2
2
3
3
The Databricks Labs data generator (aka ` dbldatagen ` ) is a Spark based solution for generating
4
4
realistic synthetic data. It uses the features of Spark dataframes and Spark SQL
5
- to generate synthetic data. As the output of the process is a Spark dataframe populated
6
- with the generated data , it may be saved to storage in a variety of formats, saved to tables
7
- or generally manipulated using the existing Spark Dataframe APIs.
5
+ to generate synthetic data. As the process produces a Spark dataframe populated
6
+ with generated data, it may be saved to storage in a variety of formats, saved to tables
7
+ or generally manipulated using the existing Spark Dataframe APIs.
8
+
9
+ It can also be used as a source in a Delta Live Tables pipeline supporting both streamimg and batch operation.
8
10
9
11
It has no dependencies on any libraries that are not already included in the Databricks
10
12
runtime, and you can use it from Scala, R or other languages by defining
@@ -47,33 +49,37 @@ and [formatting on string columns](textdata)
47
49
* Use [ SQL based expressions] ( #using-sql-in-data-generation ) to control or augment column generation
48
50
* Script Spark SQL table creation statement for dataset
49
51
* Specify a [ statistical distribution for random values] ( ./DISTRIBUTIONS.md )
52
+ * Support for use within Databricks Delta Live Tables pipelines
50
53
51
54
52
55
## Tutorials and examples
53
56
54
57
In the root directory of the project, there are a number of examples and tutorials.
55
58
56
- The Python examples in the ` examples ` folder can be run directly or imported into the Databricks runtime environment as Python files.
59
+ The Python examples in the ` examples ` folder can be run directly or imported into the Databricks runtime environment
60
+ as Python files.
57
61
58
- The examples in the ` tutorials ` folder are in notebook export format and are intended to be imported into the Databricks runtime environment.
62
+ The examples in the ` tutorials ` folder are in notebook export format and are intended to be imported into the Databricks
63
+ runtime environment.
59
64
60
65
## Basic concepts
61
66
62
- The Databricks Labs Data Generator is a Python framework that uses Spark to generate a dataframe of test data.
67
+ The Databricks Labs Data Generator is a Python framework that uses Spark to generate a dataframe of synthetic data.
63
68
64
- Once the data frame is generated, it can be used with any Spark dataframee compatible API to save or persist data,
65
- to analyze data, to write it to an external database or stream, or generally used in the same manner as a regular dataframe.
69
+ Once the data frame is generated, it can be used with any Spark dataframe compatible API to save or persist data,
70
+ to analyze data, to write it to an external database or stream, or used in the same manner as a regular
71
+ PySpark dataframe.
66
72
67
73
To consume it from Scala, R, SQL or other languages, create a view over the resulting test dataframe and you can use
68
74
it from any Databricks Spark runtime compatible language. By use of the appropriate parameters,
69
75
you can instruct the data generator to automatically register a view as part of generating the test data.
70
76
71
77
### Generating the test data
72
- The test data generation process is controlled by a test data generation spec which can build a schema implicitly,
78
+ The data generation process is controlled by a data generation spec which can build a schema implicitly,
73
79
or a schema can be added from an existing table or Spark SQL schema object.
74
80
75
- Each column to be generated derives its test data from a set of one or more seed values.
76
- By default, this is the id field of the base data frame
81
+ Each column to be generated derives its generated data from a set of one or more seed values.
82
+ By default, this is the ` id ` field of the base data frame
77
83
(generated with ` spark.range ` for batch data frames, or using a ` Rate ` source for streaming data frames).
78
84
79
85
Each column can be specified as based on the ` id ` field or other columns in the test data generation spec.
@@ -92,17 +98,19 @@ There is also support for applying arbitrary SQL expressions, and generation of
92
98
93
99
### Getting started
94
100
95
- Before you can use the data generator, you need to install the package in your environment and import it in your code.
101
+ Before using the data generator, you need to install the package in your environment and import it in your code.
96
102
You can install the package from the Github releases as a library on your cluster.
97
103
98
104
> NOTE: When running in a Databricks notebook environment, you can install directly using
99
105
> the ` %pip ` command in a notebook cell
100
106
>
101
107
> To install as a notebook scoped library, add a cell with the following text and execute it:
102
108
>
103
- > ` %pip install git+https://github.com/databrickslabs/dbldatagen `
109
+ > ` %pip install git+https://github.com/databrickslabs/dbldatagen@current `
104
110
105
- The ` %pip install ` method will work in the Databricks Community Environment also.
111
+ The ` %pip install ` method will work in the Databricks Community Environment and in Delta Live Tables pipelines also.
112
+
113
+ You can also manually download a wheel file from the releases and install it in your environment.
106
114
107
115
The releases are located at
108
116
[ Databricks Labs Data Generator releases] ( https://github.com/databrickslabs/dbldatagen/releases )
@@ -209,6 +217,9 @@ the allowable values `['a', 'b', or 'c']`
209
217
inclusive. These will be computed via a uniformly distributed random value but with weighting applied so that
210
218
the value ` a ` occurs 9 times as frequently as the values ` b ` or ` c `
211
219
220
+ > NOTE: As the seed field named ` id ` is currently reserved for system use, manually adding a column named ` id ` can
221
+ > interfere with the data generation. This will be fixed in a forthcoming release
222
+
212
223
### Creating data set with pre-existing schema
213
224
What if we want to generate data conforming to a pre-existing schema? you can specify a schema for your data by either
214
225
taking a schema from an existing table, or computing an explicit schema.
@@ -342,7 +353,8 @@ testDataSpec = (dg.DataGenerator(spark, name="device_data_set", rows=data_rows,
342
353
values = [" activation" , " deactivation" , " plan change" ,
343
354
" telecoms activity" , " internet activity" , " device error" ],
344
355
random = True )
345
- .withColumn(" event_ts" , " timestamp" , begin = " 2020-01-01 01:00:00" , end = " 2020-12-31 23:59:00" , interval = " 1 minute" , random = True )
356
+ .withColumn(" event_ts" , " timestamp" , begin = " 2020-01-01 01:00:00" , end = " 2020-12-31 23:59:00" ,
357
+ interval = " 1 minute" , random = True )
346
358
347
359
)
348
360
@@ -439,7 +451,8 @@ testDataSpec = (dg.DataGenerator(spark, name="device_data_set", rows=data_rows,
439
451
values = [" activation" , " deactivation" , " plan change" ,
440
452
" telecoms activity" , " internet activity" , " device error" ],
441
453
random = True )
442
- .withColumn(" event_ts" , " timestamp" , begin = " 2020-01-01 01:00:00" , end = " 2020-12-31 23:59:00" , interval = " 1 minute" , random = True )
454
+ .withColumn(" event_ts" , " timestamp" , begin = " 2020-01-01 01:00:00" , end = " 2020-12-31 23:59:00" ,
455
+ interval = " 1 minute" , random = True )
443
456
444
457
)
445
458
0 commit comments