1
- # Databricks Labs Data Generator (` dbldatagen ` )
1
+ # Databricks Labs Data Generator (` dbldatagen ` )
2
+
3
+ <!-- Top bar will be removed from PyPi packaged versions -->
4
+ <!-- Dont remove: exclude package -->
5
+ [ Documentation] ( https://databrickslabs.github.io/dbldatagen/public_docs/index.html ) |
2
6
[ Release Notes] ( CHANGELOG.md ) |
3
- [ Python Wheel] ( https://github.com/databrickslabs/dbldatagen/releases/tag/v.0.2.0-rc1-master ) |
4
- [ Developer Docs] ( docs/USING_THE_APIS.md ) |
5
7
[ Examples] ( examples ) |
6
8
[ Tutorial] ( tutorial )
9
+ <!-- Dont remove: end exclude package -->
7
10
8
11
[ ![ build] ( https://github.com/databrickslabs/dbldatagen/workflows/build/badge.svg?branch=master )] ( https://github.com/databrickslabs/dbldatagen/actions?query=workflow%3Abuild+branch%3Amaster )
9
12
[ ![ codecov] ( https://codecov.io/gh/databrickslabs/dbldatagen/branch/master/graph/badge.svg )] ( https://codecov.io/gh/databrickslabs/dbldatagen )
@@ -23,6 +26,7 @@ It has no dependencies on any libraries that are not already incuded in the Data
23
26
runtime, and you can use it from Scala, R or other languages by defining
24
27
a view over the generated data.
25
28
29
+ ### Feature Summary
26
30
It supports:
27
31
* Generating synthetic data at scale up to billions of rows within minutes using appropriately sized clusters
28
32
* Generating repeatable, predictable data supporting the needs for producing multiple tables, Change Data Capture,
@@ -43,16 +47,32 @@ used in other computations
43
47
* plugin mechanism to allow use of 3rd party libraries such as Faker
44
48
* Use of data generator to generate data sources in Databricks Delta Live Tables
45
49
46
- Details of these features can be found in the [ Developer Docs] ( docs/source/APIDOCS.md ) and the online help
47
- (which contains the full documentation including the HTML version of the Developer Docs) -
48
- [ Online Help] ( https://databrickslabs.github.io/dbldatagen/public_docs/index.html ) .
50
+ Details of these features can be found in the online documentation -
51
+ [ online documentation] ( https://databrickslabs.github.io/dbldatagen/public_docs/index.html ) .
49
52
53
+ ## Documentation
50
54
55
+ Please refer to the [ online documentation] ( https://databrickslabs.github.io/dbldatagen/public_docs/index.html ) for
56
+ details of use and many examples.
51
57
52
- ## Project Support
53
- Please note that all projects in the ` databrickslabs ` github space are provided for your exploration only, and are not formally supported by Databricks with Service Level Agreements (SLAs). They are provided AS-IS and we do not make any guarantees of any kind. Please do not submit a support ticket relating to any issues arising from the use of these projects.
58
+ Release notes and details of the latest changes for this specific release
59
+ can be found in the Github repository
60
+ [ here] ( https://github.com/databrickslabs/dbldatagen/blob/release/v0.2.1/CHANGELOG.md )
61
+
62
+ # Installation
63
+
64
+ Use ` pip install dbldatagen ` to install the PyPi package
65
+
66
+ Within a Databricks notebook, invoke the following in a notebook cell
67
+ ``` commandline
68
+ %pip install dbdatagen
69
+ ```
70
+
71
+ This can be invoked within a Databricks notebook, a Delta Live Tables pipeline and even works on the Databricks
72
+ community edition.
54
73
55
- Any issues discovered through the use of this project should be filed as GitHub Issues on the Repo. They will be reviewed as time permits, but there are no formal SLAs for support.
74
+ The documentation [ installation notes] ( https://databrickslabs.github.io/dbldatagen/public_docs/installation_notes.html )
75
+ contains details of installation using alternative mechanisms.
56
76
57
77
## Compatibility
58
78
The Databricks Labs data generator framework can be used with Pyspark 3.x and Python 3.6 or later
@@ -65,23 +85,6 @@ release notes for library compatibility
65
85
66
86
- https://docs.databricks.com/release-notes/runtime/releases.html
67
87
68
- ## Using a pre-built release
69
- The release binaries can be accessed at:
70
- - Databricks Labs Github Data Generator releases - https://github.com/databrickslabs/dbldatagen/releases
71
-
72
- You can install the library as a notebook scoped library when working within the Databricks
73
- notebook environment through the use of a ` %pip install ` cell in your notebook.
74
-
75
- To install as a notebook-scoped library, create and execute a notebook cell with the following text:
76
-
77
- > ` %pip install git+https://github.com/databrickslabs/dbldatagen@current `
78
-
79
- The ` %pip install ` method will work in Delta Live Tables pipelines and in the Databricks Community
80
- Environment also.
81
-
82
- Alternatively, you can download a wheel file and install using the Databricks install mechanism to install a wheel based
83
- library into your workspace.
84
-
85
88
## Using the Data Generator
86
89
To use the data generator, install the library using the ` %pip install ` method or install the Python wheel directly
87
90
in your environment.
@@ -98,58 +101,39 @@ data_rows = 1000 * 1000
98
101
df_spec = (dg.DataGenerator(spark, name="test_data_set1", rows=data_rows,
99
102
partitions=4)
100
103
.withIdOutput()
101
- .withColumn("r", FloatType(), expr="floor(rand() * 350) * (86400 + 3600)",
102
- numColumns=column_count)
104
+ .withColumn("r", FloatType(),
105
+ expr="floor(rand() * 350) * (86400 + 3600)",
106
+ numColumns=column_count)
103
107
.withColumn("code1", IntegerType(), minValue=100, maxValue=200)
104
108
.withColumn("code2", IntegerType(), minValue=0, maxValue=10)
105
109
.withColumn("code3", StringType(), values=['a', 'b', 'c'])
106
- .withColumn("code4", StringType(), values=['a', 'b', 'c'], random=True)
107
- .withColumn("code5", StringType(), values=['a', 'b', 'c'], random=True, weights=[9, 1, 1])
110
+ .withColumn("code4", StringType(), values=['a', 'b', 'c'],
111
+ random=True)
112
+ .withColumn("code5", StringType(), values=['a', 'b', 'c'],
113
+ random=True, weights=[9, 1, 1])
108
114
109
115
)
110
116
111
117
df = df_spec.build()
112
118
num_rows=df.count()
113
119
```
120
+ Refer to the [ online documentation] ( https://databrickslabs.github.io/dbldatagen/public_docs/index.html ) for further
121
+ examples.
114
122
123
+ The Github repository also contains further examples in the examples directory
115
124
116
- # Building the code
117
-
118
- See [ CONTRIBUTING.md] ( CONTRIBUTING.md ) for detailed build and testing instructions, including use of alternative
119
- build environments such as conda.
120
-
121
- Dependencies are maintained by [ Pipenv] ( https://pipenv.pypa.io/ ) . In order to start with depelopment,
122
- you should install ` pipenv ` and ` pyenv ` .
123
-
124
- Use ` make test-with-html-report ` to build and run the tests with a coverage report.
125
-
126
- Use ` make dist ` to make the distributable. The resulting wheel file will be placed in the ` dist ` subdirectory.
127
-
128
- ## Creating the HTML documentation
129
-
130
- Run ` make docs ` from the main project directory.
131
-
132
- The main html document will be in the file (relative to the root of the build directory) ` ./python/docs/docs/build/html/index.html `
133
-
134
- ## Running unit tests
135
-
136
- If using an environment with multiple Python versions, make sure to use virtual env or similar to pick up correct python versions.
125
+ ## Project Support
126
+ Please note that all projects released under [ ` Databricks Labs ` ] ( https://www.databricks.com/learn/labs )
127
+ are provided for your exploration only, and are not formally supported by Databricks with Service Level Agreements
128
+ (SLAs). They are provided AS-IS and we do not make any guarantees of any kind. Please do not submit a support ticket
129
+ relating to any issues arising from the use of these projects.
137
130
138
- If necessary, set ` PYSPARK_PYTHON ` and ` PYSPARK_DRIVER_PYTHON ` to point to correct versions of Python.
131
+ Any issues discovered through the use of this project should be filed as issues on the Github Repo.
132
+ They will be reviewed as time permits, but there are no formal SLAs for support.
139
133
140
- Run ` make test ` from the main project directory to run the unit tests.
141
134
142
135
## Feedback
143
136
144
137
Issues with the application? Found a bug? Have a great idea for an addition?
145
- Feel free to file an issue.
146
-
147
- ## Project Support
148
-
149
- Please note that all projects in the /databrickslabs github account are provided for your exploration only, and are
150
- not formally supported by Databricks with Service Level Agreements (SLAs). They are provided AS-IS and we do not
151
- make any guarantees of any kind. Please do not submit a support ticket relating to any issues arising from the use
152
- of these projects.
138
+ Feel free to file an [ issue] ( https://github.com/databrickslabs/dbldatagen/issues/new ) .
153
139
154
- Any issues discovered through the use of this project should be filed as GitHub Issues on the Repo. They will
155
- be reviewed as time permits, but there are no formal SLAs for support.
0 commit comments