Skip to content

Commit be47e15

Browse files
Doc updates 100522 (#119)
* fixed reference to dbx in pull_request_template * reverted inadvertently changed file * release 0.2.1 * doc updates * doc updates * updates for building docs * updated public docs * updated sphinx version * updated docs * doc updates * removed generated docs * removed changes to non-doc
1 parent d1a66a8 commit be47e15

13 files changed

+94
-51
lines changed

CHANGELOG.md

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44

55
See the contents of the file `python/require.txt` to see the Python package dependencies
66

7-
### Version 0.2.0-rc1
7+
### Version 0.2.1
88

99
#### Features
1010
* Uses pipenv for main build process
@@ -23,4 +23,6 @@ See the contents of the file `python/require.txt` to see the Python package depe
2323
* renamed packaging to `dbldatagen`
2424
* moved Github repo to https://github.com/databrickslabs/dbldatagen/releases
2525
* code tidy up and rename of options
26-
* added text generation plugin support for python functions and 3rd party libraries
26+
* added text generation plugin support for python functions and 3rd party libraries such as Faker
27+
* Use of data generator to generate static and streaming data sources in Databricks Delta Live Tables
28+

PULL_REQUEST_TEMPLATE.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ If it fixes a bug or resolves a feature request, please provide a link to that i
55

66
## Types of changes
77

8-
What types of changes does your code introduce to dbx?
8+
What types of changes does your code introduce to dbldatagen?
99
_Put an `x` in the boxes that apply_
1010

1111
- [ ] Bug fix (non-breaking change which fixes an issue)

README.md

Lines changed: 12 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,6 @@
77

88
[![build](https://github.com/databrickslabs/dbldatagen/workflows/build/badge.svg?branch=master)](https://github.com/databrickslabs/dbldatagen/actions?query=workflow%3Abuild+branch%3Amaster)
99
[![codecov](https://codecov.io/gh/databrickslabs/dbldatagen/branch/master/graph/badge.svg)](https://codecov.io/gh/databrickslabs/dbldatagen)
10-
![lines](https://img.shields.io/tokei/lines/github/databrickslabs/dbldatagen)
1110
[![downloads](https://img.shields.io/github/downloads/databrickslabs/dbldatagen/total.svg)](https://hanadigital.github.io/grev/?user=databrickslabs&repo=dbldatagen)
1211
[![Language grade: Python](https://img.shields.io/lgtm/grade/python/g/databrickslabs/dbldatagen.svg?logo=lgtm&logoWidth=18)](https://lgtm.com/projects/g/databrickslabs/dbldatagen/context:python)
1312

@@ -42,6 +41,7 @@ used in other computations
4241
* Generating values to conform to a schema or independent of an existing schema
4342
* use of SQL expressions in test data generation
4443
* plugin mechanism to allow use of 3rd party libraries such as Faker
44+
* Use of data generator to generate data sources in Databricks Delta Live Tables
4545

4646
Details of these features can be found in the [Developer Docs](docs/source/APIDOCS.md) and the online help
4747
(which contains the full documentation including the HTML version of the Developer Docs) -
@@ -69,24 +69,24 @@ release notes for library compatibility
6969
The release binaries can be accessed at:
7070
- Databricks Labs Github Data Generator releases - https://github.com/databrickslabs/dbldatagen/releases
7171

72-
To use download a wheel file and install using the Databricks install mechanism to install a wheel based
73-
library into your workspace.
74-
75-
Alternatively, you can install the library as a notebook scoped library when working within the Databricks
76-
notebook environment through the use of a `%pip` cell in your notebook.
72+
You can install the library as a notebook scoped library when working within the Databricks
73+
notebook environment through the use of a `%pip install` cell in your notebook.
7774

7875
To install as a notebook-scoped library, create and execute a notebook cell with the following text:
7976

80-
> `%pip install git+https://github.com/databrickslabs/dbldatagen`
77+
> `%pip install git+https://github.com/databrickslabs/dbldatagen@current`
8178
82-
The `%pip install` method will work in the Databricks Community Environment also.
79+
The `%pip install` method will work in Delta Live Tables pipelines and in the Databricks Community
80+
Environment also.
8381

84-
The latest pre-release is code complete and fully functional.
82+
Alternatively, you can download a wheel file and install using the Databricks install mechanism to install a wheel based
83+
library into your workspace.
8584

86-
## Using the Project
87-
To use the project, the generated wheel should be installed in your Python notebook as a wheel based library
85+
## Using the Data Generator
86+
To use the data generator, install the library using the `%pip install` method or install the Python wheel directly
87+
in your environment.
8888

89-
Once the library has been installed, you can use it to generate a test data frame.
89+
Once the library has been installed, you can use it to generate a data frame composed of synthetic data.
9090

9191
For example
9292

docs/source/APIDOCS.md

Lines changed: 29 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -2,9 +2,11 @@
22

33
The Databricks Labs data generator (aka `dbldatagen`) is a Spark based solution for generating
44
realistic synthetic data. It uses the features of Spark dataframes and Spark SQL
5-
to generate synthetic data. As the output of the process is a Spark dataframe populated
6-
with the generated data , it may be saved to storage in a variety of formats, saved to tables
7-
or generally manipulated using the existing Spark Dataframe APIs.
5+
to generate synthetic data. As the process produces a Spark dataframe populated
6+
with generated data, it may be saved to storage in a variety of formats, saved to tables
7+
or generally manipulated using the existing Spark Dataframe APIs.
8+
9+
It can also be used as a source in a Delta Live Tables pipeline supporting both streamimg and batch operation.
810

911
It has no dependencies on any libraries that are not already included in the Databricks
1012
runtime, and you can use it from Scala, R or other languages by defining
@@ -47,33 +49,37 @@ and [formatting on string columns](textdata)
4749
* Use [SQL based expressions](#using-sql-in-data-generation) to control or augment column generation
4850
* Script Spark SQL table creation statement for dataset
4951
* Specify a [statistical distribution for random values](./DISTRIBUTIONS.md)
52+
* Support for use within Databricks Delta Live Tables pipelines
5053

5154

5255
## Tutorials and examples
5356

5457
In the root directory of the project, there are a number of examples and tutorials.
5558

56-
The Python examples in the `examples` folder can be run directly or imported into the Databricks runtime environment as Python files.
59+
The Python examples in the `examples` folder can be run directly or imported into the Databricks runtime environment
60+
as Python files.
5761

58-
The examples in the `tutorials` folder are in notebook export format and are intended to be imported into the Databricks runtime environment.
62+
The examples in the `tutorials` folder are in notebook export format and are intended to be imported into the Databricks
63+
runtime environment.
5964

6065
## Basic concepts
6166

62-
The Databricks Labs Data Generator is a Python framework that uses Spark to generate a dataframe of test data.
67+
The Databricks Labs Data Generator is a Python framework that uses Spark to generate a dataframe of synthetic data.
6368

64-
Once the data frame is generated, it can be used with any Spark dataframee compatible API to save or persist data,
65-
to analyze data, to write it to an external database or stream, or generally used in the same manner as a regular dataframe.
69+
Once the data frame is generated, it can be used with any Spark dataframe compatible API to save or persist data,
70+
to analyze data, to write it to an external database or stream, or used in the same manner as a regular
71+
PySpark dataframe.
6672

6773
To consume it from Scala, R, SQL or other languages, create a view over the resulting test dataframe and you can use
6874
it from any Databricks Spark runtime compatible language. By use of the appropriate parameters,
6975
you can instruct the data generator to automatically register a view as part of generating the test data.
7076

7177
### Generating the test data
72-
The test data generation process is controlled by a test data generation spec which can build a schema implicitly,
78+
The data generation process is controlled by a data generation spec which can build a schema implicitly,
7379
or a schema can be added from an existing table or Spark SQL schema object.
7480

75-
Each column to be generated derives its test data from a set of one or more seed values.
76-
By default, this is the id field of the base data frame
81+
Each column to be generated derives its generated data from a set of one or more seed values.
82+
By default, this is the `id` field of the base data frame
7783
(generated with `spark.range` for batch data frames, or using a `Rate` source for streaming data frames).
7884

7985
Each column can be specified as based on the `id` field or other columns in the test data generation spec.
@@ -92,17 +98,19 @@ There is also support for applying arbitrary SQL expressions, and generation of
9298

9399
### Getting started
94100

95-
Before you can use the data generator, you need to install the package in your environment and import it in your code.
101+
Before using the data generator, you need to install the package in your environment and import it in your code.
96102
You can install the package from the Github releases as a library on your cluster.
97103

98104
> NOTE: When running in a Databricks notebook environment, you can install directly using
99105
> the `%pip` command in a notebook cell
100106
>
101107
> To install as a notebook scoped library, add a cell with the following text and execute it:
102108
>
103-
> `%pip install git+https://github.com/databrickslabs/dbldatagen`
109+
> `%pip install git+https://github.com/databrickslabs/dbldatagen@current`
104110
105-
The `%pip install` method will work in the Databricks Community Environment also.
111+
The `%pip install` method will work in the Databricks Community Environment and in Delta Live Tables pipelines also.
112+
113+
You can also manually download a wheel file from the releases and install it in your environment.
106114

107115
The releases are located at
108116
[Databricks Labs Data Generator releases](https://github.com/databrickslabs/dbldatagen/releases)
@@ -209,6 +217,9 @@ the allowable values `['a', 'b', or 'c']`
209217
inclusive. These will be computed via a uniformly distributed random value but with weighting applied so that
210218
the value `a` occurs 9 times as frequently as the values `b` or `c`
211219

220+
> NOTE: As the seed field named `id` is currently reserved for system use, manually adding a column named `id` can
221+
> interfere with the data generation. This will be fixed in a forthcoming release
222+
212223
### Creating data set with pre-existing schema
213224
What if we want to generate data conforming to a pre-existing schema? you can specify a schema for your data by either
214225
taking a schema from an existing table, or computing an explicit schema.
@@ -342,7 +353,8 @@ testDataSpec = (dg.DataGenerator(spark, name="device_data_set", rows=data_rows,
342353
values=["activation", "deactivation", "plan change",
343354
"telecoms activity", "internet activity", "device error"],
344355
random=True)
345-
.withColumn("event_ts", "timestamp", begin="2020-01-01 01:00:00", end="2020-12-31 23:59:00", interval="1 minute", random=True)
356+
.withColumn("event_ts", "timestamp", begin="2020-01-01 01:00:00", end="2020-12-31 23:59:00",
357+
interval="1 minute", random=True)
346358

347359
)
348360

@@ -439,7 +451,8 @@ testDataSpec = (dg.DataGenerator(spark, name="device_data_set", rows=data_rows,
439451
values=["activation", "deactivation", "plan change",
440452
"telecoms activity", "internet activity", "device error"],
441453
random=True)
442-
.withColumn("event_ts", "timestamp", begin="2020-01-01 01:00:00", end="2020-12-31 23:59:00", interval="1 minute", random=True)
454+
.withColumn("event_ts", "timestamp", begin="2020-01-01 01:00:00", end="2020-12-31 23:59:00",
455+
interval="1 minute", random=True)
443456

444457
)
445458

docs/source/_static/css/tdg.css

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,10 @@ body {
55
font-family:"Source Sans Pro",sans-serif!important;
66
}
77

8+
p.caption {
9+
font-family:"Source Sans Pro",sans-serif!important;
10+
}
11+
812
code.sig-name.descname {
913
font-family:"Source Sans Pro",sans-serif!important;
1014
}
@@ -39,7 +43,7 @@ h3 {
3943
max-width: 80%;
4044
}
4145

42-
/* Left pannel size */
46+
/* Left panel size */
4347
@media (min-width: 768px) {
4448
.col-md-3 {
4549
flex: 0 0 20%;

docs/source/conf.py

Lines changed: 13 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -24,11 +24,11 @@
2424
# -- Project information -----------------------------------------------------
2525

2626
project = 'Databricks Labs Data Generator'
27-
copyright = '2020, Databricks Inc'
27+
copyright = '2022, Databricks Inc'
2828
author = 'Databricks Inc'
2929

3030
# The full version, including alpha/beta/rc tags
31-
release = "0.2.0-rc1" # DO NOT EDIT THIS DIRECTLY! It is managed by bumpversion
31+
release = "0.2.1" # DO NOT EDIT THIS DIRECTLY! It is managed by bumpversion
3232

3333

3434
# -- General configuration ---------------------------------------------------
@@ -37,13 +37,14 @@
3737
# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
3838
# ones.
3939
extensions = [
40-
'sphinx_rtd_theme',
40+
'sphinx.ext.intersphinx',
4141
'sphinx.ext.autodoc',
4242
'sphinx.ext.napoleon', # enable sphinx to parse NumPy and Google style doc strings
4343
#'sphinx.ext.autosummary',
4444
'sphinx.ext.viewcode', # add links to source code
4545
#'numpydoc', # handle NumPy documentation formatted docstrings. Needs to install
46-
'recommonmark' # allow including Commonmark markdown in sources
46+
'recommonmark', # allow including Commonmark markdown in sources
47+
'sphinx_rtd_theme'
4748
]
4849

4950
source_suffix = {
@@ -86,6 +87,12 @@
8687
# a list of builtin themes.
8788
#
8889
import sphinx_rtd_theme
90+
91+
intersphinx_mapping = {
92+
'rtd': ('https://docs.readthedocs.io/en/stable/', None),
93+
'sphinx': ('https://www.sphinx-doc.org/en/master/', None),
94+
}
95+
8996
html_theme = "sphinx_rtd_theme"
9097

9198
html_theme_path = [sphinx_rtd_theme.get_html_theme_path()]
@@ -100,7 +107,7 @@
100107
html_static_path = ['_static']
101108

102109
html_css_files = [
103-
'css/tdg.css',
110+
'css/tdg.css'
104111
]
105112

106113
#html_sidebars={
@@ -118,3 +125,4 @@
118125
numpydoc_show_inherited_class_members=False
119126
numpydoc_class_members_toctree=False
120127
numpydoc_attributes_as_param_list=True
128+

docs/source/extending_text_generation.rst

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,8 @@ Extending text generation
88

99
This feature should be considered ``Experimental``.
1010

11-
The ``PyfuncText``, ``PyfuncTextFactory`` and ``FakerTextFactory`` classes provide a mechanism to expand text generation to include
11+
The ``PyfuncText``, ``PyfuncTextFactory`` and ``FakerTextFactory`` classes provide a mechanism to expand text
12+
generation to include
1213
the use of arbitrary Python functions and 3rd party data generation libraries.
1314

1415
The following example illustrates extension with the open source Faker library using the

docs/source/generating_cdc_data.rst

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -51,7 +51,9 @@ We'll add a timestamp for when the row was generated and a memo field to mark wh
5151
.withColumn("customer_id","long", uniqueValues=uniqueCustomers)
5252
.withColumn("name", percentNulls=0.01, template=r'\\w \\w|\\w a. \\w')
5353
.withColumn("alias", percentNulls=0.01, template=r'\\w \\w|\\w a. \\w')
54-
.withColumn("payment_instrument_type", values=['paypal', 'Visa', 'Mastercard', 'American Express', 'discover', 'branded visa', 'branded mastercard'], random=True, distribution="normal")
54+
.withColumn("payment_instrument_type", values=['paypal', 'Visa', 'Mastercard',
55+
'American Express', 'discover', 'branded visa', 'branded mastercard'],
56+
random=True, distribution="normal")
5557
.withColumn("int_payment_instrument", "int", minValue=0000, maxValue=9999, baseColumn="customer_id",
5658
baseColumnType="hash", omit=True)
5759
.withColumn("payment_instrument", expr="format_number(int_payment_instrument, '**** ****** *####')",

docs/source/index.rst

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,9 @@ or through creating a schema on the fly, you can control how synthetic data is g
1616
As the data generator generates a PySpark data frame, it is simple to create a view over it to expose it
1717
to Scala or R based Spark applications also.
1818

19+
As it is installable via `%pip install`, it can also be incorporated in environments such as
20+
`Delta Live Tables <https://www.databricks.com/product/delta-live-tables>`_ also.
21+
1922
.. toctree::
2023
:maxdepth: 1
2124
:caption: Getting Started
@@ -56,7 +59,7 @@ to Scala or R based Spark applications also.
5659
license
5760

5861
Indices and tables
59-
==================
62+
------------------
6063

6164
* :ref:`genindex`
6265
* :ref:`modindex`

docs/source/installation_notes.rst

Lines changed: 10 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -23,14 +23,20 @@ To do this add and execute the following cell at the start of your notebook:
2323

2424
.. code-block::
2525
26-
%pip install git+https://github.com/databrickslabs/dbldatagen
26+
%pip install git+https://github.com/databrickslabs/dbldatagen@current
2727
28-
By default, this will install a fresh build from the ``master`` branch. You can install from a
29-
specific branch by appending the branch identifier to the github URL.
28+
By default, this will install a fresh build from latest release based on the ``master`` branch.
29+
You can install from a specific branch by appending the branch identifier or tag to the github URL.
3030

3131
.. code-block::
3232
33-
%pip install git+https://github.com/databrickslabs/dbldatagen#error-report-improvements
33+
%pip install git+https://github.com/databrickslabs/dbldatagen@dbr_7_3_LTS_compat
34+
35+
The following tags will be used to pick up specific versions:
36+
37+
* `current` - latest build from master + doc changes and critical bug fixes
38+
* `stable` - latest release from master (with changes for version marking and documentation only)
39+
* `preview` - preview build of forthcoming features
3440

3541
.. seealso::
3642
See the following links for more details:

0 commit comments

Comments
 (0)