Doc updates 100522 (#119)

ronanstokes-db · web-flow · commit be47e159a7ba · 2022-10-05T12:49:17.000-07:00
* fixed reference to dbx in pull_request_template

* reverted inadvertently changed file

* release 0.2.1

* doc updates

* doc updates

* updates for building docs

* updated public docs

* updated sphinx version

* updated docs

* doc updates

* removed generated docs

* removed changes to non-doc
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -4,7 +4,7 @@
 
 See the contents of the file `python/require.txt` to see the Python package dependencies
 
-### Version 0.2.0-rc1
+### Version 0.2.1
 
 #### Features
 * Uses pipenv for main build process
@@ -23,4 +23,6 @@ See the contents of the file `python/require.txt` to see the Python package depe
 * renamed packaging to `dbldatagen`
 * moved Github repo to https://github.com/databrickslabs/dbldatagen/releases
 * code tidy up and rename of options
-* added text generation plugin support for python functions and 3rd party libraries
+* added text generation plugin support for python functions and 3rd party libraries such as Faker
+* Use of data generator to generate static and streaming data sources in Databricks Delta Live Tables
+
diff --git a/PULL_REQUEST_TEMPLATE.md b/PULL_REQUEST_TEMPLATE.md
@@ -5,7 +5,7 @@ If it fixes a bug or resolves a feature request, please provide a link to that i
 
 ## Types of changes
 
-What types of changes does your code introduce to dbx?
+What types of changes does your code introduce to dbldatagen?
 _Put an `x` in the boxes that apply_
 
 - [ ] Bug fix (non-breaking change which fixes an issue)
diff --git a/README.md b/README.md
@@ -7,7 +7,6 @@
 
 [![build](https://github.com/databrickslabs/dbldatagen/workflows/build/badge.svg?branch=master)](https://github.com/databrickslabs/dbldatagen/actions?query=workflow%3Abuild+branch%3Amaster)
 [![codecov](https://codecov.io/gh/databrickslabs/dbldatagen/branch/master/graph/badge.svg)](https://codecov.io/gh/databrickslabs/dbldatagen)
-![lines](https://img.shields.io/tokei/lines/github/databrickslabs/dbldatagen) 
 [![downloads](https://img.shields.io/github/downloads/databrickslabs/dbldatagen/total.svg)](https://hanadigital.github.io/grev/?user=databrickslabs&repo=dbldatagen)
 [![Language grade: Python](https://img.shields.io/lgtm/grade/python/g/databrickslabs/dbldatagen.svg?logo=lgtm&logoWidth=18)](https://lgtm.com/projects/g/databrickslabs/dbldatagen/context:python)
 
@@ -42,6 +41,7 @@ used in other computations
 * Generating values to conform to a schema or independent of an existing schema
 * use of SQL expressions in test data generation
 * plugin mechanism to allow use of 3rd party libraries such as Faker
+* Use of data generator to generate data sources in Databricks Delta Live Tables
 
 Details of these features can be found in the [Developer Docs](docs/source/APIDOCS.md) and the online help
 (which contains the full documentation including the HTML version of the Developer Docs) -
@@ -69,24 +69,24 @@ release notes for library compatibility
 The release binaries can be accessed at:
 - Databricks Labs Github Data Generator releases - https://github.com/databrickslabs/dbldatagen/releases
 
-To use download a wheel file and install using the Databricks install mechanism to install a wheel based
-library into your workspace.
-
-Alternatively, you can install the library as a notebook scoped library when working within the Databricks 
-notebook environment through the use of a `%pip` cell in your notebook.
+You can install the library as a notebook scoped library when working within the Databricks 
+notebook environment through the use of a `%pip install` cell in your notebook.
 
 To install as a notebook-scoped library, create and execute a notebook cell with the following text:
 
-> `%pip install git+https://github.com/databrickslabs/dbldatagen`
+> `%pip install git+https://github.com/databrickslabs/dbldatagen@current`
 
-The `%pip install` method will work in the Databricks Community Environment also.
+The `%pip install` method will work in Delta Live Tables pipelines and in the Databricks Community 
+Environment also.
 
-The latest pre-release is code complete and fully functional. 
+Alternatively, you can download a wheel file and install using the Databricks install mechanism to install a wheel based
+library into your workspace.
 
-## Using the Project
-To use the project, the generated wheel should be installed in your Python notebook as a wheel based library
+## Using the Data Generator
+To use the data generator, install the library using the `%pip install` method or install the Python wheel directly 
+in your environment.
 
-Once the library has been installed, you can use it to generate a test data frame.
+Once the library has been installed, you can use it to generate a data frame composed of synthetic data.
 
 For example
 
diff --git a/docs/source/APIDOCS.md b/docs/source/APIDOCS.md
@@ -2,9 +2,11 @@
 
 The Databricks Labs data generator (aka `dbldatagen`) is a Spark based solution for generating 
 realistic synthetic data. It uses the features of Spark dataframes and Spark SQL 
-to generate synthetic data. As the output of the process is a Spark dataframe populated 
-with the generated data , it may be saved to storage in a variety of formats, saved to tables 
-or generally manipulated using the existing Spark Dataframe APIs.
+to generate synthetic data. As the process produces a Spark dataframe populated 
+with generated data, it may be saved to storage in a variety of formats, saved to tables 
+or generally manipulated using the existing Spark Dataframe APIs. 
+
+It can also be used as a source in a Delta Live Tables pipeline supporting both streamimg and batch operation.
 
 It has no dependencies on any libraries that are not already included in the Databricks 
 runtime, and you can use it from Scala, R or other languages by defining
@@ -47,33 +49,37 @@ and [formatting on string columns](textdata)
 * Use [SQL based expressions](#using-sql-in-data-generation) to control or augment column generation
 * Script Spark SQL table creation statement for dataset 
 * Specify a [statistical distribution for random values](./DISTRIBUTIONS.md)
+* Support for use within Databricks Delta Live Tables pipelines
 
 
 ## Tutorials and examples
 
 In the root directory of the project, there are a number of examples and tutorials.
 
-The Python examples in the `examples` folder can be run directly or imported into the Databricks runtime environment as Python files.
+The Python examples in the `examples` folder can be run directly or imported into the Databricks runtime environment 
+as Python files.
 
-The examples in the `tutorials` folder are in notebook export format and are intended to be imported into the Databricks runtime environment.
+The examples in the `tutorials` folder are in notebook export format and are intended to be imported into the Databricks
+runtime environment.
  
 ## Basic concepts
 
-The Databricks Labs Data Generator is a Python framework that uses Spark to generate a dataframe of test data. 
+The Databricks Labs Data Generator is a Python framework that uses Spark to generate a dataframe of synthetic data. 
 
-Once the data frame is generated, it can be used with any Spark dataframee compatible API to save or persist data, 
-to analyze data, to write it to an external database or stream, or generally used in the same manner as a regular dataframe.
+Once the data frame is generated, it can be used with any Spark dataframe compatible API to save or persist data, 
+to analyze data, to write it to an external database or stream, or used in the same manner as a regular 
+PySpark dataframe.
 
 To consume it from Scala, R, SQL or other languages, create a view over the resulting test dataframe and you can use
 it from any Databricks Spark runtime compatible language. By use of the appropriate parameters, 
 you can instruct the data generator to automatically register a view as part of generating the test data.
 
 ### Generating the test data
-The test data generation process is controlled by a test data generation spec which can build a schema implicitly, 
+The data generation process is controlled by a data generation spec which can build a schema implicitly, 
 or a schema can be added from an existing table or Spark SQL schema object.
 
-Each column to be generated derives its test data from a set of one or more seed values. 
-By default, this is the id field of the base data frame 
+Each column to be generated derives its generated data from a set of one or more seed values. 
+By default, this is the `id` field of the base data frame 
 (generated with `spark.range` for batch data frames, or using a `Rate` source for streaming data frames).
 
 Each column  can be specified as based on the `id` field or other columns in the test data generation spec. 
@@ -92,17 +98,19 @@ There is also support for applying arbitrary SQL expressions, and generation of
 
 ### Getting started
 
-Before you can use the data generator, you need to install the package in your environment and import it in your code.
+Before using the data generator, you need to install the package in your environment and import it in your code.
 You can install the package from the Github releases as a library on your cluster. 
 
 > NOTE: When running in a Databricks notebook environment, you can install directly using 
 > the `%pip` command in a notebook cell
 >  
 > To install as a notebook scoped library, add a cell with the following text and execute it:
 >
-> `%pip install git+https://github.com/databrickslabs/dbldatagen`
+> `%pip install git+https://github.com/databrickslabs/dbldatagen@current`
  
-The `%pip install` method will work in the Databricks Community Environment also.
+The `%pip install` method will work in the Databricks Community Environment and in Delta Live Tables pipelines also.
+
+You can also manually download a wheel file from the releases and install it in your environment.
 
 The releases are located at 
 [Databricks Labs Data Generator releases](https://github.com/databrickslabs/dbldatagen/releases)
@@ -209,6 +217,9 @@ the allowable values `['a', 'b', or 'c']`
 inclusive. These will be computed via a uniformly distributed random value but with weighting applied so that
 the value `a` occurs 9 times as frequently as the values `b` or `c`
 
+> NOTE: As the seed field named `id` is currently reserved for system use, manually adding a column named `id` can 
+> interfere with the data generation. This will be fixed in a forthcoming release
+
 ### Creating data set with pre-existing schema
 What if we want to generate data conforming to a pre-existing schema? you can specify a schema for your data by either 
 taking a schema from an existing table, or computing an explicit schema. 
@@ -342,7 +353,8 @@ testDataSpec = (dg.DataGenerator(spark, name="device_data_set", rows=data_rows,
                             values=["activation", "deactivation", "plan change",
                                     "telecoms activity", "internet activity", "device error"],
                             random=True)
-                .withColumn("event_ts", "timestamp", begin="2020-01-01 01:00:00", end="2020-12-31 23:59:00", interval="1 minute", random=True)
+                .withColumn("event_ts", "timestamp", begin="2020-01-01 01:00:00", end="2020-12-31 23:59:00", 
+                            interval="1 minute", random=True)
 
                 )
 
@@ -439,7 +451,8 @@ testDataSpec = (dg.DataGenerator(spark, name="device_data_set", rows=data_rows,
                             values=["activation", "deactivation", "plan change",
                                     "telecoms activity", "internet activity", "device error"],
                             random=True)
-                .withColumn("event_ts", "timestamp", begin="2020-01-01 01:00:00", end="2020-12-31 23:59:00", interval="1 minute", random=True)
+                .withColumn("event_ts", "timestamp", begin="2020-01-01 01:00:00", end="2020-12-31 23:59:00",
+                            interval="1 minute", random=True)
 
                 )
 
diff --git a/docs/source/_static/css/tdg.css b/docs/source/_static/css/tdg.css
@@ -5,6 +5,10 @@ body {
     font-family:"Source Sans Pro",sans-serif!important;
 }
 
+p.caption {
+    font-family:"Source Sans Pro",sans-serif!important;
+}
+
 code.sig-name.descname {
     font-family:"Source Sans Pro",sans-serif!important;
 }
@@ -39,7 +43,7 @@ h3 {
     max-width: 80%;
 }
 
-/* Left pannel size */
+/* Left panel size */
 @media (min-width: 768px) {
     .col-md-3 {
         flex: 0 0 20%;
diff --git a/docs/source/conf.py b/docs/source/conf.py
@@ -24,11 +24,11 @@
 # -- Project information -----------------------------------------------------
 
 project = 'Databricks Labs Data Generator'
-copyright = '2020, Databricks Inc'
+copyright = '2022, Databricks Inc'
 author = 'Databricks Inc'
 
 # The full version, including alpha/beta/rc tags
-release = "0.2.0-rc1"  # DO NOT EDIT THIS DIRECTLY!  It is managed by bumpversion
+release = "0.2.1"  # DO NOT EDIT THIS DIRECTLY!  It is managed by bumpversion
 
 
 # -- General configuration ---------------------------------------------------
@@ -37,13 +37,14 @@
 # extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
 # ones.
 extensions = [
-    'sphinx_rtd_theme',
+    'sphinx.ext.intersphinx',
     'sphinx.ext.autodoc',
     'sphinx.ext.napoleon',  # enable sphinx to parse NumPy and Google style doc strings
     #'sphinx.ext.autosummary',
     'sphinx.ext.viewcode',  # add links to source code
     #'numpydoc',  # handle NumPy documentation formatted docstrings. Needs to install
-    'recommonmark'  # allow including Commonmark markdown in sources
+    'recommonmark',  # allow including Commonmark markdown in sources
+    'sphinx_rtd_theme'
 ]
 
 source_suffix = {
@@ -86,6 +87,12 @@
 # a list of builtin themes.
 #
 import sphinx_rtd_theme
+
+intersphinx_mapping = {
+    'rtd': ('https://docs.readthedocs.io/en/stable/', None),
+    'sphinx': ('https://www.sphinx-doc.org/en/master/', None),
+}
+
 html_theme = "sphinx_rtd_theme"
 
 html_theme_path = [sphinx_rtd_theme.get_html_theme_path()]
@@ -100,7 +107,7 @@
 html_static_path = ['_static']
 
 html_css_files = [
-    'css/tdg.css',
+    'css/tdg.css'
 ]
 
 #html_sidebars={
@@ -118,3 +125,4 @@
 numpydoc_show_inherited_class_members=False
 numpydoc_class_members_toctree=False
 numpydoc_attributes_as_param_list=True
+
diff --git a/docs/source/extending_text_generation.rst b/docs/source/extending_text_generation.rst
@@ -8,7 +8,8 @@ Extending text generation
 
 This feature should be considered ``Experimental``.
 
-The ``PyfuncText``,  ``PyfuncTextFactory`` and ``FakerTextFactory`` classes provide a mechanism to expand text generation to include
+The ``PyfuncText``,  ``PyfuncTextFactory`` and ``FakerTextFactory`` classes provide a mechanism to expand text
+generation to include
 the use of arbitrary Python functions and 3rd party data generation libraries.
 
 The following example illustrates extension with the open source Faker library using the
diff --git a/docs/source/generating_cdc_data.rst b/docs/source/generating_cdc_data.rst
@@ -51,7 +51,9 @@ We'll add a timestamp for when the row was generated and a memo field to mark wh
                .withColumn("customer_id","long", uniqueValues=uniqueCustomers)
                .withColumn("name", percentNulls=0.01, template=r'\\w \\w|\\w a. \\w')
                .withColumn("alias", percentNulls=0.01, template=r'\\w \\w|\\w a. \\w')
-               .withColumn("payment_instrument_type", values=['paypal', 'Visa', 'Mastercard', 'American Express', 'discover', 'branded visa', 'branded mastercard'], random=True, distribution="normal")
+               .withColumn("payment_instrument_type", values=['paypal', 'Visa', 'Mastercard',
+                           'American Express', 'discover', 'branded visa', 'branded mastercard'],
+                           random=True, distribution="normal")
                .withColumn("int_payment_instrument", "int",  minValue=0000, maxValue=9999,  baseColumn="customer_id",
                            baseColumnType="hash", omit=True)
                .withColumn("payment_instrument", expr="format_number(int_payment_instrument, '**** ****** *####')",
diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -16,6 +16,9 @@ or through creating a schema on the fly, you can control how synthetic data is g
 As the data generator generates a PySpark data frame, it is simple to create a view over it to expose it
 to Scala or R based Spark applications also.
 
+As it is installable via `%pip install`, it can also be incorporated in environments such as
+`Delta Live Tables <https://www.databricks.com/product/delta-live-tables>`_ also.
+
 .. toctree::
    :maxdepth: 1
    :caption: Getting Started
@@ -56,7 +59,7 @@ to Scala or R based Spark applications also.
    license
 
 Indices and tables
-==================
+------------------
 
 * :ref:`genindex`
 * :ref:`modindex`
diff --git a/docs/source/installation_notes.rst b/docs/source/installation_notes.rst
@@ -23,14 +23,20 @@ To do this add and execute the following cell at the start of your notebook:
 
 .. code-block::
 
-   %pip install git+https://github.com/databrickslabs/dbldatagen
+   %pip install git+https://github.com/databrickslabs/dbldatagen@current
 
-By default, this will install a fresh build from the ``master`` branch. You can install from a
-specific branch by appending the branch identifier to the github URL.
+By default, this will install a fresh build from latest release based on the ``master`` branch.
+You can install from a specific branch by appending the branch identifier or tag to the github URL.
 
 .. code-block::
 
-   %pip install git+https://github.com/databrickslabs/dbldatagen#error-report-improvements
+   %pip install git+https://github.com/databrickslabs/dbldatagen@dbr_7_3_LTS_compat
+
+The following tags will be used to pick up specific versions:
+
+* `current` - latest build from master + doc changes and critical bug fixes
+* `stable` - latest release from master (with changes for version marking and documentation only)
+* `preview` - preview build of forthcoming features
 
 .. seealso::
    See the following links for more details:
diff --git a/docs/source/repeatable_data_generation.rst b/docs/source/repeatable_data_generation.rst
@@ -39,7 +39,8 @@ the same values allowing for creation of multiple tables with referential integr
    The key exception to repeatability is where the data set contains the timestamp or date of when the
    data is written. In these cases, runs from a later date will have different values.
 
-   This is why we stress generating date or timestamp ranges with a specific ``begin``, ``end`` and ``interval`` in the section
+   This is why we stress generating date or timestamp ranges with a specific ``begin``, ``end`` and ``interval``
+   in the section
    on repeatability, rather than simply using SQL ``now()`` , ``current_timestamp()`` etc.
 
 
@@ -54,8 +55,8 @@ or Numpy random number generators depending on context).
 By default
 a predefined random seed will be used for all random columns - so by definition all data is repeatable.
 
-All columns will use the same random seed unless the random seed method is specified to be 'hash_fieldname' or the seed is
-overridden at the column level. In the case of the use of the 'hash_fieldname' generation method,
+All columns will use the same random seed unless the random seed method is specified to be 'hash_fieldname' or
+the seed is overridden at the column level. In the case of the use of the 'hash_fieldname' generation method,
 it will use a hash value of the field name so that each column has a different seed.
 
 True random Data
@@ -207,7 +208,8 @@ device from event to event.
                                values=["activation", "deactivation", "plan change",
                                        "telecoms activity", "internet activity", "device error"],
                                random=True)
-                   .withColumn("event_ts", "timestamp", begin="2020-01-01 01:00:00", end="2020-12-31 23:59:00", interval="1 minute", random=True)
+                   .withColumn("event_ts", "timestamp", begin="2020-01-01 01:00:00", end="2020-12-31 23:59:00",
+                               interval="1 minute", random=True)
 
                    )
 
diff --git a/python/dev_require.txt b/python/dev_require.txt
@@ -20,7 +20,7 @@ rstcheck
 prospector
 
 # The following packages are only required for building documentation and are not required at runtime
-sphinx>=2.0.0,<3.1.0
+sphinx==5.0.0
 sphinx_rtd_theme
 nbsphinx
 numpydoc==0.8
@@ -29,4 +29,5 @@ ipython==7.16.3
 recommonmark
 sphinx-markdown-builder
 rst2pdf==0.98
+Jinja2 < 3.1
 
diff --git a/python/require.txt b/python/require.txt
@@ -20,12 +20,13 @@ rstcheck
 prospector
 
 # The following packages are only required for building documentation and are not required at runtime
-sphinx>=2.0.0,<3.1.0
+sphinx==5.0.0
 sphinx_rtd_theme
 nbsphinx
 numpydoc==0.8
 pypandoc
 ipython==7.16.3
 recommonmark
 sphinx-markdown-builder
-rst2pdf==0.98
+rst2pdf==0.98
+Jinja2 < 3.1

Original file line number	Diff line number	Diff line change
`@@ -5,6 +5,10 @@ body {`
`5`	`5`	`font-family:"Source Sans Pro",sans-serif!important;`
`6`	`6`	`}`
`7`	`7`
	`8`	`+p.caption {`
	`9`	`+ font-family:"Source Sans Pro",sans-serif!important;`
	`10`	`+}`
	`11`	`+`
`8`	`12`	`code.sig-name.descname {`
`9`	`13`	`font-family:"Source Sans Pro",sans-serif!important;`
`10`	`14`	`}`
`@@ -39,7 +43,7 @@ h3 {`
`39`	`43`	`max-width: 80%;`
`40`	`44`	`}`
`41`	`45`
`42`		`-/* Left pannel size */`
	`46`	`+/* Left panel size */`
`43`	`47`	`@media (min-width: 768px) {`
`44`	`48`	`.col-md-3 {`
`45`	`49`	`flex: 0 0 20%;`