-
Notifications
You must be signed in to change notification settings - Fork 77
Home
Ronan Stokes edited this page Sep 29, 2021
·
17 revisions
Welcome to the data-generator wiki!
The Databricks data generator (dbldatagen
) is available as public preview at present.
It supports all major functionality and is code complete
steps:
- soft release (with docs hosted as GitHub pages)
- package release (with docs hosted via ReadTheDocs) and data generator available via package
Key features in initial release:
- Data generation with support for generation of data conforming to statistical distributions
- Faker integration via plugin mechanism
- Support for generation of streaming data
- Support for generation of multi-table data with consistency between primary and foreign keys
- Support for generation of CDC style data
- Support for generation of IOT style data
Todo items:
-
Additional sections in help content
-
fixup consistency of arg naming for withColumn, withColumnSpec, withColumnSpecs
-
fixup of function names for consistency (will follow PySpark SQL conventions)
- fixup of public API method names (should be very few remaining) => will adopt camelCase throughout
- fixup of private method names => will adopt camelCase thoughout
-
addition of doc sections on CDC and multi-table data generation
-
Goto Github pages
-
click on
view deployment
link to access latest help -
The following direct link will bring you to the documentation: Databricks Data Generator online documentation