Skip to content
Ronan Stokes edited this page Sep 29, 2021 · 17 revisions

Welcome to the data-generator wiki!

The Databricks data generator (dbldatagen) is available as public preview at present.

It supports all major functionality and is code complete

Roadmap for initial release

steps:

  • soft release (with docs hosted as GitHub pages)
  • package release (with docs hosted via ReadTheDocs) and data generator available via package

Key features in initial release:

  • Data generation with support for generation of data conforming to statistical distributions
  • Faker integration via plugin mechanism
  • Support for generation of streaming data
  • Support for generation of multi-table data with consistency between primary and foreign keys
  • Support for generation of CDC style data
  • Support for generation of IOT style data

Todo items:

  • Additional sections in help content

  • fixup consistency of arg naming for withColumn, withColumnSpec, withColumnSpecs

  • fixup of function names for consistency (will follow PySpark SQL conventions)

    • fixup of public API method names (should be very few remaining) => will adopt camelCase throughout
    • fixup of private method names => will adopt camelCase thoughout
  • addition of doc sections on CDC and multi-table data generation

Online Help

Clone this wiki locally