Skip to content

Commit 571fae2

Browse files
new documentation on generating JSON data (#151)
* new documentation on generating JSON data * doc changes * updates to docs * fixed typo in docs
1 parent 04c8bd0 commit 571fae2

File tree

4 files changed

+471
-16
lines changed

4 files changed

+471
-16
lines changed

docs/source/generating_json_data.rst

Lines changed: 236 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,236 @@
1+
.. Test Data Generator documentation master file, created by
2+
sphinx-quickstart on Sun Jun 21 10:54:30 2020.
3+
You can adapt this file completely to your liking, but it should at least
4+
contain the root `toctree` directive.
5+
6+
Generating JSON and structured column data
7+
==========================================
8+
9+
This section explores generating JSON and structured column data. By structured columns,
10+
we mean columns that are some combination of `struct`, `array` and `map` of other types.
11+
12+
Generating JSON data
13+
--------------------
14+
There are several methods for generating JSON data:
15+
16+
- Generate a dataframe and save it as JSON will generate full data set as JSON
17+
- Generate JSON valued fields using SQL functions such as `named_struct` and `to_json`
18+
19+
Writing dataframe as JSON data
20+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
21+
22+
The following example illustrates the basic technique for generating JSON data from a dataframe.
23+
24+
.. code-block:: python
25+
26+
from pyspark.sql.types import LongType, IntegerType, StringType
27+
28+
import dbldatagen as dg
29+
30+
31+
country_codes = ['CN', 'US', 'FR', 'CA', 'IN', 'JM', 'IE', 'PK', 'GB', 'IL', 'AU', 'SG',
32+
'ES', 'GE', 'MX', 'ET', 'SA', 'LB', 'NL']
33+
country_weights = [1300, 365, 67, 38, 1300, 3, 7, 212, 67, 9, 25, 6, 47, 83, 126, 109, 58, 8,
34+
17]
35+
36+
manufacturers = ['Delta corp', 'Xyzzy Inc.', 'Lakehouse Ltd', 'Acme Corp', 'Embanks Devices']
37+
38+
lines = ['delta', 'xyzzy', 'lakehouse', 'gadget', 'droid']
39+
40+
testDataSpec = (dg.DataGenerator(spark, name="device_data_set", rows=1000000,
41+
partitions=8,
42+
randomSeedMethod='hash_fieldname')
43+
.withIdOutput()
44+
# we'll use hash of the base field to generate the ids to
45+
# avoid a simple incrementing sequence
46+
.withColumn("internal_device_id", LongType(), minValue=0x1000000000000,
47+
uniqueValues=device_population, omit=True, baseColumnType="hash")
48+
49+
# note for format strings, we must use "%lx" not "%x" as the
50+
# underlying value is a long
51+
.withColumn("device_id", StringType(), format="0x%013x",
52+
baseColumn="internal_device_id")
53+
54+
# the device / user attributes will be the same for the same device id
55+
# so lets use the internal device id as the base column for these attribute
56+
.withColumn("country", StringType(), values=country_codes,
57+
weights=country_weights,
58+
baseColumn="internal_device_id")
59+
.withColumn("manufacturer", StringType(), values=manufacturers,
60+
baseColumn="internal_device_id")
61+
62+
# use omit = True if you don't want a column to appear in the final output
63+
# but just want to use it as part of generation of another column
64+
.withColumn("line", StringType(), values=lines, baseColumn="manufacturer",
65+
baseColumnType="hash")
66+
.withColumn("model_ser", IntegerType(), minValue=1, maxValue=11,
67+
baseColumn="device_id",
68+
baseColumnType="hash", omit=True)
69+
70+
.withColumn("event_type", StringType(),
71+
values=["activation", "deactivation", "plan change",
72+
"telecoms activity", "internet activity", "device error"],
73+
random=True)
74+
.withColumn("event_ts", "timestamp", begin="2020-01-01 01:00:00", end="2020-12-31 23:59:00",
75+
interval="1 minute", random=True)
76+
77+
)
78+
79+
dfTestData = testDataSpec.build()
80+
81+
dfTestData.write.format("json").mode("overwrite").save("/tmp/jsonData1")
82+
83+
In the most basic form, you can simply save the dataframe to storage in JSON format.
84+
85+
Use of nested structures in data generation specifications
86+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
87+
88+
When we save a dataframe containing complex column types such as `map`, `struct` and `array`, these will be
89+
converted to equivalent constructs in JSON.
90+
91+
So how do we go about creating these?
92+
93+
We can use a struct valued column to hold the nested structure data and write the results out as JSON
94+
95+
Struct / array and map valued columns can be created by adding a column of the appropriate type and using the `expr`
96+
attribute to assemble the complex column.
97+
98+
Note that in the current release, the `expr` attribute will override other column data generation rules.
99+
100+
.. code-block:: python
101+
102+
from pyspark.sql.types import LongType, FloatType, IntegerType, StringType, DoubleType, BooleanType, ShortType, \
103+
TimestampType, DateType, DecimalType, ByteType, BinaryType, ArrayType, MapType, StructType, StructField
104+
105+
import dbldatagen as dg
106+
107+
108+
country_codes = ['CN', 'US', 'FR', 'CA', 'IN', 'JM', 'IE', 'PK', 'GB', 'IL', 'AU', 'SG',
109+
'ES', 'GE', 'MX', 'ET', 'SA', 'LB', 'NL']
110+
country_weights = [1300, 365, 67, 38, 1300, 3, 7, 212, 67, 9, 25, 6, 47, 83, 126, 109, 58, 8,
111+
17]
112+
113+
manufacturers = ['Delta corp', 'Xyzzy Inc.', 'Lakehouse Ltd', 'Acme Corp', 'Embanks Devices']
114+
115+
lines = ['delta', 'xyzzy', 'lakehouse', 'gadget', 'droid']
116+
117+
testDataSpec = (dg.DataGenerator(spark, name="device_data_set", rows=1000000,
118+
partitions=8,
119+
randomSeedMethod='hash_fieldname')
120+
.withIdOutput()
121+
# we'll use hash of the base field to generate the ids to
122+
# avoid a simple incrementing sequence
123+
.withColumn("internal_device_id", LongType(), minValue=0x1000000000000,
124+
uniqueValues=device_population, omit=True, baseColumnType="hash")
125+
126+
# note for format strings, we must use "%lx" not "%x" as the
127+
# underlying value is a long
128+
.withColumn("device_id", StringType(), format="0x%013x",
129+
baseColumn="internal_device_id")
130+
131+
# the device / user attributes will be the same for the same device id
132+
# so lets use the internal device id as the base column for these attribute
133+
.withColumn("country", StringType(), values=country_codes,
134+
weights=country_weights,
135+
baseColumn="internal_device_id")
136+
137+
.withColumn("manufacturer", StringType(), values=manufacturers,
138+
baseColumn="internal_device_id", omit=True)
139+
.withColumn("line", StringType(), values=lines, baseColumn="manufacturer",
140+
baseColumnType="hash", omit=True)
141+
.withColumn("manufacturer_info", StructType([StructField('line',StringType()), StructField('manufacturer', StringType())]),
142+
expr="named_struct('line', line, 'manufacturer', manufacturer)",
143+
baseColumn=['manufacturer', 'line'])
144+
145+
146+
.withColumn("model_ser", IntegerType(), minValue=1, maxValue=11,
147+
baseColumn="device_id",
148+
baseColumnType="hash", omit=True)
149+
150+
.withColumn("event_type", StringType(),
151+
values=["activation", "deactivation", "plan change",
152+
"telecoms activity", "internet activity", "device error"],
153+
random=True, omit=True)
154+
.withColumn("event_ts", "timestamp", begin="2020-01-01 01:00:00", end="2020-12-31 23:59:00",
155+
interval="1 minute", random=True, omit=True)
156+
157+
.withColumn("event_info", StructType([StructField('event_type',StringType()), StructField('event_ts', TimestampType())]),
158+
expr="named_struct('event_type', event_type, 'event_ts', event_ts)",
159+
baseColumn=['event_type', 'event_ts'])
160+
)
161+
162+
dfTestData = testDataSpec.build()
163+
dfTestData.write.format("json").mode("overwrite").save("/tmp/jsonData2")
164+
165+
Generating JSON valued fields
166+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
167+
168+
JSON valued fields can be generated as fields of `string` type and assembled using a combination of Spark SQL
169+
functions such as `named_struct` and `to_json`.
170+
171+
.. code-block:: python
172+
173+
from pyspark.sql.types import LongType, FloatType, IntegerType, StringType, DoubleType, BooleanType, ShortType, \
174+
TimestampType, DateType, DecimalType, ByteType, BinaryType, ArrayType, MapType, StructType, StructField
175+
176+
import dbldatagen as dg
177+
178+
179+
country_codes = ['CN', 'US', 'FR', 'CA', 'IN', 'JM', 'IE', 'PK', 'GB', 'IL', 'AU', 'SG',
180+
'ES', 'GE', 'MX', 'ET', 'SA', 'LB', 'NL']
181+
country_weights = [1300, 365, 67, 38, 1300, 3, 7, 212, 67, 9, 25, 6, 47, 83, 126, 109, 58, 8,
182+
17]
183+
184+
manufacturers = ['Delta corp', 'Xyzzy Inc.', 'Lakehouse Ltd', 'Acme Corp', 'Embanks Devices']
185+
186+
lines = ['delta', 'xyzzy', 'lakehouse', 'gadget', 'droid']
187+
188+
testDataSpec = (dg.DataGenerator(spark, name="device_data_set", rows=1000000,
189+
partitions=8,
190+
randomSeedMethod='hash_fieldname')
191+
.withIdOutput()
192+
# we'll use hash of the base field to generate the ids to
193+
# avoid a simple incrementing sequence
194+
.withColumn("internal_device_id", LongType(), minValue=0x1000000000000,
195+
uniqueValues=device_population, omit=True, baseColumnType="hash")
196+
197+
# note for format strings, we must use "%lx" not "%x" as the
198+
# underlying value is a long
199+
.withColumn("device_id", StringType(), format="0x%013x",
200+
baseColumn="internal_device_id")
201+
202+
# the device / user attributes will be the same for the same device id
203+
# so lets use the internal device id as the base column for these attribute
204+
.withColumn("country", StringType(), values=country_codes,
205+
weights=country_weights,
206+
baseColumn="internal_device_id")
207+
208+
.withColumn("manufacturer", StringType(), values=manufacturers,
209+
baseColumn="internal_device_id", omit=True)
210+
.withColumn("line", StringType(), values=lines, baseColumn="manufacturer",
211+
baseColumnType="hash", omit=True)
212+
.withColumn("manufacturer_info", "string",
213+
expr="to_json(named_struct('line', line, 'manufacturer', manufacturer))",
214+
baseColumn=['manufacturer', 'line'])
215+
216+
217+
.withColumn("model_ser", IntegerType(), minValue=1, maxValue=11,
218+
baseColumn="device_id",
219+
baseColumnType="hash", omit=True)
220+
221+
.withColumn("event_type", StringType(),
222+
values=["activation", "deactivation", "plan change",
223+
"telecoms activity", "internet activity", "device error"],
224+
random=True, omit=True)
225+
.withColumn("event_ts", "timestamp", begin="2020-01-01 01:00:00", end="2020-12-31 23:59:00",
226+
interval="1 minute", random=True, omit=True)
227+
228+
.withColumn("event_info", "string",
229+
expr="to_json(named_struct('event_type', event_type, 'event_ts', event_ts))",
230+
baseColumn=['event_type', 'event_ts'])
231+
)
232+
233+
dfTestData = testDataSpec.build()
234+
235+
#dfTestData.write.format("json").mode("overwrite").save("/tmp/jsonData2")
236+
display(dfTestData)

docs/source/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -32,6 +32,7 @@ As it is installable via `%pip install`, it can also be incorporated in environm
3232
Options for column specification <options_and_features>
3333
Generating repeatable data <repeatable_data_generation>
3434
Using streaming data <using_streaming_data>
35+
Generating JSON and structured column data <generating_json_data>
3536
Generating Change Data Capture (CDC) data<generating_cdc_data>
3637
Using multiple tables <multi_table_data>
3738
Extending text generation <extending_text_generation>

docs/source/options_and_features.rst

Lines changed: 55 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -12,27 +12,60 @@ Options for column specification
1212
The following table lists some of the common options that can be applied with the ``withColumn`` and ``withColumnSpec``
1313
methods.
1414

15+
.. table:: Column creation options
16+
1517
================ ==============================
1618
Parameter Usage
1719
================ ==============================
1820
minValue Minimum value for range of generated value. As alternative use ``dataRange``.
21+
1922
maxValue Minimum value for range of generated value. As alternative use ``dataRange``.
20-
step Step to use for range of generated value. As an alternative, you may use the `dataRange` parameter
21-
random If True, will generate random values for column value. Defaults to `False`
22-
randomSeedMethod Determines how seed will be used. If 'fixed', will use fixed random seed. If set to 'hash_fieldname'
23-
will use a hash of the field name as the random seed for a specific column.
23+
24+
step Step to use for range of generated value.
25+
26+
As an alternative, you may use the `dataRange` parameter
27+
28+
random If `True`, will generate random values for column value. Defaults to `False`
29+
30+
randomSeedMethod Determines how seed will be used.
31+
32+
If set to the value 'fixed', will use fixed random seed.
33+
34+
If set to 'hash_fieldname', it will use a hash of the field name as the random seed
35+
for a specific column.
36+
2437
baseColumn Either the string name of the base column, or a list of columns to use to control data generation.
25-
values List of discrete values for the column. Discrete values can numeric, dates timestamps, strings etc.
38+
39+
values List of discrete values for the column.
40+
41+
Discrete values can numeric, dates timestamps, strings etc.
42+
2643
weights List of discrete weights for the column. Controls spread of values
27-
percentNulls Percentage of nulls to generate for column. Fraction representing percentage between 0.0 and 1.0
44+
45+
percentNulls Percentage of nulls to generate for column.
46+
47+
Fraction representing percentage between 0.0 and 1.0
48+
2849
uniqueValues Number of distinct unique values for the column. Use as alternative to data range.
50+
2951
begin Beginning of range for date and timestamp fields.
52+
3053
end End of range for date and timestamp fields.
54+
3155
interval Interval of range for date and timestamp fields.
32-
dataRange An instance of an `NRange` or `DateRange` object. This can be used in place of ``minValue``, etc.
56+
57+
dataRange An instance of an `NRange` or `DateRange` object.
58+
59+
This can be used in place of ``minValue``, etc.
60+
3361
template Template controlling text generation
34-
omit If True, omit column from final output. Use when column is only needed to compute other columns.
62+
63+
omit If True, omit column from final output.
64+
65+
Use when column is only needed to compute other columns.
66+
3567
expr SQL expression to control data generation
68+
3669
================ ==============================
3770

3871

@@ -44,12 +77,26 @@ expr SQL expression to control data generation
4477
For more information, see :data:`~dbldatagen.daterange.DateRange`
4578
or :data:`~dbldatagen.daterange.NRange`.
4679

80+
Using custom SQL to control data generation
81+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
82+
83+
The `expr` attribute can be used to specify an arbitrary Spark SQL expression to control how the data is
84+
generated for a column. If the body of the SQL references other columns, you will need to ensure that
85+
those columns are created first.
86+
87+
By default, the columns are created in the order specified.
88+
89+
However, you can control the order of column creation using the `baseColumn` attribute.
90+
91+
More Details
92+
^^^^^^^^^^^^
4793

4894
The full set of options for column specification which may be used with the ``withColumn``, ``withColumnSpec`` and
4995
and ``withColumnSpecs`` method can be found at:
5096

5197
* :data:`~dbldatagen.column_spec_options.ColumnSpecOptions`
5298

99+
53100
Generating views automatically
54101
------------------------------
55102

0 commit comments

Comments
 (0)