Skip to content

Commit f3167d8

Browse files
new docs and tutorials - "Multi-table data generation" (#61)
* new docs and tutorials * new docs and tutorials * new docs and tutorials * minor formatting * fixes for differnces between AWS and Azure * minor formatting
1 parent d90ce87 commit f3167d8

File tree

6 files changed

+926
-8
lines changed

6 files changed

+926
-8
lines changed

docs/source/generating_cdc_data.rst

Lines changed: 13 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@
66
Generating Change Data Capture data
77
===================================
88

9-
This section explores some of the features for generating CDC style data - that is exploring the abilitty to
9+
This section explores some of the features for generating CDC style data - that is exploring the ability to
1010
generate a base data set and then apply changes such as updates to existing rows and
1111
new rows that will be inserts to the existing data
1212

@@ -123,15 +123,24 @@ We will also generate a set of updates by sampling from the existing data and ad
123123
.withColumn("customer_id", F.expr(f"customer_id + {start_of_new_ids}"))
124124
)
125125
126-
df1_updates = (df1.sample(False, 0.1)
126+
# read the written data - if we simply recompute, timestamps of original will be lost
127+
df_original = spark.read.format("delta").load(customers1_location)
128+
129+
df1_updates = (df_original.sample(False, 0.1)
127130
.limit(50 * 1000)
128131
.withColumn("alias", F.lit('modified alias'))
129-
.withColumn("modified_ts",F.expr('current_timestamp()'))
132+
.withColumn("modified_ts",F.expr('now()'))
130133
.withColumn("memo", F.lit("update")))
131134
132-
133135
df_changes = df1_inserts.union(df1_updates)
134136
137+
# randomize ordering
138+
df_changes = (df_changes.withColumn("order_rand", F.expr("rand()"))
139+
.orderBy("order_rand")
140+
.drop("order_rand")
141+
)
142+
143+
135144
display(df_changes)
136145
137146
Merging in the changes

docs/source/index.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -29,7 +29,7 @@ to Scala or R based Spark applications also.
2929
Generating repeatable data <repeatable_data_generation>
3030
Using streaming data <using_streaming_data>
3131
Generating Change Data Capture (CDC) data<generating_cdc_data>
32-
Multi table data <multi_table_data>
32+
Using multiple tables <multi_table_data>
3333
Extending text generation <extending_text_generation>
3434
Troubleshooting data generation <troubleshooting>
3535

0 commit comments

Comments
 (0)