[HUDI-9359] add TIMESTAMP_MILLIS and TIME_MILLIS to InternalSchema #13291

jonvex · 2025-05-12T19:06:51Z

Change Logs

add new types TIMESTAMP_MILLIS and TIME_MILLIS to InternalSchema

add validation for row writer that we don't use timestamp millis

testing added the timestamp and time types to the schema so that TestAvroSchemaEvolutionUtils.testFixNullOrdering() will validate that the logical types are preserved.

Impact

Prevent information loss when converting avro schema to internal schema

Risk level (write none, low medium or high below)

low

Documentation Update

N/A

Contributor's checklist

Read through contributor's guide
Change Logs and Impact were stated clearly
Adequate tests were added if applicable
CI passed

…ose info during conversion

…t row writer

hudi-bot · 2025-05-14T18:13:51Z

CI report:

2424b58 UNKNOWN
d1fa7b7 UNKNOWN
4885fd2 Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

rkwagner

👍 LGTM, hopefully this goes in soon!

danny0405 · 2025-05-15T00:33:50Z

hudi-client/hudi-spark-client/src/main/scala/org/apache/spark/sql/HoodieDataTypeUtils.scala

@@ -39,7 +39,9 @@ object HoodieDataTypeUtils {
    StructType.fromString(jsonSchema)

  def canUseRowWriter(schema: Schema, conf: Configuration): Boolean = {
-    if (conf.getBoolean(AvroWriteSupport.WRITE_OLD_LIST_STRUCTURE, true)) {
+    if (HoodieAvroUtils.hasTimestampMillisField(schema)) {


Do we have a UT or integration test for Spark already?

I tested HoodieAvroUtils.hasTimestampMillisField but not canUseRowWriter. Tbh I'm not sure if I should just get rid of it. I think there is a setting in spark that allows using timestamp millis, so I think maybe it should be checking if there are both millis and micros used at the same time. And also checking that config. But we also have a hudi config that sets that config. IDK how much value all the time spent to validate all that will add. So maybe for now we just remove it?

if there are both millis and micros used at the same time. And also checking that config. But we also have a hudi config that sets that config

The default timestamp precision of Spark is 6, are you saying user specify the timestamp precision as explicit 3 for some columns? I guess most of the cases would just use either default precision 6 or 3, the mixed case should be very rare.

Is the patch to fix the schema evolution use case for timestamp(3)?

The fix is for this issue:
#13233

Where Hudi streamers always force Timestamp into micros no matter what the user specifies at the output schema in the case of a new table. As you can see in the internal converter, no matter what version of timestamp is used in the output schema (millis or micros), you'll always end up with micros.

The OR clause here makes that clear:
https://github.com/apache/hudi/pull/13291/files#diff-2d823101c425b4f9fbc444d1def5b6ebe1607bf19b532c80f5b0851cfd27a292

And is reproducible in the script in the linked issue.

Hmm, then I think we should support timestamp(3) for Spark, @jonvex do you think it is feasible?

yihua · 2025-05-27T23:52:06Z

hudi-common/src/main/java/org/apache/hudi/internal/schema/Type.java

@@ -62,7 +62,9 @@ enum TypeID {
    DATE(Integer.class),
    BOOLEAN(Boolean.class),
    TIME(Long.class),
+    TIME_MILLIS(Integer.class),


Should the primitive type for the logical type of time-millis also use Long type?

Looks like Integer type is aligned with Avro. Could you add a note here in a comment?

yihua · 2025-05-27T23:59:29Z

...ommon/src/main/java/org/apache/hudi/internal/schema/convert/AvroInternalSchemaConverter.java

+      } else if  (logical instanceof LogicalTypes.TimeMillis) {
+        return Types.TimeMillisType.get();
+      } else if (logical instanceof LogicalTypes.TimeMicros) {
        return Types.TimeType.get();
-
-      } else if (
-              logical instanceof LogicalTypes.TimestampMillis
-                      || logical instanceof LogicalTypes.TimestampMicros) {
+      } else if (logical instanceof LogicalTypes.TimestampMillis) {
+        return Types.TimestampMillisType.get();
+      } else if (logical instanceof LogicalTypes.TimestampMicros) {


Why would type handling go through the schema evolution on read here? Supposedly the schema evolution on read logic should not be invoked by default. Is the logic being leaked to non-schema-on-read code path?

Jonathan Vexler added 2 commits May 12, 2025 15:05

add timestamp millis and time millis to internal schema so we don't l…

2424b58

…ose info during conversion

remove added line

c5b408a

github-actions bot added the size:S PR with lines of changes in (10, 100] label May 12, 2025

add testing for the types and fix time millis to use int. Also protec…

d1fa7b7

…t row writer

github-actions bot added size:M PR with lines of changes in (100, 300] and removed size:S PR with lines of changes in (10, 100] labels May 13, 2025

remove duplicate tests

8c58594

jonvex marked this pull request as ready for review May 13, 2025 18:28

fix has timestamp millis method

4885fd2

rkwagner approved these changes May 14, 2025

View reviewed changes

danny0405 reviewed May 15, 2025

View reviewed changes

yihua reviewed May 27, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[HUDI-9359] add TIMESTAMP_MILLIS and TIME_MILLIS to InternalSchema #13291

[HUDI-9359] add TIMESTAMP_MILLIS and TIME_MILLIS to InternalSchema #13291

jonvex commented May 12, 2025 •

edited

Loading

Uh oh!

hudi-bot commented May 14, 2025

Uh oh!

rkwagner left a comment

Uh oh!

danny0405 May 15, 2025

Uh oh!

jonvex May 15, 2025

Uh oh!

danny0405 May 16, 2025

Uh oh!

rkwagner May 16, 2025

Uh oh!

danny0405 May 17, 2025

Uh oh!

yihua May 27, 2025

Uh oh!

yihua May 27, 2025

Uh oh!

yihua May 27, 2025

Uh oh!

Uh oh!

[HUDI-9359] add TIMESTAMP_MILLIS and TIME_MILLIS to InternalSchema #13291

Are you sure you want to change the base?

[HUDI-9359] add TIMESTAMP_MILLIS and TIME_MILLIS to InternalSchema #13291

Conversation

jonvex commented May 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Change Logs

Impact

Risk level (write none, low medium or high below)

Documentation Update

Contributor's checklist

Uh oh!

hudi-bot commented May 14, 2025

CI report:

Uh oh!

rkwagner left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jonvex commented May 12, 2025 •

edited

Loading