-
Notifications
You must be signed in to change notification settings - Fork 2.4k
[HUDI-9359] add TIMESTAMP_MILLIS and TIME_MILLIS to InternalSchema #13291
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from all commits
2424b58
c5b408a
d1fa7b7
8c58594
4885fd2
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -62,7 +62,9 @@ enum TypeID { | |
DATE(Integer.class), | ||
BOOLEAN(Boolean.class), | ||
TIME(Long.class), | ||
TIME_MILLIS(Integer.class), | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should the primitive type for the logical type of time-millis also use There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Looks like |
||
TIMESTAMP(Long.class), | ||
TIMESTAMP_MILLIS(Long.class), | ||
DECIMAL(BigDecimal.class), | ||
UUID(UUID.class); | ||
private final String name; | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -345,14 +345,13 @@ private static Type visitAvroPrimitiveToBuildInternalType(Schema primitive) { | |
} else if (logical instanceof LogicalTypes.Date) { | ||
return Types.DateType.get(); | ||
|
||
} else if ( | ||
logical instanceof LogicalTypes.TimeMillis | ||
|| logical instanceof LogicalTypes.TimeMicros) { | ||
} else if (logical instanceof LogicalTypes.TimeMillis) { | ||
return Types.TimeMillisType.get(); | ||
} else if (logical instanceof LogicalTypes.TimeMicros) { | ||
return Types.TimeType.get(); | ||
|
||
} else if ( | ||
logical instanceof LogicalTypes.TimestampMillis | ||
|| logical instanceof LogicalTypes.TimestampMicros) { | ||
} else if (logical instanceof LogicalTypes.TimestampMillis) { | ||
return Types.TimestampMillisType.get(); | ||
} else if (logical instanceof LogicalTypes.TimestampMicros) { | ||
Comment on lines
+348
to
+354
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why would type handling go through the schema evolution on read here? Supposedly the schema evolution on read logic should not be invoked by default. Is the logic being leaked to non-schema-on-read code path? |
||
return Types.TimestampType.get(); | ||
} else if (LogicalTypes.uuid().getName().equals(name)) { | ||
return Types.UUIDType.get(); | ||
|
@@ -542,9 +541,15 @@ private static Schema visitInternalPrimitiveToBuildAvroPrimitiveType(Type.Primit | |
case TIME: | ||
return LogicalTypes.timeMicros().addToSchema(Schema.create(Schema.Type.LONG)); | ||
|
||
case TIME_MILLIS: | ||
return LogicalTypes.timeMillis().addToSchema(Schema.create(Schema.Type.INT)); | ||
|
||
case TIMESTAMP: | ||
return LogicalTypes.timestampMicros().addToSchema(Schema.create(Schema.Type.LONG)); | ||
|
||
case TIMESTAMP_MILLIS: | ||
return LogicalTypes.timestampMillis().addToSchema(Schema.create(Schema.Type.LONG)); | ||
|
||
case STRING: | ||
return Schema.create(Schema.Type.STRING); | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have a UT or integration test for Spark already?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested HoodieAvroUtils.hasTimestampMillisField but not canUseRowWriter. Tbh I'm not sure if I should just get rid of it. I think there is a setting in spark that allows using timestamp millis, so I think maybe it should be checking if there are both millis and micros used at the same time. And also checking that config. But we also have a hudi config that sets that config. IDK how much value all the time spent to validate all that will add. So maybe for now we just remove it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default timestamp precision of Spark is 6, are you saying user specify the timestamp precision as explicit 3 for some columns? I guess most of the cases would just use either default precision 6 or 3, the mixed case should be very rare.
Is the patch to fix the schema evolution use case for timestamp(3)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fix is for this issue:
#13233
Where Hudi streamers always force Timestamp into micros no matter what the user specifies at the output schema in the case of a new table. As you can see in the internal converter, no matter what version of timestamp is used in the output schema (millis or micros), you'll always end up with micros.
The OR clause here makes that clear:
https://github.com/apache/hudi/pull/13291/files#diff-2d823101c425b4f9fbc444d1def5b6ebe1607bf19b532c80f5b0851cfd27a292
And is reproducible in the script in the linked issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, then I think we should support timestamp(3) for Spark, @jonvex do you think it is feasible?