-
Notifications
You must be signed in to change notification settings - Fork 2.4k
[HUDI-9359] add TIMESTAMP_MILLIS and TIME_MILLIS to InternalSchema #13291
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
[HUDI-9359] add TIMESTAMP_MILLIS and TIME_MILLIS to InternalSchema #13291
Conversation
…ose info during conversion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 LGTM, hopefully this goes in soon!
@@ -39,7 +39,9 @@ object HoodieDataTypeUtils { | |||
StructType.fromString(jsonSchema) | |||
|
|||
def canUseRowWriter(schema: Schema, conf: Configuration): Boolean = { | |||
if (conf.getBoolean(AvroWriteSupport.WRITE_OLD_LIST_STRUCTURE, true)) { | |||
if (HoodieAvroUtils.hasTimestampMillisField(schema)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have a UT or integration test for Spark already?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested HoodieAvroUtils.hasTimestampMillisField but not canUseRowWriter. Tbh I'm not sure if I should just get rid of it. I think there is a setting in spark that allows using timestamp millis, so I think maybe it should be checking if there are both millis and micros used at the same time. And also checking that config. But we also have a hudi config that sets that config. IDK how much value all the time spent to validate all that will add. So maybe for now we just remove it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if there are both millis and micros used at the same time. And also checking that config. But we also have a hudi config that sets that config
The default timestamp precision of Spark is 6, are you saying user specify the timestamp precision as explicit 3 for some columns? I guess most of the cases would just use either default precision 6 or 3, the mixed case should be very rare.
Is the patch to fix the schema evolution use case for timestamp(3)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fix is for this issue:
#13233
Where Hudi streamers always force Timestamp into micros no matter what the user specifies at the output schema in the case of a new table. As you can see in the internal converter, no matter what version of timestamp is used in the output schema (millis or micros), you'll always end up with micros.
The OR clause here makes that clear:
https://github.com/apache/hudi/pull/13291/files#diff-2d823101c425b4f9fbc444d1def5b6ebe1607bf19b532c80f5b0851cfd27a292
And is reproducible in the script in the linked issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, then I think we should support timestamp(3) for Spark, @jonvex do you think it is feasible?
@@ -62,7 +62,9 @@ enum TypeID { | |||
DATE(Integer.class), | |||
BOOLEAN(Boolean.class), | |||
TIME(Long.class), | |||
TIME_MILLIS(Integer.class), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the primitive type for the logical type of time-millis also use Long
type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like Integer
type is aligned with Avro. Could you add a note here in a comment?
} else if (logical instanceof LogicalTypes.TimeMillis) { | ||
return Types.TimeMillisType.get(); | ||
} else if (logical instanceof LogicalTypes.TimeMicros) { | ||
return Types.TimeType.get(); | ||
|
||
} else if ( | ||
logical instanceof LogicalTypes.TimestampMillis | ||
|| logical instanceof LogicalTypes.TimestampMicros) { | ||
} else if (logical instanceof LogicalTypes.TimestampMillis) { | ||
return Types.TimestampMillisType.get(); | ||
} else if (logical instanceof LogicalTypes.TimestampMicros) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would type handling go through the schema evolution on read here? Supposedly the schema evolution on read logic should not be invoked by default. Is the logic being leaked to non-schema-on-read code path?
Change Logs
add new types TIMESTAMP_MILLIS and TIME_MILLIS to InternalSchema
add validation for row writer that we don't use timestamp millis
testing added the timestamp and time types to the schema so that
TestAvroSchemaEvolutionUtils.testFixNullOrdering()
will validate that the logical types are preserved.Impact
Prevent information loss when converting avro schema to internal schema
Risk level (write none, low medium or high below)
low
Documentation Update
N/A
Contributor's checklist