Skip to content

[HUDI-9359] add TIMESTAMP_MILLIS and TIME_MILLIS to InternalSchema #13291

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

jonvex
Copy link
Contributor

@jonvex jonvex commented May 12, 2025

Change Logs

add new types TIMESTAMP_MILLIS and TIME_MILLIS to InternalSchema

add validation for row writer that we don't use timestamp millis

testing added the timestamp and time types to the schema so that TestAvroSchemaEvolutionUtils.testFixNullOrdering() will validate that the logical types are preserved.

Impact

Prevent information loss when converting avro schema to internal schema

Risk level (write none, low medium or high below)

low

Documentation Update

N/A

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@github-actions github-actions bot added the size:S PR with lines of changes in (10, 100] label May 12, 2025
@github-actions github-actions bot added size:M PR with lines of changes in (100, 300] and removed size:S PR with lines of changes in (10, 100] labels May 13, 2025
@jonvex jonvex marked this pull request as ready for review May 13, 2025 18:28
@hudi-bot
Copy link

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

Copy link

@rkwagner rkwagner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 LGTM, hopefully this goes in soon!

@@ -39,7 +39,9 @@ object HoodieDataTypeUtils {
StructType.fromString(jsonSchema)

def canUseRowWriter(schema: Schema, conf: Configuration): Boolean = {
if (conf.getBoolean(AvroWriteSupport.WRITE_OLD_LIST_STRUCTURE, true)) {
if (HoodieAvroUtils.hasTimestampMillisField(schema)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a UT or integration test for Spark already?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested HoodieAvroUtils.hasTimestampMillisField but not canUseRowWriter. Tbh I'm not sure if I should just get rid of it. I think there is a setting in spark that allows using timestamp millis, so I think maybe it should be checking if there are both millis and micros used at the same time. And also checking that config. But we also have a hudi config that sets that config. IDK how much value all the time spent to validate all that will add. So maybe for now we just remove it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if there are both millis and micros used at the same time. And also checking that config. But we also have a hudi config that sets that config

The default timestamp precision of Spark is 6, are you saying user specify the timestamp precision as explicit 3 for some columns? I guess most of the cases would just use either default precision 6 or 3, the mixed case should be very rare.

Is the patch to fix the schema evolution use case for timestamp(3)?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fix is for this issue:
#13233

Where Hudi streamers always force Timestamp into micros no matter what the user specifies at the output schema in the case of a new table. As you can see in the internal converter, no matter what version of timestamp is used in the output schema (millis or micros), you'll always end up with micros.

The OR clause here makes that clear:
https://github.com/apache/hudi/pull/13291/files#diff-2d823101c425b4f9fbc444d1def5b6ebe1607bf19b532c80f5b0851cfd27a292

And is reproducible in the script in the linked issue.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, then I think we should support timestamp(3) for Spark, @jonvex do you think it is feasible?

@@ -62,7 +62,9 @@ enum TypeID {
DATE(Integer.class),
BOOLEAN(Boolean.class),
TIME(Long.class),
TIME_MILLIS(Integer.class),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the primitive type for the logical type of time-millis also use Long type?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like Integer type is aligned with Avro. Could you add a note here in a comment?

Comment on lines +348 to +354
} else if (logical instanceof LogicalTypes.TimeMillis) {
return Types.TimeMillisType.get();
} else if (logical instanceof LogicalTypes.TimeMicros) {
return Types.TimeType.get();

} else if (
logical instanceof LogicalTypes.TimestampMillis
|| logical instanceof LogicalTypes.TimestampMicros) {
} else if (logical instanceof LogicalTypes.TimestampMillis) {
return Types.TimestampMillisType.get();
} else if (logical instanceof LogicalTypes.TimestampMicros) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would type handling go through the schema evolution on read here? Supposedly the schema evolution on read logic should not be invoked by default. Is the logic being leaked to non-schema-on-read code path?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size:M PR with lines of changes in (100, 300]
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants