-
Notifications
You must be signed in to change notification settings - Fork 3.3k
Allow using temporary staging path in Iceberg for writing sorted files #26172
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
606ea70
to
9280531
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Adds support for a configurable temporary staging directory when writing sorted files in Iceberg, improving performance on object stores.
- Introduces
temporaryStagingDirectoryEnabled
andtemporaryStagingDirectoryPath
inIcebergConfig
and exposes them as session properties. - Updates
IcebergSessionProperties
andIcebergPageSink
to select between the staging directory and default temp path per session. - Expands tests and documentation to cover the new staging directory behavior.
Reviewed Changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 1 comment.
Show a summary per file
File | Description |
---|---|
plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergConfig.java | Add new config properties and their setters/getters |
plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergSessionProperties.java | Register new session properties and provide accessors |
plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergPageSink.java | Implement logic to choose staging directory or default temp path |
plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergPageSinkProvider.java | Remove unused SortingFileWriterConfig injection |
plugin/trino-iceberg/src/test/java/io/trino/plugin/iceberg/*.java | Update tests to include SortingFileWriterConfig and new session props |
docs/src/main/sphinx/connector/iceberg.md | Document the new temporary staging directory properties |
Comments suppressed due to low confidence (2)
plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergConfig.java:90
- [nitpick] The field name includes the 'is' prefix which duplicates the boolean getter style. Consider renaming it to
temporaryStagingDirectoryEnabled
to avoid confusion between field and getter naming.
private boolean isTemporaryStagingDirectoryEnabled;
docs/src/main/sphinx/connector/iceberg.md:220
- [nitpick] The list formatting uses
* -
which differs from surrounding items. Align these entries with the existing Sphinx list style for consistency and readability.
* - `iceberg.temporary-staging-directory-enabled`
plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergPageSink.java
Show resolved
Hide resolved
9280531
to
1c37cdf
Compare
1c37cdf
to
80d1253
Compare
@@ -87,6 +87,8 @@ public class IcebergConfig | |||
private boolean hideMaterializedViewStorageTable = true; | |||
private Optional<String> materializedViewsStorageSchema = Optional.empty(); | |||
private boolean sortedWritingEnabled = true; | |||
private boolean temporaryStagingDirectoryEnabled; | |||
private String temporaryStagingDirectoryPath = "/tmp/presto-${USER}"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
presto -> trino
operations. The `${USER}` placeholder can be used to use a different | ||
location for each user. Equivalent session property is | ||
`sorted_writing_temporary_staging_directory_path`. | ||
- `/tmp/presto-${USER}` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
presto -> trino
private static final String SORTED_WRITING_WRITER_BUFFER_SIZE = "sorted_writing_write_buffer_size"; | ||
private static final String SORTED_WRITING_WRITER_MAX_OPEN_FILES = "sorted_writing_writer_max_open_files"; | ||
private static final String SORTED_WRITING_TEMP_STAGING_DIR_ENABLED = "sorted_writing_temporary_staging_directory_enabled"; | ||
private static final String SORTED_WRITING_TEMP_STAGING_DIR_PATH = "sorted_writing_temporary_staging_directory_path"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't want to expose any of these as session properties, some of these were removed from hive connector in https://github.com/trinodb/trino/pull/17390/files#diff-7dad18a4dbe9d9a536d45655875f47ffd0d8e3b5d5c6d2f3447246763841fb44L7716
.setCatalogSessionProperty("iceberg", "orc_writer_max_stripe_rows", "200") | ||
.setCatalogSessionProperty("iceberg", "parquet_writer_block_size", "20kB") | ||
.setCatalogSessionProperty("iceberg", "parquet_writer_batch_size", "200") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are not needed, sorting is per file, not per row-group
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use small value of target_max_file_size
to produce multiple files
.setCatalogSessionProperty("iceberg", "sorted_writing_temporary_staging_directory_enabled", "true") | ||
.setCatalogSessionProperty("iceberg", "sorted_writing_write_buffer_size", "2kB") | ||
.setCatalogSessionProperty("iceberg", "sorted_writing_writer_max_open_files", "5") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These should be setup in config properties rather than session
@@ -566,6 +566,33 @@ public void testFileSortingWithLargerTable() | |||
} | |||
} | |||
|
|||
@Test | |||
public void testFileSortingWithLargerTableAndTempDirForBufferFiles() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
testSortedWritingTempStaging
Description
This change adds support for using temporary staging directory during write operations involving sorted tables. Writes to sorted tables will utilize this path for staging temporary files during sorting operation. When disabled, the target storage will be used for staging while writing sorted tables which can be inefficient when writing to object stores like S3.
Adding session properties for these, as well as for the properties that control the number and target size of buffer files. This is useful because files of different sizes/number can be configured per query, so they can be tuned for the type of data in a given table.
Additional context and related issues
Fixes #24376
Similar to functionality added for Hive in #3434
Release notes
( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(x) Release notes are required, with the following suggested text: