-
Notifications
You must be signed in to change notification settings - Fork 4
CFR: Assorted fixes and improvements #483
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughThe changes update the changelog with new CFR subsystem entries and refactor the Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant Exporter as SystemTableExporter
participant Importer as SystemTableImporter
participant FileSystem
participant DB as Database
User->>Exporter: save()
Exporter->>Exporter: _save(path_schema, path_data)
Exporter->>FileSystem: Write schema/data files
Exporter->>User: Return export path
User->>Importer: load()
Importer->>Importer: _load(path_schema, path_data)
Importer->>FileSystem: Read schema/data files (using orjsonl/pandas)
Importer->>DB: Execute DELETE on target table (truncate)
Importer->>DB: Bulk insert records via pandas.to_sql
Importer->>User: Complete
Possibly related PRs
Poem
📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (4)
🚧 Files skipped from review as they are similar to previous changes (4)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (23)
✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (1)
cratedb_toolkit/cfr/systable.py (1)
254-255
: Import policy change to "replace" is implemented correctly.The switch to pandas
to_sql
withif_exists="replace"
andmethod=insert_bulk
provides better performance through bulk insertion. Note that this is a breaking change from the previous append behavior.Consider documenting this breaking change prominently in the release notes or migration guide, as users relying on append behavior will need to adjust their workflows.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
CHANGES.md
(1 hunks)cratedb_toolkit/cfr/systable.py
(5 hunks)
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: amotl
PR: crate/cratedb-toolkit#81
File: cratedb_toolkit/api/main.py:0-0
Timestamp: 2025-04-27T17:47:50.261Z
Learning: When suggesting improvements or refactorings that aren't directly related to the current PR, create a GitHub issue instead of suggesting a direct code change. This applies especially to improvements related to error handling, performance optimizations, or restructuring that aren't critical to the current PR.
Learnt from: amotl
PR: crate/cratedb-toolkit#81
File: cratedb_toolkit/api/main.py:0-0
Timestamp: 2025-04-27T17:47:50.261Z
Learning: When suggesting improvements or refactorings that aren't directly related to the current PR, create a GitHub issue instead of suggesting a direct code change. This applies especially to improvements related to error handling, performance optimizations, or restructuring that aren't critical to the current PR.
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (14)
- GitHub Check: Generic: Python 3.13 on OS ubuntu-latest
- GitHub Check: Generic: Python 3.9 on OS ubuntu-latest
- GitHub Check: Kinesis: Python 3.13 on OS ubuntu-latest
- GitHub Check: PyMongo: Python 3.9 on OS ubuntu-latest
- GitHub Check: Kinesis: Python 3.10 on OS ubuntu-latest
- GitHub Check: CFR: Python 3.13 on OS ubuntu-latest
- GitHub Check: PyMongo: Python 3.12 on OS ubuntu-latest
- GitHub Check: CFR: Python 3.9 on OS ubuntu-latest
- GitHub Check: Cloud: Python 3.13 on OS ubuntu-latest
- GitHub Check: build-and-test
- GitHub Check: CFR for OS windows-latest
- GitHub Check: CFR for OS macos-13
- GitHub Check: CFR for OS macos-latest
- GitHub Check: CFR for OS ubuntu-latest
🔇 Additional comments (5)
CHANGES.md (1)
5-8
: LGTM! Changelog entries accurately document the CFR improvements.The entries properly describe the log output improvements, table name quoting fix, import policy change from append to replace, and data importing enhancements.
cratedb_toolkit/cfr/systable.py (4)
26-30
: LGTM! Import additions support the refactoring objectives.The new imports for
orjsonl
,pandas
,insert_bulk
, andlogging_redirect_tqdm
align well with the PR's goals of improving log output and data importing without NumPy dependencies.
166-168
: Good refactoring for improved log output handling.Wrapping the export logic with
logging_redirect_tqdm()
properly handles the interaction between logging and tqdm progress bars, preventing display issues.
180-180
: Table name construction correctly simplified.Removing the
adapter.quote_relation_name
call and directly prefixing the table name resolves the double quoting issue mentioned in the changelog.
231-232
: Consistent refactoring pattern for import logging.The same
logging_redirect_tqdm()
pattern ensures proper log handling during import operations.
0a92f3e
to
13d2739
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
CHANGES.md
(1 hunks)cratedb_toolkit/cfr/systable.py
(5 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- CHANGES.md
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: amotl
PR: crate/cratedb-toolkit#81
File: cratedb_toolkit/api/main.py:0-0
Timestamp: 2025-04-27T17:47:50.261Z
Learning: When suggesting improvements or refactorings that aren't directly related to the current PR, create a GitHub issue instead of suggesting a direct code change. This applies especially to improvements related to error handling, performance optimizations, or restructuring that aren't critical to the current PR.
Learnt from: amotl
PR: crate/cratedb-toolkit#81
File: cratedb_toolkit/api/main.py:0-0
Timestamp: 2025-04-27T17:47:50.261Z
Learning: When suggesting improvements or refactorings that aren't directly related to the current PR, create a GitHub issue instead of suggesting a direct code change. This applies especially to improvements related to error handling, performance optimizations, or restructuring that aren't critical to the current PR.
Learnt from: amotl
PR: crate/cratedb-toolkit#81
File: cratedb_toolkit/api/main.py:7-7
Timestamp: 2025-04-27T02:46:46.795Z
Learning: Test coverage for the `@lru_cache` decorator on the `get_client_bundle` method in `cratedb_toolkit/api/main.py` is being tracked in issue GH-423 for a future iteration and should not be raised during PR reviews.
🪛 GitHub Check: codecov/patch
cratedb_toolkit/cfr/systable.py
[warning] 276-276: cratedb_toolkit/cfr/systable.py#L276
Added line #L276 was not covered by tests
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (9)
- GitHub Check: CFR for OS macos-13
- GitHub Check: CFR for OS ubuntu-latest
- GitHub Check: Kinesis: Python 3.13 on OS ubuntu-latest
- GitHub Check: CFR for OS windows-latest
- GitHub Check: CFR for OS macos-latest
- GitHub Check: Generic: Python 3.13 on OS ubuntu-latest
- GitHub Check: Cloud: Python 3.13 on OS ubuntu-latest
- GitHub Check: build-and-test
- GitHub Check: Python 3.13, LocalStack 4.5, OS ubuntu-latest
🔇 Additional comments (7)
cratedb_toolkit/cfr/systable.py (7)
26-30
: New dependencies added for improved functionality.The addition of
orjsonl
,pandas
,insert_bulk
, andlogging_redirect_tqdm
imports aligns with the refactoring to improve data handling and logging integration for the CFR subsystem.
166-169
: Good refactoring to separate concerns and improve logging.The refactoring of the
save
method to delegate to_save
within alogging_redirect_tqdm()
context is a clean separation of concerns that improves the integration of progress bars with logging output.
170-199
: Main export logic properly extracted to private method.The extraction of the main export logic into
_save
maintains the original functionality while enabling the logging integration. The logic flow remains intact.
180-181
: Simplified table name handling.The direct string formatting approach for the output table name is cleaner than the previous adapter-based quoting. This change aligns with the AI summary's mention of fixing double quoting issues.
231-233
: Consistent refactoring pattern for import functionality.The
load
method refactoring mirrors thesave
method approach, providing consistent logging integration across both export and import operations.
252-254
: Table truncation strategy implemented correctly.The explicit table truncation using
DELETE
before data import implements the "replace" strategy mentioned in the AI summary. This ensures clean data import without DDL complexities.
257-264
: Improved data import with pandas and bulk insertion.The switch from Polars to pandas with
insert_bulk
should provide better type mapping and more efficient bulk insertion as mentioned in the AI summary.
def load_table(self, path: Path) -> t.List: | ||
import polars as pl | ||
|
||
if path.suffix in [".jsonl"]: | ||
return pl.read_ndjson(path) | ||
return orjsonl.load(path) | ||
elif path.suffix in [".parquet", ".pq"]: | ||
return pl.read_parquet(path) | ||
return pl.read_parquet(path).to_pandas().to_dict("records") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
❓ Verification inconclusive
Verify parquet loading path is tested.
The load_table
method correctly switches from returning a DataFrame to returning a list of records, using orjsonl.load
for JSONL files and pandas records for parquet files.
The static analysis indicates line 276 (parquet loading path) lacks test coverage. Consider adding tests for the parquet loading functionality:
🏁 Script executed:
#!/bin/bash
# Description: Check test coverage for parquet loading in SystemTableImporter
# Expected: Find tests that exercise the parquet loading path
# Search for tests related to parquet loading in SystemTableImporter
rg -A 10 -B 5 "parquet.*SystemTableImporter|SystemTableImporter.*parquet" tests/
Length of output: 82
Add tests for the parquet-loading path in load_table
The load_table
method correctly handles both JSONL and Parquet files, but we have no evidence of tests covering the Parquet branch. Please add unit tests that:
- Exercise both
.parquet
and.pq
suffixes. - Use Polars (or another library) to write a small in-memory DataFrame to a temporary Parquet file.
- Call
load_table(path)
and assert the returned list of records matches the original data. - Live under your existing test hierarchy (e.g.
tests/
), perhaps astests/test_systable_parquet.py
.
This will ensure future changes to the Parquet-loading logic are caught by CI.
🧰 Tools
🪛 GitHub Check: codecov/patch
[warning] 276-276: cratedb_toolkit/cfr/systable.py#L276
Added line #L276 was not covered by tests
🤖 Prompt for AI Agents
In cratedb_toolkit/cfr/systable.py around lines 270 to 276, the load_table
method handles Parquet files but lacks test coverage for this branch. To fix
this, create a new test file under the tests/ directory, for example
tests/test_systable_parquet.py, and write unit tests that generate small
in-memory DataFrames using Polars or a similar library, save them to temporary
files with .parquet and .pq suffixes, then call load_table with these paths and
assert that the returned list of records matches the original DataFrame data.
This will ensure the Parquet-loading logic is properly tested and covered by CI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@coderabbitai: You are right, this code path is not exercised using software tests just yet. However, we will not dedicate cycles to that within this iteration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@amotl Understood! I acknowledge your decision to defer the test coverage for the parquet loading path to a future iteration. This aligns with your project's approach of focusing on core functionality first.
Would you like me to create a GitHub issue to track adding test coverage for the parquet loading path in SystemTableImporter.load_table()
for future work? This would help ensure it doesn't get lost and can be addressed when you have cycles to dedicate to more comprehensive testing.
(\ /)
( ._. )
o_(")(") *understanding nod*
🧠 Learnings used
Learnt from: amotl
PR: crate/cratedb-toolkit#81
File: cratedb_toolkit/api/main.py:7-7
Timestamp: 2025-04-27T02:46:46.795Z
Learning: Test coverage for the `@lru_cache` decorator on the `get_client_bundle` method in `cratedb_toolkit/api/main.py` is being tracked in issue GH-423 for a future iteration and should not be raised during PR reviews.
Learnt from: amotl
PR: crate/cratedb-toolkit#421
File: cratedb_toolkit/settings/compare.py:35-45
Timestamp: 2025-04-25T18:40:39.517Z
Learning: For cratedb-toolkit, amotl prefers starting with minimal "run-to-completion" tests that verify basic functionality before adding more thorough test cases.
Learnt from: amotl
PR: crate/cratedb-toolkit#421
File: cratedb_toolkit/settings/compare.py:412-447
Timestamp: 2025-04-25T18:36:10.508Z
Learning: The cratedb-toolkit project prefers blackbox "run-to-completion" testing for CLI tools rather than detailed unit tests for individual utility functions. Integration tests that verify the end-to-end functionality are prioritized over testing individual components in isolation.
Learnt from: amotl
PR: crate/cratedb-toolkit#400
File: cratedb_toolkit/docs/settings.py:0-0
Timestamp: 2025-04-13T14:43:29.858Z
Learning: For the cratedb-toolkit project, functional validation through CLI commands is preferred as the initial testing approach, with more granular unit tests being deferred to subsequent iterations when necessary.
Learnt from: amotl
PR: crate/cratedb-toolkit#421
File: tests/docs/test_functions.py:9-24
Timestamp: 2025-04-25T17:59:17.303Z
Learning: In the cratedb-toolkit project, explicit and separate test cases are preferred over parameterized tests when testing similar functionality with different outputs to maintain code clarity.
Learnt from: amotl
PR: crate/cratedb-toolkit#421
File: cratedb_toolkit/settings/compare.py:129-149
Timestamp: 2025-04-25T18:40:26.508Z
Learning: In the cratedb-toolkit project, comprehensive unit tests for utility functions like `normalize_value` in the settings comparison module are intentionally skipped as per the project's testing strategy.
Learnt from: amotl
PR: crate/cratedb-toolkit#421
File: cratedb_toolkit/settings/compare.py:152-183
Timestamp: 2025-04-25T18:39:11.724Z
Learning: The project is currently favoring minimal "run-to-completion" tests over comprehensive unit tests, especially for new functionality like the settings comparison utility.
Learnt from: amotl
PR: crate/cratedb-toolkit#81
File: cratedb_toolkit/cfr/marimo.py:53-54
Timestamp: 2025-05-02T11:43:42.639Z
Learning: The submodules within `cratedb_toolkit/io` employ a different naming scheme than other parts of the codebase and retain their own conventions (e.g., may continue using `cratedb_sqlalchemy_url` instead of `sqlalchemy_url`).
Learnt from: amotl
PR: crate/cratedb-toolkit#81
File: cratedb_toolkit/api/main.py:0-0
Timestamp: 2025-04-27T17:47:50.261Z
Learning: When suggesting improvements or refactorings that aren't directly related to the current PR, create a GitHub issue instead of suggesting a direct code change. This applies especially to improvements related to error handling, performance optimizations, or restructuring that aren't critical to the current PR.
Learnt from: amotl
PR: crate/cratedb-toolkit#81
File: cratedb_toolkit/api/main.py:0-0
Timestamp: 2025-04-27T17:47:50.261Z
Learning: When suggesting improvements or refactorings that aren't directly related to the current PR, create a GitHub issue instead of suggesting a direct code change. This applies especially to improvements related to error handling, performance optimizations, or restructuring that aren't critical to the current PR.
pyproject.toml
Outdated
"sqlalchemy-cratedb>=0.41", | ||
"sqlalchemy-cratedb>=0.42.0.dev2", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs a release or be reverted. Otherwise, downstream users who use uv
will need to add --prerelease=allow
to install the cratedb-toolkit
package.
Using orjson instead of Polars does not introduce NumPy types.
Also, use `append` strategy again, because `replace` doesn't do the right DDL.
About
The CFR subsystem feels like it never worked accurately, at least not
ctk cfr sys-import
. Well, let's fix forward.Preview
uv pip install --upgrade 'cratedb-toolkit[cfr] @ git+https://github.com/crate/cratedb-toolkit.git@ctk-cfr-fixes'
References
InvalidRelationName[Relation name ""sys-snapshots"" is invalid.]
#481Importing table failed: XXX. Reason: Type is not JSON serializable: numpy.ndarray
#482