chore: simplify schema creation from storage #1368

nikhilsinhaparseable · 2025-07-06T05:59:14Z

remove functions that creates schema from ingestors and queriers separately
reuse function fetch_schema that fetches all schema files and merges the schemas into one

this ensures the schema is always the latest

Summary by CodeRabbit

Refactor
- Streamlined schema creation by consolidating multiple schema retrieval methods into a single process, now always sourcing schemas directly from storage.
- Updated internal logic to use the new unified schema retrieval approach, removing fallback mechanisms and redundant methods.
- Improved stream detection to avoid recreating streams that already exist with defined schemas, enhancing efficiency.

coderabbitai · 2025-07-06T05:59:19Z

"""

Walkthrough

The changes refactor schema creation logic by removing separate methods for creating schemas from ingestor and querier sources, consolidating them into a single method that fetches and stores schemas using a unified approach. Related function calls and helpers are updated or removed in accordance with this new schema retrieval strategy. Additionally, stream presence checks in the querier now consider schema emptiness to avoid redundant recreations.

Changes

File(s)	Change Summary
src/migration/mod.rs	Removed the `fetch_or_create_schema` function and updated `migration_stream` to call `create_schema_from_storage` directly.
src/parseable/mod.rs	Updated to use `create_schema_from_storage` instead of `create_schema_from_ingestor` in schema creation logic.
src/storage/object_storage.rs	Removed `create_schema_from_querier` and `create_schema_from_ingestor`; added and implemented `create_schema_from_storage`.
src/handlers/http/query.rs	Modified stream missing detection logic to include streams with empty schema fields in the querier as missing.

Sequence Diagram(s)

sequenceDiagram
    participant Caller
    participant ObjectStorage
    participant fetch_schema

    Caller->>ObjectStorage: create_schema_from_storage(stream_name)
    ObjectStorage->>fetch_schema: fetch_schema(stream_name)
    fetch_schema-->>ObjectStorage: Schema object
    ObjectStorage->>ObjectStorage: Serialize schema, store under schema path
    ObjectStorage-->>Caller: Stored schema bytes

Possibly related PRs

fix: dataloss due to contention at stream creation #1258: Modifies create_stream_and_schema_from_storage to use get_or_create instead of create for stream creation, addressing contention; related by affecting the same method and schema handling.

Suggested labels

for next release

Suggested reviewers

parmesant

Poem

A hop, a skip, a schema hop,
No more ingestor, no more swap!
One method now, so neat and bright,
Fetches schemas left and right.
With every change, the code grows clean,
A rabbit’s joy—so swift, so keen! 🐇
🥕✨
"""

📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a4c8f78 and 42cebec.

📒 Files selected for processing (1)

src/handlers/http/query.rs (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

src/handlers/http/query.rs

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (10)

GitHub Check: Quest Smoke and Load Tests for Standalone deployments
GitHub Check: Quest Smoke and Load Tests for Distributed deployments
GitHub Check: Build Default aarch64-apple-darwin
GitHub Check: Build Default x86_64-pc-windows-msvc
GitHub Check: Build Default x86_64-unknown-linux-gnu
GitHub Check: Build Default aarch64-unknown-linux-gnu
GitHub Check: Build Kafka x86_64-unknown-linux-gnu
GitHub Check: Build Default x86_64-apple-darwin
GitHub Check: coverage
GitHub Check: Build Kafka aarch64-apple-darwin

✨ Finishing Touches

📝 Generate Docstrings

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Explain this complex logic.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai explain this code block.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and explain its main purpose.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR.
@coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

remove functions that creates schema from ingestors and queriers separately reused function `fetch_schema` that fetches all schema files and merges the schemas into one this ensures the schema is always the latest

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

src/handlers/http/query.rs (1)

442-446: Update the comment to match the corrected logic.

The comment is confusing and doesn't accurately describe the intended behavior. After fixing the logical error, update the comment to be clearer.

-    // fetch querier streams which have field list blank
-    // now missing streams should be list of streams which are in storage but not in querier
-    // and also have no fields in the schema
-    // this is to ensure that we do not create streams for querier which already exist in querier
+    // Find streams that need to be created/updated in the querier:
+    // 1. Streams that exist in storage but not in querier (need to be created)
+    // 2. Streams that exist in querier but have empty schema (need schema update)
+    // This avoids recreating streams that already exist with valid schemas

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1a55a11 and a4c8f78.

📒 Files selected for processing (4)

src/handlers/http/query.rs (1 hunks)
src/migration/mod.rs (1 hunks)
src/parseable/mod.rs (1 hunks)
src/storage/object_storage.rs (2 hunks)

🚧 Files skipped from review as they are similar to previous changes (3)

src/parseable/mod.rs
src/migration/mod.rs
src/storage/object_storage.rs

🧰 Additional context used

🧠 Learnings (2)

📓 Common learnings

Learnt from: de-sh
PR: parseablehq/parseable#1185
File: src/handlers/http/logstream.rs:255-261
Timestamp: 2025-02-14T09:49:25.818Z
Learning: In Parseable's logstream handlers, stream existence checks must be performed for both query and standalone modes. The pattern `!PARSEABLE.streams.contains(&stream_name) && (PARSEABLE.options.mode != Mode::Query || !PARSEABLE.create_stream_and_schema_from_storage(&stream_name).await?)` ensures proper error handling in both modes.

src/handlers/http/query.rs (2)

Learnt from: de-sh
PR: parseablehq/parseable#1185
File: src/handlers/http/logstream.rs:255-261
Timestamp: 2025-02-14T09:49:25.818Z
Learning: In Parseable's logstream handlers, stream existence checks must be performed for both query and standalone modes. The pattern `!PARSEABLE.streams.contains(&stream_name) && (PARSEABLE.options.mode != Mode::Query || !PARSEABLE.create_stream_and_schema_from_storage(&stream_name).await?)` ensures proper error handling in both modes.

Learnt from: nikhilsinhaparseable
PR: parseablehq/parseable#1340
File: src/query/mod.rs:64-66
Timestamp: 2025-06-18T06:39:04.775Z
Learning: In src/query/mod.rs, QUERY_SESSION_STATE and QUERY_SESSION serve different architectural purposes: QUERY_SESSION_STATE is used for stats calculation and allows dynamic registration of individual parquet files from the staging path (files created every minute), while QUERY_SESSION is used for object store queries with the global schema provider. Session contexts with schema providers don't support registering individual tables/parquets, so both session objects are necessary for their respective use cases.

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (10)

GitHub Check: Build Default x86_64-pc-windows-msvc
GitHub Check: Build Default x86_64-apple-darwin
GitHub Check: Build Default x86_64-unknown-linux-gnu
GitHub Check: Build Default aarch64-apple-darwin
GitHub Check: Build Default aarch64-unknown-linux-gnu
GitHub Check: Build Kafka aarch64-apple-darwin
GitHub Check: Build Kafka x86_64-unknown-linux-gnu
GitHub Check: Quest Smoke and Load Tests for Distributed deployments
GitHub Check: Quest Smoke and Load Tests for Standalone deployments
GitHub Check: coverage

src/handlers/http/query.rs

coderabbitai bot previously approved these changes Jul 6, 2025

View reviewed changes

nikhilsinhaparseable dismissed coderabbitai[bot]’s stale review via 29315f2 July 8, 2025 14:08

nikhilsinhaparseable force-pushed the simplify-schema-merge branch from 8eb4934 to 29315f2 Compare July 8, 2025 14:08

coderabbitai bot previously approved these changes Jul 8, 2025

View reviewed changes

nikhilsinhaparseable added 2 commits July 16, 2025 07:46

chore: simplify schema creation from storage

8ad7eb4

remove functions that creates schema from ingestors and queriers separately reused function `fetch_schema` that fetches all schema files and merges the schemas into one this ensures the schema is always the latest

create stream for empty fields stream in querier

a4c8f78

nikhilsinhaparseable dismissed coderabbitai[bot]’s stale review via a4c8f78 July 16, 2025 14:46

nikhilsinhaparseable force-pushed the simplify-schema-merge branch from 1a55a11 to a4c8f78 Compare July 16, 2025 14:46

coderabbitai bot requested changes Jul 16, 2025

View reviewed changes

src/handlers/http/query.rs Outdated Show resolved Hide resolved

correct logic

42cebec

coderabbitai bot approved these changes Jul 16, 2025

View reviewed changes

nitisht merged commit 7ef5169 into parseablehq:main Jul 17, 2025
13 checks passed

nikhilsinhaparseable deleted the simplify-schema-merge branch July 17, 2025 06:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

chore: simplify schema creation from storage #1368

chore: simplify schema creation from storage #1368

nikhilsinhaparseable commented Jul 6, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Jul 6, 2025 •

edited

Loading

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chore: simplify schema creation from storage #1368

chore: simplify schema creation from storage #1368

Conversation

nikhilsinhaparseable commented Jul 6, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Jul 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nikhilsinhaparseable commented Jul 6, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jul 6, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)