-
Notifications
You must be signed in to change notification settings - Fork 102
Introduce CHOMPs for multi-step pipelines. #119 #120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce CHOMPs for multi-step pipelines. #119 #120
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @sambhavnoobcoder, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
Summary of Changes
Hello team, gemini-code-assist here to provide a summary of this pull request. This PR, titled "Introduce CHOMPs for multi-step pipelines. #119", aims to complete the core features for Chonkie's multi-step pipeline system, referred to as CHOMP. Building upon existing basic components, it introduces advanced document processing capabilities for various formats (PDF, HTML, Markdown, JSON, CSV), implements infrastructure for parallel execution using threading and multiprocessing, adds comprehensive logging and performance monitoring features, and establishes a command-line interface for processing files and managing pipeline configurations. The changes span multiple new files covering these areas, along with corresponding tests and demo scripts.
Highlights
- Multi-step Pipeline (CHOMP): Introduces the core framework for building multi-step text processing pipelines, allowing chaining of 'Chefs' (preprocessing), a 'Chunker', 'Refineries' (post-processing), and 'Porters' (exporting).
- Advanced Document Processing Chefs: Adds new 'Chefs' for handling specific document types:
PDFCleanerChef
(extracts text, metadata, page numbers from PDFs, requires PyMuPDF),HTMLCleanerChef
(converts HTML to text, requires BeautifulSoup),MarkdownCleanerChef
(converts Markdown to text),JSONCleanerChef
(processes JSON, flattens, extracts fields), andCSVCleanerChef
(processes CSV, extracts columns, joins). - Parallel Processing: Implements a
ParallelProcessor
utility usingconcurrent.futures
to enable parallel execution of tasks within the pipeline, supporting both process-based and thread-based parallelism. - Logging and Performance Monitoring: Introduces
ChonkieLogger
for centralized logging with configurable levels and destinations (console/file), including built-in timers and metrics collection for performance monitoring. - Configuration System: Adds functionality to serialize and deserialize pipeline configurations to/from JSON and Pickle formats, enabling saving, loading, and sharing of pipeline setups.
- Command-Line Interface (CLI): Provides a command-line tool (
chonkie-cli.py
) with 'chunk' and 'config' subcommands for processing files/directories with customizable pipelines (defined via args or config files) and generating/saving configurations.
Changelog
Click here to see the changelog
- chonkie-cli.py
- New file: Entry point for the Chonkie command-line interface.
- Adds project root to Python path.
- demo_chefs.py
- New file: Demonstrates usage of various 'Chefs' (Text, HTML, JSON, CSV, Markdown) and a basic CHOMP pipeline.
- Includes examples for processing different content types.
- demo_config.py
- New file: Demonstrates saving and loading CHOMP pipeline configurations using JSON.
- Shows how to create and use pipeline 'recipes' from configurations.
- src/chonkie/chefs/init.py
- Adds imports and
__all__
list for new chef classes:PDFCleanerChef
,CSVCleanerChef
,JSONCleanerChef
,HTMLCleanerChef
,MarkdownCleanerChef
.
- Adds imports and
- src/chonkie/chefs/base.py
- New file: Defines the abstract base class
BaseChef
. - Includes abstract methods
is_available
andpreprocess
.
- New file: Defines the abstract base class
- src/chonkie/chefs/document.py
- New file: Implements
PDFCleanerChef
for extracting text from PDF files. - Includes options for extracting metadata, page numbers, and handling tables.
- Adds dependency check and lazy import for PyMuPDF (
fitz
).
- New file: Implements
- src/chonkie/chefs/structured.py
- New file: Implements
JSONCleanerChef
for processing JSON data. - Includes options for flattening, extracting specific fields, and joining text values.
- Implements
CSVCleanerChef
for processing CSV data. - Includes options for custom delimiter, extracting columns (by name or index), handling headers, skipping lines, and joining columns.
- New file: Implements
- src/chonkie/chefs/text.py
- New file: Implements
TextCleanerChef
for basic text cleaning (whitespace, strip, lowercase, remove patterns). - Implements
HTMLCleanerChef
for converting HTML to text, with options for stripping tags and preserving line breaks (requires BeautifulSoup). - Implements
MarkdownCleanerChef
for converting Markdown to text, with options for stripping formatting and preserving headings/links (requires markdown).
- New file: Implements
- src/chonkie/chomp/config.py
- New file: Implements
ChompConfig
class for pipeline serialization/deserialization. - Provides static methods to
serialize
anddeserialize
pipeline components (chefs, chunker, refineries, porter, handshake). - Includes logic for handling special types like
RecursiveRules
andRecursiveLevel
. - Adds static methods to
save_json
,load_json
,save_pickle
, andload_pickle
configurations to/from files.
- New file: Implements
- src/chonkie/chomp/parallel.py
- New file: Implements
ParallelProcessor
class for parallel execution. - Uses
concurrent.futures
for thread or process pools. - Provides
map
andbatch_process
methods for parallel task execution.
- New file: Implements
- src/chonkie/cli/init.py
- New file: Initializes the CLI package.
- Imports the
main
function fromcommands
.
- src/chonkie/cli/commands.py
- New file: Implements the CLI command logic using
argparse
. - Defines 'chunk' command for processing files/directories with various options (chunker, size, format, parallel, overlap, log, verbose).
- Defines 'config' command for creating and saving pipeline configurations.
- Includes helper functions for format detection and pipeline creation from arguments.
- New file: Implements the CLI command logic using
- src/chonkie/utils/logger.py
- New file: Implements
ChonkieLogger
for structured logging. - Supports different log levels, console and file output, and custom formats.
- Includes methods for performance timing (
start_timer
,end_timer
) and logging pipeline steps.
- New file: Implements
- src/chonkie/utils/visualizer.py
- New file: Implements
Visualizer
for visualizing chunks and pipelines. - Provides methods to output visualizations in text, HTML, or JSON formats.
- New file: Implements
- test_components.py
- New file: Adds tests for the new utility and core components.
- Includes tests for
ChonkieLogger
,ParallelProcessor
,Visualizer
, and CLI argument parsing.
- test_pdf_chef.py
- New file: Adds specific tests for the
PDFCleanerChef
. - Tests chef availability (PyMuPDF) and basic PDF text extraction.
- New file: Adds specific tests for the
- tests/test_chomp.py
- New file: Adds unit tests for the
Chomp
pipeline class. - Tests pipeline creation, adding/setting components, building, processing, and calling the pipeline.
- Uses mock components for isolation.
- New file: Adds unit tests for the
- tests/test_chomp_config.py
- New file: Adds unit tests for the
ChompConfig
class. - Tests serialization and deserialization of pipeline configurations.
- Tests saving and loading configurations to/from JSON and Pickle files.
- New file: Adds unit tests for the
- tests/test_structured_chefs.py
- New file: Adds unit tests for the
JSONCleanerChef
andCSVCleanerChef
. - Tests various processing options for JSON (flatten, extract, join) and CSV (delimiter, columns, header, skip, join).
- New file: Adds unit tests for the
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Pipeline's flow so neat,
Chefs and chunkers, can't be beat,
Data's journey's done.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a significant set of features for Chonkie's multi-step pipeline (CHOMP), including various Chef implementations for document and data preprocessing, a configuration system, parallel processing capabilities, a logger, a visualizer, and a command-line interface. The overall structure is modular and the additions are comprehensive.
The code is generally well-written, with good use of type hinting and docstrings in most places. The testing approach covers many new components, which is commendable.
I've identified a few areas for improvement, primarily concerning error handling consistency in Chefs, the use of global
for optional dependencies, and some specific implementation details in HTML cleaning and configuration serialization. Addressing these points will enhance the robustness and maintainability of the new CHOMP system.
No specific style guide was provided, so feedback related to style generally follows PEP 8 for Python and common best practices.
Summary of Findings
- Error Handling in Chefs: Several Chef classes (
PDFCleanerChef
,JSONCleanerChef
,CSVCleanerChef
) return error messages as strings whenhandle_errors=True
. This can mask failures and lead to incorrect data processing downstream. It's recommended to raise exceptions or return explicit error indicators likeNone
. - Use of
global
for Optional Dependencies: The use ofglobal
for lazy-loading optional dependencies (fitz
,BeautifulSoup
,markdown
,csv
) is generally discouraged as it can affect module state predictability and testability. Consider alternative patterns like class/instance attributes or standard top-level imports with try-except blocks. - HTMLCleanerChef Line Break Preservation: In
HTMLCleanerChef
, thepreserve_line_breaks
functionality appears to be negated by a subsequentre.sub(r'\s+', ' ', text)
call, which collapses all whitespace, including newlines, into single spaces. This should be reviewed to ensure it behaves as intended. - Configuration Serialization Robustness:
ChompConfig
's serialization relies on__dict__
for component parameters and falls back tostr()
for unknown parameter types. This could be made more robust by using explicit serialization methods or stricter type checking. - Parallel Processing Error Reporting:
ParallelProcessor.map
appends exceptions to the results list, which requires callers to manually check item types. A clearer error reporting mechanism might be beneficial. - Insecure
tempfile.mktemp
Usage in Tests:test_components.py
usestempfile.mktemp()
, which is insecure.tempfile.NamedTemporaryFile
ortempfile.mkstemp()
should be used instead. - Bare
except:
Clauses in Demo Code:demo_config.py
uses bareexcept:
clauses, which can catch system-exiting exceptions. It's better to catchException
or more specific exception types. - PDFCleanerChef Table Handling: The
handle_tables
functionality inPDFCleanerChef
currently doesn't perform any specific table processing. The implementation or documentation should be clarified. - Missing Newlines at End of File (Low Severity): Many Python files are missing a newline character at the end of the file. This is a minor PEP 8 style issue.
- Unused Imports (Low Severity): Some files have unused imports (e.g.,
re
insrc/chonkie/chefs/document.py
andsrc/chonkie/chefs/structured.py
). These should be removed.
Merge Readiness
This pull request introduces substantial and valuable features to Chonkie. The implementation is largely solid. However, there are several high
and medium
severity issues identified, particularly around error handling in Chefs, the use of global
for dependencies, and the HTML cleaning logic.
I recommend addressing these issues, especially the high
severity ones, before merging to ensure the robustness and maintainability of the new CHOMP system. I am not authorized to approve pull requests, so please ensure further review and approval from authorized maintainers after addressing the feedback.
text = re.sub(r'\s+', ' ', text) | ||
text = text.strip() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The preserve_line_breaks
option (if True
) aims to keep newlines by replacing <br>
and appending \n
to <p>
tags. However, text = re.sub(r'\s+', ' ', text)
on line 171 will subsequently collapse all whitespace, including these preserved newlines, into single spaces. This seems to negate the effect of preserve_line_breaks
.
If line breaks are intended to be preserved as \n
, should this re.sub
call be modified to not affect newlines, e.g., by targeting only spaces and tabs re.sub(r'[ \t]+', ' ', text)
or by handling newline preservation after this step?
if self.handle_errors: | ||
return f"Error processing CSV: {str(e)}" | ||
else: | ||
raise |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to the JSONChef, returning an error message f"Error processing CSV: {str(e)}"
as a string when handle_errors
is true can hide the error from the pipeline, potentially leading to incorrect downstream processing.
What do you think about raising an exception or returning None
/empty string and logging the error instead?
if self.handle_errors: | ||
return f"Error parsing JSON: {str(e)}" | ||
else: | ||
raise |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When handle_errors
is true, returning an error message f"Error parsing JSON: {str(e)}"
as a string can mask failures from the caller, as it might be treated as valid processed text. This could lead to downstream issues.
Would it be more robust to raise the exception or return a more explicit error indicator (e.g., None
or an empty string) while logging the error?
if not self.extract_metadata: | ||
return f"Error extracting text from PDF: {str(e)}" | ||
else: | ||
raise |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Returning an error message string f"Error extracting text from PDF: {str(e)}"
as if it's successfully processed content can be misleading for the caller. The pipeline might treat this error string as actual data. It's generally better to raise an exception or return a distinct value (like None
or an empty string) and log the error.
What are your thoughts on raising an exception here, or perhaps logging the error and returning an empty string, to make error states more explicit?
for table in tables.tables: | ||
rows = [] | ||
for cells in table.cells: | ||
for cell in cells: | ||
# Process cells if needed | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The handle_tables
parameter suggests specific table processing, but the loop for table in tables.tables: ... for cells in table.cells: ... for cell in cells: pass
doesn't perform any operations. If page.get_text()
already extracts table content sufficiently, this block might be redundant or misleading.
Could you clarify what handle_tables=True
is intended to achieve beyond the default text extraction? If it's a placeholder for future enhancements, a NotImplementedError
or a more descriptive comment might be appropriate.
def _import_dependencies(self) -> None: | ||
"""Import BeautifulSoup.""" | ||
if self._bs4_available: | ||
global BeautifulSoup |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
} | ||
|
||
# Default fallback: convert to string | ||
return str(param) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fallback return str(param)
for serializing parameters might lead to loss of information or non-deserializable representations for complex objects not explicitly handled by SPECIAL_TYPES
. For a robust configuration system, it's often better to raise an error for unsupported types or require types to implement a specific serialization interface.
What are your thoughts on making this stricter, perhaps by raising a TypeError
if a parameter type isn't explicitly serializable?
{ | ||
"type": type(chef).__name__, | ||
"module": type(chef).__module__, | ||
"params": {k: ChompConfig._serialize_param(v) for k, v in chef.__dict__.items()} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Serializing component parameters using chef.__dict__.items()
can be fragile. It might expose internal state unintended for configuration, or fail for classes using __slots__
. A more robust approach is often to define an explicit to_config_dict()
method on components, or to serialize only the parameters passed to __init__
.
Could this be changed to a more explicit serialization method for components to ensure only relevant config parameters are saved?
except Exception as e: | ||
# Collect any exceptions that occurred | ||
results.append(e) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Appending exceptions directly to the results
list (e.g., results.append(e)
) means the caller must inspect each item's type to differentiate between successful results and exceptions. This can make error handling cumbersome for the caller.
Consider either raising the first encountered exception, collecting exceptions in a separate list, or wrapping results in a structure that indicates success/failure (e.g., {'result': ..., 'error': None}
or {'result': None, 'error': ...}
). What do you think would be most user-friendly for the Chomp pipeline?
|
||
# Test direct file logging | ||
print("\nTesting direct file logging...") | ||
custom_log_file = tempfile.mktemp(suffix='.log') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tempfile.mktemp()
is considered insecure because of a potential race condition: another process could create a file with the same name between mktemp()
returning the name and your script attempting to use it. It's recommended to use tempfile.NamedTemporaryFile(delete=False)
(as you've done for log_file
) or tempfile.mkstemp()
instead.
Could this be changed to use tempfile.NamedTemporaryFile
or mkstemp
for creating custom_log_file
?
Closing PR, since this is a core feature which requires a RFC. Please refer to the Contributing Guidelines. Thanks 😊 |
Background
This PR implements the remaining core features for Chonkie's multi-step pipeline (CHOMP) to create a complete text chunking system. Building on the existing foundation of basic chefs and chunkers, this PR adds advanced document processing, parallel execution, comprehensive logging, and a command-line interface.
Approach
We followed a modular implementation approach, developing each component to be self-contained yet fully integrated with the existing architecture. We prioritized backward compatibility while adding new capabilities that significantly enhance both functionality and performance.
Features Implemented
1. Advanced Document Processing
2. Parallel Processing Infrastructure
3. Logging and Performance Monitoring
4. Command-Line Interface
Testing Approach
We developed comprehensive tests covering all new functionality:
Component Tests
Format-Specific Tests
Integration Tests
fiixes: #119