Skip to content

Introduce CHOMPs for multi-step pipelines. #119 #120

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

sambhavnoobcoder
Copy link

Background

This PR implements the remaining core features for Chonkie's multi-step pipeline (CHOMP) to create a complete text chunking system. Building on the existing foundation of basic chefs and chunkers, this PR adds advanced document processing, parallel execution, comprehensive logging, and a command-line interface.

Approach

We followed a modular implementation approach, developing each component to be self-contained yet fully integrated with the existing architecture. We prioritized backward compatibility while adding new capabilities that significantly enhance both functionality and performance.

Features Implemented

1. Advanced Document Processing

  • PDFCleanerChef: Processes PDF files with metadata extraction, page numbering, and table handling
  • Format detection: Automatically identifies file formats based on extensions for seamless processing
  • Dependency management: Optional PyMuPDF integration for PDF processing with appropriate fallbacks

2. Parallel Processing Infrastructure

  • ParallelProcessor class: Core component supporting both threading and multiprocessing modes
  • Worker configuration: Customizable worker count based on available CPU cores
  • Batch processing: Efficient handling of multiple documents in configurable batches
  • Timeout handling: Graceful termination of long-running processes
  • Pipeline integration: Seamless parallel execution within the existing CHOMP framework

3. Logging and Performance Monitoring

  • ChonkieLogger: Centralized logging with configurable verbosity levels
  • Performance tracking: Built-in timers and metrics collection
  • Pipeline monitoring: Step-by-step visibility into processing stages
  • Multi-destination output: Support for both console and file-based logging
  • Performance metrics: Runtime statistics for optimization and debugging

4. Command-Line Interface

  • Chunking command: Process files and directories with customizable parameters
  • Configuration command: Create and manage reusable pipeline configurations
  • Format handling: Support for all implemented document formats
  • Directory processing: Batch process multiple files with parallel execution
  • User feedback: Clear progress and result reporting

Testing Approach

We developed comprehensive tests covering all new functionality:

Component Tests

  • test_components.py: Verifies the core functionality of individual components
    • Logger tests: Confirm message logging, performance tracking, and file output
    • Parallel processor tests: Validate performance gains and result correctness across processing modes
    • Visualizer tests: Ensure proper output formatting in multiple formats
    • CLI argument tests: Verify command parsing and parameter handling

Format-Specific Tests

  • test_pdf_chef.py: Validates PDF processing capabilities
    • Tests availability detection and dependency handling
    • Verifies text extraction, metadata handling, and page numbering
    • Ensures proper error handling for missing dependencies

Integration Tests

  • End-to-end CLI testing with different file formats:
    • Plain text processing (sample.txt)
    • Markdown document handling (sample.md)
    • JSON data extraction (sample.json)
    • CSV data processing (sample.csv)
    • Directory batch processing with parallel execution
Screenshot 2025-05-16 at 3 36 00 AM Screenshot 2025-05-16 at 3 36 40 AM Screenshot 2025-05-16 at 3 37 26 AM

fiixes: #119

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @sambhavnoobcoder, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

Summary of Changes

Hello team, gemini-code-assist here to provide a summary of this pull request. This PR, titled "Introduce CHOMPs for multi-step pipelines. #119", aims to complete the core features for Chonkie's multi-step pipeline system, referred to as CHOMP. Building upon existing basic components, it introduces advanced document processing capabilities for various formats (PDF, HTML, Markdown, JSON, CSV), implements infrastructure for parallel execution using threading and multiprocessing, adds comprehensive logging and performance monitoring features, and establishes a command-line interface for processing files and managing pipeline configurations. The changes span multiple new files covering these areas, along with corresponding tests and demo scripts.

Highlights

  • Multi-step Pipeline (CHOMP): Introduces the core framework for building multi-step text processing pipelines, allowing chaining of 'Chefs' (preprocessing), a 'Chunker', 'Refineries' (post-processing), and 'Porters' (exporting).
  • Advanced Document Processing Chefs: Adds new 'Chefs' for handling specific document types: PDFCleanerChef (extracts text, metadata, page numbers from PDFs, requires PyMuPDF), HTMLCleanerChef (converts HTML to text, requires BeautifulSoup), MarkdownCleanerChef (converts Markdown to text), JSONCleanerChef (processes JSON, flattens, extracts fields), and CSVCleanerChef (processes CSV, extracts columns, joins).
  • Parallel Processing: Implements a ParallelProcessor utility using concurrent.futures to enable parallel execution of tasks within the pipeline, supporting both process-based and thread-based parallelism.
  • Logging and Performance Monitoring: Introduces ChonkieLogger for centralized logging with configurable levels and destinations (console/file), including built-in timers and metrics collection for performance monitoring.
  • Configuration System: Adds functionality to serialize and deserialize pipeline configurations to/from JSON and Pickle formats, enabling saving, loading, and sharing of pipeline setups.
  • Command-Line Interface (CLI): Provides a command-line tool (chonkie-cli.py) with 'chunk' and 'config' subcommands for processing files/directories with customizable pipelines (defined via args or config files) and generating/saving configurations.

Changelog

Click here to see the changelog
  • chonkie-cli.py
    • New file: Entry point for the Chonkie command-line interface.
    • Adds project root to Python path.
  • demo_chefs.py
    • New file: Demonstrates usage of various 'Chefs' (Text, HTML, JSON, CSV, Markdown) and a basic CHOMP pipeline.
    • Includes examples for processing different content types.
  • demo_config.py
    • New file: Demonstrates saving and loading CHOMP pipeline configurations using JSON.
    • Shows how to create and use pipeline 'recipes' from configurations.
  • src/chonkie/chefs/init.py
    • Adds imports and __all__ list for new chef classes: PDFCleanerChef, CSVCleanerChef, JSONCleanerChef, HTMLCleanerChef, MarkdownCleanerChef.
  • src/chonkie/chefs/base.py
    • New file: Defines the abstract base class BaseChef.
    • Includes abstract methods is_available and preprocess.
  • src/chonkie/chefs/document.py
    • New file: Implements PDFCleanerChef for extracting text from PDF files.
    • Includes options for extracting metadata, page numbers, and handling tables.
    • Adds dependency check and lazy import for PyMuPDF (fitz).
  • src/chonkie/chefs/structured.py
    • New file: Implements JSONCleanerChef for processing JSON data.
    • Includes options for flattening, extracting specific fields, and joining text values.
    • Implements CSVCleanerChef for processing CSV data.
    • Includes options for custom delimiter, extracting columns (by name or index), handling headers, skipping lines, and joining columns.
  • src/chonkie/chefs/text.py
    • New file: Implements TextCleanerChef for basic text cleaning (whitespace, strip, lowercase, remove patterns).
    • Implements HTMLCleanerChef for converting HTML to text, with options for stripping tags and preserving line breaks (requires BeautifulSoup).
    • Implements MarkdownCleanerChef for converting Markdown to text, with options for stripping formatting and preserving headings/links (requires markdown).
  • src/chonkie/chomp/config.py
    • New file: Implements ChompConfig class for pipeline serialization/deserialization.
    • Provides static methods to serialize and deserialize pipeline components (chefs, chunker, refineries, porter, handshake).
    • Includes logic for handling special types like RecursiveRules and RecursiveLevel.
    • Adds static methods to save_json, load_json, save_pickle, and load_pickle configurations to/from files.
  • src/chonkie/chomp/parallel.py
    • New file: Implements ParallelProcessor class for parallel execution.
    • Uses concurrent.futures for thread or process pools.
    • Provides map and batch_process methods for parallel task execution.
  • src/chonkie/cli/init.py
    • New file: Initializes the CLI package.
    • Imports the main function from commands.
  • src/chonkie/cli/commands.py
    • New file: Implements the CLI command logic using argparse.
    • Defines 'chunk' command for processing files/directories with various options (chunker, size, format, parallel, overlap, log, verbose).
    • Defines 'config' command for creating and saving pipeline configurations.
    • Includes helper functions for format detection and pipeline creation from arguments.
  • src/chonkie/utils/logger.py
    • New file: Implements ChonkieLogger for structured logging.
    • Supports different log levels, console and file output, and custom formats.
    • Includes methods for performance timing (start_timer, end_timer) and logging pipeline steps.
  • src/chonkie/utils/visualizer.py
    • New file: Implements Visualizer for visualizing chunks and pipelines.
    • Provides methods to output visualizations in text, HTML, or JSON formats.
  • test_components.py
    • New file: Adds tests for the new utility and core components.
    • Includes tests for ChonkieLogger, ParallelProcessor, Visualizer, and CLI argument parsing.
  • test_pdf_chef.py
    • New file: Adds specific tests for the PDFCleanerChef.
    • Tests chef availability (PyMuPDF) and basic PDF text extraction.
  • tests/test_chomp.py
    • New file: Adds unit tests for the Chomp pipeline class.
    • Tests pipeline creation, adding/setting components, building, processing, and calling the pipeline.
    • Uses mock components for isolation.
  • tests/test_chomp_config.py
    • New file: Adds unit tests for the ChompConfig class.
    • Tests serialization and deserialization of pipeline configurations.
    • Tests saving and loading configurations to/from JSON and Pickle files.
  • tests/test_structured_chefs.py
    • New file: Adds unit tests for the JSONCleanerChef and CSVCleanerChef.
    • Tests various processing options for JSON (flatten, extract, join) and CSV (delimiter, columns, header, skip, join).
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.


Pipeline's flow so neat,
Chefs and chunkers, can't be beat,
Data's journey's done.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant set of features for Chonkie's multi-step pipeline (CHOMP), including various Chef implementations for document and data preprocessing, a configuration system, parallel processing capabilities, a logger, a visualizer, and a command-line interface. The overall structure is modular and the additions are comprehensive.

The code is generally well-written, with good use of type hinting and docstrings in most places. The testing approach covers many new components, which is commendable.

I've identified a few areas for improvement, primarily concerning error handling consistency in Chefs, the use of global for optional dependencies, and some specific implementation details in HTML cleaning and configuration serialization. Addressing these points will enhance the robustness and maintainability of the new CHOMP system.

No specific style guide was provided, so feedback related to style generally follows PEP 8 for Python and common best practices.

Summary of Findings

  • Error Handling in Chefs: Several Chef classes (PDFCleanerChef, JSONCleanerChef, CSVCleanerChef) return error messages as strings when handle_errors=True. This can mask failures and lead to incorrect data processing downstream. It's recommended to raise exceptions or return explicit error indicators like None.
  • Use of global for Optional Dependencies: The use of global for lazy-loading optional dependencies (fitz, BeautifulSoup, markdown, csv) is generally discouraged as it can affect module state predictability and testability. Consider alternative patterns like class/instance attributes or standard top-level imports with try-except blocks.
  • HTMLCleanerChef Line Break Preservation: In HTMLCleanerChef, the preserve_line_breaks functionality appears to be negated by a subsequent re.sub(r'\s+', ' ', text) call, which collapses all whitespace, including newlines, into single spaces. This should be reviewed to ensure it behaves as intended.
  • Configuration Serialization Robustness: ChompConfig's serialization relies on __dict__ for component parameters and falls back to str() for unknown parameter types. This could be made more robust by using explicit serialization methods or stricter type checking.
  • Parallel Processing Error Reporting: ParallelProcessor.map appends exceptions to the results list, which requires callers to manually check item types. A clearer error reporting mechanism might be beneficial.
  • Insecure tempfile.mktemp Usage in Tests: test_components.py uses tempfile.mktemp(), which is insecure. tempfile.NamedTemporaryFile or tempfile.mkstemp() should be used instead.
  • Bare except: Clauses in Demo Code: demo_config.py uses bare except: clauses, which can catch system-exiting exceptions. It's better to catch Exception or more specific exception types.
  • PDFCleanerChef Table Handling: The handle_tables functionality in PDFCleanerChef currently doesn't perform any specific table processing. The implementation or documentation should be clarified.
  • Missing Newlines at End of File (Low Severity): Many Python files are missing a newline character at the end of the file. This is a minor PEP 8 style issue.
  • Unused Imports (Low Severity): Some files have unused imports (e.g., re in src/chonkie/chefs/document.py and src/chonkie/chefs/structured.py). These should be removed.

Merge Readiness

This pull request introduces substantial and valuable features to Chonkie. The implementation is largely solid. However, there are several high and medium severity issues identified, particularly around error handling in Chefs, the use of global for dependencies, and the HTML cleaning logic.

I recommend addressing these issues, especially the high severity ones, before merging to ensure the robustness and maintainability of the new CHOMP system. I am not authorized to approve pull requests, so please ensure further review and approval from authorized maintainers after addressing the feedback.

Comment on lines +171 to +172
text = re.sub(r'\s+', ' ', text)
text = text.strip()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The preserve_line_breaks option (if True) aims to keep newlines by replacing <br> and appending \n to <p> tags. However, text = re.sub(r'\s+', ' ', text) on line 171 will subsequently collapse all whitespace, including these preserved newlines, into single spaces. This seems to negate the effect of preserve_line_breaks.

If line breaks are intended to be preserved as \n, should this re.sub call be modified to not affect newlines, e.g., by targeting only spaces and tabs re.sub(r'[ \t]+', ' ', text) or by handling newline preservation after this step?

Comment on lines +350 to +353
if self.handle_errors:
return f"Error processing CSV: {str(e)}"
else:
raise
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Similar to the JSONChef, returning an error message f"Error processing CSV: {str(e)}" as a string when handle_errors is true can hide the error from the pipeline, potentially leading to incorrect downstream processing.

What do you think about raising an exception or returning None/empty string and logging the error instead?

Comment on lines +164 to +167
if self.handle_errors:
return f"Error parsing JSON: {str(e)}"
else:
raise
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

When handle_errors is true, returning an error message f"Error parsing JSON: {str(e)}" as a string can mask failures from the caller, as it might be treated as valid processed text. This could lead to downstream issues.

Would it be more robust to raise the exception or return a more explicit error indicator (e.g., None or an empty string) while logging the error?

Comment on lines +138 to +141
if not self.extract_metadata:
return f"Error extracting text from PDF: {str(e)}"
else:
raise
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Returning an error message string f"Error extracting text from PDF: {str(e)}" as if it's successfully processed content can be misleading for the caller. The pipeline might treat this error string as actual data. It's generally better to raise an exception or return a distinct value (like None or an empty string) and log the error.

What are your thoughts on raising an exception here, or perhaps logging the error and returning an empty string, to make error states more explicit?

Comment on lines +108 to +113
for table in tables.tables:
rows = []
for cells in table.cells:
for cell in cells:
# Process cells if needed
pass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The handle_tables parameter suggests specific table processing, but the loop for table in tables.tables: ... for cells in table.cells: ... for cell in cells: pass doesn't perform any operations. If page.get_text() already extracts table content sufficiently, this block might be redundant or misleading.

Could you clarify what handle_tables=True is intended to achieve beyond the default text extraction? If it's a placeholder for future enhancements, a NotImplementedError or a more descriptive comment might be appropriate.

def _import_dependencies(self) -> None:
"""Import BeautifulSoup."""
if self._bs4_available:
global BeautifulSoup
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The use of global BeautifulSoup for lazy loading presents similar maintainability and testability concerns as noted for fitz and csv. Storing the imported module on the class or instance, or a top-level import if appropriate, would be preferable.

Could this be refactored to avoid global?

}

# Default fallback: convert to string
return str(param)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The fallback return str(param) for serializing parameters might lead to loss of information or non-deserializable representations for complex objects not explicitly handled by SPECIAL_TYPES. For a robust configuration system, it's often better to raise an error for unsupported types or require types to implement a specific serialization interface.

What are your thoughts on making this stricter, perhaps by raising a TypeError if a parameter type isn't explicitly serializable?

{
"type": type(chef).__name__,
"module": type(chef).__module__,
"params": {k: ChompConfig._serialize_param(v) for k, v in chef.__dict__.items()}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Serializing component parameters using chef.__dict__.items() can be fragile. It might expose internal state unintended for configuration, or fail for classes using __slots__. A more robust approach is often to define an explicit to_config_dict() method on components, or to serialize only the parameters passed to __init__.

Could this be changed to a more explicit serialization method for components to ensure only relevant config parameters are saved?

Comment on lines +65 to +67
except Exception as e:
# Collect any exceptions that occurred
results.append(e)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Appending exceptions directly to the results list (e.g., results.append(e)) means the caller must inspect each item's type to differentiate between successful results and exceptions. This can make error handling cumbersome for the caller.

Consider either raising the first encountered exception, collecting exceptions in a separate list, or wrapping results in a structure that indicates success/failure (e.g., {'result': ..., 'error': None} or {'result': None, 'error': ...}). What do you think would be most user-friendly for the Chomp pipeline?


# Test direct file logging
print("\nTesting direct file logging...")
custom_log_file = tempfile.mktemp(suffix='.log')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

tempfile.mktemp() is considered insecure because of a potential race condition: another process could create a file with the same name between mktemp() returning the name and your script attempting to use it. It's recommended to use tempfile.NamedTemporaryFile(delete=False) (as you've done for log_file) or tempfile.mkstemp() instead.

Could this be changed to use tempfile.NamedTemporaryFile or mkstemp for creating custom_log_file?

@chonknick
Copy link
Contributor

Closing PR, since this is a core feature which requires a RFC. Please refer to the Contributing Guidelines.

Thanks 😊

@chonknick chonknick closed this Aug 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants