Introduce CHOMPs for multi-step pipelines. #119 #120

sambhavnoobcoder · 2025-05-15T22:13:46Z

Background

This PR implements the remaining core features for Chonkie's multi-step pipeline (CHOMP) to create a complete text chunking system. Building on the existing foundation of basic chefs and chunkers, this PR adds advanced document processing, parallel execution, comprehensive logging, and a command-line interface.

Approach

We followed a modular implementation approach, developing each component to be self-contained yet fully integrated with the existing architecture. We prioritized backward compatibility while adding new capabilities that significantly enhance both functionality and performance.

Features Implemented

1. Advanced Document Processing

PDFCleanerChef: Processes PDF files with metadata extraction, page numbering, and table handling
Format detection: Automatically identifies file formats based on extensions for seamless processing
Dependency management: Optional PyMuPDF integration for PDF processing with appropriate fallbacks

2. Parallel Processing Infrastructure

ParallelProcessor class: Core component supporting both threading and multiprocessing modes
Worker configuration: Customizable worker count based on available CPU cores
Batch processing: Efficient handling of multiple documents in configurable batches
Timeout handling: Graceful termination of long-running processes
Pipeline integration: Seamless parallel execution within the existing CHOMP framework

3. Logging and Performance Monitoring

ChonkieLogger: Centralized logging with configurable verbosity levels
Performance tracking: Built-in timers and metrics collection
Pipeline monitoring: Step-by-step visibility into processing stages
Multi-destination output: Support for both console and file-based logging
Performance metrics: Runtime statistics for optimization and debugging

4. Command-Line Interface

Chunking command: Process files and directories with customizable parameters
Configuration command: Create and manage reusable pipeline configurations
Format handling: Support for all implemented document formats
Directory processing: Batch process multiple files with parallel execution
User feedback: Clear progress and result reporting

Testing Approach

We developed comprehensive tests covering all new functionality:

Component Tests

test_components.py: Verifies the core functionality of individual components
- Logger tests: Confirm message logging, performance tracking, and file output
- Parallel processor tests: Validate performance gains and result correctness across processing modes
- Visualizer tests: Ensure proper output formatting in multiple formats
- CLI argument tests: Verify command parsing and parameter handling

Format-Specific Tests

test_pdf_chef.py: Validates PDF processing capabilities
- Tests availability detection and dependency handling
- Verifies text extraction, metadata handling, and page numbering
- Ensures proper error handling for missing dependencies

Integration Tests

End-to-end CLI testing with different file formats:
- Plain text processing (sample.txt)
- Markdown document handling (sample.md)
- JSON data extraction (sample.json)
- CSV data processing (sample.csv)
- Directory batch processing with parallel execution

fiixes: #119

…d in.

gemini-code-assist

Hello @sambhavnoobcoder, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

Summary of Changes

Hello team, gemini-code-assist here to provide a summary of this pull request. This PR, titled "Introduce CHOMPs for multi-step pipelines. #119", aims to complete the core features for Chonkie's multi-step pipeline system, referred to as CHOMP. Building upon existing basic components, it introduces advanced document processing capabilities for various formats (PDF, HTML, Markdown, JSON, CSV), implements infrastructure for parallel execution using threading and multiprocessing, adds comprehensive logging and performance monitoring features, and establishes a command-line interface for processing files and managing pipeline configurations. The changes span multiple new files covering these areas, along with corresponding tests and demo scripts.

Highlights

Multi-step Pipeline (CHOMP): Introduces the core framework for building multi-step text processing pipelines, allowing chaining of 'Chefs' (preprocessing), a 'Chunker', 'Refineries' (post-processing), and 'Porters' (exporting).
Advanced Document Processing Chefs: Adds new 'Chefs' for handling specific document types: PDFCleanerChef (extracts text, metadata, page numbers from PDFs, requires PyMuPDF), HTMLCleanerChef (converts HTML to text, requires BeautifulSoup), MarkdownCleanerChef (converts Markdown to text), JSONCleanerChef (processes JSON, flattens, extracts fields), and CSVCleanerChef (processes CSV, extracts columns, joins).
Parallel Processing: Implements a ParallelProcessor utility using concurrent.futures to enable parallel execution of tasks within the pipeline, supporting both process-based and thread-based parallelism.
Logging and Performance Monitoring: Introduces ChonkieLogger for centralized logging with configurable levels and destinations (console/file), including built-in timers and metrics collection for performance monitoring.
Configuration System: Adds functionality to serialize and deserialize pipeline configurations to/from JSON and Pickle formats, enabling saving, loading, and sharing of pipeline setups.
Command-Line Interface (CLI): Provides a command-line tool (chonkie-cli.py) with 'chunk' and 'config' subcommands for processing files/directories with customizable pipelines (defined via args or config files) and generating/saving configurations.

Changelog

Click here to see the changelog

chonkie-cli.py
- New file: Entry point for the Chonkie command-line interface.
- Adds project root to Python path.
demo_chefs.py
- New file: Demonstrates usage of various 'Chefs' (Text, HTML, JSON, CSV, Markdown) and a basic CHOMP pipeline.
- Includes examples for processing different content types.
demo_config.py
- New file: Demonstrates saving and loading CHOMP pipeline configurations using JSON.
- Shows how to create and use pipeline 'recipes' from configurations.
src/chonkie/chefs/init.py
- Adds imports and __all__ list for new chef classes: PDFCleanerChef, CSVCleanerChef, JSONCleanerChef, HTMLCleanerChef, MarkdownCleanerChef.
src/chonkie/chefs/base.py
- New file: Defines the abstract base class BaseChef.
- Includes abstract methods is_available and preprocess.
src/chonkie/chefs/document.py
- New file: Implements PDFCleanerChef for extracting text from PDF files.
- Includes options for extracting metadata, page numbers, and handling tables.
- Adds dependency check and lazy import for PyMuPDF (fitz).
src/chonkie/chefs/structured.py
- New file: Implements JSONCleanerChef for processing JSON data.
- Includes options for flattening, extracting specific fields, and joining text values.
- Implements CSVCleanerChef for processing CSV data.
- Includes options for custom delimiter, extracting columns (by name or index), handling headers, skipping lines, and joining columns.
src/chonkie/chefs/text.py
- New file: Implements TextCleanerChef for basic text cleaning (whitespace, strip, lowercase, remove patterns).
- Implements HTMLCleanerChef for converting HTML to text, with options for stripping tags and preserving line breaks (requires BeautifulSoup).
- Implements MarkdownCleanerChef for converting Markdown to text, with options for stripping formatting and preserving headings/links (requires markdown).
src/chonkie/chomp/config.py
- New file: Implements ChompConfig class for pipeline serialization/deserialization.
- Provides static methods to serialize and deserialize pipeline components (chefs, chunker, refineries, porter, handshake).
- Includes logic for handling special types like RecursiveRules and RecursiveLevel.
- Adds static methods to save_json, load_json, save_pickle, and load_pickle configurations to/from files.
src/chonkie/chomp/parallel.py
- New file: Implements ParallelProcessor class for parallel execution.
- Uses concurrent.futures for thread or process pools.
- Provides map and batch_process methods for parallel task execution.
src/chonkie/cli/init.py
- New file: Initializes the CLI package.
- Imports the main function from commands.
src/chonkie/cli/commands.py
- New file: Implements the CLI command logic using argparse.
- Defines 'chunk' command for processing files/directories with various options (chunker, size, format, parallel, overlap, log, verbose).
- Defines 'config' command for creating and saving pipeline configurations.
- Includes helper functions for format detection and pipeline creation from arguments.
src/chonkie/utils/logger.py
- New file: Implements ChonkieLogger for structured logging.
- Supports different log levels, console and file output, and custom formats.
- Includes methods for performance timing (start_timer, end_timer) and logging pipeline steps.
src/chonkie/utils/visualizer.py
- New file: Implements Visualizer for visualizing chunks and pipelines.
- Provides methods to output visualizations in text, HTML, or JSON formats.
test_components.py
- New file: Adds tests for the new utility and core components.
- Includes tests for ChonkieLogger, ParallelProcessor, Visualizer, and CLI argument parsing.
test_pdf_chef.py
- New file: Adds specific tests for the PDFCleanerChef.
- Tests chef availability (PyMuPDF) and basic PDF text extraction.
tests/test_chomp.py
- New file: Adds unit tests for the Chomp pipeline class.
- Tests pipeline creation, adding/setting components, building, processing, and calling the pipeline.
- Uses mock components for isolation.
tests/test_chomp_config.py
- New file: Adds unit tests for the ChompConfig class.
- Tests serialization and deserialization of pipeline configurations.
- Tests saving and loading configurations to/from JSON and Pickle files.
tests/test_structured_chefs.py
- New file: Adds unit tests for the JSONCleanerChef and CSVCleanerChef.
- Tests various processing options for JSON (flatten, extract, join) and CSV (delimiter, columns, header, skip, join).

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Pipeline's flow so neat,
Chefs and chunkers, can't be beat,
Data's journey's done.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a significant set of features for Chonkie's multi-step pipeline (CHOMP), including various Chef implementations for document and data preprocessing, a configuration system, parallel processing capabilities, a logger, a visualizer, and a command-line interface. The overall structure is modular and the additions are comprehensive.

The code is generally well-written, with good use of type hinting and docstrings in most places. The testing approach covers many new components, which is commendable.

I've identified a few areas for improvement, primarily concerning error handling consistency in Chefs, the use of global for optional dependencies, and some specific implementation details in HTML cleaning and configuration serialization. Addressing these points will enhance the robustness and maintainability of the new CHOMP system.

No specific style guide was provided, so feedback related to style generally follows PEP 8 for Python and common best practices.

Summary of Findings

Error Handling in Chefs: Several Chef classes (PDFCleanerChef, JSONCleanerChef, CSVCleanerChef) return error messages as strings when handle_errors=True. This can mask failures and lead to incorrect data processing downstream. It's recommended to raise exceptions or return explicit error indicators like None.
Use of global for Optional Dependencies: The use of global for lazy-loading optional dependencies (fitz, BeautifulSoup, markdown, csv) is generally discouraged as it can affect module state predictability and testability. Consider alternative patterns like class/instance attributes or standard top-level imports with try-except blocks.
HTMLCleanerChef Line Break Preservation: In HTMLCleanerChef, the preserve_line_breaks functionality appears to be negated by a subsequent re.sub(r'\s+', ' ', text) call, which collapses all whitespace, including newlines, into single spaces. This should be reviewed to ensure it behaves as intended.
Configuration Serialization Robustness: ChompConfig's serialization relies on __dict__ for component parameters and falls back to str() for unknown parameter types. This could be made more robust by using explicit serialization methods or stricter type checking.
Parallel Processing Error Reporting: ParallelProcessor.map appends exceptions to the results list, which requires callers to manually check item types. A clearer error reporting mechanism might be beneficial.
Insecure tempfile.mktemp Usage in Tests: test_components.py uses tempfile.mktemp(), which is insecure. tempfile.NamedTemporaryFile or tempfile.mkstemp() should be used instead.
Bare except: Clauses in Demo Code: demo_config.py uses bare except: clauses, which can catch system-exiting exceptions. It's better to catch Exception or more specific exception types.
PDFCleanerChef Table Handling: The handle_tables functionality in PDFCleanerChef currently doesn't perform any specific table processing. The implementation or documentation should be clarified.
Missing Newlines at End of File (Low Severity): Many Python files are missing a newline character at the end of the file. This is a minor PEP 8 style issue.
Unused Imports (Low Severity): Some files have unused imports (e.g., re in src/chonkie/chefs/document.py and src/chonkie/chefs/structured.py). These should be removed.

Merge Readiness

This pull request introduces substantial and valuable features to Chonkie. The implementation is largely solid. However, there are several high and medium severity issues identified, particularly around error handling in Chefs, the use of global for dependencies, and the HTML cleaning logic.

I recommend addressing these issues, especially the high severity ones, before merging to ensure the robustness and maintainability of the new CHOMP system. I am not authorized to approve pull requests, so please ensure further review and approval from authorized maintainers after addressing the feedback.

gemini-code-assist · 2025-05-15T22:17:50Z

src/chonkie/chefs/text.py

+        text = re.sub(r'\s+', ' ', text)
+        text = text.strip()


The preserve_line_breaks option (if True) aims to keep newlines by replacing <br> and appending \n to <p> tags. However, text = re.sub(r'\s+', ' ', text) on line 171 will subsequently collapse all whitespace, including these preserved newlines, into single spaces. This seems to negate the effect of preserve_line_breaks.

If line breaks are intended to be preserved as \n, should this re.sub call be modified to not affect newlines, e.g., by targeting only spaces and tabs re.sub(r'[ \t]+', ' ', text) or by handling newline preservation after this step?

gemini-code-assist · 2025-05-15T22:17:50Z

src/chonkie/chefs/structured.py

+            if self.handle_errors:
+                return f"Error processing CSV: {str(e)}"
+            else:
+                raise


Similar to the JSONChef, returning an error message f"Error processing CSV: {str(e)}" as a string when handle_errors is true can hide the error from the pipeline, potentially leading to incorrect downstream processing.

What do you think about raising an exception or returning None/empty string and logging the error instead?

gemini-code-assist · 2025-05-15T22:17:50Z

src/chonkie/chefs/structured.py

+            if self.handle_errors:
+                return f"Error parsing JSON: {str(e)}"
+            else:
+                raise


When handle_errors is true, returning an error message f"Error parsing JSON: {str(e)}" as a string can mask failures from the caller, as it might be treated as valid processed text. This could lead to downstream issues.

Would it be more robust to raise the exception or return a more explicit error indicator (e.g., None or an empty string) while logging the error?

gemini-code-assist · 2025-05-15T22:17:50Z

src/chonkie/chefs/document.py

+            if not self.extract_metadata:
+                return f"Error extracting text from PDF: {str(e)}"
+            else:
+                raise


Returning an error message string f"Error extracting text from PDF: {str(e)}" as if it's successfully processed content can be misleading for the caller. The pipeline might treat this error string as actual data. It's generally better to raise an exception or return a distinct value (like None or an empty string) and log the error.

What are your thoughts on raising an exception here, or perhaps logging the error and returning an empty string, to make error states more explicit?

gemini-code-assist · 2025-05-15T22:17:50Z

src/chonkie/chefs/document.py

+                        for table in tables.tables:
+                            rows = []
+                            for cells in table.cells:
+                                for cell in cells:
+                                    # Process cells if needed
+                                    pass


The handle_tables parameter suggests specific table processing, but the loop for table in tables.tables: ... for cells in table.cells: ... for cell in cells: pass doesn't perform any operations. If page.get_text() already extracts table content sufficiently, this block might be redundant or misleading.

Could you clarify what handle_tables=True is intended to achieve beyond the default text extraction? If it's a placeholder for future enhancements, a NotImplementedError or a more descriptive comment might be appropriate.

gemini-code-assist · 2025-05-15T22:17:51Z

src/chonkie/chefs/text.py

+    def _import_dependencies(self) -> None:
+        """Import BeautifulSoup."""
+        if self._bs4_available:
+            global BeautifulSoup


The use of global BeautifulSoup for lazy loading presents similar maintainability and testability concerns as noted for fitz and csv. Storing the imported module on the class or instance, or a top-level import if appropriate, would be preferable.

Could this be refactored to avoid global?

gemini-code-assist · 2025-05-15T22:17:51Z

src/chonkie/chomp/config.py

+                }
+
+        # Default fallback: convert to string
+        return str(param)


The fallback return str(param) for serializing parameters might lead to loss of information or non-deserializable representations for complex objects not explicitly handled by SPECIAL_TYPES. For a robust configuration system, it's often better to raise an error for unsupported types or require types to implement a specific serialization interface.

What are your thoughts on making this stricter, perhaps by raising a TypeError if a parameter type isn't explicitly serializable?

gemini-code-assist · 2025-05-15T22:17:51Z

src/chonkie/chomp/config.py

+                {
+                    "type": type(chef).__name__,
+                    "module": type(chef).__module__,
+                    "params": {k: ChompConfig._serialize_param(v) for k, v in chef.__dict__.items()}


Serializing component parameters using chef.__dict__.items() can be fragile. It might expose internal state unintended for configuration, or fail for classes using __slots__. A more robust approach is often to define an explicit to_config_dict() method on components, or to serialize only the parameters passed to __init__.

Could this be changed to a more explicit serialization method for components to ensure only relevant config parameters are saved?

gemini-code-assist · 2025-05-15T22:17:51Z

src/chonkie/chomp/parallel.py

+                    except Exception as e:
+                        # Collect any exceptions that occurred
+                        results.append(e)


Appending exceptions directly to the results list (e.g., results.append(e)) means the caller must inspect each item's type to differentiate between successful results and exceptions. This can make error handling cumbersome for the caller.

Consider either raising the first encountered exception, collecting exceptions in a separate list, or wrapping results in a structure that indicates success/failure (e.g., {'result': ..., 'error': None} or {'result': None, 'error': ...}). What do you think would be most user-friendly for the Chomp pipeline?

gemini-code-assist · 2025-05-15T22:17:51Z

test_components.py

+
+    # Test direct file logging
+    print("\nTesting direct file logging...")
+    custom_log_file = tempfile.mktemp(suffix='.log')


tempfile.mktemp() is considered insecure because of a potential race condition: another process could create a file with the same name between mktemp() returning the name and your script attempting to use it. It's recommended to use tempfile.NamedTemporaryFile(delete=False) (as you've done for log_file) or tempfile.mkstemp() instead.

Could this be changed to use tempfile.NamedTemporaryFile or mkstemp for creating custom_log_file?

chonknick · 2025-08-02T06:34:38Z

Closing PR, since this is a core feature which requires a RFC. Please refer to the Contributing Guidelines.

Thanks 😊

sambhavnoobcoder added 16 commits May 16, 2025 01:02

file init

df01168

the chonky cli

aaa2d63

the chef needs to demo

a46ba99

demo configs

b40baaa

the chefs directory has been added , some base fiels have been plugge…

d565b43

…d in.

chomper config

8aedc61

chomper parallel

6a2d543

the cli setup is here.

4d23aca

logger changes

303d6a8

because every good project needs a visualiser

1f35f45

components test

8e7ebff

pdf chef test

90867a0

test chomping for text in repo

c061a45

testing the configs of the chomper

406a0ca

test chonkie

406bdc6

testing structured outputs for chefs

8d5736e

gemini-code-assist bot reviewed May 15, 2025

View reviewed changes

gemini-code-assist bot suggested changes May 15, 2025

View reviewed changes

chonknick closed this Aug 2, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Introduce CHOMPs for multi-step pipelines. #119 #120

Introduce CHOMPs for multi-step pipelines. #119 #120

Uh oh!

sambhavnoobcoder commented May 15, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot May 15, 2025

Uh oh!

gemini-code-assist bot May 15, 2025

Uh oh!

gemini-code-assist bot May 15, 2025

Uh oh!

gemini-code-assist bot May 15, 2025

Uh oh!

gemini-code-assist bot May 15, 2025

Uh oh!

gemini-code-assist bot May 15, 2025

Uh oh!

gemini-code-assist bot May 15, 2025

Uh oh!

gemini-code-assist bot May 15, 2025

Uh oh!

gemini-code-assist bot May 15, 2025

Uh oh!

gemini-code-assist bot May 15, 2025

Uh oh!

chonknick commented Aug 2, 2025

Uh oh!

Uh oh!

Introduce CHOMPs for multi-step pipelines. #119 #120

Introduce CHOMPs for multi-step pipelines. #119 #120

Uh oh!

Conversation

sambhavnoobcoder commented May 15, 2025

Background

Approach

Features Implemented

1. Advanced Document Processing

2. Parallel Processing Infrastructure

3. Logging and Performance Monitoring

4. Command-Line Interface

Testing Approach

Component Tests

Format-Specific Tests

Integration Tests

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Changelog

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Summary of Findings

Merge Readiness

Uh oh!

gemini-code-assist bot May 15, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot May 15, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot May 15, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot May 15, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot May 15, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot May 15, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot May 15, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot May 15, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot May 15, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot May 15, 2025

Choose a reason for hiding this comment

Uh oh!

chonknick commented Aug 2, 2025

Uh oh!

Uh oh!