Art•E(mail)

Project Overview

This project trains models using reinforcement learning (RL) to search through email datasets and answer user queries. The goal is to create agents that could potentially integrate with email providers (like a Gmail plugin), allowing users to ask questions like "what time does my wife's flight arrive on Friday" or "what are the next steps I committed to for project X" and receive accurate answers based on email content.

Features

Email database creation and management with SQLite
Advanced search capabilities (keywords, date ranges, sender/recipient filters)
Email question-answering with reinforcement learning
Comprehensive model evaluation framework
Performance benchmarking against multiple models

Dataset

This project uses the Enron Email Dataset, downloaded from Kaggle. The dataset is processed and stored in a SQLite database with full-text search capabilities.

To generate training data, the system:

Processes email inboxes of Enron employees
Generates synthetic questions about email content
Creates training/validation splits for model training

Requirements

Python 3.10+
Key dependencies:
- art (for reinforcement learning)
- pandas, polars (data processing)
- mailparser, kaggle (dataset handling)
- sqlite3 (database)
- litellm (inference)
- datasets, tqdm (data management)
API keys:
- Kaggle (for dataset download)
- Model providers for evaluation (OpenAI, etc.)
S3 bucket to store logs and model checkpoints (set via BACKUP_BUCKET env var)

Installation

# Clone repository
git clone https://github.com/OpenPipe/ART
cd ART/examples/jarvis-mail

# Install package
uv sync

# Create .env file with required variables
# BACKUP_BUCKET=your-s3-bucket-name
# OPENPIPE_API_KEY=your-key  # Optional, for logging

Usage

Creating the Synthetic Dataset

The following commands were used to create the initial dataset. However, you can skip this if you just want to reproduce the results, since the processed dataset is freely hosted at https://huggingface.co/datasets/corbt/enron_emails_sample_questions.

# Download and process the Enron dataset (default: 100 emails)
python -m email_deep_research.data.convert_enron_email_dataset

# Process more emails
python -m email_deep_research.data.convert_enron_email_dataset --max-emails 10000

# Generate SQLite database
python -c "from email_deep_research.data.local_email_db import generate_database; generate_database(overwrite=True)"

Training Models

I used skypilot with the Runpod backend to train these models. Once you've authenticated Runpod for use with skypilot, the following command should start a training job that replicates our reported results:

uv run run_training_job.py 008 --fast

You can see the other model variants I tried training in train.py.

Evaluating Models

# Benchmark a model
from email_deep_research.evaluate.benchmark import benchmark_model
from email_deep_research.project_types import ProjectPolicyConfig
import asyncio
import art

# Create model
model = art.Model(
    name="gpt-4o",  # Can also use your trained models
    project="email_agent",
    config=ProjectPolicyConfig(
        litellm_model_name="openai/gpt-4o",
        use_tools=True,
    ),
)

# Run benchmark
results = asyncio.run(benchmark_model(model))
print(results)

Project Structure

data/: Dataset processing and management
- convert_enron_email_dataset.py: Downloads/processes Enron dataset
- local_email_db.py: SQLite database creation and management
- generate_synthetic_question_data.py: Creates question-answer pairs
- query_iterators.py: Loads datasets for training/evaluation
- types_enron.py: Data models for emails and queries
email_search_tools.py: Tools for searching and retrieving emails
evaluate/: Model evaluation
- benchmark.py: Performance benchmarking
- evaluate.py: Analysis and visualization tools
train.py: Main training script
rollout.py: Defines model-environment interaction
project_types.py: Configuration classes

How It Works

Data Preparation: Process Enron emails into a searchable database
Question Generation: Create synthetic questions about emails
Training: Models learn via reinforcement learning to:
- Search for relevant emails using keywords
- Read email content to extract information
- Formulate correct answers with proper citations
Rewards: System provides rewards based on answer correctness, sourcing, and efficiency
Evaluation: Compare models on metrics like accuracy, turn count, and source citation

Performance Metrics

The evaluation framework tracks:

Answer correctness (semantic match to ground truth)
Source citation accuracy (identifying the correct email)
Efficiency (number of turns to find answer)
Tool use effectiveness
Search strategy quality

Models are benchmarked against commercial LLMs like GPT-4.1 and Gemini 2.5 Pro to measure relative performance.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.cursor/rules		.cursor/rules
.vscode		.vscode
art_e		art_e
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
pyproject.toml		pyproject.toml
run_training_job.py		run_training_job.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Art•E(mail)

Project Overview

Features

Dataset

Requirements

Installation

Usage

Creating the Synthetic Dataset

Training Models

Evaluating Models

Project Structure

How It Works

Performance Metrics

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

OpenPipe/email-deep-research

Folders and files

Latest commit

History

Repository files navigation

Art•E(mail)

Project Overview

Features

Dataset

Requirements

Installation

Usage

Creating the Synthetic Dataset

Training Models

Evaluating Models

Project Structure

How It Works

Performance Metrics

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages