The Environmental Emissions Analyzer, also known as the "Report Revolutionizer," is a powerful Streamlit application designed to transform dense, unstructured environmental emissions reports into actionable insights. Leveraging cutting-edge AI, including Large Language Models (LLMs) via LangChain and LangGraph, this tool automates data extraction, provides rich visualizations, generates in-depth analyses augmented by web knowledge, and enables users to query reports using a sophisticated RAG (Retrieval Augmented Generation) system.
The core mission is to make complex environmental data accessible, understandable, and interactive, empowering users to quickly grasp key information, identify trends, and make informed decisions in minutes rather than hours.
- Versatile Data Ingestion:
- Upload PDF reports directly.
- Search the web for publicly available reports using keywords.
- Automated Structured Data Extraction:
- Intelligently identifies and extracts key emissions data points (e.g., emission type, amount, unit, region, year) from unstructured text.
- Dynamic Visualizations:
- Converts extracted data into beautiful, interactive charts and graphs (bar charts, line charts, pie charts) for easy comprehension of trends and distributions.
- AI-Powered Summarization:
- Generates concise executive summaries of the key findings from the processed reports.
- In-Depth Analysis with Web Augmentation:
- Provides a detailed analytical report going beyond the document's content by contextualizing findings, identifying trends, and discussing potential data gaps, often leveraging broader internet knowledge for richer insights.
- Interactive RAG Q&A:
- Allows users to "chat" with their documents. Ask specific questions about the report content and receive precise answers sourced directly from the document, powered by a FAISS vector store and retrieval QA chains.
- Multi-Agent Architecture (LangGraph):
- Employs a sophisticated workflow orchestrated by LangGraph, where specialized AI agents collaborate to perform tasks like ingestion, extraction, vector store creation, summarization, and analysis, ensuring an efficient and modular process.
- User-Friendly Interface:
- Built with Streamlit for an intuitive and interactive web application experience.
- Configurable & Extensible:
- Easily configure API keys and model preferences. The modular design allows for future enhancements and integration of new capabilities.
Here's a glimpse of the Report Revolutionizer in action:
-
Landing Page & Input Selection:
-
Data Extraction & Processing Feedback:
-
Structured Data Table:
-
Visualizations - Emissions by Region & Year:
-
Visualizations - Emissions by Type (Pie Chart):
-
Visualizations - Total Emissions Trend Over Time:
-
AI-Generated Summary:
-
In-Depth AI Analysis:
-
RAG Q&A Feature:
The application utilizes a multi-agent system orchestrated by LangGraph. Each agent has a specific responsibility:
-
Ingestion Agent:
- Role: Responsible for loading data from the chosen source (PDF or web search results). It handles file reading, web scraping (via Serper and UnstructuredURLLoader/BeautifulSoup), and initial text aggregation.
- Output: Raw text content and a list of LangChain
Document
objects.
-
Extraction Agent:
- Role: Processes the raw text to identify and extract structured emissions data (emission type, amount, unit, region, year, source snippet) using an LLM with a specific prompt.
- Output: A list of dictionaries, where each dictionary represents an extracted emissions data point.
-
Vector Store Agent:
- Role: Takes the
Document
objects (or creates them from raw text if needed) and builds a FAISS vector store using Google Generative AI Embeddings. This store is crucial for the RAG Q&A functionality. - Output: A FAISS vector store instance.
- Role: Takes the
-
Summarization Agent:
- Role: Generates a concise executive summary of the key environmental emissions insights based on either the extracted structured data or the raw text, using an LLM.
- Output: A string containing the summary.
-
In-Depth Analysis Agent:
- Role: Produces a more detailed analytical report. It considers the summary and structured data, and uses an LLM to discuss trends, contributors, regional hotspots, data completeness, and significant figures, potentially drawing on its general knowledge to enrich the analysis.
- Output: A markdown string containing the in-depth analysis.
-
Q&A Processing Agent (Standalone Functionality):
- Role: Handles user queries in the Q&A tab. It uses the FAISS vector store (if available) and a RetrievalQA chain to find relevant information in the processed documents and generate an answer. It can also fallback to using raw text or structured data if the vector store is unavailable.
- Output: A string containing the answer to the user's question.
Package | Version (Example) | Use Case |
---|---|---|
streamlit |
>=1.30.0 |
Building the interactive web user interface. |
python-dotenv |
>=1.0.0 |
Loading environment variables (API keys) from a .env file. |
pandas |
>=2.0.0 |
Data manipulation, creating DataFrames for structured data display. |
plotly |
>=5.15.0 |
Generating interactive charts and visualizations. |
requests |
>=2.30.0 |
Making HTTP requests (used as a fallback for URL content fetching). |
beautifulsoup4 |
>=4.12.0 |
Parsing HTML and XML (used for basic web scraping). |
langchain-google-genai |
>=1.0.0 |
Interacting with Google's Generative AI models (Gemini) for LLM and embeddings. |
langchain-community |
>=0.0.30 |
Community integrations: PyMuPDFLoader , UnstructuredURLLoader , FAISS , GoogleSerperAPIWrapper . |
langchain |
>=0.1.10 |
Core LangChain functionalities: prompts, chains, text splitters. |
langgraph |
>=0.0.30 |
Building and orchestrating the multi-agent workflow. |
pymupdf |
>=1.23.0 |
Loading and parsing PDF files (PyMuPDFLoader ). |
unstructured |
>=0.12.0 |
Advanced document parsing, especially for URLs (UnstructuredURLLoader ). |
faiss-cpu |
>=1.7.0 |
Efficient similarity search and vector store creation (CPU version). |
google-search-results |
>=2.4.0 |
Wrapper for the Serper Google Search API. |
tiktoken |
(implicit) |
Tokenizer used by LangChain text splitters and for LLM context management. |
This flowchart illustrates the main data processing pipeline orchestrated by LangGraph:
flowchart TD
A[Start: User Input] --> B{Ingestion Method}
B -->|Upload PDF| C[PDF Upload Handler]
B -->|Search Web| D[Keyword-based Web Crawler]
C --> E[Data Preprocessing]
D --> E
E --> F["LLM-Based Data Extraction<br/>(emission type, amount, unit, etc.)"]
F --> G["Create Vector Store (FAISS)"]
F --> H[Generate Structured Dataset]
G --> I["Interactive RAG Q&A<br/>(LangChain + Retrieval QA)"]
H --> J["Dynamic Visualizations<br/>(Charts & Graphs)"]
F --> K["AI-Powered Summarization<br/>(Executive Summary)"]
F --> L["Web-Augmented Analysis<br/>(Contextual Insights & Trends)"]
I --> M["User Interface (Streamlit)"]
J --> M
K --> M
L --> M
subgraph LangGraph Orchestration
E
F
G
H
K
L
end
M --> N[User Decisions & Actions]
- Frontend: Streamlit
- Backend Logic: Python
- AI/LLM Orchestration: LangChain, LangGraph
- LLM & Embeddings: Google Gemini (via
langchain-google-genai
) - Vector Store: FAISS
- Web Search: Serper API
- Document Parsing: PyMuPDF, Unstructured, BeautifulSoup
- Data Handling: Pandas
- Visualization: Plotly Express
- Environment Management:
uv
- Support for More Document Types: Extend ingestion to support
.docx
,.txt
,.html
files, etc. - Advanced Data Cleaning & Normalization: Implement more robust cleaning for extracted amounts, units, and dates to improve visualization accuracy.
- Comparative Analysis: Allow users to upload multiple reports and perform comparative analyses between them.
- Trend Extrapolation: If sufficient time-series data is available, attempt to predict future emission trends.
- Customizable Extraction Schemas: Allow users to define or select different schemas for data extraction based on report type (e.g., financial reports, sustainability reports).
- Agent Memory & Statefulness: Explore more persistent memory for agents for more complex, multi-turn interactions or analyses.
- User Authentication & Saved Reports: Implement user accounts to save processed reports and analyses.
- Fine-tuning LLMs: For very specific report formats or analysis needs, explore fine-tuning smaller LLMs on domain-specific data.
- Deployment: Provide clear instructions or scripts for deployment on platforms like Streamlit Community Cloud, Hugging Face Spaces, or cloud VMs.
- Enhanced Error Handling & Feedback: More granular error messages and user guidance.
Create a .env
file in the root of your project directory with your API keys:
GOOGLE_API_KEY="YOUR_GOOGLE_GEMINI_API_KEY"
SERPER_API_KEY="YOUR_SERPER_DEV_API_KEY"
MODEL="gemini-1.5-flash-latest" # Or your preferred Gemini model
# Optional: Fallback keys if you want to define them directly (not recommended for production)
# DEFAULT_GOOGLE_API_KEY_FALLBACK="YOUR_FALLBACK_GOOGLE_API_KEY_HERE_IF_NEEDED"
# DEFAULT_SERPER_API_KEY_FALLBACK="YOUR_FALLBACK_SERPER_API_KEY_HERE_IF_NEEDED"
Important: Add .env
to your .gitignore
file to prevent committing your API keys.
This project uses uv
for efficient Python environment and package management.
-
Install
uv
: If you don't haveuv
installed, follow the instructions at astral.sh/uv. A common method is:curl -LsSf https://astral.sh/uv/install.sh | sh
Or via pip (if you have a base Python accessible):
pip install uv
-
Navigate to Project Directory: Open your terminal and change to the root directory of this project (where
pyproject.toml
is located).cd path/to/environmental_analyzer
-
Create a Virtual Environment:
uv venv
This will create a
.venv
directory in your project. -
Activate the Virtual Environment:
- On macOS/Linux:
source .venv/bin/activate
- On Windows (Command Prompt):
.venv\Scripts\activate.bat
- On Windows (PowerShell):
.venv\Scripts\Activate.ps1
You should see
(.venv)
at the beginning of your terminal prompt. - On macOS/Linux:
-
Install Dependencies:
uv
will use thepyproject.toml
file to install the required packages.uv pip install .
Alternatively, if you have a
requirements.txt
(which can be generated frompyproject.toml
usinguv pip freeze > requirements.txt
):uv pip install -r requirements.txt
-
Run the Application: Once the environment is set up and dependencies are installed:
streamlit run app.py
This project is licensed under the MIT License. (Consider adding a LICENSE
file with the MIT license text).