Skip to content

KokTeng00/PDF-Translation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PDF Translation Project

The PDF Translation Project is an end-to-end solution for extracting text from PDF documents and translating it into a target language (default: Chinese) using OpenAI’s API. It integrates robust text extraction, a powerful translation service, and a user-friendly web interface. Additionally, the project provides optional capabilities for fine-tuning translation models using state-of-the-art techniques with mT5 and VBLoRA.

Table of Contents

Overview

The project provides a comprehensive pipeline to:

  • Extract text from PDFs, including Optical Character Recognition (OCR) for embedded images.
  • Translate extracted text using an OpenAI-powered service.
  • Display translations through an intuitive web interface.
  • (Optional) Fine-tune your translation model using mT5 and VBLoRA for enhanced performance.

Features

  • PDF Extraction:
    Leverages robust methods for text extraction. For PDF extraction, OpenAI is used to summarize or extract text from images instead of conventional OCR techniques.

  • Translation:
    Translates the extracted text into the desired language using OpenAI’s API.
    Default target language is Chinese.

  • Web User Interface:
    Provides a simple and responsive interface for PDF uploads and displaying translated content.
    Note: When you click on the "Extract & Translate" button, please wait a minute as the system processes the translation until the translated text is shown.

  • Model Fine-Tuning (Optional):
    Experiment with fine-tuning your translation model using mT5 along with VBLoRA for parameter-efficient training.
    Note: Initial experiments were paused due to high computational costs and resource limitations. However, the code is fully runnable if you have sufficient resources, so feel free to try it.

Project Structure

PDF-Translation/
├── app.py                      # Main FastAPI application exposing API endpoints.
├── crawler/
│   └── pdf_crawler.py          # Utilities for PDF crawling and text extraction.
├── llm/
│   ├── openai_translation.py   # Translation service utilizing OpenAI's API.
│   └── mt5_translation.py      # Script for fine-tuning mT5 using VBLoRA (optional).
├── web/
│   ├── main.html               # HTML file for the web interface.
│   └── main.js                 # JavaScript handling file uploads and API interactions.
│   └── styles.css              # CSS script for frontend designing and styling.
└── ReadMe.md                   # Project documentation.

Installation

Clone the Repository

git clone https://github.com/yourusername/PDF-Translation.git
cd PDF-Translation

Create & Activate a Virtual Environment

python3 -m venv venv
source venv/bin/activate

Install Dependencies

Ensure that your requirements.txt file includes packages such as FastAPI, Uvicorn, Mangum, openai, Pillow, Transformers, and Datasets. Then run:

pip install -r requirements.txt

Configuration

  • OpenAI API Key:
    Update the OPENAI_API_KEY constant in both app.py and llm/openai_translation.py with your valid OpenAI API key.

  • Model Settings:
    The default translation model is configured in llm/openai_translation.py. Modify the settings if a different model is desired.

Running the Application

Backend API

  1. Start the FastAPI Server:

    uvicorn app:app --reload
  2. API Endpoints:

    • /extract_text: Endpoint for PDF text extraction.
    • /translate: Endpoint for text translation.

    The API will be available at:
    http://127.0.0.1:8000

Web Interface

  1. Ensure the backend API is running.

  2. Change to the web Directory:

    cd web
  3. Serve the Static Files using Python’s HTTP Server:

    python3 -m http.server 8000
  4. Open your Browser and Navigate to:

    http://localhost:8000/main.html
    

Usage

  • PDF Upload:
    Use the web interface to upload a PDF file. The backend API extracts text (using OpenAI to either summarize or extract text from images) instead of traditional OCR methods.

  • Translation:
    The extracted text is sent to the /translate endpoint, translated using the OpenAI-powered service, and the resulting translation is displayed in the web interface.
    Tip: After clicking "Extract & Translate," please allow up to a minute for the processing to complete before the translated text is shown.

Fine-Tuning with mT5 and VBLoRA

For users interested in custom translation models, the repository includes optional scripts for fine-tuning:

  • mT5:
    A multilingual text-to-text transformer pre-trained on extensive multilingual datasets. It serves as a robust base model for translation tasks across multiple languages.

  • VBLoRA:
    Stands for Varying Bottleneck Low-Rank Adaptation. This technique allows for efficient fine-tuning by adapting only a small subset of model parameters, reducing computational costs and mitigating catastrophic forgetting.

The script llm/mt5_translation.py demonstrates how to integrate VBLoRA with mT5 for resource-friendly fine-tuning.
Note: Due to high computational costs and resource limitations, initial experiments were paused after a few optimization trials. However, if you have sufficient resources, the code is fully runnable for training your own model. Feel free to explore and extend this capability.

Troubleshooting

  • 422 Unprocessable Entity:
    Verify that JSON requests to endpoints (e.g., /translate) conform to the expected Pydantic models.

  • OpenAI API Errors:
    Confirm that your OpenAI API key is valid and that the model configuration aligns with supported parameters.

Acknowledgments

  • Built with FastAPI
  • Powered by OpenAI
  • Fine-tuning experiments use mT5 enhanced with VBLoRA.
  • Special thanks to all contributors and the open-source community.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published