Skip to content

B.Sc. ICT AI Dissertation University of Malta (ICT3909) – Accurate name extraction from broadcast news videos using YOLOv12 for graphic detection, OCR for text recognition, and NER models for identifying personal names. The system processes video frames, extracts names, and generates a structured name timeline.

License

Notifications You must be signed in to change notification settings

AFLucas-UOM/Accurate-Name-Extraction

Repository files navigation

ANEP: Accurate Name Extraction from News Video Graphics

This repository contains the full implementation of ANEP (Accurate Name Extraction Pipeline), a hybrid Deep Learning (DL) and Generative AI (GenAI) system for extracting personal names from graphical overlays in broadcast and social media news videos.

Python 3.10+ MIT License

Download Dataset on Roboflow Try Model on Roboflow

Google Cloud Vision API Gemini 1.5 Pro API LLaMA 4 Maverick via OpenRouter

📚 Table of Contents

Click-to-View

📌 Project Overview

In today’s fast-paced digital news ecosystem, crucial information, such as the names of individuals featured in stories, is often displayed visually through broadcast graphics rather than spoken aloud. These elements appear in the form of lower-thirds, tickers, headlines, and other on-screen text overlays. However, their inconsistent styles, short display times, and frequent visual clutter make automated extraction of names a highly challenging task.

This project addresses that challenge through a novel, two-pronged solution for accurate name extraction from news video graphics:

  • ANEP Pipeline
    A custom-built pipeline that integrates:

    • YOLOv12 for detecting news-related graphical elements (e.g. headlines, tickers, etc.).
    • Tesseract OCR with advanced preprocessing (CLAHE, thresholding, de-noising) to extract text from detected regions.
    • Transformer-based Named Entity Recognition (NER) models (e.g. BERT, spaCy + GliNER) to identify and validate personal names in noisy OCR output.
    • Clustering techniques to consolidate name variants and generate structured appearance timelines.
  • GenAI Pipelines
    Parallel pipelines built using:

    • Google Cloud Vision API for high-accuracy OCR,
    • Gemini 1.5 Pro and LLaMA 4 Maverick, two powerful large multimodal models capable of extracting and reasoning over names directly from video frames.

    These models are evaluated as alternatives to classical CV-NLP pipelines, with a focus on name extraction accuracy, runtime performance, and robustness to visual noise.

By combining traditional deep learning (DL) with cutting-edge GenAI, this project contributes a robust, scalable system for extracting names from video media, with direct applications in media monitoring, automated news summarisation, and AI-based fact-checking.

ANEP Architecture Overview

%%{init: {
  "themeVariables": {
    "fontSize":       "16px",
    "edgeLabelFontSize": "14px",
    "edgeLabelColor": "#37474F"
  }
}}%%

flowchart TB
  %% darker text shades on same fills
  classDef user      fill:#BBDEFB,stroke:#1976D2,stroke-width:2px,color:#0D47A1;
  classDef process   fill:#C8E6C9,stroke:#2E7D32,stroke-width:2px,color:#1B5E20;
  classDef datastore fill:#FFECB3,stroke:#FFA000,stroke-width:2px,color:#EF6C00;

  %% nodes
  User[User]:::user
  SM((Select Model)):::process
  UV((Upload Video)):::process
  D1[(D1: Uploaded Video)]:::datastore
  CS((Confirm Settings)):::process
  RA((Run Analysis)):::process
  Backend[Backend API]:::user
  D3[(D3: NGD)]:::datastore
  D2[(D2: Analysis Results)]:::datastore
  VR((View Results)):::process

  %% flows
  User -->|Model selection| SM
  User -->|Video file| UV

  UV -->|Video + metadata| D1

  D1 -->|Video metadata| CS
  SM -->|Selected model ID| CS

  CS -->|Confirmed settings| RA
  D1 -->|Video file| RA

  RA -->|Video + model ID| Backend

  Backend -->|Training/inference data| D3
  Backend -->|Processed results| D2

  D2 -->|Extracted names,<br>timestamps,<br>confidence scores| VR
  Backend -->|Log/progress stream| VR

  User -->|Downloaded results| VR

Loading

🔎 Key Features

  • Intelligent Frame Sampling & Deduplication
    Efficiently processes long videos using perceptual hashing (DCT, ORB) to identify and retain only visually distinct frames, reducing redundancy while preserving key content.

  • YOLOv12-based Graphic Detection
    Fine-tuned YOLOv12 model trained on a custom dataset detects six distinct classes of broadcast graphics: Breaking News, Digital On-Screen Graphics, Lower Thirds, Headlines, News Tickers, and Other Graphics.

  • Custom Annotated Dataset: NGD (News Graphics Dataset)
    Purpose-built dataset containing 1,500+ annotated frames sourced from local and international news videos, across six classes: Breaking News, Lower Thirds, News Ticker, Digital On-Screen Graphics, Headlines, and Other.

  • OCR with Adaptive Preprocessing
    Applies multi-method image preprocessing (CLAHE, thresholding, morphological operations, noise reduction) to maximise text clarity prior to recognition. Tesseract OCR is used with confidence scoring.

  • Named Entity Recognition (NER)
    Combines spaCy with GLiNER (for zero-shot multilingual NER) and a fine-tuned Transformer model to identify real-world person names from noisy OCR text. Includes heuristic and linguistic validation.

  • Name Clustering & Deduplication
    Clusters name variants using fuzzy string matching, token-based distance (Jaccard), and embedding-based cosine similarity to generate accurate, canonical name lists and appearance timelines.

  • GenAI Integration
    Alternative pipelines using:

    • Google Cloud Vision API + Gemini 1.5 Pro
    • LLaMA 4 Maverick via OpenRouter

    These systems extract names directly from video frames using multimodal reasoning and structured prompts.

  • Survey Dashboard & Evaluation Metrics
    Includes a dedicated visualisation dashboard for survey findings on news consumption trends. Evaluation metrics include precision, recall, F1-score, and runtime comparisons across pipelines.

  • Progressive Web App (PWA)
    Fully featured frontend built with React, Tailwind CSS, and Vite. Provides a clean, step-by-step UI for uploading videos, selecting models, and visualising extracted results.

🎯 Object Detection Performance (YOLO Models)

Model Precision Recall [email protected] [email protected]:0.95 Epochs Type
YOLOv12(m) 🥇 93.9% 93.5% 95.8% 88.7% 102 Local
YOLOv8(m) 92.6% 86.9% 93.7% 75.2% 47 Local
YOLOv12(n) 🥈 91.6% 90.8% 93.8% 85.4% 120 Cloud
YOLOv11(n) 91.2% 90.4% 93.1% 84.9% 100 Cloud
YOLOv12(n) Reflect 91.4% 85.7% 91.8% 80.4% 72 Cloud
YOLO-NAS(n) 85.1% 84.3% 91.0% 61.0% 51 Cloud

Best Model mAP Score


🔍 Name Extraction Performance

Pipeline Precision Recall F1 Score Speed Status
GVA + Gemini 1.5 🥇 94.68s Production
ANEP Pipeline 🥈 542.15s 🐢 Explainable
LLaMA 4 Maverick 🥉 140.18s ⏱️ Experimental

Winner F1 Score Average Speed


📈 Performance Overview

graph LR
    A[Speed] -->|94.68s| B[GVA + Gemini]
    C[Accuracy] -->|82.22%| B
    D[Explainability] -->|High| E[ANEP]
    F[Balance] -->|68.10%| E
    G[Simplicity] -->|55.56%| H[LLaMA 4]
    I[Cost] -->|Low| H

    style B fill:#2ECC71,stroke:#27AE60,stroke-width:2px,color:#FFF
    style E fill:#F39C12,stroke:#E67E22,stroke-width:2px,color:#FFF
    style H fill:#E74C3C,stroke:#C0392B,stroke-width:2px,color:#FFF
Loading

🔐 Ethics & Data Usage

  • All data used in the NGD is sourced from publicly available news footage under fair use for research purposes.
  • No private personal data is collected or stored.
  • The system is NOT intended for surveillance or use in sensitive political contexts.

📊 Dataset & Model (Roboflow)

Explore the News Graphics Dataset (NGD) and experiment with the fine-tuned YOLOv12 model directly on Roboflow.

Download NGD Dataset Try YOLOv12 Model

🚀 Getting Started

Prerequisites

Python 3.10+
Node.js 12+
CUDA-capable GPU (recommended)

Clone Repository

git clone https://github.com/AFLucas-UOM/Accurate-Name-Extraction
cd Accurate-Name-Extraction

Backend Configuration (config.json)

To enable the GenAI-based pipelines, create a config.json file inside the 6. GenAI API/ folder:

{
  "google_cloud_vision_api_key": "your-google-vision-api-key",
  "google_gemini_api_key": "your-gemini-api-key",
  "openrouter_api_key": "your-openrouter-api-key"
}

⚠️ Important: Never commit your API keys to GitHub.
Ensure that config.json is added to your .gitignore to keep sensitive credentials secure.

🎓 Dissertation

The full dissertation, containing methodology, evaluation, and survey results, is included in the 7. Documentation/ folder.

📘 Citation

If you use the News Graphic Dataset (NGD) or the ANEP in your research, please cite the following:

📂 News Graphic Dataset (NGD)

@dataset{news_graphic_dataset,
  title     = {News Graphic Dataset (NGD)},
  type      = {Open Source Dataset},
  author    = {Andrea Filiberto Lucas, Dylan Seychell},
  year      = {2025},
  publisher = {Roboflow},
  howpublished = {\url{https://universe.roboflow.com/ict3909-fyp/news-graphic-dataset}},
  url       = {https://universe.roboflow.com/ict3909-fyp/news-graphic-dataset}
}

🎓 Dissertation

@thesis{lucas2025anep,
  title     = {Accurate Name Extraction from News Video Graphics},
  author    = {Andrea Filiberto Lucas, Dylan Seychell},
  year      = {2025},
  school    = {University of Malta},
  type      = {B.Sc. (Hons.) Dissertation}
}

✨ Contribution

Contributions to improve the code, add new features, or optimize model performance are welcome! Fork the repository, make your changes, and submit a pull request.

🪪 License

This project is licensed under the MIT License. See the LICENSE file for details.

🙏🏻 Acknowledgments

This project was developed as part of the ICT3909 Final Year Project course at the University of Malta, and submitted in partial fulfilment of the requirements for the B.Sc. (Hons.) in Information Technology (Artificial Intelligence). Supervised by Dr. Dylan Seychell.

✉️ Contact

For questions, collaboration, or feedback, please contact Andrea Filiberto Lucas

About

B.Sc. ICT AI Dissertation University of Malta (ICT3909) – Accurate name extraction from broadcast news videos using YOLOv12 for graphic detection, OCR for text recognition, and NER models for identifying personal names. The system processes video frames, extracts names, and generates a structured name timeline.

Topics

Resources

License

Stars

Watchers

Forks