This repository contains the full implementation of ANEP (Accurate Name Extraction Pipeline), a hybrid Deep Learning (DL) and Generative AI (GenAI) system for extracting personal names from graphical overlays in broadcast and social media news videos.
Click-to-View
In today’s fast-paced digital news ecosystem, crucial information, such as the names of individuals featured in stories, is often displayed visually through broadcast graphics rather than spoken aloud. These elements appear in the form of lower-thirds, tickers, headlines, and other on-screen text overlays. However, their inconsistent styles, short display times, and frequent visual clutter make automated extraction of names a highly challenging task.
This project addresses that challenge through a novel, two-pronged solution for accurate name extraction from news video graphics:
-
ANEP Pipeline
A custom-built pipeline that integrates:- YOLOv12 for detecting news-related graphical elements (e.g. headlines, tickers, etc.).
- Tesseract OCR with advanced preprocessing (CLAHE, thresholding, de-noising) to extract text from detected regions.
- Transformer-based Named Entity Recognition (NER) models (e.g. BERT, spaCy + GliNER) to identify and validate personal names in noisy OCR output.
- Clustering techniques to consolidate name variants and generate structured appearance timelines.
-
GenAI Pipelines
Parallel pipelines built using:- Google Cloud Vision API for high-accuracy OCR,
- Gemini 1.5 Pro and LLaMA 4 Maverick, two powerful large multimodal models capable of extracting and reasoning over names directly from video frames.
These models are evaluated as alternatives to classical CV-NLP pipelines, with a focus on name extraction accuracy, runtime performance, and robustness to visual noise.
By combining traditional deep learning (DL) with cutting-edge GenAI, this project contributes a robust, scalable system for extracting names from video media, with direct applications in media monitoring, automated news summarisation, and AI-based fact-checking.
%%{init: {
"themeVariables": {
"fontSize": "16px",
"edgeLabelFontSize": "14px",
"edgeLabelColor": "#37474F"
}
}}%%
flowchart TB
%% darker text shades on same fills
classDef user fill:#BBDEFB,stroke:#1976D2,stroke-width:2px,color:#0D47A1;
classDef process fill:#C8E6C9,stroke:#2E7D32,stroke-width:2px,color:#1B5E20;
classDef datastore fill:#FFECB3,stroke:#FFA000,stroke-width:2px,color:#EF6C00;
%% nodes
User[User]:::user
SM((Select Model)):::process
UV((Upload Video)):::process
D1[(D1: Uploaded Video)]:::datastore
CS((Confirm Settings)):::process
RA((Run Analysis)):::process
Backend[Backend API]:::user
D3[(D3: NGD)]:::datastore
D2[(D2: Analysis Results)]:::datastore
VR((View Results)):::process
%% flows
User -->|Model selection| SM
User -->|Video file| UV
UV -->|Video + metadata| D1
D1 -->|Video metadata| CS
SM -->|Selected model ID| CS
CS -->|Confirmed settings| RA
D1 -->|Video file| RA
RA -->|Video + model ID| Backend
Backend -->|Training/inference data| D3
Backend -->|Processed results| D2
D2 -->|Extracted names,<br>timestamps,<br>confidence scores| VR
Backend -->|Log/progress stream| VR
User -->|Downloaded results| VR
-
Intelligent Frame Sampling & Deduplication
Efficiently processes long videos using perceptual hashing (DCT, ORB) to identify and retain only visually distinct frames, reducing redundancy while preserving key content. -
YOLOv12-based Graphic Detection
Fine-tuned YOLOv12 model trained on a custom dataset detects six distinct classes of broadcast graphics: Breaking News, Digital On-Screen Graphics, Lower Thirds, Headlines, News Tickers, and Other Graphics. -
Custom Annotated Dataset: NGD (News Graphics Dataset)
Purpose-built dataset containing 1,500+ annotated frames sourced from local and international news videos, across six classes: Breaking News, Lower Thirds, News Ticker, Digital On-Screen Graphics, Headlines, and Other. -
OCR with Adaptive Preprocessing
Applies multi-method image preprocessing (CLAHE, thresholding, morphological operations, noise reduction) to maximise text clarity prior to recognition. Tesseract OCR is used with confidence scoring. -
Named Entity Recognition (NER)
Combines spaCy with GLiNER (for zero-shot multilingual NER) and a fine-tuned Transformer model to identify real-world person names from noisy OCR text. Includes heuristic and linguistic validation. -
Name Clustering & Deduplication
Clusters name variants using fuzzy string matching, token-based distance (Jaccard), and embedding-based cosine similarity to generate accurate, canonical name lists and appearance timelines. -
GenAI Integration
Alternative pipelines using:- Google Cloud Vision API + Gemini 1.5 Pro
- LLaMA 4 Maverick via OpenRouter
These systems extract names directly from video frames using multimodal reasoning and structured prompts.
-
Survey Dashboard & Evaluation Metrics
Includes a dedicated visualisation dashboard for survey findings on news consumption trends. Evaluation metrics include precision, recall, F1-score, and runtime comparisons across pipelines. -
Progressive Web App (PWA)
Fully featured frontend built with React, Tailwind CSS, and Vite. Provides a clean, step-by-step UI for uploading videos, selecting models, and visualising extracted results.
Model | Precision | Recall | [email protected] | [email protected]:0.95 | Epochs | Type |
---|---|---|---|---|---|---|
YOLOv12(m) 🥇 | 93.9% |
93.5% |
95.8% |
88.7% |
102 | Local |
YOLOv8(m) | 92.6% |
86.9% |
93.7% |
75.2% |
47 | Local |
YOLOv12(n) 🥈 | 91.6% |
90.8% |
93.8% |
85.4% |
120 | Cloud |
YOLOv11(n) | 91.2% |
90.4% |
93.1% |
84.9% |
100 | Cloud |
YOLOv12(n) Reflect | 91.4% |
85.7% |
91.8% |
80.4% |
72 | Cloud |
YOLO-NAS(n) | 85.1% |
84.3% |
91.0% |
61.0% |
51 | Cloud |
Pipeline | Precision | Recall | F1 Score | Speed | Status |
---|---|---|---|---|---|
GVA + Gemini 1.5 🥇 | 94.68s ⚡ | Production | |||
ANEP Pipeline 🥈 | 542.15s 🐢 | Explainable | |||
LLaMA 4 Maverick 🥉 | 140.18s ⏱️ | Experimental |
graph LR
A[Speed] -->|94.68s| B[GVA + Gemini]
C[Accuracy] -->|82.22%| B
D[Explainability] -->|High| E[ANEP]
F[Balance] -->|68.10%| E
G[Simplicity] -->|55.56%| H[LLaMA 4]
I[Cost] -->|Low| H
style B fill:#2ECC71,stroke:#27AE60,stroke-width:2px,color:#FFF
style E fill:#F39C12,stroke:#E67E22,stroke-width:2px,color:#FFF
style H fill:#E74C3C,stroke:#C0392B,stroke-width:2px,color:#FFF
- All data used in the NGD is sourced from publicly available news footage under fair use for research purposes.
- No private personal data is collected or stored.
- The system is NOT intended for surveillance or use in sensitive political contexts.
Explore the News Graphics Dataset (NGD) and experiment with the fine-tuned YOLOv12 model directly on Roboflow.
Python 3.10+
Node.js 12+
CUDA-capable GPU (recommended)
git clone https://github.com/AFLucas-UOM/Accurate-Name-Extraction
cd Accurate-Name-Extraction
To enable the GenAI-based pipelines, create a config.json
file inside the 6. GenAI API/
folder:
{
"google_cloud_vision_api_key": "your-google-vision-api-key",
"google_gemini_api_key": "your-gemini-api-key",
"openrouter_api_key": "your-openrouter-api-key"
}
⚠️ Important: Never commit your API keys to GitHub.
Ensure thatconfig.json
is added to your.gitignore
to keep sensitive credentials secure.
The full dissertation, containing methodology, evaluation, and survey results, is included in the 7. Documentation/
folder.
If you use the News Graphic Dataset (NGD) or the ANEP in your research, please cite the following:
@dataset{news_graphic_dataset,
title = {News Graphic Dataset (NGD)},
type = {Open Source Dataset},
author = {Andrea Filiberto Lucas, Dylan Seychell},
year = {2025},
publisher = {Roboflow},
howpublished = {\url{https://universe.roboflow.com/ict3909-fyp/news-graphic-dataset}},
url = {https://universe.roboflow.com/ict3909-fyp/news-graphic-dataset}
}
@thesis{lucas2025anep,
title = {Accurate Name Extraction from News Video Graphics},
author = {Andrea Filiberto Lucas, Dylan Seychell},
year = {2025},
school = {University of Malta},
type = {B.Sc. (Hons.) Dissertation}
}
Contributions to improve the code, add new features, or optimize model performance are welcome! Fork the repository, make your changes, and submit a pull request.
This project is licensed under the MIT License. See the LICENSE file for details.
This project was developed as part of the ICT3909
Final Year Project course at the University of Malta, and submitted in partial fulfilment of the requirements for the B.Sc. (Hons.) in Information Technology (Artificial Intelligence).
Supervised by Dr. Dylan Seychell.
For questions, collaboration, or feedback, please contact Andrea Filiberto Lucas