Repository: https://github.com/chirindaopensource/llm_faithfulness_hallucination_misalignment_detection
Owner: 2025 Craig Chirinda (Open Source Projects)
This repository contains an independent, professional-grade Python implementation of the research methodology from the 2025 paper entitled "Prompt-Response Semantic Divergence Metrics for Faithfulness Hallucination and Misalignment Detection in Large Language Models" by:
- Igor Halperin
The project provides a complete, end-to-end computational framework for detecting faithfulness hallucinations (confabulations) in Large Language Models (LLMs). It moves beyond traditional prompt-agnostic methods by introducing a prompt-aware, ensemble-based approach that measures the semantic consistency of LLM responses across multiple, semantically equivalent paraphrases of a user's query. The goal is to provide a transparent, robust, and computationally efficient toolkit for researchers and practitioners to replicate, validate, and apply the Semantic Divergence Metrics (SDM) framework.
- Introduction
- Theoretical Background
- Features
- Methodology Implemented
- Core Components (Notebook Structure)
- Key Callable: execute_sdm_analysis
- Prerequisites
- Installation
- Input Data Structure
- Usage
- Output Structure
- Project Structure
- Customization
- Contributing
- License
- Citation
- Acknowledgments
This project provides a Python implementation of the methodologies presented in the 2025 paper "Prompt-Response Semantic Divergence Metrics for Faithfulness Hallucination and Misalignment Detection in Large Language Models." The core of this repository is the iPython Notebook faithfulness_hallucination_misalignment_detection_draft.ipynb
, which contains a comprehensive suite of functions to replicate the paper's findings, from initial configuration validation to the final calculation of the SDM scores and a full suite of robustness checks.
Traditional hallucination detection methods often measure the diversity of answers to a single, fixed prompt. This can fail to distinguish between a healthy, multifaceted answer and a genuinely unstable, confabulatory one. This project implements the SDM framework, which introduces a more rigorous, prompt-aware methodology.
This codebase enables users to:
- Rigorously validate and structure a complete experimental configuration using Pydantic.
- Automatically generate a high-quality corpus of semantically equivalent prompt paraphrases.
- Efficiently generate a matrix of LLM responses using fault-tolerant, asynchronous API calls.
- Transform the raw text corpus into a shared semantic topic space via joint embedding and hierarchical clustering.
- Calculate a full suite of information-theoretic (JSD, KL Divergence) and geometric (Wasserstein Distance) metrics.
- Aggregate these metrics into the final, interpretable scores for Semantic Instability (
$S_H$ ) and Semantic Exploration (KL). - Execute a full suite of robustness checks to validate the stability of the framework itself.
The implemented methods are grounded in information theory, statistics, and natural language processing, providing a quantitative framework for measuring the alignment between a prompt and a response.
1. Ensemble-Based Testing:
The core innovation is to test for a deeper form of arbitrariness. Instead of just generating
2. Joint Semantic Clustering:
All sentences from both the
3. Semantic Divergence Metrics:
From the topic assignments, topic probability distributions are created for the prompts (
-
Jensen-Shannon Divergence (
$D_{JS}$ ): A symmetric, bounded measure of the dissimilarity between the prompt and answer topic distributions. $$ D_{JS}(P||A) = \frac{1}{2}(D_{KL}(P||M) + D_{KL}(A||M)), \quad M = \frac{1}{2}(P+A) $$ -
Wasserstein Distance (
$W_d$ ): A measure of the geometric shift between the raw embedding clouds, capturing changes in meaning that might not be reflected in the topic distributions. -
Kullback-Leibler (KL) Divergence (
$D_{KL}$ ): An asymmetric measure of "surprise." The paper identifies$D_{KL}(A||P)$ as a powerful indicator of Semantic Exploration—the degree to which the LLM must introduce new concepts not present in the prompt.
4. Final Aggregated Scores: These components are combined into the final, normalized scores:
-
Semantic Instability (
$S_H$ ): The primary hallucination score. $$ S_H = \frac{w_{jsd} \cdot D_{JS}^{ens} + w_{wass} \cdot W_d}{H(P)} $$ - Semantic Exploration (KL Score): $$ KL(\text{Answer| |Prompt}) = \frac{D_{KL}^{ens}(A || P)}{H(P)} $$
The provided iPython Notebook (faithfulness_hallucination_misalignment_detection_draft.ipynb
) implements the full research pipeline, including:
- Configuration Pipeline: A robust, Pydantic-based validation system for all experimental parameters.
- High-Performance Data Generation: Asynchronous API calls for efficient generation of the paraphrase and response corpora, with built-in fault tolerance and retry logic.
- Rigorous Analytics: Elite-grade, modular functions for each stage of the analysis, from embedding and clustering to the final metric calculations, leveraging optimized libraries like
scipy
andscikit-learn
. - Automated Orchestration: A master function that runs the entire end-to-end workflow with a single call.
- Comprehensive Validation: A full suite of robustness checks to analyze the framework's sensitivity to hyperparameters, model substitutions, and statistical noise.
- Full Research Lifecycle: The codebase covers the entire research process from configuration to final, validated scores, providing a complete and transparent replication package.
The core analytical steps directly implement the methodology from the paper:
- Configuration Validation (Task 1): The pipeline ingests a configuration dictionary and rigorously validates its schema, constraints, and content.
- Environment Setup (Task 2): It establishes a deterministic, reproducible computational environment and initializes all models and clients.
- Paraphrase Generation (Task 3): It generates and validates
M
semantically equivalent paraphrases of the original prompt. - Response Generation (Task 4): It generates and validates an
M x N
matrix of responses. - Sentence Segmentation (Task 5): It deconstructs all texts into a cataloged, sentence-level corpus.
- Embedding Generation (Task 6): It transforms the sentence corpus into a validated, high-dimensional vector space.
- Clustering (Task 7): It determines the optimal number of topics (
k*
) and partitions the embedding space intok*
clusters. - Distribution Construction (Task 8): It translates the discrete cluster labels into numerically stable probability distributions.
- Metric Computation (Tasks 9-10): It calculates the full suite of information-theoretic and geometric metrics.
- Score Aggregation (Task 11): It synthesizes all intermediate metrics into the final, interpretable SDM scores and validates them against paper benchmarks.
- Orchestration & Robustness (Tasks 12-13): Master functions orchestrate the main pipeline and the optional, full suite of robustness checks.
The faithfulness_hallucination_misalignment_detection_draft.ipynb
notebook is structured as a logical pipeline with modular orchestrator functions for each of the 13 tasks.
The central function in this project is execute_sdm_analysis
. It orchestrates the entire analytical workflow, providing a single entry point for either a standard analysis or a full robustness study.
def execute_sdm_analysis(
experiment_config: Dict[str, Any],
perform_robustness_checks: bool = False
) -> Dict[str, Any]:
"""
Executes the main SDM analysis pipeline and optionally a full suite of robustness checks.
"""
# ... (implementation is in the notebook)
- Python 3.8+
- An OpenAI API key set as an environment variable (
OPENAI_API_KEY
). - Core dependencies:
pandas
,numpy
,scipy
,scikit-learn
,pydantic
,openai
,sentence-transformers
,nltk
,tenacity
,tqdm
.
-
Clone the repository:
git clone https://github.com/chirindaopensource/llm_faithfulness_hallucination_misalignment_detection.git cd llm_faithfulness_hallucination_misalignment_detection
-
Create and activate a virtual environment (recommended):
python -m venv venv source venv/bin/activate # On Windows, use `venv\Scripts\activate`
-
Install Python dependencies:
pip install pandas numpy scipy scikit-learn pydantic "openai>=1.0.0" sentence-transformers nltk tenacity tqdm
-
Set your OpenAI API Key:
export OPENAI_API_KEY='your_secret_api_key_here'
-
Download NLTK data: Run the following in a Python interpreter:
import nltk nltk.download('punkt')
The pipeline is controlled by a single, comprehensive Python dictionary, experiment_config
. A fully specified example, FusedExperimentInput
, is provided in the notebook. This dictionary defines everything from the prompt text and model choices to hyperparameters and validation thresholds.
The faithfulness_hallucination_misalignment_detection_draft.ipynb
notebook provides a complete, step-by-step guide. The core workflow is:
-
Prepare Inputs: Define your
experiment_config
dictionary. A complete template is provided. -
Execute Pipeline: Call the master orchestrator function.
For a standard, single analysis:
# Returns a dictionary with the results of the main run standard_results = execute_sdm_analysis( experiment_config=FusedExperimentInput, perform_robustness_checks=False )
For a full robustness study (computationally expensive):
# Returns a dictionary with main run results and robustness reports full_study_results = execute_sdm_analysis( experiment_config=FusedExperimentInput, perform_robustness_checks=True )
-
Inspect Outputs: Programmatically access any result from the returned dictionary. For example, to view the primary scores:
final_scores = full_study_results['main_run']['final_scores'] print(final_scores)
The execute_sdm_analysis
function returns a single, comprehensive dictionary:
main_run
: A dictionary containing theSDMFullResult
object from the primary analysis. This includes the final scores, all intermediate diagnostic metrics, and the validation report against paper benchmarks.robustness_analysis
(optional): Ifperform_robustness_checks=True
, this key will contain a dictionary ofpandas.DataFrame
s, with each DataFrame summarizing the results of a specific robustness test.
llm_faithfulness_hallucination_misalignment_detection/
│
├── faithfulness_hallucination_misalignment_detection_draft.ipynb # Main implementation notebook
├── requirements.txt # Python package dependencies
├── LICENSE # MIT license file
└── README.md # This documentation file
The pipeline is highly customizable via the master experiment_config
dictionary. Users can easily modify:
- The
original_prompt_text
to analyze any prompt. - The
system_components
to target different LLMs or embedding models. - All
hyperparameters
, includingM
,N
,temperature
, clustering settings, and final score weights. - All
validation_protocols
to tighten or loosen quality control thresholds.
Contributions are welcome. Please fork the repository, create a feature branch, and submit a pull request with a clear description of your changes. Adherence to PEP 8, type hinting, and comprehensive docstrings is required.
This project is licensed under the MIT License. See the LICENSE
file for details.
If you use this code or the methodology in your research, please cite the original paper:
@article{halperin2025prompt,
title={Prompt-Response Semantic Divergence Metrics for Faithfulness Hallucination and Misalignment Detection in Large Language Models},
author={Halperin, Igor},
journal={arXiv preprint arXiv:2508.10192},
year={2025}
}
For the implementation itself, you may cite this repository:
Chirinda, C. (2025). A Python Implementation of "Prompt-Response Semantic Divergence Metrics for Faithfulness Hallucination and Misalignment Detection in Large Language Models".
GitHub repository: https://github.com/chirindaopensource/llm_faithfulness_hallucination_misalignment_detection
- Credit to Igor Halperin for the insightful and clearly articulated research.
- Thanks to the developers of the scientific Python ecosystem (
numpy
,pandas
,scipy
,scikit-learn
,pydantic
) that makes this work possible.
--
This README was generated based on the structure and content of faithfulness_hallucination_misalignment_detection_draft.ipynb
and follows best practices for research software documentation.