Seeing Eye to AI? Probing Vision-Language Model Alignment with Human Expert Visual Grouping

This repository implements the research described in our poster submission "Seeing Eye to AI? Probing Vision-Language Model Alignment with Human Expert Visual Grouping". The project investigates whether state-of-the-art Vision-Language Models (VLMs) can align with human-centric, stimuli-based categorization of data visualizations.

Overview

Unlike previous work that focuses on task-based data interpretation, this research probes whether VLMs can categorize visualizations based purely on their essential visual stimuli as perceived by human experts, independent of specific data interpretation tasks. We evaluate VLMs against Chen et al.'s image-based typology derived from expert analysis of the VIS30K dataset.

Key Research Questions

Can VLMs approximate human cognitive processes in visualization categorization?
How well do VLMs grasp the "essential stimuli" that drive human expert categorization?
What are the current limitations of strictly stimuli-based AI visual understanding in the visualization domain?

Implementation

The research methodology is implemented using two interactive Marimo notebooks.

📊 `src/inference.py`

Interactive notebook for running VLM inference on visualization images:

Loads and samples the VIS30K dataset (stratified sampling of 305 images)
Implements zero-shot categorization using structured prompts
Supports concurrent processing with rate limiting and caching
Outputs structured predictions for purpose, encoding, and dimensionality

📈 `src/evaluation.py`

Interactive notebook for comprehensive evaluation and analysis:

Computes multi-label classification metrics (Accuracy, Hamming Loss, Jaccard Score, Precision/Recall/F1)
Generates confusion matrices and performance visualizations
Provides interactive exploration of results by model, feature, and difficulty
Constructs comparative analysis across all evaluated models

Evaluated Vision-Language Models (13 Total)

Google GenAI

gemini-2.0-flash
gemini-2.5-flash-preview-05-20
gemini-2.5-pro-preview-05-06

OpenAI

gpt-4.1
gpt-4.1-mini
gpt-4.1-nano
o4-mini

Meta LLaMA (via OpenRouter)

llama-4-scout
llama-4-maverick

Mistral AI (via OpenRouter)

mistral-small-3.1-24b-instruct
mistral-medium-3
pixtral-large-2411

Qwen (via OpenRouter)

qwen2.5-vl-32b-instruct

Dataset & Evaluation Framework

Dataset: VIS30K with expert annotations (6,803 images)
Sample: Stratified sample of 305 images across encoding types, dimensionalities, and difficulty levels
Features Evaluated:
- Purpose: gui, schematic, vis
- Encoding: Various encoding types (bar, line, scatter, etc.)
- Dimensionality: 2D, 3D, others
Setting: Zero-shot evaluation with structured JSON output

Key Findings

Purpose Identification: VLMs achieve reasonable accuracy ($>0.7$) for high-level categorization
Dimensionality: Performance varies with complexity, showing challenges with nuanced spatial reasoning
Encoding Recognition: Most challenging task for all VLMs ($<0.4$ accuracy), highlighting the difficulty of discerning fine-grained visual stimuli
Difficulty Impact: Performance decreases with expert-assessed image complexity across all models

Getting Started

Install Dependencies:

uv sync

Set up API Keys:

Copy the example environment variables file and fill in the missing values:

cp .env.example .env

Run Inference Notebook:

uv run marimo run src/inference.py

Run Evaluation Notebook:

uv run marimo run src/evaluation.py

Research Implications

This work is a precursor to a more comprehensive study that will provide insights for:

AI Development: Understanding current VLM limitations in abstract visual reasoning
Human-AI Collaboration: Informing the design of visualization tools that leverage human perceptual strengths
Visualization Research: Establishing benchmarks for AI alignment with human-centric frameworks

Future Work

One-shot and few-shot prompting experiments
Full VIS30K dataset evaluation
Model uncertainty quantification
Parameter sensitivity analysis
Determinism evaluation across multiple runs

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
data		data
src		src
.env.example		.env.example
.gitattributes		.gitattributes
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Seeing Eye to AI? Probing Vision-Language Model Alignment with Human Expert Visual Grouping

Overview

Key Research Questions

Implementation

📊 `src/inference.py`

📈 `src/evaluation.py`

Evaluated Vision-Language Models (13 Total)

Google GenAI

OpenAI

Meta LLaMA (via OpenRouter)

Mistral AI (via OpenRouter)

Qwen (via OpenRouter)

Dataset & Evaluation Framework

Key Findings

Getting Started

Research Implications

Future Work

About

Uh oh!

Releases

Packages

Uh oh!

Languages

peter-gy/AutoVisType

Folders and files

Latest commit

History

Repository files navigation

Seeing Eye to AI? Probing Vision-Language Model Alignment with Human Expert Visual Grouping

Overview

Key Research Questions

Implementation

📊 src/inference.py

📈 src/evaluation.py

Evaluated Vision-Language Models (13 Total)

Google GenAI

OpenAI

Meta LLaMA (via OpenRouter)

Mistral AI (via OpenRouter)

Qwen (via OpenRouter)

Dataset & Evaluation Framework

Key Findings

Getting Started

Research Implications

Future Work

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

📊 `src/inference.py`

📈 `src/evaluation.py`

Packages