Skip to content

mixpeek/awesome-multimodal-search

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 

Repository files navigation

Awesome Multimodal Search

Awesome Multimodal Search Banner

A curated collection of 🔍 libraries, ☁️ platforms, 📖 research, 📊 benchmarks, and 📚 tutorials focused on Multimodal Search — enabling semantic retrieval across images, video, audio, and documents.

📢 Stay updated on multimodal search trends! Subscribe to the Mixpeek newsletter for the latest developments in multimodal AI.

Table of Contents


🔍 Libraries & Frameworks

Name Description Links
Jina AI Flow-based neural search framework for text, image, video, and audio. GitHub · Website
Weaviate Vector DB with modules for image, text, and audio embeddings (e.g. CLIP, ImageBind). GitHub · Website
Towhee Multimodal data pipelines with 100+ pretrained models. GitHub · Website
CLIP Retrieval Lightweight toolkit to search CLIP-embedded LAION datasets. GitHub · Demo
Qdrant Vector database with multimodal search capabilities and filtering. GitHub · Website
Milvus Open-source vector database for embedding similarity search. GitHub · Website
Vespa Real-time search and recommendation engine with multimodal capabilities. GitHub · Website
ChromaDB Embedding database for building AI applications with multimodal data. GitHub · Website
LlamaIndex Data framework for connecting custom data to LLMs with multimodal retrieval. GitHub · Docs
LangChain Framework for developing applications with LLMs and multimodal retrieval. GitHub · Website
DocArray Data structure for multimodal and nested data, pairs with Jina. GitHub · Docs
Haystack End-to-end framework for building search pipelines with multimodal support. GitHub · Website
FAISS Library for efficient similarity search from Meta Research, supports image vectors. GitHub · Docs

☁️ Cloud Services & APIs

Name Modalities Links Notes
OpenAI API Text, image (GPT-4V), audio (Whisper) Docs Supports RAG + embeddings
Vertex AI (Google) Image + Text Docs CoCa model embeddings
AWS Rekognition + Kendra + Transcribe Image, text, audio Rekognition · Kendra Modular pipeline for multimodal search
Pinecone Vector database supporting text, image, audio embeddings Website Hybrid search with metadata filtering
Mixpeek Text, image, video, audio, PDF, time series, tabular Website · Docs Multimodal data warehouse with 25+ specialized feature extractors (face grouping, object tracking, scene detection, etc.), automatic model upgrades, and cross-modal correlation capabilities
Microsoft Azure AI Search Text, images, PDFs, audio transcription Docs Cognitive search capabilities
Anthropic Claude API Text + image understanding Docs Claude 3 Opus/Sonnet/Haiku models
Cohere Text embeddings with multilingual support Website Embed, Rerank, and Generate APIs
Supabase Vector Vector embeddings in Postgres Docs pgvector integration
Vectara Managed neural search platform Website Zero-shot cross-modal search
Zilliz Cloud Managed Milvus service for vector search Website Enterprise-grade vector DB service
Algolia Search API with AI-powered vector search Website Hybrid keyword + semantic search
Elastic AI Search Enterprise search with vector capabilities Website ELSER and vector search capabilities

📖 Landmark Papers

Title Modality Venue Links
CLIP Image–Text ICML 2021 Paper · GitHub
ImageBind All (6 modalities) ICML 2023 Paper · GitHub
CLAP Audio–Text NeurIPS 2022 Paper · GitHub
BLIP/BLIP-2 Image-Text ICML 2022/2023 Paper · GitHub
LLaVA Image-Text NeurIPS 2023 Paper · GitHub
VideoLLaMA Video-Text ICCV 2023 Paper · GitHub
Flamingo Image/Video-Text NeurIPS 2022 Paper · DeepMind
AudioLDM Audio-Text ICLR 2023 Paper · GitHub
CM3 Text-Image ICLR 2023 Paper · GitHub
ALIGN Image-Text ICML 2021 Paper · Blog
FLAVA Image-Text CVPR 2022 Paper · GitHub
Kosmos-2 Image-Text NeurIPS 2023 Paper · GitHub
Whisper Audio-Text 2022 Paper · GitHub
CoCa Image-Text NeurIPS 2022 Paper · Blog

📊 Benchmarks & Leaderboards

Benchmark Modality Metric Example
MS COCO Image–Text R@1, R@5, R@10 BLIP-2 > 80% R@1
MSR-VTT Video–Text R@1, R@5 Marengo > 60% R@1
Clotho, AudioCaps Audio–Text mAP@10, R@10 CLAP ~0.21 mAP
Wiki-SS Document Screenshots Top-1 Accuracy DSE 49% top-1
Flickr30k Image-Text R@1, R@5, R@10 CLIP ~65% R@1
MSMARCO Text-Image MRR@10, nDCG@10 RankFusion ~0.4 MRR
VQAv2 Image-Question-Answer Accuracy LLaVA ~80%
MTEB Multimodal tasks Avg. performance BGE ~65% avg
MSCOCO Captioning Image-Text BLEU, METEOR, CIDEr CoCa 143.6 CIDEr
DiDeMo Video-Text R@1, R@5 CLIP4Clip ~45% R@1
AudioSet Audio classification mAP ImageBind ~0.44 mAP
SentEval Text embeddings Accuracy OpenAI text-embedding-3 ~87%
HowTo100M Video-Text R@1, R@5 VideoCLIP ~32% R@1
ImageNet Image classification Top-1, Top-5 CLIP ~76% Top-1
BEIR Text retrieval nDCG@10 GTR ~66% nDCG

📚 Tutorials & Demos

Title Modality Links
ImageBind + Deep Lake Unified search Tutorial
Pinecone + CLIP Text–Image Blog
Mixpeek Reverse Video Search Video-Video Tutorial
Jina Hello Multimodal Text + Image Code
RAG + CLIP + OpenAI Multimodal RAG Colab
LangChain Multimodal RAG Text, Image, Video Tutorial
Hugging Face CLIP Demo Text-Image Demo
Building Multimodal Search Engines Text, Image Course
FAISS Tutorial with Images Image similarity Tutorial
Video Search with PyTorch Video retrieval Tutorial
Milvus Bootcamp Vector search Bootcamp
ChromaDB Multimodal Examples Text, Image Cookbook
LlamaIndex Multimodal Guide Text, Image, PDF Guide
Vespa Image Search Tutorial Image similarity Tutorial
ImageBind Zero-Shot Classification All modalities Colab
Haystack Multimodal Pipelines Text, Image, Audio Tutorial

📰 Multimodal Monday Blog Posts

Title Date Author Summary Link
Multimodal Monday #3 — Scaling Multimodal AI: Laws, Lightweights & Large Releases Apr 14, 2025 Philip Bankier Apple's new scaling law research redefines how multimodal models are built, while Moonshot and OpenGVLab drop powerful open-source VLMs with reasoning and tool-use. Read More
Multimodal Monday #2 — From Tiny VLMs to 10M‑Token Titans Apr 6, 2025 Ethan Steininger Major multimodal model releases including Meta's Llama 4 Scout & Maverick and Microsoft's Phi-4-Multimodal, marking the start of a new era of natively multimodal AI. Read More
Multimodal Monday #1 - State of the Stack - - Researchers introducing new methods to replace embeddings with discrete IDs for faster cross-modal search. Read More

📬 Contributions welcome! PRs and issues encouraged.

Releases

No releases published

Packages

No packages published