Learn how multimodal AI merges text, image, and audio for smarter models
-
Updated
Jan 21, 2025 - Jupyter Notebook
Learn how multimodal AI merges text, image, and audio for smarter models
AI-powered tool to turn long videos into short, viral-ready clips. Combines transcription, speaker diarization, scene detection & 9:16 resizing — perfect for creators & smart automation.
Neocortex Unity SDK for Smart NPCs and Virtual Assistants
A demo multimodal AI chat application built with Streamlit and Google's Gemini model. Features include: secure Google OAuth, persistent data storage with Cloud SQL (PostgreSQL), and intelligent function calling. Includes a persona-based newsletter engine to deliver personalized insights.
AI-powered tool to turn long videos into short, viral-ready clips. Combines transcription, speaker diarization, scene detection & 9:16 resizing — perfect for creators & smart automation.
Enterprise-ready solution leveraging multimodal Generative AI (Gen AI) to enhance existing or new applications beyond text—implementing RAG, image classification, video analysis, and advanced image embeddings.
🔥 The first survey on bridging VLMs and synthetic data, for which I completed the entire process of reading 125 papers and writing the research paper in
ICML 2025 Papers: Dive into cutting-edge research from the premier machine learning conference. Stay current with breakthroughs in deep learning, generative AI, optimization, reinforcement learning, and beyond. Code implementations included. ⭐ support the future of machine learning research!
#3 Winner of Best Use of Zoom API at Stanford TreeHacks 2024! An AI-powered meeting assistant that captures video, audio and textual context from Zoom calls using multimodal RAG.
This GitHub repository contains the complete code for building Business-Ready Generative AI Systems (GenAISys) from scratch. It guides you through architecting and implementing advanced AI controllers, intelligent agents, and dynamic RAG frameworks. The projects demonstrate practical applications across various domains.
VLDBench: A large-scale benchmark for evaluating Vision-Language Models (VLMs) and Large Language Models (LLMs) on multimodal disinformation detection.
🎭 Real-time voice-controlled 3D avatar with multimodal AI - speak naturally and watch your AI companion respond with perfect lip-sync
A hands-on collection of experimental AI mini-projects exploring large language models, multimodal reasoning, retrieval-augmented generation (RAG), reinforcement learning, and real-world applications in finance, eKYC, and voice interfaces.
Leveraging Bayesian Neural Networks for multimodal AUV data fusion, enabling precise and uncertainty-aware mapping of underwater environments.
Gallery showcasing AI-generated images and videos created using the Nova model
Mai is an emotionally intelligent, voice-enabled AI assistant built with FastAPI, Together.ai LLMs, memory persistence via ChromaDB, and real-time sentiment analysis. Designed to feel alive, empathetic, and human-like, Mai blends the charm of a flirty cyberpunk companion with the power of modern multimodal AI.
Multi-modal AI system for diagnosing respiratory diseases using Vision Transformers and BERT.
AI Framework for Remote Sensing Image Analysis using RAG - 88%+ accuracy, multi-modal queries, ChatGPT-like interface
Lab website
Generative AI (Gen AI) is a branch of artificial intelligence that creates new content such as text, images, audio, or code using models like GPT or Gemini. It powers applications like AI chatbots, image generation tools, and creative assistants across various industries.
Add a description, image, and links to the multimodal-ai topic page so that developers can more easily learn about it.
To associate your repository with the multimodal-ai topic, visit your repo's landing page and select "manage topics."