[Model] Add support for Jina Embeddings V4 #20802

sigridjineth · 2025-07-11T07:01:42Z

Purpose

This PR adds support for the Jina Embeddings V4 model
(jinaai/jina-embeddings-v4-vllm-retrieval) in vLLM, enabling multimodal embeddings for text and image inputs.

FIX #20463

Test Plan

Run the comprehensive test suite for Jina V4

pytest
tests/models/multimodal/test_jina_embeddings_v4.py -v


## Documentation Update

  The following documentation has been added:

  1. Model Registry (vllm/model_executor/models/registry.py): Added JinaVLForEmbedding entry
  2. Supported Models (docs/source/models/supported_models.rst): Added Jina Embeddings V4 to the multimodal embeddings section
  3. Example Script (examples/offline_inference/jina_embeddings_v4.py):
  4. Implementation Guide  (JINA_EMBEDDINGS_V4_IMPLEMENTATION.md): Detailed
  technical documentation

github-actions · 2025-07-11T07:01:50Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

gemini-code-assist

Summary of Changes

Hello @sigridjineth, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly expands vLLM's capabilities by integrating native support for the Jina Embeddings V4 multimodal model. My work ensures that users can now efficiently generate embeddings for both text and image inputs, benefiting from highly optimized processing, robust error handling, and comprehensive testing to guarantee accuracy and stability.

Highlights

Jina Embeddings V4 Integration: I've added full, production-ready support for the Jina Embeddings V4 model (jinaai/jina-embeddings-v4-vllm-retrieval), enabling both text and image multimodal embeddings within vLLM.
Optimized Multimodal Pooling: The implementation includes thread-safe, token-type-aware pooling, leveraging optimized Triton CUDA kernels for efficient vision token extraction, with a robust PyTorch fallback.
Robustness and Observability: I've incorporated comprehensive error handling, including graceful fallback mechanisms and OOM recovery during pooling, alongside full observability integration for performance metrics.
Comprehensive Testing & Examples: New unit tests, a validation benchmark against HuggingFace, and an offline inference example script have been added to ensure accuracy and demonstrate usage.
Documentation Updates: The model has been registered in the model registry, added to the supported models documentation, and a detailed implementation guide has been provided.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This PR adds production-ready support for the Jina Embeddings V4 model. I've identified a bug in the tests, a performance issue in the core implementation, and some areas for code improvement in the example and validation scripts.

tests/models/multimodal/test_jina_embeddings_v4.py

vllm/model_executor/models/jina_embeddings_v4.py

benchmarks/jina_embeddings_v4_validation.py

examples/offline_inference/jina_embeddings_v4.py

DarkLight1337

Thanks for contributing! Can you add this model to the test registry and supported models page?

vllm/model_executor/models/jina_embeddings_v4.py

vllm/model_executor/models/registry.py

DarkLight1337 · 2025-07-11T07:37:12Z

vllm/model_executor/models/jina_embeddings_v4.py

+
+
+# Triton kernel for optimized vision token extraction
+if HAS_TRITON:


How much is the performance increase using triton that this additional complexity is justified? cc @Isotr0py @imkero

would provide Triton performance benchmarks after finshing up some tasks in the pr

If this triton kernel is only used in pooler, I think the performance improvement will be very little. But it would be best to have a performance benchmarks first.

Can you perform benchmarking on this?

@sigridjineth I did some benchmarks between triton kernels and torch native implementation on RTX3090, but found that triton kernel can be much slower when image seq_len is quite long, which can be normal image input case for Qwen2-VL like model:

Benchmark results

Image sequence length: 512, Text sequence length: 2048, Number of images: 1 -- triton vision pooling = 0.08771181106567383 -- native vision pooling = 0.05670571327209473 Image sequence length: 1024, Text sequence length: 2048, Number of images: 1 -- triton vision pooling = 0.10277390480041504 -- native vision pooling = 0.03438115119934082 Image sequence length: 8192, Text sequence length: 2048, Number of images: 1 -- triton vision pooling = 0.3178141117095947 -- native vision pooling = 0.07503867149353027 Image sequence length: 16384, Text sequence length: 2048, Number of images: 1 -- triton vision pooling = 0.5705935955047607 -- native vision pooling = 0.11778688430786133 Image sequence length: 512, Text sequence length: 2048, Number of images: 2 -- triton vision pooling = 0.09008479118347168 -- native vision pooling = 0.03199028968811035 Image sequence length: 1024, Text sequence length: 2048, Number of images: 2 -- triton vision pooling = 0.10735464096069336 -- native vision pooling = 0.03523516654968262 Image sequence length: 8192, Text sequence length: 2048, Number of images: 2 -- triton vision pooling = 0.3502342700958252 -- native vision pooling = 0.0757303237915039 Image sequence length: 16384, Text sequence length: 2048, Number of images: 2 -- triton vision pooling = 0.6468491554260254 -- native vision pooling = 0.12034487724304199 Image sequence length: 512, Text sequence length: 2048, Number of images: 4 -- triton vision pooling = 0.09511590003967285 -- native vision pooling = 0.03257870674133301 Image sequence length: 1024, Text sequence length: 2048, Number of images: 4 -- triton vision pooling = 0.11696052551269531 -- native vision pooling = 0.03539228439331055 Image sequence length: 8192, Text sequence length: 2048, Number of images: 4 -- triton vision pooling = 0.4277994632720947 -- native vision pooling = 0.07425379753112793 Image sequence length: 16384, Text sequence length: 2048, Number of images: 4 -- triton vision pooling = 0.8103950023651123 -- native vision pooling = 0.11885881423950195

Any idea about this? The benchmark script can be found here: https://gist.github.com/Isotr0py/eef7470ff176a28ac40340b883cf1abe

vllm/model_executor/models/jina_embeddings_v4.py

Isotr0py · 2025-07-11T08:04:28Z

vllm/model_executor/models/jina_embeddings_v4.py

+
+
+# Triton kernel for optimized vision token extraction
+if HAS_TRITON:


If this triton kernel is only used in pooler, I think the performance improvement will be very little. But it would be best to have a performance benchmarks first.

vllm/model_executor/models/jina_embeddings_v4.py

Isotr0py · 2025-07-11T14:16:24Z

vllm/model_executor/models/jina_embeddings_v4.py

+    @triton.jit
+    def extract_vision_tokens_kernel(
+        hidden_states_ptr,
+        token_ids_ptr,
+        output_ptr,
+        seq_start,
+        seq_len,
+        hidden_size,
+        vision_start_id: tl.constexpr,
+        vision_end_id: tl.constexpr,
+        BLOCK_SIZE: tl.constexpr,
+    ):


I don't like putting triton kernel in model implementation, we should move this to pooler.py or somewhere else if the performance improvement is significant.

done caea1fe

sigridjineth · 2025-07-12T18:18:15Z

@Isotr0py @DarkLight1337 do review if more changes needed if you think so

DarkLight1337 · 2025-07-16T08:15:31Z

Sorry for the delay, can you merge from main and fix pre-commit?

Signed-off-by: Sigrid Jin (Sionic AI) <[email protected]>

Fixed import statement formatting to comply with isort requirements. The PoolingMetadata import now has proper line breaks and indentation. Signed-off-by: Sigrid Jin (Sionic AI) <[email protected]>

sigridjineth · 2025-07-16T19:23:32Z

@DarkLight1337 hello, I just merged the branch into the main and rebased it.

just gave some lots of efforts to pass pre-commit but it seems like yapf and isort hook is being conflicted in jina_embeddings_v4.py today and no lock. that's very annoying.

can you help me out? Thanks!

Signed-off-by: Sigrid Jin (Sionic AI) <[email protected]> This is a known issue where CI runs formatters on all files, not just changed files.

DarkLight1337 · 2025-07-17T02:19:31Z

You can yapf: disable the affected lines and yapf: enable afterwards

As suggested by maintainer, use yapf: disable/enable comments around the pooling_metadata imports to prevent formatter conflicts. This allows isort to handle the import formatting while yapf skips these lines. Signed-off-by: Sigrid Jin (Sionic AI) <[email protected]>

tests/models/multimodal/test_jina_embeddings_v4.py

vllm/third_party/pynvml.py

vllm/model_executor/models/jina_embeddings_v4.py

benchmarks/jina_embeddings_v4_validation.py

DarkLight1337 · 2025-07-17T10:10:09Z

Once #21058 is merged, you also have to update this PR by refactoring the pooler call into a class that inherits from Pooler and adding is_pooling_model flag to the model class

Signed-off-by: Sigrid Jin (Sionic AI) <[email protected]>

sigridjineth · 2025-07-17T15:42:17Z

@DarkLight1337 okay, will look forward to. when is the expected merge date then?

DarkLight1337 · 2025-07-17T15:47:13Z

Within this hour

Isotr0py · 2025-07-17T16:12:30Z

vllm/model_executor/layers/pooler.py

@@ -15,9 +15,12 @@
    PoolingMetadata as V0PoolingMetadata)
 from vllm.model_executor.pooling_metadata import PoolingTensors
 from vllm.sequence import PoolerOutput, PoolingSequenceGroupOutput
+from vllm.triton_utils import tl, triton


Suggested change

from vllm.triton_utils import tl, triton

from vllm.triton_utils import tl, triton, HAS_TRITON

Isotr0py · 2025-07-17T16:49:19Z

vllm/model_executor/models/jina_embeddings_v4.py

+
+
+# Triton kernel for optimized vision token extraction
+if HAS_TRITON:


@sigridjineth I did some benchmarks between triton kernels and torch native implementation on RTX3090, but found that triton kernel can be much slower when image seq_len is quite long, which can be normal image input case for Qwen2-VL like model:

Benchmark results

Image sequence length: 512, Text sequence length: 2048, Number of images: 1 -- triton vision pooling = 0.08771181106567383 -- native vision pooling = 0.05670571327209473 Image sequence length: 1024, Text sequence length: 2048, Number of images: 1 -- triton vision pooling = 0.10277390480041504 -- native vision pooling = 0.03438115119934082 Image sequence length: 8192, Text sequence length: 2048, Number of images: 1 -- triton vision pooling = 0.3178141117095947 -- native vision pooling = 0.07503867149353027 Image sequence length: 16384, Text sequence length: 2048, Number of images: 1 -- triton vision pooling = 0.5705935955047607 -- native vision pooling = 0.11778688430786133 Image sequence length: 512, Text sequence length: 2048, Number of images: 2 -- triton vision pooling = 0.09008479118347168 -- native vision pooling = 0.03199028968811035 Image sequence length: 1024, Text sequence length: 2048, Number of images: 2 -- triton vision pooling = 0.10735464096069336 -- native vision pooling = 0.03523516654968262 Image sequence length: 8192, Text sequence length: 2048, Number of images: 2 -- triton vision pooling = 0.3502342700958252 -- native vision pooling = 0.0757303237915039 Image sequence length: 16384, Text sequence length: 2048, Number of images: 2 -- triton vision pooling = 0.6468491554260254 -- native vision pooling = 0.12034487724304199 Image sequence length: 512, Text sequence length: 2048, Number of images: 4 -- triton vision pooling = 0.09511590003967285 -- native vision pooling = 0.03257870674133301 Image sequence length: 1024, Text sequence length: 2048, Number of images: 4 -- triton vision pooling = 0.11696052551269531 -- native vision pooling = 0.03539228439331055 Image sequence length: 8192, Text sequence length: 2048, Number of images: 4 -- triton vision pooling = 0.4277994632720947 -- native vision pooling = 0.07425379753112793 Image sequence length: 16384, Text sequence length: 2048, Number of images: 4 -- triton vision pooling = 0.8103950023651123 -- native vision pooling = 0.11885881423950195

Any idea about this? The benchmark script can be found here: https://gist.github.com/Isotr0py/eef7470ff176a28ac40340b883cf1abe

Isotr0py · 2025-07-17T16:58:31Z

benchmarks/jina_embeddings_v4_validation.py

I think this file is unnecessary here, the scripts under benchmarks are usually used to benchmark the performance of kernels or common models instead of a specific model.

Isotr0py · 2025-07-17T16:59:29Z

examples/offline_inference/jina_embeddings_v4.py

Perhaps move the content of this file to examples/offline_inference/vision_language_embedding.py?

Isotr0py · 2025-07-17T17:02:34Z

vllm/model_executor/models/registry.py

+    # Multimodal embedding model with token-type-aware pooling
+    "JinaVLForEmbedding": ("jina_embeddings_v4", "JinaVLForEmbedding"),


We should also update tests/models/registry.py.

sigridjineth requested review from DarkLight1337 and ywang96 as code owners July 11, 2025 07:01

gemini-code-assist bot reviewed Jul 11, 2025

View reviewed changes

mergify bot added documentation Improvements or additions to documentation multi-modality Related to multi-modality (#4194) new-model Requests to new models performance Performance-related issues labels Jul 11, 2025

gemini-code-assist bot reviewed Jul 11, 2025

View reviewed changes

sigridjineth force-pushed the jina-support branch from 34f3e7f to d7d6b60 Compare July 11, 2025 07:10