[CI] Add mteb testing to test the accuracy of the embedding model #17175

noooop · 2025-04-25T09:16:16Z

Summary:

Add mteb STS12 testing to test the accuracy of the embedding model (tests/entrypoints/openai/correctness/test_mteb.py)
test snowflake_arctic_embed using mteb STS12. (tests/models/embedding/language/test_snowflake_arctic_embed.py)
Snowflake/snowflake-arctic-embed-m-long without using rope_scaling

Task selection

Although mteb v2 significantly speeds up the testing process, it still requires several hours to complete all tests.

Here choose a small test task in mteb: STS12.

Running time on 4090 is approximately 26.95s
The score for this test set is strongly numerical stability ( <1e-4 ) with respect to minor variations in the model implementation and tensor data types.
The score differences between different models are also quite noticeable ( >1e-3).

This makes mteb STS12 a great embedding model test set.

numerical stability

float16
VLLM main score:  0.787317856459396
SentenceTransformer main score:  0.7873427091972599
Difference:  2.4852737863900742e-05  

bfloat16
VLLM main score:  0.7873663350672234
SentenceTransformer main score:  0.7873427091972599
Difference:  -2.3625869963517232e-05

The difference is very subtle (<1e-4) at least on this test set.

Ten rounds:

float32 2.2034500413159464e-07 2.2674275951409703e-06
float16 -1.2960828871366736e-05 6.329514177900761e-06
bfloat16 -0.0001093438180336248 3.559915712887334e-05

The results of ten iterations seem to show that converting float32 to float16 yields better results than bfloat16 (vllm defaults to converting float32 to float16).

more about numerical stability

Most models exhibit excellent numerical stability

model name	st_main_score	Difference	std
BAAI/bge-m3	0.7873424632849964	-4.014647728589615e-06	1.266416587059263e-05
BAAI/bge-base-en-v1.5	0.7802846624612514	1.1294266950234721e-05	6.865350381034025e-06
Snowflake/snowflake-arctic-embed-xs	0.7149276890717804	1.5002530987628937e-05	5.132361246049283e-06
Snowflake/snowflake-arctic-embed-s	0.7408120447186094	1.2957674633273797e-05	5.364178900440517e-06
Snowflake/snowflake-arctic-embed-m	0.6467522411844727	-3.727433978584216e-06	8.904071772230203e-06
Snowflake/snowflake-arctic-embed-l	0.6362746289758823	9.515755331335196e-05	2.023830079795977e-05
Snowflake/snowflake-arctic-embed-m-v1.5	0.6490882209298032	1.8871733633019083e-05	6.591107037250243e-06
Snowflake/snowflake-arctic-embed-l-v2.0	0.7122583106737259	1.074976228776503e-05	1.3400689624215418e-05
Snowflake/snowflake-arctic-embed-m-v2.0	0.7066229164460937	1.5418442692483048e-05	9.792523972420118e-06
Alibaba-NLP/gte-Qwen2-1.5B-instruct	0.7280529229028553	5.124313459714536e-05	1.6385524234026275e-05

intfloat/multilingual-e5-small shows a significant drop when using fp16 , and fp32 needs to be used. (Maybe it's because the model is too small.)

fp16:
| intfloat/multilingual-e5-small | 0.7805425596252846 | -0.2749311085815237 | 0.006216913108536066 |

fp32:
| intfloat/multilingual-e5-small | 0.7805425596252846 | -1.6403316041024851e-06 | 7.53539269543218e-06 |

jinaai/jina-embeddings-v3 shows a slight drop when using fp16.

fp16:
| jinaai/jina-embeddings-v3 | 0.7834129787836271 | -0.0709833671361465 | 0.004834963031278825 |
fp32:
| jinaai/jina-embeddings-v3 | 0.8243646209061513 | -3.119267999662778e-05 | 6.651161140301139e-06 |

Snowflake/snowflake-arctic-embed-m-long using rope_scaling would result in a slight drop in precision

with rope_scaling
| Snowflake/snowflake-arctic-embed-m-long | 0.6811445157066163 | 0.002028678862646127 | 1.7115555299524317e-05 |

without rope_scaling
| Snowflake/snowflake-arctic-embed-m-long | 0.6811445157066163 | 3.396798716037708e-05 | 1.224356222837439e-05 |

github-actions · 2025-04-25T09:16:25Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

DarkLight1337 · 2025-04-25T09:36:13Z

Thanks for adding this, can you fix pre-commit?

noooop · 2025-04-25T10:12:24Z

@DarkLight1337

Delete the benchmarks/eval/test_mteb.py finally, or change it into a more general testing script.

DarkLight1337 · 2025-04-25T10:21:17Z

cc @mgoin @comaniac do you think we can incorporate this eval script into our existing scripts? Or would it be better to keep them separate?

comaniac · 2025-04-26T07:56:38Z

Hmm I'm not sure we want to have benchmark/evals. For correctness checking in the CI, we should be able to just test 2-3 cases to keep the stability.

noooop · 2025-04-28T07:59:20Z

@DarkLight1337

I tested more models, and most of them showed strong numerical stability ( <1e-4 ), even better than I imagined.

The score differences between different models are also quite noticeable ( >1e-3).

This makes mteb STS12 a great embedding model test set.

noooop requested review from DarkLight1337, robertgshaw2-redhat and simon-mo as code owners April 25, 2025 09:16

mergify bot added the ci/build label Apr 25, 2025

noooop closed this Apr 28, 2025

noooop force-pushed the mteb branch from 93c8788 to 8262a3e Compare April 28, 2025 04:24

update

4d13f80

noooop reopened this Apr 28, 2025

noooop requested a review from ywang96 as a code owner April 28, 2025 07:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI] Add mteb testing to test the accuracy of the embedding model #17175

[CI] Add mteb testing to test the accuracy of the embedding model #17175

noooop commented Apr 25, 2025 •

edited by github-actions bot

Loading

github-actions bot commented Apr 25, 2025

DarkLight1337 commented Apr 25, 2025

noooop commented Apr 25, 2025

DarkLight1337 commented Apr 25, 2025

comaniac commented Apr 26, 2025

noooop commented Apr 28, 2025

[CI] Add mteb testing to test the accuracy of the embedding model #17175

Are you sure you want to change the base?

[CI] Add mteb testing to test the accuracy of the embedding model #17175

Conversation

noooop commented Apr 25, 2025 • edited by github-actions bot Loading

Task selection

numerical stability

more about numerical stability

github-actions bot commented Apr 25, 2025

DarkLight1337 commented Apr 25, 2025

noooop commented Apr 25, 2025

DarkLight1337 commented Apr 25, 2025

comaniac commented Apr 26, 2025

noooop commented Apr 28, 2025

noooop commented Apr 25, 2025 •

edited by github-actions bot

Loading