Skip to content

[CI] Add mteb testing to test the accuracy of the embedding model #17175

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

noooop
Copy link
Contributor

@noooop noooop commented Apr 25, 2025

Summary:

  1. Add mteb STS12 testing to test the accuracy of the embedding model (tests/entrypoints/openai/correctness/test_mteb.py)
  2. test snowflake_arctic_embed using mteb STS12. (tests/models/embedding/language/test_snowflake_arctic_embed.py)
  3. Snowflake/snowflake-arctic-embed-m-long without using rope_scaling

Task selection

Although mteb v2 significantly speeds up the testing process, it still requires several hours to complete all tests.

Here choose a small test task in mteb: STS12.

  1. Running time on 4090 is approximately 26.95s

  2. The score for this test set is strongly numerical stability ( <1e-4 ) with respect to minor variations in the model implementation and tensor data types.

  3. The score differences between different models are also quite noticeable ( >1e-3).

This makes mteb STS12 a great embedding model test set.

numerical stability

float16
VLLM main score:  0.787317856459396
SentenceTransformer main score:  0.7873427091972599
Difference:  2.4852737863900742e-05  

bfloat16
VLLM main score:  0.7873663350672234
SentenceTransformer main score:  0.7873427091972599
Difference:  -2.3625869963517232e-05

The difference is very subtle (<1e-4) at least on this test set.

Ten rounds:

float32 2.2034500413159464e-07 2.2674275951409703e-06
float16 -1.2960828871366736e-05 6.329514177900761e-06
bfloat16 -0.0001093438180336248 3.559915712887334e-05

The results of ten iterations seem to show that converting float32 to float16 yields better results than bfloat16 (vllm defaults to converting float32 to float16).

more about numerical stability

Most models exhibit excellent numerical stability

model name st_main_score Difference std
BAAI/bge-m3 0.7873424632849964 -4.014647728589615e-06 1.266416587059263e-05
BAAI/bge-base-en-v1.5 0.7802846624612514 1.1294266950234721e-05 6.865350381034025e-06
Snowflake/snowflake-arctic-embed-xs 0.7149276890717804 1.5002530987628937e-05 5.132361246049283e-06
Snowflake/snowflake-arctic-embed-s 0.7408120447186094 1.2957674633273797e-05 5.364178900440517e-06
Snowflake/snowflake-arctic-embed-m 0.6467522411844727 -3.727433978584216e-06 8.904071772230203e-06
Snowflake/snowflake-arctic-embed-l 0.6362746289758823 9.515755331335196e-05 2.023830079795977e-05
Snowflake/snowflake-arctic-embed-m-v1.5 0.6490882209298032 1.8871733633019083e-05 6.591107037250243e-06
Snowflake/snowflake-arctic-embed-l-v2.0 0.7122583106737259 1.074976228776503e-05 1.3400689624215418e-05
Snowflake/snowflake-arctic-embed-m-v2.0 0.7066229164460937 1.5418442692483048e-05 9.792523972420118e-06
Alibaba-NLP/gte-Qwen2-1.5B-instruct 0.7280529229028553 5.124313459714536e-05 1.6385524234026275e-05
  • intfloat/multilingual-e5-small shows a significant drop when using fp16 , and fp32 needs to be used. (Maybe it's because the model is too small.)

fp16:
| intfloat/multilingual-e5-small | 0.7805425596252846 | -0.2749311085815237 | 0.006216913108536066 |

fp32:
| intfloat/multilingual-e5-small | 0.7805425596252846 | -1.6403316041024851e-06 | 7.53539269543218e-06 |

  • jinaai/jina-embeddings-v3 shows a slight drop when using fp16.

fp16:
| jinaai/jina-embeddings-v3 | 0.7834129787836271 | -0.0709833671361465 | 0.004834963031278825 |
fp32:
| jinaai/jina-embeddings-v3 | 0.8243646209061513 | -3.119267999662778e-05 | 6.651161140301139e-06 |

  • Snowflake/snowflake-arctic-embed-m-long using rope_scaling would result in a slight drop in precision

with rope_scaling
| Snowflake/snowflake-arctic-embed-m-long | 0.6811445157066163 | 0.002028678862646127 | 1.7115555299524317e-05 |

without rope_scaling
| Snowflake/snowflake-arctic-embed-m-long | 0.6811445157066163 | 3.396798716037708e-05 | 1.224356222837439e-05 |

Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@DarkLight1337
Copy link
Member

Thanks for adding this, can you fix pre-commit?

@mergify mergify bot added the ci/build label Apr 25, 2025
@noooop
Copy link
Contributor Author

noooop commented Apr 25, 2025

@DarkLight1337

Delete the benchmarks/eval/test_mteb.py finally, or change it into a more general testing script.

@DarkLight1337
Copy link
Member

cc @mgoin @comaniac do you think we can incorporate this eval script into our existing scripts? Or would it be better to keep them separate?

@comaniac
Copy link
Collaborator

Hmm I'm not sure we want to have benchmark/evals. For correctness checking in the CI, we should be able to just test 2-3 cases to keep the stability.

@noooop noooop reopened this Apr 28, 2025
@noooop noooop requested a review from ywang96 as a code owner April 28, 2025 07:53
@noooop
Copy link
Contributor Author

noooop commented Apr 28, 2025

@DarkLight1337

I tested more models, and most of them showed strong numerical stability ( <1e-4 ), even better than I imagined.

The score differences between different models are also quite noticeable ( >1e-3).

This makes mteb STS12 a great embedding model test set.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants