Skip to content

Releases: intel/xFasterTransformer

v2.1.1

07 May 02:52
a7ab7d7
Compare
Choose a tag to compare

v2.1.1

Performance

  • Optimize Qwen3 MOE models conversion.

v2.1.0 Qwen3 Series models supported!🎉

29 Apr 08:11
56acc31
Compare
Choose a tag to compare

v2.1.0 Qwen3 Series models supported!🎉

Models

  • Support Qwen3 series models.

Performance

  • Optimize DeepSeek-R1 fp8_e4m3 performance.

v2.0.0 DeepSeek-R1 671B supported!🎉

26 Mar 07:53
eefbcda
Compare
Choose a tag to compare

v2.0.0

Models

  • Support DeepSeek-R1 671B with fp8_e4m3 dtype, using bf16 kv cache dtype.
  • Support Mixtral MoE series models.
  • Support TeleChat model.

What's Changed

Generated release nots

What's Changed

New Contributors

Full Changelog: v1.8.2...v2.0.0

v1.8.2

10 Oct 08:17
b43edc8
Compare
Choose a tag to compare

v1.8.2

Performance

  • Enable flash attention by default for W8A8 dtype to accelerate the performance of the 1st token.

Benchmark

  • When the number of ranks is 1, run in single mode to avoid the dependency on mpirun.
  • Support SNC-3 platform.

v1.8.1

31 Jul 08:08
df57cb2
Compare
Choose a tag to compare

v1.8.1

Functionality

  • Expose the interface of embedding lookup.

Performance

  • Optimized the performance of grouped query attention (GQA).
  • Enhanced the performance of creating keys for the oneDNN primitive cache.
  • Set the [bs][nh][seq][hs] layout as the default for KV Cache, resulting in better performance.
  • Improved the task split imbalance issue in self-attention.

v1.8.0 Continuous Batching on Single ARC GPU and AMX_FP16 Support.

23 Jul 01:25
faa25f4
Compare
Choose a tag to compare

Highlight

  • Continuous Batching on Single ARC GPU is supported and can be integrated by vllm-xft.
  • Introduce Intel AMX instructions support for float16 data type.

Models

  • Support ChatGLM4 series models.
  • Introduce BF16/FP16 full path support for Qwen series models.

BUG fix

  • Fixed memory leak of oneDNN primitive cache.
  • Fixed SPR-HBM flat QUAD mode detect issue in benchmark scripts.
  • Fixed heads Split error for distributed Grouped-query attention(GQA).
  • Fixed an issue with the invokeAttentionLLaMA API.

What's Changed

Generated release nots

What's Changed

  • [Kernel] Enable continuous batching on single GPU. by @changqi1 in #452
  • [Bugfix] fixed shm reduceAdd & rope error when batch size is large by @abenmao in #457
  • [Feature] Enable AMX FP16 on next generation CPU by @wenhuanh in #456
  • [Kernel] Cache oneDNN primitive when M < XFT_PRIMITIVE_CACHE_M, default 256. by @Duyi-Wang in #460
  • [Denpendency] Pin python requirements.txt version. by @Duyi-Wang in #458
  • [Dependency] Bump web_demo requirement. by @Duyi-Wang in #463
  • [Layers] Enable AMX FP16 of FlashAttn by @abenmao in #459
  • [Layers] Fix invokeAttentionLLaMA API by @wenhuanh in #464
  • [Readme] Add accepted papers by @wenhuanh in #465
  • [Kernel] Make SelfAttention prepared for AMX_FP16; More balanced task split in Cross Attention by @pujiang2018 in #466
  • [Kernel] Upgrade xDNN to v1.5.2 and make AMX_FP16 work by @pujiang2018 in #468

Full Changelog: v1.7.3...v1.8.0

v1.7.3

01 Jul 01:52
Compare
Choose a tag to compare

BUG fix

  • Fixed SHM reduceAdd & rope error when batch size is large.
  • Fixed the issue of abnormal usage of oneDNN primitive cache.

v1.7.2 - Continuous batching feature supports Qwen 1.0 & hybrid data types.

18 Jun 05:07
da2a7fa
Compare
Choose a tag to compare

v1.7.2 - Continuous batching feature supports Qwen 1.0 & hybrid data types.

Functionality

  • Add continuous batching support of Qwen 1.0 models.
  • Enable hybrid data types for continuous batching feature, including BF16_FP16, BF16_INT8, BF16_W8A8, BF16_INT4, BF16_NF4, W8A8_INT8, W8A8_int4, W8A8_NF4.

BUG fix

  • Fixed the convert fault in Baichuan1 models.

What's Changed

Generated release nots

Full Changelog: v1.7.1...v1.7.2

v1.7.1 - Continuous batching feature supports ChatGLM2/3.

12 Jun 05:27
38658b1
Compare
Choose a tag to compare

v1.7.1 - Continuous batching feature supports ChatGLM2/3.

Functionality

  • Add continuous batching support of ChatGLM2/3 models.
  • Qwen2Convert supports quantized Qwen2 models by GPTQ, such as GPTQ-Int8 and GPTQ-Int4, by param from_quantized_model="gptq".

BUG fix

  • Fixed the segament fault error when running with more than 2 ranks in vllm-xft serving.

What's Changed

Generated release nots

Full Changelog: v1.7.0...v1.7.1

v1.7.0 - Continuous batching feature supported.

05 Jun 05:13
76ddad7
Compare
Choose a tag to compare

v1.7.0 - Continuous batching feature supported.

Functionality

  • Refactor framework to support continuous batching feature. vllm-xft, a fork of vllm, integrates the xFasterTransformer backend and maintains compatibility with most of the official vLLM's features.
  • Remove FP32 data type option of KV Cache.
  • Add get_env() python API to get recommended LD_PRELOAD set.
  • Add GPU build option for Intel Arc GPU series.
  • Exposed the interface of the LLaMA model, including Attention and decoder.

Performance

  • Update xDNN to release v1.5.1
  • Baichuan series models supports full FP16 pipline to improve performance.
  • More FP16 data type kernel added, including MHA, MLP, YARN rotary_embedding, rmsnorm and rope.
  • Kernel implementation of crossAttnByHead.

Dependency

  • Bump torch to 2.3.0.

BUG fix

  • Fixed the segament fault error when running with more than 4 ranks.
  • Fixed the bugs of core dump && hang when running croos nodes.

What's Changed

Generated release nots

New Contributors

Full Changelog: v1.6.0...v1.7.0