Releases: intel/xFasterTransformer
Releases · intel/xFasterTransformer
v2.1.1
v2.1.0 Qwen3 Series models supported!🎉
v2.1.0 Qwen3 Series models supported!🎉
Models
- Support Qwen3 series models.
Performance
- Optimize DeepSeek-R1 fp8_e4m3 performance.
v2.0.0 DeepSeek-R1 671B supported!🎉
v2.0.0
Models
- Support DeepSeek-R1 671B with
fp8_e4m3
dtype, usingbf16
kv cache dtype. - Support Mixtral MoE series models.
- Support TeleChat model.
What's Changed
Generated release nots
What's Changed
- Bump gradio from 4.37.2 to 5.0.0 in /examples/web_demo by @dependabot in #479
- Bump gradio from 5.0.0 to 5.5.0 in /examples/web_demo by @dependabot in #483
- [API] Add layernorm FP16 support; by @wenhuanh in #485
- Bump gradio from 5.5.0 to 5.11.0 in /examples/web_demo by @dependabot in #488
- Fix bug for EMR SNC-2 mode benchmark by @qiuyuleng1 in #484
- Fix bugs in mpirun commands by @zsym-sjtu in #487
- [web demo] Add thinking process for demo by @wenhuanh in #492
New Contributors
- @qiuyuleng1 made their first contribution in #484
- @zsym-sjtu made their first contribution in #487
Full Changelog: v1.8.2...v2.0.0
v1.8.2
v1.8.1
v1.8.1
Functionality
- Expose the interface of embedding lookup.
Performance
- Optimized the performance of grouped query attention (GQA).
- Enhanced the performance of creating keys for the oneDNN primitive cache.
- Set the [bs][nh][seq][hs] layout as the default for KV Cache, resulting in better performance.
- Improved the task split imbalance issue in self-attention.
v1.8.0 Continuous Batching on Single ARC GPU and AMX_FP16 Support.
Highlight
- Continuous Batching on Single ARC GPU is supported and can be integrated by
vllm-xft
. - Introduce Intel AMX instructions support for
float16
data type.
Models
- Support ChatGLM4 series models.
- Introduce BF16/FP16 full path support for Qwen series models.
BUG fix
- Fixed memory leak of oneDNN primitive cache.
- Fixed SPR-HBM flat QUAD mode detect issue in benchmark scripts.
- Fixed heads Split error for distributed Grouped-query attention(GQA).
- Fixed an issue with the invokeAttentionLLaMA API.
What's Changed
Generated release nots
What's Changed
- [Kernel] Enable continuous batching on single GPU. by @changqi1 in #452
- [Bugfix] fixed shm reduceAdd & rope error when batch size is large by @abenmao in #457
- [Feature] Enable AMX FP16 on next generation CPU by @wenhuanh in #456
- [Kernel] Cache oneDNN primitive when M <
XFT_PRIMITIVE_CACHE_M
, default 256. by @Duyi-Wang in #460 - [Denpendency] Pin python requirements.txt version. by @Duyi-Wang in #458
- [Dependency] Bump web_demo requirement. by @Duyi-Wang in #463
- [Layers] Enable AMX FP16 of FlashAttn by @abenmao in #459
- [Layers] Fix invokeAttentionLLaMA API by @wenhuanh in #464
- [Readme] Add accepted papers by @wenhuanh in #465
- [Kernel] Make SelfAttention prepared for AMX_FP16; More balanced task split in Cross Attention by @pujiang2018 in #466
- [Kernel] Upgrade xDNN to v1.5.2 and make AMX_FP16 work by @pujiang2018 in #468
Full Changelog: v1.7.3...v1.8.0
v1.7.3
v1.7.2 - Continuous batching feature supports Qwen 1.0 & hybrid data types.
v1.7.2 - Continuous batching feature supports Qwen 1.0 & hybrid data types.
Functionality
- Add continuous batching support of Qwen 1.0 models.
- Enable hybrid data types for continuous batching feature, including
BF16_FP16, BF16_INT8, BF16_W8A8, BF16_INT4, BF16_NF4, W8A8_INT8, W8A8_int4, W8A8_NF4
.
BUG fix
- Fixed the convert fault in Baichuan1 models.
What's Changed
Generated release nots
- [Doc] Add vllm benchmark docs. by @marvin-Yu in #448
- [Kernel] Add GPU kernels and enable LLaMA model. by @changqi1 in #372
- [Tools] Add Baichuan1/2 convert tool by @abenmao in #451
- [Layers] Add qwenRope support for Qwen1.0 in CB mode by @abenmao in #449
- [Framework] Remove duplicated code by @xiangzez in #450
- [Model] Support hybrid model in continuous batching. by @Duyi-Wang in #453
- [Version] v1.7.2. by @Duyi-Wang in #454
Full Changelog: v1.7.1...v1.7.2
v1.7.1 - Continuous batching feature supports ChatGLM2/3.
v1.7.1 - Continuous batching feature supports ChatGLM2/3.
Functionality
- Add continuous batching support of ChatGLM2/3 models.
- Qwen2Convert supports quantized Qwen2 models by GPTQ, such as GPTQ-Int8 and GPTQ-Int4, by param
from_quantized_model="gptq"
.
BUG fix
- Fixed the segament fault error when running with more than 2 ranks in vllm-xft serving.
What's Changed
Generated release nots
- [README] Update README.md. by @Duyi-Wang in #434
- [README] Update README.md. by @Duyi-Wang in #435
- [Common]Add INT8/UINT4 to BF16 weight convert by @xiangzez in #436
- Add Continue Batching support for Chatglm2/3 by @a3213105 in #438
- [Model] Add Qwen2 GPTQ model support by @xiangzez in #439
- [Model] Fix array out of bounds when rank > 2. by @Duyi-Wang in #441
- Bump gradio from 4.19.2 to 4.36.0 in /examples/web_demo by @dependabot in #442
- [Version] v1.7.1. by @Duyi-Wang in #445
Full Changelog: v1.7.0...v1.7.1
v1.7.0 - Continuous batching feature supported.
v1.7.0 - Continuous batching feature supported.
Functionality
- Refactor framework to support continuous batching feature.
vllm-xft
, a fork of vllm, integrates the xFasterTransformer backend and maintains compatibility with most of the official vLLM's features. - Remove FP32 data type option of KV Cache.
- Add
get_env()
python API to get recommended LD_PRELOAD set. - Add GPU build option for Intel Arc GPU series.
- Exposed the interface of the LLaMA model, including Attention and decoder.
Performance
- Update xDNN to release
v1.5.1
- Baichuan series models supports full FP16 pipline to improve performance.
- More FP16 data type kernel added, including MHA, MLP, YARN rotary_embedding, rmsnorm and rope.
- Kernel implementation of crossAttnByHead.
Dependency
- Bump
torch
to2.3.0
.
BUG fix
- Fixed the segament fault error when running with more than 4 ranks.
- Fixed the bugs of core dump && hang when running croos nodes.
What's Changed
Generated release nots
- [Fix] add utf-8 encoding. by @marvin-Yu in #354
- [Benchmark] Calculate throughput using avg latency. by @Duyi-Wang in #360
- [GPU] Add GPU build option. by @changqi1 in #359
- Fix Qwen prompt.json by @JunxiChhen in #368
- [Model] Fix ICX build issue. by @changqi1 in #370
- [CMake] Remove evaluation under XFT_BUILD_TESTS option. by @Duyi-Wang in #374
- [Kernel][UT] Kernel impl. of crossAttnByHead and unit test for cross attention. by @pujiang2018 in #348
- [API] Add LLaMA attention API. by @changqi1 in #378
- [Finetune] Scripts for Llama2-7b lora finetune example using stock pytorch by @ustcuna in #327
- [Demo] Add abbreviation for output length. by @Duyi-Wang in #385
- [API] Add LLaMA decoder API. by @changqi1 in #386
- [API] Optimize API Impl. by @changqi1 in #396
- [Framework] Continuous Batching Support by @pujiang2018 in #357
- [KVCache] Remove FP32 data type. by @Duyi-Wang in #399
- [Interface] Change return shape of forward_cb. by @Duyi-Wang in #400
- [Example] Add demo of offline continuous batching by @pujiang2018 in #401
- [Layers] Add alibiSlopes Attn && Flash Attn for CB. by @abenmao in #402
- [Interface] Support List[int] and List[List[int]] for set_input_sb. by @Duyi-Wang in #404
- [Bug] fix incorrect input offset computing by @pujiang2018 in #405
- [Example] Fix incorrect tensor dimension with latest interface by @pujiang2018 in #406
- [Models/Layers/Kernels] Add Baichuan1/2 full-link bf16 support & Fix next-tok gen bug by @abenmao in #407
- [xDNN] Release v1.5.0. by @changqi1 in #410
- [Kernel] Add FP16 rmsnorm and rope kernels. by @changqi1 in #408
- [Kenrel] Add FP16 LLaMA YARN rotary_embedding. by @changqi1 in #412
- [Benchmark] Add platform options. Support real model. by @JunxiChhen in #409
- [Dependency] Update torch to 2.3.0. by @Duyi-Wang in #416
- [COMM] Fix bugs of core dump && hang when running cross nodes by @abenmao in #423
- [xDNN] Release v1.5.1. by @changqi1 in #422
- [Kernel] Add FP16 MHA and MLP kernels. by @changqi1 in #415
- [Python] Add
get_env()
to get LD_PRELOAD set. by @Duyi-Wang in #427 - Add --padding and fix bug by @yangkunx in #418
- [Layers] Fixed the seg fault error when running with more than 4 ranks by @abenmao in #424
- [Kernel] Less compute for Self-Attention (Q * K) by @pujiang2018 in #420
- [Dependency] Update libiomp5.so to
5.0.20230815
contained in mkl. by @Duyi-Wang in #430 - [Distribute] Add distribute support for continuous batching api. by @Duyi-Wang in #421
- [Layers] Fixed error in yarn by @abenmao in #429
- [README] Update readme. by @Duyi-Wang in #431
- [Dependency] Fix wrong so path returned in
get_env()
. by @Duyi-Wang in #432 - [Version] v1.7.0. by @Duyi-Wang in #433
New Contributors
Full Changelog: v1.6.0...v1.7.0