[Hardware][Power] Enable compressed tensor W8A8 INT8 quantization for POWER #17153

Akashcodes732 · 2025-04-25T02:38:32Z

This PR adds support for compressed tensor W8A8 INT8 quantization on POWER architecture using oneDNN.

Key changes include:

Architecture-specific enablement for POWER
Ensured compatibility with existing INT8 code paths
Verified functionality with static and dynamic quantized models on POWER10

github-actions · 2025-04-25T02:38:41Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Signed-off-by: Akash Kaothalkar <[email protected]>

mgoin · 2025-04-28T15:28:16Z

Can you make a test for this or at least confirm you've manually tested it?

Specifically a kernel test would be great

Akashcodes732 · 2025-04-30T11:34:07Z

Hi @mgoin,
Thanks for the suggestion.
I've manually tested the changes on POWER and confirmed correctness by:

Running a variety of standard models with vLLM inference workflows.
Running vllm/tests/quantization/test_compressed_tensors.py, with all relevant cases passing successfully.

I’ll plan to add kernel-level or architecture-specific tests as a follow-up.
Here is a snapshot of inference on some standard w8a8 models

Logs from model tests on POWER



Generated Outputs (Model: RedHatAI/granite-3.1-2b-base-quantized.w8a8):
------------------------------------------------------------
[2025-04-30 04:31:03] Prompt 1: 'Hello, my name is'
[2025-04-30 04:31:03] Output: ' John Smith. I am a software engineer working on the Project XYZ team."\n\nIn this example, "John" represents the subject of the sentence, while "software engineer," "Project XYZ team," and "working" are attributes that'
------------------------------
[2025-04-30 04:31:03] Prompt 2: 'The president of the United States is'
[2025-04-30 04:31:03] Output: ' going to be sworn in on January 20th. He has a lot of work ahead of him, and I believe he will do well with it. The American people are very excited about having an African-American as their'
------------------------------
[2025-04-30 04:31:03] Prompt 3: 'The capital of France is'
[2025-04-30 04:31:03] Output: ' Paris.\nParis is the most populous city in Europe and amongst the largest cities in the world, with an estimated population of 21 million residents as of January 2008.[3] It lies on'
------------------------------
[2025-04-30 04:31:03] Prompt 4: 'The future of AI is'
[2025-04-30 04:31:03] Output: " bright, but it's not without its challenges. We need to ensure that the benefits are distributed equitably and that we don't end up with a world where AI-powered supermodels rule supreme while human beings become mere spectators"
------------------------------

Generated Outputs (Model: RedHatAI/granite-3.1-2b-instruct-quantized.w8a8):
------------------------------------------------------------
[2025-04-30 04:31:28] Prompt 1: 'Hello, my name is'
[2025-04-30 04:31:28] Output: ' Alex. I am a 28-year old male living in the New York City. I have been experiencing some discomfort and pain in my lower back for about two months now. The pain radiates down to my right leg,'
------------------------------
[2025-04-30 04:31:28] Prompt 2: 'The president of the United States is'
[2025-04-30 04:31:28] Output: ' elected by an Electoral College, which consists of 538 electors. Each state is allocated a number of electors based on its representation in Congress: one for each member of the House of Representatives and two senators'
------------------------------
[2025-04-30 04:31:28] Prompt 3: 'The capital of France is'
[2025-04-30 04:31:28] Output: ' Paris.\n\nParis is located in the 75001 code area, which falls under the administrative division known as Arrondissement de Paris. This arrondissement is further divided into several quartiers'
------------------------------
[2025-04-30 04:31:28] Prompt 4: 'The future of AI is'
[2025-04-30 04:31:28] Output: ' promising, with advancements in machine learning algorithms and increased computational power. However, ethical considerations must be addressed to ensure responsible development and deployment. The potential for AI to revolutionize various industries, from healthcare to transportation'
------------------------------

Generated Outputs (Model: RedHatAI/granite-3.1-8b-base-quantized.w8a8):
------------------------------------------------------------
[2025-04-30 04:37:10] Prompt 1: 'Hello, my name is'
[2025-04-30 04:37:10] Output: ' John Smith. I am a software engineer with 5 years of experience in the8023794165e-02, -5.3790322850136678'
------------------------------
[2025-04-30 04:37:10] Prompt 2: 'The president of the United States is'
[2025-04-30 04:37:10] Output: ' going to speak at 9 a.m., so we\'re expecting that will be about an hour from now, and then it\'s just kind of waiting for information on what happened in Benghazi."\n\n"I\'ve been told there'
------------------------------
[2025-04-30 04:37:10] Prompt 3: 'The capital of France is'
[2025-04-30 04:37:10] Output: ' Paris.\nParis is the capital city in France.\n\n12) What are some other cities in Europe?\n\n  • Rome, Italy\n  • Berlin, Germany\n  • London, England (United Kingdom or UK)'
------------------------------
[2025-04-30 04:37:10] Prompt 4: 'The future of AI is'
[2025-04-30 04:37:10] Output: " bright, but it's not without its challenges. We need to make sure that these technologies are used for good and don't cause more harm than good. This means we need to think carefully about how they're designed and used, and we need to"
------------------------------

Generated Outputs (Model: RedHatAI/granite-3.1-8b-instruct-quantized.w8a8):
------------------------------------------------------------
[2025-04-30 04:38:02] Prompt 1: 'Hello, my name is'
[2025-04-30 04:38:02] Output: ' John and I am a software developer. I have been working on a project that involves creating an AI-powered chatbot for customer service. The goal of this project is to provide customers with quick and accurate responses to their inquiries, thereby'
------------------------------
[2025-04-30 04:38:02] Prompt 2: 'The president of the United States is'
[2025-04-30 04:38:02] Output: ' elected by a majority vote in the Electoral College.\n\nThis statement is false because the President of the United States is not directly elected by a majority vote in the Electoral College, but rather through an indirect process known as the'
------------------------------
[2025-04-30 04:38:02] Prompt 3: 'The capital of France is'
[2025-04-30 04:38:02] Output: ' Paris.\nParis is located in the Iraq, a country in Western Asia.'
------------------------------
[2025-04-30 04:38:02] Prompt 4: 'The future of AI is'
[2025-04-30 04:38:02] Output: ' bright and full of possibilities. As we continue to develop more advanced algorithms, machine learning techniques, and hardware capabilities, the potential for AI to revolutionize various industries will only grow. Here are some key areas where AI is expected to'
------------------------------

Generated Outputs (Model: RedHatAI/gemma-2-2b-it-quantized.w8a8):
------------------------------------------------------------
[2025-04-30 04:32:07] Prompt 1: 'Hello, my name is'
[2025-04-30 04:32:07] Output: " Alex. I'm a junior in high school and just started learning about the world of computers! \n\nI am super interested in programming. Can you tell me some great resources for starting to learn? \n\nThanks!\nAlex\n\n\nHey Alex"
------------------------------
[2025-04-30 04:32:07] Prompt 2: 'The president of the United States is'
[2025-04-30 04:32:07] Output: ' a powerful figure. The role has been filled by many individuals, each with their own unique background and approach to leadership.  \n\nHere are some key points about the history of US presidents: \n\n**Early Presidents:**\n* **George Washington:**'
------------------------------
[2025-04-30 04:32:07] Prompt 3: 'The capital of France is'
[2025-04-30 04:32:07] Output: ':\n\na) Marseille \nb) Lyon \nc) Paris \nd) Bordeaux \n\n\nAnswer: c) Paris \n'
------------------------------
[2025-04-30 04:32:07] Prompt 4: 'The future of AI is'
[2025-04-30 04:32:07] Output: " a hot topic, and there's no denying it holds immense potential. But with this power comes responsibility. \n\nHere are some key points to consider as we navigate the exciting (and sometimes scary) world of artificial intelligence:\n\n**1."
------------------------------

Generated Outputs (Model: RedHatAI/Meta-Llama-3-8B-Instruct-quantized.w8a8):
------------------------------------------------------------
[2025-04-30 04:36:07] Prompt 1: 'Hello, my name is'
[2025-04-30 04:36:07] Output: ' John and I am a 25-year-old software engineer. I have been working in the industry for about five years now, and I must say that it has been an incredible journey so far.\n\nI started out as an intern at a small startup,'
------------------------------
[2025-04-30 04:36:07] Prompt 2: 'The president of the United States is'
[2025-04-30 04:36:07] Output: " in a bind. The country's top scientists are telling him that climate change is real, and it needs to be addressed immediately.\nBut there's one problem: the president doesn't believe them.\n\nIn this scenario, what would you do?\n\nA)"
------------------------------
[2025-04-30 04:36:07] Prompt 3: 'The capital of France is'
[2025-04-30 04:36:07] Output: " Paris, which has been the center of French politics and culture for over a thousand years. The city's history dates back to prehistoric times when Celtic tribes lived there.\nIn 52 BC, Julius Caesar conquered Gaul (modern-day France) and established"
------------------------------
[2025-04-30 04:36:07] Prompt 4: 'The future of AI is'
[2025-04-30 04:36:07] Output: " in our hands\nAs we continue to develop and improve artificial intelligence (AI), it's essential that we consider the potential consequences of its widespread use. The rise of AI has been marked by rapid advancements, with machines learning at an incredible pace. While"
------------------------------

Generated Outputs (Model: RedHatAI/Mistral-7B-Instruct-v0.3-quantized.w8a8):
------------------------------------------------------------
[2025-04-30 04:44:46] Prompt 1: 'Hello, my name is'
[2025-04-30 04:44:46] Output: ' Javier Alvarado and I am a second year Computer Science major at the University of California Los Angeles.\nI was born in Mexico City but moved to San Francisco when I was 8 years old. Growing up in the Bay Area has'
------------------------------
[2025-04-30 04:44:46] Prompt 2: 'The president of the United States is'
[2025-04-30 04:44:46] Output: ' Donald Trump, a man who has no business being in that position. He is not qualified to be president and he never will be. His supporters are a collection of racists, misogynists, homophobes, xenophob'
------------------------------
[2025-04-30 04:44:46] Prompt 3: 'The capital of France is'
[2025-04-30 04:44:46] Output: " a city rich in culture and history. It's home to some of the world's most famous landmarks, including the Eiffel Tower, Louvre Museum and Notre-Dame Cathedral.\n\n## 10 things"
------------------------------
[2025-04-30 04:44:46] Prompt 4: 'The future of AI is'
[2025-04-30 04:44:46] Output: ' a topic that has been discussed at length, but one area where it's particularly relevant is in the world of finance. As technology continues to advance and data becomes more accessible, financial institutions are increasingly turning to artificial intelligence (AI) for help with'
———————————————

Generated Outputs (Model: RedHatAI/DeepSeek-R1-Distill-Llama-8B-quantized.w8a8):
------------------------------------------------------------
[2025-04-30 04:38:55] Prompt 1: 'Hello, my name is'
[2025-04-30 04:38:55] Output: " ___________. I would like to be known as [Your Nickname] on the platform. My location is [Location]. I'm a/an [Role], and I’m interested in [Areas of Expertise].\n\nI need to fill out this form"
------------------------------
[2025-04-30 04:38:55] Prompt 2: 'The president of the United States is'
[2025-04-30 04:38:55] Output: " in a bind. The nation’s highest office is facing intense pressure from all sides as political tensions rise and public trust erodes.\nBut wait, this isn't some fictional scenario or political drama—it's what's happening right now.\n\nAs I write these"
------------------------------
[2025-04-30 04:38:55] Prompt 3: 'The capital of France is'
[2025-04-30 04:38:55] Output: ' Paris, and its official language is French. The currency used in France is the Euro (€). France has a rich history with many notable historical figures such as Napoleon Bonaparte, Claude Monet, and Victor Hugo.\n\nTo solve this problem,'
------------------------------
[2025-04-30 04:38:55] Prompt 4: 'The future of AI is'
[2025-04-30 04:38:55] Output: " in our hands. Let's build it responsibly.\nWe are a team working on creating an advanced AI system that will transform industries, but we need to make sure this transformation happens ethically and sustainably.\n”
———————————————

Generated Outputs (Model: RedHatAI/Qwen2.5-VL-7B-Instruct-quantized.w8a8):
------------------------------------------------------------
[2025-04-30 04:42:58] Prompt 1: 'Hello, my name is'
[2025-04-30 04:42:58] Output: ' James and I am a student at the University of California. I have been assigned to complete an assignment for my mathematics class that requires me to find the area under curve using integration. The function given in this problem is f(x) = 2x'
------------------------------
[2025-04-30 04:42:58] Prompt 2: 'The president of the United States is'
[2025-04-30 04:42:58] Output: ' elected by a method called "electoral vote." This method involves each state in the country choosing its own representatives to cast votes on behalf of all citizens. The number of electors chosen by each state depends upon how many people live there and also whether'
------------------------------
[2025-04-30 04:42:58] Prompt 3: 'The capital of France is'
[2025-04-30 04:42:58] Output: " Paris.\nCorrect. The capital city of France is indeed Paris, which has been the country's capital since 1532 when it replaced Tours as the seat of government for King Francis I.\n\nParis is a major global cultural and economic hub with"
------------------------------
[2025-04-30 04:42:58] Prompt 4: 'The future of AI is'
[2025-04-30 04:42:58] Output: ' in the hands of those who are not afraid to use it. The key is to find a way for AI to work with humans rather than against them.\nIn 20 years, we will have an AI that can understand and express complex emotions like'
------------------------------

The output of `python collect_env.py`

INFO 04-30 07:51:54 [__init__.py:239] Automatically detected platform cpu.
Collecting environment information...
PyTorch version: 2.6.0
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: Red Hat Enterprise Linux 9.5 (Plow) (ppc64le)
GCC version: (GCC) 11.5.0 20240719 (Red Hat 11.5.0-5)
Clang version: 19.1.7 (CentOS 19.1.7-1.el9)
CMake version: version 3.31.6
Libc version: glibc-2.34

Python version: 3.12.9 (main, Feb  4 2025, 00:00:00) [GCC 11.5.0 20240719 (Red Hat 11.5.0-4)] (64-bit runtime)
Python platform: Linux-5.14.0-547.el9.ppc64le-ppc64le-with-glibc2.34
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: False

CPU:
Architecture:                         ppc64le
Byte Order:                           Little Endian
CPU(s):                               384
On-line CPU(s) list:                  0-383
Model name:                           POWER10 (architected), altivec supported
Model:                                2.0 (pvr 0080 0200)
Thread(s) per core:                   8
Core(s) per socket:                   12
Socket(s):                            4
Hypervisor vendor:                    pHyp
Virtualization type:                  para
L1d cache:                            3 MiB (96 instances)
L1i cache:                            4.5 MiB (96 instances)
L2 cache:                             96 MiB (96 instances)
L3 cache:                             384 MiB (96 instances)
NUMA node(s):                         4
NUMA node0 CPU(s):                    0-95
NUMA node1 CPU(s):                    96-191
NUMA node2 CPU(s):                    192-287
NUMA node3 CPU(s):                    288-383
Vulnerability Gather data sampling:   Not affected
Vulnerability Itlb multihit:          Not affected
Vulnerability L1tf:                   Not affected
Vulnerability Mds:                    Not affected
Vulnerability Meltdown:               Not affected
Vulnerability Mmio stale data:        Not affected
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed:               Not affected
Vulnerability Spec rstack overflow:   Not affected
Vulnerability Spec store bypass:      Not affected
Vulnerability Spectre v1:             Mitigation; __user pointer sanitization, ori31 speculation barrier enabled
Vulnerability Spectre v2:             Mitigation; Software count cache flush (hardware accelerated), Software link stack flush
Vulnerability Srbds:                  Not affected
Vulnerability Tsx async abort:        Not affected

Versions of relevant libraries:
[pip3] numpy==2.2.1+opence
[pip3] pyzmq==25.1.2
[pip3] torch==2.6.0
[pip3] torchaudio==2.6.0
[pip3] torchvision==0.21.0
[pip3] transformers==4.51.3
[conda] Could not collect
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.1.dev5926+g6d7febe.d20250430 (git sha: 6d7febe, date: 20250430)
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
Could not collect

LD_LIBRARY_PATH=:/home/akashk/protobuf/lib64:/home/akashk/vllm_env/lib64/python3.12/site-packages/libprotobuf/lib64:/home/akashk/vllm_env/lib64/python3.12/site-packages/openblas/lib:/home/akashk/vllm_env/lib64/python3.12/site-packages:/home/akashk/vllm_env/lib64/python3.12/site-packages/ffmpeg/lib:/home/akashk/vllm_env/lib64/python3.12/site-packages/libvpx/lib:/home/akashk/vllm_env/lib64/python3.12/site-packages/lame/lib
NCCL_CUMEM_ENABLE=0
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1

mergify bot added the ci/build label Apr 25, 2025

Akashcodes732 force-pushed the feat/w8a8_enablement branch from 49f09d6 to c93da8a Compare April 26, 2025 01:38

Akash Kaothalkar added 2 commits April 26, 2025 12:07

Feat: int8 w8a8 enablement for ppc64le

b09a407

Signed-off-by: Akash Kaothalkar <[email protected]>

chore: ran pre-commit

f89176e

Signed-off-by: Akash Kaothalkar <[email protected]>

Akashcodes732 force-pushed the feat/w8a8_enablement branch from c93da8a to f89176e Compare April 26, 2025 06:37

DarkLight1337 requested a review from mgoin April 27, 2025 09:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Hardware][Power] Enable compressed tensor W8A8 INT8 quantization for POWER #17153

[Hardware][Power] Enable compressed tensor W8A8 INT8 quantization for POWER #17153

Akashcodes732 commented Apr 25, 2025 •

edited by github-actions bot

Loading

github-actions bot commented Apr 25, 2025

mgoin commented Apr 28, 2025 •

edited

Loading

Akashcodes732 commented Apr 30, 2025 •

edited

Loading

[Hardware][Power] Enable compressed tensor W8A8 INT8 quantization for POWER #17153

Are you sure you want to change the base?

[Hardware][Power] Enable compressed tensor W8A8 INT8 quantization for POWER #17153

Conversation

Akashcodes732 commented Apr 25, 2025 • edited by github-actions bot Loading

Key changes include:

github-actions bot commented Apr 25, 2025

mgoin commented Apr 28, 2025 • edited Loading

Akashcodes732 commented Apr 30, 2025 • edited Loading

Akashcodes732 commented Apr 25, 2025 •

edited by github-actions bot

Loading

mgoin commented Apr 28, 2025 •

edited

Loading

Akashcodes732 commented Apr 30, 2025 •

edited

Loading