An Amazon SageMaker Container for Hugging Face Inference on Graviton and Intel CPUs

💡 Not Just for SageMaker! This container runs anywhere Docker is available—on your laptop, on-prem servers, or any cloud (not just AWS or SageMaker).

For local Docker and Docker Compose usage, see the Intel doc and the Arm doc.

For Kubernetes/Helm, see helm/README.md

That's it! The container automatically downloads, converts, and optimizes the model for your CPU architecture.

Why?

Because small language models and modern CPUs are a great match for cost-efficient AI inference. More context in these blog posts: "The case for small language model inference on Arm CPUs" and "Is running language models on CPU really viable?".

Because I've been trying for a while to collaborate with AWS and Arm on this project, and I got tired of waiting 😴

So there. Enjoy!

Caveat: I've only tested sub-10B models so far. Timeouts could hit on larger models. Bug reports, ideas, and pull requests are welcome.

What It Does

Based on a clean source build of llama.cpp
Native integration with the SageMaker SDK and both Graviton3/Graviton4 (ARM64) and Intel Xeon (AMD64) instances
Model deployment from the Hugging Face hub or an Amazon S3 bucket
Single-step deployment and optimization of safetensors models, with automatic GGUF conversion and quantization
Deployment of existing GGUF models
Support for OpenAI API (/v1/chat/completions, /v1/completions)
Support for streaming and non-streaming text generation
Support for all llama-server flags

Architecture

SageMaker Endpoint → FastAPI Adapter (port 8080) → llama.cpp Server (port 8081)

Quickstart

Prerequisites

Docker with AMD64 or ARM64 support
Log in to the Docker Hub
Log in to the Hugging Face Hub
ECR repository created

1. Pull the Container

# For Intel/AMD64 systems
docker pull juliensimon/sagemaker-inference-llamacpp-cpu:amd64

# For ARM64/Graviton systems
docker pull juliensimon/sagemaker-inference-llamacpp-cpu:arm64

2. Run with a Public Model

mkdir local_models
# Start the container with a public Hugging Face model
docker run -p 8080:8080 \
  -e HF_MODEL_ID="arcee-ai/arcee-lite" \
  -e QUANTIZATION="Q4_K_M" \
  -v $(pwd)/local_models:/opt/models \
  --name llm-inference \
  juliensimon/sagemaker-inference-llamacpp-cpu:arm64

3. Test the API

# Chat completion
curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {"role": "user", "content": "Hello! How are you today?"}
    ],
    "max_tokens": 100
  }'

Amazon SageMaker instructions

Prerequisites

Docker with AMD64 or AMR64 support (building on a Mac for Graviton works great)
AWS CLI configured with appropriate permissions
ECR repository created

1. Build the Container (Optional)

Pre-built images are available on Docker Hub:

# Pull pre-built images
docker pull juliensimon/sagemaker-inference-llamacpp-cpu:amd64
docker pull juliensimon/sagemaker-inference-llamacpp-cpu:arm64

Or build from source:

# Clone repository
git clone https://github.com/juliensimon/sagemaker-inference-container-graviton

cd sagemaker-inference-container-graviton

# Build for ARM64 (Graviton)
docker build --platform linux/arm64 -t sagemaker-inference-container-graviton:arm64 .

# Build for AMD64 (x86_64)
docker build --platform linux/amd64 -t sagemaker-inference-container-graviton:amd64 .

2. Push to ECR

# Set variables
AWS_ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
AWS_REGION=$(aws configure get region)
ECR_REPOSITORY="sagemaker-inference-container-graviton"

# Create ECR repository (if it doesn't exist)
aws ecr create-repository \
    --repository-name $ECR_REPOSITORY \
    --region $AWS_REGION \
    --image-scanning-configuration scanOnPush=true \
    --image-tag-mutability MUTABLE

# Login to ECR
aws ecr get-login-password --region $AWS_REGION | \
    docker login --username AWS --password-stdin $AWS_ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com

# Tag images
docker tag sagemaker-inference-container-graviton:arm64 \
    $AWS_ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com/$ECR_REPOSITORY:arm64

docker tag sagemaker-inference-container-graviton:amd64 \
    $AWS_ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com/$ECR_REPOSITORY:amd64

# Push images
docker push $AWS_ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com/$ECR_REPOSITORY:arm64
docker push $AWS_ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com/$ECR_REPOSITORY:amd64

Deploy to SageMaker

Here's a quick overview of how to deploy models. A full notebook is available in examples/.

# Option 1: Deploy a safetensors model from HuggingFace Hub (auto-convert + quantize)
model_environment = {
    "HF_MODEL_ID": "your-model-repository",
    "QUANTIZATION": "Q8_0",
    "HF_TOKEN": hf_token, # optional, only for private and gated models
    "LLAMA_CPP_ARGS": llama_cpp_args # optional, see llama-server -h
}

# Option 2: Deploy a GGUF model from HuggingFace Hub
model_environment = {
    "HF_MODEL_ID": "your-model-repository-GGUF",
    "MODEL_FILENAME": "your-model.gguf"
}

# Option 3: Deploy a safetensors model from S3 (auto-convert + quantize)
model_environment = {
    "HF_MODEL_URI": "s3://your-bucket/your-model/",
    "QUANTIZATION": "Q4_0"
}

# Option 4: Deploy a GGUF model from S3
model_environment = {
    "HF_MODEL_URI": "s3://your-bucket/",
    "MODEL_FILENAME": "your-model.gguf"
}

# Create deployable model
model = Model(
    image_uri=your_image_uri,
    role=role,
    env=model_environment,
)

# Deploy the model
response = model.deploy(...)

Usage

Test the endpoint

model_sample_input = {
    "messages": [
        {"role": "system", "content": "You are a friendly and helpful AI assistant."},
        {
            "role": "user",
            "content": "Suggest 5 names for a new neighborhood pet food store. Names should be short, fun, easy to remember, and respectful of pets. \
        Explain why customers would like them.",
        },
    ],
    "max_tokens": 1024
}

response = runtime_sm_client.invoke_endpoint(
    EndpointName=endpoint_name,
    ContentType="application/json",
    Body=json.dumps(model_sample_input),
)

output = json.loads(response["Body"].read().decode("utf8"))

Environment Variables

Variable	Description	Usage
`HF_MODEL_ID`	Hugging Face model repository	Hub deployments
`HF_MODEL_URI`	S3 URI for model files (safetensors or GGUF)	S3 deployments
`MODEL_FILENAME`	Specific GGUF file to use	GGUF model deployment
`HF_TOKEN`	Hugging Face token for private and gated models	Private and gated hub models
`QUANTIZATION`	Quantization level (e.g., Q4_K_M)	default is F16
`LLAMA_CPP_ARGS`	Additional llama.cpp arguments	default is empty

License

Modified MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
app		app
config		config
docker		docker
docs		docs
examples		examples
helm		helm
scripts		scripts
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README-multiarch.md		README-multiarch.md
README.md		README.md
pytest.ini		pytest.ini
requirements.txt		requirements.txt
start.sh		start.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

An Amazon SageMaker Container for Hugging Face Inference on Graviton and Intel CPUs

Why?

What It Does

Architecture

Quickstart

Prerequisites

1. Pull the Container

2. Run with a Public Model

3. Test the API

Amazon SageMaker instructions

Prerequisites

1. Build the Container (Optional)

2. Push to ECR

Deploy to SageMaker

Usage

Test the endpoint

Environment Variables

License

About

Uh oh!

Releases

Packages

Languages

License

juliensimon/sagemaker-inference-container-cpu

Folders and files

Latest commit

History

Repository files navigation

An Amazon SageMaker Container for Hugging Face Inference on Graviton and Intel CPUs

Why?

What It Does

Architecture

Quickstart

Prerequisites

1. Pull the Container

2. Run with a Public Model

3. Test the API

Amazon SageMaker instructions

Prerequisites

1. Build the Container (Optional)

2. Push to ECR

Deploy to SageMaker

Usage

Test the endpoint

Environment Variables

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages