Skip to content

juliensimon/sagemaker-inference-container-cpu

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

An Amazon SageMaker Container for Hugging Face Inference on Graviton and Intel CPUs

MIT License Docker Amazon SageMaker ARM64 AMD64 Python FastAPI Hugging Face llama.cpp

💡 Not Just for SageMaker! This container runs anywhere Docker is available—on your laptop, on-prem servers, or any cloud (not just AWS or SageMaker).

That's it! The container automatically downloads, converts, and optimizes the model for your CPU architecture.

Why?

Because small language models and modern CPUs are a great match for cost-efficient AI inference. More context in these blog posts: "The case for small language model inference on Arm CPUs" and "Is running language models on CPU really viable?".

Because I've been trying for a while to collaborate with AWS and Arm on this project, and I got tired of waiting 😴

So there. Enjoy!

Caveat: I've only tested sub-10B models so far. Timeouts could hit on larger models. Bug reports, ideas, and pull requests are welcome.

What It Does

  • Based on a clean source build of llama.cpp
  • Native integration with the SageMaker SDK and both Graviton3/Graviton4 (ARM64) and Intel Xeon (AMD64) instances
  • Model deployment from the Hugging Face hub or an Amazon S3 bucket
  • Single-step deployment and optimization of safetensors models, with automatic GGUF conversion and quantization
  • Deployment of existing GGUF models
  • Support for OpenAI API (/v1/chat/completions, /v1/completions)
  • Support for streaming and non-streaming text generation
  • Support for all llama-server flags

Architecture

SageMaker Endpoint → FastAPI Adapter (port 8080) → llama.cpp Server (port 8081)

Quickstart

Prerequisites

  • Docker with AMD64 or ARM64 support
  • Log in to the Docker Hub
  • Log in to the Hugging Face Hub
  • ECR repository created

1. Pull the Container

# For Intel/AMD64 systems
docker pull juliensimon/sagemaker-inference-llamacpp-cpu:amd64

# For ARM64/Graviton systems
docker pull juliensimon/sagemaker-inference-llamacpp-cpu:arm64

2. Run with a Public Model

mkdir local_models
# Start the container with a public Hugging Face model
docker run -p 8080:8080 \
  -e HF_MODEL_ID="arcee-ai/arcee-lite" \
  -e QUANTIZATION="Q4_K_M" \
  -v $(pwd)/local_models:/opt/models \
  --name llm-inference \
  juliensimon/sagemaker-inference-llamacpp-cpu:arm64

3. Test the API

# Chat completion
curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {"role": "user", "content": "Hello! How are you today?"}
    ],
    "max_tokens": 100
  }'

Amazon SageMaker instructions

Prerequisites

  • Docker with AMD64 or AMR64 support (building on a Mac for Graviton works great)
  • AWS CLI configured with appropriate permissions
  • ECR repository created

1. Build the Container (Optional)

Pre-built images are available on Docker Hub:

# Pull pre-built images
docker pull juliensimon/sagemaker-inference-llamacpp-cpu:amd64
docker pull juliensimon/sagemaker-inference-llamacpp-cpu:arm64

Or build from source:

# Clone repository
git clone https://github.com/juliensimon/sagemaker-inference-container-graviton

cd sagemaker-inference-container-graviton

# Build for ARM64 (Graviton)
docker build --platform linux/arm64 -t sagemaker-inference-container-graviton:arm64 .

# Build for AMD64 (x86_64)
docker build --platform linux/amd64 -t sagemaker-inference-container-graviton:amd64 .

2. Push to ECR

# Set variables
AWS_ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
AWS_REGION=$(aws configure get region)
ECR_REPOSITORY="sagemaker-inference-container-graviton"

# Create ECR repository (if it doesn't exist)
aws ecr create-repository \
    --repository-name $ECR_REPOSITORY \
    --region $AWS_REGION \
    --image-scanning-configuration scanOnPush=true \
    --image-tag-mutability MUTABLE

# Login to ECR
aws ecr get-login-password --region $AWS_REGION | \
    docker login --username AWS --password-stdin $AWS_ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com

# Tag images
docker tag sagemaker-inference-container-graviton:arm64 \
    $AWS_ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com/$ECR_REPOSITORY:arm64

docker tag sagemaker-inference-container-graviton:amd64 \
    $AWS_ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com/$ECR_REPOSITORY:amd64

# Push images
docker push $AWS_ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com/$ECR_REPOSITORY:arm64
docker push $AWS_ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com/$ECR_REPOSITORY:amd64

Deploy to SageMaker

Here's a quick overview of how to deploy models. A full notebook is available in examples/.

# Option 1: Deploy a safetensors model from HuggingFace Hub (auto-convert + quantize)
model_environment = {
    "HF_MODEL_ID": "your-model-repository",
    "QUANTIZATION": "Q8_0",
    "HF_TOKEN": hf_token, # optional, only for private and gated models
    "LLAMA_CPP_ARGS": llama_cpp_args # optional, see llama-server -h
}

# Option 2: Deploy a GGUF model from HuggingFace Hub
model_environment = {
    "HF_MODEL_ID": "your-model-repository-GGUF",
    "MODEL_FILENAME": "your-model.gguf"
}

# Option 3: Deploy a safetensors model from S3 (auto-convert + quantize)
model_environment = {
    "HF_MODEL_URI": "s3://your-bucket/your-model/",
    "QUANTIZATION": "Q4_0"
}

# Option 4: Deploy a GGUF model from S3
model_environment = {
    "HF_MODEL_URI": "s3://your-bucket/",
    "MODEL_FILENAME": "your-model.gguf"
}

# Create deployable model
model = Model(
    image_uri=your_image_uri,
    role=role,
    env=model_environment,
)

# Deploy the model
response = model.deploy(...)

Usage

Test the endpoint

model_sample_input = {
    "messages": [
        {"role": "system", "content": "You are a friendly and helpful AI assistant."},
        {
            "role": "user",
            "content": "Suggest 5 names for a new neighborhood pet food store. Names should be short, fun, easy to remember, and respectful of pets. \
        Explain why customers would like them.",
        },
    ],
    "max_tokens": 1024
}

response = runtime_sm_client.invoke_endpoint(
    EndpointName=endpoint_name,
    ContentType="application/json",
    Body=json.dumps(model_sample_input),
)

output = json.loads(response["Body"].read().decode("utf8"))

Environment Variables

Variable Description Usage
HF_MODEL_ID Hugging Face model repository Hub deployments
HF_MODEL_URI S3 URI for model files (safetensors or GGUF) S3 deployments
MODEL_FILENAME Specific GGUF file to use GGUF model deployment
HF_TOKEN Hugging Face token for private and gated models Private and gated hub models
QUANTIZATION Quantization level (e.g., Q4_K_M) default is F16
LLAMA_CPP_ARGS Additional llama.cpp arguments default is empty

License

Modified MIT License