- With RTX3080
- Ultra-low latency: 17.4ms average prediction time
- Fast initialization: Model loads in under 1 second
- Efficient memory usage: ~940MB model storage footprint
- Responsive API: Non-blocking architecture with background processing
Metric | LightTxt-Predict Qwen2 | Standard Qwen2-0.5B | GPT-3.5 Turbo |
---|---|---|---|
Avg. Latency | 17.4ms | ~20ms | 50-100ms |
Model Size | 942.3 MB | 1.17 GB | N/A (Cloud) |
Initialization | 0.69s | 1-2s | N/A (Cloud) |
Tokens/sec | ~57.5 | 49.94 | ~67.83 |
- GPU-optimized inference with Flash Attention 2.0 support on supported devices with >Ampere
- Multi-level caching system for repeated queries
- Precomputed common phrases for frequently used inputs (needs to be expanded upon)
- Asynchronous processing with background threading
- RESTful API with both GET and POST endpoints
- Batch prediction support for improved throughput
- Automatic hardware adaptation (CPU/GPU with optimal settings) (limited testing so far)
- Comprehensive statistics and basic health monitoring
git clone https://github.com/WindingMotor/LightTxt-Predict
cd LightTxt-Predict
pip install -r requirements.txt
# Optional: Install Flash Attention for maximum performance
pip install flash-attn --no-build-isolation
- Python 3.8+
- NVIDIA CUDA-compatible GPU recommended (but works on CPU)
- 2GB RAM minimum (4GB+ recommended)
python server.py --host 0.0.0.0 --port 8000
import requests
# Simple GET request
response = requests.get("http://localhost:8000/predict?text=I would like to")
predictions = response.json()
# Output predictions
for pred in predictions["predictions"]:
print(f"{pred['word']} ({pred['probability']:.0%})")
know (43%)
create (31%)
use (26%)
Endpoint | Method | Description |
---|---|---|
/predict |
GET | Get predictions for a single text input |
/predict |
POST | Get predictions for single or batch inputs |
/health |
GET | Check server health and model status |
/stats |
GET | Get performance statistics |
http://localhost:8000/predict?text=Hello world&top_k=3
{
"text": "Once upon a",
"top_k": 3
}
{
"text": ["Hello world", "Once upon a", "The weather is"],
"top_k": 3
}
The server accepts several command-line options:
python server.py --help
Options:
--host
: Host address (default: localhost)--port
: Port number (default: 8000)--disable-optimizations
: Disable model optimizations for higher quality (but slower) predictions
LightPredict uses several optimization techniques:
- Flash Attention 2.0 when available for efficient transformer computation
- Multi-level caching to avoid redundant computation
- Automatic precision adaptation (FP16 on CUDA, FP32 on CPU)
- Optimized tokenization with left-side padding
- Selective quantization for CPU deployment
- Context-aware processing that only looks at relevant text
- Warm-up phase to avoid cold-start latency
- Auto device mapping for optimal GPU memory utilization
LightPredict offers a balance of speed and quality that outperforms many alternatives:
- vs. API-based services (OpenAI, Claude): Lower latency, no API costs, local deployment
- vs. Local LLMs: 2-5x faster than similar-sized models with standard configuration
- vs. Traditional Predictive Text: Higher quality predictions with semantic understanding vs n-based approaches
- Intelligent text editors
- Code completion tools
- Mobile keyboard suggestions
- Form auto-completion
- Chat applications
- CLI tools with autocomplete
MIT License
This project uses the excellent Qwen2-0.5B model developed by Alibaba Cloud.