LightTxt-Predict: Ultra-Fast LLM Predictive Text API

🚀 Performance Highlights

With RTX3080
Ultra-low latency: 17.4ms average prediction time
Fast initialization: Model loads in under 1 second
Efficient memory usage: ~940MB model storage footprint
Responsive API: Non-blocking architecture with background processing

📊 Benchmarks

Metric	LightTxt-Predict Qwen2	Standard Qwen2-0.5B	GPT-3.5 Turbo
Avg. Latency	17.4ms	~20ms	50-100ms
Model Size	942.3 MB	1.17 GB	N/A (Cloud)
Initialization	0.69s	1-2s	N/A (Cloud)
Tokens/sec	~57.5	49.94	~67.83

✨ Key Features

GPU-optimized inference with Flash Attention 2.0 support on supported devices with >Ampere
Multi-level caching system for repeated queries
Precomputed common phrases for frequently used inputs (needs to be expanded upon)
Asynchronous processing with background threading
RESTful API with both GET and POST endpoints
Batch prediction support for improved throughput
Automatic hardware adaptation (CPU/GPU with optimal settings) (limited testing so far)
Comprehensive statistics and basic health monitoring

🔧 Installation

git clone https://github.com/WindingMotor/LightTxt-Predict
cd LightTxt-Predict
pip install -r requirements.txt

# Optional: Install Flash Attention for maximum performance
pip install flash-attn --no-build-isolation

📋 Requirements

Python 3.8+
NVIDIA CUDA-compatible GPU recommended (but works on CPU)
2GB RAM minimum (4GB+ recommended)

🚦 Quick Start

Start the server

python server.py --host 0.0.0.0 --port 8000

Query for predictions

import requests

# Simple GET request
response = requests.get("http://localhost:8000/predict?text=I would like to")
predictions = response.json()

# Output predictions
for pred in predictions["predictions"]:
    print(f"{pred['word']} ({pred['probability']:.0%})")

Sample output

know (43%)
create (31%)
use (26%)

🔄 API Endpoints

Endpoint	Method	Description
`/predict`	GET	Get predictions for a single text input
`/predict`	POST	Get predictions for single or batch inputs
`/health`	GET	Check server health and model status
`/stats`	GET	Get performance statistics

GET /predict

http://localhost:8000/predict?text=Hello world&top_k=3

POST /predict

{
  "text": "Once upon a",
  "top_k": 3
}

Batch prediction (POST)

{
  "text": ["Hello world", "Once upon a", "The weather is"],
  "top_k": 3
}

📈 Advanced Configuration

The server accepts several command-line options:

python server.py --help

Options:

--host: Host address (default: localhost)
--port: Port number (default: 8000)
--disable-optimizations: Disable model optimizations for higher quality (but slower) predictions

🧠 Technical Details

LightPredict uses several optimization techniques:

Flash Attention 2.0 when available for efficient transformer computation
Multi-level caching to avoid redundant computation
Automatic precision adaptation (FP16 on CUDA, FP32 on CPU)
Optimized tokenization with left-side padding
Selective quantization for CPU deployment
Context-aware processing that only looks at relevant text
Warm-up phase to avoid cold-start latency
Auto device mapping for optimal GPU memory utilization

🤝 Comparison with Other Solutions

LightPredict offers a balance of speed and quality that outperforms many alternatives:

vs. API-based services (OpenAI, Claude): Lower latency, no API costs, local deployment
vs. Local LLMs: 2-5x faster than similar-sized models with standard configuration
vs. Traditional Predictive Text: Higher quality predictions with semantic understanding vs n-based approaches

📚 Example Use Cases

Intelligent text editors
Code completion tools
Mobile keyboard suggestions
Form auto-completion
Chat applications
CLI tools with autocomplete

📄 License

MIT License

🙏 Acknowledgements

This project uses the excellent Qwen2-0.5B model developed by Alibaba Cloud.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
examples		examples
.gitignore		.gitignore
README.md		README.md
application.py		application.py
predictor.py		predictor.py
requirements.txt		requirements.txt
server.py		server.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LightTxt-Predict: Ultra-Fast LLM Predictive Text API

🚀 Performance Highlights

📊 Benchmarks

✨ Key Features

🔧 Installation

📋 Requirements

🚦 Quick Start

Start the server

Query for predictions

Sample output

🔄 API Endpoints

GET /predict

POST /predict

Batch prediction (POST)

📈 Advanced Configuration

🧠 Technical Details

🤝 Comparison with Other Solutions

📚 Example Use Cases

📄 License

🙏 Acknowledgements

About

Releases

Packages

Languages

WindingMotor/LightTxt-Predict

Folders and files

Latest commit

History

Repository files navigation

LightTxt-Predict: Ultra-Fast LLM Predictive Text API

🚀 Performance Highlights

📊 Benchmarks

✨ Key Features

🔧 Installation

📋 Requirements

🚦 Quick Start

Start the server

Query for predictions

Sample output

🔄 API Endpoints

GET /predict

POST /predict

Batch prediction (POST)

📈 Advanced Configuration

🧠 Technical Details

🤝 Comparison with Other Solutions

📚 Example Use Cases

📄 License

🙏 Acknowledgements

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages