Core Features
check_circleNo CUDA/Python dependencies
check_circleSupport Nvidia/AMD/Intel GPUs
check_circleVulkan/Dx12/OpenGL backends
check_circleWASM support (Browser ready)
check_circleBatched inference
check_circleInt8 and Float4 quantization
check_circleSupport RWKV V4 through V7
check_circleLoRA merging & Model serialization
Functional Scope
• Tokenizer
• Model Loading
• State Creation & Updating
• GPU-accelerated `run` & `softmax`
• Model Quantization
• OpenAI-compatible API
• Built-in Samplers
• State Caching System
• Python Bindings
Usage Examples
cargo run --release --example gen
cargo run --release --example chat -- --model /path/to/model.st
cargo run --release --example chat -- --quant 32
Advanced Features
let runtime = TokioRuntime::new(bundle).await;
The asynchronous runtime API allows CPU and GPU to work in parallel, maximizing utilization.
Input Tokens
→
Hook Point
→
Output Logits
Hooks: Inject tensor ops into inference process for dynamic LoRA, control net, etc.
Model Conversion
python assets/scripts/convert_safetensors.py --input model.pth --output model.st