:
Run large language models directly on your Android device. Three inference backends, persistent memory, character cards, and on-device benchmarking — all offline. No cloud. No subscription.
Import PNG/JSON character cards with lorebooks, alternate greetings, and world info. Full spec support.
MNN with OpenCL GPU (up to 2.4x faster), GGUF/llama.cpp with ARM i8mm, and OpenAI-compatible Remote API. Best backend auto-selected per model.
Collapsible <think> blocks for Qwen3, DeepSeek-R1, and QwQ. See the model's chain-of-thought reasoning live.
Private AI conversations, character personas, and on-device benchmarks — all running locally on your phone.
Real devices. Real tok/s. Reproducible configs.
| Model | MNN (OpenCL GPU) | GGUF (llama.cpp CPU) | MNN Advantage |
|---|---|---|---|
| Qwen3-0.6B | 34 tok/s | 36.5 tok/s | GGUF wins at 0.6B |
| Qwen3-1.7B | 21 tok/s | 16 tok/s | 1.3x faster |
| Qwen3-4B | 20.68 tok/s | 9.6 tok/s | 2.15x faster |
| Qwen3-8B | 14.05 tok/s | 6.4 tok/s | 2.19x faster |
| Qwen3-14B | 8.25 tok/s | 3.8 tok/s | 2.17x faster |
Benchmarked on RedMagic 11 Pro (Snapdragon 8 Elite / SM8850, 24GB RAM, Adreno 840). GGUF uses 2-thread futex barrier, KleidiAI i8mm. MNN uses OpenCL GPU with precision=low.
| Device | SoC | Model | Backend | Decode tok/s |
|---|---|---|---|---|
| RedMagic 11 Pro | SM8850 | Qwen3-4B | OpenCL | 20.68 |
| RedMagic 11 Pro | SM8850 | Qwen3-8B | OpenCL | 14.05 |
| Galaxy S26 Ultra | SM8850 | Qwen3.5-4B | CPU | 21.30 |
| Galaxy S24 Ultra | SM8650 | Qwen3-4B | OpenCL | 13.58 |
| Lenovo TB520FU | SM8650 | Qwen3-8B | OpenCL | 10.10 |
| Xiaomi Pad 7 Pro | SM8635 | Qwen3-4B | CPU | 11.81 |
MNN OpenCL wins on standard attention models (Qwen3). Qwen3.5 (LinearAttention) auto-routes to CPU where it matches or exceeds OpenCL — no GPU-CPU transfer penalty.
| Model | Quant | Threads | Decode tok/s | Prefill tok/s |
|---|---|---|---|---|
| Qwen3-0.6B | Q4_K_M | 2T | 42.7 | 113.0 |
| Qwen3-1.7B | Q4_K_M | 2T | 16.3 | 43.9 |
| Llama-3.2-3B | Q4_K_M | 2T | 10.1 | 26.6 |
| Qwen3-4B | Q4_K_M | 2T | 9.0 | 20.7 |
| Qwen3-8B | Q4_K_M | 2T | 5.4 | 12.0 |
| Qwen3-14B | Q4_K_M | 2T | 2.7 | 5.8 |
GGUF uses llama.cpp with KleidiAI i8mm acceleration and futex barrier threading. 2 threads consistently outperforms 4 threads on Snapdragon 8 Elite.
Built for privacy-conscious users, roleplay enthusiasts, and developers who want full control.
Full TavernAI V2 spec: PNG/JSON import, lorebooks, alternate greetings, {{char}}/{{user}} placeholders, world info.
Collapsible <think> reasoning blocks for Qwen3, DeepSeek-R1, and QwQ. See chain-of-thought live.
Organized across 4 categories: General, Roleplay, Creative, and Thinking. Plus HuggingFace search for any GGUF model.
MNN + OpenCL GPU (up to 2.4x faster), GGUF/llama.cpp with ARM i8mm/KleidiAI, and OpenAI-compatible remote API. Auto-selects the best for your hardware and model.
Default, Creative, Precise, Deterministic, Roleplay, MNN Optimized, GGUF Optimized. Full sampler control including DRY, Mirostat, min-p.
OpenAI-compatible streaming. Connect to Ollama, llama.cpp server, vLLM, or text-generation-webui as a fallback for larger models.
Parameter sweeps across threads, batch sizes, and GPU configs. Persistent benchmark database with cross-device export/import.
Auto-detects SoC, CPU topology, GPU, and RAM. Recommends optimal models and settings for your specific device.
Voice input via Android SpeechRecognizer and text-to-speech read-aloud on assistant messages. Hands-free AI conversations on your phone.
Rich text formatting during streaming with dual-mode roleplay and markdown renderer. Code blocks, bold, italic, lists — all rendered live.
Your AI remembers you across conversations. Per-character memory with full-text search, knowledge graphs, and document import. Pin important facts, archive old ones, and let the AI learn your preferences over time.
Select text anywhere on Android and tap "TokForge" to save as a memory fact or inject into chat. Share intent support and background inference with wake lock.
Four optimization tiers from Instant (5 min) to Long (2 hours). AutoForge sweeps all configs automatically. Shareable PNG report cards, cross-device comparison matrix, and exportable profiles.
Automatic device profiling on first launch. Degradation watchdog monitors inference quality and recommends config changes in real-time.
Caches pre-computed system prompt state to disk. Second messages in a conversation start instantly — no re-processing the system prompt every turn.
Chat + inference pipeline
TokForge is free during the beta period. Help us test across devices and shape the future of private mobile AI.
No telemetry. No background reporting. Your data stays on your device unless you explicitly opt in.