: TokForge — Private AI Chat on Your Phone

Private AI Chat
On Your Phone.

Run large language models directly on your Android device. Three inference backends, persistent memory, character cards, and on-device benchmarking — all offline. No cloud. No subscription.

100% offline Zero telemetry Free during beta Android 8.0+
TokForge — Mobile LLM Tuning

TavernAI V2 Character Cards

Import PNG/JSON character cards with lorebooks, alternate greetings, and world info. Full spec support.

Three Inference Backends

MNN with OpenCL GPU (up to 2.4x faster), GGUF/llama.cpp with ARM i8mm, and OpenAI-compatible Remote API. Best backend auto-selected per model.

Thinking & Reasoning Mode

Collapsible <think> blocks for Qwen3, DeepSeek-R1, and QwQ. See the model's chain-of-thought reasoning live.

Cross-Device Benchmark Matrix

Real devices. Real tok/s. Reproducible configs.

Updated: 2026-03-05 — v2.7.0 Methodology →

MNN vs GGUF — Same Model, Same Device

Model MNN (OpenCL GPU) GGUF (llama.cpp CPU) MNN Advantage
Qwen3-0.6B 34 tok/s 36.5 tok/s GGUF wins at 0.6B
Qwen3-1.7B 21 tok/s 16 tok/s 1.3x faster
Qwen3-4B 20.68 tok/s 9.6 tok/s 2.15x faster
Qwen3-8B 14.05 tok/s 6.4 tok/s 2.19x faster
Qwen3-14B 8.25 tok/s 3.8 tok/s 2.17x faster

Benchmarked on RedMagic 11 Pro (Snapdragon 8 Elite / SM8850, 24GB RAM, Adreno 840). GGUF uses 2-thread futex barrier, KleidiAI i8mm. MNN uses OpenCL GPU with precision=low.

Cross-Device Fleet Results (v2.7.0)

Device SoC Model Backend Decode tok/s
RedMagic 11 Pro SM8850 Qwen3-4B OpenCL 20.68
RedMagic 11 Pro SM8850 Qwen3-8B OpenCL 14.05
Galaxy S26 Ultra SM8850 Qwen3.5-4B CPU 21.30
Galaxy S24 Ultra SM8650 Qwen3-4B OpenCL 13.58
Lenovo TB520FU SM8650 Qwen3-8B OpenCL 10.10
Xiaomi Pad 7 Pro SM8635 Qwen3-4B CPU 11.81

MNN OpenCL wins on standard attention models (Qwen3). Qwen3.5 (LinearAttention) auto-routes to CPU where it matches or exceeds OpenCL — no GPU-CPU transfer penalty.

GGUF Decode Speed by Model Size (RedMagic 11 Pro)

Model Quant Threads Decode tok/s Prefill tok/s
Qwen3-0.6B Q4_K_M 2T 42.7 113.0
Qwen3-1.7B Q4_K_M 2T 16.3 43.9
Llama-3.2-3B Q4_K_M 2T 10.1 26.6
Qwen3-4B Q4_K_M 2T 9.0 20.7
Qwen3-8B Q4_K_M 2T 5.4 12.0
Qwen3-14B Q4_K_M 2T 2.7 5.8

GGUF uses llama.cpp with KleidiAI i8mm acceleration and futex barrier threading. 2 threads consistently outperforms 4 threads on Snapdragon 8 Elite.

Key Findings

  • MNN with OpenCL GPU is up to 2.4x faster than GGUF for models 1.7B and above.
  • GGUF wins for tiny models (0.6B) and is required for thinking/reasoning mode.
  • 2 threads beats 4 threads for GGUF on Snapdragon 8 Elite — memory-bandwidth bound.
  • OpenCL benefit is device-dependent: great on RedMagic/Lenovo, slower on Xiaomi.
  • All results exportable and reproducible via the 50+ endpoint remote API.

Everything you need for local AI chat.

Built for privacy-conscious users, roleplay enthusiasts, and developers who want full control.

Character Cards

Full TavernAI V2 spec: PNG/JSON import, lorebooks, alternate greetings, {{char}}/{{user}} placeholders, world info.

Thinking Mode

Collapsible <think> reasoning blocks for Qwen3, DeepSeek-R1, and QwQ. See chain-of-thought live.

17 Curated Models

Organized across 4 categories: General, Roleplay, Creative, and Thinking. Plus HuggingFace search for any GGUF model.

Three Backends

MNN + OpenCL GPU (up to 2.4x faster), GGUF/llama.cpp with ARM i8mm/KleidiAI, and OpenAI-compatible remote API. Auto-selects the best for your hardware and model.

7 Inference Presets

Default, Creative, Precise, Deterministic, Roleplay, MNN Optimized, GGUF Optimized. Full sampler control including DRY, Mirostat, min-p.

Remote API

OpenAI-compatible streaming. Connect to Ollama, llama.cpp server, vLLM, or text-generation-webui as a fallback for larger models.

Auto-Tune & Benchmarks

Parameter sweeps across threads, batch sizes, and GPU configs. Persistent benchmark database with cross-device export/import.

Hardware Profiling

Auto-detects SoC, CPU topology, GPU, and RAM. Recommends optimal models and settings for your specific device.

Voice & TTS

Voice input via Android SpeechRecognizer and text-to-speech read-aloud on assistant messages. Hands-free AI conversations on your phone.

Markdown Rendering

Rich text formatting during streaming with dual-mode roleplay and markdown renderer. Code blocks, bold, italic, lists — all rendered live.

Persistent Memory

Your AI remembers you across conversations. Per-character memory with full-text search, knowledge graphs, and document import. Pin important facts, archive old ones, and let the AI learn your preferences over time.

Android Integration

Select text anywhere on Android and tap "TokForge" to save as a memory fact or inject into chat. Share intent support and background inference with wake lock.

ForgeLab Benchmarking

Four optimization tiers from Instant (5 min) to Long (2 hours). AutoForge sweeps all configs automatically. Shareable PNG report cards, cross-device comparison matrix, and exportable profiles.

Auto-Profiler & Watchdog

Automatic device profiling on first launch. Degradation watchdog monitors inference quality and recommends config changes in real-time.

System Prompt Cache

Caches pre-computed system prompt state to disk. Second messages in a conversation start instantly — no re-processing the system prompt every turn.

Read the docs

Built for serious mobile inference.

  • Character cards + persona — TavernAI V2 import, user persona injection, system prompt assembly
  • MNN + GGUF dual backends — GPU-accelerated MNN and CPU-optimized llama.cpp with ARM i8mm
  • Hardware profiler — Auto-detects SoC, CPU topology, GPU, RAM for optimal config
  • 70+ API endpoints — Full remote control: inference, models, config, benchmarks, UI navigation
  • Benchmark database — Persistent results, cross-device export/import, auto-matrix benchmarks
1. Import character card → system prompt assembly
2. Hardware profiler → auto-detect optimal config
3. Backend (MNN GPU / GGUF CPU) → token streaming
4. Thinking extraction → collapsible <think> blocks
5. Benchmark → persistent results → export

Chat + inference pipeline

Local-first by design.
Transparent by default.

Get Free Beta Access

TokForge is free during the beta period. Help us test across devices and shape the future of private mobile AI.

v2.7.0 — Three backends, persistent memory, character cards, thinking mode, 17 curated models, 70+ API endpoints. Free while in beta.

Request Access

No spam. We'll only email you about beta access.

No telemetry. No background reporting. Your data stays on your device unless you explicitly opt in.