Custom AI Personalities

Import character cards with full backstories, alternate greetings, and lore. Each character feels different because each one is.

Auto-Optimized for Your Device

Three inference backends and five GPU paths. TokForge detects your hardware and picks the fastest config automatically — no tuning required.

Blazing Small-Model Speed HOT

46–57 tok/s on small models with TQ4 TurboQuant — aggressive GPU quantization that makes lightweight models fly.

Chat With Your Documents NEW

Attach PDFs, DOCX, or EPUB files. TokForge summarizes, indexes, and searches them so your AI can answer grounded in your documents — all on-device.

Hear Your AI Speak NEW

Offline text-to-speech with 11 natural voices and adjustable speed. Powered by Kokoro TTS — no internet, no latency, no data sent anywhere.

Unique Settings Per Character

Each character gets its own creativity, sampling, and style settings. Your creative writer stays wild while your analyst stays precise.

See TokForge in Action

Private AI conversations, character personas, and on-device benchmarks — all running locally on your phone.

Character Personas

Live Chat • 22.4 tok/s

Auto Device Profiling

ForgeLab Benchmarks

HuggingFace Search

TokForge chat with speculative decoding active

Spec Decode Chat

Spec Decode Models

Memory screen with Reflected badges on auto-extracted facts

Background Memory

Speculative decoding settings with draft model pairing and acceptance rates

Spec Decode Settings

ForgeLab Auto-Tune

Settings with thinking budget, background memory, and speculative decoding

Settings & Controls

ForgeLab with speculative decoding config and model pairings

ForgeLab Spec Decode

Chat with memory facts banner and voice input

Memory & Voice

Markdown Rendering

Welcome — Tablet UI

Onboarding — Tablet UI

Rex

Luna

Marcus

Aria

Cross-Device Benchmark Matrix

Real devices. Real tok/s. Reproducible configs.

Updated: 2026-04-06 — v3.4.7 Methodology →

MNN Vulkan — AR Decode Vulkan

Device	SoC	Model	tok/s	vs OpenCL	vs CPU
OnePlus Ace 5 Ultra	D9400	Qwen3-8B	11.88	+56%	+166%
OnePlus Ace 5 Ultra	D9400	Qwen3-14B	11.22	N/A	+151%

MNN Vulkan with tuned NHWC4 GEMV kernels for Mali G925. Verified same-session decode peaks via app benchmark harness. Note: spec decode currently hurts Vulkan — verify batch (M>1) hits slow slide-window path on Mali. Use Vulkan for AR only; OpenCL for spec decode.

GGUF Vulkan — CoopMat Vulkan

Device	SoC	Model	tok/s	vs CPU
OnePlus Ace 5 Ultra	D9400	3B Q4_K_M	16.85	3.4x
OnePlus Ace 5 Ultra	D9400	8B Q4_K_M	8.07	~2x

GGUF Vulkan with ggml-vulkan cooperative matrix support. Complements MNN Vulkan for quantized GGUF models.

Mid-Range vs Flagship

OnePlus Ace 5 Ultra (D9400, ~$400) hits 11.88 tok/s on 8B and 11.22 tok/s on 14B via Vulkan AR — competitive with flagships costing 3–4x more. Samsung's OneUI memory overhead limits what the S26 can run comfortably; the OnePlus runs 14B with headroom.

Speculative Decoding — Draft-Verified Acceleration NEW

Device	Target Model	Baseline	With Spec Decode	Speedup
RedMagic 11 Pro	Qwen3-8B	14.05 tok/s	23.5 tok/s	+67%
RedMagic 11 Pro	Qwen3-14B	8.25 tok/s	16.4 tok/s	+99%
Lenovo TB520FU	Qwen3-8B	10.10 tok/s	10.99 tok/s	+9%
OnePlus Ace 5 Ultra	Qwen3-8B	11.88 tok/s (Vulkan AR)	—	N/A*
OnePlus Ace 5 Ultra	Qwen3-14B	11.22 tok/s (Vulkan AR)	—	N/A*

A small draft model proposes candidate tokens; the target model verifies them in a single batched forward pass. SM8850 results are verified single-packet peaks (500 decode tokens). Automatically enabled on supported devices and model pairings.

* Spec decode currently hurts Vulkan performance — verify batch (M>1) falls back to slow slide-window path. D9400 baselines now use faster Vulkan AR-only decode instead.

Cross-Device Fleet Results (v3.4.7)

Device	SoC	Model	Backend	Decode tok/s
Galaxy S26 Ultra	SM8850	Qwen3-8B	OpenCL	21.0
RedMagic 11 Pro	SM8850	Qwen3-4B	OpenCL	20.68
Galaxy S26 Ultra	SM8850	Qwen3.5-4B	CPU	21.30
OnePlus Ace 5 Ultra	D9400	Qwen3-8B	Vulkan	11.88
OnePlus Ace 5 Ultra	D9400	Qwen3-14B	Vulkan	11.22
RedMagic 11 Pro	SM8850	Qwen3-8B	OpenCL	14.05
Galaxy S24 Ultra	SM8650	Qwen3-4B	OpenCL	13.58
Xiaomi Pad 7 Pro	SM8635	Qwen3-4B	CPU	11.81
Lenovo TB520FU	SM8650	Qwen3-8B	OpenCL	10.10

BackendCapabilityResolver auto-routes each device: Snapdragon uses MNN OpenCL for standard attention (Qwen3), CPU for linear attention (Qwen3.5). Dimensity 9400 Mali uses MNN Vulkan with tuned NHWC4 GEMV kernels — first production Vulkan LLM on ARM Mali.

MNN vs GGUF — Backend Comparison (RedMagic 11 Pro, SM8850)

Model	MNN OpenCL	GGUF CPU	MNN Advantage
Qwen3-0.6B	34.8	42.7	−18%
Qwen3-1.7B	27.4	16.3	+68%
Qwen3-4B	20.68	9.0	+130%
Qwen3-8B	14.05	5.4	+160%
Qwen3-14B	8.25	2.7	+206%

MNN OpenCL overtakes GGUF CPU at 1.7B+ parameters. At 14B, MNN is 3x faster. GGUF wins only on tiny models (<1B) where CPU overhead is negligible. GGUF uses KleidiAI i8mm + futex barrier threading (2T optimal on Snapdragon 8 Elite).

GGUF Decode Speed by Model Size

Model	Quant	Threads	Decode tok/s	Prefill tok/s
Qwen3-0.6B	Q4_K_M	2T	42.7	113.0
Qwen3-1.7B	Q4_K_M	2T	16.3	43.9
Llama-3.2-3B	Q4_K_M	2T	10.1	26.6
Qwen3-4B	Q4_K_M	2T	9.0	20.7
Qwen3-8B	Q4_K_M	2T	5.4	12.0
Qwen3-14B	Q4_K_M	2T	2.7	5.8

GGUF uses llama.cpp with KleidiAI i8mm acceleration and futex barrier threading. 2 threads consistently outperforms 4 threads on Snapdragon 8 Elite.

Key Findings

Vulkan GPU makes Mali competitive — 11.88 tok/s on 8B models, 56% faster than OpenCL and 166% faster than CPU on the Dimensity 9400.
$400 phone, flagship performance — OnePlus Ace 5 Ultra runs 8B at 11.88 tok/s and 14B at 11.22 tok/s, matching devices that cost 3–4x more.
Spec decode nearly doubles large-model speed — 14B goes from 8.25 to 16.4 tok/s (+99%) on Snapdragon 8 Elite. 8B gets a 67% boost.
GPU acceleration is up to 3x faster — On models 1.7B and above, GPU-accelerated inference dominates CPU-only. The gap widens with model size.
Conversations stay fast after the first message — Delta prefill cuts follow-up latency by up to 34x (58s down to 1.7s).
Your device picks the best path — Five GPU acceleration paths are auto-selected per chipset. No manual config needed.
Every result is reproducible — All benchmarks are exportable via the 120+ endpoint API. Compare across devices and share configs.

See full methodology

Everything you need for local AI chat.

Built for privacy-conscious users, roleplay enthusiasts, and developers who want full control.

2x Faster With Spec Decode NEW

A small draft model predicts ahead, the main model verifies in one pass — nearly doubling speed on large models. 23+ tok/s on 14B. TokForge auto-detects the best model pairings and picks the right GPU path per device.

Your Phone, Optimized Automatically

Three inference engines and five GPU acceleration paths. TokForge profiles your hardware on first launch and picks the fastest config — Snapdragon, Dimensity, or Exynos. You can also connect to a remote server for bigger models.

Hear Your AI Talk Back NEW

11 natural-sounding voices with adjustable speed, fully offline via Kokoro TTS. Two quality tiers: fast or premium. Voice input too — talk to your AI and hear it respond without ever touching a server.

Your AI Remembers You

Facts, preferences, and context carry across every conversation. TokForge learns in the background while you chat, building a memory that makes each character feel like it actually knows you.

TurboQuant — 57 tok/s HOT

TQ4 aggressive GPU quantization makes small models absurdly fast — 46–57 tok/s. Two modes trade off quality vs speed. Ideal for quick questions, brainstorming, and real-time back-and-forth where latency matters most.

Chat With Your Documents NEW

Attach PDFs, Word docs, EPUBs, or plain text. TokForge indexes and summarizes them, then your AI answers questions grounded in the actual content — all processed on-device, nothing uploaded anywhere.

Watch It Think

See your AI's reasoning process live in collapsible thought blocks. Works with reasoning models like Qwen3, DeepSeek-R1, and QwQ. Control how long it thinks with a budget slider.

Full Character Card Support

Import TavernAI V2 cards (PNG or JSON) with lorebooks, alternate greetings, world info, and placeholder variables. A thriving community of thousands of characters ready to use.

Save & Switch Configs

Save your best-performing setups as profiles and switch between them in one tap. Different models, different use cases, different configs — all instantly accessible.

Raw Prompt Mode

Skip character cards and system prompts — send exactly what you type directly to the model. For prompt engineers and power users who want total control.

17 Curated Models

Organized across 4 categories: General, Roleplay, Creative, and Thinking. Plus HuggingFace search for any GGUF model.

7 Built-In Presets

Creative, Precise, Deterministic, Roleplay, and more. Pick a style and go, or dive into full sampler control if you want to fine-tune every parameter.

Connect to Your Server

Run bigger models on your PC or server and chat from your phone. Compatible with Ollama, llama.cpp, vLLM, and any OpenAI-compatible endpoint.

One-Tap Auto-Tune

AutoForge tests every combination of threads, GPU backends, and acceleration modes to find your device's peak performance. Results are saved, exportable, and comparable across devices.

Knows Your Hardware

On first launch, TokForge identifies your chipset, GPU, and RAM, then recommends the right models and settings for your specific device.

Rich Text & Code Blocks

Responses render with proper formatting as they stream in — bold, italic, lists, and syntax-highlighted code blocks. A dedicated roleplay renderer too.

Works Across Android

Select text anywhere on your phone and tap "TokForge" to save it as a memory fact or send it straight to a chat. Inference keeps running in the background.

ForgeLab Performance Lab

Run structured benchmarks from quick 5-minute checks to deep 2-hour sweeps. Get shareable report cards, compare across devices, and export your best configs.

Performance Watchdog

Monitors inference quality in real-time and alerts you if performance degrades. Recommends config tweaks before you even notice a slowdown.

Instant Follow-Up Messages

Your first message in a conversation warms up the model. Every message after that starts instantly — no waiting for re-processing.

Read the docs

Private AI ChatOn Your Phone.

See It Running Offline

Custom AI Personalities

Auto-Optimized for Your Device

Blazing Small-Model Speed HOT

Chat With Your Documents NEW

Hear Your AI Speak NEW

Unique Settings Per Character

See TokForge in Action

Cross-Device Benchmark Matrix

MNN Vulkan — AR Decode Vulkan

GGUF Vulkan — CoopMat Vulkan

Mid-Range vs Flagship

Speculative Decoding — Draft-Verified Acceleration NEW

Cross-Device Fleet Results (v3.4.7)

MNN vs GGUF — Backend Comparison (RedMagic 11 Pro, SM8850)

GGUF Decode Speed by Model Size

Key Findings

Everything you need for local AI chat.

2x Faster With Spec Decode NEW

Your Phone, Optimized Automatically

Hear Your AI Talk Back NEW

Your AI Remembers You

TurboQuant — 57 tok/s HOT

Chat With Your Documents NEW

Watch It Think

Full Character Card Support

Save & Switch Configs

Raw Prompt Mode

17 Curated Models

7 Built-In Presets

Connect to Your Server

One-Tap Auto-Tune

Knows Your Hardware

Rich Text & Code Blocks

Works Across Android

ForgeLab Performance Lab

Performance Watchdog

Instant Follow-Up Messages

Engineered for speed and control.

Local-first by design.Transparent by default.

Get TokForge — Open Testing

Request Access

Private AI Chat
On Your Phone.

Local-first by design.
Transparent by default.