January 7, 2026
Most LLM apps have the same shape. Ship text to a server, pay per token, and pray the Wi-Fi stays up.
WebLLM is the fun twist. It runs an LLM inside the browser using WebGPU. That unlocks privacy-friendly demos, offline-ish behavior, and a new kind of “deployment” where your biggest backend cost is your user’s laptop fan spinning up like it just saw a Dark Souls boss.
The goal here is simple: one chat UI, one message format, two interchangeable “brains”:
tldr; Build a tiny chat app where switching between local WebLLM and a remote model is just a dropdown.
Optional but recommended:
npm create vite@latest webllm-dual-provider-chat -- --template react-ts
cd webllm-dual-provider-chat
npm i
npm i @mlc-ai/web-llm
npm run dev
You now have a normal React app that will become a “two-brain” chat UI.
This interface is the entire trick. The UI does not care how tokens appear, only that they stream in.
In src/ai/types.ts:
export type Role = "system" | "user" | "assistant";
export type ChatMessage = {
role: Role;
content: string;
};
export type StreamChunk = {
delta: string;
done?: boolean;
};
export type ChatProvider = {
id: string;
label: string;
// Called once when selecting this provider (load model, warmup, etc.)
init?: (opts?: {
signal?: AbortSignal;
onStatus?: (s: string) => void;
}) => Promise<void>;
// Stream response tokens/chunks
streamChat: (args: {
messages: ChatMessage[];
signal?: AbortSignal;
onChunk: (chunk: StreamChunk) => void;
onStatus?: (s: string) => void;
}) => Promise<void>;
// Optional cleanup
dispose?: () => Promise<void>;
};
From this point forward, everything is just “implement the interface.”
Two important realities:
Also, use a model ID that actually works, this one worked at time of writing:
Llama-3.1-8B-Instruct-q4f32_1-MLCIn src/ai/webllmProvider.ts
import type { ChatMessage, ChatProvider } from "./types";
import * as webllm from "@mlc-ai/web-llm";
function toWebLLMMessages(
messages: ChatMessage[]
): webllm.ChatCompletionMessageParam[] {
return messages.map((m) => ({ role: m.role, content: m.content }));
}
export function createWebLLMProvider(
modelId = "Llama-3.1-8B-Instruct-q4f32_1-MLC"
): ChatProvider {
let engine: webllm.MLCEngineInterface | null = null;
const init: ChatProvider["init"] = async ({ signal, onStatus } = {}) => {
if (!("gpu" in navigator)) {
throw new Error("WebGPU not available in this browser.");
}
if (engine) return;
onStatus?.("Initializing WebLLM engine...");
engine = await webllm.CreateMLCEngine(modelId, {
initProgressCallback: (p) => {
const msg =
typeof p === "string" ? p : (p as any)?.text ?? "Loading model...";
onStatus?.(msg);
},
});
onStatus?.("Warming up...");
await engine.chat.completions.create({
messages: [{ role: "user", content: "Say 'ready'." }],
temperature: 0,
});
onStatus?.("Ready.");
signal?.throwIfAborted?.();
};
return {
id: "local-webllm",
label: "Local (WebLLM)",
init,
streamChat: async ({ messages, signal, onChunk, onStatus }) => {
if (!engine) {
onStatus?.("Engine not initialized. Initializing now...");
await init({ signal, onStatus });
}
if (!engine) throw new Error("WebLLM engine failed to initialize.");
onStatus?.("Generating...");
const resp = await engine.chat.completions.create({
messages: toWebLLMMessages(messages),
stream: true,
temperature: 0.7,
});
for await (const event of resp) {
signal?.throwIfAborted?.();
const delta = event.choices?.[0]?.delta?.content ?? "";
// Optional cleanup if your model spits template markers
const cleaned = delta
.replaceAll("<|start_header_id|>", "")
.replaceAll("<|end_header_id|>", "");
if (cleaned) onChunk({ delta: cleaned });
}
onChunk({ delta: "", done: true });
onStatus?.("Done.");
},
dispose: async () => {
// Some builds expose engine.dispose(). If not, dropping the reference is fine.
// @ts-expect-error optional
await engine?.dispose?.();
engine = null;
},
};
}
Model IDs can change across releases and builds. If a model ID fails to load you just need to find the updated one.
Same contract, same streaming shape. The UI should not have to care if the text came from WebGPU wizardry or a server in a trench coat. You can skip this step if you prefer to only have a local provider.
In src/ai/remoteProvider.ts
import type { ChatProvider } from "./types";
export function createRemoteProvider(endpoint = "/api/chat"): ChatProvider {
return {
id: "remote",
label: "Remote (Server)",
streamChat: async ({ messages, signal, onChunk, onStatus }) => {
onStatus?.("Contacting server...");
const res = await fetch(endpoint, {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({ messages }),
signal,
});
if (!res.ok || !res.body) {
throw new Error(`Remote provider error: ${res.status}`);
}
onStatus?.("Streaming...");
const reader = res.body.getReader();
const decoder = new TextDecoder();
while (true) {
const { value, done } = await reader.read();
if (done) break;
const text = decoder.decode(value, { stream: true });
if (text) onChunk({ delta: text });
}
onChunk({ delta: "", done: true });
onStatus?.("Done.");
},
};
}
/api/chatYou can skip this step if you prefer to only have a local provider.
This route handler:
{ messages } from the clientstream: trueIn app/api/chat/route.ts
export const runtime = "edge";
type Role = "system" | "user" | "assistant";
type ChatMessage = { role: Role; content: string };
export async function POST(req: Request) {
const { messages } = (await req.json()) as { messages: ChatMessage[] };
const apiKey = process.env.OPENAI_API_KEY;
if (!apiKey) {
return new Response("Missing OPENAI_API_KEY", { status: 500 });
}
const model = process.env.OPENAI_MODEL || "gpt-4o-mini";
const upstream = await fetch("https://api.openai.com/v1/responses", {
method: "POST",
headers: {
Authorization: `Bearer ${apiKey}`,
"Content-Type": "application/json",
},
body: JSON.stringify({
model,
stream: true,
input: messages.map((m) => ({
role: m.role,
content: [{ type: "input_text", text: m.content }],
})),
text: { format: { type: "text" } },
}),
});
if (!upstream.ok || !upstream.body) {
const errText = await upstream.text().catch(() => "");
return new Response(`Upstream error (${upstream.status}): ${errText}`, {
status: 500,
});
}
const encoder = new TextEncoder();
const decoder = new TextDecoder();
let buffer = "";
const stream = new ReadableStream<Uint8Array>({
async start(controller) {
const reader = upstream.body!.getReader();
try {
while (true) {
const { value, done } = await reader.read();
if (done) break;
buffer += decoder.decode(value, { stream: true });
// SSE events are separated by a blank line
let idx;
while ((idx = buffer.indexOf("\n\n")) !== -1) {
const rawEvent = buffer.slice(0, idx);
buffer = buffer.slice(idx + 2);
const dataLines = rawEvent
.split("\n")
.filter((line) => line.startsWith("data:"))
.map((line) => line.replace(/^data:\s?/, "").trim());
for (const data of dataLines) {
if (!data) continue;
if (data === "[DONE]") {
controller.close();
return;
}
let evt: any;
try {
evt = JSON.parse(data);
} catch {
continue;
}
if (
evt.type === "response.output_text.delta" &&
typeof evt.delta === "string"
) {
controller.enqueue(encoder.encode(evt.delta));
}
}
}
}
} catch (e) {
controller.error(e);
} finally {
controller.close();
}
},
});
return new Response(stream, {
headers: {
"Content-Type": "text/plain; charset=utf-8",
"Cache-Control": "no-cache, no-transform",
},
});
}
If the chat UI is running on Vite (localhost:5173) and Next.js is running on localhost:3000, calling /api/chat from Vite will hit Vite’s server, not Next. The easy fix is a dev proxy.
Update vite.config.ts:
import { defineConfig } from "vite";
import react from "@vitejs/plugin-react";
export default defineConfig({
plugins: [react()],
server: {
proxy: {
"/api": "http://localhost:3000",
},
},
});
Now the client can keep using createRemoteProvider("/api/chat") and Vite will forward it to Next.
Create .env.local in the Next.js project:
OPENAI_API_KEY=your_key_here
OPENAI_MODEL=gpt-4o-mini
The whole job of this hook is to:
The sharp edge: streaming makes state bugs very obvious. If you mutate the last message in place, React will punish you with duplication weirdness, especially in dev.
So we update the last message immutably.
In src/hooks/useChat.ts
import { useMemo, useRef, useState } from "react";
import type { ChatMessage, ChatProvider } from "../ai/types";
export function useChat(providers: ChatProvider[]) {
const [providerId, setProviderId] = useState(providers[0]?.id ?? "");
const provider = useMemo(
() => providers.find((p) => p.id === providerId)!,
[providers, providerId]
);
const [messages, setMessages] = useState<ChatMessage[]>([
{ role: "system", content: "You are a helpful assistant." },
]);
const [status, setStatus] = useState<string>("");
const [isStreaming, setIsStreaming] = useState(false);
const abortRef = useRef<AbortController | null>(null);
async function selectProvider(nextId: string) {
abortRef.current?.abort();
setProviderId(nextId);
const next = providers.find((p) => p.id === nextId);
if (next?.init) {
setStatus("Preparing provider...");
try {
await next.init({ onStatus: setStatus });
} catch (e: any) {
setStatus(e?.message ?? "Failed to initialize provider.");
}
}
}
async function send(userText: string) {
if (!userText.trim()) return;
if (isStreaming) return;
abortRef.current?.abort();
abortRef.current = new AbortController();
const userMsg: ChatMessage = { role: "user", content: userText };
// Add user + placeholder assistant
setMessages((prev) => [...prev, userMsg, { role: "assistant", content: "" }]);
setIsStreaming(true);
setStatus("");
try {
await provider.streamChat({
messages: [...messages, userMsg], // good enough for a demo
signal: abortRef.current.signal,
onStatus: setStatus,
onChunk: ({ delta, done }) => {
if (delta) {
setMessages((prev) => {
const last = prev[prev.length - 1];
if (!last || last.role !== "assistant") return prev;
// Immutable update
const updatedLast = { ...last, content: last.content + delta };
return [...prev.slice(0, -1), updatedLast];
});
}
if (done) setIsStreaming(false);
},
});
} catch (e: any) {
setIsStreaming(false);
setStatus(e?.message ?? "Error while streaming.");
}
}
function stop() {
abortRef.current?.abort();
setIsStreaming(false);
setStatus("Stopped.");
}
return {
providers,
providerId,
provider,
messages,
status,
isStreaming,
selectProvider,
send,
stop,
};
}
React state closure note:
messages: [...messages, userMsg]uses the current render’smessages. For normal chat usage, that is fine. If you want to harden it, storemessagesin a ref and read from that when starting the stream.
Keep it simple. Treat the provider dropdown as a “brain toggle” and let the rest of the UI stay boring on purpose.
In src/App.tsx
import { useEffect, useMemo, useState } from "react";
import { createWebLLMProvider } from "./ai/webllmProvider";
import { createRemoteProvider } from "./ai/remoteProvider";
import { useChat } from "./hooks/useChat";
export default function App() {
const providers = useMemo(
() => [createWebLLMProvider(), createRemoteProvider("/api/chat")],
[]
);
const chat = useChat(providers);
const [input, setInput] = useState("");
useEffect(() => {
chat.selectProvider(chat.providerId);
// eslint-disable-next-line react-hooks/exhaustive-deps
}, []);
return (
<div style={{ maxWidth: 900, margin: "0 auto", padding: 16, fontFamily: "system-ui" }}>
<h1>Dual Provider Chat</h1>
<div style={{ display: "flex", gap: 12, alignItems: "center" }}>
<label>
Provider{" "}
<select
value={chat.providerId}
onChange={(e) => chat.selectProvider(e.target.value)}
disabled={chat.isStreaming}
>
{chat.providers.map((p) => (
<option key={p.id} value={p.id}>
{p.label}
</option>
))}
</select>
</label>
<div style={{ opacity: 0.8 }}>{chat.status}</div>
{chat.isStreaming && <button onClick={chat.stop}>Stop</button>}
</div>
<div style={{ marginTop: 16, border: "1px solid #ddd", borderRadius: 8, padding: 12, minHeight: 300 }}>
{chat.messages
.filter((m) => m.role !== "system")
.map((m, idx) => (
<div key={idx} style={{ marginBottom: 12 }}>
<div style={{ fontWeight: 700 }}>{m.role}</div>
<div style={{ whiteSpace: "pre-wrap" }}>{m.content}</div>
</div>
))}
</div>
<form
onSubmit={(e) => {
e.preventDefault();
chat.send(input);
setInput("");
}}
style={{ display: "flex", gap: 8, marginTop: 12 }}
>
<input
value={input}
onChange={(e) => setInput(e.target.value)}
placeholder="Say something..."
style={{ flex: 1, padding: 10 }}
disabled={chat.isStreaming}
/>
<button type="submit" disabled={chat.isStreaming}>
Send
</button>
</form>
</div>
);
}
At this point you have:
If the user’s text is sensitive (journaling, medical notes, internal docs), local mode keeps content on-device by default.
Local mode is effectively “free per token” after the download. It is great for:
Local mode can be an upgrade path instead of a requirement:
Full offline is tricky, but “no server call needed for this” is still a huge win for:
<|start_header_id|>Some model builds emit template markers. Filtering them out is fine for demos. For cleaner output long-term, experiment with model choices and chat templates.
Expect differences:
localStorageA provider boundary is one of those small architectural choices that pays rent forever. Models change, vendors change, pricing changes, browser capabilities evolve. A chat UI that can swap brains is a lot harder to paint into a corner.
Also, it is extremely satisfying to flip a dropdown and make your browser turn into a tiny AI workstation. 😈💻