The Glitched Goblet

Intro

Most LLM apps have the same shape. Ship text to a server, pay per token, and pray the Wi-Fi stays up.

WebLLM is the fun twist. It runs an LLM inside the browser using WebGPU. That unlocks privacy-friendly demos, offline-ish behavior, and a new kind of “deployment” where your biggest backend cost is your user’s laptop fan spinning up like it just saw a Dark Souls boss.

The goal here is simple: one chat UI, one message format, two interchangeable “brains”:

Local provider: WebLLM in the browser (WebGPU)
Remote provider: a server endpoint with an OpenAI-compatible shape (Next.js route handler)

tldr; Build a tiny chat app where switching between local WebLLM and a remote model is just a dropdown.

Setting Up / Prerequisites

Node 18+ (20+ preferred)
A modern Chromium browser with WebGPU enabled (Chrome or Edge is easiest)
Basic React + TypeScript comfort

Optional but recommended:

A machine with decent RAM. Smaller laptops can run it, but you will feel the pain sooner.
Patience for the first model download.

Implementation Steps

Step 1: Create the app (Vite + React)

npm create vite@latest webllm-dual-provider-chat -- --template react-ts
cd webllm-dual-provider-chat
npm i
npm i @mlc-ai/web-llm
npm run dev

You now have a normal React app that will become a “two-brain” chat UI.

Step 2: Define a provider interface

This interface is the entire trick. The UI does not care how tokens appear, only that they stream in.

In src/ai/types.ts:

export type Role = "system" | "user" | "assistant";

export type ChatMessage = {
  role: Role;
  content: string;
};

export type StreamChunk = {
  delta: string;
  done?: boolean;
};

export type ChatProvider = {
  id: string;
  label: string;

  // Called once when selecting this provider (load model, warmup, etc.)
  init?: (opts?: {
    signal?: AbortSignal;
    onStatus?: (s: string) => void;
  }) => Promise<void>;

  // Stream response tokens/chunks
  streamChat: (args: {
    messages: ChatMessage[];
    signal?: AbortSignal;
    onChunk: (chunk: StreamChunk) => void;
    onStatus?: (s: string) => void;
  }) => Promise<void>;

  // Optional cleanup
  dispose?: () => Promise<void>;
};

From this point forward, everything is just “implement the interface.”

Step 3: Implement the WebLLM local provider

Two important realities:

First run downloads a model. This can be big. Show status text so it does not look frozen.
WebGPU is not universal. Feature detect and fall back.

Also, use a model ID that actually works, this one worked at time of writing:

Llama-3.1-8B-Instruct-q4f32_1-MLC

In src/ai/webllmProvider.ts

import type { ChatMessage, ChatProvider } from "./types";
import * as webllm from "@mlc-ai/web-llm";

function toWebLLMMessages(
  messages: ChatMessage[]
): webllm.ChatCompletionMessageParam[] {
  return messages.map((m) => ({ role: m.role, content: m.content }));
}

export function createWebLLMProvider(
  modelId = "Llama-3.1-8B-Instruct-q4f32_1-MLC"
): ChatProvider {
  let engine: webllm.MLCEngineInterface | null = null;

  const init: ChatProvider["init"] = async ({ signal, onStatus } = {}) => {
    if (!("gpu" in navigator)) {
      throw new Error("WebGPU not available in this browser.");
    }
    if (engine) return;

    onStatus?.("Initializing WebLLM engine...");
    engine = await webllm.CreateMLCEngine(modelId, {
      initProgressCallback: (p) => {
        const msg =
          typeof p === "string" ? p : (p as any)?.text ?? "Loading model...";
        onStatus?.(msg);
      },
    });

    onStatus?.("Warming up...");
    await engine.chat.completions.create({
      messages: [{ role: "user", content: "Say 'ready'." }],
      temperature: 0,
    });

    onStatus?.("Ready.");
    signal?.throwIfAborted?.();
  };

  return {
    id: "local-webllm",
    label: "Local (WebLLM)",
    init,

    streamChat: async ({ messages, signal, onChunk, onStatus }) => {
      if (!engine) {
        onStatus?.("Engine not initialized. Initializing now...");
        await init({ signal, onStatus });
      }
      if (!engine) throw new Error("WebLLM engine failed to initialize.");

      onStatus?.("Generating...");

      const resp = await engine.chat.completions.create({
        messages: toWebLLMMessages(messages),
        stream: true,
        temperature: 0.7,
      });

      for await (const event of resp) {
        signal?.throwIfAborted?.();
        const delta = event.choices?.[0]?.delta?.content ?? "";

        // Optional cleanup if your model spits template markers
        const cleaned = delta
          .replaceAll("<|start_header_id|>", "")
          .replaceAll("<|end_header_id|>", "");

        if (cleaned) onChunk({ delta: cleaned });
      }

      onChunk({ delta: "", done: true });
      onStatus?.("Done.");
    },

    dispose: async () => {
      // Some builds expose engine.dispose(). If not, dropping the reference is fine.
      // @ts-expect-error optional
      await engine?.dispose?.();
      engine = null;
    },
  };
}

Model IDs can change across releases and builds. If a model ID fails to load you just need to find the updated one.

Step 4: Implement the remote provider client

Same contract, same streaming shape. The UI should not have to care if the text came from WebGPU wizardry or a server in a trench coat. You can skip this step if you prefer to only have a local provider.

In src/ai/remoteProvider.ts

import type { ChatProvider } from "./types";

export function createRemoteProvider(endpoint = "/api/chat"): ChatProvider {
  return {
    id: "remote",
    label: "Remote (Server)",

    streamChat: async ({ messages, signal, onChunk, onStatus }) => {
      onStatus?.("Contacting server...");

      const res = await fetch(endpoint, {
        method: "POST",
        headers: { "Content-Type": "application/json" },
        body: JSON.stringify({ messages }),
        signal,
      });

      if (!res.ok || !res.body) {
        throw new Error(`Remote provider error: ${res.status}`);
      }

      onStatus?.("Streaming...");

      const reader = res.body.getReader();
      const decoder = new TextDecoder();

      while (true) {
        const { value, done } = await reader.read();
        if (done) break;

        const text = decoder.decode(value, { stream: true });
        if (text) onChunk({ delta: text });
      }

      onChunk({ delta: "", done: true });
      onStatus?.("Done.");
    },
  };
}

Step 5: Add a Next.js route handler for `/api/chat`

You can skip this step if you prefer to only have a local provider.

This route handler:

receives { messages } from the client
calls OpenAI’s Responses API with stream: true
converts the SSE stream into a plain text stream your Vite client already understands

In app/api/chat/route.ts

export const runtime = "edge";

type Role = "system" | "user" | "assistant";
type ChatMessage = { role: Role; content: string };

export async function POST(req: Request) {
  const { messages } = (await req.json()) as { messages: ChatMessage[] };

  const apiKey = process.env.OPENAI_API_KEY;
  if (!apiKey) {
    return new Response("Missing OPENAI_API_KEY", { status: 500 });
  }

  const model = process.env.OPENAI_MODEL || "gpt-4o-mini";

  const upstream = await fetch("https://api.openai.com/v1/responses", {
    method: "POST",
    headers: {
      Authorization: `Bearer ${apiKey}`,
      "Content-Type": "application/json",
    },
    body: JSON.stringify({
      model,
      stream: true,
      input: messages.map((m) => ({
        role: m.role,
        content: [{ type: "input_text", text: m.content }],
      })),
      text: { format: { type: "text" } },
    }),
  });

  if (!upstream.ok || !upstream.body) {
    const errText = await upstream.text().catch(() => "");
    return new Response(`Upstream error (${upstream.status}): ${errText}`, {
      status: 500,
    });
  }

  const encoder = new TextEncoder();
  const decoder = new TextDecoder();

  let buffer = "";

  const stream = new ReadableStream<Uint8Array>({
    async start(controller) {
      const reader = upstream.body!.getReader();

      try {
        while (true) {
          const { value, done } = await reader.read();
          if (done) break;

          buffer += decoder.decode(value, { stream: true });

          // SSE events are separated by a blank line
          let idx;
          while ((idx = buffer.indexOf("\n\n")) !== -1) {
            const rawEvent = buffer.slice(0, idx);
            buffer = buffer.slice(idx + 2);

            const dataLines = rawEvent
              .split("\n")
              .filter((line) => line.startsWith("data:"))
              .map((line) => line.replace(/^data:\s?/, "").trim());

            for (const data of dataLines) {
              if (!data) continue;

              if (data === "[DONE]") {
                controller.close();
                return;
              }

              let evt: any;
              try {
                evt = JSON.parse(data);
              } catch {
                continue;
              }

              if (
                evt.type === "response.output_text.delta" &&
                typeof evt.delta === "string"
              ) {
                controller.enqueue(encoder.encode(evt.delta));
              }
            }
          }
        }
      } catch (e) {
        controller.error(e);
      } finally {
        controller.close();
      }
    },
  });

  return new Response(stream, {
    headers: {
      "Content-Type": "text/plain; charset=utf-8",
      "Cache-Control": "no-cache, no-transform",
    },
  });
}

Running Next.js alongside Vite without CORS pain

If the chat UI is running on Vite (localhost:5173) and Next.js is running on localhost:3000, calling /api/chat from Vite will hit Vite’s server, not Next. The easy fix is a dev proxy.

Update vite.config.ts:

import { defineConfig } from "vite";
import react from "@vitejs/plugin-react";

export default defineConfig({
  plugins: [react()],
  server: {
    proxy: {
      "/api": "http://localhost:3000",
    },
  },
});

Now the client can keep using createRemoteProvider("/api/chat") and Vite will forward it to Next.

Environment variables for Next.js

Create .env.local in the Next.js project:

OPENAI_API_KEY=your_key_here
OPENAI_MODEL=gpt-4o-mini

Step 6: Build the chat hook (provider-agnostic brain socket)

The whole job of this hook is to:

manage messages
manage streaming state
route the “append these tokens” events into the last assistant message

The sharp edge: streaming makes state bugs very obvious. If you mutate the last message in place, React will punish you with duplication weirdness, especially in dev.

So we update the last message immutably.

In src/hooks/useChat.ts

import { useMemo, useRef, useState } from "react";
import type { ChatMessage, ChatProvider } from "../ai/types";

export function useChat(providers: ChatProvider[]) {
  const [providerId, setProviderId] = useState(providers[0]?.id ?? "");
  const provider = useMemo(
    () => providers.find((p) => p.id === providerId)!,
    [providers, providerId]
  );

  const [messages, setMessages] = useState<ChatMessage[]>([
    { role: "system", content: "You are a helpful assistant." },
  ]);

  const [status, setStatus] = useState<string>("");
  const [isStreaming, setIsStreaming] = useState(false);

  const abortRef = useRef<AbortController | null>(null);

  async function selectProvider(nextId: string) {
    abortRef.current?.abort();
    setProviderId(nextId);

    const next = providers.find((p) => p.id === nextId);
    if (next?.init) {
      setStatus("Preparing provider...");
      try {
        await next.init({ onStatus: setStatus });
      } catch (e: any) {
        setStatus(e?.message ?? "Failed to initialize provider.");
      }
    }
  }

  async function send(userText: string) {
    if (!userText.trim()) return;
    if (isStreaming) return;

    abortRef.current?.abort();
    abortRef.current = new AbortController();

    const userMsg: ChatMessage = { role: "user", content: userText };

    // Add user + placeholder assistant
    setMessages((prev) => [...prev, userMsg, { role: "assistant", content: "" }]);
    setIsStreaming(true);
    setStatus("");

    try {
      await provider.streamChat({
        messages: [...messages, userMsg], // good enough for a demo
        signal: abortRef.current.signal,
        onStatus: setStatus,
        onChunk: ({ delta, done }) => {
          if (delta) {
            setMessages((prev) => {
              const last = prev[prev.length - 1];
              if (!last || last.role !== "assistant") return prev;

              // Immutable update
              const updatedLast = { ...last, content: last.content + delta };
              return [...prev.slice(0, -1), updatedLast];
            });
          }

          if (done) setIsStreaming(false);
        },
      });
    } catch (e: any) {
      setIsStreaming(false);
      setStatus(e?.message ?? "Error while streaming.");
    }
  }

  function stop() {
    abortRef.current?.abort();
    setIsStreaming(false);
    setStatus("Stopped.");
  }

  return {
    providers,
    providerId,
    provider,
    messages,
    status,
    isStreaming,
    selectProvider,
    send,
    stop,
  };
}

React state closure note: messages: [...messages, userMsg] uses the current render’s messages. For normal chat usage, that is fine. If you want to harden it, store messages in a ref and read from that when starting the stream.

Step 7: UI component

Keep it simple. Treat the provider dropdown as a “brain toggle” and let the rest of the UI stay boring on purpose.

In src/App.tsx

import { useEffect, useMemo, useState } from "react";
import { createWebLLMProvider } from "./ai/webllmProvider";
import { createRemoteProvider } from "./ai/remoteProvider";
import { useChat } from "./hooks/useChat";

export default function App() {
  const providers = useMemo(
    () => [createWebLLMProvider(), createRemoteProvider("/api/chat")],
    []
  );

  const chat = useChat(providers);
  const [input, setInput] = useState("");

  useEffect(() => {
    chat.selectProvider(chat.providerId);
    // eslint-disable-next-line react-hooks/exhaustive-deps
  }, []);

  return (
    <div style={{ maxWidth: 900, margin: "0 auto", padding: 16, fontFamily: "system-ui" }}>
      <h1>Dual Provider Chat</h1>

      <div style={{ display: "flex", gap: 12, alignItems: "center" }}>
        <label>
          Provider{" "}
          <select
            value={chat.providerId}
            onChange={(e) => chat.selectProvider(e.target.value)}
            disabled={chat.isStreaming}
          >
            {chat.providers.map((p) => (
              <option key={p.id} value={p.id}>
                {p.label}
              </option>
            ))}
          </select>
        </label>

        <div style={{ opacity: 0.8 }}>{chat.status}</div>
        {chat.isStreaming && <button onClick={chat.stop}>Stop</button>}
      </div>

      <div style={{ marginTop: 16, border: "1px solid #ddd", borderRadius: 8, padding: 12, minHeight: 300 }}>
        {chat.messages
          .filter((m) => m.role !== "system")
          .map((m, idx) => (
            <div key={idx} style={{ marginBottom: 12 }}>
              <div style={{ fontWeight: 700 }}>{m.role}</div>
              <div style={{ whiteSpace: "pre-wrap" }}>{m.content}</div>
            </div>
          ))}
      </div>

      <form
        onSubmit={(e) => {
          e.preventDefault();
          chat.send(input);
          setInput("");
        }}
        style={{ display: "flex", gap: 8, marginTop: 12 }}
      >
        <input
          value={input}
          onChange={(e) => setInput(e.target.value)}
          placeholder="Say something..."
          style={{ flex: 1, padding: 10 }}
          disabled={chat.isStreaming}
        />
        <button type="submit" disabled={chat.isStreaming}>
          Send
        </button>
      </form>
    </div>
  );
}

At this point you have:

a local WebGPU chat provider
a remote API chat provider
a UI that can swap between them without rewriting anything

Why this setup is worth having

1) Privacy-first features without a backend

If the user’s text is sensitive (journaling, medical notes, internal docs), local mode keeps content on-device by default.

2) Cost control and “free” demos

Local mode is effectively “free per token” after the download. It is great for:

prototypes
workshops
dev tooling
weekend projects that should not come with a monthly bill

3) Graceful degradation

Local mode can be an upgrade path instead of a requirement:

WebGPU available: local
WebGPU missing: remote fallback

4) Offline-ish UX for specific workflows

Full offline is tricky, but “no server call needed for this” is still a huge win for:

rewriting text
summarizing
quick Q&A over content already in the browser

Real Talk

Download time: first load can be chunky. People will assume it is broken unless you show progress.
Device limits: mobile can struggle. Low-RAM machines can crash tabs or throttle hard.
WebGPU support: treat local mode as progressive enhancement, not a hard dependency.
Privacy win: local mode avoids shipping user text to your server by default.
Cost win: local mode shifts the cost to user compute, which is nice until it is not.

Watch outs and gotchas

Token junk like `<|start_header_id|>`

Some model builds emit template markers. Filtering them out is fine for demos. For cleaner output long-term, experiment with model choices and chat templates.

Local models are not remote models

Expect differences:

weaker instruction following
more formatting quirks
occasional “why are you like this” moments

Possible Improvements

Model picker UI

dropdown of model IDs
persist selection in localStorage
show estimated download size if available

Provider router

auto-pick local if WebGPU exists
auto-fallback to remote if init fails
show a small badge: “Local” or “Remote”

Conversation memory controls

send last N messages only
auto-summarize older messages (local if possible)

Structured output mode

have the assistant return JSON “actions”
validate with zod before rendering anything

Outro

A provider boundary is one of those small architectural choices that pays rent forever. Models change, vendors change, pricing changes, browser capabilities evolve. A chat UI that can swap brains is a lot harder to paint into a corner.

Also, it is extremely satisfying to flip a dropdown and make your browser turn into a tiny AI workstation. 😈💻

Provider-Agnostic Chat in React: WebLLM Local Mode + Remote Fallback