July 28, 2025
Getting a “ChatGPT for my docs” into prod used to feel like tape and prayers. In 2025 everything finally snaps together: Next.js 14 App Router
, the Vercel AI SDK’s
useChat hook for instant streaming UI, OpenAI’s Responses API
for agent-style calls, and Pinecone (or pgvector) for retrieval. Add Fluid Compute
so your Node functions stay warm, and you’ve got a latency-friendly RAG bot that still fits the hobby budget. (AI SDK, OpenAI, Vercel)
tldr; We bootstrap a Next app, chunk + embed Markdown, shove vectors into Pinecone, wire a LangChain RAG chain that calls the Responses API, and deploy on Vercel. Copy-paste the code below and you’ll have a docs bot humming before your coffee hits optimal drinkability.
What | Why |
---|---|
Node 20 + pnpm | Modern fetch for both OpenAI & Pinecone. |
OpenAI account with Responses API enabled | Single endpoint = chat + tool use. (OpenAI) |
Pinecone Starter tier | 2 GB / 2 M writes / 1 M reads free—~300 k vectors. (Pinecone, Reddit) |
Vercel Pro (optional) | Turns on Fluid Compute; Hobby works but with lower CPU quotas. (Vercel) |
npx create-next-app@latest rag-bot --app
cd rag-bot
pnpm add openai vercel-ai langchain @langchain/openai @langchain/pinecone \
@pinecone-database/pinecone @ai-sdk/react
# if you prefer Postgres later
# pnpm add pg pgvector
Create .env.local
:
OPENAI_API_KEY=sk-...
PINECONE_API_KEY=...
PINECONE_INDEX_NAME=docs-test
scripts/ingest.ts
The ingest script reads Markdown files, splits them into manageable chunks, embeds them with OpenAI’s text-embedding-3-small
model, and stores the vectors in Pinecone. This sets up the vector store for retrieval later.
import fs from 'node:fs/promises'
import path from 'node:path'
import { RecursiveCharacterTextSplitter } from '@langchain/textsplitters'
import { OpenAIEmbeddings } from '@langchain/openai'
import { Pinecone } from '@pinecone-database/pinecone'
import { PineconeStore } from '@langchain/pinecone'
const docsPath = path.join(process.cwd(), 'docs-test')
const files = await fs.readdir(docsPath)
const splitter = new RecursiveCharacterTextSplitter({ chunkSize: 1_000, chunkOverlap: 100 })
const texts: string[] = []
for (const file of files) {
const raw = await fs.readFile(path.join(docsPath, file), 'utf8')
texts.push(...(await splitter.splitText(raw)))
}
const embeddings = new OpenAIEmbeddings({ modelName: 'text-embedding-3-small' })
const pc = new Pinecone({ apiKey: process.env.PINECONE_API_KEY! })
const index = pc.Index(process.env.PINECONE_INDEX_NAME!)
await PineconeStore.fromTexts(texts, {}, embeddings, { pineconeIndex: index })
console.log('✅ Ingestion complete')
Why: RecursiveCharacterTextSplitter
keeps chunks below the 8 k-token context while overlapping 100 chars for coherence. (LangChain) The text-embedding-3-small
model is 20× cheaper than Ada-002—$0.02 per M tokens. (OpenAI Platform)
Run once:
pnpm tsx scripts/ingest.ts
lib/vectorStore.ts
import { PineconeStore } from '@langchain/pinecone'
import { OpenAIEmbeddings } from '@langchain/openai'
import { Pinecone } from '@pinecone-database/pinecone'
// We use `PineconeStore.fromExistingIndex` to load the vector store from the existing Pinecone index, which allows us to query it later in the RAG chain. ([LangChain][8])
export async function makeStore() {
const embeddings = new OpenAIEmbeddings({ modelName: 'text-embedding-3-small' })
const pinecone = new Pinecone()
const pineconeIndex = pinecone.Index(process.env.PINECONE_INDEX_NAME!)
return PineconeStore.fromExistingIndex(embeddings, {
pineconeIndex,
namespace: 'docs',
maxConcurrency: 5,
})
}
Why: fromExistingIndex
expects (embeddings, cfg)
; get the order wrong and store.embeddings
is undefined
, crashing later. (LangChain)
The chain retrieves relevant documents from Pinecone and generates a response using OpenAI’s Responses API
. It uses RunnableSequence
to pipe the retrieval and generation steps together.
lib/rag.ts
import { RunnableSequence } from '@langchain/core/runnables'
import { ChatOpenAIResponses } from '@langchain/openai'
import { PineconeStore } from '@langchain/pinecone'
export async function makeChain(store: PineconeStore) {
const retriever = store.asRetriever({ k: 4 })
const model = new ChatOpenAIResponses({
modelName: 'o3-large',
temperature: 0.2,
maxTokens: 512,
streaming: true,
})
return RunnableSequence.from([
async (input: string) => ({ docs: await retriever.invoke(input), input }),
model,
])
}
RunnableSequence
pipes retrieval → generation with minimal overhead. (LangChain)
Now we wire up the API route to handle chat requests. It reads the last message from the chat, retrieves relevant documents from Pinecone, and streams the response back to the client.
app/api/chat/route.ts
import { NextRequest } from 'next/server'
import { makeStore } from '@/lib/vectorStore'
import { makeChain } from '@/lib/rag'
export const runtime = 'nodejs' // Pinecone SDK needs core Node APIs :contentReference[oaicite:8]{index=8}
export async function POST(req: NextRequest) {
const { messages } = await req.json()
const question = messages.at(-1)?.content ?? ''
const store = await makeStore()
const chain = await makeChain(store)
const stream = await chain.stream(question)
return new Response(stream, { headers: { 'Content-Type': 'text/event-stream' } })
}
Edge vs Node: The Pinecone SDK imports node:stream
, absent in Edge Runtime, so we stay on Node—cold starts are slightly higher but Fluid Compute still keeps instances warm. (Next.js, Vercel)
And finally, we create the chat UI using the useChat
hook from the Vercel AI SDK. This hook manages the chat state and handles streaming responses from the API.
components/Chat.tsx
'use client'
import { useChat } from '@ai-sdk/react'
export default function Chat() {
const { messages, input, handleInputChange, handleSubmit } = useChat({ api: '/api/chat' })
return (
<form onSubmit={handleSubmit} className="space-y-4">
<ul className="space-y-2 max-h-[60vh] overflow-y-auto">
{messages.map((m) => (
<li key={m.id} className={m.role === 'user' ? 'text-right' : ''}>
<span className="px-3 py-2 rounded-lg inline-block bg-zinc-800">{m.content}</span>
</li>
))}
</ul>
<input
value={input}
onChange={handleInputChange}
className="w-full px-3 py-2 rounded-lg bg-zinc-900"
placeholder="Ask me about the docs…"
/>
</form>
)
}
useChat
streams tokens over SSE and manages local state for you. (AI SDK, Vercel)
vercel --prod
Fluid Compute keeps micro-VMs hot, caches bytecode, and allows in-function concurrency—cutting cold starts 30-60%. (Vercel)
Hybrid Search
– combine pgvector for “last week” edits with Pinecone for the archive.LangSmith Tracing
– set LANGCHAIN_TRACING=true
to debug prompts token-by-token.Rerank
– Pinecone’s new inference reranker boosts relevance without re-embedding. (Pinecone)Upgrade to GPT-5
– arriving in August with stronger reasoning. (The Verge)Just to be clear on the budget, here’s a breakdown of the expected costs for running this RAG chatbot on Vercel with Pinecone and OpenAI:
Item | Cost |
---|---|
Embeddings: 100 k tokens × text-embedding-3-small |
$0.002 (OpenAI Platform) |
Chats: 1 000 req × 1 800 tokens (in+out) on o3-large |
≈ $9 (token-based) |
Pinecone Starter (≤2 GB) | Free (Pinecone) |
Vercel Hobby | Free (Pro is $20 if you need more CPU) |
In ~150 lines you now have a warm-booting, vector-grounded chatbot that answers doc questions without hallucinations. Fork it, theme it, or bolt on extra tools and if you find a slick tweak, drop a comment. Happy hacking!