mlllmstartupsengineering

Building Long-Term Memory for AI Agents With ChromaDB

Sep 12, 20226 min readUtso Sarkar

Everyone building AI agents in 2022 had the same dirty secret: their “memory” was a bloated system prompt and a prayer. You paste in the last twenty messages, hit the token limit, watch the model forget who you are, and call it a product. I got tired of pretending that worked.

I was running a personal Telegram agent side project and needed it to remember deals, people, half-baked ideas, and the specific way I phrase rejection emails. Stateless LLM calls were useless for that. I needed durable, queryable memory that survived restarts and did not require me to manually curate context windows every morning.

ChromaDB was the pragmatic choice. Not because it was the best vector database on paper, but because I could embed it in a Python service in an afternoon, point it at local disk, and stop thinking about infrastructure until I had paying users. Perfect is the enemy of a founder who still writes ingestion scripts at midnight.

Why Prompt Stuffing Is Not Memory

Memory is not “more tokens in the context window.” Memory is selective retrieval under uncertainty. When I ask “what did I promise the investor last Tuesday,” the system should not replay our entire chat history. It should pull the three fragments that matter, rank them, and inject them before the model answers.

Naive approaches fail in predictable ways:

Recency bias. Recent messages crowd out older but more important facts.
No semantic jump. Keyword search misses paraphrases. “Term sheet” and “signed docs” should connect.
No persistence model. Restart the process, lose the illusion of continuity.
Cost explosion. Every turn re-sends the full history. Your API bill scales with anxiety.

Vector stores fix the retrieval problem if you treat memory as a pipeline, not a dump.

Architecture: Memory Retrieval Pipeline

The core loop is boring and correct: write embeddings on ingest, query by similarity on read, assemble a bounded context pack for the LLM.

flowchart LR
    subgraph ingest [Ingest Path]
        TG[Telegram Message] --> Parser[Message Parser]
        Parser --> Chunker[Semantic Chunker]
        Chunker --> Embed[Embedding Model]
        Embed --> Chroma[(ChromaDB Collection)]
    end

    subgraph retrieve [Retrieval Path]
        Query[User Query] --> QEmbed[Query Embedding]
        QEmbed --> Search[Similarity Search]
        Chroma --> Search
        Search --> Rank[Re-rank and Filter]
        Rank --> Pack[Context Pack Builder]
        Pack --> LLM[LLM Completion]
        LLM --> Reply[Telegram Reply]
    end

    subgraph meta [Metadata Layer]
        Chroma --> Meta[tags, timestamps, source]
        Meta --> Rank
    end

Telegram was the interface because founders live on their phones and because async messaging maps cleanly to agent loops. A message arrives, you classify intent, you maybe retrieve memory, you respond. No fake typing indicators required for v1.

Chunking: Where Most People Blow It

I chunked by conversational turn pairs, not arbitrary token windows. A user message plus the assistant reply became one unit when they were tightly coupled; standalone notes became single chunks. Metadata mattered as much as vectors:

source: telegram, manual note, email forward
timestamp: ISO string, used for decay and “last week” queries
entity_tags: extracted names, companies, project codenames
importance: manual pin or heuristic score

ChromaDB’s metadata filtering saved me from building a second database early. “Similarity search within last 14 days tagged investor” is a product feature, not a research paper.

Embedding Model Choices in Late 2022

I used text-embedding-ada-002 via API for quality and all-MiniLM-L6-v2 locally when I wanted zero marginal cost on ingestion experiments. The local model was worse on proper nouns and Indian company names. The API model cost cents per thousand chunks. For a personal agent, API won. For a batch re-index of 50k Slack messages, local won and I lived with the recall hit.

Do not fetishize embedding benchmarks. Measure recall@k on your own queries. I kept a spreadsheet of fifty questions I actually ask my agent and scored retrieval weekly. That beat every leaderboard.

Writing Path: When to Remember

Not every message deserves persistence. I added a lightweight classifier (fine-tuned small model, later just GPT-3.5 with a rigid JSON schema) that decided:

Ephemeral: greetings, jokes, one-off calculations
Durable: commitments, preferences, contact facts, project state
Derived: summaries the agent produced that should compound

Durable writes went to Chroma. Derived summaries got their own collection so retrieval could prefer distilled facts over raw chat sludge.

Retrieval Tuning That Actually Moved Numbers

Similarity alone is naive. My production-ish pipeline:

Embed the query.
Pull top 20 from Chroma with metadata pre-filter when possible.
Re-rank with a cheap cross-encoder or heuristic blend of similarity + recency + importance.
Pack until ~1500 tokens of memory context, hard cap.

The hard cap is non-negotiable. Unbounded retrieval is how you recreate prompt stuffing with extra steps.

Failure Modes I Hit in Production

Duplicate chunks. The agent repeated the same fact because I ingested near-identical messages. Fix: dedupe by cosine similarity threshold on insert.

Stale memory wins. Old wrong facts outranked corrections. Fix: supersede pattern. New chunk with supersedes_id metadata; filter out losers at read time.

Hallucinated retrieval confidence. The model cited memory that was only weakly related. Fix: force the model to quote chunk IDs in scratchpad (internal) and drop chunks below a similarity floor.

Telegram rate limits. Burst ingestion during a long voice-note rant tripped limits. Fix: queue with backoff. Boring. Correct.

Agent Memory Lessons

Building memory turned the agent from a clever parrot into something I could delegate to. It still lied sometimes. But it lied consistently about the same outdated facts, which meant I could debug memory instead of debugging “the model.”

ChromaDB was not forever infrastructure. It was the right 80% solution while I validated whether anyone besides me wanted an agent that remembered. They did not, in large numbers, in 2022. But I learned that memory UX is harder than memory engineering: users do not want to manage vectors; they want to be understood.

If you are building agents today, steal this pipeline. Swap Chroma for whatever your cloud vendor subsidizes. Keep the ingest, retrieve, pack, generate split. Your future self will thank you when you need to audit why the bot thought you still worked at a company you left in March.

Memory is not a feature slide. It is plumbing. Build the plumbing first, then lie on stage about AGI.

--claps