<?xml version="1.0" encoding="UTF-8"?><rss version="2.0" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Utso Sarkar</title><description>Writing on ML research, startups, and systems.</description><link>https://utso.stamped.work/</link><item><title>LECE: How Founder OS Learns From Your Life</title><link>https://utso.stamped.work/blog/2026-06-29-lece-how-founder-os-learns-from-your-life/</link><guid isPermaLink="true">https://utso.stamped.work/blog/2026-06-29-lece-how-founder-os-learns-from-your-life/</guid><description>The Lived-Experience Cognitive Engine turns flight-recorder traces into principles, rehearses high-stakes actions before you send them, and earns autonomy one capability at a time. Local-first. No goldfish.</description><pubDate>Mon, 29 Jun 2026 00:00:00 GMT</pubDate><content:encoded>Every serious agent logs everything. Tool calls, timings, failures, final answers - JSONL traces piling up in a folder like security footage nobody watches.

Founder OS was the same for months. Rich flight recorder. Zero learning loop. The system could **act** but not **grow**. It was a capable body with no memory that compounded.

I wrote about *why* I built the body in the first place - five thousand outreach threads, a lying spreadsheet, Telegram as command center: [I Built Founder OS Because I Was Drowning in Five Thousand Conversations](/blog/2026-06-28-i-built-founder-os-because-my-crm-was-lying-to-me/). This post is what I bolted on top when RAG and CRM were not enough.

It is called **LECE** - the **Lived-Experience Cognitive Engine**. The full technical whitepaper is embedded at the bottom of this post (PDF). Source and implementation live in the [Founder OS repo](https://github.com/officiallyutso/Founder-OS). I will tell you what it does, why I built it this way, and what I think is actually new.

## The Thesis in One Sentence

The most valuable training data for *your* assistant is not the internet. It is the private, day-by-day record of **your** company - and almost nobody uses it because nobody closed the loop on-device without shipping your life to a cloud trainer.

LECE closes the loop.

## Three Pillars (Not Three Features)

LECE is not a module you toggle for &quot;smarter replies.&quot; It is three reinforcing systems:

```mermaid
flowchart TB
    subgraph live [Every day you work]
        act[You + Founder OS act]
        trace[(Flight recorder traces)]
    end

    subgraph p1 [Pillar 1: Distillation]
        episodes[Score episodes]
        principles[(Principles + manual)]
    end

    subgraph p2 [Pillar 2: Preplay]
        twin[Digital twin of your business]
        sandbox[Sandbox - no real sends]
        preplay[Rehearse before commit]
    end

    subgraph p3 [Pillar 3: Workspace + Trust]
        ws[Global Workspace attention]
        trust[Earned autonomy per action]
    end

    act --&gt; trace
    trace --&gt; episodes --&gt; principles
    principles --&gt; act
    act --&gt; preplay --&gt; sandbox
    preplay --&gt; twin
    twin --&gt; preplay
    ws --&gt; act
    trust --&gt; act
    act --&gt; twin
```

Each pillar echoes published research (EvolveR, continual experience internalization, ProPlay-style world models, Global Workspace architectures). The contribution is not inventing those ideas in isolation. It is **wiring them into one circuit on one founder&apos;s private data**.

---

## Pillar 1: Self-Distillation - The Diary Someone Finally Reads

Every turn Founder OS lives gets appended to `data/traces/*.jsonl`. In most agent projects that file is for debugging. Post-mortems. Git blame for the model.

LECE reads it on a schedule - nightly, after memory consolidation - and runs a pipeline I care about deeply:

1. **Segment and score** episodes (0–1) from signals already in the system: did verification pass? did I approve the action? did outreach succeed? did self-healing kick in mid-task?
2. **Off-policy filter** - only keep genuinely good episodes as teaching material. Learning from your current flailing reinforces mistakes. The 2026 literature on continual internalization is brutal about this. I listened.
3. **Distill principles, not anecdotes** - not &quot;emailed Sarah Tuesday&quot; but &quot;for technical-founder outreach, lead with a specific product observation before generic praise.&quot;
4. **Deduplicate and decay** - near-duplicate principles reinforce; stale advice quarantines. A mind that only accumulates gets cluttered. LECE forgets on purpose.
5. **Inject** - principles land in a living **Personalized Operating Manual**, surfaced turn-by-turn and step-by-step during reasoning, not dumped once into a bloated system prompt.

After a few weeks the agent **reasons differently** than day one - not because weights changed (there is an optional local LoRA path for that later), but because it carries compressed wisdom from *our* lived experience.

And it never left my laptop.

That matters when your CRM contains investors who ghosted, enterprise pilots that stalled, and the exact wording you used when you were wrong about pricing.

---

## Pillar 2: Digital Twin + Preplay - Rehearse Before You Send

Distillation makes the system wiser about the **past**. The dangerous moments are **irreversible**: the email to your best lead, the public post, the calendar commit you cannot unsend.

Humans rehearse. &quot;If I open aggressive, he balks. If I anchor softer and follow up Thursday, he engages.&quot; LECE gives Founder OS that habit.

**Digital twin** - structured state of *my* business: CRM pipeline by stage, tasks, goals, runway, recent outcomes. Layered on top: a **procedure graph** learned from the action log - when the business looked like *this* and I took *that* action, *this* tended to happen, with *this* reliability.

**Sandbox** - when simulation mode is on, a flag propagates through execution context and the tool layer **intercepts every real-world side effect**. `send_email` in the sandbox does not send. It returns a prediction. Enforced at the registry level so rehearsal cannot leak into production. I can let the agent imagine freely; the world does not feel it.

**Preplay** - before high-stakes tools fire, Founder OS generates candidate variants, estimates outcomes (procedure graph, swarm role-play, or hybrid), picks a branch, and only then routes the real action through the normal approval gate - with reasoning attached so I see *why* it chose what it chose.

After the real action lands, **expectation feedback** compares prediction to reality and updates transition reliability. Imagination gets calibrated. Wrong rehearsals hurt the model&apos;s self-trust the same way wrong predictions hurt yours.

When I was juggling Stamped Energy conversations with plant managers who have seen a hundred vendors lie about OPC-UA integration, I did not want an agent that drafts fast. I wanted one that **thinks before it spends my credibility**.

---

## Pillar 3: Global Workspace + Trust You Earn

The third pillar is what makes it feel less like a tool and more like something **awake** - without handing it keys to your kingdom on day one.

**Global Workspace** (in the cognitive-architecture sense): specialized processes bid for attention on a blackboard - pending approvals, overdue follow-ups, goal pressure, recent failures, twin risk when runway or backlog goes red. Highest salience wins and gets broadcast into deliberation. Optional continuous mode ticks this every couple of minutes so the system can tap your shoulder before you remember to ask.

But proactive agents are terrifying unless autonomy is **earned**.

Most products give you a global dial: cautious, balanced, yolo. That is nonsense. You trust a cofounder to book travel before you trust them to email your lead investor.

LECE tracks **per-action-type competence** from outcomes and the failure ledger. Autonomy scales with demonstrated success, modeled risk tolerance, and preplay danger signals. It can grant *more* caution than static settings. It can **never** grant less than the hard floor - irreversible actions still hit the approval gate. Sacred list. Non-negotiable.

There is also a lightweight **theory of mind** of me - communication style, decision speed, risk appetite - refined from how I actually behave, not a onboarding form I lied on.

---

## A Composite Day (All Three Pillars)

3 a.m. Distillation runs. Yesterday&apos;s successful outreach episodes become a reinforced principle about technical-founder messaging. A stale principle about a channel I abandoned decays into quarantine.

8 a.m. Briefing surfaces three follow-ups I would have missed.

10 a.m. I say: follow up with the lead from the energy conference. The agent recalls my terse style, preplays three draft variants against the twin of that relationship, picks the observation-led one, routes it to my phone for one-tap approve. I tap yes.

Four days later the lead replies. CRM updates. Tonight that episode becomes tomorrow&apos;s teaching material. The procedure graph ticks reliability upward for *that kind of message in that kind of state*.

That is the flywheel: **act → record → distill → imagine → act better → measure → distill again**.

Not a demo. A loop.

---

## What Is Actually New (Intellectual Honesty)

I did not invent self-distillation, world models, or Global Workspace theory. I stand on EvolveR, EvoSC, continual experience internalization, ProPlay, and the 2026 GWT agent papers - cited in the repo.

What I claim:

1. **The problem formulation** - personal, private, longitudinal, on-device improvement from one human&apos;s operational life. Not anonymous benchmarks.
2. **Closed loop on private life** - traces → principles → behavior; twin → preplay → safer actions → measured outcomes → sharper twin. Integrated in shipping code.
3. **Sandbox at the tool registry** - structural guarantee that rehearsal cannot touch production.
4. **Trust as a learned, per-capability quantity** with a hard safety floor.

Synthesis counts. The transistor was known physics. The arrangement was not.

---

## Config and Limits (Because I Am Not Selling Magic)

LECE is mostly **default-on** for safe learning (Tier A distillation). Heavy stuff is behind flags: continuous workspace loop, Tier B LoRA training on local GPU, swarm preplay rollouts that are LLM-estimated not full simulation.

Read the config table in the whitepaper below or `docs/LECE.md` in the repo if you are running it yourself. Start with distillation and preplay. Turn on continuous cognition when you accept the API bill.

Swarm preplay is not a physics engine. Founder theory-of-mind is heuristic until you give it months of data. Tier B weight training needs a GPU and discipline about eval gates.

I would rather ship honest limits than a render.

---

## How This Connects Back

Founder OS is the **body** - tools, CRM, Telegram, swarm, immune system. [That story is here](/blog/2026-06-28-i-built-founder-os-because-my-crm-was-lying-to-me/).

LECE is the **part that remembers** - the thing that turns a sequence of brilliant disconnected moments into something that compounds.

If you are building agents for founders (or for anyone with a longitudinal, high-stakes life), stop optimizing only for benchmark cleverness. Optimize for **continuity**: what happened, what worked, what to rehearse, what to trust.

The model is the engine. LECE is what keeps it from resetting every morning.

Code and docs: [github.com/officiallyutso/Founder-OS](https://github.com/officiallyutso/Founder-OS). If you build on it or think I am wrong about any of this, tell me. The machine is listening. For once, it will remember what you said.</content:encoded></item><item><title>I Built Founder OS Because I Was Drowning in Two Thousand Conversations</title><link>https://utso.stamped.work/blog/2026-06-28-i-built-founder-os-because-my-crm-was-lying-to-me/</link><guid isPermaLink="true">https://utso.stamped.work/blog/2026-06-28-i-built-founder-os-because-my-crm-was-lying-to-me/</guid><description>Since January I have contacted thousands of people for Stamped and Stamped Energy. Spreadsheets lied, ChatGPT forgot, and I needed something that lived in Telegram and actually remembered.</description><pubDate>Sun, 28 Jun 2026 00:00:00 GMT</pubDate><content:encoded>Since January 2026 I have been in outreach mode nonstop. Stamped, then the Bangalore YC week in mid-April, then the pivot to Stamped Energy in May. I have talked to plant managers, integrators, investors, founders, operators - the kind of volume where you stop remembering *which* polite &quot;send a deck&quot; was from last Tuesday.

I am not exaggerating for effect. I have contacted on the order of **two thousand people** across email, LinkedIn, Telegram, and introductions. Not two thousand deep relationships. Two thousand threads that each need a state: interested, ghosted, hard no, &quot;follow up after their board meeting,&quot; &quot;wrong ICP but intro&apos;d us to someone useful.&quot;

That is not a networking brag. It is an operational nightmare.

## The Problem Nobody Warns You About

Everyone talks about finding product-market fit. Nobody talks about what happens when you are actually doing the work of finding it: you become a human CRM with amnesia.

I had spreadsheets. I had Notion. I had starred emails and half-finished labels in Gmail. I had notes in three apps because each one failed at a different axis. The spreadsheet knew *who*. It did not know *what I promised them* or *why I marked them warm in March and cold in June*.

Worse: I had ChatGPT and other LLM tabs open constantly. Brilliant for drafting one email. Useless the next morning. It did not know Vinayak and I pivoted to energy intelligence. It did not know which lead already said no. It did not know I prefer follow-ups short, specific, and never apologetic. Every session was day one.

Vinayak and I split building and GTM, which helps enormously. But even with a co-founder, **you still need a system that holds the whole map** - product facts, pipeline state, follow-up timing, what worked in outreach, what got ignored. Co-founders are not a substitute for institutional memory. They are another human with their own overloaded brain.

I got tired of re-explaining my company to my own tools.

## What I Actually Wanted

Not another dashboard I would open once and forget. Not a SaaS CRM that wants my investor list on someone else&apos;s servers.

I wanted something that:

- Lives where I already am (**Telegram** - WhatsApp-style simplicity, but I wired it to Telegram first)
- Knows **who** is in my pipeline and **what state** they are in
- Drafts follow-ups in **my** voice without me repeating context every time
- Tracks product details as Vinayak and I change the pitch (Stamped to Stamped Energy was not a rename; the buyer changed)
- Runs in the background and **nudges me** when something is due
- Does real work - research, email drafts, calendar, reminders - not just chat

And I wanted it to get **better** the longer I used it. Not smarter in the abstract. Better at being *my* chief of staff.

That is **[Founder OS](https://github.com/officiallyutso/Founder-OS)**.

## What Founder OS Is (Honestly)

Founder OS is not a chatbot with a CRM plugin. It is an **agentic system** I run locally: you tell it an outcome (&quot;follow up with the plant manager from last week,&quot; &quot;find three competitors in energy telemetry,&quot; &quot;remind me if nobody replies in four days&quot;), and it plans, picks tools, executes, and verifies before it claims done.

On the surface it feels like texting a very competent operator. Underneath it is closer to a small company staff:

- **117 tools** - inbox over IMAP, headless browser research, CRM read/write, calendar, PDFs, web watches, voice notes via local Whisper, runway tracking, social drafts. Anything irreversible goes through a **human approval gate** first.
- **SQLite CRM** - contacts, stages, notes, follow-up dates. The thing my spreadsheet pretended to be.
- **ChromaDB memory** - semantic recall over conversations, principles, and context. Not &quot;paste your last 50 messages into the prompt.&quot;
- **Swarm mode** when a task needs parallel specialists (research, fundraising, competitive intel) instead of one general model flailing.
- **Self-healing** - retries, circuit breakers, stuck-loop detection. It runs for weeks, not just one demo conversation.

I started building it in **May 2026**, the same month we pivoted to Stamped Energy. That timing is not accidental. The pivot multiplied outreach complexity overnight. New ICP, new vocabulary, new objections. I needed software that could keep up with *me* changing my mind every two weeks without forgetting who I already talked to.

## A Day With It

Morning: I ask what follow-ups are overdue. It pulls CRM state, checks which threads went cold, drafts three emails in the tone it has learned I use.

Afternoon: a plant manager replies with a technical question about OPC-UA read-only access. I do not re-brief the model on Stamped Energy&apos;s architecture. It already has product context and recent conversation memory.

Night: I am asleep. Scheduled jobs consolidate memory. The system records what happened today the way a flight recorder records a plane - every tool call, every outcome. That log is not decoration. It feeds the part of Founder OS I am most proud of, which I wrote about separately: **[LECE - the Lived-Experience Cognitive Engine](/blog/2026-06-29-lece-how-founder-os-learns-from-your-life/)**.

That second post is the technical heart. This one is the *why*.

## More Than RAG

People hear &quot;memory&quot; and think RAG: embed documents, retrieve chunks, stuff the prompt. Founder OS does that. It is necessary. It is not sufficient.

RAG gives you **recall**. It does not give you **judgment**. It does not distill &quot;when I lead with a specific product observation, technical founders reply more often&quot; from fifty successful outreach threads. It does not rehearse sending an email to your most important lead before you actually send it. It does not earn the right to act autonomously on tasks it has proven it can handle.

Founder OS is built for the full loop: **act → record → learn → act better**. LECE is the learning part. Without it, you have a very capable goldfish.

## Why Local-First

My pipeline is sensitive. Runway, investor conversations, enterprise prospects, half-formed strategy doubts at 2 a.m. I am not shipping that to a random multi-tenant SaaS to train someone else&apos;s model.

Founder OS runs on **my machine**. Traces, CRM, principles, distillation - local. Optional cloud LLM API calls if I want them, replaceable with Ollama. The architecture assumes **my data stays mine**.

If you are a founder doing serious outbound in India or anywhere else, you already know why this matters.

## What I Am Not Claiming

This is not &quot;AI as your co-founder.&quot; I&apos;m still on the loop. I still make the calls that matter.

This is not finished. I use it every day. It breaks in ways that teach me what to fix next. Some features are behind flags because continuous autonomous cognition is expensive and not always desirable.

It is real code, open on GitHub: [github.com/officiallyutso/Founder-OS](https://github.com/officiallyutso/Founder-OS).

## If You Are Building Something Similar

Start with the pain, not the architecture.

Mine was: **two thousand conversations and no trustworthy memory of any of them.**

If your pain is different - solo dev tools, content pipeline, hiring - the shape might differ. But the lesson generalizes: frontier models are amnesiac geniuses. The product is not the model. The product is **continuity** - CRM state, follow-up discipline, product context, and a loop that turns your lived experience into something the system can reuse tomorrow.

Read the engine post next: [LECE: How Founder OS Learns From Your Life](/blog/2026-06-29-lece-how-founder-os-learns-from-your-life/). That is where I stop venting about spreadsheets and start explaining the three pillars - distillation, digital-twin preplay, and earned trust - in detail.

If you are drowning in outreach like I was, fix the memory layer first. Everything else is lipstick on a goldfish.</content:encoded></item><item><title>Stamped Energy: First Month of Enterprise Conversations</title><link>https://utso.stamped.work/blog/2026-06-15-stamped-energy-first-enterprise-conversations/</link><guid isPermaLink="true">https://utso.stamped.work/blog/2026-06-15-stamped-energy-first-enterprise-conversations/</guid><description>We pivoted to Stamped Energy in May 2026. One month in, here is what plant managers ask and what we are learning from customer conversations.</description><pubDate>Mon, 15 Jun 2026 00:00:00 GMT</pubDate><content:encoded>Vinayak Raizada and I pivoted to **Stamped Energy** in May 2026. We are not deployed in a plant yet. We are one month into conversations with plant managers, integrators, and people who have seen energy dashboards die in bookmark folders.

This post is what we are hearing before we claim any deployment wins.

## How We Got Here

January 2026: Vinayak and I started **Stamped** (C2PA and image-authenticity, consumer ZK proofs, then B2B verification for logistics and insurance). Competitor mapping (Truepic and others). At **ETHMumbai 2026** we won the **Hashed Emergent VC track** with the [stamped-ethm](https://github.com/officiallyutso/stamped-ethm) ZK capture stack.

Mid-April 2026: we went to **Bangalore** for a YC event (April 15–20), explored the ecosystem, and talked to founders who had been building in India for years. That week changed how we thought about enterprise GTM.

By late spring, enterprise interviews kept surfacing a parallel pain: operational waste in manufacturing, terabytes of energy telemetry, and no decisions on top of it. In May 2026 we pivoted from Stamped to **Stamped Energy**. Same co-founder team. Different buyer and integration surface.

## What Plant Managers Ask First

Not &quot;show me your model.&quot; Not &quot;what is your Series A story.&quot;

**&quot;Will this work with our historian without opening write access?&quot;** OPC-UA read-only is table stakes. If you need a firewall exception that takes six months, you are dead before you start.

**&quot;Can I trust the number on screen?&quot;** They have seen dashboards that disagree with the meter on the wall. Side-by-side historian cross-check is not a feature. It is admission requirements.

**&quot;Who maintains the tag mapping when equipment changes?&quot;** Integrators name tags like archaeology. Your product needs a UI plant staff will actually use, not a one-time professional services engagement.

**&quot;What happens when your model is wrong?&quot;** Maintenance calendars, dismiss-with-reason, and audit logs. False positives without context destroy trust faster than no product at all.

## What We Are Building Toward

```mermaid
flowchart TB
    subgraph PlantOT[&quot;Plant OT Network&quot;]
        PLC[PLCs / DCS]
        HIST[Plant Historian]
        OPC[OPC-UA Server]
    end

    subgraph DMZ[&quot;Site DMZ&quot;]
        GW[Stamped Energy Gateway - Read Only]
        VAL[Validation Engine]
        QS[Quality Scorer]
    end

    subgraph Cloud[&quot;Stamped Energy Cloud&quot;]
        TS[(Time Series Store)]
        ML[Anomaly Models]
        DASH[Energy Dashboard]
    end

    subgraph Users[&quot;Users&quot;]
        PM[Plant Manager - War Room]
        ENG[Process Engineer - Drill Down]
    end

    PLC --&gt; HIST --&gt; OPC
    OPC --&gt;|TLS, read-only| GW
    GW --&gt; VAL --&gt; QS
    QS --&gt; TS
    TS --&gt; ML --&gt; DASH
    DASH --&gt; PM
    DASH --&gt; ENG

    style VAL fill:#1a3a5c,color:#fff
    style QS fill:#1a3a5c,color:#fff
    style PM fill:#2d5016,color:#fff
```

Architecture we are scoping with early conversations. Not shipped to production yet.

## Month One Lessons

**Data quality is the product before optimization is.** Every serious conversation eventually lands on invalid readings, stale tags, and naming chaos. Intelligence on dirty data is a demo, not a deployment.

**Credibility beats chart density.** Plant managers want three numbers they believe, not twenty they ignore.

**Sales cycles are measured in quarters, not sprints.** We knew this from insurance discovery on Stamped. Energy confirms it with harder procurement and longer IT reviews.

**A pivot is not a rebrand.** Stamped was image authenticity. Stamped Energy is industrial energy intelligence. We kept the founder team and the discipline around trust; we did not pretend the SKUs are the same.

## What Comes Next

More conversations. Honest scoping with plants willing to talk to an early team. No inflated deployment narrative until we have earned it.

If you are a plant manager who has buried an energy dashboard before, I want to hear what killed it. That feedback is worth more than any accelerator lecture.</content:encoded></item><item><title>YC Startup School India: What It Is and Whether You Should Go</title><link>https://utso.stamped.work/blog/2026-03-20-yc-startup-school-india/</link><guid isPermaLink="true">https://utso.stamped.work/blog/2026-03-20-yc-startup-school-india/</guid><description>An honest account of YC Startup School India in March 2026. Not affiliate marketing. A decision framework for founders who have limited time and zero patience for hype.</description><pubDate>Fri, 20 Mar 2026 00:00:00 GMT</pubDate><content:encoded>YC Startup School is free. That should immediately suspicious you. Nothing valuable in startups is free unless you are the product or the alumni network is the real SKU.

I went through the India cohort in March 2026 while building in India. Here is what it actually is, what it is not, and whether you should spend ten weeks on it when you could be shipping.

## What Startup School Actually Is

Startup School is Y Combinator&apos;s online program for early-stage founders. Weekly lectures (often recycled from YC partners), group sessions with other founders, accountability structures (weekly updates), and access to a community of people at similar stages globally.

The India-specific cohorts add timezone-friendly sessions and occasionally India-focused guest speakers. The curriculum is largely the same as the global program with regional community layering.

It is not Y Combinator admission. Completing Startup School does not improve your YC application in any guaranteed way. YC partners have said this publicly. Founders still believe it. Belief persists because hope is cheaper than evidence.

## What You Actually Get

**Structured accountability.** Weekly updates force you to articulate progress. If you lack discipline, external structure helps. If you already ship daily, it is redundant.

**Lecture library.** Paul Graham essays in video form, partner talks on fundraising, growth, hiring. Quality is high. Also available on YouTube without the cohort structure.

**Peer network.** You meet other founders in your group. Quality varies wildly. Some groups are serious builders. Some are idea-stage tourists collecting certificates for LinkedIn.

**Credibility signal (weak).** &quot;YC Startup School alum&quot; impresses people who do not know the difference between SUS and YC batch. Sophisticated investors know the difference.

## What You Do Not Get

- Funding
- YC partner mentorship at batch intensity
- Introductions to YC&apos;s investor network
- Validation that your idea is good
- India-specific regulatory or GTM playbooks beyond occasional talks

If you need any of the above, Startup School is the wrong product.

## Decision Framework

```mermaid
flowchart TD
    Start[Considering Startup School India?] --&gt; Q1{Do you have a co-founder and a shipped MVP?}
    Q1 --&gt;|No| Q1a{Need accountability to start?}
    Q1a --&gt;|Yes| Go[Consider SUS for discipline]
    Q1a --&gt;|No| Skip1[Build first. SUS later or never]

    Q1 --&gt;|Yes| Q2{Is your bottleneck knowledge or execution?}
    Q2 --&gt;|Execution| Skip2[Skip. Ship features instead]
    Q2 --&gt;|Knowledge - fundraising, hiring| Q3{Can you learn from YC content async?}
    Q3 --&gt;|Yes| Async[Watch lectures. Skip cohort]
    Q3 --&gt;|No| Q4{Will peer group quality matter for your sector?}
    Q4 --&gt;|Yes - niche B2B, deep tech| Go2[Join for network]
    Q4 --&gt;|No| Skip3[Skip. Niche communities better]

    Q2 --&gt;|Knowledge - India GTM, regulation| Skip4[Skip. SUS won&apos;t help much]
    
    Go --&gt; Done[10 weeks commitment]
    Go2 --&gt; Done
    Async --&gt; Done2[Self-paced, lower cost]
    Skip1 --&gt; Done3[Founder time preserved]
    Skip2 --&gt; Done3
    Skip3 --&gt; Done3
    Skip4 --&gt; Done3

    style Skip1 fill:#2d5016,color:#fff
    style Skip2 fill:#2d5016,color:#fff
    style Skip3 fill:#2d5016,color:#fff
    style Skip4 fill:#2d5016,color:#fff
    style Done3 fill:#1a3a5c,color:#fff
```

Most serious Indian founders with traction should land on Skip. That is not anti-YC. It is pro-time.

## India-Specific Considerations

**Timezone and community.** India cohorts help with synchronous group sessions. Global cohorts mean 2 AM lectures unless you async everything.

**Market context.** SUS teaches Silicon Valley defaults: launch fast, charge USD, hire in SF. Indian B2B enterprise sales, regulatory moats, and rupee economics get short shrift. You will hear &quot;talk to users&quot; which is correct and insufficient when your user is a plant manager with a 14-month procurement cycle.

**Fundraising narrative.** SUS emphasizes YC-style seed rounds. Indian seed dynamics in 2026 still differ from Silicon Valley defaults. Apply the principles, ignore the assumed market.

**Network value.** If your group includes founders in adjacent spaces (energy, manufacturing, fintech), cross-pollination is real. If your group is twelve AI wrapper pitches, you learned nothing except patience.

## Alternatives Worth Your Time

- **Industry-specific founder communities** with actual customers in the room
- **One paid advisor** with domain expertise vs ten weeks of general lectures
- **Customer discovery sprints** with structured interview scripts
- **Building in public** with accountability to users who pay, not peers who cheer

## My Honest Account

I participated in the April 2026 India cohort while building Stamped with Vinayak and Dhanraj. We had started in January and were pivoting from consumer ZK capture to B2B image authenticity. The weekly update discipline was useful when we were prone to rabbit holes. The lectures I had mostly already consumed as essays. The peer group had two serious founders and several tourists.

I would not do it again at my current stage. I would recommend it narrowly: first-time founders pre-MVP who need external structure and have never read PG essays.

I would not recommend it for: founders with paying customers, founders who treat it as YC pipeline strategy, founders who use completion as LinkedIn performance.

## Bottom Line

Startup School is a free, well-produced intro course with community wrapper. It is not a moat. It is not a credential that closes rounds. It is not a substitute for shipping.

If ten weeks of structured learning beats ten weeks of building for your current stage, go. If you know it does not, the fact that it is free does not mean the time cost is zero. Your time is the most expensive line item on your cap table.

Choose accordingly.</content:encoded></item><item><title>Solo Founder vs Co-founder in India</title><link>https://utso.stamped.work/blog/2026-02-10-solo-founder-vs-cofounder-india/</link><guid isPermaLink="true">https://utso.stamped.work/blog/2026-02-10-solo-founder-vs-cofounder-india/</guid><description>Vinayak Raizada changed how I think about co-founder fit. A framework for the decision nobody makes rationally.</description><pubDate>Tue, 10 Feb 2026 00:00:00 GMT</pubDate><content:encoded>The solo founder vs co-founder debate produces more bad advice than good product. Twitter says you need a co-founder. Paul Graham said solo founders are harder. Indian family WhatsApp groups say get a partner for &quot;stability.&quot; None of these sources know your runway, your skill gaps, or whether you can tolerate another human in your cap table for ten years.

I spent years solo on Strato Inc. In January 2026 I started Stamped with Vinayak , my batchmate from IIT Roorkee (he is Electrical Engineering; I am Mathematics &amp; Computing). Six weeks in, that collaboration is already shaping how I think about this decision. Here is the framework I wish I had before we committed.

## The Myths

**Myth: Solo founders cannot raise.** False with traction. Harder at pre-seed without network. Not impossible.

**Myth: Co-founders double output.** Only true if skills are complementary and conflict is low. Otherwise they halve equity and double meetings.

**Myth: Friends make good co-founders.** Friends make good friends. Co-founder relationship is closer to marriage with vesting cliffs and board seats.

**Myth: India requires a co-founder for credibility.** Some enterprise buyers ask about team depth. Revenue solves credibility faster than a co-founder title on a slide.

## What Vinayak Taught Me

I am genuinely fond of Vinayak. Not in the performative &quot;grateful co-founder&quot; LinkedIn sense. In the sense that six weeks of building together taught me more about how a real entrepreneur operates than years of solo Strato Inc ever could.

He multitasks without dropping threads. Integration bug at two, pitch rewrite at five, and he still has the energy to argue the product decision properly at night. I tend to go deep on one stack and let the rest queue up. He keeps parallel workstreams alive without letting any of them rot. I have learned to steal that rhythm.

His tenacity is the other thing. Early customer conversations are full of polite nos, vague &quot;send a deck,&quot; and meetings that go nowhere. Vinayak does not take the no personally and does not stop following up because the first conversation felt awkward. He has a stubbornness that is productive, not theatrical. When we pivoted from consumer ZK to B2B, he did not mourn the old thesis for three weeks. He was on the phone the next day.


Working with him on Stamped was not generic &quot;we split equity 50-50 and built a company.&quot; It was explicit about domains: who owned product vs engineering vs GTM, how decisions escalated, and what happened when we disagreed.

The collaboration worked because:

- Overlapping skills were minimal. Complementary skills were maximal
- We had worked together on smaller commitments before we committed to Stamped
- Neither needed the other for ego validation. Both needed the other for capability gaps
- Exit expectations aligned: build durable business, not flip in 18 months

It would not have worked if any of those were false. I have seen co-founder breakups where all four failed simultaneously. The lawsuits are worse than the loneliness. Vinayak passes the framework. That is not luck. That is why I am writing this post with his name in the title.

## Decision Framework

```mermaid
flowchart TD
    Start[Solo or Co-founder?] --&gt; Q1{Can you ship MVP alone in 6 months?}
    Q1 --&gt;|Yes| Q2{Is your bottleneck a skill you lack?}
    Q1 --&gt;|No| NeedPartner[Strong co-founder candidate needed]

    Q2 --&gt;|No - bottleneck is time/customers| Solo1[Stay solo. Hire contractors]
    Q2 --&gt;|Yes - deep domain gap| Q3{Do you know someone with 6+ months trust history?}
    
    Q3 --&gt;|No| Solo2[Stay solo. Hire employee #1 later]
    Q3 --&gt;|Yes| Q4{Do your timelines and ambition align?}
    
    Q4 --&gt;|No| Solo3[Do not force it. Bad co-founder &gt; solo]
    Q4 --&gt;|Yes| Q5{Have you worked through one real conflict together?}
    
    Q5 --&gt;|No| Trial[Run 90-day trial project. No equity yet]
    Q5 --&gt;|Yes| Cofound[Co-founder path viable]

    NeedPartner --&gt; Q3
    Trial --&gt; Q6{Trial succeeded?}
    Q6 --&gt;|Yes| Cofound
    Q6 --&gt;|No| Solo3

    style Solo3 fill:#8b0000,color:#fff
    style Cofound fill:#2d5016,color:#fff
    style Trial fill:#1a3a5c,color:#fff
```

The 90-day trial without equity is the most underrated step. Most founders skip it because urgency feels virtuous. Urgency without compatibility is expensive.

## India-Specific Factors

**Family pressure to partner.** Indian founders often face social expectation to enter business with relatives or college friends. Technical compatibility and conflict resolution matter more than shared alumni network.

**Equity and marriage negotiations.** Co-founder equity splits get compared to family property discussions. Document everything early. Use standard vesting (4 years, 1 year cliff). Cheap lawyers cost more than good lawyers.

**Geographic dispersion.** Co-founders in Bangalore and Mumbai can work with discipline. Different cities without explicit communication norms drift apart silently.

**Visa and relocation.** If one co-founder plans US relocation and other does not, resolve that before incorporation, not after first term sheet.

## When Solo Is Correct

- You are technical and selling to users who buy from individuals (certain dev tools, consulting)
- You have capital or revenue to hire specialists without giving co-founder equity
- Your previous co-founder relationship ended badly and you have not processed why
- Speed of decision-making is existential and your co-founder candidate is consensus-oriented

Solo is not lonely if you build a network of advisors, contractors, and early employees with real ownership (ESOP), not fake &quot;founding team&quot; titles.

## When Co-founder Is Correct

- You need deep complementary expertise from day one (hard tech + hard enterprise sales)
- The market window requires parallel workstreams you cannot serially execute
- You have a tested relationship with aligned risk tolerance
- Investors are truly blocked without team depth and you cannot hire fast enough

## The Middle Path

Not every strong collaborator is a co-founder. 

- Equal partners on a specific venture with clear incorporation
- First employee with significant ESOP and &quot;founding&quot; title
- Part-time technical co-founder with pro-rata vesting tied to hours

Forcing co-founder label on every key relationship dilutes the term and creates cap table mess.


## My Current Stance

Strato Inc was solo by necessity and temperament. Stamped is built with Vinayak because enterprise B2B needed parallel GTM and engineering from day one, and because I found someone who is better than me at half the job. Same person, different structure per bet.

The decision is not solo vs co-founder. It is: do I have a specific person who passes the framework, or do I not? If not, solo is not a consolation prize. It is the correct answer until someone passes the trial.

Choose the structure that survives the first serious disagreement. Everything else is LinkedIn cosplay.</content:encoded></item><item><title>What I Actually Think About the AI Future</title><link>https://utso.stamped.work/blog/2025-06-01-what-i-actually-think-about-ai-future/</link><guid isPermaLink="true">https://utso.stamped.work/blog/2025-06-01-what-i-actually-think-about-ai-future/</guid><description>Not a hype thread. Not doomerism. A founder-researcher&apos;s timeline for capabilities, moats, and where the money actually goes.</description><pubDate>Sun, 01 Jun 2025 00:00:00 GMT</pubDate><content:encoded>Everyone wants a single sentence about AI&apos;s future. &quot;AGI in two years.&quot; &quot;It&apos;s a bubble.&quot; &quot;It changes everything.&quot; These are personality statements dressed as forecasts.

I build ML products for enterprise customers and research verification problems on the side. Here is my actual model of the AI future, with timelines I would bet modest money on and caveats where I will not.

## Capability Timeline

```mermaid
gantt
    title AI Capability Timeline (My Estimates)
    dateFormat YYYY
    axisFormat %Y

    section Commoditized
    Text generation quality plateau           :done, 2023, 2025
    Code assist mainstream dev                :done, 2024, 2026
    Cheap open-weight inference               :done, 2025, 2027

    section Improving Fast
    Agentic workflows reliable sub-domains    :active, 2024, 2028
    Multimodal doc understanding              :2024, 2027
    Real-time voice agents commercial         :2025, 2028

    section Hard Problems
    Reliable long-horizon autonomy            :2026, 2032
    Forensic-grade vision                     :2024, 2030
    Robotics general purpose                  :2027, 2035

    section Overhyped Near Term
    Full job replacement knowledge work       :2025, 2026
    AGI                                       :2027, 2040
    Regulatory clarity global                 :2026, 2030
```

Commoditized means good enough for most use cases at low cost. Hard problems means I would not bet my company on solving them in three years.

## Hype vs Reality

**Hype:** AI replaces software engineers in five years.  
**Reality:** AI changes what engineers do. Writing boilerplate is commoditized. System design, debugging production at 3 AM, and knowing which corners not to cut remain human. Team sizes may shrink at low-performing orgs. Output per engineer rises elsewhere.

**Hype:** Every startup needs an AI strategy.  
**Reality:** Every startup needs a customer strategy. AI is implementation detail unless AI is the product. Most products need a thin AI layer or none.

**Hype:** Foundation model companies capture all value.  
**Reality:** Value accrues to distribution, data flywheels, regulatory moats, and workflow lock-in. Open weights commoditized the middle. Wrappers died. Vertical depth survives.

**Hype:** Scaling laws solve everything.  
**Reality:** Scaling helps until it hits data walls, eval walls, and economic walls. DeepSeek moment proved economics matter as much as parameter counts.

**Hype:** AI safety pauses deployment.  
**Reality:** Enterprise procurement and liability law pause deployment faster than OpenAI safety team. Regulated industries move slowly for boring reasons.

## Moats That Survive

1. **Proprietary operational data** with feedback loops (labeled outcomes from real workflows, not scraped web text)
2. **Regulatory certification** (health, finance, compliance-heavy B2B)
3. **Physical world integration** (hardware, on-site trust, messy deployments)
4. **Customer switching costs** embedded in workflow, not chat interface
5. **Brand trust in high-stakes decisions** (safety, verification, money movement)

Moats that erode:

- &quot;We call GPT-4&quot;
- &quot;We have a chatbot&quot;
- &quot;We fine-tuned on public data&quot;
- &quot;Our prompt is secret sauce&quot;

## Timeline Bets I Will Make

**2025-2027:** Agentic workflows reliable in narrow domains (customer support with human escalation, code migration with test suites, document processing with validation gates). Not reliable for open-ended &quot;run my company.&quot;

**2025-2028:** Multimodal models good enough for document QA and triage. Not good enough for forensic evidence without specialized CV underneath.

**2026-2030:** Significant job restructuring in content mills, basic legal doc review, tier-1 support. Not mass unemployment. Labor market friction and retraining lag technology by years.

**2028+:** Robotics advances if sim-to-real and hardware costs improve. I am skeptical of home humanoid hype before warehouse humanoid reliability.

**AGI:** I will not give you a date. Anyone who does is selling something. Transformers were a breakthrough. Breakthroughs continue. &quot;General intelligence&quot; is definitional quicksand.


## Policy and Geopolitics

AI policy fragments by jurisdiction. EU AI Act, US executive orders, India evolving framework. Enterprise customers will require compliance documentation. Compliance becomes moat for companies that invest early.

US-China model competition continues. Enterprise buyers outside both blocs may mix providers. Founders should avoid single-provider dependency for core inference.

## Personal Position

I am not an AI doomer. I am not an AI maximalist. I am a founder who has seen RAG fail, sensors lie, and plant managers ignore dashboards built by people who never visited a plant.

AI is the most important engineering shift since mobile. It is also the most oversold technology since blockchain met enterprise ERP.

Build for the world where inference is cheap, models are interchangeable, and customers still pay for outcomes. That world arrived faster than I expected in February 2025. The founders who win are the ones who never confused model capability with product value.

That is what I actually think. Ask me again in a year. I expect to revise.</content:encoded></item><item><title>DeepSeek Changed the AI Economics Overnight</title><link>https://utso.stamped.work/blog/2025-02-10-deepseek-changed-the-ai-economics-overnight/</link><guid isPermaLink="true">https://utso.stamped.work/blog/2025-02-10-deepseek-changed-the-ai-economics-overnight/</guid><description>February 2025 broke the inference cost curve. Founders who priced on OpenAI margins need new spreadsheets.</description><pubDate>Mon, 10 Feb 2025 00:00:00 GMT</pubDate><content:encoded>DeepSeek dropped in February 2025 and every AI startup&apos;s unit economics spreadsheet became a historical document. Not because the models were magic. Because they proved that frontier-class inference could run at a fraction of the cost the market had priced in, with open weights, on hardware people already owned.

If you were building on the assumption that GPT-4 API pricing was a permanent cost floor, you were wrong. If you were building on the assumption that only hyperscalers could afford to serve large models, you were also wrong.

## What Happened

DeepSeek-R1 and associated models demonstrated competitive reasoning performance with training and inference economics that violated the consensus narrative. Open weights. Efficient architectures. Reported training costs that made Silicon Valley venture rounds look like overfunding a rounding error.

The market reaction was predictable: NVIDIA stock volatility, panic in AI infra startups selling &quot;only we can afford to run this,&quot; excitement in founders who had been margin-crushed by token pricing.

The technical reaction was messier: replication attempts, debate about training details, distillation accusations, and the usual arXiv thunderdome. For founders, the technical nuance matters less than the economic fact: inference got cheaper faster than anyone&apos;s pricing model assumed.

## The Cost Curve Before and After

```mermaid
xychart-beta
    title &quot;Inference Cost per 1M Tokens (Illustrative Index)&quot;
    x-axis [&quot;2023 Q1&quot;, &quot;2023 Q4&quot;, &quot;2024 Q2&quot;, &quot;2024 Q4&quot;, &quot;2025 Q1 Post-DeepSeek&quot;]
    y-axis &quot;Relative Cost Index&quot; 0 --&gt; 100
    line &quot;Closed Frontier API&quot; [100, 85, 70, 55, 50]
    line &quot;Open Weights Self-Hosted&quot; [90, 75, 60, 45, 15]
    line &quot;Distilled / Efficient Models&quot; [80, 65, 50, 35, 10]
```

The gap between closed API pricing and self-hosted open weights widened discontinuously in Q1 2025. Exact numbers vary by workload, quantization, and hardware. The direction does not.

## Founder Implications

**Margin structure reset.** AI-native products priced at 70% gross margin assuming $X per million tokens may now have 85% margin at same price, or competitors will undercut at same margin. Your moat was not the API wrapper. It was never the API wrapper.

**Build vs buy recalculated.** Self-hosting DeepSeek-class models on rented GPUs became viable for Series A stage companies. Legal, ops, and ML engineering costs shift. Total cost of ownership favors teams with infra talent.

**Commoditization acceleration.** &quot;We use GPT-4&quot; stopped being a feature in February 2025. &quot;We use AI&quot; was already not a feature in 2024. Differentiation returns to workflow, data, distribution, trust.

**Fundraising narrative shift.** Investors who funded &quot;AI infra moat&quot; companies face awkward LP calls. Investors who funded vertical AI with customer lock-in sleep slightly better.

**Geopolitical dimension.** DeepSeek is Chinese lab. US enterprise procurement adds compliance questions. Indian and global founders outside US-China binary may have more flexibility. Read your customer&apos;s vendor policy before betting the company on one provider.

## What Did Not Change

Models still hallucinate. RAG still breaks in production. Enterprise sales cycles still measured in quarters. Regulatory requirements still exist. Customer trust still earned per deployment.

Cheaper inference does not fix bad product. It makes bad product cheaper to operate, which is mixed news.

## Strategic Responses

**Own the workflow, not the model.** Swap model backends. Customer should not notice or care beyond quality delta.

**Invest in eval infrastructure.** Model switching is cheap only if you know when quality regressed. Golden sets, automated evals, production monitoring.

**Reprice or reinvest.** Either pass savings to customers for competitive win, or reinvest margin into product depth. Defaulting to founder dividend via burn reduction is valid in funding winter.

**Hardware planning.** If self-hosting, capacity planning becomes core competency. Spot instances, quantization tradeoffs, batch vs realtime serving.

## Hot Takes I Will Defend

Open weights won the economic war even if closed models win occasional benchmarks. Not forever. For this cycle.

AI startups that are CRUD apps with chatbot frontends die first in price war. Good.

Foundation model companies need new revenue stories beyond API tokens. Obvious now. Was obvious before if you listened.

Indian founders benefit from lower inference costs disproportionately because rupee revenue against dollar API bills was brutal.

## What I Changed

We re-ran inference workloads on side projects against self-hosted open models. Anything that used LLMs for explanation (not detection) migrated partially off closed APIs. Edge deployments kept models small and local for latency and air-gap requirements anyway.

Burn dropped. Dependency dropped. Eval suite got two sprints of investment.

## Timeline Estimate

Six months of chaos: pricing wars, model releases, benchmark gaming. Twelve months: stable tier structure emerges (frontier closed, efficient open, tiny on-device). Twenty-four months: next discontinuity, probably not from who you expect.

DeepSeek did not end the AI race. It changed the admission price. Founders who treat inference as commodity cost and product as moat survive. Founders who treated OpenAI as moat are updating LinkedIn headlines.

Update your spreadsheet. Then update your product. The cost curve will move again.</content:encoded></item><item><title>Doing a Mathematics Degree While Building on the Side</title><link>https://utso.stamped.work/blog/2025-02-08-iit-roorkee-mathematics-degree-founder/</link><guid isPermaLink="true">https://utso.stamped.work/blog/2025-02-08-iit-roorkee-mathematics-degree-founder/</guid><description>I joined IIT Roorkee in August 2024. Six months in, this is the honest accounting of time cost versus benefit while still publishing apps through Strato Inc.</description><pubDate>Sat, 08 Feb 2025 00:00:00 GMT</pubDate><content:encoded>People love the story arc: dropout founder beats the system. People also love the opposite arc: founder with elite credentials wins on credibility. The messy middle, finishing a mathematics degree at IIT Roorkee while still building, does not fit a keynote slide. It fits my reality since August 2024.

I joined IIT Roorkee in August 2024 for B.S./M.S. Mathematics. Strato Inc, my Play Store publishing company, did not pause. I still ship apps. I am not running a venture-backed startup alongside exams. I am doing proof-based coursework while keeping the builder identity that predates the admission letter.

This is not advice to copy. It is a ledger from the first two semesters.

## Why Mathematics, Why Roorkee

Mathematics at IIT Roorkee is not a coding bootcamp with prestige. You live in proof-based coursework: real analysis, linear algebra that hurts, probability that finally makes sense months later. The department does not care about your Play Store download count. Deadlines are theorem-shaped.

I chose math because I wanted a foundation that would not expire when the next JavaScript framework died. Publishing apps rewards speed. Mathematics rewards careful definitions. Holding both in your head at 2 AM is character building or sleep destroying, depending on the week.

## The Time Tax Is Real

There are 168 hours in a week. IIT coursework can consume 50-70 if you are honest about assignments and exams. Strato Inc experiments and client work can consume every remaining waking hour if you let them. Something gives. Usually sleep, relationships, or grades. Sometimes all three.

```mermaid
pie title Typical week (exam season, 2024-25)
    &quot;Coursework + problem sets&quot; : 35
    &quot;Strato Inc + side projects&quot; : 40
    &quot;Sleep&quot; : 20
    &quot;Everything else&quot; : 5
```

That pie chart is generous. Exam season shrinks the &quot;everything else&quot; slice to near zero.

## Benefits That Actually Showed Up

**Rigorous notation:** Startups throw around &quot;loss functions&quot; and &quot;convergence.&quot; A math degree makes you annoyed at sloppy definitions in a useful way.

**Credibility in India:** Outside global Twitter, IIT still opens doors. I am not proud that the brand matters. It matters.

**Peer network:** Some of my sharpest conversations now happen in hostel corridors and lab hours. Talent density is not myth.

**Optionality:** Strato Inc predates IIT. The degree adds a parallel track, not a replacement identity.

## Costs Nobody Posts on LinkedIn

**Opportunity cost of depth:** You cannot read every ML paper and ship every app update in the same week.

**Split identity:** Professors ask if you are serious about academics. The Play Store asks if you shipped this month&apos;s fix. Seriousness becomes performance.

**Mental fragmentation:** Context switching between Galois theory and a production bug is cognitive overhead with measurable error rates.

## Strategies That Keep Me Sane (Mostly)

**Semester theming:** Align electives with interests when possible. Optimization, numerical methods, probability: yes.

**Ruthless calendar blocking:** Coursework gets fixed slots. Building gets mornings before class when possible.

**Say no to vanity:** Campus clubs, random hackathons, most networking events. IIT offers infinite distraction disguised as opportunity.

**Use the institute without worshiping it:** Labs, libraries, subsidized infrastructure. Do not wait for permission to build.

## What IIT Did Not Teach

Sales. Hiring. Saying no to bad deals. Reading a term sheet. Handling a customer who ghosts after a pilot. Mathematics programs are not designed for that. You learn by doing, often expensively.

Also: modern ML engineering. You will self-study PyTorch regardless of degree.

## Should You Do Both?

If you need the degree for family, visa, or personal closure: yes, with eyes open.

If you already have a funded company with a team that needs you full-time: calculate the cost honestly.

If you think the degree is only signaling: calculate whether the signal buys something you cannot buy cheaper.

## India-Specific Notes

The IIT label travels differently in Bangalore enterprise sales than in San Francisco VC Twitter. Plan for the market you are actually selling into.

Hostel life is simultaneous accelerant and trap. Cheap rent, smart roommates, constant interruptions.

## Closing Ledger

Doing a mathematics degree while keeping Strato Inc alive is not heroic. It is arithmetic with bad constants. You trade time for credibility and rigor. You pay in stress.

I am one semester in. The Play Store account survived the transition. The coursework is hard. I would not call it optimal. I would call it mine.

If you are in the middle of it: track your hours honestly, protect one day a week from both gods, and remember that degrees are long games and apps are short ones until they are not.</content:encoded></item><item><title>2024: IIT Roorkee, Writing, and the Funding Winter</title><link>https://utso.stamped.work/blog/2024-12-31-2024-year-in-review/</link><guid isPermaLink="true">https://utso.stamped.work/blog/2024-12-31-2024-year-in-review/</guid><description>A year-in-review on joining IIT Roorkee, ML engineering depth, and surviving the Indian startup funding crunch.</description><pubDate>Tue, 31 Dec 2024 00:00:00 GMT</pubDate><content:encoded>2024 was the year I joined IIT Roorkee in August and went deep on technical foundations while the Indian startup ecosystem contracted around me. This is the honest review: coursework, side projects, and the funding winter that forced clarity — without pretending I was running a venture-backed company.

## January-July: Pre-IIT and Strato Inc

Before August, I was still publishing through Strato Inc, sitting board exams, and navigating JEE outcomes. No incorporated companies. Just maintenance, learning, and the hallway between school and college.

## August-December: IIT Roorkee Begins

Joined B.S./M.S. Mathematics and Computing at IIT Roorkee in August 2024. Q3 and Q4 were coursework shock, automata theory, and learning to balance hostel life with the Play Store account I refused to abandon.

```mermaid
flowchart LR
    subgraph Research[&quot;Learning - 2024 Status&quot;]
        R1[RAG Production Patterns - Documented]
        R2[Continual Learning Reading - Ongoing]
        R3[ViT vs CNN Deployment - Applied]
        R4[Multimodal LLM Skepticism - Written]
    end

    subgraph Skills[&quot;Skills - 2024 Status&quot;]
        S1[IIT Mathematics - On Track]
        S2[ML Agent Reading - Side Interest]
    end

    subgraph Writing[&quot;Writing - 2024 Status&quot;]
        W1[Technical Essays - 12+ posts]
        W2[Founder Opinion - India-specific]
        W3[Personal IIT transition posts]
    end

    Research --&gt; Writing
    Skills --&gt; Research
```

Green: active and compounding. No fake &quot;shipped to enterprise&quot; labels.

## What Worked

**Writing as accountability.** Publishing forced clarity on what I actually believed vs what sounded good in conversation.

**Depth over breadth.** RAG failures, ViT deployment tradeoffs, and LLM hype skepticism are related skills. Jumping between them as if they were one company was not — and I was not running a company anyway.

## What Failed

**Pretending parallel company narratives.** Early 2024 drafts sometimes sounded like I had a board and payroll. I did not. I had exams, then a campus ID card.

**Sleep.** December productivity was fine. August decision quality was not.

**Conference FOMO.** Every event I skipped, I did not miss anything important.

## Numbers I Will Share

- Blog posts published in 2024: enough to make December embarrassing by comparison
- IIT admission: August 2024
- New companies incorporated: 0

## Lessons for 2025

1. Ship companies when incorporated, not when theorized
2. College schedules are real constraints, not excuses
3. Kill side project features in November, not March after sunk cost rationalization
4. Keep human escalation paths in any high-stakes ML deployment you might build later

## Looking Forward

2025 was about AI economics shifting again (DeepSeek), side projects getting serious, and deciding which ideas become real commitments. Not a third narrative for the sake of looking busy.

Happy new year. Back to work Monday.</content:encoded></item><item><title>GNFA State Elimination: The Algorithm That Made Automata Theory Click</title><link>https://utso.stamped.work/blog/2024-12-10-iit-roorkee-automata-theory-gfna/</link><guid isPermaLink="true">https://utso.stamped.work/blog/2024-12-10-iit-roorkee-automata-theory-gfna/</guid><description>State elimination on Generalized NFAs turned NFA-to-regex from black magic into something I could implement without crying in the IIT Roorkee library.</description><pubDate>Tue, 10 Dec 2024 00:00:00 GMT</pubDate><content:encoded>I joined IIT Roorkee in August 2024. By the end of my first semester, automata theory had a reputation: either you loved the elegance or you failed the midterm and questioned your life choices. I was in the second camp until GNFA state elimination clicked. Not clicked like &quot;I got a good grade.&quot; Clicked like &quot;I can derive the regex for this NFA on a whiteboard without praying to the textbook gods.&quot;

The course covered NFAs, DFAs, regular expressions, pumping lemmas, and the guilt-inducing feeling that everyone else seemed to see the isomorphism while you memorized constructions. State elimination on Generalized NFAs (GNFAs) was the bridge that made regex and automata feel like the same object wearing different hats.

## The Problem: NFA to Regular Expression

Given an NFA, produce an equivalent regular expression. The textbook Thompson construction goes regex to NFA. Going backward feels harder because NFAs have parallel paths, epsilon transitions, and the combinatorial explosion of naive state subset methods.

State elimination gives a direct, if tedious, path:

1. Convert NFA to a GNFA with structured start and end states
2. Repeatedly eliminate internal states
3. Read the regex label on the single edge from start to accept

The magic is not magic. It is accounting. Every elimination updates edge labels with regex algebra that captures all paths that routed through the removed state.

## What Is a GNFA?

A Generalized NFA allows each transition to be labeled with a regular expression, not just a single symbol. We assume:

- One start state with no incoming edges
- One accept state with no outgoing edges
- Every other state has edges to every other state (use empty set label if missing)

That last constraint sounds ugly. It makes elimination uniform. You never special-case missing transitions because &quot;missing&quot; is the empty regex.

## The Elimination Rule

To eliminate state `k`, for every pair of states `(i, j)` update the label on edge `(i, j)`:

```
R_ij = R_ij + R_ik · (R_kk)* · R_kj
```

Where `·` is concatenation, `+` is union, and `*` is Kleene star. Read it as: old paths from i to j, plus paths that go from i to k, loop zero or more times at k, then go from k to j.

If you have ever written a Floyd-Warshall loop and felt smug, this is the regex version with more symbols and fewer integers.

## Before and After One Elimination Step

```mermaid
flowchart LR
    subgraph before [Before Eliminating State k]
        direction LR
        S((s)) --&gt;|A| I((i))
        I --&gt;|B| K((k))
        K --&gt;|C| J((j))
        I --&gt;|D| J
        K --&gt;|E| K
        J --&gt;|F| T((t))
        I --&gt;|G| T
    end

    subgraph after [After Eliminating k]
        direction LR
        S2((s)) --&gt;|A| I2((i))
        I2 --&gt;|D + B·E*·C| J2((j))
        J2 --&gt;|F| T2((t))
        I2 --&gt;|G + B·E*·F| T2
    end
```

State `k` vanishes. Its looping behavior compresses into `(E)*`. Paths that used `k` as a relay fold into updated labels on `(i, j)` and direct shortcuts to accept.

Repeat until only `s` and `t` remain. The label on `s → t` is your regex. Simplify with algebra if you enjoy pain.

## Elimination Order Matters for Human Sanity

The algorithm works for any elimination order of internal states. Different orders produce different regexes, all equivalent but some look like a cat walked on your keyboard.

Heuristics that saved me on exams:

- Eliminate states with few incident edges first when possible
- Eliminate sink-like states that only loop or exit early
- Leave high fan-in/fan-out hubs for later when labels are already compound

For course assignments, brute elimination order is fine if your simplification is disciplined. For humans, order is the difference between `(a+b)*` and a seventeen-parenthesis monster.

## Worked Intuition: Epsilon Transitions

Real NFAs have epsilons. Convert them away first or fold them into GNFA labels during preprocessing. Epsilon closure on transitions becomes union of labels. I wasted a problem set trying to eliminate epsilons mid-algorithm. Do not be me.

Standard preprocessing:

1. Add fresh start and accept with epsilons as needed
2. Remove epsilon transitions by label composition
3. Ensure GNFA completeness with empty labels on missing edges

Then eliminate. The preprocessing is boring. The elimination is mechanical. The algebra simplification is where grades go to die.

## Implementation Sketch (What I Actually Coded)

For a CP contest helper script, I represented edge labels as strings and trusted a simplification pass... lightly. Production of a correct regex for small NFAs:

```python
def eliminate(gnfa, k):
    for i in gnfa.states:
        if i == k:
            continue
        for j in gnfa.states:
            if j == k:
                continue
            rik = gnfa.edge(i, k)
            rkk = gnfa.edge(k, k)
            rkj = gnfa.edge(k, j)
            rij = gnfa.edge(i, j)
            loop = kleene_star(rkk)
            gnfa.set_edge(i, j, union(rij, concat(rik, concat(loop, rkj))))
    gnfa.remove_state(k)
```

`union`, `concat`, and `kleene_star` should canonicalize: drop empty unions, flatten nested stars where safe, remove trivial epsilons. I did not solve regex minimization. I solved &quot;TA can read it.&quot;

## Connection to Broader Theory

State elimination proves NFAs and regexes are equivalent without appealing to minimal DFAs and partition refinement. Pedagogically, that matters. It is constructive in the direction students fear.

It also connects to:

- **Path expressions** in graph theory
- **Algebraic automata theory** where languages are solutions to matrix equations
- **Kleene&apos;s theorem** in a form you can touch

Once I saw elimination as &quot;Gaussian elimination but regex,&quot; the course stopped feeling like a bag of unrelated tricks.

## Exam War Stories

First semester automata exams loved to ask for regex from a diagram with eight states and one troll epsilon loop. My strategy:

1. Draw GNFA mentally, five minutes
2. Pick elimination order, write it down, stick to it
3. Eliminate one state per page for partial credit
4. Simplify only at the end unless intermediate labels exceed half a page

Partial credit saved my grade more than elegance. The algorithm rewards showing work.

## When Not to Use This

In real compilers and regex engines, you do not convert NFA to regex for fun. You build NFAs from regex and simulate. State elimination is exponential in size in the worst case for the resulting expression. It is a theory tool, not a production pipeline.

But for learning, interviews, and the automata problem your friend swore was &quot;just theory,&quot; it is the algorithm that made the subject click.

## The Takeaway

GNFA state elimination is not pretty. It is honest. You watch paths consolidate into algebra until only start and accept remain. No hidden oracle, no pumping lemma hand-waving, just repeated application of one update rule.

Automata theory stopped being memorization for me when I could eliminate a state on a whiteboard and explain every term in the updated label. If you are stuck in the NFA-regex wilderness, learn this algorithm, pick your elimination order carefully, and carry a spare whiteboard marker. You will need it.

GFNA. State elimination. Regex. Finally the same language.</content:encoded></item><item><title>Six Weeks Into IIT Roorkee as Someone Who Can&apos;t Stop Building</title><link>https://utso.stamped.work/blog/2024-11-12-six-weeks-into-iit-roorkee-cant-stop-building/</link><guid isPermaLink="true">https://utso.stamped.work/blog/2024-11-12-six-weeks-into-iit-roorkee-cant-stop-building/</guid><description>November 2024: First semester was a blur of classes, hostel life, and the discovery that college schedules do not respect side projects.</description><pubDate>Tue, 12 Nov 2024 00:00:00 GMT</pubDate><content:encoded>I arrived at IIT Roorkee in August 2024 for Mathematics and Computing. By November I had been on campus long enough to know where the mess hall was, which assignments actually mattered, and that **founder brain does not turn off because you have a timetable**.

This is a personal post about the first six weeks — not a guide to IIT, not a course review, just what it felt like to be a builder inside a degree program.

## August: Orientation Noise

Everything is new. Names, faces, clubs pitching you, seniors with opinions, Wi-Fi that works until it does not.

I did not introduce myself as &quot;founder of Strato Foods.&quot; That story is true and also exhausting at 18 in a new hostel. I said I build Android apps. That was enough.

The first surprise: **many people here also build.** The competitive energy is not only ranks. It is side projects in dorm rooms at 1 AM.

## September: The Schedule Hits

Math and Computing is not a light joke. Problem sets are real. Sleep becomes negotiable.

Strato Inc did not pause. Users do not know you have a Linear Algebra deadline. I learned quickly:

**Batch your maintenance.** One evening a week for crashes and emails, not constant notification anxiety.

**Say no to new app ideas.** November me had a graveyard of &quot;great concepts&quot; from September that wisely never shipped.

**Use college APIs for learning, not company formation.** Courses give structure. Side projects give joy. Confusing the two makes both worse.

```mermaid
flowchart LR
    subgraph campus [IIT Roorkee Nov 2024]
        C[Classes]
        H[Hostel life]
        B[Strato Inc maintenance]
    end

    C --&gt;|priority most days| Grades
    B --&gt;|scheduled slots| Users
    H --&gt;|everything else| Sanity
```

## October: Identity Recalibration

I was no longer &quot;the kid who ran local delivery.&quot; I was one of many smart people in a lecture hall. Humbling. Necessary.

What transferred:
- Comfort with debugging under time pressure
- Less fear of public failure (Strato Foods shutdown inoculated me a bit)
- Ability to learn a new API fast when a assignment required it

What did not transfer automatically:
- Grades. Building skills ≠ exam skills without practice.
- Social capital from hometown founder story. New campus, new ledger.

## What I Was Building

Not a startup. Maintenance on existing Play Store apps. Small experiments — some tied to coursework curiosity, some pure procrastination.

The ML discourse on campus was louder than my Telegram bot era. People discussed transformers like cricket scores. I listened more than I talked. Six weeks is not long enough to pretend you are the expert in the room.

## What Surprised Me

**How much I liked being a student again.** Not performing founder updates. Just learning with a cohort.

**How guilty maintenance made me feel.** Should I be starting something new? Should I focus only on academics? The answer in November was: **keep users unbroken, keep grades acceptable, sleep sometimes.**

**How little anyone cared about my Play Store download counts.** Refreshing. Freeing.

## What I Did Not Know Yet

I did not know what I would build in 2026. November 2024 me was not sketching company names. I was trying to pass quizzes and not ruin apps I shipped years ago.

That ignorance was correct. You cannot schedule insight.

## Takeaway

College does not replace builder identity — it **compresses** it into fewer hours. The skill is choosing which hours matter.

Six weeks in, my formula was ugly but workable: classes first, scheduled maintenance second, new ambitions third — deferred until they stop being fantasies and start being commitments.

Roorkee winter was coming. Assignments would get harder. The Play Store would still send crash emails. I would still open Android Studio after midnight sometimes.

Not because I had to prove something. Because building is how I think. Campus just made me do it on a budget of time I had never respected before.

That was November. Still early. Still learning the balance.</content:encoded></item><item><title>Multimodal LLMs Are Not Computer Vision Models</title><link>https://utso.stamped.work/blog/2024-09-18-multimodal-llms-computer-vision/</link><guid isPermaLink="true">https://utso.stamped.work/blog/2024-09-18-multimodal-llms-computer-vision/</guid><description>GPT-4V can describe your lunch. It cannot replace a YOLO pipeline or a forensic detector you have not built yet. Stop conflating multimodal chat with computer vision.</description><pubDate>Wed, 18 Sep 2024 00:00:00 GMT</pubDate><content:encoded>Every week in late 2024 someone in a campus group chat shares a GPT-4V demo and asks if we can &quot;just use this instead of training a model.&quot; Every week I try to explain why that is the wrong comparison.

Multimodal LLMs are language models that can see. They are not computer vision models. The distinction matters when your use case requires pixel-level fidelity, repeatable outputs, or anything you would grade in a lab assignment without hand-waving.

I am a Mathematics and Computing student at IIT Roorkee who has shipped Android apps and trained small CV models on a laptop. Here is where multimodal LLMs fit and where they do not — from that seat, not from a product I have not built yet.

## What Multimodal LLMs Actually Do

Models like GPT-4V, Claude with vision, Gemini Pro Vision: they encode images into tokens, fuse with text context, and generate language outputs. The output is always text (or structured text). The internal representation is optimized for semantic understanding and conversational coherence, not geometric precision.

They excel at:

- Image captioning and description
- Visual question answering (&quot;how many people are in this photo?&quot;)
- Document understanding (OCR-ish extraction from clean scans)
- Rough object identification in unconstrained photos
- Multimodal reasoning that combines image context with world knowledge

They fail at:

- Pixel-accurate localization without specialized tooling
- Detecting subtle manipulations (ELA-level, copy-move, splicing artifacts)
- Consistent results across near-duplicate inputs
- Calibrated probability outputs for high-stakes workflows
- Real-time inference at scale on high-resolution images

## What Specialized CV Pipelines Need

Production computer vision — the kind I practiced with YOLO on a laptop, the kind assignment rubrics demand — needs:

1. **Repeatability:** Same input, same output, every time
2. **Explainability:** Bounding boxes and scores, not vibes
3. **Calibration:** You can measure false positives on a held-out set
4. **Version pinning:** Frozen weights for reproducibility
5. **Latency budgets:** Especially on mobile

Multimodal LLMs fail criteria 1, 3, and 5 out of the box. They partially address 2 with plausible-sounding prose that is not evidence. They are not designed for adversarial robustness.

## Task Suitability Matrix

```mermaid
quadrantChart
    title LLM Vision vs Specialized CV Models
    x-axis Low Precision Required --&gt; High Precision Required
    y-axis Semantic Task --&gt; Geometric/Forensic Task
    quadrant-1 Use Specialized CV
    quadrant-2 Hybrid Pipeline
    quadrant-3 LLM Sufficient
    quadrant-4 LLM Insufficient

    Image Captioning: [0.2, 0.15]
    Document QA: [0.35, 0.25]
    Object Counting: [0.45, 0.3]
    Damage Assessment: [0.65, 0.55]
    Tamper Detection: [0.85, 0.85]
    Copy-Move Forgery: [0.9, 0.9]
    Biometric Matching: [0.95, 0.7]
    Scene Understanding: [0.3, 0.2]
    Medical Imaging: [0.95, 0.95]
```

Upper-right quadrant: do not use multimodal LLMs as primary detectors. Lower-left: LLMs are fine, maybe overkill.

## The Hybrid Pattern That Actually Makes Sense

Use LLMs where language is the output and specialized CV where pixels are the evidence:

1. **CV model** runs detection, segmentation, or feature extraction
2. **Structured metadata** captures scores, regions, model version
3. **LLM** generates human-readable summaries from structured inputs, not from raw pixels alone

The LLM should not be the only thing between a user and a consequential decision. It can explain what the CV stack found.

This is architecture homework, not a pitch deck.

## Specific GPT-4V Failure Modes I Hit

**Confident hallucination on ambiguous regions.** Compression artifacts described as &quot;signs of digital manipulation.&quot; Sometimes true. Often not. No confidence score.

**Inconsistent across crops.** Same region, different crop padding, different verbal assessment.

**Resolution limits.** Downscale a large photo to model input size, lose high-frequency detail, get a clean bill of health.

**No frozen behavior.** Model updates change outputs. Fine for chat. Bad for anything you need to reproduce in a report.

## A Small Experiment, Honestly

I ran GPT-4V on a few dozen images from public manipulation datasets alongside a simple OpenCV baseline I trusted from coursework. The LLM&apos;s verbal explanations sounded plausible on most images. Agreement with the baseline on localization: poor.

That gap is the lesson. Impressive language is not impressive geometry.

## When Students Should Use Multimodal LLMs

- Explaining CV outputs to non-technical teammates
- Triage: &quot;is this worth running through the expensive pipeline?&quot;
- Document extraction from heterogeneous forms where OCR + LLM beats pure OCR
- Rapid prototyping to validate whether anyone wants a feature

## When They Should Not

- Primary tamper detection
- Anything legally or financially consequential without human review
- Replacing a trained detector because the demo video looked cool
- Anything where false positive costs exceed API costs by orders of magnitude

## Research Direction

The gap may narrow. Vision-language models with grounding improve localization. Fine-tuned specialist models on manipulation datasets outperform generalist LLMs. The trend is toward ensembles, not replacement.

Betting your grade — or your company — on &quot;GPT-N will solve vision&quot; is betting against every CV assignment rubric that asks for numbers.

Multimodal LLMs are an interface layer and a reasoning layer. They are not a retina. Build your project accordingly.

September 2024 me was learning that in lecture halls and group chats. The demos will keep coming. The distinction stays the same.</content:encoded></item><item><title>Llama 3.1 vs Closed APIs: One Month Into IIT With a Real GPU</title><link>https://utso.stamped.work/blog/2024-08-25-llama-3-1-open-models-vs-closed-apis/</link><guid isPermaLink="true">https://utso.stamped.work/blog/2024-08-25-llama-3-1-open-models-vs-closed-apis/</guid><description>August 2024: Meta shipped Llama 3.1 in July. I had just started IIT Roorkee with an HP Omen and an RTX 4050. Here is what open weights could and could not do.</description><pubDate>Sun, 25 Aug 2024 00:00:00 GMT</pubDate><content:encoded>**Llama 3.1** dropped on **July 23, 2024** — 8B, 70B, and a 405B flagship — with **128k context** on the smaller tiers and a fresh round of &quot;open weights will kill OpenAI&quot; posts on ML Twitter.

I read those posts from a hostel room at **IIT Roorkee**. Orientation was loud. Wi-Fi was worse. And for the first time in my student life I had a laptop that could run models locally instead of only staring at API dashboards: an **HP Omen** with a **Ryzen 7000-series** CPU and an **NVIDIA RTX 4050 (6 GB VRAM)**.

This is not a benchmark paper. It is what **open models vs closed APIs** looked like in **late August 2024** on that hardware — no hindsight from 2025 pricing wars.

## What Llama 3.1 Actually Brought (July 2024)

Meta&apos;s pitch, stripped of keynote adjectives:

| Model | Role in the stack (Aug 2024) |
|-------|------------------------------|
| **8B** | Laptop-friendly after quantization |
| **70B** | Serious quality, serious compute |
| **405B** | Flagship narrative, datacenter territory |

The jump that mattered for builders on the 8B/70B path: **128k context windows** on those models, up from the 8k-class limits on Llama 3 that made long-document experiments painful.

Also: a public conversation about **license terms** and what &quot;open&quot; means when weights are downloadable but the license has guardrails. Lawyers eat well; developers skim the FAQ.

```mermaid
flowchart TB
    subgraph local [My RTX 4050 6GB - Aug 2024]
        Q4[8B Q4 / Q5 quant]
        Slow[70B - mostly not on GPU alone]
    end

    subgraph cloud [Closed APIs - Aug 2024]
        G4o[GPT-4o API]
        Other[Anthropic / Google APIs]
    end

    Task[Side project task] --&gt; local
    Task --&gt; cloud
    local --&gt;|fits VRAM| OK[Private, free-ish, slower]
    cloud --&gt;|fits budget| OK2[Better reasoning, costs money]
```

## What Fit on a 4050 6 GB

Be honest about 6 GB VRAM in August 2024:

**Llama 3.1 8B (quantized)** — yes. **Q4_K_M** and friends via llama.cpp / Ollama / similar tools. Not blazing. Not datacenter. But **you can iterate at 2 AM without API keys or hostel Wi-Fi drama**.

**13B-class models** — tight. Possible with aggressive quant and patience. Not my default.

**70B** — not on 6 GB VRAM alone in any useful way. CPU offload on a Ryzen 7000 chip can technically run it. &quot;Technically&quot; and &quot;I will do this twice&quot; are different sentences.

**405B** — a meme on my desk. Download size alone is a lifestyle choice.

The Omen&apos;s **CPU** mattered: offloading layers when VRAM ran out, preprocessing, compiling tooling. The GPU mattered more: finally a student machine where &quot;run a small model locally&quot; is a Tuesday, not a fantasy.

## Closed APIs in the Same Week

**GPT-4o** (from May 2024) was still my quality bar for hard prompts: multi-step reasoning, messy instructions, &quot;fix this stack trace&quot; tasks.

**Strengths of closed APIs in August 2024:**

- Better out-of-the-box reasoning on hard tasks
- No VRAM math
- Fast iteration when internet works
- Tooling ecosystems (function calling, structured outputs) more mature than local stacks

**Weaknesses:**

- Cost accumulates on a student budget
- Privacy: you are uploading problem sets, code, screenshots
- Dependency: rate limits, policy changes, outages
- Latency + connectivity in a hostel

## When I Reached for Which (August 2024)

**Local Llama 3.1 8B** when:

- Iterating on prompts I did not want logged in the cloud
- Offline or flaky network
- Batch experiments (generate fifty variants, pick one)
- Learning how tokenization and context actually behave

**GPT-4o API** when:

- Quality threshold mattered more than cost
- Vision + text in one call for a screenshot workflow
- Deadlines (yes, IIT had already started proving that)

**Neither** when:

- I should have been sleeping

## Open vs Closed Is Not Religion

The August 2024 discourse was tedious: open weights warriors vs API maximalists.

Practical solo-dev truth on one laptop:

1. **Open weights won privacy and marginal cost** for small models you can actually run.
2. **Closed APIs won ceiling quality** for the hard 10% of tasks.
3. **128k context on 8B** changed local use cases — paste a long PDF chunk, ask questions — but did not delete the need for evals. Long context ≠ correct answers.
4. **405B existing** mattered narratively more than practically for students. It moved the Overton window. I still could not run it.

## Mistakes I Made in Week One

**Assuming quant 8B equals GPT-4o.** It does not on reasoning-heavy tasks. Obvious in hindsight. Embarrassing on the first assignment-adjacent experiment.

**Ignoring RAM and thermals.** Gaming laptops throttle. My Omen fans sounded like a small aircraft. Plan for sustained load, not a five-minute demo.

**Downloading everything.** Disk is finite. You do not need all quantizations.

**Skipping evals because it is &quot;local.&quot;** Local models hallucinate with confidence too. Free inference is not free technical debt.

## What I Did Not Know Yet (And Will Not Pretend I Did)

I did not know which model family would win 2025 economics. I did not know every campus policy on local LLM use. I had not started any company in 2026. August 2024 me was a first-year student with a GPU, a Supabase side note from last summer, and a GPT-4o tab — trying to learn without outsourcing all thinking to either cloud.

## Takeaway

**Llama 3.1 in August 2024** did not kill closed APIs on a 6 GB laptop. It **split the workflow**: open 8B for volume and privacy, GPT-4o for quality and multimodal convenience.

If you are a student buying your first &quot;ML-capable&quot; machine: **6 GB VRAM is a real constraint, not an insult.** It is enough to learn, prototype, and run 8B-class models if you accept quantization and patience. It is not enough to cosplay as a datacenter.

The interesting year was not &quot;open vs closed.&quot; It was **both**: weights on disk, APIs in the tab, and the discipline to know which tool earns the task.

My Omen earned its price in the first month — not because it ran 405B, but because it finally let me **touch** the stack instead of only reading release notes.</content:encoded></item><item><title>GPT-4o: The Spring Update That Made Multimodal Feel Product-Ready</title><link>https://utso.stamped.work/blog/2024-05-22-gpt-4o-spring-update-multimodal-first-impressions/</link><guid isPermaLink="true">https://utso.stamped.work/blog/2024-05-22-gpt-4o-spring-update-multimodal-first-impressions/</guid><description>May 2024: OpenAI&apos;s Spring Update dropped GPT-4o. I watched the demo, ran the API, and wrote down one bet about voice that felt crazy at the time.</description><pubDate>Wed, 22 May 2024 00:00:00 GMT</pubDate><content:encoded>OpenAI&apos;s **Spring Update** landed on **May 13, 2024** with **GPT-4o** — &quot;o&quot; for omni — and for a week every group chat looked the same: screen recordings of a model that could see, hear, and talk back with latency that did not feel like 2022&apos;s &quot;wait fifteen seconds for a paragraph.&quot;

I watched it between JEE aftermath and packing for IIT Roorkee. This post is what GPT-4o actually changed for a builder who had been on the API since GPT-3 — and one prediction I wrote down on **May 22** that I still have screenshots of.

## What They Announced (As of Mid-May 2024)

**GPT-4o** is a single model line pitched as native multimodal: text, vision, and audio in one stack, not bolt-on modules duct-taped in a rush.

The claims that mattered to developers:

- **Faster** than GPT-4 Turbo on many tasks
- **Cheaper** API pricing than GPT-4 Turbo (roughly half on input/output at launch — check your dashboard, not my memory)
- **Vision in the same model** you already call for chat completions
- A **desktop demo** of real-time voice conversation that felt like a product, not a lab bench

What was **not** fully in my hands on day one: the polished real-time voice experience from the keynote in a clean public API with the same latency as the demo. The API was rolling out in tiers. The demo was ahead of the SDK. That gap is important.

```mermaid
flowchart TB
    subgraph keynote [What the demo showed - May 13]
        V[Live camera input]
        A[Spoken responses]
        L[Low latency turn-taking]
    end

    subgraph api [What builders had - May 22]
        T[Text + vision API]
        P[Streaming completions]
        W[Voice UX mostly still coming]
    end

    keynote --&gt; Gap[Demo ahead of API]
    api --&gt; Ship[Still shippable for many apps]
```

## What Changed From GPT-4 Turbo in Practice

I had a small harness from older posts: prompt templates, token logging, a few Flutter-adjacent experiments that send images + questions to the API.

**Vision without the &quot;vision model&quot; tax dance.** Before GPT-4o, multimodal often felt like model shopping: this endpoint for text, that endpoint for images, merge outputs in your code. GPT-4o pushed toward **one call, one thread, image bytes in the message**. That is an architecture simplification, not a party trick.

**Latency matters for mobile.** Non-streaming completions already felt rude on 4G. GPT-4o was noticeably snappier in my informal tests on a held-out set of twenty prompts — not scientific, but enough to update my default model string.

**Price changes behavior.** When input tokens get cheaper, you stop hoarding context like a miser. You send the screenshot. You send the error log. You stop building elaborate pre-filters because every byte hurts.

**The demo is a UX category.** The Scarlett-Johansson-adjacent voice banter was memed instantly. Under the memes: OpenAI showed that **turn-taking latency** is now a competitive axis, not just benchmark scores.

## What Did Not Change

**Hallucinations.** Faster wrong answers are still wrong.

**Eval discipline.** If you did not have a test set in April, GPT-4o does not give you one in May.

**Privacy.** You are still sending user photos to a third-party API. Consent and retention policies did not magically become simple.

**Offline.** None of this runs on-device for builders yet. The stack is still cloud-first.

## The Builder Use Cases That Got Better Immediately

**&quot;What is wrong with this screen?&quot;** Upload a screenshot of a Flutter error or a broken layout. One multimodal call. This replaced a workflow of OCR + guesswork.

**Document-ish photos.** Messy photos of forms, receipts, whiteboards — not forensic-grade, but good enough for triage and student-side automation.

**Rapid product validation.** Cheaper + faster means you can afford to put multimodal behind a beta button without CFO cosplay.

## What I Thought Was Theater (Fairly)

The emotional voice persona is a product marketing asset. Enterprises will disable it. Teenagers will love it. Both reactions can be true.

&quot;Solved AGI&quot; tweets were theater. Ignore them.

## The Prediction I Wrote on May 22, 2024

Here is the sentence I saved in my notes that day:

&gt; **OpenAI will ship a consumer voice mode in ChatGPT that feels close to the keynote demo — not perfect, but close enough that people stop calling it &quot;coming soon&quot; — before the end of summer 2024 in the US rollout, and developers will get a narrower API version shortly after.**

Why it felt bold then:

- The demo was clearly polished routing, not something I could reproduce in Postman on May 14.
- Voice products historically ship quarters late.
- OpenAI had every incentive to move fast after Google I/O noise the same month.

Why I believed it anyway:

- The latency step-change was real even if hidden behind demo magic.
- They unified audio in the **same model brand** (4o), which signals product commitment, not a research slide.
- Consumer voice is a retention moat; they would not leave it in the keynote only.

I did not know the name **Advanced Voice Mode** yet. I did not know the exact July rollout date. I was guessing from incentive structure, not insider knowledge.

If you are reading this after summer 2024: check whether I got lucky or whether &quot;demo → product in ~10 weeks&quot; was the obvious cadence. I will take either verdict.

## What I Would Do With GPT-4o in May 2024

**Default new experiments to 4o** for text + image tasks.

**Keep a regression harness.** Same prompts as GPT-4 Turbo week. Diff outputs. Count hallucination style changes.

**Do not rebuild your app around voice yet** unless you enjoy building on preview-tier APIs.

**Log tokens.** Cheaper does not mean free. JEE-era me still had a budget.

## Takeaway

GPT-4o in May 2024 was the first moment multimodal felt like **infrastructure** — one endpoint, acceptable latency, pricing that lets students iterate — even if the **Her demo** was ahead of what mere mortals could ship that week.

The model war was not about parameter counts anymore. It was about **modalities per dollar** and **seconds per turn**.

I was about to start IIT with a new laptop and a shorter attention span for bad APIs. GPT-4o made the API tab worth keeping open. The voice bet was my one arrogant sentence in the notebook. We will see if May-me was guessing or seeing clearly.</content:encoded></item><item><title>The Indian Startup Funding Crunch: An Honest Take From the Trenches</title><link>https://utso.stamped.work/blog/2024-04-10-indian-startup-funding-crunch-honest-take/</link><guid isPermaLink="true">https://utso.stamped.work/blog/2024-04-10-indian-startup-funding-crunch-honest-take/</guid><description>2021 money is gone. Down rounds are not shameful anymore. Here is what survival actually looks like when the term sheets stop coming.</description><pubDate>Wed, 10 Apr 2024 00:00:00 GMT</pubDate><content:encoded>I raised in 2021. I did not raise in 2023. That sentence contains more education than most accelerator curricula.

The Indian startup funding crunch is not a news cycle. It is a structural reset that separates founders who built businesses from founders who built fundraising machines. I have friends on both sides. Some are fine. Some are ghosting their cap tables.

## What Actually Changed

2021 was liquidity hallucination. Tiger Global moved fast. Every sector had a &quot;X of India&quot; pitch. Revenue multiples stopped mattering when growth rate was the only variable. Indian founders who had been told to copy Silicon Valley playbook copied the fundraising part and skipped the unit economics part.

2022 started the correction globally. 2023-2024 hammered India specifically as global LPs reallocated, Chinese capital retreated from certain sectors, and public market comps for tech destroyed private market markups. Seed got tighter. Series A became a graveyard for companies with pretty decks and 40% month-over-month user growth and negative contribution margins.

```mermaid
flowchart TB
    subgraph Y2021[&quot;2021 Environment&quot;]
        A1[Abundant Capital]
        A2[Growth &gt; Unit Economics]
        A3[12-18 Month Raise Cycles]
        A4[Logo Hiring]
        A5[High Burn Accepted]
    end

    subgraph Y2024[&quot;2024 Environment&quot;]
        B1[Selective Capital]
        B2[Path to Profitability Required]
        B3[Extend Runway or Die]
        B4[Hire for Output]
        B5[Burn Scrutiny Every Board Meeting]
    end

    A1 -.-&gt;|Capital dried up| B1
    A2 -.-&gt;|Investors learned| B2
    A3 -.-&gt;|Market closed| B3
    A4 -.-&gt;|Layoffs| B4
    A5 -.-&gt;|Down rounds / shutdowns| B5

    style A1 fill:#2d5016,color:#fff
    style B1 fill:#8b0000,color:#fff
    style B5 fill:#8b0000,color:#fff
```

The diagram is simplified. Reality is messier. But the directional shift is correct and permanent for this cycle.

## Down Rounds Are Not Moral Failures

I know founders who took down rounds and kept building. I know founders who refused down rounds on principle and shut down with six months runway left because ego outranked arithmetic.

A down round hurts. It wipes prior returns for early employees. It signals market repricing. It also keeps the company alive. In 2024, alive is an achievement.

Bridge rounds with heavy structure (liquidation preferences, ratchets, full pay-to-play) are down rounds wearing a disguise. Read the term sheet. If your lawyer says &quot;this is standard,&quot; get a second lawyer.

## Survival Strategies That Work

**Revenue now, not later.** Every pilot must have a paid path. &quot;Strategic partnership&quot; without contract value is a hobby. We moved enterprise conversations from &quot;innovation budget&quot; to &quot;operational budget&quot; by tying ROI to measurable cost savings. Harder sell. Closes faster.

**Cut burn before you have to.** Founders who cut at 18 months runway look decisive. Founders who cut at 4 months look desperate. Same cuts. Different narrative. Different employee trust.

**Ignore vanity metrics in board updates.** Monthly active users without retention curves is noise. CAC without LTV is noise. Show revenue, gross margin, net revenue retention if you have it, and cash zero date under conservative assumptions.

**Extend without poisoning the cap table.** Revenue-based financing, venture debt (if you have revenue to support it), government grants, customer prepayments for annual contracts. Not glamorous. Keeps you alive.

**Kill the second product.** You are not a portfolio company. One wedge. One ICP. One GTM motion until revenue covers burn.

## What Investors Say vs What They Mean

&quot;We love the team, come back when you have more traction&quot; means no.

&quot;We are watching the space&quot; means no.

&quot;We would lead if you find a co-lead&quot; means no unless you actually have a co-lead lined up.

&quot;We are doing internal portfolio management&quot; means they are not deploying this quarter.

Learn to hear no without requiring a follow-up meeting that wastes three weeks.

## India-Specific Dynamics

Domestic VC funds have dry powder but deployment is cautious. Corporate venture arms retrenched. Family offices still write checks but want revenue and often want governance seats that complicate cap tables.

Tier 2 city startups with lower burn have advantage over Bangalore burn-rate clones. Remote-first ops teams cost less than Indiranagar offices with foosball nobody plays.

Regulatory-heavy sectors (fintech, health, regulated B2B) face longer sales cycles. Factor that into runway math. A 24-month enterprise sales cycle with 12 months of cash is not a company. It is a countdown.

## The Psychological Cost

Funding crunch is not just financial. It is identity. Founders who raised big rounds built public personas around being &quot;funded founders.&quot; When the next round does not come, the identity crisis is real.

Talk to other founders honestly. Not LinkedIn honestly. Actual honestly. The ones doing well are rarely posting about it. The ones struggling are rarely posting at all.

## My Position

I stopped optimizing for the next raise and started optimizing for customers who pay. That sounds like a tweet. It is a daily discipline that means saying no to press, no to conferences, no to &quot;brand partnerships&quot; that consume founder time for logo placement.

The funding winter will end. It always does. The founders who survive with revenue and sane cap tables will raise on better terms than the ones who limp through on bridges and hope.

If you are in the crunch right now: cut burn this week, call ten customers tomorrow, and stop refreshing Tracxn for comparables that no longer exist. The market does not care about your 2021 valuation. It cares whether you are still here in 2025.</content:encoded></item><item><title>Spring Before College: What Class 12 Felt Like From the Inside</title><link>https://utso.stamped.work/blog/2024-03-20-spring-before-college-class-12/</link><guid isPermaLink="true">https://utso.stamped.work/blog/2024-03-20-spring-before-college-class-12/</guid><description>March 2024: Boards were winding down. JEE results were weeks away. I was not a college student yet — just someone who had been building for years and did not know what came next.</description><pubDate>Wed, 20 Mar 2024 00:00:00 GMT</pubDate><content:encoded>March 2024 is an awkward month to write about. You are not in the climax of exams anymore. You are not in college yet. You are in the hallway between outcomes, refreshing portals you pretend you are not refreshing.

This is a personal post about that hallway — as someone who had already founded and shut down a food delivery startup, shipped apps under Strato Inc, and still had to wait like everyone else.

## The Waiting Room

JEE Main results had started arriving for some rounds. Advanced was ahead. IIT admission was a summer problem, not a March certainty.

Meanwhile:
- Strato Inc apps still needed occasional fixes
- Friends discussed branches and cutoffs like sports scores
- LinkedIn suggested I should &quot;announce something&quot; because silence looks like failure

I announced nothing. March is not a good month for announcements when your future address is still unknown.

## What Strato Inc Meant in March

Not a venture-backed company. A publisher account, a portfolio of small Android projects, and the muscle memory of having run something real called Strato Foods.

That history changed how I heard advice:

**&quot;Start young&quot;** — I had. It did not make waiting easier.

**&quot;Founders don&apos;t need degrees&quot;** — maybe. I still wanted the degree. Roorkee was the plan if the numbers cooperated.

**&quot;Build a personal brand&quot;** — I had a Play Store presence. That was enough brand for March me.

```mermaid
flowchart TD
    A[March 2024] --&gt; B{Outcomes pending}
    B --&gt; C[JEE / IIT hope]
    B --&gt; D[Strato Inc maintenance]
    B --&gt; E[Identity: ex-founder + student]
    C --&gt; F[Summer decision]
    D --&gt; G[Skills stay warm]
    E --&gt; H[Patience practice]
```

## What I Was Reading

Less startup Twitter, more syllabus leftovers and the occasional ML paper when guilt allowed. GPT-4 and DevDay were months in the rearview. The vibe shift was from &quot;wow, demos&quot; to &quot;okay, what is actually useful for a solo builder.&quot;

I was also reading admission forums — the universal humiliation of teenagers comparing ranks. No post makes that dignified.

## Honest Feelings

**Impatience.** I had shipped products. Sitting still felt unnatural.

**Relief.** No restaurant ops emergencies anymore. The Strato Foods chapter was closed; March was quieter.

**Fear.** What if the exam did not go well? What if the building years counted for nothing in the only game parents could easily explain to relatives?

**Pride I did not say out loud.** I had done more than most classmates. Saying that aloud in March would have sounded like coping. Maybe it was coping. It was also true.

## What I Did Not Know

I did not know which hostel I would sleep in come August. I did not know which friendships would survive the move to Roorkee. I did not know that college would change my schedule more than any framework announcement.

I definitely did not know what company I would start in 2026. March 2024 me was not planning that arc. March 2024 me was trying to land somewhere worth going.

## Takeaway

The spring before college is underrated as a life chapter. You are not the person you were in class 10. You are not yet the person campus makes you. You are in between — and in between is when you notice what building has already taught you.

Patience. Triage. Shipping small. Shutting down honestly.

Those lessons were mine before any admission letter confirmed a campus name.

When the letter did come, it would say IIT Roorkee, Mathematics and Computing. March me did not have that PDF yet. March me just had work, wait, and the stubborn habit of opening Android Studio after dinner.

That was enough for one month.</content:encoded></item><item><title>Board Exams and a Play Store That Wouldn&apos;t Wait</title><link>https://utso.stamped.work/blog/2024-02-18-board-exams-play-store-wouldnt-wait/</link><guid isPermaLink="true">https://utso.stamped.work/blog/2024-02-18-board-exams-play-store-wouldnt-wait/</guid><description>February 2024: CBSE boards were here. Strato Inc still had users filing bugs. I learned you cannot optimize two calendars at once.</description><pubDate>Sun, 18 Feb 2024 00:00:00 GMT</pubDate><content:encoded>February 2024 was the month my two identities collided hardest: **student taking board exams** and **developer with apps in the wild**.

I am not writing this to sound heroic. I wrote it because a lot of founders-in-school posts pretend you can do both perfectly. You cannot. You negotiate tradeoffs daily and sometimes you negotiate badly.

## The Two Calendars

**Calendar A:** Physics, Chemistry, Mathematics. Sample papers. School attendance rules. Parents asking reasonable questions about ranks.

**Calendar B:** Crash reports from apps published under Strato Inc. A user email about a payment edge case. A dependency that broke because Google moved a target SDK deadline again.

They do not share a priority function. Calendar A punishes you in public results. Calendar B punishes you in one-star reviews and guilt.

```mermaid
flowchart LR
    subgraph feb [February 2024]
        B[Board exams]
        P[Play Store maintenance]
    end

    B --&gt;|wins most days| Study
    P --&gt;|wins some nights| Hotfix
    Study --&gt; JEE[JEE pipeline]
    Hotfix --&gt; Users[Real users]
```

## What I Actually Did

**Set a maintenance window.** Not a formal SLA — I am one person. But I stopped pretending every bug was P0. Critical crashes yes. Cosmetic regressions waited until after papers.

**Wrote release notes in my head instead of shipping them.** The number of half-finished branches that February could fill a graveyard. Shipping less was the win.

**Used boards as an excuse to say no.** Friends suggesting new startup ideas got a honest &quot;ask me in April.&quot; That boundary was harder than any technical debt.

**Kept one app stable on purpose.** Not everything in the portfolio deserved attention. Pick the thing with real users and let the experiments rot quietly.

## What Broke

**Sleep.** Obviously.

**Motivation loops.** Studying felt pointless when a bug fix would help someone today. Fixing bugs felt irresponsible when integrals were tomorrow.

**Identity.** &quot;Founder&quot; is a loud word in Indian family WhatsApp groups. &quot;Student&quot; is the louder one in February.

## What I Learned About Building

Production does not pause for your life events. Users do not know you have a Chemistry paper. That is not unfair — it is the deal when you publish.

The skill I practiced that month was **triage**, not feature velocity. Same skill I would need later, in different contexts, with different stakes.

## What I Did Not Do

I did not launch a new company. I did not write thought leadership about industries I had not worked in. I did not pretend boards were a &quot;strategic pause&quot; for investors — I had no investors, just apps and exams.

## Takeaway

If you are building while exams are real: **pick one calendar to win each day.** Some days that calendar is school and you accept the guilt about GitHub. Some nights the calendar is a hotfix and you accept the guilt about sleep.

February 2024 taught me that sustainability is not a podcast topic. It is whether you can still open your laptop in March without hating yourself.

Boards ended. JEE was next. The Play Store was still there, patient and slightly broken, like always.</content:encoded></item><item><title>RAG in Production: The Failure Modes Nobody Writes About</title><link>https://utso.stamped.work/blog/2024-01-25-rag-production-failures/</link><guid isPermaLink="true">https://utso.stamped.work/blog/2024-01-25-rag-production-failures/</guid><description>Chunking, hybrid search, and re-ranking look elegant in demos. In production they fail in predictable, expensive ways.</description><pubDate>Thu, 25 Jan 2024 00:00:00 GMT</pubDate><content:encoded>Every RAG tutorial ends with a green checkmark. Your vector database returns relevant chunks. Your LLM synthesizes a coherent answer. You ship it Friday afternoon and spend Monday explaining why the bot told a customer that your refund policy covers time travel.

I have deployed RAG systems in three production environments. Two of them required architectural rewrites within six weeks. The third worked because we treated retrieval as an engineering problem, not a LangChain recipe. Here is what actually breaks.

## The Demo Lie

Demo RAG uses clean PDFs, short documents, and questions written by the person who indexed the corpus. Production RAG ingests Confluence pages from 2019, Slack exports with broken threading, and PDF tables that OCR turned into abstract art. The embedding model does not know your org chart. It knows cosine similarity between token sequences.

The failure is not hallucination. Hallucination is downstream. The failure is retrieval that looks confident and is wrong. Users trust retrieved context more than raw model output. That makes bad retrieval worse than no retrieval at all.

## Chunking Is Not a Hyperparameter

Everyone treats chunk size like learning rate: sweep 256, 512, 1024, pick the best on a dev set of twelve questions, declare victory.

Real documents have structure. A chunking strategy that splits mid-paragraph destroys referential context. Split mid-table and you get numeric fragments that embed near unrelated financial data. Split API documentation at function boundaries and your retrieval returns half a signature with no return type.

We ran an audit on a legal corpus: 34% of chunks contained a pronoun whose antecedent lived in an adjacent chunk. No amount of re-ranking fixes that. You need structure-aware chunking: respect headings, tables, code blocks, and list hierarchies. For technical docs, chunk by semantic section, not token count.

Overlap helps but is not a cure. Overlap without boundary awareness just gives you duplicate wrong answers with higher recall.

## Hybrid Search: When Vectors Lie

Pure vector search fails on exact-match queries. User asks for error code `E-4471` and your embedding model returns chunks about &quot;error handling best practices&quot; because the semantics overlap. Hybrid search (BM25 + dense vectors) fixes this, but introduces new failure modes.

BM25 dominates on rare tokens. Dense retrieval dominates on paraphrase. Without score normalization and fusion tuning, one modality silently wins and you get bimodal behavior: some queries work perfectly, others fail consistently, and your logs look random until you plot query type against retrieval source.

We use reciprocal rank fusion with per-collection calibration. We also log which modality contributed the winning chunk for every query. That logging paid for itself in the first week when we discovered that 40% of support tickets referenced SKU numbers that vector search never surfaced.

## Re-Ranking: The Expensive Band-Aid

Cross-encoder re-rankers improve precision dramatically. They also add 200-800ms latency per query and do not fix garbage chunks. Re-ranking is a filter, not a foundation.

The production pattern that works: retrieve wide (top 50-100), re-rank narrow (top 5), then apply a confidence threshold before generation. If the top re-ranked score falls below that threshold, refuse to answer or escalate to human. Most teams skip the check because it makes demos look bad. Production is not a demo.

We also learned to re-rank on the full query-chunk pair, not truncated pairs. Truncation for speed silently drops the constraints that matter in long enterprise queries.

## Failure Points in the Pipeline

```mermaid
flowchart TD
    A[User Query] --&gt; B[Query Preprocessing]
    B --&gt; C{Embedding Model}
    C --&gt;|Stale index| D[Wrong Vector Space]
    C --&gt;|OK| E[Hybrid Retrieval]
    E --&gt; F[BM25 Results]
    E --&gt; G[Dense Results]
    F --&gt; H[Score Fusion]
    G --&gt; H
    H --&gt;|Poor calibration| I[Wrong Modality Wins]
    H --&gt;|OK| J[Re-Ranker]
    J --&gt;|Low confidence| K[Should Refuse - Often Skipped]
    J --&gt;|OK| L[Context Assembly]
    L --&gt;|Bad chunking| M[Broken Context]
    L --&gt;|OK| N[LLM Generation]
    N --&gt; O[Confident Wrong Answer]
    D --&gt; O
    I --&gt; O
    K --&gt; O
    M --&gt; O

    style D fill:#8b0000,color:#fff
    style I fill:#8b0000,color:#fff
    style K fill:#8b0000,color:#fff
    style M fill:#8b0000,color:#fff
    style O fill:#8b0000,color:#fff
```

Every red node is a failure mode we hit in production. Most tutorials skip straight from H to N.

## Evaluation That Actually Matters

Offline metrics on synthetic QA pairs lie. Build an evaluation set from production failures: every escalated ticket, every thumbs-down, every &quot;that is not what our docs say&quot; message. Label the failure stage (retrieval, re-rank, generation, chunking). Fix the stage, not the symptom.

Track retrieval recall at k=10 separately from end-to-end answer quality. If recall is low, re-ranking and prompt engineering are theater.

## What We Changed

After the second rewrite we landed on: structure-aware chunking with metadata (source, section, timestamp), hybrid retrieval with logged modality contribution, cross-encoder re-ranking with confidence gating, and a refusal path that routes to human support with the retrieved context attached for faster resolution.

Latency went up 300ms. Wrong answers dropped 60%. Support escalations from the bot dropped because we stopped pretending low-confidence retrieval was good enough.

RAG in production is not an LLM problem. It is a search problem with an LLM stapled on the end. Treat it that way or your users will treat your product as broken. They will be right.</content:encoded></item><item><title>2023: Shutting Down a Hit App, AI Goes Mainstream</title><link>https://utso.stamped.work/blog/2023-12-31-2023-year-in-review/</link><guid isPermaLink="true">https://utso.stamped.work/blog/2023-12-31-2023-year-in-review/</guid><description>A personal year-in-review: Strato Foods shutdown, the AI mainstream moment, and what came next.</description><pubDate>Sun, 31 Dec 2023 00:00:00 GMT</pubDate><content:encoded>2023 was the year Strato Foods, my biggest Strato Inc app, stopped making sense to operate. It was also the year AI went from &quot;interesting research area&quot; to &quot;every dinner conversation mentions ChatGPT.&quot; I shut down Strato Foods in February, kept publishing through Strato Inc, and spent the rest of the year learning what building looks like after your local hit dies.

This is not a highlight reel. It is an honest accounting.

## The Year at a Glance

```mermaid
timeline
    title 2023 - Personal and Professional
    section Q1
        Jan : GPT-4 launch reshapes AI discourse
        Feb : Strato Foods shutdown under Strato Inc
        Mar : SVB collapse triggers treasury lessons
    section Q2
        Apr-May : Recovery and reflection post-shutdown
        Jun-Jul : Small Strato Inc releases, summer building
    section Q3
        Aug : More Strato Inc app experiments
        Sep : Vision transformer production reading
        Oct : Continual learning research deepens
    section Q4
        Nov : OpenAI DevDay and GPT Store hype
        Nov : OpenAI board drama governance lessons
        Dec : JEE prep and year-end consolidation
```

## January-March: AI Hype and Hard Lessons

GPT-4 dropped in March and reset every conversation about what machines can do. I had been working with LLMs since GPT-3, but GPT-4 was a step change that made non-technical people suddenly care about context windows and hallucination rates.

I shut down Strato Foods in February. I wrote about that separately. The meta-lesson: product success in a local market is not the same as a durable business when national aggregators arrive. Strato Inc continued. The app did not. That distinction mattered for my sanity.

SVB collapsed in March. Even as a mostly India-based builder, the panic was a reminder that banking and treasury hygiene are founder skills, not finance team problems.

## April-September: Research and Strato Inc

Q2 and Q3 were about shipping smaller things through Strato Inc and reading ML papers without pretending each paper was a company pitch. Class 12 and JEE prep were the other calendar — sometimes winning, sometimes not.

On the research side, I spent time on continual learning for LLMs, vision transformers, and what production actually needs versus what papers claim.

## October-December: Consolidation

Q4 was writing, research consolidation, and planning for college. OpenAI DevDay in November was a builder-economics moment. The board drama the same month was a governance lesson for anyone building on top of foundation models.

## Numbers I Will Share

- 1 major product shutdown (Strato Foods)
- Strato Inc Play Store account: still active
- 0 new incorporated companies this year

## What 2024 Needed to Be

Keep Strato Inc alive. Finish school. Get into college without abandoning the builder identity that survived the shutdown.

## Closing

2023 was the year I learned that shutting down Strato Foods is not shutting down as a builder. Strato Inc outlived the app. The skills compounded. The identity crisis was real and temporary.

Happy new year. Back to work.</content:encoded></item><item><title>The OpenAI Board Drama: What Founders Missed</title><link>https://utso.stamped.work/blog/2023-12-05-openai-board-drama-what-founders-missed/</link><guid isPermaLink="true">https://utso.stamped.work/blog/2023-12-05-openai-board-drama-what-founders-missed/</guid><description>November 2023&apos;s Sam Altman firing and reinstatement was not tech gossip. It was a governance case study every founder building an AI company should study.</description><pubDate>Tue, 05 Dec 2023 00:00:00 GMT</pubDate><content:encoded>On November 17, 2023, OpenAI&apos;s board fired Sam Altman as CEO. By November 22, he was back. In between: 700 of 770 employees threatened to quit, Microsoft offered to hire everyone, two interim CEOs, and a governance structure that looked sophisticated on paper and collapsed under pressure in 72 hours.

Founders treated this as tech Twitter drama. They should have treated it as a case study in corporate governance, board composition, and the unique risks of building a company with a nonprofit mission and a for-profit engine.

## Timeline of Events

```mermaid
timeline
    title OpenAI Board Crisis - November 2023
    section Friday Nov 17
        Board fires Sam Altman as CEO : Greg Brockman resigns as chairman
        Mira Murati appointed interim CEO
    section Saturday Nov 18
        Altman negotiates potential return : Board refuses conditions
        Murati replaced by Emmett Shear as interim CEO
    section Sunday Nov 19
        Microsoft announces Altman and Brockman join to lead new AI lab
        OpenAI employees sign letter threatening mass resignation
    section Monday Nov 20
        Negotiations intensify : Employee exodus pressure mounts
    section Tuesday Nov 21
        Board in discussions to reinstate Altman
    section Wednesday Nov 22
        Altman reinstated as CEO : New board formed : Brockman returns as president
```

Five days. That is how long it took for the most important AI company in the world to nearly disintegrate because five board members decided they could fire the CEO without a succession plan, without employee buy-in, and without understanding their own cap table dynamics.

## What Actually Happened (Best Available Account)

The board cited Altman&apos;s lack of candor as the stated reason. The underlying tensions were more complex: disagreements about commercialization speed, safety vs scaling priorities, and board members who fundamentally misunderstood the power dynamics of firing a founder-CEO who is the public face, primary fundraiser, and cultural center of gravity for the company.

The nonprofit board controlled a for-profit subsidiary through a unique governance structure. The board&apos;s fiduciary duty was to the nonprofit mission (&quot;ensure AGI benefits all of humanity&quot;), not to shareholders or employees. This sounded principled in charter documents. In practice, it created a board with authority to make catastrophic decisions without accountability to the people who actually build and fund the company.

When the board fired Altman, they expected the company to continue under new leadership. What they got was a loyalty cascade: employees, investors, and Microsoft all aligned with Altman within hours. The board&apos;s leverage evaporated because they had no credible replacement and no support from the people who matter.

## Governance Lessons for AI Founders

**Lesson 1: Board composition is destiny.**

OpenAI&apos;s board had members with strong safety and governance credentials but limited operating experience at hypergrowth companies. A board that can evaluate research safety but cannot evaluate CEO performance in a commercial context will make bad decisions. Balance technical, operational, and financial expertise.

If you are an AI founder assembling a board, ask: can this person add value during a crisis, not just during a quarterly review? OpenAI&apos;s board failed the crisis test spectacularly.

**Lesson 2: Nonprofit mission + for-profit operations is a structural time bomb.**

OpenAI&apos;s capped-profit structure was designed to prevent commercial incentives from overriding safety. Instead, it created a governance layer that could override commercial reality without understanding commercial consequences. The board could fire the CEO because the mission came first. The employees could threaten to quit because their equity and careers came first. These interests were not aligned, and the structure did not resolve the conflict. It hid it until it exploded.

If you are structuring an AI company with a safety mission, solve the governance problem on day one. Do not assume good intentions prevent power struggles.

**Lesson 3: The CEO is not fungible at founder-led companies.**

Altman&apos;s firing assumed OpenAI would function without him. It could not. Founder-CEOs at high-growth companies are not interchangeable executives. They hold investor relationships, employee loyalty, strategic vision, and external credibility that no interim CEO replicates in 48 hours.

Boards at founder-led companies need succession plans before they need succession. If your board cannot answer &quot;what happens if we remove the CEO tomorrow&quot; with a credible plan, they should not remove the CEO.

**Lesson 4: Employee leverage is real and increasing.**

700 of 770 employees signed a letter saying they would leave unless the board resigned. In a talent-constrained market, employees are not replaceable on a timeline that preserves company value. Boards that forget this learn painfully.

AI founders: your team is your moat. Not your model weights (those leak). Not your compute (Microsoft will sell you more). Your people. Governance structures that alienate the team destroy value faster than any competitor.

**Lesson 5: Your largest investor is also your backup CEO employer.**

Microsoft invested $13 billion in OpenAI. When Altman was fired, Microsoft offered to hire him and the entire team within 24 hours. This means Microsoft&apos;s leverage over OpenAI&apos;s governance is enormous and permanent. Any AI startup&apos;s largest investor has similar latent power.

Understand your cap table dynamics before a crisis, not during one. Who can hire your team out from under you? Who has board seats? Who has information rights that let them move faster than you expect?

## What Founders Missed by Treating This as Gossip

**This will happen again.** OpenAI is not unique. Every AI company with dual missions, complex cap tables, and safety boards will face governance tensions as commercial pressure increases. Anthropic, Cohere, Mistral, and the next generation of labs are all building on governance structures that have not been stress-tested.

**Investors are watching.** VC firms updated their governance requirements for AI portfolio companies within weeks. Expect more explicit board composition clauses, clearer CEO removal procedures, and investor protective provisions in term sheets.

**Regulatory attention increased.** US lawmakers who barely understood AI before November suddenly cared about who controls the most powerful AI company in America. Governance failures at leading labs invite regulation that affects everyone in the ecosystem.

**Employee expectations shifted.** AI talent now knows that collective action works. The OpenAI employee letter set a precedent. Founders who treat employees as replaceable resources in a talent war are operating on outdated assumptions.

## What I Would Do Differently as an AI Founder

- Keep the board small (3-5 members) with at least two people who have operated companies at my stage
- Write explicit CEO succession and removal procedures into governance documents before raising
- Avoid hybrid nonprofit/for-profit structures unless a lawyer can explain exactly what happens in a CEO dispute
- Maintain direct relationships with key employees independent of board dynamics
- Understand my largest investor&apos;s incentives in a crisis scenario, not just a funding scenario

## Closing

The OpenAI board drama was not entertainment. It was a five-day masterclass in what happens when governance structure, board competence, and power dynamics misalign at the most important company in AI.

Founders building AI companies in 2023 and beyond should study this closely. Your governance documents will be tested. Your board will face decisions they are not prepared for. Your employees will have more leverage than you expect.

Build the governance structure for the crisis, not for the pitch deck. OpenAI survived because Sam Altman had enough loyalty to return. Your company might not have that luck.

The mission matters. The structure that protects the mission matters more. OpenAI&apos;s board learned that the hard way. Learn from their mistake without paying their price.</content:encoded></item><item><title>OpenAI DevDay and the GPT Store Hype</title><link>https://utso.stamped.work/blog/2023-11-22-openai-devday-gpt-store-hype/</link><guid isPermaLink="true">https://utso.stamped.work/blog/2023-11-22-openai-devday-gpt-store-hype/</guid><description>November 2023: Sam Altman announced custom GPTs and a store. I watched from my coaching centre room wondering what was product and what was keynote theater.</description><pubDate>Wed, 22 Nov 2023 00:00:00 GMT</pubDate><content:encoded>OpenAI DevDay landed in November 2023 the way only a few keynotes do: every group chat forwarded the stream, people who had never written a line of Python had opinions about &quot;GPTs,&quot; and my Telegram agent experiments from 2022 suddenly looked both early and quaint.

I watched it between problem sets. Class 12 energy, JEE on the horizon, Strato Inc still ticking in the background. This post is what DevDay felt like from that seat — not a enterprise strategy memo.

## What They Announced (That Mattered to Me)

**GPT-4 Turbo** — longer context, cheaper API pricing. The builder headline. If you were already paying OpenAI bills for side projects, your spreadsheet changed overnight.

**Custom GPTs** — no-code wrappers with instructions, files, and actions. Everyone became a &quot;GPT entrepreneur&quot; for forty-eight hours.

**GPT Store** — marketplace for those wrappers. Monetization story TBD. Hype immediate.

**Assistants API** — closer to what I had been hacking manually: threads, tools, retrieval-ish behavior without assembling the plumbing from scratch every time.

None of this replaced studying. It did replace sleep for one night.

## Why It Felt Different from ChatGPT&apos;s Launch

ChatGPT in late 2022 changed public awareness. DevDay changed **builder economics**.

The gap between &quot;I chained prompts in a Telegram bot&quot; and &quot;I can publish a configured agent in OpenAI&apos;s UI&quot; collapsed for non-engineers. That is good for access. It is terrifying for differentiation if your product was thin prompting.

```mermaid
flowchart TB
    subgraph before [My 2022 stack]
        TG[Telegram]
        P[Hand-rolled prompts]
        API[GPT-3 API]
    end

    subgraph after [DevDay narrative]
        UI[Custom GPT UI]
        Store[GPT Store]
        AST[Assistants API]
    end

    before --&gt; after
    after --&gt; Q{What is still defensible?}
    Q --&gt; Memory[Memory + tools you own]
    Q --&gt; UX[UX outside OpenAI walled garden]
    Q --&gt; Domain[Real workflow integration]
```

## What I Thought Was Real

**Cheaper tokens** — real. Side projects became less scary to leave running.

**Assistants API** — real for prototypes. Still not magic memory. Still needed eval discipline.

**Custom GPTs** — great for personal workflows and demos. Most would not survive contact with strangers who type one-word prompts.

**GPT Store gold rush** — mostly theater until revenue share and discovery were clear.

## What I Was Building Then

Not a DevDay competitor. Strato Inc maintenance, JEE prep, reading continual learning papers, occasional Flutter fixes.

The honest internal question: **does this change what I ship on the Play Store?** Mostly no. It changed what I experimented with on weekends — richer bots, file upload toys, the usual.

## The JEE Complication

It is strange to watch a platform shift while your near-term future is decided by an exam that has nothing to do with LLMs. Part of me felt behind for not shipping a custom GPT. Part of me knew integration math mattered more that week.

Both feelings were valid. Only one had a deadline.

## Takeaway

DevDay was a pricing and packaging event dressed in keynote clothes. If you were already building on the API, you got cheaper experiments. If you were not building yet, you got a new way to procrastinate with a storefront fantasy.

November 2023 me did not need a market map. I needed to finish problem sets, keep Strato Inc alive, and remember that **wrappers age fast; skills and taste age slower.**

The GPT Store would be full within months. Most listings would be forgotten by the next announcement cycle. I was fine being a student who noticed that pattern early.</content:encoded></item><item><title>Continual Learning Research in 2023: A Review With Opinions</title><link>https://utso.stamped.work/blog/2023-10-28-continual-learning-research-review/</link><guid isPermaLink="true">https://utso.stamped.work/blog/2023-10-28-continual-learning-research-review/</guid><description>EWC, replay methods, and why benchmarks lie about progress in continual learning. An honest survey with strong opinions.</description><pubDate>Sat, 28 Oct 2023 00:00:00 GMT</pubDate><content:encoded>Continual learning is the problem of updating a model on new data without forgetting what it learned before. It is also the problem every production ML team faces and almost no research paper solves in conditions that resemble production.

In 2023, the field has a taxonomy of approaches, a graveyard of benchmarks that reward cheating, and a growing gap between &quot;prevents catastrophic forgetting on Split-MNIST&quot; and &quot;personalizes an LLM for a user without destroying base capabilities.&quot; This review covers the landscape with opinions attached, because neutral surveys of a field this messy are useless.

## Continual Learning Taxonomy

```mermaid
mindmap
  root((Continual Learning))
    Regularization
      EWC
      SI
      LwF
    Replay
      Experience replay
      Generative replay
      Coreset selection
    Architecture
      Progressive nets
      PackNet
      Dynamic expansion
    Optimization
      GEM
      A-GEM
      OGD
    Meta-learning
      MAML variants
      Online meta-learning
    Hybrid
      ER + EWC
      Distillation + regularization
      Replay + distillation
```

Every branch has papers claiming SOTA. Most results do not replicate under fair evaluation protocols.

## The Core Problem: Catastrophic Forgetting

Train a neural network on task A. Fine-tune on task B. Performance on task A collapses. This is catastrophic forgetting, and it is not a corner case. It is the default behavior of gradient-based optimization on non-stationary data distributions.

Production teams encounter this constantly:

- A recommendation model retrained on recent user behavior forgets long-tail preferences
- An LLM fine-tuned on domain data loses general reasoning capability
- A vision model updated on new product SKUs degrades on older SKU detection

The research field formalizes this as continual learning. The solutions proposed in papers rarely survive contact with real deployment constraints.

## Elastic Weight Consolidation (EWC)

Kirkpatrick et al.&apos;s EWC (2017) remains the most cited regularization approach. After training on task A, compute the Fisher information matrix diagonal to estimate parameter importance. When training on task B, add a penalty that prevents important parameters from moving far from their task-A values.

**What works:** EWC reduces forgetting on small-scale benchmarks (Permuted MNIST, Split MNIST) with clear task boundaries.

**What does not work:** Fisher diagonal approximations are noisy for large models. The quadratic penalty fights the new task gradient when tasks conflict. EWC does not scale cleanly to models with billions of parameters. Nobody has shown EWC working on a production LLM personalization pipeline at scale.

**My opinion:** EWC is a pedagogical tool, not a production solution. It teaches you why regularization-based CL is appealing and why it fails when task boundaries blur.

## Replay Methods

Experience replay stores a buffer of examples from previous tasks and mixes them into training on new tasks. Simple, effective, and honest about its memory cost.

**Experience replay:** Keep N examples per task. Sample uniformly during new task training. Works surprisingly well if the buffer is large enough. The question is always: how large, and who pays for storage?

**Generative replay:** Train a generative model on task A data, synthesize pseudo-examples when learning task B. Clever. Generator quality limits replay quality. GAN-generated replay for CL mostly works on MNIST-scale data.

**Coreset selection:** Instead of random replay, select a diverse subset that maximally preserves task performance. Herding, k-center, gradient-based selection. Better sample efficiency but expensive selection algorithms.

**My opinion:** Replay is the most honest approach because it admits that preventing forgetting requires retaining information about old data. The field&apos;s discomfort with replay (it &quot;cheats&quot; by storing data) is ideological, not practical. In production, you have logs. Use them.

## Architecture-Based Methods

Progressive neural networks add a new column of parameters for each task with lateral connections to previous columns. No forgetting by construction. Memory grows linearly with tasks.

PackNet prunes and reassigns parameters per task in a shared network. Dynamic expansion methods add neurons or modules for new tasks.

**What works:** Clean task boundaries with moderate task count.

**What fails:** Task count scaling (100 tasks means 100 columns or a fragmented network), inference complexity (which column/module for which input?), and the assumption that task identity is known at inference time.

**My opinion:** Architecture methods solve forgetting by throwing parameters at the problem. For LLMs where parameters are already expensive, this is a non-starter unless task count is tiny (2-5 distinct domains, not millions of users).

## Why Benchmarks Are Broken

Split-MNIST: train on digits 0-4, then 5-9. Permuted MNIST: same digits, different pixel permutations. CORe50: incremental object recognition on a robot camera.

These benchmarks share fatal flaws:

**Clear task boundaries at train and test time.** Production data does not arrive in labeled task blocks. It arrives as a stream with shifting distributions and no task ID.

**Small model scale.** MLPs and small CNNs on MNIST do not predict behavior on 7B parameter transformers.

**No measurement of forward transfer.** Most benchmarks only measure forgetting (backward transfer). A method that prevents forgetting by learning nothing on new tasks scores well. That is not continual learning. That is freezing.

**No compute budget constraints.** Methods that store full replay buffers and retrain on all previous data every task are not scalable. Papers rarely report training cost.

**No evaluation of base capability preservation for LLMs.** Fine-tune a LLM on medical QA, measure medical QA improvement and general MMLU retention. Almost no CL benchmarks do this.

## What Would Actually Matter for LLM Personalization

The application I care about most: updating an LLM&apos;s behavior for a specific user or domain without degrading general capabilities. This is continual learning in the wild.

Requirements that 2023 research mostly ignores:

1. **No task labels.** The system does not know when &quot;task B&quot; started. Data arrives continuously.
2. **Compute budget.** Cannot replay full pretraining corpus on every update.
3. **Latency.** Updates should not require full retraining. Minutes, not days.
4. **Safety.** Personalization must not introduce harmful behavior or leak other users&apos; data.
5. **Evaluation.** Measure both new task performance and general capability retention on standard benchmarks.

Current approaches that come closest:

- **LoRA adapters per user/domain.** Freeze base model, train small adapter. No forgetting of base model (it is frozen). Limited expressiveness per adapter.
- **Selective fine-tuning with replay.** Fine-tune on user data mixed with a sample of general corpus. Replay in disguise. Works if replay sample is representative.
- **Retrieval-augmented personalization.** Do not update weights. Update retrieval index with user-specific context. Not continual learning strictly, but avoids forgetting entirely.

## Strong Opinions Section

**Opinion 1:** The field over-indexes on preventing forgetting and under-indexes on forward transfer and compute efficiency. A method that forgets 5% but learns new tasks 3x faster is more useful than one that forgets 0% at 10x compute cost.

**Opinion 2:** For LLMs, the right default is frozen base model plus adapters or retrieval, not continual weight updates. The research community should stop pretending Split-MNIST results inform LLM deployment strategy.

**Opinion 3:** Replay is not cheating. Data retention is a design choice. Privacy-preserving replay (differential privacy, federated buffers) is an engineering problem, not a reason to abandon the most effective CL strategy.

**Opinion 4:** Most &quot;continual learning&quot; papers would be better classified as &quot;regularized fine-tuning with extra steps.&quot; The label &quot;continual learning&quot; attracts citations without delivering production value.

## What I Am Watching

- **Online LoRA merging methods** that compose user adapters without interference
- **Curriculum-based replay selection** using influence functions to pick high-value replay examples
- **Evaluation frameworks** that test LLM personalization with MMLU/HELM retention metrics alongside domain task improvement
- **Federated continual learning** for settings where data cannot be centralized

## Closing

Continual learning research in 2023 has strong theoretical foundations and weak production relevance for large-scale models. EWC and architecture methods teach important concepts but do not solve real deployment problems. Replay works but the field stigmatizes it. Benchmarks reward methods that exploit unrealistic assumptions.

If you are a founder or engineer facing forgetting in production: start with replay (mix old and new data), consider LoRA adapters instead of full fine-tuning, and measure both new task and base capability metrics. Ignore Split-MNIST SOTA claims.

The field will mature when benchmarks match production constraints. Until then, read papers for ideas, not for deployment recipes.</content:encoded></item><item><title>Vision Transformers in 2023: The Real Talk</title><link>https://utso.stamped.work/blog/2023-09-15-transformers-in-vision-vit-real-talk/</link><guid isPermaLink="true">https://utso.stamped.work/blog/2023-09-15-transformers-in-vision-vit-real-talk/</guid><description>ViT vs CNN tradeoffs, where DINOv2 actually matters, and why your production pipeline probably still needs convolutions.</description><pubDate>Fri, 15 Sep 2023 00:00:00 GMT</pubDate><content:encoded>Vision Transformers were supposed to kill CNNs by 2022. It is 2023. ResNet variants still dominate production deployments. ViT and its descendants win on benchmarks and lose on engineering papers that assume infinite GPU budget and no latency constraints.

I am not anti-ViT. I use ViT-based models in production. But the gap between &quot;SOTA on ImageNet&quot; and &quot;works in our factory inspection pipeline on edge hardware&quot; is where most of the discourse conveniently stops.

## ViT Architecture: What Actually Happens

```mermaid
flowchart LR
    subgraph Input
        IMG[Input image H x W x 3]
    end

    subgraph PatchEmbed[&quot;Patch Embedding&quot;]
        PATCH[Split into P x P patches]
        FLAT[Flatten patches to sequence]
        PROJ[Linear projection to D dimensions]
        POS[Add positional embeddings]
    end

    subgraph Transformer[&quot;Transformer Encoder x L layers&quot;]
        MHSA[Multi-Head Self-Attention]
        FFN[Feed-Forward Network]
        MHSA --&gt; FFN
        FFN --&gt; MHSA
    end

    subgraph Output
        CLS[CLS token or global average pool]
        HEAD[Classification / downstream head]
    end

    IMG --&gt; PATCH --&gt; FLAT --&gt; PROJ --&gt; POS
    POS --&gt; MHSA
    FFN --&gt; CLS --&gt; HEAD
```

The image becomes a sequence of patch tokens. Self-attention lets every patch attend to every other patch. Global context from layer one. That is the core advantage over convolutions, which build receptive field gradually through stacked layers.

The cost: self-attention is O(n squared) in the number of patches. A 224x224 image with 16x16 patches gives 196 tokens. Manageable. A 1024x1024 image gives you thousands of tokens and a GPU memory bill that makes your CFO ask questions.

## ViT vs CNN: Honest Tradeoffs

| Dimension | CNN (ResNet, EfficientNet) | ViT (and variants) |
|-----------|---------------------------|---------------------|
| Data efficiency | Good with limited data | Needs large datasets or strong pretraining |
| Inductive bias | Locality, translation equivariance built in | Must learn spatial relationships from data |
| Compute at inference | Lower for equivalent accuracy on many tasks | Higher attention cost, especially at high resolution |
| Pretraining leverage | ImageNet pretrain works, less transfer gap | Massive benefit from large-scale pretrain (DINOv2, CLIP) |
| Small object detection | Strong with FPN architectures | Requires adaptations (Deformable DETR, etc.) |
| Edge deployment | Mature quantization and pruning tooling | Catching up, still harder |

The table oversimplifies. But directionally correct for 2023 production decisions.

## DINOv2: The Pretraining Story That Matters

Meta&apos;s DINOv2 (released early 2023) is the most practically relevant ViT development for founders and engineers, not because it tops a leaderboard, but because it produces general-purpose visual features that transfer with minimal fine-tuning.

Self-supervised pretraining on 142 million curated images. Strong dense features for segmentation, depth estimation, and retrieval without task-specific labels. If you are building a computer vision product and need a backbone, DINOv2 ViT-S or ViT-B is a credible starting point.

What DINOv2 does not solve:

- Inference latency on edge devices
- Real-time requirements above 30 FPS on non-GPU hardware
- Domain shift when your factory images look nothing like LVD-142M pretraining data (fine-tuning still required)
- The engineering work of integrating a PyTorch checkpoint into your existing pipeline

## Where ViT Wins in Production

**Retrieval and similarity search.** Embedding images with a pretrained ViT and searching by cosine similarity works well for duplicate detection, visual search, and content moderation. Near-duplicate image matching is one signal in larger deduplication pipelines.

**Semi-supervised and self-supervised pipelines.** When labeling data is expensive and you have lots of unlabeled images, ViT backbones pretrained with DINO or MAE extract features that reduce labeling requirements.

**Multi-modal systems.** If you are already running a transformer for language (you are), sharing architectural patterns between vision and language encoders simplifies the stack. CLIP-style dual encoders enable zero-shot classification that CNN pipelines cannot match without retraining.

**High-resolution document and scene understanding.** When you need global context across an entire page or scene and can afford the compute, ViT attention captures long-range dependencies that CNNs need deep stacks to approximate.

## Where CNNs Still Win

**Real-time video on edge hardware.** Factory inspection, autonomous drones, mobile AR. Convolutions with INT8 quantization on NPUs and TPUs still beat ViT on latency and power at equivalent accuracy for most edge tasks.

**Small datasets without pretraining budget.** If you have 500 labeled images and no GPU cluster for self-supervised pretraining, a fine-tuned EfficientNet-B0 will outperform a ViT-B/16 every time.

**Mature deployment tooling.** TensorRT, ONNX Runtime, CoreML, and TFLite have years of CNN optimization. ViT support exists but the tooling edge goes to convolutions for now.

**Proven architectures for detection and segmentation.** YOLO, Mask R-CNN, and U-Net variants with CNN backbones are battle-tested in production. Transformer-based detectors (DETR family) are improving but the ecosystem is younger.

## What I Actually Recommend

1. **Default to CNN backbones** unless you have a specific reason to use ViT (retrieval, multi-modal, large-scale pretrain available).
2. **Use DINOv2 features** when you need strong general-purpose embeddings and compute is not the bottleneck.
3. **Benchmark on your data, your hardware, your latency budget.** ImageNet accuracy is irrelevant if your deployment target is a Raspberry Pi.
4. **Do not rewrite a working CNN pipeline** because a paper showed ViT wins on ImageNet by 0.3%. That is not engineering. That is vanity.
5. **Watch hierarchical ViTs** (Swin, PVT) if you need transformer benefits with better compute scaling for high-resolution inputs.

## The Research vs Production Gap

Academic CV in 2023 optimizes for benchmark rankings. Production CV optimizes for accuracy per dollar per millisecond per watt. These objectives diverge constantly.

ViT papers report results with massive pretraining, multi-GPU inference, and test-time augmentation. Your production pipeline has a single T4 GPU, no TTA, and a SLA of 200ms per image. The model that wins in the paper is not the model that wins in your datacenter.

Be honest about which game you are playing. If you are publishing research, chase SOTA. If you are shipping product, chase the pareto frontier on your actual constraints.

## Closing

Vision Transformers are a real architectural advance. They are not a universal replacement for convolutions in 2023. DINOv2 makes ViT backbones practical for embedding and transfer learning tasks. CNNs remain the default for latency-constrained edge deployment.

The real talk: use the right backbone for your constraints, not the backbone from the most recent arXiv paper. Benchmark everything. Ship what works. Ignore the &quot;CNNs are dead&quot; discourse. They are very much alive in every factory and phone in the world.</content:encoded></item><item><title>Starting a B2B Startup in India in 2023</title><link>https://utso.stamped.work/blog/2023-08-10-founding-b2b-startup-in-india-2023/</link><guid isPermaLink="true">https://utso.stamped.work/blog/2023-08-10-founding-b2b-startup-in-india-2023/</guid><description>A pre-launch checklist for Indian B2B founders covering entity structure, compliance, and the mistakes that cost months.</description><pubDate>Thu, 10 Aug 2023 00:00:00 GMT</pubDate><content:encoded>Starting a B2B company in India in 2023 is not the same as starting a consumer app in 2015. The playbook has changed. Enterprise buyers expect security questionnaires, GST compliance, data localization awareness, and vendors who can invoice properly. Regulators expect you to have chosen the right entity structure before you raise, not after.

Here is the checklist I wish I had on day one when I was still researching B2B markets, plus the entity structure decision that every founder gets wrong.

## Pre-Launch Checklist

```mermaid
flowchart TD
    Start([B2B startup idea validated]) --&gt; E{Choose entity type}
    E --&gt;|Solo / 2 founders, services-heavy| LLP[LLP registration]
    E --&gt;|Raising VC, ESOP, scale intent| Pvt[Pvt Ltd registration]

    LLP --&gt; C1[DIN / DPIN for partners]
    Pvt --&gt; C1

    C1 --&gt; C2[Register on MCA portal]
    C2 --&gt; C3[Obtain PAN and TAN]
    C3 --&gt; C4[Open current account]
    C4 --&gt; C5[GST registration if turnover &gt; threshold or B2B inter-state]

    C5 --&gt; C6[Professional tax registration - state specific]
    C6 --&gt; C7[Shop and Establishment Act registration]
    C7 --&gt; C8[Startup India recognition - optional but useful]

    C8 --&gt; Ops[Operational readiness]
    Ops --&gt; O1[Accounting system - Zoho Books / Tally]
    Ops --&gt; O2[Contract templates - MSA, SOW, NDA]
    Ops --&gt; O3[Data privacy policy and DPDP awareness]
    Ops --&gt; O4[IP assignment agreements for founders and contractors]

    O4 --&gt; GTM[Go-to-market readiness]
    GTM --&gt; G1[ICP defined with budget authority mapped]
    GTM --&gt; G2[Pilot pricing and invoicing workflow tested]
    GTM --&gt; G3[First 3 design partner LOIs or MOUs]
    GTM --&gt; G4[Security questionnaire baseline answers ready]

    G4 --&gt; Launch([Launch sales motion])
```

Most founders skip straight from idea to &quot;build product.&quot; B2B buyers in India will ask for your GSTIN before they ask for your demo.

## LLP vs Pvt Ltd: The Decision That Matters

This is the question I get most often from student founders and first-time entrepreneurs in India.

**Limited Liability Partnership (LLP):**

- Cheaper and faster to incorporate
- Fewer compliance requirements (no mandatory audit below turnover threshold in many cases)
- Partners, not shareholders. No shares, no ESOP pool
- Harder to raise institutional VC (most funds prefer Pvt Ltd)
- Suitable for consulting, agency work, early-stage validation with services revenue

**Private Limited Company (Pvt Ltd):**

- Required for VC fundraising, ESOP grants, and most accelerator programs
- More compliance: annual filings, board meetings, audited financials
- Share-based cap table that investors understand
- Limited liability for directors with clearer corporate governance framework
- The default choice if you intend to raise equity capital within 18 months

My rule: if you are building a product company that will raise venture capital, incorporate as Pvt Ltd from the start. Converting LLP to Pvt Ltd later is possible but wastes time and legal fees. If you are bootstrapping a services business that might become a product company, LLP buys you runway to validate before committing to Pvt Ltd compliance overhead.

## Compliance Nobody Warns You About

**GST for B2B SaaS.** Software as a service supplied to Indian businesses is taxable. Inter-state B2B supply requires GST registration regardless of turnover threshold in most cases. If you invoice a client in Maharashtra from your Karnataka entity, you need GST registration and IGST on the invoice. Get this wrong and enterprise clients will not pay your invoices.

**TDS on vendor payments.** When you pay contractors above threshold amounts, you deduct TDS. When clients pay you, they may deduct TDS on your invoices. Understand Form 16A, TDS certificates, and how this affects your cash flow. Enterprise clients will deduct 10% TDS on your SaaS invoice by default if you have not provided a lower deduction certificate.

**DPDP Act awareness.** India&apos;s Digital Personal Data Protection Act passed in 2023. B2B companies processing personal data of end users (even on behalf of clients) need privacy policies, data processing agreements, and consent mechanisms. Enterprise buyers will ask about this in security reviews.

**Contractor vs employee classification.** Hiring developers as contractors to save on PF and ESI is common and risky. If the &quot;contractor&quot; works exclusively for you, uses your equipment, and follows your hours, they are an employee under Indian labor law regardless of what the contract says.

## B2B Sales in India: What Is Different

Indian B2B sales cycles are long. Six to eighteen months for enterprise deals is normal, not a sign your product is wrong. Decision-making involves multiple stakeholders, procurement committees, and price negotiation culture that US SaaS playbooks do not prepare you for.

**Design partners before paid pilots.** Indian enterprises want to see your product working in their environment before they commit budget. Structure design partnerships with clear success criteria and a conversion path to paid contracts.

**Founder-led sales is mandatory early.** You do not have a brand. Your website is six months old. The founder must close the first ten deals to learn what buyers actually care about vs what you think they care about.

**Reference customers are currency.** One logo from a recognizable Indian enterprise is worth more than ten features on your roadmap. Over-invest in making early customers successful even if the contract value is small.

**Pricing in INR vs USD.** Indian enterprises prefer INR invoicing. If your costs are in USD (cloud, APIs), build FX buffer into pricing. Do not price at US SaaS rates with a currency label swap. Indian buyers know the difference.

## Banking and Finance Setup

Open a current account with a bank that understands startups. HDFC, ICICI, and Axis all have startup banking programs. You need:

- Current account in company name (not founder personal account)
- Payment gateway for online invoicing (Razorpay, Cashfree for B2B)
- Accounting software connected to bank feed from month one
- Separate tracking of founder loans vs equity investment vs revenue

Do not commingle personal and company funds. Auditors and investors will find out. It creates cap table and tax nightmares.

## Common Mistakes I See

**Incorporating too late.** You have a paying client but no entity. You invoice from a personal account. Fix this before the second client, not after the tenth.

**No IP assignment from day one.** Every founder and contractor must assign IP to the company. Without this, an acquirer or investor will find the gap in diligence and use it to renegotiate.

**Ignoring compliance until fundraising.** Due diligence will surface every missed GST filing, every unsigned contract, every TDS mismatch. Clean it up before you need the money, not during the process.

**Building for US buyers while incorporated in India without a plan.** If your market is US enterprises, you may need a US entity for contracting. Understand when to add a Delaware C-Corp alongside your Indian operating company.

## Closing

Starting a B2B startup in India in 2023 requires more administrative groundwork than the Twitter founder myth suggests. That groundwork is not optional. Enterprise buyers and regulators treat it as table stakes.

Incorporate correctly. Get compliance right early. Sell before you scale the team. The founders who skip these steps do not move faster. They just hit walls later, when the walls are more expensive.

Build the boring infrastructure first. Then build the product. Then sell relentlessly. That is the India B2B playbook in 2023.</content:encoded></item><item><title>Supabase vs Firebase for a Solo Android Developer</title><link>https://utso.stamped.work/blog/2023-07-04-supabase-vs-firebase-solo-android-dev/</link><guid isPermaLink="true">https://utso.stamped.work/blog/2023-07-04-supabase-vs-firebase-solo-android-dev/</guid><description>July 2023: I had shipped Firebase on Flutter apps for years. Supabase kept showing up as the open-source alternative. I finally compared them properly.</description><pubDate>Tue, 04 Jul 2023 00:00:00 GMT</pubDate><content:encoded>I have been publishing Android apps under Strato Inc since 2020. Every backend I shipped before mid-2023 was **Firebase**: Auth, Firestore or Realtime Database, Cloud Storage, the occasional Cloud Function. It worked. It also felt like renting an apartment where the landlord can change the lease and you have no SQL escape hatch.

This summer I was between major products and had time to test **Supabase** seriously — not a weekend hello-world, but &quot;could I rebuild a Strato-scale app on this stack?&quot; Here is the comparison as of **July 2023**, from a solo developer in India with one laptop and no DevOps team.

## What Each Stack Actually Is (July 2023)

**Firebase** is Google&apos;s managed BaaS. You think in documents, security rules, and Google Cloud billing. The Flutter SDK is mature. The mental model is &quot;append JSON, pray for indexes.&quot;

**Supabase** is an open-source layer on **PostgreSQL**: database, auth, storage, edge functions, and realtime subscriptions. You can use their hosted cloud or self-host with Docker. The mental model is &quot;it&apos;s Postgres with batteries.&quot;

Neither is magic. Both will waste your weekend if you skip reading the security model.

```mermaid
flowchart LR
    subgraph firebase [Firebase - July 2023]
        FA[Firebase Auth]
        FS[Firestore / RTDB]
        FST[Cloud Storage]
        CF[Cloud Functions]
    end

    subgraph supabase [Supabase - July 2023]
        SA[Supabase Auth]
        PG[PostgreSQL]
        SST[Supabase Storage]
        EF[Edge Functions]
        RT[Realtime on Postgres]
    end

    App[Flutter Android app] --&gt; firebase
    App --&gt; supabase
```

## Where Firebase Still Wins

**Flutter integration depth.** Firebase&apos;s Flutter plugins are first-party energy. Auth flows, crashlytics hooks, analytics — if you live entirely in Google land, the path of least resistance is real.

**Document model for chaotic prototypes.** When your schema changes every week because you are 18 and still figuring out the product, schemaless documents forgive you. Postgres wants migrations. Migrations want discipline.

**Offline sync on mobile.** Firestore&apos;s offline persistence story was still ahead of what I could get working with Supabase + Postgres in the same afternoon in July 2023. For field apps with flaky 3G, that mattered.

**I already knew the failure modes.** After two years on Firebase I knew where the bills spike, where rules get weird, where composite indexes bite. Switching stacks has a learning tax.

## Where Supabase Pulls Ahead

**SQL.** Joins. Aggregations. `SELECT` statements that do not feel like a crime. If your app has reporting, ledgers, or anything relational, fighting Firestore&apos;s query limitations gets old fast.

**Row Level Security (RLS).** Policies live in the database. In July 2023 this was Supabase&apos;s killer feature for me: auth-aware Postgres rules that feel closer to how a real backend should enforce access. Firebase Security Rules work, but debugging them at 1 AM is a different kind of suffering.

**Open source and portability.** The project is on GitHub. You can self-host. If Supabase-the-company changes pricing or disappears, you are not locked to a proprietary document store — you have Postgres dumps and a migration path.

**Pricing transparency at hobby scale.** Firebase&apos;s free tier is generous until it is not. Firestore reads add up quietly. Supabase&apos;s hosted tiers in 2023 were straightforward for a side project: Postgres size, bandwidth, monthly active users. I could spreadsheet it without a PhD in Google Cloud billing.

**Edge Functions (Deno).** By mid-2023 Supabase Edge Functions were usable for webhooks and light API glue. Firebase Cloud Functions were the incumbent. Neither is fun, but having TypeScript at the edge next to your database is a nice mental stack.

## What I Tested Hands-On

I did not migrate a production app. I rebuilt a slice of a familiar pattern:

1. Email + magic link auth
2. User-owned rows in a table with RLS
3. Image upload to object storage with signed URLs
4. A realtime listener when a row changes

**Auth:** Both worked in Flutter. Supabase&apos;s Flutter client was fine in July 2023 — not as polished as Firebase Auth&apos;s docs ecosystem, but not scary.

**Queries:** Supabase won any &quot;show me orders grouped by restaurant&quot; task in one SQL file. Firestore wanted composite indexes and denormalized counters.

**Realtime:** Firebase felt snappier out of the box on mobile. Supabase realtime over Postgres changes worked, but I spent more time wiring listeners and understanding replication slots in docs.

**Self-host curiosity:** I spun up Supabase locally with Docker once. It is not trivial — you are operating Postgres, Kong, GoTrue, etc. For a solo dev, hosted Supabase is the realistic choice. Self-host is a flex for when you outgrow free tiers or have ops help.

## The India / Solo-Dev Constraints

**Latency:** Both hosted stacks serve from regions you pick. For Indian users, region choice matters more than brand. Measure RTT; do not assume.

**Payment friction:** Firebase bills through Google. Supabase bills through Stripe. Neither is perfect for a teenager with a debit card and anxiety about surprise charges. Set billing alerts on day one.

**Sleep budget:** Firebase lets you ignore ops until you cannot. Supabase still hides most ops in hosted mode, but Postgres backups and migration discipline are on you eventually.

## What Supabase Did Not Have (Yet, in July 2023)

I am writing this in July, so I will not pretend I know what ships next quarter.

As of now I would **not** bet a production app on:

- Supabase matching Firebase&apos;s full mobile offline story without custom work
- Edge Functions replacing a proper backend when you need long-running jobs
- &quot;We will self-host&quot; as a plan with zero ops experience (I include myself)

I also had not seen Supabase become the default answer in every Flutter tutorial. Firebase still owns the beginner funnel.

## My Actual Decision in Summer 2023

I did not rip Firebase out of live apps. Users do not care about your ideological preference for SQL.

I **did** start new experiments on Supabase when the data model was relational from day one — anything with settlements, ledgers, or reporting. I kept Firebase for fast document-shaped prototypes and apps where offline-first mobile was the core requirement.

That split stack is annoying. It is also honest for a solo publisher without a platform team.

## Takeaway

If you are a solo Android developer choosing in **July 2023**:

- Pick **Firebase** when you need the fastest Flutter path, document flexibility, and offline mobile sync — and you accept vendor shape.
- Pick **Supabase** when you want **Postgres, SQL, RLS, and open-source exit ramps** — and you will read the docs instead of cargo-culting rules.

There is no correct answer for all apps. There is a correct answer per schema, and most founders pick Firebase first because the tutorial did.

I am glad I ran this comparison while I had summer hours. College was coming. Side projects would get shorter. Knowing where SQL beats documents saves you a rewrite later — even if later-me still has not learned to love migrations.</content:encoded></item><item><title>What SVB&apos;s Collapse Taught Indian Founders About Banking Risk</title><link>https://utso.stamped.work/blog/2023-05-22-svb-collapse-lesson-for-indian-founders/</link><guid isPermaLink="true">https://utso.stamped.work/blog/2023-05-22-svb-collapse-lesson-for-indian-founders/</guid><description>March 2023&apos;s Silicon Valley Bank crisis exposed treasury management gaps that Indian founders with US exposure cannot ignore.</description><pubDate>Mon, 22 May 2023 00:00:00 GMT</pubDate><content:encoded>On March 10, 2023, Silicon Valley Bank failed. Within 72 hours, the second and third largest bank failures in US history had occurred. Indian founders watched from Bangalore and Mumbai wondering if their US-domiciled holding companies, their SVB accounts, and their pending wire transfers were about to evaporate.

Some lost access to payroll for US employees. Some had millions locked for days. Some discovered they had concentrated 100% of their treasury in a single institution because &quot;that is what everyone in YC did.&quot; The FDIC made depositors whole above the $250k insurance limit, but the scare was real and the lessons are permanent.

## What Actually Happened

SVB took deposits from tech startups and venture capital firms, invested heavily in long-duration Treasury bonds, and failed to hedge interest rate risk. When the Fed raised rates, those bonds lost value. When VC funding slowed and startups burned cash, deposit outflows accelerated. SVB sold bonds at a loss, announced a capital raise, and triggered a bank run. Regulators closed the bank on a Friday. By Sunday, the Treasury, Fed, and FDIC announced all depositors would be made whole.

The speed was the terrifying part. A bank that looked fine on Monday was gone by Friday.

## Cascade: How SVB Failure Hit Startups

```mermaid
flowchart TD
    A[Fed raises interest rates] --&gt; B[SVB bond portfolio loses value]
    B --&gt; C[SVB announces $2.25B capital raise]
    C --&gt; D[VCFirms advise portfolio cos to withdraw]
    D --&gt; E[Bank run: $42B withdrawn in one day]
    E --&gt; F[SVB seized by FDIC]

    F --&gt; G1[Startups lose payroll access]
    F --&gt; G2[Pending wire transfers frozen]
    F --&gt; G3[VC funds locked mid-close]
    F --&gt; G4[Credit lines and cards suspended]

    G1 --&gt; H[Founders scramble to open new accounts]
    G2 --&gt; H
    G3 --&gt; H
    G4 --&gt; H

    H --&gt; I[Multi-day delays on new bank onboarding]
    I --&gt; J[Runway calculations suddenly wrong]
```

Indian founders with US entities felt this cascade even if their deposits were eventually safe. The operational disruption was immediate.

## Indian Founder Exposure: The Specifics

Indian startups raising from US VCs often structure with a US C-Corp or Delaware entity for fundraising while operating from India. This creates banking complexity:

**US bank accounts for US entities.** Many used SVB because it was the default recommendation from US investors, accelerators, and lawyers. SVB understood startup cap tables, venture debt, and the rhythm of fundraising. That convenience created concentration risk.

**USD treasury held in US banks.** Indian founders who raised in dollars often kept those dollars in US accounts for US payroll, US vendors, and future US expansion. Rupee accounts in India do not help when your burn is in dollars and your employees are in San Francisco.

**Wire transfer timing.** Founders mid-fundraise had capital calls and closings delayed because counterparties could not verify bank details, new accounts needed setup, and trust in &quot;startup-friendly&quot; banks evaporated overnight.

**Venture debt exposure.** Startups with SVB venture debt lines faced uncertainty about whether those facilities would be honored by the acquirer (First Citizens Bank eventually acquired SVB&apos;s assets).

## Treasury Management Lessons

If SVB taught one lesson, it is that treasury management is a founder responsibility, not something you delegate to &quot;whoever opened the account at demo day.&quot;

**Never concentrate more than $250k FDIC-insured limit in a single bank without a plan.** Spread across multiple institutions. Use treasury management products that sweep into money market funds. Boring? Yes. That is the point.

**Maintain 6+ months runway in liquid, diversified accounts.** Not in long-duration bonds your bank bought. Not in crypto. Not in your co-founder&apos;s personal account because the wire was faster.

**Know your bank&apos;s balance sheet.** You do not need to be a fixed income analyst. You need to know if your bank has concentrated exposure to a single sector (tech deposits) and interest rate risk. SVB&apos;s 10-K was public. The risk was visible to anyone who read it.

**Have a backup bank before you need one.** The worst time to open a business account at Mercury or Brex is the weekend your primary bank fails. Open the backup account now. Keep it funded with enough to cover two weeks of operations.

**Document your wire instructions.** When banks fail, you re-establish payment rails. Having payroll, vendor payments, and investor wire details documented saves days of chaos.

## The India-Specific Angle

Indian founders operate in a different regulatory environment but face parallel risks:

**RBI regulations on foreign exchange.** Moving USD between US and Indian entities involves FEMA compliance, authorized dealer banks, and documentation. A banking crisis in the US does not suspend Indian regulatory requirements. Founders who had never thought about FX hedging started thinking about it fast.

**Indian bank stability is different but not risk-free.** Yes, Indian banking is more conservatively regulated in some dimensions. No, that does not mean you should hold 100% of your treasury in a single Indian bank either. Diversification is jurisdiction-agnostic.

**Fundraising timing.** US VC funds that lost access to capital call mechanisms or had LP concerns about bank exposure slowed deployment. Indian founders raising Series A in Q2 2023 felt this as extended diligence cycles and &quot;let us wait and see&quot; responses.

## What Changed After March 2023

**Investors now ask about treasury in diligence.** &quot;Where is your cash?&quot; became a standard question. Founders who cannot answer clearly lose credibility.

**Multi-bank setups became normal.** The stigma of &quot;spreading cash around&quot; disappeared. Prudence replaced convenience.

**US entity structuring got more scrutiny.** Lawyers started recommending treasury diversification clauses in board resolutions. Some VCs updated their portfolio company guidelines.

**Foundry fintechs gained share.** Mercury, Brex, and others picked up SVB refugees. Whether they are safer long-term is debatable, but the monoculture broke.

## What I Did

After SVB, I audited every account across my entities. Spread USD holdings across two US institutions. Ensured Indian operating accounts had sufficient rupee runway independent of US banking status. Documented every wire instruction. Set a calendar reminder to review treasury quarterly, not annually.

It took one day. The peace of mind was worth more than the interest rate optimization I gave up by not concentrating at a single bank.

## Closing

SVB&apos;s collapse was a liquidity and interest rate risk failure, not a tech failure. Indian founders with US exposure learned that banking is infrastructure, and infrastructure fails. The FDIC backstop was not guaranteed in advance. Treating it as guaranteed next time is the same mistake as treating SVB as too-big-to-fail for startups.

Diversify your treasury. Understand your exposure. Have backup rails. The next bank failure might not resolve over a weekend.

Runway is not just how much cash you have. It is how accessible that cash is when something breaks.</content:encoded></item><item><title>Fine-tuning vs RAG vs Prompting</title><link>https://utso.stamped.work/blog/2023-04-18-fine-tuning-vs-rag-vs-prompting/</link><guid isPermaLink="true">https://utso.stamped.work/blog/2023-04-18-fine-tuning-vs-rag-vs-prompting/</guid><description>A decision framework for choosing how to adapt LLMs to your domain, with honest tradeoffs and no silver bullets.</description><pubDate>Tue, 18 Apr 2023 00:00:00 GMT</pubDate><content:encoded>Every team building on LLMs hits the same question: how do we make the model good at our specific task? The three options on the table are prompting, retrieval-augmented generation (RAG), and fine-tuning. Conference talks treat them as interchangeable. They are not. Each has different cost profiles, latency characteristics, maintenance burdens, and failure modes.

I have seen teams fine-tune when prompting would suffice, wasting months and tens of thousands of dollars. I have seen teams prompt-engineer endlessly when a small fine-tune would have solved the problem in a week. The choice is not about which approach is &quot;best.&quot; It is about which approach fits your constraints.

## The Decision Framework

```mermaid
flowchart TD
    Start([Adapt LLM to your task]) --&gt; Q1{Knowledge changes frequently?}
    Q1 --&gt;|Yes, daily/weekly| RAG[RAG + prompting]
    Q1 --&gt;|No, stable domain| Q2{Need specific output format/ style?}

    Q2 --&gt;|Yes, rigid schema| Q3{Have 500+ quality examples?}
    Q2 --&gt;|No, flexible outputs| Q4{Context fits in window?}

    Q4 --&gt;|Yes| Prompt[Prompt engineering + few-shot]
    Q4 --&gt;|No| RAG

    Q3 --&gt;|Yes| FT[Fine-tuning]
    Q3 --&gt;|No| Q5{Can you generate synthetic examples?}

    Q5 --&gt;|Yes| Synth[Synthetic data + fine-tune]
    Q5 --&gt;|No| Prompt

    RAG --&gt; Q6{Retrieval quality sufficient?}
    Q6 --&gt;|No| FixRAG[Fix chunking / embeddings first]
    Q6 --&gt;|Yes| Ship[Ship and iterate]

    FT --&gt; Q7{Base model updates break you?}
    Q7 --&gt;|Yes| RAG
    Q7 --&gt;|No| Ship

    Prompt --&gt; Ship
    Synth --&gt; Ship
    FixRAG --&gt; RAG
```

This is a starting point, not gospel. Your specific task may violate every assumption here. But it beats the default approach of &quot;let&apos;s fine-tune because it sounds serious.&quot;

## Prompting: The Underrated Baseline

Prompt engineering gets mocked as &quot;not real engineering.&quot; This is stupid. Prompting is the fastest iteration loop available. Change a system prompt, run your eval set, see results in minutes. No training pipeline. No GPU cluster. No data labeling budget.

Prompting works well when:

- Your task fits in the context window with room for examples
- Output format is flexible or enforceable via structured prompting
- You need to ship this week, not next quarter
- Your domain knowledge can be expressed as instructions, not implicit in thousands of examples

Prompting fails when:

- The model consistently ignores instructions despite careful prompt design
- You need behavior that requires internalizing patterns too complex for in-context demonstration
- Latency and cost from long prompts with many few-shot examples exceed fine-tuning inference costs
- You are stuffing 50 examples into every request and calling it a &quot;prompt strategy&quot;

**Cost profile:** API inference only. Scales linearly with prompt length.
**Maintenance:** Low. Update prompts in code, redeploy.
**Latency:** Depends on prompt size. Long few-shot prompts hurt.

## RAG: When Your Knowledge Is External and Dynamic

RAG separates the model&apos;s reasoning capability from your domain knowledge. You retrieve relevant documents at query time and inject them into the prompt. The model answers using provided context.

RAG works well when:

- Your knowledge base changes frequently (product docs, support articles, legal regulations)
- The total knowledge exceeds any context window
- You need citations and traceability for answers
- You want to update knowledge without retraining

RAG fails when:

- Retrieval returns wrong or incomplete context (most common failure)
- The task requires synthesizing information across many documents in non-obvious ways
- Your documents are poorly structured for chunking
- You use RAG as a substitute for fixing a model that lacks basic reasoning capability

**Cost profile:** Embedding API + vector storage + inference. Re-indexing costs when documents change.
**Maintenance:** Medium-high. Pipeline for ingestion, chunking, embedding, index updates.
**Latency:** Retrieval step adds 50-200ms depending on infrastructure.

The dirty secret of RAG in 2023: most teams should spend 80% of their effort on document quality and chunking strategy and 20% on retrieval infrastructure. Most teams do the reverse.

## Fine-tuning: When You Need Behavior, Not Knowledge

Fine-tuning adapts model weights to your task. The model internalizes patterns from training examples rather than reading them at inference time.

Fine-tuning works well when:

- You have hundreds to thousands of high-quality input-output pairs
- You need consistent output format, tone, or style the base model resists via prompting
- Inference cost matters and a shorter fine-tuned model replaces long few-shot prompts
- The task is stable and will not change with every product release

Fine-tuning fails when:

- You have fewer than 200 quality examples (results will be unreliable)
- Your knowledge changes frequently (model goes stale)
- You fine-tune for knowledge injection instead of behavior shaping (use RAG)
- You do not have an eval set to detect regression when the base model updates

**Cost profile:** Training compute (one-time per version) + inference. OpenAI fine-tuning API charges for training tokens and inference at a premium.
**Maintenance:** Medium. Retrain when base model updates or task distribution shifts.
**Latency:** Lower than long-prompt RAG for equivalent quality on behavior tasks.

## The Combinations That Actually Work

Real production systems combine approaches:

**Prompt + RAG:** The default stack for most B2B AI products in 2023. RAG for knowledge, prompting for behavior and format control.

**Prompt + fine-tuning:** Fine-tune for style and format, prompt for task-specific instructions that change frequently.

**RAG + fine-tuning:** Fine-tune a model to better use retrieved context (train on query-document-answer triples). Expensive to get right but powerful when retrieval alone is insufficient.

**All three:** Justified only at scale with dedicated ML infrastructure. Most startups should not start here.

## Common Mistakes I See Repeatedly

**Fine-tuning for facts.** If the answer is in your docs, RAG is cheaper and more maintainable. Fine-tuning factual knowledge into weights is how you get a model that confidently states outdated information.

**RAG without evaluation.** Teams ship retrieval pipelines without measuring whether the right chunks are retrieved. Answer quality is a downstream symptom of retrieval quality.

**Prompt engineering without a eval set.** You cannot improve what you do not measure. Ten examples in a spreadsheet is not an eval set. Build 100+ labeled cases before declaring prompting insufficient.

**Chasing base model updates.** GPT-4 improves, your fine-tuned GPT-3.5 gets relatively worse. Have a migration plan or accept that fine-tuning ties you to a model version.

## My Default Recommendation

Start with prompting. Add RAG if knowledge is external or dynamic. Fine-tune only when you have evidence that prompting and RAG cannot achieve required quality, and you have the data and eval infrastructure to do it properly.

This sequence minimizes wasted effort. Each step teaches you something about your task that informs the next step. Skipping straight to fine-tuning because it sounds more &quot;serious&quot; is how startups burn three months and learn nothing.

## Closing

There is no universally correct choice. There is a correct choice for your task, your data, your timeline, and your team&apos;s capabilities. The framework above is how I think about it when advising founders.

The LLM infrastructure vendors want you to believe fine-tuning is always the answer because fine-tuning generates training revenue. Vector DB vendors want you to believe RAG is always the answer because RAG generates storage revenue. Prompting generates nothing for vendors, which is exactly why you should start there.

Build the eval set. Run the experiments. Let metrics decide.</content:encoded></item><item><title>Vector Databases Are Overhyped</title><link>https://utso.stamped.work/blog/2023-03-30-vector-databases-overhyped/</link><guid isPermaLink="true">https://utso.stamped.work/blog/2023-03-30-vector-databases-overhyped/</guid><description>When pgvector is enough, when Pinecone makes sense, and why most RAG pipelines fail before retrieval even matters.</description><pubDate>Thu, 30 Mar 2023 00:00:00 GMT</pubDate><content:encoded>Every LLM startup in 2023 has a vector database in their architecture diagram. Pinecone logos on pitch decks. Weaviate mentioned in Slack channels. Qdrant discussed like it is infrastructure oxygen. I am going to say the quiet part loud: for most teams, a dedicated vector database is premature optimization dressed up as AI strategy.

This is not anti-vector-search. Embeddings are useful. Similarity retrieval works. The problem is that teams reach for a managed vector DB before they have validated that retrieval is their bottleneck, before they have exhausted simpler options, and before they understand why their RAG pipeline produces garbage answers.

## The Hype Cycle in One Sentence

Investors ask &quot;what is your RAG stack?&quot; Founders panic and buy Pinecone. Nobody asks whether the documents are chunked correctly.

## pgvector vs Pinecone: An Honest Comparison

**pgvector** is a PostgreSQL extension. You store embeddings alongside your relational data. One database, one backup strategy, one connection pool, transactions that actually work. Query latency is fine for datasets under a few million vectors with proper indexing. You already have Postgres ops knowledge on your team because every startup has Postgres.

**Pinecone** is a managed vector database optimized for similarity search at scale. Sub-50ms queries on billion-vector indexes. Serverless scaling. Zero ops if you trust their SLA. Costs scale with usage in ways that surprise founders who demo&apos;d on the free tier.

The tradeoff is not &quot;pgvector bad, Pinecone good.&quot; The tradeoff is operational complexity vs. query performance at scale vs. cost predictability.

For a B2B SaaS with 10,000 documents and 500 queries per day? pgvector in your existing Postgres instance is the correct answer 90% of the time. For a consumer app doing 10 million similarity searches per hour across 100 million vectors? You need purpose-built infrastructure.

Most startups are in the first category and architect for the second because the second category sounds more like a &quot;real AI company.&quot;

## Decision Tree: Vector DB vs Simpler Retrieval

```mermaid
flowchart TD
    Start([Need to retrieve context for LLM]) --&gt; Q1{Corpus size}
    Q1 --&gt;|&lt; 100k chunks| Q2{Already on Postgres?}
    Q1 --&gt;|100k - 10M chunks| Q3{Query latency SLA &lt; 100ms?}
    Q1 --&gt;|&gt; 10M chunks| VDB[Consider dedicated vector DB]

    Q2 --&gt;|Yes| PG[pgvector extension]
    Q2 --&gt;|No| Q4{Need metadata joins?}
    Q4 --&gt;|Yes| PG
    Q4 --&gt;|No| BM25[Try BM25 / keyword search first]

    Q3 --&gt;|No| PG
    Q3 --&gt;|Yes| Q5{Team has vector DB ops experience?}
    Q5 --&gt;|No| PG
    Q5 --&gt;|Yes| VDB

    BM25 --&gt; Eval{Retrieval quality good enough?}
    PG --&gt; Eval
    VDB --&gt; Eval

    Eval --&gt;|No| Fix[Fix chunking and embedding model first]
    Eval --&gt;|Yes| Ship[Ship it]
    Fix --&gt; Start
```

Notice where the decision tree sends you most often: fix your fundamentals before buying infrastructure.

## Why RAG Fails (It Is Usually Not the Database)

I have debugged RAG pipelines that used Pinecone, Weaviate, and pgvector. The vector database was never the problem. The problems were always:

**Chunking strategy.** Fixed 512-token chunks split mid-sentence, mid-table, mid-code-block. The retriever returns fragments that no LLM can synthesize into a coherent answer. Semantic chunking helps. Structure-aware chunking helps more.

**Embedding model mismatch.** You embed with `text-embedding-ada-002` but your domain is legal contracts in Hindi. The embedding space does not capture the semantics you care about. Fine-tuned or domain-specific embedders matter more than which vector DB you use.

**No reranking.** Top-k cosine similarity returns plausible-looking garbage. A cross-encoder reranker on the top 20 candidates before passing to the LLM improves answer quality more than switching from pgvector to Pinecone.

**Stale index.** Your product docs update weekly. Your index updates never. Users ask about features that launched last month and get answers from six-month-old documentation. The vector DB works perfectly. The pipeline is broken.

**No evaluation harness.** Teams ship RAG without measuring retrieval precision and answer faithfulness. They discover problems in production when customers complain. Build the eval set before you build the infra.

## When a Dedicated Vector DB Actually Makes Sense

I am not saying never use Pinecone. Use it when:

- You have validated that retrieval quality is good with simpler tools and latency or scale is the bottleneck
- Your query volume exceeds what Postgres can serve without dedicated tuning
- You need hybrid search (dense + sparse) with sophisticated filtering at a scale Postgres extensions struggle with
- Your team lacks Postgres expertise but has budget for managed services

These are engineering constraints, not pitch deck constraints.

## The Cost Surprise

Pinecone&apos;s pricing model punishes the curious founder. You prototype on the free tier, demo to investors, get traction, and suddenly your vector DB bill exceeds your LLM API bill. pgvector costs you whatever you already pay for Postgres compute. For early-stage startups, that difference is runway.

Run the math before you commit. Include embedding API costs, re-indexing costs when you change chunking strategy, and the engineering time to migrate when you outgrow your initial choice.

## What I Recommend in 2023

1. Start with BM25 or keyword search as a baseline. Measure answer quality.
2. Add pgvector if you are on Postgres and semantic search improves metrics.
3. Invest in chunking, reranking, and evaluation before switching vector DBs.
4. Move to a dedicated vector DB only when profiling shows Postgres is the bottleneck.
5. Never let &quot;our RAG stack&quot; become a selling point. Customers buy answers, not architecture.

## Closing

Vector databases are infrastructure. Infrastructure should be boring and justified by metrics. The fact that every AI conference sponsor is a vector DB company does not mean you need one on day one.

Build the simplest retrieval pipeline that works. Measure it. Then optimize. The founders who skip this sequence spend months debugging Pinecone configs when their chunking strategy was broken from the start.

The hype will pass. Good retrieval fundamentals will not.</content:encoded></item><item><title>What I Learned When I Shut Down Strato Foods</title><link>https://utso.stamped.work/blog/2023-02-14-what-i-learned-shutting-down-strato-foods/</link><guid isPermaLink="true">https://utso.stamped.work/blog/2023-02-14-what-i-learned-shutting-down-strato-foods/</guid><description>Strato Foods was a hometown hit under Strato Inc. I walked away in 2023 when Zomato arrived. The publisher stayed. The app did not.</description><pubDate>Tue, 14 Feb 2023 00:00:00 GMT</pubDate><content:encoded>In early 2023 I shut down **Strato Foods**, the food delivery app that had been Strato Inc&apos;s biggest hit. This was not an acquisition. No LOI, no wire transfer, no corp dev call. I stopped operating the app because the market changed, I had no one to run day-to-day operations, and keeping it alive would have meant lying to restaurants and riders.

Strato Inc is my Play Store publishing company, started in 2020 (originally PAC Limited). I have shipped many apps through it over the years. Strato Foods was the one that worked: real orders, real revenue, real name recognition in my hometown through 2022 and into 2023. Then Swiggy and Zomato expanded properly into the same geography. The game ended. I chose to exit that product, not the publisher.

## What Strato Inc Actually Was

Strato Inc is not a venture-backed startup in the Silicon Valley sense. It is a developer account, a portfolio of Android apps, and years of learning how to ship on the Play Store without a team of forty.

Some apps flopped. Some got a few thousand downloads. Strato Foods was the outlier: it generated revenue, repeat users, and the kind of local word-of-mouth that does not show up in TechCrunch.

I built it because my hometown needed it before the aggregators cared. I shut it down because once they cared, I could not compete on logistics, subsidies, or operations without capital and people I did not have.

## Why Shutdown Is Not the Same as Failure

Founders are trained to narrate shutdowns as pivots or acquisitions. Sometimes the honest story is simpler: you had a product-market fit in a window, the window closed, and you made an adult decision.

```mermaid
flowchart TD
    A[2021-2022: Local demand, weak aggregator coverage] --&gt; B[Strato Foods gains traction via Strato Inc]
    B --&gt; C[2022-2023: Revenue and repeat orders in hometown]
    C --&gt; D[Aggregators expand + ops burden grows]
    D --&gt; E{Can you sustain operations?}
    E --&gt;|No team, no capital| F[Shutdown Strato Foods]
    E --&gt;|Yes| G[Scale into regional player]
    F --&gt; H[Keep Strato Inc + other apps]
    G --&gt; I[Rare outcome for solo founders]
```

I landed on F. That is not a tragedy. It is arithmetic.

## What I Got Wrong

**I confused product success with business durability.** The app worked. The business model did not survive competition from players who burn cash for market share.

**I had no operator.** I was the developer, the support desk, and the person negotiating with restaurants at odd hours. That works at small scale. It breaks when order volume spikes and something always needs fixing.

**I waited too long to call it.** I knew aggregators were coming. I kept the app running out of pride and sunk cost. The last two months cost more stress than the revenue justified.

## What I Got Right

**I kept Strato Inc.** The Play Store account, the other apps, the publishing pipeline. Shutting down one product did not mean shutting down how I build.

**I did not fake an acquisition story.** Local founders sometimes inflate a shutdown into an exit. I am not interested. The lesson is worth more than the LinkedIn headline.

**I learned operations matter as much as code.** Food delivery is logistics with an app attached. I was good at the app.

## The Post-Shutdown Period

Identity shift is real even without a deal. For two years I was &quot;the guy who built Strato Foods.&quot; After shutdown I was just another developer with a Play Store portfolio. That was humbling and useful.

I kept building on Strato Inc: utilities, experiments, client work, side projects. Revenue from Strato Foods was gone. The skills and the account remained.

## Advice If You Are Running a Local App Business

1. **Know your aggregator timeline.** If Zomato or Swiggy are hiring in your city, your window is measured in months, not years.
2. **Separate the product from the publisher.** Keep your developer account and your other apps structurally independent from any single hit.
3. **Hire or partner for operations before you need it.** If you cannot, plan the shutdown date in advance.
4. **Do not call it an exit unless money changed hands for the company.** Words matter for your own honesty.

## Closing

My &quot;first exit&quot; in 2023 was walking away from Strato Foods, a product that could not survive the next phase of the market. Strato Inc is still here. I still publish apps. That chapter taught me more about real users and real operations than most CS coursework ever could.

That is the lesson I wish someone had told me before I attached my identity to a single app in a portfolio.</content:encoded></item><item><title>GPT-4 Is Not AGI and the Framing Matters</title><link>https://utso.stamped.work/blog/2023-01-20-gpt4-is-not-agi/</link><guid isPermaLink="true">https://utso.stamped.work/blog/2023-01-20-gpt4-is-not-agi/</guid><description>Why calling GPT-4 AGI is sloppy thinking, what the term actually requires, and the failure modes that keep showing up in production.</description><pubDate>Fri, 20 Jan 2023 00:00:00 GMT</pubDate><content:encoded>Everyone called GPT-4 a step toward AGI within hours of the demo. I watched the same people who couldn&apos;t explain backpropagation last year suddenly become philosophers of mind. The framing matters because sloppy language drives sloppy product decisions, and sloppy product decisions burn runway.

Let me be precise: GPT-4 is an impressive general-purpose pattern completer. It is not artificial general intelligence by any definition that would survive a five-minute cross-examination in a room full of skeptical engineers.

## What AGI Actually Means (When People Aren&apos;t Grifting)

The term AGI gets abused because it has no single canonical definition. That ambiguity is convenient for fundraising decks and Twitter threads. When I use AGI here, I mean a system that can:

1. Learn new tasks from minimal instruction without retraining the entire model
2. Transfer knowledge across domains in ways that generalize beyond surface pattern matching
3. Maintain coherent goals over long horizons with reliable self-correction
4. Operate in open environments where the action space and observation space are not bounded by a training distribution

GPT-4 fails on every single one of these when you push past the demo layer. It fails quietly, which is worse than failing loudly.

## The Capability Comparison Nobody Wants to Draw

```mermaid
graph TB
    subgraph AGI[&quot;AGI Requirements&quot;]
        A1[Novel task acquisition]
        A2[Cross-domain transfer]
        A3[Long-horizon planning]
        A4[Open-world robustness]
        A5[Reliable self-correction]
    end

    subgraph GPT4[&quot;GPT-4 Actual Capabilities&quot;]
        G1[Strong in-distribution completion]
        G2[Impressive zero-shot heuristics]
        G3[Fragile multi-step reasoning]
        G4[Hallucination under uncertainty]
        G5[No persistent learning without fine-tuning]
    end

    A1 -.-&gt;|partial mimicry| G2
    A2 -.-&gt;|surface only| G1
    A3 -.-&gt;|breaks on complexity| G3
    A4 -.-&gt;|fails silently| G4
    A5 -.-&gt;|requires external loop| G5
```

The dotted lines are the lie. Partial mimicry is not capability. A system that looks like it plans when the plan fits training distribution patterns is not planning. It is autocomplete with confidence.

## Failure Modes That Production Surfaces Immediately

If you have shipped anything with GPT-4 beyond a chatbot demo, you have hit these:

**Hallucination with authority.** The model does not know what it does not know. It generates plausible continuations. In a customer support context, plausible wrong answers are worse than &quot;I don&apos;t know&quot; because users trust the fluency.

**Reasoning collapse on nested dependencies.** Ask it to track five interdependent constraints over twelve steps. Watch it drop constraints silently. Chain-of-thought prompting helps marginally. It does not fix the underlying lack of reliable symbolic manipulation.

**No learning without expensive retraining.** GPT-4 does not get better at your specific domain from use. Every &quot;it learned my preferences&quot; story is either prompt engineering, RAG, or fine-tuning dressed up as magic. The base model is frozen.

**Tool use is scaffolding, not agency.** Function calling and plugins look like agency. They are API wrappers with a language model choosing which button to press. The moment your tool schema has an edge case, the model picks the wrong tool with the same confidence it picks for the right one.

**Context window is not memory.** Stuffing 128k tokens is not the same as maintaining a coherent world model. Retrieval helps. It does not create understanding.

## Why the Framing Matters for Founders

When you tell investors your product is &quot;AGI-powered,&quot; you are making a claim you cannot defend in diligence. When you tell your engineering team the model is &quot;generally intelligent,&quot; they will under-invest in guardrails because they assume the model will figure it out.

The correct framing: GPT-4 is a probabilistic text engine with broad but shallow competence. Treat it like a very fast intern who has read the entire internet but never verified any of it.

That framing drives better architecture:

- Always verify outputs against ground truth when stakes are high
- Build explicit state machines for multi-step workflows instead of hoping the model chains correctly
- Invest in evaluation harnesses before you invest in prompt prettiness
- Assume the model will fail on edge cases and design graceful degradation

## The Hype Cycle Is Not Your Friend

Every AI wave produces a cohort of founders who confuse capability demonstrations with product readiness. GPT-4 is the best demonstration we have seen. It is still not a product. The gap between &quot;wow demo&quot; and &quot;reliable system&quot; is where companies die.

Researchers who should know better participate in the hype because attention is currency. I get it. But if you are building something real, your job is to be the person in the room who says: this is impressive, and it is not what you think it is.

## What Would Actually Move the Needle

Real progress toward AGI, however you define it, requires at least:

- Continual learning without catastrophic forgetting (still an open problem in 2023)
- Grounded world models tied to sensorimotor experience or high-fidelity simulation
- Reliable calibration of uncertainty, not just fluent guessing
- Compositional reasoning that does not collapse under adversarial perturbation

None of these are solved by scaling transformers and adding more RLHF. Scaling helps. It is not sufficient. Anyone who tells you otherwise is selling something.

## Closing

GPT-4 is the most capable language model available as of early 2023. That is a statement about engineering achievement, not about the nature of intelligence. Count on it for tasks where errors are cheap and human review is cheap. Do not trust it for tasks where errors are expensive and you cannot verify.

The founders who win in this cycle will be the ones who understand exactly what the model can and cannot do, build systems that compensate for the gaps, and refuse to act like the gaps do not exist.

Calling GPT-4 AGI is not optimism. It is marketing. And marketing is a terrible foundation for system design.</content:encoded></item><item><title>2022: The Year Everything Changed in AI</title><link>https://utso.stamped.work/blog/2022-12-31-2022-year-in-review/</link><guid isPermaLink="true">https://utso.stamped.work/blog/2022-12-31-2022-year-in-review/</guid><description>From GPT-3 experiments in January to ChatGPT in November, a founder-researcher&apos;s log of the year that broke the Overton window on machine intelligence.</description><pubDate>Sat, 31 Dec 2022 00:00:00 GMT</pubDate><content:encoded>I started 2022 building with LLMs when it was still a weird hobby you explained at parties while people backed away slowly. I ended 2022 watching relatives use ChatGPT to write wedding speeches. The compression of &quot;fringe&quot; to &quot;default&quot; in eleven months broke my sense of how fast industries are supposed to move.

This is not a timeline of every paper. It is a personal field log: what I built, what failed, what suddenly worked, and what I wish I had not optimistically shipped.

## January to March: Plumbing Before the Hype

GPT-3 API access was the toy that ate my Q1. I was not training models. I was wiring retrieval, prompt templates, and evaluation harnesses around models I did not own. Side agent experiments with Telegram ingress and memory prototypes that would take years to become anything serious.

Lessons from the quiet months:

- **Prompts are code.** Version them or drown in regressions.
- **Evals before features.** If you cannot score outputs, you cannot ship responsibly.
- **Latency kills UX.** Streaming completions were not vanity. They were survival.

Nobody paid much attention. That was the advantage.

## April to June: Images Enter the Chat

DALL-E 2 and Midjourney made generative media real for normies. I split time between text agents and image pipelines for marketing assets. The startup pitch deck problem (&quot;we need visuals&quot;) got cheaper overnight.

I also hit the first wall of **policy whack-a-mole**: NSFW filters, copyright anxiety, client requests that made me update terms of service twice.

Internally I drew a map of skills I was accumulating. By midyear it looked like a messy graph, not a ladder.

```mermaid
flowchart TB
    subgraph skills [Skills Layer]
        PROMPT[Prompt Engineering]
        EVAL[Eval Design]
        RET[Retrieval / RAG]
        EMB[Embeddings]
        FINETUNE[Fine-Tuning LoRA]
        DIFF[Latent Diffusion]
    end

    subgraph projects [Projects Layer]
        AGENT[Telegram Agent Experiments]
        MEM[Memory Prototypes]
        IMG[Image Gen Pipelines]
        B2B[B2B SaaS Experiments]
    end

    subgraph infra [Infra Layer]
        PY[Python Services]
        API[OpenAI / HF APIs]
        GPU[Local GPU Experiments]
        TG[Telegram Bot Layer]
    end

    PROMPT --&gt; AGENT
    RET --&gt; MEM
    EMB --&gt; MEM
    EMB --&gt; AGENT
    DIFF --&gt; IMG
    FINETUNE --&gt; IMG
    EVAL --&gt; AGENT
    EVAL --&gt; B2B

    PY --&gt; AGENT
    API --&gt; AGENT
    API --&gt; IMG
    GPU --&gt; DIFF
    GPU --&gt; FINETUNE
    TG --&gt; AGENT

    MEM --&gt; AGENT
    IMG --&gt; B2B
    AGENT --&gt; B2B
```

The edges are dependencies. The graph got denser every month. That density was the year&apos;s real story: tools stopped being siloed and started composing.

## July to August: Stable Diffusion and the Open Source Shock

I wrote about this separately because the week mattered. Stable Diffusion&apos;s release moved image generation from API rent-seeking to checkpoint files on disk. My GPU went from gaming hardware to production-ish infra.

Fine-tuning culture exploded. DreamBooth, LoRA, community WebUIs. I shipped internal tools faster and slept worse.

## September to October: Memory, Markets, and India Reality

September was ChromaDB memory work: making agents remember without stuffing prompts. October was market reality: B2B SaaS in India does not care about your US pricing calculator. I watched founders burn runway on GTM that ignored procurement and relationship graphs.

I straddled builder and researcher modes. Neither side felt optional. If you ignore GTM, you build demos. If you ignore memory and evals, you build demos that lie confidently.

## November: ChatGPT and the Overton Window

OpenAI shipped ChatGPT on November 30. The date is burned in because everything after felt like damage control for attention.

Effects I felt within two weeks:

- **Inbound interest spiked** from people who had ignored LLMs for months
- **Expectations detached from capability** overnight
- **Hiring conversations changed** from &quot;what is GPT&quot; to &quot;why are you not ChatGPT yet&quot;
- **My old demos looked quaint** even when they were more specialized

ChatGPT was not magically more capable than what tinkerers had been chaining together. It was accessible, conversational, and free to try. Distribution beat depth for public narrative.

I rewrote onboarding for everything I showed investors. Screenshots aged in days.

## December: Consolidation and Honest Accounting

By year end I had:

- A Telegram agent stack with real memory, still imperfect but debuggable
- Image pipelines I trusted for internal use, not legal-approved for all clients
- Strong opinions on Indian B2B GTM I wished I had learned cheaper
Failures:

- Shipped features without eval coverage and paid in support time
- Underestimated how fast open models commoditize UI wrappers
- Overestimated how fast enterprises adopt anything without security theater answers

## Themes That Survived the Hype Cycles

**Composition won.** Retrieval plus tools plus memory plus UI beat raw model size for most founder use cases.

**Open weights changed bargaining power.** APIs are convenient until your unit economics depend on them.

**India is not a discount US market.** It is a parallel GTM universe.

**Memory is product, not research.** Users do not care about your vector DB. They care that the bot remembers their kid&apos;s name.

**Speed of narrative outran speed of reliability.** 2022 rewarded demos. 2023 would punish them. I could feel that coming in December.

## What I Would Tell January-2022 Me

1. Buy more GPU before the scalpers wake up.
2. Log prompts and outputs from day one.
3. Say no to custom demos without contracts.
4. Study sales cycles in your actual market, not on podcasts.
5. ChatGPT is coming. Build specialized value, not a chat box.

## Looking at 2023 From the Edge

The year everything changed in AI was also the year I stopped asking whether LLMs were a fad. The fad was thinking you could ignore them and keep a normal software career.

I entered 2023 tired, more technical than I was in January, and more skeptical of people who confuse virality with product-market fit. The tools got better. The hard parts stayed hard: memory, evals, distribution, trust.

2022 did not give us AGI. It gave us permission to build like the future was already here, with all the mess that implies.

I would not want to relive the chaos. I would not want to have missed it.

Happy new year. Ship evals first.</content:encoded></item><item><title>The Week Stable Diffusion Went Open Source</title><link>https://utso.stamped.work/blog/2022-11-17-stable-diffusion-open-source-changed-everything/</link><guid isPermaLink="true">https://utso.stamped.work/blog/2022-11-17-stable-diffusion-open-source-changed-everything/</guid><description>August 2022 changed generative AI forever. Latent diffusion, open weights, and a fine-tuning ecosystem that left closed APIs scrambling.</description><pubDate>Thu, 17 Nov 2022 00:00:00 GMT</pubDate><content:encoded>There is a before and after. Before Stable Diffusion&apos;s public release, image generation was a demo you accessed through a waiting list, a Discord bot, or a research paper you could read but not run. After, it was a 4 GB checkpoint on your laptop and a weekend away from fine-tuning a model on your face, your product, or your questionable fan art.

I do not overuse the word &quot;inflection.&quot; This was one.

## What Actually Shipped

Stable Diffusion is a latent diffusion model. Instead of denoising pixels directly in high-resolution space, it compresses images into a lower-dimensional latent space using a variational autoencoder, runs the diffusion process there, and decodes back to pixels. That trick is why you could generate 512x512 images on a consumer GPU instead of a government budget.

The open-source release included:

- Model weights
- Inference code
- The implicit invitation to break every license ambiguity on the internet within 72 hours

Competitors had models. They did not have distribution plus permissiveness plus a community that already knew PyTorch. Stability AI (contentious founder drama aside) catalyzed a Cambrian explosion by choosing release over lock-in.

## Latent Diffusion Architecture (Why It Worked)

The pipeline is conceptually simple and computationally vicious:

```mermaid
flowchart TB
    subgraph encode [Encoding]
        IMG[Input Image] --&gt; VAEEnc[VAE Encoder]
        VAEEnc --&gt; LAT[Latent Representation z]
    end

    subgraph diffuse [Diffusion in Latent Space]
        NOISE[Gaussian Noise] --&gt; UNET[U-Net Denoiser]
        LAT --&gt; UNET
        TEXT[Text Embedding from CLIP] --&gt; UNET
        UNET --&gt; DENOISED[Denoised Latent z&apos;]
    end

    subgraph decode [Decoding]
        DENOISED --&gt; VAEDec[VAE Decoder]
        VAEDec --&gt; OUT[Generated Image]
    end

    subgraph cond [Conditioning]
        PROMPT[Text Prompt] --&gt; CLIP[CLIP Text Encoder]
        CLIP --&gt; TEXT
    end
```

Text enters through CLIP&apos;s text encoder. The U-Net learns to predict noise conditioned on timestep and text embedding. At inference, you start from pure noise and walk backward through the schedule. The VAE decoder turns latents into something you can post on Twitter before the content policy account wakes up.

Understanding this diagram mattered because every hack in the ecosystem targeted a different box: better schedulers on the denoiser loop, LoRA adapters on the U-Net, textual inversion on embeddings, ControlNet on conditioning paths.

## The Week It Broke Loose

I remember the sequence vividly:

**Day 1:** Weights hit Hugging Face. Discord servers melted. Colab notebooks appeared like mushrooms.

**Day 2:** AUTOMATIC1111&apos;s WebUI gained a thousand stars. Non-engineers generated their first images. Prompt engineering became a meme and a job.

**Day 3:** DreamBooth fine-tuning tutorials dropped. People cloned pets, celebrities (ethics be damned), and brand mascots.

**Day 4:** Stock photo Twitter accounts entered existential crisis. Illustrator forums split between curiosity and rage.

**Day 5:** Every AI startup pitch deck gained a &quot;generative&quot; slide whether or not the founders had GPUs.

**Day 6:** Lawyers discovered the license.

**Day 7:** I slept. Many did not.

## Fine-Tuning Ecosystem: The Real Product

The base model was a commodity within weeks. The moat (temporary, porous) was tooling:

- **DreamBooth** for subject personalization with a handful of images
- **LoRA** for low-rank adaptation that made fine-tunes small and swappable
- **Hypernetworks** and **textual inversion** for style and concept injection
- **ControlNet** later for spatial conditioning (edges, poses, depth)

Open weights meant open experimentation. Closed APIs could not keep pace with the combinatorics of community repos. Google and OpenAI had talent. The community had parallel idle GPUs and no committee approval for weird ideas.

I fine-tuned models for product mockups, blog hero images, and internal design exploration. Quality was inconsistent. Speed was unbeatable. For a startup, &quot;good enough today&quot; beat &quot;perfect next quarter.&quot;

## Research vs Production Gap

Academic papers optimize FID. Founders optimize &quot;does this get clicks&quot; and &quot;can we ship without a legal call.&quot; Stable Diffusion lowered the research-to-meme latency to hours.

Production issues showed up immediately:

- **NSFW generation** and moderation failures
- **Bias and stereotyping** baked into training data
- **Copyright ambiguity** for commercial use
- **Artifacting** on hands, text, and faces until you learned negative prompts by rote

The model did not solve these. It made them everyone&apos;s problem instead of a lab&apos;s problem.

## Economic Shockwaves

Midjourney had aesthetics and ease. DALL-E had brand and safety rails. Stable Diffusion had economics. Once inference cost approached zero on owned hardware, usage exploded in places that would never pay $30/month for a subscription.

That shifted:

- **Design agencies** experimenting with asset pipelines
- **Game studios** prototyping concept art faster
- **E-commerce** players generating catalog variations
- **Bad actors** generating disinformation imagery (predictable and under-discussed in hype threads)

Incumbents responded with better UX and API bundling. They could not respond with the same freedom to fork and modify. Open source ate the long tail.

## What Changed in My Work

I stopped treating image generation as a party trick and started treating it as infrastructure. Pipelines for:

- Batch generation with fixed seeds for reproducibility
- Prompt templates versioned in git (yes, really)
- Human review gates before anything customer-facing shipped
- Watermarking and metadata for anything public

The research world moved to video and 3D. The builder world was still catching up on 2D batch jobs. That gap was opportunity for anyone willing to write boring scripts.

## Lessons for Open Model Releases

If you release weights in 2022-style:

1. Assume fine-tunes within days
2. Assume NSFW within hours
3. Ship inference code or someone else&apos;s becomes standard
4. License clarity matters more than benchmark points
5. Community tooling is distribution

Stable Diffusion did not win because it was the prettiest generator. It won because it was the most forkable.

## Looking Back From Late 2022

The week Stable Diffusion went open source, generative AI stopped being a spectator sport. Researchers still mattered. But the center of gravity moved to GitHub repos, Discord channels, and Colab notebooks running on borrowed compute.

Every closed model since has lived in the shadow of that release. Either you justify the API tax with safety, scale, and UX, or you get commoditized by the next checkpoint someone torrents.

I am not nostalgic for chaos. But I am honest: that week changed what solo founders could build without asking permission. The image was just the beginning. The permission structure was what actually broke.</content:encoded></item><item><title>B2B SaaS in India Is Different</title><link>https://utso.stamped.work/blog/2022-10-08-india-b2b-saas-is-different/</link><guid isPermaLink="true">https://utso.stamped.work/blog/2022-10-08-india-b2b-saas-is-different/</guid><description>Why copying US playbooks fails in Indian B2B: procurement friction, relationship selling, and pricing that looks broken until you understand the market.</description><pubDate>Sat, 08 Oct 2022 00:00:00 GMT</pubDate><content:encoded>I watched a YC alum pitch a $15k ACV HR SaaS to a mid-market Indian manufacturer and get laughed out of the room. Not because the product was bad. Because the buyer&apos;s mental model for software spend, trust, and implementation looked nothing like the slide deck assumed. The founder had copied a US GTM motion, swapped dollars for rupees at face value, and wondered why pipeline stalled after the first demo.

B2B SaaS in India is not &quot;US SaaS but cheaper.&quot; It is a different game with different winners, different timelines, and pricing that makes US investors twitch until you explain unit economics on their terms.

## The Procurement Maze

US mid-market SaaS often sells on ROI slides and a credit card. Indian mid-market sells through relationships, reference calls with someone the buyer already trusts, and procurement committees that treat software like capex even when you swear it is opex.

Typical friction points:

- **Vendor registration** that takes longer than your runway
- **GST invoicing** requirements that break naive billing systems
- **TDS deductions** that confuse your US-centric accountant
- **Security questionnaires** photocopied from enterprise templates and applied to your ten-person startup
- **Payment terms** of Net 60 or Net 90 treated as normal, not a favor

Your beautiful self-serve funnel dies at the accounts payable gate. I have seen deals close on WhatsApp and die in email threads with someone&apos;s uncle who &quot;handles IT.&quot;

## Relationship Selling Is Not Optional

In the US, cold outbound plus product-led growth can work for certain categories. In India, for many B2B segments, warm intros are the primary channel. Not because Indians are allergic to software. Because reputational risk is social. If the tool fails, the champion looks foolish in a network where everyone knows everyone.

That means:

- Founders sell longer than they want to
- Customer success is pre-sales
- Case studies from recognizable logos matter more than feature matrices
- Conference rooms and chai matter more than your landing page gradient

I am not romanticizing this. It is inefficient. It is also the market.

## Pricing: The 40-70% Discount Reality

US pricing anchors do not transfer. A tool that sells for $50/user/month in the US might need to land at INR pricing that implies 40-70% lower effective ARPU once you account for purchasing power, negotiation norms, and bundled services buyers expect.

Indian buyers often expect:

- **Implementation help** included or heavily discounted
- **Local support** in timezone and language
- **Flexible contracts** with exit ramps
- **Multi-year discounts** demanded upfront

Your US investor sees ₹ pricing and calls it a lifestyle business. You need to show that CAC is lower (sometimes), that expansion happens via seats and modules (sometimes), and that payment discipline is improving (slowly).

The founders who win do not pretend parity. They design packaging for India: land with a wedge module, expand after trust, price for annual prepay to fix cash flow.

## Sales Cycle: US vs India

The shape of the cycle differs more than the length. US can be shorter for PLG bottoms-up. India enterprise-leaning deals stretch through festivals, budget cycles tied to April-March fiscal years, and random freezes during election seasons.

```mermaid
flowchart TB
    subgraph us [US B2B SaaS - Typical Mid-Market]
        U1[Inbound / PLG Signup] --&gt; U2[Self-Serve Trial]
        U2 --&gt; U3[Champion Internal Sell]
        U3 --&gt; U4[Security Review]
        U4 --&gt; U5[Procurement / Card]
        U5 --&gt; U6[Closed Won]
    end

    subgraph india [India B2B SaaS - Typical Mid-Market]
        I1[Warm Intro or Event] --&gt; I2[Founder-Led Demo]
        I2 --&gt; I3[Pilot / POC Negotiation]
        I3 --&gt; I4[Reference Calls + Site Visit]
        I4 --&gt; I5[Vendor Onboarding + Legal]
        I5 --&gt; I6[Invoice + TDS + AP Queue]
        I6 --&gt; I7[Closed Won - Often Annual]
    end
```

Notice the extra states. Each state is a place deals go to die. Your CRM must model them or you will misforecast every quarter.

## Product Implications

&quot;Global-first&quot; features are often India-last in practice:

- **Offline or flaky connectivity** still matters outside metro offices
- **Mobile-first admin** beats desktop dashboards for many users
- **Role-based access** must map to messy real org charts, not clean RBAC demos
- **Integrations** with local accounting and payroll systems beat shiny AI widgets

US startups ship integrations with Salesforce and call it a platform. Indian startups ship Tally and Zoho Payroll connectors and call it survival.

## Hiring and Distribution

Inside sales teams modeled on US SDR/AE ratios struggle when deals need founder credibility. The first ten customers often need the founder in the room. Literally. Or on a Zoom where they turn on video and speak the buyer&apos;s language.

Channel partners can accelerate or destroy you. A reseller who promises implementation you cannot support will churn customers you never met. Vet partners like investors vet you.

## What I Tell Founders Who Ask

Do not copy US pricing slides. Build a India-specific plan with annual prepay incentives and implementation tiers you can actually deliver.

Do not underestimate accounts payable. Hire or outsource finance ops earlier than feels cool.

Do not confuse metro early adopters with national product-market fit. Bangalore is not Bharat.

Do not apologize for relationship selling. Systematize it: reference programs, champion enablement kits, WhatsApp-friendly support playbooks that do not violate privacy norms.

## The Upside Nobody Puts in Keynotes

India rewards patience and kills tourists. If you survive the procurement maze, churn can be lower than US SMB horror stories because switching costs include social capital. Expansion revenue exists once trust is banked. The market is large enough that niche B2B verticals can produce real businesses without winning San Francisco mindshare.

B2B SaaS in India is different. Treat it that way and you might close the deal that made your US clone quit. Ignore it and you will post on Twitter about how Indian customers &quot;do not pay for software&quot; while your competitor with an uncle in manufacturing eats your lunch.

The market is not broken. Your playbook was.</content:encoded></item><item><title>Building Long-Term Memory for AI Agents With ChromaDB</title><link>https://utso.stamped.work/blog/2022-09-12-building-chromadb-memory-for-ai-agents/</link><guid isPermaLink="true">https://utso.stamped.work/blog/2022-09-12-building-chromadb-memory-for-ai-agents/</guid><description>How I wired vector memory into a Telegram-based agent stack using ChromaDB, and why naive prompt stuffing fails at scale.</description><pubDate>Mon, 12 Sep 2022 00:00:00 GMT</pubDate><content:encoded>Everyone building AI agents in 2022 had the same dirty secret: their &quot;memory&quot; was a bloated system prompt and a prayer. You paste in the last twenty messages, hit the token limit, watch the model forget who you are, and call it a product. I got tired of pretending that worked.

I was running a personal Telegram agent side project and needed it to remember deals, people, half-baked ideas, and the specific way I phrase rejection emails. Stateless LLM calls were useless for that. I needed durable, queryable memory that survived restarts and did not require me to manually curate context windows every morning.

ChromaDB was the pragmatic choice. Not because it was the best vector database on paper, but because I could embed it in a Python service in an afternoon, point it at local disk, and stop thinking about infrastructure until I had paying users. Perfect is the enemy of a founder who still writes ingestion scripts at midnight.

## Why Prompt Stuffing Is Not Memory

Memory is not &quot;more tokens in the context window.&quot; Memory is selective retrieval under uncertainty. When I ask &quot;what did I promise the investor last Tuesday,&quot; the system should not replay our entire chat history. It should pull the three fragments that matter, rank them, and inject them before the model answers.

Naive approaches fail in predictable ways:

1. **Recency bias.** Recent messages crowd out older but more important facts.
2. **No semantic jump.** Keyword search misses paraphrases. &quot;Term sheet&quot; and &quot;signed docs&quot; should connect.
3. **No persistence model.** Restart the process, lose the illusion of continuity.
4. **Cost explosion.** Every turn re-sends the full history. Your API bill scales with anxiety.

Vector stores fix the retrieval problem if you treat memory as a pipeline, not a dump.

## Architecture: Memory Retrieval Pipeline

The core loop is boring and correct: write embeddings on ingest, query by similarity on read, assemble a bounded context pack for the LLM.

```mermaid
flowchart LR
    subgraph ingest [Ingest Path]
        TG[Telegram Message] --&gt; Parser[Message Parser]
        Parser --&gt; Chunker[Semantic Chunker]
        Chunker --&gt; Embed[Embedding Model]
        Embed --&gt; Chroma[(ChromaDB Collection)]
    end

    subgraph retrieve [Retrieval Path]
        Query[User Query] --&gt; QEmbed[Query Embedding]
        QEmbed --&gt; Search[Similarity Search]
        Chroma --&gt; Search
        Search --&gt; Rank[Re-rank and Filter]
        Rank --&gt; Pack[Context Pack Builder]
        Pack --&gt; LLM[LLM Completion]
        LLM --&gt; Reply[Telegram Reply]
    end

    subgraph meta [Metadata Layer]
        Chroma --&gt; Meta[tags, timestamps, source]
        Meta --&gt; Rank
    end
```

Telegram was the interface because founders live on their phones and because async messaging maps cleanly to agent loops. A message arrives, you classify intent, you maybe retrieve memory, you respond. No fake typing indicators required for v1.

## Chunking: Where Most People Blow It

I chunked by conversational turn pairs, not arbitrary token windows. A user message plus the assistant reply became one unit when they were tightly coupled; standalone notes became single chunks. Metadata mattered as much as vectors:

- `source`: telegram, manual note, email forward
- `timestamp`: ISO string, used for decay and &quot;last week&quot; queries
- `entity_tags`: extracted names, companies, project codenames
- `importance`: manual pin or heuristic score

ChromaDB&apos;s metadata filtering saved me from building a second database early. &quot;Similarity search within last 14 days tagged `investor`&quot; is a product feature, not a research paper.

## Embedding Model Choices in Late 2022

I used `text-embedding-ada-002` via API for quality and `all-MiniLM-L6-v2` locally when I wanted zero marginal cost on ingestion experiments. The local model was worse on proper nouns and Indian company names. The API model cost cents per thousand chunks. For a personal agent, API won. For a batch re-index of 50k Slack messages, local won and I lived with the recall hit.

Do not fetishize embedding benchmarks. Measure recall@k on your own queries. I kept a spreadsheet of fifty questions I actually ask my agent and scored retrieval weekly. That beat every leaderboard.

## Writing Path: When to Remember

Not every message deserves persistence. I added a lightweight classifier (fine-tuned small model, later just GPT-3.5 with a rigid JSON schema) that decided:

- **Ephemeral**: greetings, jokes, one-off calculations
- **Durable**: commitments, preferences, contact facts, project state
- **Derived**: summaries the agent produced that should compound

Durable writes went to Chroma. Derived summaries got their own collection so retrieval could prefer distilled facts over raw chat sludge.

## Retrieval Tuning That Actually Moved Numbers

Similarity alone is naive. My production-ish pipeline:

1. Embed the query.
2. Pull top 20 from Chroma with metadata pre-filter when possible.
3. Re-rank with a cheap cross-encoder or heuristic blend of similarity + recency + importance.
4. Pack until ~1500 tokens of memory context, hard cap.

The hard cap is non-negotiable. Unbounded retrieval is how you recreate prompt stuffing with extra steps.

## Failure Modes I Hit in Production

**Duplicate chunks.** The agent repeated the same fact because I ingested near-identical messages. Fix: dedupe by cosine similarity threshold on insert.

**Stale memory wins.** Old wrong facts outranked corrections. Fix: supersede pattern. New chunk with `supersedes_id` metadata; filter out losers at read time.

**Hallucinated retrieval confidence.** The model cited memory that was only weakly related. Fix: force the model to quote chunk IDs in scratchpad (internal) and drop chunks below a similarity floor.

**Telegram rate limits.** Burst ingestion during a long voice-note rant tripped limits. Fix: queue with backoff. Boring. Correct.

## Agent Memory Lessons

Building memory turned the agent from a clever parrot into something I could delegate to. It still lied sometimes. But it lied *consistently* about the same outdated facts, which meant I could debug memory instead of debugging &quot;the model.&quot;

ChromaDB was not forever infrastructure. It was the right 80% solution while I validated whether anyone besides me wanted an agent that remembered. They did not, in large numbers, in 2022. But I learned that memory UX is harder than memory engineering: users do not want to manage vectors; they want to be understood.

If you are building agents today, steal this pipeline. Swap Chroma for whatever your cloud vendor subsidizes. Keep the ingest, retrieve, pack, generate split. Your future self will thank you when you need to audit why the bot thought you still worked at a company you left in March.

Memory is not a feature slide. It is plumbing. Build the plumbing first, then lie on stage about AGI.</content:encoded></item><item><title>LoRA Fine-tuning Actually Works</title><link>https://utso.stamped.work/blog/2022-08-20-lora-fine-tuning-actually-works/</link><guid isPermaLink="true">https://utso.stamped.work/blog/2022-08-20-lora-fine-tuning-actually-works/</guid><description>Low-rank adaptation makes LLM fine-tuning cheap enough for small teams. Intuition, math sketch, and why it beat full fine-tuning in practice.</description><pubDate>Sat, 20 Aug 2022 00:00:00 GMT</pubDate><content:encoded>Full fine-tuning a large language model is like buying the building because you wanted to repaint one wall. Most weights do not need to move much to adapt a model to a new task, style, or domain. LoRA (Low-Rank Adaptation) from Hu et al. exploits that fact: freeze the pretrained model, inject small trainable low-rank matrices into attention layers, and update only those. It works annoyingly well.

By August 2022, practitioners were not debating whether LoRA was real. They were debating how to stack it with quantization, which layers to target, and how to merge adapters for deployment. If you were a founder with one GPU and a niche dataset, LoRA was the difference between &quot;maybe&quot; and &quot;shipped Friday.&quot;

## The Intuition: Updates Live in a Low-Dimensional Subspace

Pretrained transformers already encode broad language structure. Task-specific adaptation often lies in a smaller degrees-of-freedom subspace than the full parameter count suggests.

Instead of updating a weight matrix W directly, LoRA learns a delta:

W&apos; = W + BA

Where B is d x r, A is r x k, and rank r is tiny (4, 8, 16) compared to full rank.

You train A and B. W stays frozen.

```mermaid
flowchart LR
    subgraph frozen[&quot;Frozen pretrained weights&quot;]
        W[&quot;W (d x k)&quot;]
    end
    subgraph lora[&quot;Trainable LoRA adapters&quot;]
        A[&quot;A (r x k)&quot;]
        B[&quot;B (d x r)&quot;]
        A --&gt; BA[&quot;B @ A&quot;]
    end
    X[&quot;Input x&quot;] --&gt; W
    X --&gt; A
    W --&gt; Sum[&quot;xW + xBA&quot;]
    BA --&gt; Sum
    Sum --&gt; Out[&quot;Output&quot;]
```

At inference, BA can be merged into W for no extra latency if you plan ahead. That merge step is when LoRA stops being research and starts being infrastructure.

## Why Full Fine-Tuning Hurts in Production

**Memory:** Optimizer states for billions of parameters dominate VRAM.

**Catastrophic forgetting:** Large updates erode general capabilities.

**Storage:** One full checkpoint per customer or task is untenable.

**Iteration speed:** Slow training loops kill experimentation.

LoRA trades a hyperparameter (rank) for an order-of-magnitude reduction in trainable parameters. Not magic. Engineering.

## Where to Attach Adapters

Original work focused on attention projection matrices (q, v in their experiments). Community practice expanded to more layers. Rule of thumb in 2022:

- Start with attention projections
- If underfitting, increase rank before adding every layer
- If overfitting on tiny data, decrease rank and regularize harder

More adapters is not automatically better. You are fitting a budget.

## LoRA vs Other Parameter-Efficient Methods

Adapters (Houlsby et al.) insert bottleneck modules. Prefix tuning prepends learned vectors. Prompt tuning learns soft prompts.

LoRA&apos;s sweet spot:

- **Fewer architectural surprises** than bolting adapters everywhere
- **Mergeable weights** for deployment
- **Simple to implement** in PyTorch with hooks on linear layers

For many LLM fine-tunes, LoRA became the default first attempt.

## Quantization + LoRA: QLoRA Preview Energy

Even before QLoRA paper hype peaked, the pattern was obvious: 4-bit quantized base weights + LoRA adapters in higher precision training. Train cheap. Serve merged or adapter-sidecar depending on infra.

Founders with consumer GPUs could fine-tune models that previously required serious clusters. That democratization has second-order effects: more niche models, more garbage models, more compliance questions.

## What LoRA Is Good At

- Style and format adherence (JSON, support macros, legal tone)
- Domain vocabulary injection (medical shorthand, internal acronyms)
- Instruction following tweaks on small curated datasets
- Multi-tenant SaaS where each customer gets an adapter, not a full model

## What LoRA Does Not Fix

- **Bad data:** Garbage demonstrations produce garbage adapters.
- **Factual grounding:** LoRA will confidently bake in wrong facts from your CSV.
- **Safety:** If your dataset contains toxic patterns, low-rank updates still learn them.
- **Evaluation:** You still need held-out prompts and regression tests.

LoRA makes experimentation cheap. It does not make responsibility optional.

## Training Recipe That Actually Shipped

1. **Curate 500-5,000 high-quality examples** (quality beats 50k scraped rows).
2. **Pick a base model** matched to latency budget (not the biggest on the leaderboard).
3. **Set rank 8 or 16**, alpha often 16 or 32, tune learning rate 1e-4 to 3e-4 as starting band.
4. **Train 1-3 epochs**; watch eval loss for overfit on tiny sets.
5. **Merge adapters** for inference unless you need hot-swapping per tenant.
6. **Run eval prompts** from production logs, not just training loss.

This is boring. Boring ships.

## Multi-Tenant Deployment Patterns

**Merged weights per tenant:** Simple inference, painful update pipeline.

**Shared base + adapter files:** Hot swap LoRA weights, more serving complexity.

**Grouped adapters:** Cluster similar customers to reduce cardinality.

Your MLOps maturity picks the pattern. LoRA enables options full fine-tuning could not afford.

## Skepticism I Had to Eat

I assumed low-rank would underfit complex reasoning tasks. Sometimes it does. Surprisingly often, rank 8 on a 7B-class model (when those weights became available openly) captured enough shift for vertical copilots.

I also assumed merged weights would drift numerically. Merging worked fine with fp16 discipline. Test your stack.

## Relation to Stable Diffusion LoRA Ecosystem

Same idea, different modality. Image LoRAs for styles and characters exploded in late 2022. The mental model transfers: small adapter, big frozen base, community forks everywhere. LLM LoRA just has uglier eval metrics.

## Research Loose Ends in 2022

- Optimal rank selection theory vs heuristics
- Which layers matter for reasoning vs style
- Composition of multiple LoRAs without interference
- Federated LoRA with privacy constraints

Academia will publish. You need to ship adapters and monitor drift.

## Closing

LoRA fine-tuning actually works. Not because low-rank magic is profound, but because pretrained models are already good and most products need a nudge, not a lobotomy.

If you are still copying full model checkpoints per experiment on a single A100, stop. Freeze W. Train BA. Merge. Measure. Ship.

The hard part remains data and evaluation. LoRA just makes failure cheaper and success faster. That is enough to change what startups build.</content:encoded></item><item><title>React Native vs Flutter in 2022</title><link>https://utso.stamped.work/blog/2022-07-29-react-native-vs-flutter-2022/</link><guid isPermaLink="true">https://utso.stamped.work/blog/2022-07-29-react-native-vs-flutter-2022/</guid><description>Hermes, rendering pipelines, and hiring in India. A technical comparison of React Native and Flutter when choosing a mobile stack in 2022.</description><pubDate>Fri, 29 Jul 2022 00:00:00 GMT</pubDate><content:encoded>Every mobile stack debate is two parts engineering and one part religion. In 2022, the practical choice for most Indian startups still narrowed to React Native or Flutter. Native iOS/Swift and Android/Kotlin were correct and slow for small teams. Kotlin Multiplatform was promising and thin on hiring. Xamarin was a ghost.

I have shipped with both RN and Flutter. Neither is free. Here is how they compared when Hermes matured, Material 3 landed, and everyone pretended they would rewrite the app &quot;after PMF.&quot;

## The Real Question: What Are You Optimizing?

- **Time to hire in India:** React Native usually wins on JS/React familiarity.
- **UI consistency across platforms:** Flutter wins if you accept its widget religion.
- **Integration with existing web team:** RN if your web team is React-heavy.
- **Animation-heavy custom UI:** Flutter&apos;s compositor model is pleasant.
- **Brownfield native modules:** RN&apos;s bridge ecosystem is older, messier, more examples.

Pick constraints, not logos.

## React Native in 2022: Not the Old Bridge Monolith

The New Architecture (Fabric renderer, TurboModules, JSI) was rolling out. Hermes became the default JavaScript engine on many templates. Hermes improved startup time and memory versus JSC on mid-range Android devices, which is most of India.

RN still renders native components. Your React tree maps to UIView and Android views. That means platform quirks leak through: spacing, keyboards, navigation transitions, accessibility edge cases.

**Strengths:**

- Massive JS talent pool
- Share logic and some UI patterns with React web
- Mature libraries for navigation, forms, analytics
- Easier incremental adoption in apps with existing native code

**Weaknesses:**

- Performance cliffs when bridge chatter returns on bad lists
- Dependency hell when native modules disagree on RN version
- UI consistency requires discipline; platforms look &quot;almost the same&quot;

If your team already thinks in hooks and Redux, RN is the path of least political resistance.

## Flutter in 2022: Skia All the Way Down

Flutter draws its own pixels via Skia (and Impeller coming for iOS). No native widget mapping for the core UI. One rendering pipeline, one layout model, one set of animations.

That is liberating until you need a platform-specific behavior that fights the framework.

**Strengths:**

- Predictable UI across iOS and Android
- Hot reload culture is excellent for designer-engineer loops
- Strong performance on animations and custom painters
- Single language (Dart) end to end, fewer &quot;who owns the native module&quot; disputes

**Weaknesses:**

- Smaller hiring pool than JS in most Indian cities
- APK/IPA size baseline heavier than minimal RN apps
- Platform channel glue for obscure SDKs
- Dart is fine; convincing seniors to learn it is a sales job

## Rendering Pipeline: Where the Philosophies Diverge

```mermaid
flowchart TB
    subgraph rn[&quot;React Native&quot;]
        RNJS[&quot;JavaScript (Hermes)&quot;] --&gt; JSI[&quot;JSI / TurboModules&quot;]
        JSI --&gt; Fabric[&quot;Fabric renderer&quot;]
        Fabric --&gt; NativeW[&quot;Native views (UIKit / Android)&quot;]
    end
    subgraph fl[&quot;Flutter&quot;]
        DartUI[&quot;Dart framework&quot;] --&gt; Engine[&quot;Flutter engine&quot;]
        Engine --&gt; Skia[&quot;Skia compositor&quot;]
        Skia --&gt; GPU[&quot;GPU surface&quot;]
    end
```

RN: JavaScript orchestrates native widgets. Flutter: Dart orchestrates pixels. The RN diagram still has more moving parts at the UI boundary. That matters for debugging flicker at 3 AM.

## Hermes Changed the RN Calculus

Before Hermes, RN startup on budget Android was a punchline. Hermes bytecode, ahead-of-time compilation, and tighter memory made RN credible for consumer apps in India where devices skew mid-range.

It did not erase Flutter&apos;s advantages. It removed one historical RN veto from architecture reviews.

## Hiring in India: The Uncomfortable Truth

Job boards lie. Real hiring data in 2022 India:

- **React Native:** Easier to find candidates who have touched React. Depth varies wildly. Many &quot;2 years RN&quot; resumes are mostly web with one Expo toy app.
- **Flutter:** Harder to fill senior slots outside Bangalore/Hyderabad/Pune bubbles. Strong Flutter devs exist; they cost more attention to recruit.

For startups, time-to-hire beats benchmark FPS. A good RN team ships before a perfect Flutter team is assembled.

For design-heavy products with custom UI, paying the Flutter recruiting tax can still be correct.

## Ecosystem and Third-Party SDK Pain

Both ecosystems suffer when a payment gateway or KYC SDK ships a half-maintained plugin. RN has more wrappers because it is older. Flutter plugins are often cleaner when they exist.

Evaluate your **mandatory SDK list** before choosing. If your vertical needs a SDK with only a RN bridge, the debate ends.

## Developer Experience

RN: JavaScript tooling, TypeScript adoption high, Metro bundler quirks, flipper debugging when it works.

Flutter: `flutter doctor` is honest about your broken Xcode. DevTools solid. Build times creep on large apps.

Neither is Xcode Storyboards hell. Both beat that.

## Performance: Stop Quoting Hello World

List scrolling, image caching, and navigation stacks dominate perceived performance. Both frameworks can stutter:

- RN: fix list virtualization, memoization, native driver for animations
- Flutter: watch build method rebuild storms, profile with DevTools

Microbenchmarks are for conference talks. Profile your actual screens.

## Opinionated Recommendation for 2022

**Choose React Native if:**

- Your team is mostly web React engineers
- You need gradual native integration
- Hiring speed in India is critical
- Your UI is standard forms + lists + modals

**Choose Flutter if:**

- UI consistency and motion design are core product value
- You can afford Dart onboarding
- You want fewer platform visual leaks
- Your app is greenfield without legacy native

**Choose native if:**

- You have budget for two platform teams and complex platform APIs dominate

## What I Would Not Do

Rewrite a shipping RN app to Flutter for &quot;performance&quot; without profiling.

Pick Flutter because Twitter said Google killed RN. RN was very much alive.

Ignore accessibility. Both frameworks require work; RN inherits some platform behavior, Flutter requires you to wire semantics deliberately.

## Closing

React Native vs Flutter in 2022 was not a knockout. Hermes narrowed RN&apos;s weakness. Flutter matured tooling. The winner is the stack your team can ship and maintain for three years.

Founders want a verdict. The verdict is constraints. Hire for what you choose. Profile what you ship. Ignore holy wars from people who have not released a Play Store update in six months.</content:encoded></item><item><title>Training YOLO on a Laptop GPU Over Summer Break</title><link>https://utso.stamped.work/blog/2022-06-18-training-yolo-laptop-gpu-summer-break/</link><guid isPermaLink="true">https://utso.stamped.work/blog/2022-06-18-training-yolo-laptop-gpu-summer-break/</guid><description>June 2022: I spent summer break trying to get real-time object detection working on my own hardware. It was slower, messier, and more educational than the YouTube tutorials promised.</description><pubDate>Sat, 18 Jun 2022 00:00:00 GMT</pubDate><content:encoded>June 2022. Board exams were behind me. Strato Foods still needed occasional fixes. And I had a laptop GPU, a Kaggle dataset, and the arrogant belief that I could reproduce a YOLO demo in a weekend.

It took the whole month. This is what that actually looked like for a teenager building through Strato Inc, not a research lab with labeled forensic data.

## Why I Picked Object Detection

I had already played with image classifiers. Detection felt like the next level: boxes, classes, something you could point at on a screen and say &quot;see, it works.&quot;

I was not solving fraud. I was not building verification for anyone&apos;s pipeline. I wanted a side project that might eventually ship as a silly demo app or teach me enough CV to stop copy-pasting Stack Overflow answers into Flutter plugins.

YOLOv5 was the default recommendation on ML Twitter that summer. Ultralytics made it easy to pretend the hard parts were solved.

## The Hardware Reality

My laptop had a consumer NVIDIA GPU. Not datacenter. Not even particularly new.

What the tutorials skip:
- **CUDA version hell** with PyTorch builds
- **Thermal throttling** after twenty minutes of training
- **Batch size 4** because batch size 16 OOM&apos;d instantly
- **Dataset on an external drive** because my SSD was full of Android build artifacts

The first time training ran overnight and Windows Update rebooted the machine, I considered switching to philosophy.

## What Actually Worked

```mermaid
flowchart LR
    subgraph data [Data]
        K[Kaggle dataset]
        L[Label cleanup by hand]
    end

    subgraph train [Train]
        Y[YOLOv5 small]
        G[Laptop GPU]
    end

    subgraph ship [Ship]
        E[Export ONNX]
        T[Test on phone photos]
    end

    K --&gt; L --&gt; Y
    G --&gt; Y
    Y --&gt; E --&gt; T
```

**Start tiny.** YOLOv5s, not x. I did not need SOTA. I needed a box around a cup that did not flicker every third frame.

**Label hygiene matters more than architecture.** Half my early failures were mislabeled training images I was too lazy to fix.

**Export early.** Getting a `.pt` file to run in Python is not the same as making it usable anywhere else. I learned that before I learned mAP.

**Record your hyperparameters.** Not because I was rigorous. Because I forgot what worked and re-ran bad configs twice.

## What Failed

**Real-time on phone.** The exported model ran on laptop webcam at acceptable FPS. On a mid-range Android phone through a naive pipeline, it did not. That gap between &quot;demo&quot; and &quot;product&quot; would follow me for years.

**Custom classes with too little data.** I tried adding a class with forty images. The model confidently detected everything as that class. Classic overfit. I was not special; every beginner hits this wall.

**Expecting tutorial metrics to mean product quality.** High mAP on a clean validation set does not mean robustness to bad lighting, motion blur, or a user who holds the phone at a angle that would make a cinematographer cry.

## Connection to Strato Inc

None of this shipped as a Strato Foods feature. It was parallel learning — the same pattern as my GPT-3 Telegram experiments that month: build at night, learn something real, maybe reuse later.

That reuse would come in random places: better intuition for camera pipelines in Android apps, less trust in &quot;99% accuracy&quot; slides, patience when ML Kit did something dumb in production.

## Takeaway

If you are a student with one GPU and infinite YouTube confidence: **shrink the problem until it fits your hardware and your patience.** Detection is not magic. It is labels, compute, and the humility to fix your dataset before you swap model architectures.

Summer break ended. I went back to classes, Strato Foods, and GPT-3 plumbing. The YOLO weights lived in a folder I would rediscover months later and delete to free space.

The lessons stuck longer than the checkpoints.</content:encoded></item><item><title>Everyone Keeps Calling Things ChatGPT</title><link>https://utso.stamped.work/blog/2022-04-11-chatgpt-is-not-gpt4/</link><guid isPermaLink="true">https://utso.stamped.work/blog/2022-04-11-chatgpt-is-not-gpt4/</guid><description>GPT-3, InstructGPT, ChatGPT, and RLHF are not interchangeable labels. A caustic primer for people shipping products on top of the wrong mental model.</description><pubDate>Mon, 11 Apr 2022 00:00:00 GMT</pubDate><content:encoded>If I had a rupee for every pitch deck that said &quot;we use ChatGPT under the hood&quot; in Q1 2022, I could fund a seed round. Most of those decks meant &quot;we call the OpenAI API.&quot; Some meant &quot;we fine-tuned GPT-3.&quot; A few confidently meant &quot;we built our own foundation model.&quot; The words collapsed into one brand, and technical decisions got worse because of it.

ChatGPT is a product. GPT-4 did not exist in public conversation yet. GPT-3 is a base language model. InstructGPT is a fine-tuned variant optimized for following instructions. RLHF is the training recipe that connects them. Confusing these is not pedantry. It is how you end up with wrong latency budgets, wrong safety assumptions, and wrong fine-tuning strategies.

## GPT-3: Autocomplete at Scale

GPT-3 is a large autoregressive transformer trained to predict the next token on internet text. It is good at continuation. Ask it to &quot;write a polite email,&quot; and it might write an email, or it might write a forum post about writing emails, or it might veer into fiction because the prior token context looked like a story.

Base models are simulators of text distribution. They are not assistants. They do not know what you want unless the prompt makes the desired mode statistically likely.

For founders, the base model lesson is simple: **do not expect alignment from scale alone.** GPT-3&apos;s raw API was a power tool for people who could prompt engineer around its moods. Everyone else got inconsistent magic.

## InstructGPT: Instructions as the Objective

OpenAI&apos;s InstructGPT work (Ouyang et al., 2022) fine-tuned GPT-3 style models on demonstrations of humans following instructions, then refined with reinforcement learning from human feedback. The result models were smaller in parameter count comparisons yet preferred by labelers over larger base models.

That was the inflection. Capability stopped being purely about pretraining loss and started being about **human preference on outputs**.

Instruct-style models answer questions more directly, refuse some bad requests more often, and hallucinate with more confidence. That last part is important. Alignment can make models sound right when they are wrong.

## RLHF: The Training Stack People Hand-Wave

Reinforcement Learning from Human Feedback is a pipeline, not a checkbox:

1. **Supervised fine-tuning (SFT)** on curated instruction-response pairs.
2. **Reward model training** on comparisons: humans pick better answers.
3. **Policy optimization** (often PPO) to maximize reward while staying close to the SFT model.

```mermaid
flowchart TB
    Base[&quot;Base LM (GPT-3 class)&quot;] --&gt; SFT[&quot;Supervised fine-tuning on demonstrations&quot;]
    SFT --&gt; RM[&quot;Train reward model from human comparisons&quot;]
    RM --&gt; PPO[&quot;RL policy optimization (e.g. PPO)&quot;]
    PPO --&gt; Instruct[&quot;Instruct-style model&quot;]
    Instruct --&gt; Chat[&quot;Chat product layer&quot;]
    Chat --&gt; Tools[&quot;Plugins, browsing, system prompts&quot;]
    Base -.-&gt;|&quot;not the same artifact&quot;| Chat
```

Each box is a different artifact with different failure modes. Skipping SFT and hoping PPO fixes everything is how you get unstable training and bizarre policies. Skipping reward modeling and using heuristics is how you get gaming.

If your ML lead says &quot;we will just RLHF it next sprint,&quot; ask which step they mean. Then ask for data.

## ChatGPT: Product, Not Model Card

ChatGPT launched November 2022 for the public consciousness, but by April 2022 the ingredients were already visible to anyone reading papers and API changelogs. ChatGPT wraps model + system prompt + moderation + UX affordances (chat history, regeneration, thumbs up/down feeding future training).

Calling your startup &quot;ChatGPT for X&quot; tells investors you ride trends. It tells engineers nothing about:

- Which model snapshot you are on
- Whether you rely on chat-tuned vs code-tuned endpoints
- How you handle context windows
- What moderation hooks you inherit vs implement

The product layer matters. System prompts and tool use can turn the same weights into a lawyer cosplay or a SQL assistant. That is not mysticism. It is conditioning.

## Why the Naming Mess Hurts Shipping

**Latency and cost:** Chat-style products encourage multi-turn context. Base completion APIs reward single-shot prompts. Your architecture differs.

**Safety:** Instruction-tuned models have refusals baked in. Base models may comply with harmful prompts unless you bolt on classifiers. Compliance teams care.

**Fine-tuning:** OpenAI and others offered different fine-tuning surfaces over time. Fine-tuning Davinci is not the same as fine-tuning an instruct model. Data formats differ. Evaluation differs.

**Evaluation:** Comparing your system to &quot;ChatGPT&quot; without a fixed benchmark is meaningless. ChatGPT changes. Your demo does not.

## GPT-3 vs InstructGPT vs ChatGPT: A Founder Cheat Sheet

| Layer | What it is | What you get |
|-------|------------|--------------|
| GPT-3 (base) | Pretrained LM | Continuation, brittle control |
| InstructGPT | SFT + RLHF weights | Instruction following, refusals |
| ChatGPT | Product on tuned weights | UX, moderation, multi-turn |

When someone says &quot;GPT-4&quot; in 2022 April, they are often time-traveling or bluffing. GPT-4 was not public. Clarify or ignore.

## What RLHF Does Not Fix

RLHF aligns to labeler preferences, not ground truth. It can:

- Reduce toxic outputs while increasing polished nonsense
- Encode cultural bias from raters
- Overfit to short, helpful-sounding formats

If your application needs factual grounding, you need retrieval, tools, or domain-specific fine-tuning and evals. RLHF is not a database.

## Building Without Confusion

1. **Name the artifact** in your docs: model ID, snapshot date, API version.
2. **Separate policy from weights:** system prompts and filters are your liability surface too.
3. **Log prompts and outputs** (with privacy policy) so you can debug when the vendor updates weights silently.
4. **Benchmark against tasks**, not against a brand.

## The Opinion You Paid For

The industry brands everything ChatGPT because consumers recognize it. Inside your codebase, that branding is malpractice. GPT-3 is the engine block. InstructGPT is the fuel mapping. RLHF is the tuning process. ChatGPT is the car with airbags and a speed limiter.

Drive the car if you want. Just do not pretend you manufactured the engine because you can floor the accelerator in a parking lot.

When GPT-4 actually ships publicly, the naming will get worse. Start building the discipline now.</content:encoded></item><item><title>Building with the GPT-3 API Before Anyone Had Heard of ChatGPT</title><link>https://utso.stamped.work/blog/2022-03-20-building-with-gpt-3-before-chatgpt/</link><guid isPermaLink="true">https://utso.stamped.work/blog/2022-03-20-building-with-gpt-3-before-chatgpt/</guid><description>In March 2022 the OpenAI API was still a weird side project for tinkerers. I was wiring Telegram bots, burning credits, and learning prompts are code.</description><pubDate>Sun, 20 Mar 2022 00:00:00 GMT</pubDate><content:encoded>In March 2022 I was wiring **GPT-3** into side experiments from my room — Telegram ingress, rough memory sketches, prompt templates that broke every time OpenAI tweaked a model string. Nobody in my extended family had heard of ChatGPT. That product would not exist until November.

This post is a field note from the quiet months: before Stable Diffusion, before the hype tsunami, when using the API felt like a hobby you apologized for at dinner.

## The Landscape in Early 2022

If you were paying attention to ML Twitter in Q1 2022, the text side looked like this:

- **GPT-3** via API: strong autocomplete, flaky instruction-following unless you prompt carefully
- **InstructGPT** (Ouyang et al., January 2022): proof that RLHF could make smaller fine-tunes beat raw scale on human preferences — most founders had not internalized this yet
- **Codex**: magic for boilerplate, dangerous for security-sensitive codegen
- **DALL-E**: existed, but the public shock wave was still a few weeks away

What did not exist in my world: ChatGPT, cheap open-weight chat models, LangChain as default plumbing, or a single Indian enterprise buyer asking for &quot;an LLM strategy.&quot;

I was still running Strato Inc apps and tinkering at night. The LLM work was not a startup pitch. It was me trying to see if a Telegram bot could remember context better than stuffing the last twenty messages into a prompt.

## Why Telegram

Founders live on their phones. Async messaging maps cleanly to agent loops: message in, classify intent, maybe retrieve something, reply.

I picked Telegram because the Bot API is boring in a good way. No app store review for v1. Easy to share a bot link with three friends who would tolerate broken behavior.

The architecture in March was embarrassingly simple:

```mermaid
flowchart LR
    TG[Telegram message] --&gt; Router[Intent / keyword router]
    Router --&gt; Prompt[Prompt template + few-shot]
    Prompt --&gt; API[OpenAI Completions API]
    API --&gt; Post[Light post-process]
    Post --&gt; Reply[Telegram reply]
```

No vector DB yet. No tools. No swarm. Just **prompt + API + hope**, which is where everyone starts whether they admit it or not.

## What GPT-3 Taught Me the Hard Way

**Prompts are code.** I versioned them in git after the third time a &quot;small wording tweak&quot; turned a helpful bot into a passive-aggressive poet. If you are not diffing prompts, you are not engineering.

**Base models simulate text.** Ask for a JSON object, sometimes get JSON, sometimes get a Medium article about JSON. Instruct-style behavior was coming; the public API surface in early 2022 still rewarded people who understood that GPT-3 was a distribution learner, not an assistant.

**Latency is UX.** Non-streaming completions felt broken on mobile. Waiting eight seconds for &quot;okay&quot; is not a product. Streaming responses were not vanity — they were survival.

**Cost is real.** I tracked tokens before I tracked uptime. A badly designed loop that re-sent the system prompt every turn would eat lunch money in an afternoon.

**Evals or chaos.** I had no formal harness yet. I had a spreadsheet of prompts and &quot;good / bad / hilarious failure&quot; notes. That primitive eval was enough to stop me shipping the worst regressions to friends.

## What I Was Not Doing

I was not fine-tuning with LoRA. That post comes later in 2022, after Stable Diffusion made open-weight adaptation culture normal.

I was not pitching enterprises on &quot;AI transformation.&quot; Indian B2B buyers were still asking for basic dashboards and WhatsApp integrations, not retrieval-augmented generation.

March 2022 me was plumbing, not thought leadership.

## The InstructGPT Paper Mattered More Than the Hype

The InstructGPT result landed in January 2022 and changed how I thought about the stack even before products caught up:

1. **Alignment is a training recipe**, not a side effect of scale
2. **Human preference data** is a moat if you can collect it responsibly
3. **Smaller aligned models beating larger base models** would reshape API economics

I did not have RLHF infrastructure. Almost nobody did outside labs. But I stopped assuming the biggest model string on the pricing page was automatically the right tool for a constrained bot.

That mental shift paid off when ChatGPT arrived nine months later and everyone conflated &quot;chat UI&quot; with &quot;new science.&quot;

## Failure Modes I Hit in March

- **Context stuffing:** pasted entire chat history until I hit token limits and got truncated nonsense
- **No memory across sessions:** users thought the bot was rude; the bot was amnesiac
- **Over-trusting outputs:** sent a draft email that sounded confident and was wrong about a date
- **Rate limits at demo time:** classic

Each failure pushed me toward the memory and eval work I would write about later in 2022. The through-line is boring: **continuity and measurement**, not bigger models.

## What Changed by Year End (Foreshadowing)

November 2022 would compress a decade of narrative into a weekend when ChatGPT launched. March me would have laughed if you said relatives would use LLMs for wedding speeches.

But the foundations were already visible in March if you were building instead of tweeting:

- Prompt versioning
- Streaming UX
- Token budgets
- Async messaging interfaces
- Skepticism of raw base-model behavior

Stable Diffusion in August would add image pipelines. ChromaDB experiments in September would add memory. None of that was visible yet. The GPT-3 API was enough to learn on.

## Takeaway

If you are building in 2022 (or reading this retroactively): **you do not need the trendiest acronym to start.** You need a narrow loop, a logging habit, and friends who will tell you when the bot lies.

ChatGPT would later make everyone think they were late. They were not. They were early if they had already learned that prompts are code and demos without evals are liabilities.

I was just a kid with Strato Inc on the side and a Telegram bot that sometimes worked. That was the right level of ambition for March 2022.

The enterprise pivots could wait. The API was already enough to teach the real lessons.</content:encoded></item><item><title>Before Stable Diffusion: The Summer of DALL-E</title><link>https://utso.stamped.work/blog/2022-02-14-stable-diffusion-before-stable-diffusion/</link><guid isPermaLink="true">https://utso.stamped.work/blog/2022-02-14-stable-diffusion-before-stable-diffusion/</guid><description>DALL-E 1, GLIDE, and classifier-free guidance set the stage for open diffusion. A founder-researcher&apos;s field notes from the summer everything changed.</description><pubDate>Mon, 14 Feb 2022 00:00:00 GMT</pubDate><content:encoded>Everyone talks about Stable Diffusion like it appeared from nowhere in August 2022. It did not. The real inflection was the summer before, when OpenAI and Google Research published two papers that made text-to-image generation feel inevitable instead of laughable. If you were building anything in computer vision that year, you were not waiting for a single model release. You were watching a stack crystallize: diffusion as the generative backbone, CLIP as the semantic bridge, and classifier-free guidance as the knob that turned fuzzy blobs into something you could ship in a demo.

I was running a startup and doing research on the side. That combination makes you allergic to hype but hungry for leverage. DALL-E 1 and GLIDE were leverage. Not because they were perfect. They were not. But they made the failure modes legible.

## DALL-E 1: Proof That Text Could Steer Pixels

OpenAI&apos;s &quot;Zero-Shot Text-to-Image Generation&quot; (January 2021, but the ecosystem digestion took months) used a discrete VQ-VAE token space and an autoregressive transformer. You did not diffuse in pixel space. You predicted the next image token given prior tokens and text. The results were surreal, inconsistent, and occasionally brilliant. That was the point.

DALL-E 1 established three things practitioners still rely on:

1. **Joint training on image-text pairs scales.** The model did not need fine-grained captions for every object; noisy web alt-text was enough to learn rough alignment.
2. **Compositionality is hard.** Ask for &quot;a red cube on top of a blue sphere&quot; and you get symbolism, not physics. Text conditioning is not a scene graph.
3. **Autoregressive image modeling is expensive at inference.** Generating hundreds or thousands of tokens per image is fine for research demos, painful for products.

If you only remember DALL-E 1 as &quot;the weird avocado chair meme,&quot; you missed the architectural bet: treat images as a sequence problem conditioned on language. CLIP&apos;s text encoder made that bet viable.

## GLIDE: Diffusion Enters the Chat

Google&apos;s GLIDE (Guided Language to Image Diffusion for Generation and Editing) took a different route. Train a diffusion model in pixel space (with a U-Net), condition on text embeddings, and use classifier guidance at sampling time to sharpen outputs toward a label or embedding direction.

GLIDE was better at photorealism than DALL-E 1 for many prompts. It also exposed the sampling cost problem openly: diffusion means many forward passes through a large U-Net. Quality scales with steps. Your GPU bill scales with it too.

The paper&apos;s editing results were underrated. Inpainting and masked editing with the same diffusion objective foreshadowed what Stable Diffusion&apos;s latent inpainting would later productize. The research community was already circling the same idea: diffusion is not just generation; it is iterative refinement under constraints.

## Classifier-Free Guidance: The Hack That Won

Classifier guidance required a separate classifier and gradients through it at sample time. Clever, but clunky. Ho and Salimans&apos; classifier-free guidance (CFG) trick dropped the separate classifier. During training, randomly drop the conditioning signal so the model learns both conditional and unconditional score estimates. At inference, interpolate:

The guided score is a weighted combination of conditional and unconditional predictions. Turn the guidance weight up and images get sharper, more literal, more &quot;on prompt.&quot; Turn it too high and they get crunchy, oversaturated, and artifact-ridden.

Every modern text-to-image stack you have used since then is, in some sense, CFG all the way down. Stable Diffusion. DALL-E 2&apos;s public details differ, but the guidance intuition persists. Midjourney&apos;s aesthetic bias is not magic; it is training data plus guidance schedules plus post-processing.

If you implement one equation from this era, implement CFG. It is the cheapest performance lever in the generative toolbox.

## Forward and Reverse: The Diffusion Mental Model

Diffusion is not deep learning astrology. It is a controlled noise process. Forward diffusion gradually corrupts data into Gaussian noise. Reverse diffusion learns to denoise step by step, recovering structure. Text conditioning bends the denoising vector field so &quot;a fox in watercolor&quot; means something different than &quot;a fox in neon.&quot;

```mermaid
flowchart LR
    subgraph forward[&quot;Forward diffusion&quot;]
        X0[&quot;x0: data&quot;] --&gt; X1[&quot;x1: light noise&quot;]
        X1 --&gt; X2[&quot;x2: more noise&quot;]
        X2 --&gt; XT[&quot;xT: ~ Gaussian&quot;]
    end
    subgraph reverse[&quot;Reverse diffusion (learned)&quot;]
        XT --&gt; Rn1[&quot;denoise step T&quot;]
        Rn1 --&gt; Rn2[&quot;denoise step T-1&quot;]
        Rn2 --&gt; R0[&quot;x0: sample&quot;]
    end
    Text[&quot;Text embedding&quot;] -.-&gt; Rn1
    Text -.-&gt; Rn2
    CFG[&quot;Classifier-free guidance&quot;] -.-&gt; Rn1
    CFG -.-&gt; Rn2
```

Once you internalize this diagram, papers stop sounding like incantations. DDPM, DDIM, latent diffusion: they are variations on how you parameterize the reverse steps, how many you take, and whether you work in pixel space or a VAE latent.

## DALL-E 1 vs GLIDE: What Actually Mattered for Founders

From a product standpoint in early 2022, neither model was yours to ship. APIs were gated. Weights were not public. But the comparison still informed build vs wait decisions:

| Dimension | DALL-E 1 (autoregressive) | GLIDE (diffusion) |
|-----------|---------------------------|-------------------|
| Visual fidelity | Stylized, variable | Stronger photorealism |
| Controllability | Prompt-only, limited edit | Masked editing in paper |
| Inference cost | Many serial token steps | Many parallel U-Net steps |
| Open replication | Harder without scale | Feasible with effort |

The startup calculus was brutal. If your moat was &quot;we call OpenAI,&quot; you had no moat. If your moat was domain-specific data, constrained generation, or verification on outputs, the model layer being closed was almost irrelevant.

## What We Got Wrong That Summer

We underestimated how fast latent diffusion would drop VRAM requirements. DALL-E 1&apos;s token story made us think only hyperscalers could play. Stable Diffusion proved a compressed latent space plus a smaller U-Net could run on consumer GPUs. We overestimated how much users cared about semantic correctness vs aesthetic punch. CFG-heavy sampling taught users that cranking &quot;strength&quot; fixes prompts. It does, until it does not.

We also underweighted safety and provenance. Generative demos were so novel that nobody wanted to talk about deepfakes at dinner. That changed quickly.

## The Line to Stable Diffusion

Stable Diffusion did not invent text-to-image. It bundled the winning ingredients for replication:

- Latent diffusion (from Rombach et al.) to cut compute
- CLIP text conditioning (the semantic glue since DALL-E 1&apos;s era)
- CFG for prompt adherence
- Open weights, which turned research into a fork ecosystem overnight

If you were paying attention in the summer of DALL-E and GLIDE, Stable Diffusion felt like the obvious open-source endpoint, not a surprise. The shock was licensing and speed, not science.

## What I Would Tell a Founder Starting in 2022

Read the papers, not the threads. Implement a tiny DDPM on MNIST, then scale your intuition. Replicate CFG ablations on a small conditional model so you feel the artifact tradeoff in your own outputs. Do not anchor your roadmap to a single vendor&apos;s API tier.

The summer before Stable Diffusion was when text-to-image went from &quot;party trick&quot; to &quot;infrastructure.&quot; Everything after is distribution, fine-tuning, and the unglamorous work of verification, rights, and cost control. The models were the spark. The product is still the hard part.</content:encoded></item><item><title>2021: A Year of Building Things That Did Not Exist Before</title><link>https://utso.stamped.work/blog/2021-12-31-2021-year-in-review/</link><guid isPermaLink="true">https://utso.stamped.work/blog/2021-12-31-2021-year-in-review/</guid><description>An honest year-in-review from Strato Inc: Strato Foods, Play Store apps, and what actually shipped.</description><pubDate>Fri, 31 Dec 2021 00:00:00 GMT</pubDate><content:encoded>This is the honest accounting of my 2021. Not the LinkedIn highlight reel. Not the &quot;grateful for the journey&quot; post. The version I would write if I were advising myself in January and had to report back with actual data.

I was 19 for most of this year, publishing Android apps under Strato Inc (started 2020 as PAC Limited) from my room, and trying to figure out whether I was a founder who codes or a coder who wanted to be a founder. The answer, unsatisfyingly, is both, and both suffered when the other demanded attention.

## The Timeline

```mermaid
gantt
    title 2021 Projects via Strato Inc (approximate)
    dateFormat YYYY-MM
    axisFormat %b

    section Strato Foods
    Strato Foods food delivery app :2021-03, 2021-12

    section Research
    GAN synthetic data experiments :2021-04, 2021-07
    Transformer attention project  :2021-09, 2021-12

    section Skills
    Flutter production apps        :2021-01, 2021-12
```

Overlap is intentional. I was bad at focusing on one thing. I am probably still bad at it. But 2021 was the year I learned that parallel projects have compound returns even when they feel like parallel burnout.

## What Shipped

**Strato Foods (March onward)**: Food delivery app published through Strato Inc. Real orders in my hometown, repeat users, revenue that made the Play Store account feel like a business instead of a hobby. This was the project that defined the year.

**Research (ongoing)**: GAN experiments for synthetic training data (mostly failed, learned honestly). Transformer attention project for document classification (worked, not published). Both informed how I think about ML in production vs ML in papers.

## What Did Not Ship

Honesty requires this section.

**VC fundraise**: 40 outreach attempts, 0 term sheets. Wrote about it in July. Stopped chasing in September. Built revenue instead.

**GAN synthetic data pipeline**: Mode collapse won. Shipped a simpler augmentation approach. The research notebook is a cautionary tale.

**Strato Foods at national scale**: Local product-market fit, not aggregator-scale logistics. That tension would matter later.

**School and sleep**: Both took damage. The app hours came from somewhere.

**Social life**: Moderate damage. Friends were patient. I owe them dinners.

## Numbers (Approximate)

| Metric | Value |
|--------|-------|
| VC meetings | 8 |
| Term sheets | 0 |
| Strato Foods | Shipped March 2021 |
| Revenue 2021 | Low five figures INR |
| Flutter apps in production | Multiple via Strato Inc |
| Blog posts written | 0 (this series is retroactive) |
| Hours of sleep per night (average) | Less than recommended |

## What I Learned

**About building**: Shipping beats planning. Every project that taught me something shipped in some form. Every project I &quot;designed&quot; extensively did not.

**About India**: Market constraints (connectivity, device tier, lighting, regulation) are features, not bugs. Products that work in San Francisco and fail on a mid-range phone in tier-2 India were not built for my market. Embrace the constraint.

**About fundraising**: Traction is the universal solvent. Without it, warm intros are just warmer cold emails. I respect the game more and enjoy playing it less.

**About ML**: GANs are not magic. Transformers are not magic. ML Kit on a mid-range phone is engineering. The gap between paper and production is where founders live or die.

**About myself**: I can sustain 70-hour weeks for about 10 months before quality degrades. December 2021 me was worse at decisions than March 2021 me. Rest is not optional. I ignored this.

## People Who Mattered

My parents, who did not fully understand what I was building but funded the laptop upgrades and did not demand I stop when grades slipped.

Restaurants and riders in my hometown who gave Strato Foods a chance before the aggregators cared.

## 2022 Intentions (Written in Retrospect)

I wrote these in my December 2021 journal. Reporting back for the reader who cares about continuity:

1. **Grow Strato Foods and Strato Inc revenue.** Needed for sustainability without VC.
2. **Publish one research paper.** Transformer work. Status: eventual.
3. **Sleep more.** Failed immediately in January 2022.

## The Honest Sentence

2021 was the year I built things that did not exist before me: apps, revenue, failures, and opinions strong enough to write about.

It was not the year I figured everything out. It was the year I stopped pretending I had already figured everything out.

If you are a young founder reading this: building from your room works until it does not. The skills compound. The burnout is real. Ship anyway. Write the honest accounting at year end. It helps.

Happy new year. Back to work.</content:encoded></item><item><title>Nine Months of Running a Food Delivery App in a Tier-2 Town</title><link>https://utso.stamped.work/blog/2021-12-01-nine-months-strato-foods-local-delivery/</link><guid isPermaLink="true">https://utso.stamped.work/blog/2021-12-01-nine-months-strato-foods-local-delivery/</guid><description>Strato Foods hit nine months in December 2021. Real orders, real riders, and the gap between a working local app and a Swiggy competitor.</description><pubDate>Wed, 01 Dec 2021 00:00:00 GMT</pubDate><content:encoded>In March 2021 I shipped **Strato Foods** through Strato Inc — a food delivery app for my hometown. By December I had nine months of production data: orders, complaints, restaurant churn, rider no-shows, and revenue that was real but not venture-shaped.

This is not a growth hack post. It is what running local delivery actually feels like when you are 19, building from your room, and nobody in Bangalore is returning your emails.

## What I Thought I Was Building

The pitch in my head was simple: restaurants in my town did not need a national aggregator&apos;s commission structure or onboarding bureaucracy. They needed orders on their phones and riders who showed up.

Strato Foods was Flutter on Android, Firebase on the backend, and me playing customer support at midnight when someone&apos;s biryani went to the wrong lane.

I was trying to make a marketplace work where the &quot;market&quot; was three kilometers wide and everyone already knew each other&apos;s phone numbers.

## What Actually Worked

**Density.** One town, a handful of restaurants that trusted me, repeat users who ordered because I fixed things personally when they broke.

**Founder on WhatsApp.** Terrible for scale. Excellent for retention in month two through six. People forgave bugs when a human answered in ten minutes.

**Low burn.** No office. No salaried ops team. Rider payments and restaurant settlements were spreadsheets I understood because I built them in pain.

**Shipping cadence.** Play Store updates every two weeks. Small fixes visible to merchants: &quot;he actually listens.&quot;

By autumn we had a rhythm. Restaurants called me before they called Swiggy&apos;s local rep — not because we were bigger, but because we were reachable.

## What Did Not Work

**Pretending to be an aggregator.** Swiggy and Zomato win on subsidies, brand, and rider pools I could not match from a hostel room.

**Feature envy.** I kept sketching &quot;dark store&quot; ideas and loyalty programs while basic ETA accuracy was still flaky.

**Ops as an afterthought.** Software was the fun part. Rider incentives, rain-day surge, restaurant payout disputes — that was the company. I underestimated it until month five.

**Sleep.** The 2021 year-in-review numbers do not lie. December me made worse decisions than April me.

```mermaid
flowchart LR
    subgraph worked [What compounded]
        A[Single-town density]
        B[Founder-led support]
        C[Low burn / fast ship]
    end

    subgraph failed [What hurt]
        D[Aggregator cosplay]
        E[Ops debt]
        F[Feature envy]
    end

    worked --&gt; Revenue[Real but local revenue]
    failed --&gt; Ceiling[Growth ceiling]
```

## The Numbers I Care About (Approximate)

| Metric | Dec 2021 |
|--------|----------|
| Months live | 9 |
| Towns | 1 |
| Model | Marketplace + own rider coordination |
| Revenue | Low five figures INR for the year |
| VC meetings that quarter | 0 (I stopped chasing in September) |
| Nights I debugged Firebase rules | Too many |

None of this is unicorn math. It is proof that Strato Inc could ship something strangers paid for — which mattered more to me than another pitch deck revision.

## What I Was Learning in Parallel

Strato Foods was not the only thread. I was running other Android experiments under Strato Inc and poking at a transformer attention side project for document classification. The GAN synthetic data work had mostly died by summer. The lesson that stuck: **ML papers and food delivery ops do not share a calendar.** I could not be in lab-head mode and dispatch-head mode on the same Tuesday.

That tension would define the next few years. In 2021 it just felt like being bad at focusing.

## Restaurants, Riders, and the Unsexy Truth

The product was an app. The business was phone calls.

Restaurants wanted:
- Predictable settlement cycles
- Fewer canceled orders
- Someone to yell at who could fix menus same-day

Riders wanted:
- Clear per-delivery pay
- No ambiguous &quot;bonus maybe&quot; language
- Respect when it rained and roads disappeared

Users wanted:
- Food hot
- Accurate ETA
- Refunds without a ticket system designed for enterprises

Every item on that list is operations. The Flutter UI was maybe 30% of whether people reordered.

## Why I Kept Going

I stopped fundraising in the fall. Not because I became anti-VC — I wrote about that summer separately — but because Strato Foods was teaching me things no incubator lecture covered: local trust, settlement float, and how fast word-of-mouth travels in a town where your mother might hear about a bad delivery before you see the complaint.

That knowledge compounds. Even when the app does not scale nationally.

## Takeaway

If you are a young founder building in a tier-2 city: **local delivery is not a smaller Swiggy.** It is a logistics and relationships business that happens to have an app attached.

Ship in one geography until the ops are boring. Treat every restaurant owner like your first investor. Log the failures in a spreadsheet, not in your memory.

Nine months in, Strato Foods was not a unicorn. It was a real company in a real town. That was enough to keep building — and enough to write the honest version down before the LinkedIn highlight reel kicks in.</content:encoded></item><item><title>Attention Is All You Need: Three Years Later</title><link>https://utso.stamped.work/blog/2021-11-08-attention-is-all-you-need-three-years-later/</link><guid isPermaLink="true">https://utso.stamped.work/blog/2021-11-08-attention-is-all-you-need-three-years-later/</guid><description>Revisiting the Transformer paper in 2021: quadratic attention costs, BERT vs GPT-2, and why the architecture ate NLP.</description><pubDate>Mon, 08 Nov 2021 00:00:00 GMT</pubDate><content:encoded>In June 2017, eight Google researchers published &quot;Attention Is All You Need&quot; and quietly detonated the entire field of natural language processing. This post is a technical revisit of the paper that started it all, written from the perspective of someone who has read it five times and understood it properly on the fourth. By November 2021, three years and change after publication, I was implementing attention mechanisms for a side research project while the rest of the world debated whether GPT-3 was AGI.

Transformers are not magic. They are weighted sum machines with a clever inductive bias. But that bias turned out to be exactly what language needed.

## What the Paper Actually Proposed

Before Transformers, sequence modeling meant RNNs (LSTM, GRU) or CNNs (ByteNet, ConvS2S). RNNs process tokens sequentially. Training is slow because parallelism is limited. Long-range dependencies vanish in gradient flow despite LSTM gates.

The Transformer throws recurrence away. Every token attends to every other token in parallel. Position information comes from positional encodings, not from sequential processing.

Core components:
- **Multi-head self-attention**: each token builds a representation by attending to all tokens, with multiple attention heads learning different relationship types
- **Position-wise feed-forward networks**: two linear layers with ReLU applied per token independently
- **Residual connections and layer normalization** around each sublayer
- **Encoder-decoder architecture** for sequence-to-sequence tasks (original paper targeted machine translation)

```mermaid
flowchart TB
    subgraph Encoder
        E_IN[Input embeddings + Positional encoding] --&gt; E_ATTN[Multi-Head Self-Attention]
        E_ATTN --&gt; E_ADD1[Add and Norm]
        E_ADD1 --&gt; E_FFN[Feed Forward]
        E_FFN --&gt; E_ADD2[Add and Norm]
        E_ADD2 --&gt; E_OUT[Encoder output]
    end

    subgraph Decoder
        D_IN[Output embeddings + Positional encoding] --&gt; D_MASK[Masked Self-Attention]
        D_MASK --&gt; D_ADD1[Add and Norm]
        D_ADD1 --&gt; D_CROSS[Cross-Attention to Encoder]
        E_OUT --&gt; D_CROSS
        D_CROSS --&gt; D_ADD2[Add and Norm]
        D_ADD2 --&gt; D_FFN[Feed Forward]
        D_FFN --&gt; D_ADD3[Add and Norm]
        D_ADD3 --&gt; D_SOFT[Linear + Softmax]
        D_SOFT --&gt; D_OUT[Output probabilities]
    end
```

The diagram is the whole paper. Everything else is engineering details and training tricks.

## The O(n^2) Elephant in the Room

Self-attention computes pairwise interactions between all tokens. For sequence length n, attention is O(n^2) in both compute and memory.

The 2017 paper used sequences up to a few hundred tokens. Fine for translation. In 2021:
- **BERT** maxes at 512 tokens
- **GPT-2** uses 1024
- **GPT-3** uses 2048
- Document-level tasks want 4096+

At n=4096, the attention matrix has 16 million entries per head per layer. Multiply by batch size, heads, and layers. Your GPU weeps.

The research response in 2021 included:
- **Sparse attention** (Longformer, BigBird): attend locally plus global tokens
- **Linear attention** (Performer, Linformer): kernel approximations to avoid materializing full matrix
- **FlashAttention** (not yet widely deployed in 2021 but coming): IO-aware exact attention
- **Recurrence hybrids** (Transformer-XL): cache previous segments

None of these fully solved the problem. They traded exactness for scalability. The Transformer ate NLP anyway because n=512 covers most profitable use cases and GPUs got bigger.

For my research project processing longer documents, I hit the wall at 512 tokens and had to chunk with overlap, losing cross-chunk context. The paper does not mention this pain. Production does.

## BERT vs GPT-2: Same Architecture, Different Religion

Both use Transformer blocks. The difference is training objective and architecture variant.

**BERT** (2018): Encoder-only. Trained with masked language modeling (predict hidden tokens) and next sentence prediction. Bidirectional context. Fine-tune for classification, NER, QA.

**GPT-2** (2019): Decoder-only. Trained with causal language modeling (predict next token). Left-to-right only. Fine-tune for generation, or prompt without fine-tuning.

| Aspect | BERT | GPT-2 |
|--------|------|-------|
| Architecture | Encoder | Decoder |
| Attention | Bidirectional | Causal (masked) |
| Pre-training | MLM + NSP | CLM |
| Best for | Understanding tasks | Generation |
| Fine-tuning | Standard | Prompting emerges |

The ideological split: is language understanding best served by bidirectional context (BERT) or by generative modeling that must implicitly learn understanding to predict well (GPT)?

By 2021 the answer was leaning GPT. Scaling decoder-only models produced emergent capabilities BERT&apos;s architecture could not match. BERT still wins on extractive QA and classification with limited data. GPT wins when you have compute and want one model to do everything.

I used BERT-base for a text classification task with 2000 labeled examples. It worked. I tried GPT-2 for the same task with prompting. It kind of worked. With 2000 examples, BERT was the correct engineering choice. The paper&apos;s lesson is not &quot;always use the biggest model.&quot; It is &quot;match architecture to objective and data scale.&quot;

## Multi-Head Attention: What the Heads Learn

The paper uses h=8 heads with d_model=512, so d_k=d_v=64 per head. Each head learns different attention patterns. Visualizations in follow-up work show:
- Heads that attend to previous/next token (syntax)
- Heads that attend to matching brackets or delimiters
- Heads that attend to coreferent entities

You do not design these patterns. They emerge from training. This is the Transformer&apos;s real superpower: flexible relational inductive bias without hand-crafted linguistic features.

Implementation detail that tripped me:

```python
def scaled_dot_product_attention(Q, K, V, mask=None):
    d_k = Q.size(-1)
    scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(d_k)
    if mask is not None:
        scores = scores.masked_fill(mask == 0, -1e9)
    attn = F.softmax(scores, dim=-1)
    return torch.matmul(attn, V), attn
```

The scaling by sqrt(d_k) prevents softmax saturation as dimension grows. Small detail. Without it training diverges. Papers omit these details. Code does not.

## Positional Encodings: Sinusoidal vs Learned

The original paper uses fixed sinusoidal encodings. Later models (GPT, BERT) use learned positional embeddings. Relative position encodings (Transformer-XL, T5) generalize better to longer sequences than absolute positions.

In 2021 this was still an active research area. For fine-tuning BERT on short texts, it barely matters. For extrapolating to longer sequences than training, it matters enormously.

## Three Years Later: What Held Up

**Still true:**
- Attention as the core primitive for sequence modeling
- Parallelization advantage over RNNs
- Transfer learning via pre-train then fine-tune (or prompt)
- Layer normalization placement and residual connections

**Underestimated in 2017:**
- Scale. The paper&apos;s largest model was 213M parameters (Transformer-big). GPT-3 is 175B.
- Decoder-only dominance for general-purpose models
- Prompting as an interface replacing fine-tuning
- Multimodal extension (Vision Transformer, CLIP, DALL-E)

**Overestimated:**
- Encoder-decoder as the default architecture (decoder-only won for LLMs)
- Need for task-specific architectures (one big model eats them)
- Efficiency at long context (still painful in 2021)

## What I Tell Juniors Who Ask &quot;Should I Read the Paper?&quot;

Yes. Read the original. Not just the annotated blog version.

Read it for:
- Scaled dot-product attention definition
- Why multi-head instead of single head with larger dimension
- Encoder-decoder cross-attention for seq2seq

Skip detailed derivation of learning rate schedule unless you are reproducing training. Nobody uses the exact warmup from 2017 anymore.

Then read BERT and GPT-2 papers to see how the architecture forked. Then read one efficient attention paper (Longformer or Performer) to understand the O(n^2) mitigation landscape.

## The Honest Assessment

&quot;Attention Is All You Need&quot; is one of the most impactful ML papers ever written. It is also, by 2021 standards, an incomplete blueprint for modern LLMs. No RLHF, no scaling laws, no emergent abilities, no prompting, no safety considerations.

Three years later the title was prophetic and slightly wrong. Attention is most of what you need. You also need 10^23 FLOPs, a data pipeline, and a product team.

But it started here. Every LLM you use in 2024 traces back to this architecture. Worth understanding properly, not just as a buzzword on a pitch deck.</content:encoded></item><item><title>Image Compression Is a Solved Problem the Way Traffic Is a Solved Problem</title><link>https://utso.stamped.work/blog/2021-10-15-image-compression-not-solved/</link><guid isPermaLink="true">https://utso.stamped.work/blog/2021-10-15-image-compression-not-solved/</guid><description>JPEG, WebP, AVIF, and neural codecs: why your images are still too big and your quality metrics still lie.</description><pubDate>Fri, 15 Oct 2021 00:00:00 GMT</pubDate><content:encoded>People say image compression is solved the way people say traffic is solved: technically there are solutions, practically you are still stuck in a queue wondering where the design went wrong. I spent October 2021 optimizing image pipelines for Strato Foods — restaurant menu photos, dish images, and rider-uploaded proof shots on 3G connections. The textbook answer was &quot;use WebP.&quot; The real answer was more complicated, more interesting, and more frustrating.

## The Codec Landscape in 2021

Four contenders mattered for our use case:

**JPEG** (1992): Universal support, fast encode/decode, terrible efficiency by modern standards. Still the default because everything accepts it. Baseline for comparison.

**WebP** (2010): Google&apos;s answer. 25-35% smaller than JPEG at equivalent visual quality. Lossy and lossless modes. Android support since 4.0. iOS since 14. Good enough for most mobile apps.

**AVIF** (2019): Based on AV1 video codec intra frames. 50% smaller than JPEG in benchmarks. Encode times measured in seconds per image on mobile. Decode support spotty in 2021.

**HEIC** (2015): Apple&apos;s choice. Excellent compression. Useless for cross-platform web delivery. Relevant only if your users are all on iOS, which ours were not.

Neural compression (learned image compression, LIC) was publishing impressive RD curves in papers but had no production-ready encoders you could ship in a Flutter app without writing a research project.

## The Quality vs Size Tradeoff Nobody Plots Honestly

Codec comparisons in blog posts pick a single quality setting, compare file sizes, declare a winner. Real systems live on a curve.

```mermaid
xychart-beta
    title &quot;Rate-Distortion Tradeoff (qualitative)&quot;
    x-axis &quot;File size (KB)&quot; 0 --&gt; 500
    y-axis &quot;Perceptual quality (SSIM)&quot; 0 --&gt; 1
    line &quot;JPEG q=85&quot; [400, 0.92]
    line &quot;JPEG q=60&quot; [200, 0.85]
    line &quot;WebP q=80&quot; [250, 0.92]
    line &quot;WebP q=60&quot; [120, 0.85]
    line &quot;AVIF q=50&quot; [100, 0.92]
    line &quot;Neural codec&quot; [80, 0.90]
```

The neural codec line looks magical until you add **encode time** as a third axis. AVIF looks great until you try it on a 2019 Oppo. WebP wins not because it is optimal but because it sits in the pareto frontier of quality, size, encode speed, and compatibility.

For our restaurant and menu photos:
- Resolution: 1920x1080 to 4032x3024 depending on device
- Content: food textures, steam blur, harsh kitchen lighting, occasional motion blur from riders
- Constraint: upload on 3G in under 10 seconds
- Target: under 300KB per image with legible menu text and recognizable dishes

JPEG at quality 85 averaged 800KB. Unacceptable. WebP at quality 75 averaged 280KB with acceptable detail. AVIF at equivalent quality averaged 180KB but took 4x longer to encode on mid-range Android.

We shipped WebP. Not because it is the best codec. Because it is the best codec **for our constraints**.

## Why SSIM and PSNR Lie

**PSNR** (Peak Signal-to-Noise Ratio) measures pixel-level MSE. High-frequency detail loss in textures (exactly what matters for scratch detection) can look bad to humans while PSNR stays high.

**SSIM** (Structural Similarity Index) is better but still computed on downsampled luminance. Two images with different compression artifacts in different regions can have identical SSIM while one preserves dent edges and one smears them.

**VMAF** (Netflix&apos;s metric) is more perceptually aligned but designed for video at streaming bitrates. Overkill for still images and slow to compute at capture time.

We ended up with a hybrid evaluation:
1. SSIM for automated regression testing in CI
2. Human review on a sample set — restaurant owners comparing compressed vs original on their phones (the ground truth)
3. Downstream classifier sanity check on compressed images (does our simple tagging still work?)

The third test was the one that mattered. A compression setting that looked fine on a laptop made menu text illegible on a cheap Android screen. We tightened quality until readability recovered. Compression is not independent of your downstream pipeline. Treat it as part of the product.

## Neural Compression: The Research vs Production Gap

Learned image compression papers in 2021 showed architectures like hyperprior autoencoders beating BPG and approaching AVIF on Kodak dataset. Exciting.

Production reality:
- **No standard file format** with universal decoder support
- **GPU dependency** for reasonable encode times
- **Model size** adds megabytes to your app
- **Domain shift**: models trained on natural images compress document scans and damage photos differently

I experimented with a TFLite-quantized learned codec from a research codebase. Encoding a 1080p image took 2.3 seconds on GPU, 8 seconds on CPU. Output was 40% smaller than WebP. The research paper did not mention that the model was 12MB and the decoder had edge cases that produced block artifacts on high-contrast metal surfaces.

Neural compression will win eventually. In 2021 it was a conference paper, not a library import.

## Practical Pipeline Decisions

What actually shipped:

```python
def compress_for_upload(image_path: str, max_kb: int = 300) -&gt; bytes:
  img = load_image(image_path)
  img = auto_orient(img)  # EXIF rotation, non-negotiable
  img = resize_long_edge(img, max_px=2048)  # cap resolution first

  for quality in range(80, 40, -5):
    webp_bytes = encode_webp(img, quality=quality)
    if len(webp_bytes) &lt;= max_kb * 1024:
      return webp_bytes

  # fallback: aggressive resize then encode
  img = resize_long_edge(img, max_px=1280)
  return encode_webp(img, quality=50)
```

Key decisions:
- **Resize before compress.** Halving resolution saves more bytes than any quality tweak.
- **Strip EXIF** except orientation (already applied). Location metadata in delivery photos is a privacy headache.
- **Progressive quality reduction** with a hard size cap. Upload constraints are binary.
- **Server-side re-encode** for archival at higher quality. Client optimizes for upload speed.

## The Traffic Analogy Extended

Traffic is &quot;solved&quot; because we have roads, traffic lights, and GPS routing. Yet you are still late because the system optimizes for average case and you live in the worst case.

Image compression is &quot;solved&quot; because we have codecs that work on average images for average viewers on average hardware. Food delivery photos are not average:
- Steam and glare on dish shots
- Extreme lighting in small kitchens (see my ML Kit post)
- Users zooming in to read menu prices
- Upload on networks that drop packets when it rains

## What I Would Do Today

In 2021 WebP was the answer. A later pipeline might revisit:
- **AVIF** with hardware encode on newer devices, WebP fallback
- **Content-adaptive compression**: higher quality on text-heavy menu regions (ROI-aware encoding exists in research, emerging in tools)
- **On-device learned codecs** as TFLite models mature

## The Takeaway

Image compression is a solved problem the way calculus is a solved problem: the theory is complete, your integral is still messy.

Pick your codec based on constraints, not benchmarks. Measure with your downstream task, not just SSIM. Resize aggressively. Test on the worst phone on the worst network in your market.

And when someone says &quot;just use AVIF,&quot; ask them encode time on a Rs. 10,000 Android phone. Then ask again.</content:encoded></item><item><title>The Myth Indian Founders Are Sold About VC Funding</title><link>https://utso.stamped.work/blog/2021-07-22-india-startup-founders-vc-myth/</link><guid isPermaLink="true">https://utso.stamped.work/blog/2021-07-22-india-startup-founders-vc-myth/</guid><description>Bangalore bias, traction requirements at pre-seed, and the fundraising funnel nobody draws honestly.</description><pubDate>Thu, 22 Jul 2021 00:00:00 GMT</pubDate><content:encoded>I spent the summer of 2021 sending cold emails to Indian VCs while growing Strato Foods under Strato Inc from my hometown. I got polite passes, one meeting that went nowhere, and an education in how the Indian fundraising narrative diverges from the Indian fundraising reality. This post is for the founder who just watched a Bangalore startup raise a pre-seed on a pitch deck and wonders what they are doing wrong.

You are probably not doing anything wrong. The game is just different than the blogs say.

## The Myth

The myth goes like this: build something, show traction, pitch investors, raise a round, hire a team, scale. LinkedIn posts reinforce it. Inc42 headlines reinforce it. Y Combinator&apos;s playbook, slightly Indianized, reinforces it.

The myth has specific sub-clauses for India:

1. **&quot;Indian VCs are hungry for deals.&quot;** True in aggregate. Not true for you specifically.
2. **&quot;Pre-seed is about the team and vision.&quot;** True if your team went to IIT Bombay and worked at Flipkart. Otherwise they want traction.
3. **&quot;Bangalore is just a flight away.&quot;** Geographically yes. Psychologically it is a different country.
4. **&quot;Warm intros are preferred but cold outreach works.&quot;** Cold outreach works for getting a meeting maybe 2% of the time. Warm intros work maybe 15% of the time. Neither works if you are building in a sector the fund has already &quot;done.&quot;

## Bangalore Bias Is Real and Rational

I am not going to pretend geography should not matter. VCs in Bangalore see more deals per week than a Mumbai or Delhi fund because the density of startups is higher. They pattern-match faster. They have more reference checks in their network. A founder they met at a BLR meetup is lower risk than a founder who flew in from a tier-2 city with a deck and enthusiasm.

This is rational from the investor&apos;s perspective. It is maddening from mine.

What Bangalore bias actually looks like in practice:

- **&quot;Let&apos;s catch up when you are in town.&quot;** Translation: not now.
- **Portfolio company referrals get meetings.** Cold emails get auto-replies.
- **Co-working space presence signals commitment.** Hostel room does not.

The counter-move is not &quot;move to Bangalore immediately.&quot; I could not afford Bangalore rent in 2021 and my product did not need a local sales team yet. The counter-move is **build undeniable traction in your niche** and let the inbound come. Harder. Slower. More honest about how power works.

## Traction at Pre-Seed: The Moving Goalpost

American pre-seed often means: idea, team, prototype, maybe some LOIs. Indian pre-seed in 2021 increasingly meant: idea, team, prototype, **paying customers or signed pilots**.

I watched funds advertise &quot;we invest at idea stage&quot; and then pass on pre-revenue startups with &quot;come back when you have Rs. 5 lakh MRR.&quot; The term sheet language says pre-seed. The behavior says seed.

For B2B consumer logistics and local marketplace plays specifically, the bar was higher because **sales cycles are long** and **operations matter**. A VC who does not understand last-mile economics will pass rather than learn. Fair enough. But do not tell me you back bold founders in regulated markets if you only back founders who already cleared operational hurdles.

What counted as traction when I talked to investors:

| Signal | Weight |
|--------|--------|
| Revenue (any amount) | High |
| Repeat orders in a local market | High |
| LOI from enterprise | Medium |
| 10k app downloads | Low (vanity) |
| &quot;We are in talks with...&quot; | Zero |
| Hackathon win | Negative (sorry) |

The honest conversation I had with one angel: &quot;Your traction is interesting but I do not know anyone in your market who will take my intro. I cannot help. I will pass.&quot;

That was the most useful feedback I received all summer.

## The Funnel Nobody Draws

```mermaid
flowchart TD
    A[Cold email / warm intro] --&gt; B{Opened?}
    B --&gt;|No| Z1[Dead]
    B --&gt;|Yes| C{Reply?}
    C --&gt;|No| Z1
    C --&gt;|Yes| D[First call scheduled]
    D --&gt; E{Partner meeting?}
    E --&gt;|No| Z2[Polite pass]
    E --&gt;|Yes| F[Due diligence]
    F --&gt; G{Traction check}
    G --&gt;|Insufficient| Z3[Come back later]
    G --&gt;|Sufficient| H[Term sheet]
    H --&gt; I{Terms acceptable?}
    I --&gt;|No| Z4[Walk away]
    I --&gt;|Yes| J[Closed round]

    style Z1 fill:#333
    style Z2 fill:#333
    style Z3 fill:#333
    style Z4 fill:#333
```

Most founders stare at the left side of this funnel. The kill zone is **G: Traction check**, and it happens later than you expect, after you have invested weeks in partner meetings and data room prep.

My numbers, roughly:
- 40 outreach attempts (mix of cold and warm)
- 8 first calls
- 2 partner meetings
- 0 term sheets

Not unusual for a first-time founder outside the Bangalore network. The myth says persistence wins. Reality says persistence plus traction wins. Persistence alone gets you more polite passes.

## What Indian VCs Actually Optimize For

Having been on the founder side and talked to enough investors to pattern-match:

1. **Risk reduction.** Indian LP money is scarcer than US LP money. Funds take fewer bets.
2. **Follow-on signaling.** Can this company raise a Series A from a top-tier fund? Your pre-seed investor is buying a call option on your Series A story.
3. **Sector thesis fit.** If the fund did edtech in 2019 and it failed, they are not doing edtech in 2021. No matter your deck.
4. **Founder-market fit.** Do you have unfair access to customers? IIT helps. Family business in the target industry helps more.
5. **Speed to revenue.** Indian consumer startups can scale users fast. B2B needs revenue faster because user counts do not impress the same way.

None of this is evil. It is a market. But the myth that &quot;Indian founders just need to pitch better&quot; ignores that pitching is a small variable in a large equation.

## What I Did Instead of Chasing Term Sheets

I stopped fundraising in September 2021. Not forever. For that cycle.

Instead:
- **Doubled down on Strato Foods** in my hometown where word-of-mouth was real
- **Kept burn near zero** on Strato Inc
- **Built revenue** (small, but real)
- **Stayed local** until aggregators changed the math

The next fundraising conversation, when it happened, started with &quot;we have X paying customers and Y pipeline&quot; instead of &quot;imagine if we captured Z% of the market.&quot;

## Advice for Indian Founders (Unsolicited)

1. **Do not compare your chapter 1 to someone else&apos;s chapter 4.** The Bangalore startup that raised pre-seed on a deck had a chapter 0 you did not see.
2. **Treat fundraising as a full-time job only when you have traction to show.** Otherwise it is a part-time job that destroys morale.
3. **Geography matters.** Mitigate with video, with traction, with angels in your city. Do not pretend it does not matter.
4. **Regulated sectors need regulatory literacy.** Investors smell naivety instantly.
5. **A pass is not a verdict.** It is one fund&apos;s portfolio strategy. Keep building.

The myth will not die. New founders will keep believing the playbook works uniformly. It does not. But you can still build a company. You just might build it before you raise, not after.

That is harder. It is also the only path that worked for me.</content:encoded></item><item><title>GANs Are Not Magic</title><link>https://utso.stamped.work/blog/2021-06-10-gans-are-not-magic/</link><guid isPermaLink="true">https://utso.stamped.work/blog/2021-06-10-gans-are-not-magic/</guid><description>Mode collapse, training instability, and why FID scores lie to you. A researcher&apos;s honest field guide to generative adversarial networks.</description><pubDate>Thu, 10 Jun 2021 00:00:00 GMT</pubDate><content:encoded>Every few months someone posts AI-generated faces that look indistinguishable from photographs, and the internet collectively forgets that Generative Adversarial Networks have been a nightmare to train since Ian Goodfellow invented them in a bar in 2014. I spent the first half of 2021 trying to use GANs for synthetic training data in a computer vision pipeline. This post is what I wish the tutorials had told me before I burned three weeks on a model that generated the same three faces with different hair.

GANs are not magic. They are two neural networks playing a minimax game that is provably unstable, evaluated with metrics that correlate weakly with human judgment, and surrounded by a literature that reports best-case results while burying the failed runs.

## The Setup Everyone Understands

A GAN has two components: a **generator** G that maps random noise z to synthetic samples, and a **discriminator** D that classifies samples as real or fake. They train adversarially. G tries to fool D. D tries to catch G. At equilibrium, G&apos;s outputs are indistinguishable from the real data distribution.

In theory, elegant. In practice, you are balancing two loss functions that fight each other while both networks are simultaneously learning representations of your data. It is like teaching two students to debate while both are still learning the subject.

```mermaid
flowchart LR
    Z[Random noise z] --&gt; G[Generator G]
    G --&gt; Fake[Fake sample]
    Real[Real sample] --&gt; D[Discriminator D]
    Fake --&gt; D
    D --&gt;|Real/Fake score| LossD[Discriminator loss]
    D --&gt;|Gradient to G| LossG[Generator loss]
    LossD --&gt;|Update D| D
    LossG --&gt;|Update G| G
```

The loop looks clean in a diagram. In a Jupyter notebook at 2 AM it looks like NaN losses and generated images that resemble television static had a baby with a Rorschach test.

## Mode Collapse: The Silent Killer

**Mode collapse** is when the generator learns to produce a small set of outputs that fool the discriminator, ignoring most of the training distribution. Your dataset has ten thousand diverse faces. Your generator produces one face with slightly different skin tones. The discriminator cannot distinguish this one face from real faces in its neighborhood of feature space, so G stops exploring.

I hit mode collapse on day four. My synthetic images were generating the same defect on the same surface, rotated slightly. The discriminator was not stupid. It was complacent. G found a local optimum and camped there.

Detecting mode collapse is easier than fixing it:
- Visual inspection (the honest method)
- Intra-batch diversity metrics (compare pairwise distances in generated batch)
- Sudden drop in discriminator accuracy with no improvement in sample quality

Fixing it is a grab bag:
- **Minibatch discrimination** (let D see batches, not individual samples)
- **Unrolled GANs** (G optimizes against k-step lookahead of D)
- **Different architectures** (StyleGAN&apos;s mapping network helps)
- **More data** (brutal but often true)
- **Starting over** (underrated)

None of these are guaranteed. The GAN literature is a graveyard of techniques that work on CelebA and fail on your domain.

## Training Instability: Where the Loss Curves Lie

GAN loss curves are famously uninformative. The generator loss can increase while sample quality improves. The discriminator loss can flatline at 0.69 (ln 2) while the model is either perfect or useless. You cannot read GAN training like you read supervised learning.

What actually helps me monitor training:

1. **Sample grids every N steps.** Save 64 generated images. Look at them. Your eyes are the best metric.
2. **Discriminator accuracy on a held-out real set.** If D hits 100% and stays there, G is not learning. If D is at 50%, either G is perfect or D is broken. Context matters.
3. **Feature matching loss** as a supplementary signal. Match intermediate discriminator features between real and fake batches.

Hyperparameters that matter more than the papers admit:
- **Learning rate ratio between G and D.** If D learns too fast, G gets no gradient. If D learns too slow, G collapses. I typically use 1:1 or 2:1 (D:G) with separate optimizers.
- **Batch size.** Small batches increase gradient variance. GANs hate variance.
- **Architecture capacity.** A generator with too few parameters collapses. One with too many overfits to adversarial examples against a weak D.

I spent a week tuning learning rates before I realized my data pipeline was feeding D and G different normalizations. Check your preprocessing before you touch hyperparameters. I say this because I did not.

## FID Lies (Or At Least Misleadingly Whispers)

**Fréchet Inception Distance (FID)** measures the distance between feature distributions of real and generated images, using Inception-v3 activations. Lower is better. Papers report FID of 2.5 and reviewers nod approvingly.

Problems with FID:

**Inception-v3 was trained on ImageNet.** If your domain is medical X-rays or industrial inspection photos, Inception features may not capture what matters. Two generated images can have excellent FID while being useless for your downstream task.

**FID is a distribution metric.** It tells you whether generated samples cover the real distribution on average. It does not tell you whether any individual sample is good. A model with mode collapse can achieve decent FID by nailing the most common modes.

**Sample size sensitivity.** FID estimates depend on how many generated samples you use. Comparing FID across papers with different sample counts is apples to oranges with different apple varieties.

**It does not measure diversity within the generated set well.** Two models with identical FID can have wildly different mode coverage.

I still compute FID. It is useful for relative comparison within a single experiment series. I do not trust it as an absolute quality gate. When someone says &quot;our GAN achieves state-of-the-art FID on this dataset,&quot; ask how the samples look and whether anyone tried to use them for anything real.

## What Actually Worked for Me

After the GAN experiments, I landed on a hybrid approach:

1. **Use GANs for data augmentation of existing images**, not generation from scratch. Conditional GANs that take a real image and add synthetic variation performed better than unconditional generation. The real image anchors the output.

2. **Start from pretrained generators.** StyleGAN2 checkpoints, fine-tune on domain data. Training from scratch is a research project, not an engineering task.

3. **Pair GAN outputs with a quality filter.** A separate classifier rejects generated samples below a confidence threshold. Brute force, but it beats shipping bad synthetic data into a production training set.

4. **Consider diffusion models.** In 2021 they were emerging. By now they have largely superseded GANs for image generation quality. If you are starting fresh in generative modeling, look at DDPM and Stable Diffusion before you invest in GAN expertise.

## The Honest Researcher&apos;s Checklist

Before you commit to GANs for a project:

- [ ] Do you have enough real data that you do not need generation? (Often yes.)
- [ ] Is your use case tolerant of occasional garbage outputs?
- [ ] Do you have GPU budget for hundreds of hyperparameter sweeps?
- [ ] Can you evaluate quality with domain-specific metrics, not just FID?
- [ ] Have you tried simpler baselines (copy-paste augmentation, texture synthesis, 3D rendering)?

If you checked fewer than three, reconsider.

## The Takeaway

GANs produced genuinely remarkable results. StyleGAN faces, CycleGAN style transfer, Pix2Pix for paired image translation. These are real achievements built on a foundation of unstable optimization and careful engineering that papers underreport.

They are not magic. They are a tool that works sometimes, for some domains, with significant tuning investment. Treating them as plug-and-play generative models is how you end up with three faces and a FID score you cannot explain to your advisor.

I still respect the architecture. I just no longer trust it blindly. That is the difference between reading papers and doing research.</content:encoded></item><item><title>Flutter State Management Is a Mess and Nobody Will Admit It</title><link>https://utso.stamped.work/blog/2021-03-12-flutter-state-management-is-a-mess/</link><guid isPermaLink="true">https://utso.stamped.work/blog/2021-03-12-flutter-state-management-is-a-mess/</guid><description>Provider, BLoC, Riverpod, GetX: a frank comparison from someone who shipped production Flutter apps and got burned by all of them.</description><pubDate>Fri, 12 Mar 2021 00:00:00 GMT</pubDate><content:encoded>I shipped my first production Flutter app in late 2020. By March 2021 I had rewritten the state layer twice, considered a third migration, and developed a permanent twitch whenever someone says &quot;just use Provider, it&apos;s simple.&quot; It is not simple. None of this is simple. The Flutter ecosystem has convinced itself that state management is a solved problem because there are twelve incompatible solutions, each with a conference talk and a Medium article claiming victory.

This post is not a tutorial. Tutorials are how we got here. This is an autopsy.

## The Four Horsemen

If you have spent any time in Flutter Discord servers or r/FlutterDev, you know the cast. **Provider** is the official-ish recommendation from the Flutter team, a thin wrapper over InheritedWidget that most people use wrong. **BLoC** (Business Logic Component) is the enterprise answer: streams, events, separation of concerns, and enough boilerplate to make a Java developer feel at home. **Riverpod** is Provider&apos;s ambitious nephew who read too much functional programming and decided global state needed compile-time safety. **GetX** is what happens when someone optimizes for lines of code deleted and accidentally optimizes for testability deleted too.

I have used all four in production or near-production contexts. Here is what nobody says out loud.

### Provider: Simple Until It Isn&apos;t

Provider works beautifully for a counter app. You wrap your app in `MultiProvider`, sprinkle `ChangeNotifierProvider` widgets around, call `context.read&lt;Counter&gt;().increment()`, and ship to the Play Store feeling clever.

Then you add authentication. Then you add a shopping cart that depends on auth state. Then you add a nested navigator with its own providers. Then you discover that `context.read` inside `initState` throws because the widget tree isn&apos;t ready. Then you discover that `Consumer` rebuilds more than you expected. Then you add `Selector` everywhere and your widget tree looks like a Christmas tree designed by someone who hates Christmas.

Provider&apos;s real problem is not capability. It is scope. `InheritedWidget` propagates down the tree, which means you are constantly thinking about where in the widget hierarchy your state lives. Put `UserProvider` too high and everything rebuilds. Put it too low and half your app cannot access it. There is no compile-time enforcement. You find out at runtime, usually in QA, usually on a Tuesday.

### BLoC: Correctness at What Cost

BLoC was my second attempt. I wanted testability. I wanted clear separation between UI and business logic. I got those things. I also got three files per feature: `auth_event.dart`, `auth_state.dart`, `auth_bloc.dart`. For a login screen.

The pattern is sound. Events flow in, states flow out, the UI is a pure function of state. Your unit tests are beautiful. Your `flutter_bloc` `BlocBuilder` widgets are clean.

```dart
// This is fine. The other 400 lines are not.
class AuthBloc extends Bloc&lt;AuthEvent, AuthState&gt; {
  AuthBloc() : super(AuthUnauthenticated()) {
    on&lt;LoginRequested&gt;(_onLoginRequested);
  }

  Future&lt;void&gt; _onLoginRequested(
    LoginRequested event,
    Emitter&lt;AuthState&gt; emit,
  ) async {
    emit(AuthLoading());
    try {
      final user = await authRepository.login(event.email, event.password);
      emit(AuthAuthenticated(user));
    } catch (e) {
      emit(AuthError(e.toString()));
    }
  }
}
```

The problem is velocity. BLoC was designed for teams where a senior engineer reviews every state transition. For a solo founder shipping an MVP to validate a market, BLoC is a tax you pay in files. Cubit exists and is simpler, but then you are back to &quot;which flavor of BLoC&quot; debates that consume more Slack messages than the actual feature work.

### Riverpod: The Smart Choice That Requires You to Be Smart

Riverpod fixes Provider&apos;s scoping problems with `ProviderScope`, `ref.watch`, and providers that are not tied to the widget tree. Compile-time safety via `Provider` types. No `BuildContext` in your business logic. Async providers with built-in loading and error states.

I like Riverpod. I genuinely do. It is the best-designed of the four.

It is also the hardest to onboard. `Provider`, `StateNotifierProvider`, `FutureProvider`, `StreamProvider`, `family`, `autoDispose`, `override` for testing. The mental model is excellent once you have it. Getting there requires reading documentation that assumes you already understand the problem Riverpod is solving, which you do not, because you came from Provider.

### GetX: Speedrun Any% Wrong Architecture

GetX deletes boilerplate. `Get.put(Controller())`, `Obx(() =&gt; Text(controller.count.value))`, done. Navigation, dependency injection, state management, snackbars, dialogs, all in one package. It is seductive.

I used GetX on a prototype. It was fast. It was also a nightmare to test, a nightmare to reason about when controllers outlived their routes, and a nightmare to untangle when I needed to migrate. `Get.find&lt;SomeController&gt;()` is global mutable state with extra steps. The GetX community will tell you that you are holding it wrong. They are partially right and entirely unhelpful.

## The Diagram Everyone Needs and Nobody Draws

Here is what actually happens in a typical Flutter app with Provider versus BLoC. Not the marketing diagram. The real one.

```mermaid
flowchart TB
    subgraph Provider Flow
        UI1[Widget Tree] --&gt;|context.watch| CN[ChangeNotifier]
        CN --&gt;|notifyListeners| UI1
        UI1 --&gt;|context.read| CN
        CN --&gt;|async API call| API1[Backend]
        API1 --&gt;|setState via notifyListeners| CN
    end

    subgraph BLoC Flow
        UI2[Widget] --&gt;|add event| BLoC[BLoC]
        BLoC --&gt;|emit state| UI2
        BLoC --&gt;|repository call| REPO[Repository]
        REPO --&gt;|Future/Stream| BLoC
        BLoC --&gt;|map to state| UI2
    end
```

Notice the asymmetry. Provider couples your widget tree to your state. BLoC decouples them but inserts a ceremony layer. Neither diagram shows the bug where your `ListView` rebuilds because someone called `notifyListeners` on a provider three levels up. BLoC&apos;s diagram does not show the `BlocProvider` you forgot to wrap around a route.

## What I Actually Recommend (Reluctantly)

If you are a solo developer shipping an MVP: use **Riverpod** and accept the learning curve. The upfront cost pays off when you add features six months later and your state does not spaghetti.

If you are on a team with Java or Android architecture experience: **BLoC** is fine. Your team already thinks in events and states. Do not let anyone convince you that Cubit is &quot;not real BLoC.&quot; Ship the product.

If you are learning Flutter: start with **Provider** on a toy project, then migrate to Riverpod before you ship anything real. Treat Provider as training wheels, not a destination.

If someone suggests **GetX** for a production app with more than two screens: ask them what their test coverage looks like. Then ask again.

## The Deeper Problem

State management in Flutter is a mess because Flutter itself conflates two things: **ephemeral state** (text field contents, animation controllers, scroll positions) and **app state** (user session, cached data, feature flags). The framework gives you `StatefulWidget` for the first and a dozen third-party libraries for the second, with no clear boundary.

React had the same problem and mostly solved it with hooks plus a single dominant library (Redux, then Context, now Zustand or Jotai depending on who you ask). Flutter has not converged. Google will not pick a winner because picking a winner means deprecating something, and the Flutter team is allergic to breaking changes that affect Medium tutorial authors.

## What I Would Do Differently

I would pick Riverpod on day one and never touch Provider except to read legacy code. I would use `flutter_bloc` only if my co-founder insisted and had a background in enterprise Android. I would never use GetX in anything I expected to maintain past three months.

I would also stop reading &quot;State Management in Flutter: Which One Should You Choose?&quot; articles written by people who implemented a todo app in each framework over a weekend and declared a winner based on line count.

State management is not solved. It is managed. Pick your poison, document your choice in the README, and get back to building something users care about. That is the only advice that survived contact with production.</content:encoded></item></channel></rss>