Everyone Keeps Calling Things ChatGPT
If I had a rupee for every pitch deck that said “we use ChatGPT under the hood” in Q1 2022, I could fund a seed round. Most of those decks meant “we call the OpenAI API.” Some meant “we fine-tuned GPT-3.” A few confidently meant “we built our own foundation model.” The words collapsed into one brand, and technical decisions got worse because of it.
ChatGPT is a product. GPT-4 did not exist in public conversation yet. GPT-3 is a base language model. InstructGPT is a fine-tuned variant optimized for following instructions. RLHF is the training recipe that connects them. Confusing these is not pedantry. It is how you end up with wrong latency budgets, wrong safety assumptions, and wrong fine-tuning strategies.
GPT-3: Autocomplete at Scale
GPT-3 is a large autoregressive transformer trained to predict the next token on internet text. It is good at continuation. Ask it to “write a polite email,” and it might write an email, or it might write a forum post about writing emails, or it might veer into fiction because the prior token context looked like a story.
Base models are simulators of text distribution. They are not assistants. They do not know what you want unless the prompt makes the desired mode statistically likely.
For founders, the base model lesson is simple: do not expect alignment from scale alone. GPT-3’s raw API was a power tool for people who could prompt engineer around its moods. Everyone else got inconsistent magic.
InstructGPT: Instructions as the Objective
OpenAI’s InstructGPT work (Ouyang et al., 2022) fine-tuned GPT-3 style models on demonstrations of humans following instructions, then refined with reinforcement learning from human feedback. The result models were smaller in parameter count comparisons yet preferred by labelers over larger base models.
That was the inflection. Capability stopped being purely about pretraining loss and started being about human preference on outputs.
Instruct-style models answer questions more directly, refuse some bad requests more often, and hallucinate with more confidence. That last part is important. Alignment can make models sound right when they are wrong.
RLHF: The Training Stack People Hand-Wave
Reinforcement Learning from Human Feedback is a pipeline, not a checkbox:
- Supervised fine-tuning (SFT) on curated instruction-response pairs.
- Reward model training on comparisons: humans pick better answers.
- Policy optimization (often PPO) to maximize reward while staying close to the SFT model.
flowchart TB
Base["Base LM (GPT-3 class)"] --> SFT["Supervised fine-tuning on demonstrations"]
SFT --> RM["Train reward model from human comparisons"]
RM --> PPO["RL policy optimization (e.g. PPO)"]
PPO --> Instruct["Instruct-style model"]
Instruct --> Chat["Chat product layer"]
Chat --> Tools["Plugins, browsing, system prompts"]
Base -.->|"not the same artifact"| Chat
Each box is a different artifact with different failure modes. Skipping SFT and hoping PPO fixes everything is how you get unstable training and bizarre policies. Skipping reward modeling and using heuristics is how you get gaming.
If your ML lead says “we will just RLHF it next sprint,” ask which step they mean. Then ask for data.
ChatGPT: Product, Not Model Card
ChatGPT launched November 2022 for the public consciousness, but by April 2022 the ingredients were already visible to anyone reading papers and API changelogs. ChatGPT wraps model + system prompt + moderation + UX affordances (chat history, regeneration, thumbs up/down feeding future training).
Calling your startup “ChatGPT for X” tells investors you ride trends. It tells engineers nothing about:
- Which model snapshot you are on
- Whether you rely on chat-tuned vs code-tuned endpoints
- How you handle context windows
- What moderation hooks you inherit vs implement
The product layer matters. System prompts and tool use can turn the same weights into a lawyer cosplay or a SQL assistant. That is not mysticism. It is conditioning.
Why the Naming Mess Hurts Shipping
Latency and cost: Chat-style products encourage multi-turn context. Base completion APIs reward single-shot prompts. Your architecture differs.
Safety: Instruction-tuned models have refusals baked in. Base models may comply with harmful prompts unless you bolt on classifiers. Compliance teams care.
Fine-tuning: OpenAI and others offered different fine-tuning surfaces over time. Fine-tuning Davinci is not the same as fine-tuning an instruct model. Data formats differ. Evaluation differs.
Evaluation: Comparing your system to “ChatGPT” without a fixed benchmark is meaningless. ChatGPT changes. Your demo does not.
GPT-3 vs InstructGPT vs ChatGPT: A Founder Cheat Sheet
| Layer | What it is | What you get |
|---|---|---|
| GPT-3 (base) | Pretrained LM | Continuation, brittle control |
| InstructGPT | SFT + RLHF weights | Instruction following, refusals |
| ChatGPT | Product on tuned weights | UX, moderation, multi-turn |
When someone says “GPT-4” in 2022 April, they are often time-traveling or bluffing. GPT-4 was not public. Clarify or ignore.
What RLHF Does Not Fix
RLHF aligns to labeler preferences, not ground truth. It can:
- Reduce toxic outputs while increasing polished nonsense
- Encode cultural bias from raters
- Overfit to short, helpful-sounding formats
If your application needs factual grounding, you need retrieval, tools, or domain-specific fine-tuning and evals. RLHF is not a database.
Building Without Confusion
- Name the artifact in your docs: model ID, snapshot date, API version.
- Separate policy from weights: system prompts and filters are your liability surface too.
- Log prompts and outputs (with privacy policy) so you can debug when the vendor updates weights silently.
- Benchmark against tasks, not against a brand.
The Opinion You Paid For
The industry brands everything ChatGPT because consumers recognize it. Inside your codebase, that branding is malpractice. GPT-3 is the engine block. InstructGPT is the fuel mapping. RLHF is the tuning process. ChatGPT is the car with airbags and a speed limiter.
Drive the car if you want. Just do not pretend you manufactured the engine because you can floor the accelerator in a parking lot.
When GPT-4 actually ships publicly, the naming will get worse. Start building the discipline now.