Multimodal LLMs Are Not Computer Vision Models
Every week in late 2024 someone in a campus group chat shares a GPT-4V demo and asks if we can “just use this instead of training a model.” Every week I try to explain why that is the wrong comparison.
Multimodal LLMs are language models that can see. They are not computer vision models. The distinction matters when your use case requires pixel-level fidelity, repeatable outputs, or anything you would grade in a lab assignment without hand-waving.
I am a Mathematics and Computing student at IIT Roorkee who has shipped Android apps and trained small CV models on a laptop. Here is where multimodal LLMs fit and where they do not — from that seat, not from a product I have not built yet.
What Multimodal LLMs Actually Do
Models like GPT-4V, Claude with vision, Gemini Pro Vision: they encode images into tokens, fuse with text context, and generate language outputs. The output is always text (or structured text). The internal representation is optimized for semantic understanding and conversational coherence, not geometric precision.
They excel at:
- Image captioning and description
- Visual question answering (“how many people are in this photo?”)
- Document understanding (OCR-ish extraction from clean scans)
- Rough object identification in unconstrained photos
- Multimodal reasoning that combines image context with world knowledge
They fail at:
- Pixel-accurate localization without specialized tooling
- Detecting subtle manipulations (ELA-level, copy-move, splicing artifacts)
- Consistent results across near-duplicate inputs
- Calibrated probability outputs for high-stakes workflows
- Real-time inference at scale on high-resolution images
What Specialized CV Pipelines Need
Production computer vision — the kind I practiced with YOLO on a laptop, the kind assignment rubrics demand — needs:
- Repeatability: Same input, same output, every time
- Explainability: Bounding boxes and scores, not vibes
- Calibration: You can measure false positives on a held-out set
- Version pinning: Frozen weights for reproducibility
- Latency budgets: Especially on mobile
Multimodal LLMs fail criteria 1, 3, and 5 out of the box. They partially address 2 with plausible-sounding prose that is not evidence. They are not designed for adversarial robustness.
Task Suitability Matrix
quadrantChart
title LLM Vision vs Specialized CV Models
x-axis Low Precision Required --> High Precision Required
y-axis Semantic Task --> Geometric/Forensic Task
quadrant-1 Use Specialized CV
quadrant-2 Hybrid Pipeline
quadrant-3 LLM Sufficient
quadrant-4 LLM Insufficient
Image Captioning: [0.2, 0.15]
Document QA: [0.35, 0.25]
Object Counting: [0.45, 0.3]
Damage Assessment: [0.65, 0.55]
Tamper Detection: [0.85, 0.85]
Copy-Move Forgery: [0.9, 0.9]
Biometric Matching: [0.95, 0.7]
Scene Understanding: [0.3, 0.2]
Medical Imaging: [0.95, 0.95]
Upper-right quadrant: do not use multimodal LLMs as primary detectors. Lower-left: LLMs are fine, maybe overkill.
The Hybrid Pattern That Actually Makes Sense
Use LLMs where language is the output and specialized CV where pixels are the evidence:
- CV model runs detection, segmentation, or feature extraction
- Structured metadata captures scores, regions, model version
- LLM generates human-readable summaries from structured inputs, not from raw pixels alone
The LLM should not be the only thing between a user and a consequential decision. It can explain what the CV stack found.
This is architecture homework, not a pitch deck.
Specific GPT-4V Failure Modes I Hit
Confident hallucination on ambiguous regions. Compression artifacts described as “signs of digital manipulation.” Sometimes true. Often not. No confidence score.
Inconsistent across crops. Same region, different crop padding, different verbal assessment.
Resolution limits. Downscale a large photo to model input size, lose high-frequency detail, get a clean bill of health.
No frozen behavior. Model updates change outputs. Fine for chat. Bad for anything you need to reproduce in a report.
A Small Experiment, Honestly
I ran GPT-4V on a few dozen images from public manipulation datasets alongside a simple OpenCV baseline I trusted from coursework. The LLM’s verbal explanations sounded plausible on most images. Agreement with the baseline on localization: poor.
That gap is the lesson. Impressive language is not impressive geometry.
When Students Should Use Multimodal LLMs
- Explaining CV outputs to non-technical teammates
- Triage: “is this worth running through the expensive pipeline?”
- Document extraction from heterogeneous forms where OCR + LLM beats pure OCR
- Rapid prototyping to validate whether anyone wants a feature
When They Should Not
- Primary tamper detection
- Anything legally or financially consequential without human review
- Replacing a trained detector because the demo video looked cool
- Anything where false positive costs exceed API costs by orders of magnitude
Research Direction
The gap may narrow. Vision-language models with grounding improve localization. Fine-tuned specialist models on manipulation datasets outperform generalist LLMs. The trend is toward ensembles, not replacement.
Betting your grade — or your company — on “GPT-N will solve vision” is betting against every CV assignment rubric that asks for numbers.
Multimodal LLMs are an interface layer and a reasoning layer. They are not a retina. Build your project accordingly.
September 2024 me was learning that in lecture halls and group chats. The demos will keep coming. The distinction stays the same.