mlresearchcomputer-visionllm

Multimodal LLMs Are Not Computer Vision Models

Sep 18, 20245 min readUtso Sarkar

Every week in late 2024 someone in a campus group chat shares a GPT-4V demo and asks if we can “just use this instead of training a model.” Every week I try to explain why that is the wrong comparison.

Multimodal LLMs are language models that can see. They are not computer vision models. The distinction matters when your use case requires pixel-level fidelity, repeatable outputs, or anything you would grade in a lab assignment without hand-waving.

I am a Mathematics and Computing student at IIT Roorkee who has shipped Android apps and trained small CV models on a laptop. Here is where multimodal LLMs fit and where they do not — from that seat, not from a product I have not built yet.

What Multimodal LLMs Actually Do

Models like GPT-4V, Claude with vision, Gemini Pro Vision: they encode images into tokens, fuse with text context, and generate language outputs. The output is always text (or structured text). The internal representation is optimized for semantic understanding and conversational coherence, not geometric precision.

They excel at:

Image captioning and description
Visual question answering (“how many people are in this photo?”)
Document understanding (OCR-ish extraction from clean scans)
Rough object identification in unconstrained photos
Multimodal reasoning that combines image context with world knowledge

They fail at:

Pixel-accurate localization without specialized tooling
Detecting subtle manipulations (ELA-level, copy-move, splicing artifacts)
Consistent results across near-duplicate inputs
Calibrated probability outputs for high-stakes workflows
Real-time inference at scale on high-resolution images

What Specialized CV Pipelines Need

Production computer vision — the kind I practiced with YOLO on a laptop, the kind assignment rubrics demand — needs:

Repeatability: Same input, same output, every time
Explainability: Bounding boxes and scores, not vibes
Calibration: You can measure false positives on a held-out set
Version pinning: Frozen weights for reproducibility
Latency budgets: Especially on mobile

Multimodal LLMs fail criteria 1, 3, and 5 out of the box. They partially address 2 with plausible-sounding prose that is not evidence. They are not designed for adversarial robustness.

Task Suitability Matrix

quadrantChart
    title LLM Vision vs Specialized CV Models
    x-axis Low Precision Required --> High Precision Required
    y-axis Semantic Task --> Geometric/Forensic Task
    quadrant-1 Use Specialized CV
    quadrant-2 Hybrid Pipeline
    quadrant-3 LLM Sufficient
    quadrant-4 LLM Insufficient

    Image Captioning: [0.2, 0.15]
    Document QA: [0.35, 0.25]
    Object Counting: [0.45, 0.3]
    Damage Assessment: [0.65, 0.55]
    Tamper Detection: [0.85, 0.85]
    Copy-Move Forgery: [0.9, 0.9]
    Biometric Matching: [0.95, 0.7]
    Scene Understanding: [0.3, 0.2]
    Medical Imaging: [0.95, 0.95]

Upper-right quadrant: do not use multimodal LLMs as primary detectors. Lower-left: LLMs are fine, maybe overkill.

The Hybrid Pattern That Actually Makes Sense

Use LLMs where language is the output and specialized CV where pixels are the evidence:

CV model runs detection, segmentation, or feature extraction
Structured metadata captures scores, regions, model version
LLM generates human-readable summaries from structured inputs, not from raw pixels alone

The LLM should not be the only thing between a user and a consequential decision. It can explain what the CV stack found.

This is architecture homework, not a pitch deck.

Specific GPT-4V Failure Modes I Hit

Confident hallucination on ambiguous regions. Compression artifacts described as “signs of digital manipulation.” Sometimes true. Often not. No confidence score.

Inconsistent across crops. Same region, different crop padding, different verbal assessment.

Resolution limits. Downscale a large photo to model input size, lose high-frequency detail, get a clean bill of health.

No frozen behavior. Model updates change outputs. Fine for chat. Bad for anything you need to reproduce in a report.

A Small Experiment, Honestly

I ran GPT-4V on a few dozen images from public manipulation datasets alongside a simple OpenCV baseline I trusted from coursework. The LLM’s verbal explanations sounded plausible on most images. Agreement with the baseline on localization: poor.

That gap is the lesson. Impressive language is not impressive geometry.

When Students Should Use Multimodal LLMs

Explaining CV outputs to non-technical teammates
Triage: “is this worth running through the expensive pipeline?”
Document extraction from heterogeneous forms where OCR + LLM beats pure OCR
Rapid prototyping to validate whether anyone wants a feature

When They Should Not

Primary tamper detection
Anything legally or financially consequential without human review
Replacing a trained detector because the demo video looked cool
Anything where false positive costs exceed API costs by orders of magnitude

Research Direction

The gap may narrow. Vision-language models with grounding improve localization. Fine-tuned specialist models on manipulation datasets outperform generalist LLMs. The trend is toward ensembles, not replacement.

Betting your grade — or your company — on “GPT-N will solve vision” is betting against every CV assignment rubric that asks for numbers.

Multimodal LLMs are an interface layer and a reasoning layer. They are not a retina. Build your project accordingly.

September 2024 me was learning that in lecture halls and group chats. The demos will keep coming. The distinction stays the same.

--claps