Multimodal LLMs Are Not Computer Vision Models

Every week in late 2024 someone in a campus group chat shares a GPT-4V demo and asks if we can “just use this instead of training a model.” Every week I try to explain why that is the wrong comparison.

Multimodal LLMs are language models that can see. They are not computer vision models. The distinction matters when your use case requires pixel-level fidelity, repeatable outputs, or anything you would grade in a lab assignment without hand-waving.

I am a Mathematics and Computing student at IIT Roorkee who has shipped Android apps and trained small CV models on a laptop. Here is where multimodal LLMs fit and where they do not — from that seat, not from a product I have not built yet.

What Multimodal LLMs Actually Do

Models like GPT-4V, Claude with vision, Gemini Pro Vision: they encode images into tokens, fuse with text context, and generate language outputs. The output is always text (or structured text). The internal representation is optimized for semantic understanding and conversational coherence, not geometric precision.

They excel at:

They fail at:

What Specialized CV Pipelines Need

Production computer vision — the kind I practiced with YOLO on a laptop, the kind assignment rubrics demand — needs:

  1. Repeatability: Same input, same output, every time
  2. Explainability: Bounding boxes and scores, not vibes
  3. Calibration: You can measure false positives on a held-out set
  4. Version pinning: Frozen weights for reproducibility
  5. Latency budgets: Especially on mobile

Multimodal LLMs fail criteria 1, 3, and 5 out of the box. They partially address 2 with plausible-sounding prose that is not evidence. They are not designed for adversarial robustness.

Task Suitability Matrix

quadrantChart
    title LLM Vision vs Specialized CV Models
    x-axis Low Precision Required --> High Precision Required
    y-axis Semantic Task --> Geometric/Forensic Task
    quadrant-1 Use Specialized CV
    quadrant-2 Hybrid Pipeline
    quadrant-3 LLM Sufficient
    quadrant-4 LLM Insufficient

    Image Captioning: [0.2, 0.15]
    Document QA: [0.35, 0.25]
    Object Counting: [0.45, 0.3]
    Damage Assessment: [0.65, 0.55]
    Tamper Detection: [0.85, 0.85]
    Copy-Move Forgery: [0.9, 0.9]
    Biometric Matching: [0.95, 0.7]
    Scene Understanding: [0.3, 0.2]
    Medical Imaging: [0.95, 0.95]

Upper-right quadrant: do not use multimodal LLMs as primary detectors. Lower-left: LLMs are fine, maybe overkill.

The Hybrid Pattern That Actually Makes Sense

Use LLMs where language is the output and specialized CV where pixels are the evidence:

  1. CV model runs detection, segmentation, or feature extraction
  2. Structured metadata captures scores, regions, model version
  3. LLM generates human-readable summaries from structured inputs, not from raw pixels alone

The LLM should not be the only thing between a user and a consequential decision. It can explain what the CV stack found.

This is architecture homework, not a pitch deck.

Specific GPT-4V Failure Modes I Hit

Confident hallucination on ambiguous regions. Compression artifacts described as “signs of digital manipulation.” Sometimes true. Often not. No confidence score.

Inconsistent across crops. Same region, different crop padding, different verbal assessment.

Resolution limits. Downscale a large photo to model input size, lose high-frequency detail, get a clean bill of health.

No frozen behavior. Model updates change outputs. Fine for chat. Bad for anything you need to reproduce in a report.

A Small Experiment, Honestly

I ran GPT-4V on a few dozen images from public manipulation datasets alongside a simple OpenCV baseline I trusted from coursework. The LLM’s verbal explanations sounded plausible on most images. Agreement with the baseline on localization: poor.

That gap is the lesson. Impressive language is not impressive geometry.

When Students Should Use Multimodal LLMs

When They Should Not

Research Direction

The gap may narrow. Vision-language models with grounding improve localization. Fine-tuned specialist models on manipulation datasets outperform generalist LLMs. The trend is toward ensembles, not replacement.

Betting your grade — or your company — on “GPT-N will solve vision” is betting against every CV assignment rubric that asks for numbers.

Multimodal LLMs are an interface layer and a reasoning layer. They are not a retina. Build your project accordingly.

September 2024 me was learning that in lecture halls and group chats. The demos will keep coming. The distinction stays the same.

--claps