mlcomputer-visionresearch

Vision Transformers in 2023: The Real Talk

Sep 15, 20236 min readUtso Sarkar

Vision Transformers were supposed to kill CNNs by 2022. It is 2023. ResNet variants still dominate production deployments. ViT and its descendants win on benchmarks and lose on engineering papers that assume infinite GPU budget and no latency constraints.

I am not anti-ViT. I use ViT-based models in production. But the gap between “SOTA on ImageNet” and “works in our factory inspection pipeline on edge hardware” is where most of the discourse conveniently stops.

ViT Architecture: What Actually Happens

flowchart LR
    subgraph Input
        IMG[Input image H x W x 3]
    end

    subgraph PatchEmbed["Patch Embedding"]
        PATCH[Split into P x P patches]
        FLAT[Flatten patches to sequence]
        PROJ[Linear projection to D dimensions]
        POS[Add positional embeddings]
    end

    subgraph Transformer["Transformer Encoder x L layers"]
        MHSA[Multi-Head Self-Attention]
        FFN[Feed-Forward Network]
        MHSA --> FFN
        FFN --> MHSA
    end

    subgraph Output
        CLS[CLS token or global average pool]
        HEAD[Classification / downstream head]
    end

    IMG --> PATCH --> FLAT --> PROJ --> POS
    POS --> MHSA
    FFN --> CLS --> HEAD

The image becomes a sequence of patch tokens. Self-attention lets every patch attend to every other patch. Global context from layer one. That is the core advantage over convolutions, which build receptive field gradually through stacked layers.

The cost: self-attention is O(n squared) in the number of patches. A 224x224 image with 16x16 patches gives 196 tokens. Manageable. A 1024x1024 image gives you thousands of tokens and a GPU memory bill that makes your CFO ask questions.

ViT vs CNN: Honest Tradeoffs

Dimension	CNN (ResNet, EfficientNet)	ViT (and variants)
Data efficiency	Good with limited data	Needs large datasets or strong pretraining
Inductive bias	Locality, translation equivariance built in	Must learn spatial relationships from data
Compute at inference	Lower for equivalent accuracy on many tasks	Higher attention cost, especially at high resolution
Pretraining leverage	ImageNet pretrain works, less transfer gap	Massive benefit from large-scale pretrain (DINOv2, CLIP)
Small object detection	Strong with FPN architectures	Requires adaptations (Deformable DETR, etc.)
Edge deployment	Mature quantization and pruning tooling	Catching up, still harder

The table oversimplifies. But directionally correct for 2023 production decisions.

DINOv2: The Pretraining Story That Matters

Meta’s DINOv2 (released early 2023) is the most practically relevant ViT development for founders and engineers, not because it tops a leaderboard, but because it produces general-purpose visual features that transfer with minimal fine-tuning.

Self-supervised pretraining on 142 million curated images. Strong dense features for segmentation, depth estimation, and retrieval without task-specific labels. If you are building a computer vision product and need a backbone, DINOv2 ViT-S or ViT-B is a credible starting point.

What DINOv2 does not solve:

Inference latency on edge devices
Real-time requirements above 30 FPS on non-GPU hardware
Domain shift when your factory images look nothing like LVD-142M pretraining data (fine-tuning still required)
The engineering work of integrating a PyTorch checkpoint into your existing pipeline

Where ViT Wins in Production

Retrieval and similarity search. Embedding images with a pretrained ViT and searching by cosine similarity works well for duplicate detection, visual search, and content moderation. Near-duplicate image matching is one signal in larger deduplication pipelines.

Semi-supervised and self-supervised pipelines. When labeling data is expensive and you have lots of unlabeled images, ViT backbones pretrained with DINO or MAE extract features that reduce labeling requirements.

Multi-modal systems. If you are already running a transformer for language (you are), sharing architectural patterns between vision and language encoders simplifies the stack. CLIP-style dual encoders enable zero-shot classification that CNN pipelines cannot match without retraining.

High-resolution document and scene understanding. When you need global context across an entire page or scene and can afford the compute, ViT attention captures long-range dependencies that CNNs need deep stacks to approximate.

Where CNNs Still Win

Real-time video on edge hardware. Factory inspection, autonomous drones, mobile AR. Convolutions with INT8 quantization on NPUs and TPUs still beat ViT on latency and power at equivalent accuracy for most edge tasks.

Small datasets without pretraining budget. If you have 500 labeled images and no GPU cluster for self-supervised pretraining, a fine-tuned EfficientNet-B0 will outperform a ViT-B/16 every time.

Mature deployment tooling. TensorRT, ONNX Runtime, CoreML, and TFLite have years of CNN optimization. ViT support exists but the tooling edge goes to convolutions for now.

Proven architectures for detection and segmentation. YOLO, Mask R-CNN, and U-Net variants with CNN backbones are battle-tested in production. Transformer-based detectors (DETR family) are improving but the ecosystem is younger.

Default to CNN backbones unless you have a specific reason to use ViT (retrieval, multi-modal, large-scale pretrain available).
Use DINOv2 features when you need strong general-purpose embeddings and compute is not the bottleneck.
Benchmark on your data, your hardware, your latency budget. ImageNet accuracy is irrelevant if your deployment target is a Raspberry Pi.
Do not rewrite a working CNN pipeline because a paper showed ViT wins on ImageNet by 0.3%. That is not engineering. That is vanity.
Watch hierarchical ViTs (Swin, PVT) if you need transformer benefits with better compute scaling for high-resolution inputs.

The Research vs Production Gap

Academic CV in 2023 optimizes for benchmark rankings. Production CV optimizes for accuracy per dollar per millisecond per watt. These objectives diverge constantly.

ViT papers report results with massive pretraining, multi-GPU inference, and test-time augmentation. Your production pipeline has a single T4 GPU, no TTA, and a SLA of 200ms per image. The model that wins in the paper is not the model that wins in your datacenter.

Be honest about which game you are playing. If you are publishing research, chase SOTA. If you are shipping product, chase the pareto frontier on your actual constraints.

Closing

Vision Transformers are a real architectural advance. They are not a universal replacement for convolutions in 2023. DINOv2 makes ViT backbones practical for embedding and transfer learning tasks. CNNs remain the default for latency-constrained edge deployment.

The real talk: use the right backbone for your constraints, not the backbone from the most recent arXiv paper. Benchmark everything. Ship what works. Ignore the “CNNs are dead” discourse. They are very much alive in every factory and phone in the world.

--claps