computer-visionmlcompressionresearch

Image Compression Is a Solved Problem the Way Traffic Is a Solved Problem

Oct 15, 20216 min readUtso Sarkar

People say image compression is solved the way people say traffic is solved: technically there are solutions, practically you are still stuck in a queue wondering where the design went wrong. I spent October 2021 optimizing image pipelines for Strato Foods — restaurant menu photos, dish images, and rider-uploaded proof shots on 3G connections. The textbook answer was “use WebP.” The real answer was more complicated, more interesting, and more frustrating.

The Codec Landscape in 2021

Four contenders mattered for our use case:

JPEG (1992): Universal support, fast encode/decode, terrible efficiency by modern standards. Still the default because everything accepts it. Baseline for comparison.

WebP (2010): Google’s answer. 25-35% smaller than JPEG at equivalent visual quality. Lossy and lossless modes. Android support since 4.0. iOS since 14. Good enough for most mobile apps.

AVIF (2019): Based on AV1 video codec intra frames. 50% smaller than JPEG in benchmarks. Encode times measured in seconds per image on mobile. Decode support spotty in 2021.

HEIC (2015): Apple’s choice. Excellent compression. Useless for cross-platform web delivery. Relevant only if your users are all on iOS, which ours were not.

Neural compression (learned image compression, LIC) was publishing impressive RD curves in papers but had no production-ready encoders you could ship in a Flutter app without writing a research project.

The Quality vs Size Tradeoff Nobody Plots Honestly

Codec comparisons in blog posts pick a single quality setting, compare file sizes, declare a winner. Real systems live on a curve.

xychart-beta
    title "Rate-Distortion Tradeoff (qualitative)"
    x-axis "File size (KB)" 0 --> 500
    y-axis "Perceptual quality (SSIM)" 0 --> 1
    line "JPEG q=85" [400, 0.92]
    line "JPEG q=60" [200, 0.85]
    line "WebP q=80" [250, 0.92]
    line "WebP q=60" [120, 0.85]
    line "AVIF q=50" [100, 0.92]
    line "Neural codec" [80, 0.90]

The neural codec line looks magical until you add encode time as a third axis. AVIF looks great until you try it on a 2019 Oppo. WebP wins not because it is optimal but because it sits in the pareto frontier of quality, size, encode speed, and compatibility.

For our restaurant and menu photos:

Resolution: 1920x1080 to 4032x3024 depending on device
Content: food textures, steam blur, harsh kitchen lighting, occasional motion blur from riders
Constraint: upload on 3G in under 10 seconds
Target: under 300KB per image with legible menu text and recognizable dishes

JPEG at quality 85 averaged 800KB. Unacceptable. WebP at quality 75 averaged 280KB with acceptable detail. AVIF at equivalent quality averaged 180KB but took 4x longer to encode on mid-range Android.

We shipped WebP. Not because it is the best codec. Because it is the best codec for our constraints.

Why SSIM and PSNR Lie

PSNR (Peak Signal-to-Noise Ratio) measures pixel-level MSE. High-frequency detail loss in textures (exactly what matters for scratch detection) can look bad to humans while PSNR stays high.

SSIM (Structural Similarity Index) is better but still computed on downsampled luminance. Two images with different compression artifacts in different regions can have identical SSIM while one preserves dent edges and one smears them.

VMAF (Netflix’s metric) is more perceptually aligned but designed for video at streaming bitrates. Overkill for still images and slow to compute at capture time.

We ended up with a hybrid evaluation:

SSIM for automated regression testing in CI
Human review on a sample set — restaurant owners comparing compressed vs original on their phones (the ground truth)
Downstream classifier sanity check on compressed images (does our simple tagging still work?)

The third test was the one that mattered. A compression setting that looked fine on a laptop made menu text illegible on a cheap Android screen. We tightened quality until readability recovered. Compression is not independent of your downstream pipeline. Treat it as part of the product.

Neural Compression: The Research vs Production Gap

Learned image compression papers in 2021 showed architectures like hyperprior autoencoders beating BPG and approaching AVIF on Kodak dataset. Exciting.

Production reality:

No standard file format with universal decoder support
GPU dependency for reasonable encode times
Model size adds megabytes to your app
Domain shift: models trained on natural images compress document scans and damage photos differently

I experimented with a TFLite-quantized learned codec from a research codebase. Encoding a 1080p image took 2.3 seconds on GPU, 8 seconds on CPU. Output was 40% smaller than WebP. The research paper did not mention that the model was 12MB and the decoder had edge cases that produced block artifacts on high-contrast metal surfaces.

Neural compression will win eventually. In 2021 it was a conference paper, not a library import.

Practical Pipeline Decisions

What actually shipped:

def compress_for_upload(image_path: str, max_kb: int = 300) -> bytes:
  img = load_image(image_path)
  img = auto_orient(img)  # EXIF rotation, non-negotiable
  img = resize_long_edge(img, max_px=2048)  # cap resolution first

  for quality in range(80, 40, -5):
    webp_bytes = encode_webp(img, quality=quality)
    if len(webp_bytes) <= max_kb * 1024:
      return webp_bytes

  # fallback: aggressive resize then encode
  img = resize_long_edge(img, max_px=1280)
  return encode_webp(img, quality=50)

Key decisions:

Resize before compress. Halving resolution saves more bytes than any quality tweak.
Strip EXIF except orientation (already applied). Location metadata in delivery photos is a privacy headache.
Progressive quality reduction with a hard size cap. Upload constraints are binary.
Server-side re-encode for archival at higher quality. Client optimizes for upload speed.

The Traffic Analogy Extended

Traffic is “solved” because we have roads, traffic lights, and GPS routing. Yet you are still late because the system optimizes for average case and you live in the worst case.

Image compression is “solved” because we have codecs that work on average images for average viewers on average hardware. Food delivery photos are not average:

Steam and glare on dish shots
Extreme lighting in small kitchens (see my ML Kit post)
Users zooming in to read menu prices
Upload on networks that drop packets when it rains

What I Would Do Today

In 2021 WebP was the answer. A later pipeline might revisit:

AVIF with hardware encode on newer devices, WebP fallback
Content-adaptive compression: higher quality on text-heavy menu regions (ROI-aware encoding exists in research, emerging in tools)
On-device learned codecs as TFLite models mature

The Takeaway

Image compression is a solved problem the way calculus is a solved problem: the theory is complete, your integral is still messy.

Pick your codec based on constraints, not benchmarks. Measure with your downstream task, not just SSIM. Resize aggressively. Test on the worst phone on the worst network in your market.

And when someone says “just use AVIF,” ask them encode time on a Rs. 10,000 Android phone. Then ask again.

--claps