Appendix B: Datasets & Benchmarks Catalog

"They call me held-out data. I have been downloaded eleven million times, mirrored on four continents, and quietly memorized by at least one foundation model. At this point, I recognize the architectures before they recognize me."
A Thoroughly Memorized Test Set

Big Picture

A benchmark is not a pile of images; it is a contract. Every standard benchmark in vision is a four-part agreement: a dataset, a fixed split, a metric, and an evaluation protocol. Two papers reporting "COCO mAP" are only comparable if all four parts match. This appendix catalogs the field's standard contracts, task by task: what each dataset contains, how big it is, what you may legally do with it, and exactly which claim it is the accepted evidence for. Use it as a map when you choose training data, and as a checklist when you read a results table.

The datasets below are the ones the book's chapters lean on again and again: the classification sets behind the architecture story of Part III, the geometry and flow benchmarks behind the multi-view machinery of Part II, the restoration test sets that grade the enhancement methods of Part I, and the generative evaluation suites that anchor Part IV. Each entry records five things: name, size, contents, license and access conditions, and the specific benchmark role the dataset plays in the literature. Every URL points at the canonical source.

Two reading hints before the catalog proper. First, "license" in vision almost always means two licenses: one for the annotations (usually permissive) and one for the images (usually not yours to grant). Section 8 unpacks why that distinction matters. Second, the standard metric is part of each dataset's identity: ImageNet means top-1 accuracy, COCO means mAP averaged over IoU thresholds, Sintel means endpoint error. Reporting a different metric on the same images is a different benchmark, and honest papers say so explicitly.

1. Image Classification

Classification benchmarks form a ladder of difficulty that doubles as a history of the field. The five rungs below take you from a dataset a laptop trains on in seconds to one that requires a cluster, and from ten balanced classes to ten thousand species in a long tail. Table B.1 compares them at a glance.

MNIST

70,000 grayscale images of handwritten digits at 28×28 pixels, split 60,000 train and 10,000 test, ten classes. MNIST was assembled by Yann LeCun, Corinna Cortes, and Christopher Burges from two NIST handwriting collections and has been the "hello world" of machine learning since 1998. It is freely downloadable and redistributed everywhere, including built-in loaders in every framework. As a benchmark it is saturated: simple convolutional networks exceed 99.5 percent accuracy, so modern papers use it only for sanity checks, didactic examples, and method ablations where speed matters more than difficulty. Canonical page: yann.lecun.com/exdb/mnist.

Fun Fact

The two NIST collections behind MNIST had very different authors: one was written by Census Bureau employees, the other by high-school students. The original NIST split used the tidy bureaucrat handwriting for training and the messier student handwriting for testing, which made the task artificially hard. MNIST's lasting contribution was partly just remixing the two populations into both splits, an early lesson in distribution shift that the field keeps relearning.

CIFAR-10 and CIFAR-100

Each contains 60,000 color images at 32×32 pixels, split 50,000 train and 10,000 test. CIFAR-10 has ten coarse classes (airplane, automobile, bird, and so on) with 6,000 images each; CIFAR-100 has 100 classes with 600 images each, grouped into 20 superclasses. Curated by Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton as labeled subsets of the 80 Million Tiny Images collection (which was itself withdrawn in 2020; the CIFAR subsets remain available), they are distributed without restriction from the University of Toronto. CIFAR-10 is the standard benchmark for fast architecture iteration and for reproducible-training exercises; CIFAR-100 stresses fine-grained discrimination at the same tiny resolution. Download: cs.toronto.edu/~kriz/cifar.html.

ImageNet-1k (ILSVRC 2012)

About 1.28 million training images, 50,000 validation images, and 100,000 test images with withheld labels, covering 1,000 object classes drawn from WordNet. ImageNet-1k is the single most consequential dataset in computer vision: top-1 accuracy on its validation set is the default headline metric for image backbones, and "pretrained on ImageNet" is the default starting point for transfer learning throughout Part III. Access requires registering at the official site and agreeing to non-commercial research terms; the images are web-scraped and may not be redistributed, though the ILSVRC 2012 package is also mirrored for competition use on Kaggle. Since 2021 an updated release blurs human faces, with negligible effect on benchmark accuracy. Official site: image-net.org.

ImageNet-21k

The full ImageNet release: roughly 14 million images spanning 21,841 WordNet synsets, of which ImageNet-1k is the famous subset. It is too inconsistent for clean evaluation (classes overlap, the hierarchy is uneven, and there is no official test split), so its role is pretraining scale: vision transformers and modern CNNs pretrain here before fine-tuning on ImageNet-1k. The cleaned, balanced ImageNet-21k-P preprocessing of Ridnik et al. is the variant most pretraining recipes actually use. Access follows the same registration and non-commercial terms as ImageNet-1k at image-net.org; the 21k-P processing scripts live at github.com/Alibaba-MIIL/ImageNet21K.

iNaturalist

A family of fine-grained, long-tailed species-classification datasets built from the iNaturalist citizen-science platform. The 2021 edition contains 2.7 million training images of 10,000 species; the 2017 and 2018 editions are smaller but even more imbalanced, which is precisely why they are cited: iNaturalist is the standard benchmark for long-tail recognition and fine-grained classification under realistic class imbalance, the problems that balanced CIFAR and ImageNet hide. Labels come from community consensus verified by experts. Each photograph carries the Creative Commons license its uploader chose (many are non-commercial), so audit per-image licenses before any commercial use. Competition data and download instructions: github.com/visipedia/inat_comp.

Table B.1 · Classification datasets at a glance

Dataset	Size	Standard benchmark for	License / access
MNIST	70k images, 10 classes, 28×28 gray	Sanity checks, didactic baselines (saturated)	Free, redistributed everywhere
CIFAR-10/100	60k images each, 10/100 classes, 32×32	Fast architecture iteration; fine-grained small-image recognition	Free download, no registration
ImageNet-1k	1.28M train / 50k val, 1,000 classes	Top-1 accuracy for backbones; transfer-learning pretraining	Registration, non-commercial research terms
ImageNet-21k	~14M images, 21,841 classes	Large-scale pretraining (no clean eval split)	Same terms as ImageNet-1k
iNaturalist 2021	2.7M train images, 10,000 species	Long-tail and fine-grained recognition	Per-image CC licenses, many non-commercial

2. Detection & Segmentation

Localization benchmarks add a second axis to evaluation: not just what is in the image but where. The six datasets below span the historical arc from 20-class detection to open-vocabulary, long-tail, and pixel-dense evaluation, and Table B.2 summarizes them.

Pascal VOC

The benchmark that defined object detection's first decade. The 2012 edition provides roughly 11,500 annotated images across 20 everyday classes with about 27,000 bounding boxes, plus a segmentation subset of 2,913 pixel-labeled images; the 2007 edition (closer to 10,000 images) remains in use because its test labels are public. The legacy metric, mAP at IoU 0.5, is still reported as "VOC-style mAP". Images were collected from Flickr, and usage is subject to Flickr's terms; the annotations themselves are freely available. Today VOC serves as a small, clean detection and semantic-segmentation testbed rather than a frontier challenge. Official site: host.robots.ox.ac.uk/pascal/VOC.

COCO (Common Objects in Context)

The center of gravity for modern detection and instance segmentation. The 2017 split provides 118,000 training and 5,000 validation images annotated with 80 object categories, around 1.5 million object instances with per-instance masks, five human captions per image, person keypoints for roughly a quarter of a million people, and panoptic labels. The COCO metric, mAP averaged over IoU thresholds from 0.5 to 0.95, computed by the official pycocotools code, is the standard headline number for every detector and instance segmenter in Part III. Annotations are CC BY 4.0; images come from Flickr under a mix of Creative Commons licenses. Site, downloads, and evaluation server: cocodataset.org.

LVIS

Large Vocabulary Instance Segmentation reuses COCO's images but annotates 1,203 categories with about two million high-quality masks, distributed in a natural long tail with explicit rare, common, and frequent buckets. Its federated annotation design means not every image is exhaustively labeled for every category, and its evaluation reports AP per frequency bucket; the rare-category AP is the standard measure of long-tail and open-vocabulary detection ability. Annotations are released under a permissive license mirroring COCO's; image licensing follows COCO. Site: lvisdataset.org.

Open Images

Google's web-scale localization dataset: about 9 million images carrying some 16 million bounding boxes over 600 classes, 2.8 million instance masks across 350 classes, visual-relationship triples, localized narratives, and, in V7, point-level labels. It is the largest publicly annotated detection corpus and the standard pretraining and scale-stress benchmark for detectors. Annotations are CC BY 4.0, and Google states the images were verified as CC BY 2.0 at collection time, though as always the underlying photographs remain their owners' property. Site: storage.googleapis.com/openimages/web.

ADE20K

The standard scene-parsing benchmark. The SceneParse150 protocol uses 20,210 training and 2,000 validation images densely labeled with 150 semantic categories covering both objects and "stuff" like sky, floor, and wall; the full dataset's open vocabulary exceeds 3,000 categories. Mean IoU on the 150-class split is the default semantic-segmentation number for new backbones and segmentation heads. Access requires registration at the MIT site and is free for research. Site: ade20k.csail.mit.edu.

Cityscapes

The standard urban driving segmentation benchmark: 5,000 finely annotated frames (2,975 train, 500 validation, 1,525 test) plus 20,000 coarsely annotated ones, captured across 50 European cities with stereo pairs, GPS, and vehicle odometry. Thirty classes are annotated and 19 are scored; 19-class mIoU on the held-out test server is the canonical metric, with instance-level and panoptic tracks alongside. Access is free for non-commercial research after registration, with commercial licensing available separately. Site: cityscapes-dataset.com.

Table B.2 · Detection and segmentation benchmarks

Dataset	Size	Standard benchmark for	License / access
Pascal VOC	~11.5k images, 20 classes (2012)	Legacy detection (mAP@0.5), small clean testbed	Free; images subject to Flickr terms
COCO	123k labeled images, 80 classes, 1.5M instances	Detection / instance segmentation mAP@[.5:.95]; keypoints; captions	Annotations CC BY 4.0; Flickr images
LVIS	COCO images, 1,203 classes, ~2M masks	Long-tail and open-vocabulary instance segmentation	Permissive annotations; COCO images
Open Images	~9M images, 16M boxes, 2.8M masks	Web-scale detection pretraining and evaluation	Annotations CC BY 4.0; images CC BY 2.0 (as collected)
ADE20K	20.2k train / 2k val, 150 scored classes	Scene parsing mIoU	Registration; free for research
Cityscapes	5k fine + 20k coarse frames, 19 scored classes	Urban semantic / instance / panoptic segmentation	Registration; non-commercial research

3. Geometry, Flow & 3D

Geometric benchmarks differ from recognition benchmarks in one crucial way: their ground truth is physically measured (laser scanners, structured light, rendered synthetic scenes) rather than human-labeled, so accuracy is graded in pixels and millimeters. These six anchor the evaluation of the stereo, flow, and reconstruction methods of Part II and their learned successors. Table B.3 compares them.

KITTI

The autonomous-driving benchmark suite, captured in 2011 around Karlsruhe from a car rigged with stereo cameras, a Velodyne LiDAR, and GPS/IMU. It hosts separate leaderboards for stereo, optical flow, scene flow, depth estimation, visual odometry (22 sequences), 3D object detection (7,481 training and 7,518 test frames), and tracking. KITTI's metrics (percentage of bad disparity pixels, translational drift per distance, 3D AP at fixed difficulty tiers) remain the standard for driving-domain geometry. Data is released under CC BY-NC-SA 3.0; commercial use requires separate arrangement. Site: cvlibs.net/datasets/kitti.

Middlebury

The precision benchmark family for stereo, optical flow, and multi-view stereo, maintained for over two decades. The 2014 stereo edition provides 33 high-resolution indoor image pairs with subpixel ground truth from structured light; the flow and multi-view tracks are similarly small but exquisitely accurate. Middlebury datasets are tiny by modern standards, which is the point: they measure the accuracy ceiling rather than scale, and the online leaderboards enforce one-submission discipline that keeps test sets honest. Free for research use with citation. Site: vision.middlebury.edu.

MPI Sintel

An optical-flow benchmark rendered from the Blender Foundation's open movie Sintel: 23 training sequences with 1,041 ground-truth flow fields and a 552-frame held-out test set, each in a "clean" pass and a "final" pass with motion blur, defocus, and atmospheric effects. Average endpoint error (EPE) on the final pass is the standard difficulty-frontier number for flow methods, classical and learned alike. Because the source movie is Creative Commons Attribution, the dataset is freely downloadable without registration. Site: sintel.is.tue.mpg.de.

ScanNet

The standard indoor RGB-D corpus: 1,513 scanned sequences of 707 indoor spaces, about 2.5 million RGB-D frames, with camera poses, surface reconstructions, and 3D semantic and instance labels. It anchors benchmarks for 3D semantic segmentation, 3D instance segmentation (including the harder 200-class ScanNet200 vocabulary), and RGB-D reconstruction quality. Access requires emailing a signed terms-of-use agreement; usage is restricted to non-commercial research. Site: scan-net.org.

ETH3D

A multi-view stereo and SLAM benchmark with millimeter-accurate laser-scanned ground truth: roughly two dozen indoor and outdoor scenes in a high-resolution multi-view track, plus low-resolution many-view and stereo/SLAM tracks. Its accuracy and completeness scores at fixed distance thresholds are the standard evidence for dense-reconstruction quality alongside Tanks and Temples. Released under CC BY-NC-SA 4.0 with an online evaluation server. Site: eth3d.net.

Tanks and Temples

The large-scale reconstruction benchmark: 14 evaluation scenes (eight intermediate, six advanced) ranging from statues to entire building interiors, captured as high-resolution video with industrial laser scans as ground truth, plus a training set with public ground truth. Reconstruction quality is scored as an F-score combining precision and completeness at a per-scene distance threshold, evaluated on a server with held-out ground truth. It is the standard benchmark for photogrammetry pipelines such as COLMAP and, increasingly, for neural reconstruction methods. Data is licensed CC BY-NC-SA 3.0. Site: tanksandtemples.org.

Table B.3 · Geometry, flow and 3D benchmarks

Dataset	Size	Standard benchmark for	License / access
KITTI	Driving suite; e.g. 7,481 train frames (3D detection), 22 odometry sequences	Driving stereo, flow, odometry, 3D detection	CC BY-NC-SA 3.0
Middlebury	Tens of pairs per track, subpixel ground truth	Accuracy ceiling for stereo, flow, MVS	Free for research, cite per track
MPI Sintel	1,041 training flow fields, 552 test frames	Optical flow EPE (clean and final passes)	Free; derived from CC BY movie
ScanNet	1,513 scans, ~2.5M RGB-D frames	3D semantic / instance segmentation, RGB-D reconstruction	Signed agreement, non-commercial
ETH3D	~25 scenes, laser ground truth	Multi-view stereo accuracy and completeness; SLAM	CC BY-NC-SA 4.0
Tanks and Temples	14 eval scenes + training scenes, laser ground truth	Large-scale reconstruction F-score	CC BY-NC-SA 3.0

4. Video Understanding

Video benchmarks split along what they actually test: appearance-dominated action labels, true temporal reasoning, spatiotemporal localization, and identity-preserving tracking. The four families below cover that spectrum; Table B.4 compares them.

Kinetics (400 / 600 / 700)

The ImageNet of action recognition: ten-second YouTube clips labeled with one human action each. Kinetics-400 provides roughly 240,000 training clips over 400 classes; the 600- and 700-class editions extend coverage to about 650,000 clips. Top-1 accuracy on Kinetics-400 is the standard headline number for video backbones, and Kinetics pretraining is the default initialization for downstream video tasks. The dataset is distributed as YouTube identifiers with timestamps under CC BY 4.0 annotations; the videos themselves remain on YouTube and decay over time as uploads vanish, so exact reproducibility degrades with the years. The CVD Foundation maintains downloadable archives: github.com/cvdfoundation/kinetics-dataset.

Something-Something V2

220,847 short clips of crowd-sourced performers executing 174 template actions like "pushing something so that it falls off the table". Because the labels describe object-agnostic interactions, frame-shuffled or single-frame models collapse on it: it is the standard benchmark for genuine temporal reasoning, the property Kinetics only weakly tests. The data is free for academic use after accepting the license from its current steward, Qualcomm: qualcomm.com/developer/software/something-something-v-2-dataset.

AVA

Atomic Visual Actions: 430 fifteen-minute movie clips in which every person is localized with a bounding box and labeled with one or more of 80 atomic actions (stand, talk to, carry) at one-second intervals, totaling about 1.6 million labels in version 2.2. Frame-level mAP at IoU 0.5 on AVA is the standard benchmark for spatiotemporal action detection, the task of answering who is doing what, where, and when. Annotations are released under CC BY 4.0; the underlying films are referenced, not redistributed. Site: research.google.com/ava.

MOT17 and MOT20

The MOTChallenge pedestrian-tracking benchmarks. MOT17 offers seven training and seven test sequences of street scenes, each paired with three public detector outputs so that tracking can be compared independently of detection quality. MOT20 adds eight sequences of very dense crowds (stations, rallies) with crowd densities an order of magnitude above MOT17, totaling over two million boxes. The modern headline metric is HOTA, reported alongside MOTA and IDF1, on the withheld test annotations via the evaluation server. Data is CC BY-NC-SA 3.0. Site and leaderboards: motchallenge.net.

Table B.4 · Video understanding benchmarks

Dataset	Size	Standard benchmark for	License / access
Kinetics-400/600/700	240k to 650k ten-second clips	Action-recognition top-1; video pretraining	CC BY 4.0 annotations; videos via YouTube (decaying)
Something-Something V2	220,847 clips, 174 templates	Temporal reasoning	Free for research via Qualcomm license
AVA v2.2	430 movie clips, ~1.6M labels, 80 actions	Spatiotemporal action detection (frame mAP)	CC BY 4.0 annotations
MOT17 / MOT20	14 + 8 sequences, >2M boxes	Multi-object tracking (HOTA, MOTA, IDF1)	CC BY-NC-SA 3.0

5. Restoration & Image Quality

Restoration benchmarks are unusual in being tiny: a few dozen clean reference images against which degraded-and-restored versions are scored with PSNR and SSIM. Their small size is tolerable because the ground truth is exact (the clean image itself) and the degradations are synthesized on the fly. Table B.5 lists the standard sets.

BSD68 / CBSD68

Sixty-eight images carved from the Berkeley Segmentation Dataset's test split, used in grayscale (BSD68) and color (CBSD68) form. They are the standard denoising benchmark: methods report PSNR after removing additive Gaussian noise at sigma 15, 25, and 50, a protocol stretching unbroken from BM3D to today's deep denoisers. The parent BSDS data is freely available for research from Berkeley: www2.eecs.berkeley.edu/Research/Projects/CS/vision/bsds.

Set5 and Set14

Five and fourteen images respectively (the baby, bird, butterfly, head, and woman of Set5 are among the most super-resolved pixels in history). They are the legacy super-resolution test sets: PSNR and SSIM on the luminance channel at 2×, 3×, and 4× upscaling, reported by essentially every SR paper since the early 2010s. Both sets are freely redistributed in standard form; a widely used canonical packaging (together with BSD100 and Urban100) ships with the SelfExSR project: github.com/jbhuang0604/SelfExSR.

DIV2K

800 training, 100 validation, and 100 test images at 2K resolution with high photographic quality and diverse content, created for the NTIRE super-resolution challenges. DIV2K is the standard training set for modern SR models and its validation split a standard benchmark; the bicubic-downsampling tracks define the canonical degradation protocol. The data is freely available for academic research. Site: data.vision.ee.ethz.ch/cvl/DIV2K.

FFHQ

Flickr-Faces-HQ: 70,000 aligned and cropped face photographs at 1024×1024, collected by NVIDIA for the StyleGAN line of work. FID against FFHQ is the standard benchmark for unconditional face generation, and the dataset doubles as training data for face restoration and editing research. The individual photographs were selected for permissive licenses (CC BY, public domain, and similar), the compiled dataset is released under CC BY-NC-SA 4.0, and NVIDIA honors removal requests from photographed individuals; treat any face dataset with extra care under biometric-privacy laws. Repository: github.com/NVlabs/ffhq-dataset.

Table B.5 · Restoration and quality test sets

Dataset	Size	Standard benchmark for	License / access
BSD68 / CBSD68	68 images	Gaussian denoising PSNR (sigma 15/25/50)	Free for research (Berkeley BSDS)
Set5 / Set14	5 / 14 images	Super-resolution PSNR/SSIM at 2-4×	Freely redistributed
DIV2K	1,000 images at 2K	SR training and NTIRE challenge protocol	Free for academic research
FFHQ	70k faces at 1024×1024	Face-generation FID; face restoration	CC BY-NC-SA 4.0 compilation; permissive source images

6. Generative & Multimodal

Generative benchmarks come in two flavors: training corpora that pair images with text at web scale, and evaluation suites that score what generators produce. The line matters, because the corpora are too dirty to evaluate on and the evaluation sets are too small to train on. Table B.6 covers both flavors; the metrics they feed (FID, CLIPScore, compositional accuracy) are dissected in Part IV.

LAION-5B (and Re-LAION-5B)

The web-scale corpus behind the open text-to-image era: 5.85 billion CLIP-filtered image-text pairs (2.3 billion in English) harvested from Common Crawl, distributed not as images but as URL-plus-metadata parquet files under CC BY 4.0, with the photographs remaining under their owners' copyright. LAION-5B trained Stable Diffusion and OpenCLIP. Its status note is essential history: in December 2023 the Stanford Internet Observatory found links to child sexual abuse material in the index, and LAION withdrew the dataset; in August 2024 the project released Re-LAION-5B, a cleaned research-safe re-issue, which is the version to use today. Original announcement: laion.ai/blog/laion-5b; the re-release: laion.ai/blog/relaion-5b.

Conceptual Captions (CC3M and CC12M)

Google's curated alt-text corpora: CC3M pairs 3.3 million images with cleaned, hypernymized captions; CC12M relaxes the filtering to reach 12 million noisier pairs. Both are standard pretraining sets for captioning and vision-language models, and CC3M's validation split is a common captioning benchmark. Like LAION they are distributed as URL lists (the images stay with their owners), so a meaningful fraction of links has rotted; published papers routinely note the percentage they could still download. CC3M: ai.google.com/research/ConceptualCaptions; CC12M: github.com/google-research-datasets/conceptual-12m.

MS-COCO Captions

COCO's five-captions-per-image annotations double as the standard text-to-image evaluation set: the zero-shot "FID-30k" protocol samples 30,000 captions from the 2014 validation split, generates one image per caption, and computes FID against the corresponding real images. DALL-E 2, Imagen, Parti, and the Stable Diffusion releases all report this number, which makes it the closest thing text-to-image has to a shared headline metric, caveats and all. Data and details: cocodataset.org/#captions-2015.

ImageNet as a generative benchmark

The same ImageNet-1k from Section 1 is also the standard class-conditional generation benchmark: models generate 50,000 images at 256×256 (or 512×512) conditioned on class labels, and FID-50k is computed against the training distribution, usually alongside Inception Score and precision/recall. The diffusion-transformer lineage (ADM, DiT, and successors) is ranked almost entirely on this protocol. Access terms are those of ImageNet itself: image-net.org.

PartiPrompts

A prompt suite rather than an image dataset: more than 1,600 English prompts spanning 12 categories (animals, artifacts, world knowledge, writing) and 11 challenge dimensions from simple to "imagination". Released with Google's Parti model under a permissive license, it is the standard qualitative and human-evaluation suite for text-to-image systems; there is deliberately no automatic metric attached. Repository: github.com/google-research/parti.

GenEval

An object-focused automatic evaluation framework for text-to-image alignment: 553 templated prompts probe six compositional skills (single object, two objects, counting, color, position, and color binding), and an off-the-shelf detector and segmenter verify whether each generated image actually satisfies its prompt. GenEval scores are now a standard table in text-to-image papers, including the Stable Diffusion 3 and FLUX reports, precisely because the check is objective rather than aesthetic. MIT-licensed code: github.com/djghosh13/geneval.

Table B.6 · Generative and multimodal datasets and evaluation suites

Resource	Size	Standard role	License / access
LAION-5B / Re-LAION-5B	5.85B / 5.5B URL-text pairs	Web-scale T2I and CLIP training corpus (use Re-LAION)	CC BY 4.0 metadata; images stay with owners
CC3M / CC12M	3.3M / 12M URL-caption pairs	Vision-language pretraining; captioning	Free annotations; URL distribution, link rot
MS-COCO Captions	5 captions × 123k images	Zero-shot FID-30k for text-to-image	Annotations CC BY 4.0
ImageNet-1k (FID)	50k generated vs train reference	Class-conditional generation FID-50k	ImageNet research terms
PartiPrompts	>1,600 prompts	Qualitative / human T2I evaluation	Permissive, on GitHub
GenEval	553 prompts + verifier code	Compositional T2I alignment score	MIT

7. Choosing a Benchmark Honestly

A benchmark choice is a claim about what your method is for, so the first rule is alignment: pick the dataset whose difficulty axis matches your contribution. A long-tail method belongs on LVIS or iNaturalist, not on balanced CIFAR; a temporal-reasoning model belongs on Something-Something, not on appearance-dominated Kinetics; a precision-focused stereo method belongs on Middlebury, a robustness-focused one on KITTI. Evaluating where your method's advantage is invisible wastes the experiment, and evaluating only where it is most flattering invites the reviewer question you least want.

The second rule is protocol fidelity. Use the official evaluation code (pycocotools for COCO, the Cityscapes scripts, the MOTChallenge and Tanks and Temples servers) rather than reimplementing metrics, because subtle choices like IoU thresholds, ignore regions, and boundary handling change numbers by amounts larger than many claimed improvements. Report the exact split, resolution, and metric variant; "COCO val2017 mAP@[.5:.95] at 640 pixels, single-scale" is a benchmark, "COCO accuracy" is not. And where a leaderboard with withheld test labels exists, submit to it: the limited-submission discipline is what keeps the number meaningful.

Key Insight

Most "test sets" in daily use are actually validation sets. ImageNet's labeled 50k split, COCO val2017, and Cityscapes val have absorbed two decades of model selection, and the ImageNetV2 replication study (github.com/modestyachts/ImageNetV2) showed that accuracies drop by several points on freshly collected test data drawn from the same source: the community has been gently overfitting the split itself. The honest pattern is to tune on the validation split, touch the held-out test server once, and treat any saturated benchmark (MNIST above 99.5, CIFAR-10 above 99) as a sanity check rather than evidence.

8. Licensing Pitfalls

Vision datasets have a structural licensing problem: the people who annotate images rarely own them. The annotation files for COCO or Open Images are genuinely CC BY 4.0, but that license cannot launder the underlying photographs, which were scraped from Flickr or the open web under terms their owners set. For research this distinction is mostly academic; the moment a model trained on such data ships in a product, it stops being academic. Read the image-license story separately from the annotation-license story for every dataset you adopt, and remember that several workhorse benchmarks (KITTI, Cityscapes, ScanNet, ETH3D, MOTChallenge, ImageNet) carry explicit non-commercial clauses that cover the data, and arguably models derived from it, depending on jurisdiction and counsel.

Warning

Datasets get withdrawn, and projects built on them inherit the problem. 80 Million Tiny Images (CIFAR's parent) was retracted in 2020 over offensive labels; MS-Celeb-1M and DukeMTMC were taken down over consent and privacy failures; LAION-5B was withdrawn in 2023 and only its cleaned Re-LAION re-issue is appropriate today. URL-distributed corpora (LAION, CC12M, Kinetics) additionally decay as links rot, so the dataset you download is not quite the dataset the paper used. Defensive practice: record the dataset name, version, download date, license text, and a checksum in your project repository at the moment you first train on it, and check the source page for takedown notices before any release.

9. Dataset Contamination

When training corpora reach billions of web images, the question stops being whether benchmark data leaked into training and becomes how much. LAION-scale scrapes demonstrably contain copies and near-duplicates of ImageNet, COCO, and most other public test imagery, which complicates every "zero-shot" claim built on top of them: a CLIP-style model evaluated zero-shot on ImageNet may have seen many of those exact images, captioned, during pretraining. Generative evaluation has a parallel failure mode: a diffusion model that memorizes training images can score an excellent FID against that same distribution while quietly regurgitating its sources.

The practical defenses are deduplication and disclosure. Strong evaluations run near-duplicate detection (perceptual hashing or CLIP-embedding similarity) between the training corpus and the test set, report the overlap, and rerun the benchmark with detected duplicates removed; strong dataset releases publish their dedup methodology up front. When you read a results table in the foundation-model era, the absence of any contamination analysis is itself information. And when you build your own evaluation, prefer test data created after your training corpus was frozen, the one contamination check that needs no tooling.

Research Frontier

Dataset curation has become a research topic in its own right. The DataComp benchmark (datacomp.ai) inverts the usual contest: the model and training recipe are fixed and participants compete on filtering a 12.8-billion-pair candidate pool, turning data quality into the measured variable. Re-LAION-5B (2024) established the template for safety-revised re-releases of web corpora. And on the evaluation side, compositional suites are racing the generators: T2I-CompBench and the dense-prompt DPG-Bench extend GenEval-style verifiable scoring to harder attribute-binding and long-prompt regimes, and each new text-to-image release through 2025 reports against this moving battery.

10. Dataset Hubs & Loading Tooling

Three hubs cover most practical needs. The Hugging Face Hub (huggingface.co/datasets) hosts mirrors of most catalog entries above with versioned, programmatic access through the datasets library. Papers with Code (paperswithcode.com) indexed datasets, leaderboards, and code for a decade; after its 2025 sunset by Meta its archives remain a useful historical record, with the paper-and-code discovery role continuing at Hugging Face Papers (huggingface.co/papers). The FiftyOne dataset zoo (docs.voxel51.com/dataset_zoo) adds the piece the others lack: visual inspection of images and labels before you commit to training on them.

Library Shortcut

You rarely need to hand-download archives and write parsers. One line of the Hugging Face datasets API replaces the download-extract-parse boilerplate (often fifty lines or more per dataset) and handles caching, versioning, and streaming internally:

from datasets import load_dataset

cifar = load_dataset("uoft-cs/cifar10")        # 50,000 train / 10,000 test
print(cifar["train"].features["label"].names)  # ['airplane', 'automobile', ...]

# Web-scale sets stream without a full download:
laion_sample = load_dataset("laion/relaion2B-en-research-safe",
                            split="train", streaming=True)
print(next(iter(laion_sample))["url"])

Loading CIFAR-10 eagerly and streaming a research-safe Re-LAION shard lazily with the Hugging Face datasets library; caching, checksums, and shard management are handled internally.

For detection and segmentation data, looking at the labels before training is the cheapest bug prevention available, and FiftyOne makes the inspection loop two lines plus a browser tab. The snippet below pulls a slice of COCO and opens the interactive viewer.

import fiftyone as fo
import fiftyone.zoo as foz

# Download only what you need: 500 validation images with their labels
dataset = foz.load_zoo_dataset("coco-2017", split="validation",
                               max_samples=500)
session = fo.launch_app(dataset)  # browse boxes and masks in the browser

Pulling a 500-image slice of COCO val2017 through the FiftyOne zoo and launching the in-browser viewer to eyeball boxes and masks before any training run.

However you obtain a dataset, close the loop the same way: verify the checksum or sample count against the official release notes, skim a few dozen samples visually, and pin the version string in your experiment configuration. The catalog above tells you which contract you are signing; this last habit makes sure you actually received the goods.