Section 27.3: Learning on Point Clouds: PointNet & Successors

"They handed me a bag of three-dimensional confetti and asked what it was. No grid, no order, no edges. So I looked at every piece at once, kept only the loudest opinion about each feature, and declared it a chair. I was right, which still surprises me."
A Network That Learned to Think in Sets

Big Picture

To learn directly on a point cloud you need a network that is invariant to the order of its inputs, and the deceptively simple answer is to apply the same small network to every point independently and then aggregate with a symmetric function. A point cloud is an unordered set, so any architecture that consumes it must produce the same answer no matter how the points are listed. PointNet achieves this with two ideas: a shared per-point MLP that lifts each point into a feature vector, and a symmetric pooling operation (a max over all points) that collapses those features into one global descriptor immune to ordering. This section derives why that works, builds PointNet from scratch in PyTorch, and then follows the line of successors (PointNet++, DGCNN, point transformers) that restored the local neighborhood structure the original flat design threw away.

Section 27.2 left us with a point cloud and a problem: the convolution that powers every network in Part III needs a regular grid of neighbors, and a point cloud has none. Voxelizing the cloud (subsection 2 of the previous section) restores the grid but pays the cubic memory curse and quantizes away fine detail. The question PointNet answered, in 2017, was whether a network could consume the raw, unordered points directly. The answer reshaped 3D deep learning, and the core trick is a beautiful piece of architectural reasoning about symmetry.

1. The Permutation Problem Beginner

Let a point cloud be a set $\{x_1, x_2, \ldots, x_N\}$ of $N$ points in 3D. A network that classifies the cloud computes some function $f(\{x_1, \ldots, x_N\})$. Because the set is unordered, $f$ must satisfy permutation invariance: shuffling the points must not change the output. A standard MLP fed the points as a flattened vector $[x_1; x_2; \ldots; x_N]$ violates this immediately, the first weight column only ever sees $x_1$, so reordering the points changes the answer. A 1D convolution along the point index fails for the same reason: it assumes the index carries meaning, and it does not.

The theorem behind PointNet is that any continuous permutation-invariant function can be approximated arbitrarily well in the form

f(\{x_1, \ldots, x_N\}) \approx \gamma\Big(\underset{i=1\ldots N}{\text{MAX}}\; h(x_i)\Big)

where $h$ is a per-point transformation (applied identically to each point), MAX is an element-wise maximum over all the point features (a symmetric function, immune to order), and $\gamma$ is a final transformation of the pooled result. The MAX is doing all the work: because the maximum of a set is the same regardless of how the set is listed, the whole expression is permutation-invariant by construction. PointNet simply makes $h$ and $\gamma$ learnable multilayer perceptrons (MLPs).

Key Insight: A Symmetric Pool Buys Permutation Invariance for Free

The single most important design decision in point-cloud learning is the choice of aggregation. Sum, mean, and max are all symmetric (order-independent), so any of them grants permutation invariance. PointNet uses max because it acts as a feature-wise selector: for each of the learned feature channels, the network keeps only the point that most strongly activates it, so the global descriptor is built from a sparse set of "critical points" that outline the shape. This is why PointNet is robust to missing or extra points: as long as the critical points survive, the descriptor barely changes. Choose your aggregation and you have chosen your invariance.

Fun Note: A Committee That Only Listens to the Loudest Voice

Max-pooling runs the world's most ruthless meeting. Thousands of points each shout their opinion on every feature, and for each feature the network keeps exactly one voice, the loudest, and discards the rest. The shape ends up summarized by a tiny clique of "critical points" that happened to win their channels. Delete eighty percent of the silent majority and the chair is still a chair. Democracy this is not, but it is gloriously order-independent. The illustration below stages exactly that meeting.

A robot moderator at a podium keeps only the single loudest shouting point-character for each topic while dimming the rest, and the chosen loud characters faintly outline a chair, illustrating how PointNet's symmetric max-pool selects a sparse set of critical points to give an order-independent global descriptor. — A symmetric max-pool runs the most ruthless meeting in machine learning: for each feature it keeps only the loudest point, and that tiny clique of critical points is enough to still recognize the chair.

2. Building PointNet From Scratch Intermediate

The architecture follows the formula directly, as Figure 27.3.1 shows. Each point (an $x, y, z$ triple, plus optional features) passes through a shared MLP, implemented as a stack of $1 \times 1$ convolutions so the same weights apply to every point. After lifting each point to a high-dimensional feature, a single max-pool over the point axis produces one global feature vector for the whole cloud. For classification, that vector feeds a small MLP to class scores; for segmentation, the global feature is concatenated back onto every per-point feature so each point can be labeled using both its own identity and the global context.

Figure 27.3.1: The PointNet architecture. A shared MLP (implemented as $1 \times 1$ convolutions, so identical weights touch every point) lifts each point to a 1024-dimensional feature; a single max-pool over the point axis yields a permutation-invariant global descriptor. Classification reads off the global vector; segmentation concatenates it back onto every per-point feature.

The implementation below is the classification branch, complete enough to run. The crucial line is the torch.max over the point dimension, which is the symmetric aggregation of subsection 1. Read the per-point MLP as the function $h$ and the post-pool layers as $\gamma$.

# PointNet classification: a shared per-point MLP (1x1 convs) lifts each point
# to a feature vector, then a single max-pool over the point axis collapses them
# into one permutation-invariant global descriptor before the class head.
import torch
import torch.nn as nn

class PointNetCls(nn.Module):
    """PointNet classification head. Input: (B, 3, N) point clouds. Output: class logits."""
    def __init__(self, num_classes=10):
        super().__init__()
        # Shared per-point MLP as 1x1 convs: same weights applied to every point.
        self.feat = nn.Sequential(
            nn.Conv1d(3, 64, 1),   nn.BatchNorm1d(64),   nn.ReLU(),
            nn.Conv1d(64, 128, 1), nn.BatchNorm1d(128),  nn.ReLU(),
            nn.Conv1d(128, 1024, 1), nn.BatchNorm1d(1024), nn.ReLU())
        self.head = nn.Sequential(
            nn.Linear(1024, 512), nn.ReLU(), nn.Dropout(0.3),
            nn.Linear(512, num_classes))

    def forward(self, x):                  # x: (B, 3, N)
        f = self.feat(x)                   # (B, 1024, N) per-point features
        g = torch.max(f, dim=2)[0]         # (B, 1024) symmetric max-pool over points
        return self.head(g)                # (B, num_classes)

model = PointNetCls(num_classes=10)
cloud = torch.randn(4, 3, 2048)            # batch of 4 clouds, 2048 points each
print("logits:", model(cloud).shape)       # logits: torch.Size([4, 10])

# Permutation invariance check: shuffle the points, the output is unchanged.
perm = torch.randperm(2048)
same = torch.allclose(model(cloud), model(cloud[:, :, perm]), atol=1e-5)
print("invariant to point order:", same)   # invariant to point order: True

Code Fragment 1: PointNet classification from scratch. The shared MLP is three $1 \times 1$ convolutions; torch.max(f, dim=2) is the order-independent pool. The final assertion confirms the network's output is identical after the points are randomly shuffled, the permutation invariance proved in subsection 1.

The full PointNet adds small T-Net sub-networks that predict an affine transform to align the input cloud and its features into a canonical pose, improving robustness to rotation. We omit them here for clarity; they are a refinement, not the central idea, and modern successors often drop them in favor of better local modeling.

3. Bringing Back Locality: PointNet++, DGCNN, and Point Transformers Advanced

PointNet's elegance is also its weakness. Because it pools globally in one shot, it sees each point and the whole cloud, but nothing in between: it has no notion of local neighborhoods, the very thing that made CNNs powerful through their hierarchy of receptive fields (Chapter 19). Fine geometric detail, a sharp edge, a small protrusion, is lost in the global max. The successors all answer the same question: how do you give a point-cloud network a sense of locality without a grid?

PointNet++ applies PointNet recursively. It samples a set of centroid points, gathers each one's local neighborhood (the points within a radius), runs a small PointNet on each neighborhood to summarize it, and repeats on the summarized points. This builds exactly the coarse-to-fine hierarchy of a CNN, but on sets: local features at the bottom, global structure at the top. DGCNN takes a graph view: at each layer it builds a k-nearest-neighbor graph (recomputed in feature space, hence "dynamic") and applies an edge convolution that processes each point together with its neighbors' differences, capturing relationships rather than isolated points. Point transformers (2021 onward) replace the pooling with self-attention restricted to local neighborhoods, importing the attention mechanism of Chapter 22 into 3D and currently topping most point-cloud benchmarks.

Practical Example: Autonomous-Driving LiDAR Segmentation Picks Its Battles

Who: a perception team at an autonomous-driving company, 2023, segmenting LiDAR point clouds into road, vehicle, pedestrian, and background. Situation: a single spinning LiDAR sweep is roughly 120,000 points, arriving ten times a second, and the segmentation had to run in real time on an in-car GPU. Problem: a plain PointNet ran fast but was too coarse, it missed thin objects like poles and distant pedestrians because the global max-pool washed out local detail; a full point transformer was accurate but too slow at 120k points to hit the 100-millisecond budget. Decision: they adopted a PointNet++-style hierarchical backbone with sparse voxel features in the early layers (so neighborhoods were cheap to query) and reserved attention for a small final stage. Result: they hit the latency budget while recovering the thin-object accuracy the flat network lost, because the hierarchy reintroduced locality where it mattered. Lesson: the PointNet-to-transformer spectrum is a speed-accuracy dial; on real-time, large-cloud problems you place locality and attention surgically rather than everywhere.

Library Shortcut: PyTorch3D and Torch-Points3D Ship the Operators

The neighborhood queries, farthest-point sampling, and k-NN graph construction that PointNet++ and DGCNN need are easy to get wrong and slow in pure Python. PyTorch3D provides them as fused, differentiable, GPU operators:

from pytorch3d.ops import sample_farthest_points, knn_points, knn_gather
# The "set abstraction" core of PointNet++ as three fused GPU operators:
# sample well-spread centroids, find each centroid's k nearest neighbors,
# and gather those neighbor coordinates into local groups for a mini-PointNet.
# points: (B, N, 3). Sample 512 well-spread centroids, then grab each one's 32 neighbors.
centroids, idx = sample_farthest_points(points, K=512)        # coverage-maximizing sample
knn = knn_points(centroids, points, K=32)                     # 32 nearest neighbors each
neighbors = knn_gather(points, knn.idx)                       # (B, 512, 32, 3) local groups
# These three calls are the "set abstraction" core of PointNet++.

Code Fragment 2: The PointNet++ neighborhood machinery in three PyTorch3D calls. sample_farthest_points picks well-spread centroids, knn_points finds each centroid's nearest neighbors, and knn_gather collects them into local groups, replacing dozens of lines of index juggling and a hand-written k-NN CUDA kernel with tested, differentiable operators.

What would be dozens of lines of index juggling and a hand-written CUDA kernel for the k-NN becomes three tested calls. PyTorch3D (and the Torch-Points3D framework built on it) also bundles chamfer distance, batched point-cloud and mesh containers, and reference implementations of PointNet++, DGCNN, and KPConv, the standard starting point for any point-cloud project today.

Research Frontier: Point Cloud Foundation Models and 3D-Language (2024-2026)

Point-cloud learning is following the same foundation-model trajectory as 2D vision (Chapter 25). Point-MAE and Point-BERT brought masked-autoencoder pretraining to point clouds, learning self-supervised representations by reconstructing masked point patches. The 2024-2026 frontier connects 3D to language and images: ULIP and OpenShape align point-cloud encoders with the CLIP embedding space (the same multimodal space you will meet in Chapter 34), enabling zero-shot 3D classification from text prompts, and PointLLM attaches a point encoder to a large language model so you can ask questions about a 3D object in natural language. The Point Transformer V3 (2024) scaled point attention to outdoor driving scenes with serialization tricks that make attention linear, closing much of the gap between research accuracy and the real-time constraint of the driving example above.

You Could Build This: A Part Segmenter That Labels Chair Legs From Backs

With the segmentation branch of subsection 2 (concatenate the global descriptor back onto every per-point feature) you can build a real part segmenter on the standard ShapeNet-Part benchmark, the model that powers "select all the legs" tools in 3D editors and robotic grasp planners that need to know a mug's handle from its body. Start from the PointNetCls of Code Fragment 1, swap the classification head for a per-point head, train on a few ShapeNet categories, then upgrade the flat backbone to a PointNet++ set-abstraction stack using the three PyTorch3D operators from the library shortcut and watch the thin-part accuracy (chair-leg, airplane-engine boundaries) jump, the exact locality gain of subsection 3 made measurable. Difficulty: advanced. Time: about three to four hours including a short training run. It is interview-ready: reporting per-part mean intersection-over-union before and after adding locality demonstrates that you understand why the global max-pool of subsection 1 throws away the fine structure that PointNet++ recovers.

We have now learned to classify and segment explicit geometry. But every method so far stores the scene as concrete points, voxels, or triangles. The most influential idea of the last five years takes a radically different route: store the entire scene inside the weights of a small neural network and recover any view by querying it. That is neural radiance fields, the subject of Section 27.4.

Exercise 27.3.1: Prove the Pool Is Invariant Conceptual

Show formally that the function $g(\{x_1, \ldots, x_N\}) = \text{MAX}_i\, h(x_i)$ is invariant under any permutation $\pi$ of the points, that is, $g(\{x_{\pi(1)}, \ldots, x_{\pi(N)}\}) = g(\{x_1, \ldots, x_N\})$. Then explain in two or three sentences why replacing MAX with a plain concatenation of the $h(x_i)$ vectors would break the invariance, and why replacing it with a SUM or MEAN would preserve it. What practical property does MAX give (think about robustness to outliers and to changing the number of points) that SUM does not?

Exercise 27.3.2: Find the Critical Points Coding

Extend the PointNetCls of subsection 2 to also return, for each of the 1024 feature channels, the index of the point that won the max-pool. Run it on a point cloud of a simple shape (a sampled cube or sphere) and collect the unique set of winning points across all channels; these are PointNet's "critical points". Visualize them against the full cloud. Then randomly delete 80 percent of the non-critical points and confirm the predicted class is unchanged, demonstrating the robustness claimed in the Key Insight of subsection 1. Write one sentence on why this set is typically a sparse skeleton of the shape.

Exercise 27.3.3: The Locality Gap Analysis

PointNet has no intermediate receptive field between a single point and the entire cloud. Construct (on paper or in code) two point clouds that PointNet cannot distinguish but that have clearly different local structure, for example, the same set of points where one cloud has them clustered into two dense lumps and the other spreads them evenly, with both sharing the same per-channel maxima. Explain how PointNet++ would tell them apart, referencing its neighborhood sampling, and relate the locality gap to the receptive-field hierarchy of Chapter 19.