Description
We have seen the usage of CLIP in 2D scenes for example zero/few shot anomaly detections. Here in this paper scenes are more oriented toward 3D setting. Focused on transfering 2D CLIP knowledge to 3D point cloud.
Disclaimer - No, this is not ChatGPT generated post. I truly want to learn and understand things. However, i have used chatgpt to comprehend some equations and graphs presented in the paper.
CHECK OUT THE PAPER - Arxiv Link
Technical abbreviation -
- CLIP - Contrastive Language-Image Pre-training
- SPVCNN - Sparse Point-Voxel Convolutional Neural Network
- StCR - Spatial-temporal Consistency Regularization
- SCR - Semantic Consistency Regularization
- KL - Kullback–Leibler divergence
- S3 - Switchable Self-Training Strategy
- ST - Self-Training
- RGB-D - Red, Green, Blue-Depth
- InfoNCE - Information Noise-Contrastive Estimation
- mIoU - Mean Intersection over Union
- PPKT - Pixel-to-Point Knowledge Transfer
- SGD - Stochastic Gradient Descent
Paper discussion
Introduction
The current deep learning methods have showed promising performance in 3D points cloud data. BUT tagging along the major bottlenecks : 1) large data collection and annotations 2) fails to recognize novel objects, leds to more annotations
According to the paper, Contrastive Vision-Language Pre-training (CLIP) provides a new perspective that mitigates the above issues in 2D vision.
Paper cites some crucial work on CLIP -
- MaskCLIP : explores semantic segmentation based on CLIP. And the problem of utilizing CLIP to help the 2D dense prediction tasks and exhibits encouraging zero-shot semantic segmentation performance.
- PointCLIP : zero-shot classification ability of CLIP can be generalized from the 2D image to the 3D point cloud. shows impressive performance on zero-shot and few-shot classification tasks.
THIS WORK explores the rich semantic and visual knowledge in CLIP can benefit the 3D sementic segmentation tasks.
but 3D scene understanding is still under-explored.
So the big question is
- “Can CLIP help a 3D network understand scenes without needing annotations?”
- “Will the performance of the network improve if it is fine-tuned with labeled data?”
Paper “CLIP2Scene” pioneer in exploring 3D scene understanding.
CLIP2Scene
CLIP is pre-trained on 2D images and text, our first concern is the domain gap between 2D images and the 3D point cloud. To counter it, paper builds pixel-point correspondence and transfers knowledge from 2d to 3d point cloud, basically Domain Gap Bridging.
Some previous reserached methods represent “optimization conflicts”. Which basically means, if different parts of an image and a point cloud represent the same object (and thus share semantic meaning), they should ideally be considered similar. However, because they are at different positions (pixel xi and point pj), the InfoNCE loss incorrectly pushes them apart, harming the overall learning process.
So the paper introduces Semantic Consistency Regularization - leverages the CLIP’s semantic information. Avoiding pixel-point mapping, this method generates pixel-text pairs {xi,ti} using a method called MaskCLIP. Which further gets transfered from pixel-text pairs to point-text pairs and use text semantics to select the positive and negative point samples.
This approach ensures that points are compared based on their semantic similarity rather than just spatial proximity.
Note that the pixel-text mappings are free-available from CLIP without any additional training
Semantic-guided Spatial-temporal Consistency Regularization :
Semantic information helps ensure that the regularization is not just based on geometric or appearance data but also on the meaning of the objects and scenes being observed.
Semantics (meaningful information) assigned to a pixel may be noisy, i.e., the pixel might not perfectly represent the actual object due to factors like lighting, occlusion, or camera angle.
Additionally, the mapping from pixel to point might not always be accurate due to errors in sensor calibration, differences in perspective, or the inherent limitations.
So the fancy term and a bit intimidating term being used “Semantic-guided Spatial-Temporal Consistency Regularization”. Let us explore what does it mean in detail :
- Semantic Guidance : semantic information (e.g., object labels or features derived from text descriptions) to guide the association between pixels and points.
- Spatial-Temporal Constraints : considers the consistency of these associations over time and space. For example, if a car moves across a series of frames in a video, the points representing that car in the 3D point cloud should have a consistent relationship with pixels.
Grid-wise Correspondence and Feature Fusion :
The first step involves defining a grid-based structure over the image and the corresponding point cloud. Handling of features in manageable segments, ensuring that local spatial-temporal relationships are maintained. The space is divided into regular grids.
Semantic-guided Fusion Feature Generation :
For each grid, the features from the image and the point cloud are combined to form a fusion feature.
Attention Weights :
The attention weights adjust the influence of each feature based on its alignment with a semantic template (t_{i,k}).
D
is a function measuring the compatibility (or distance) between features, and \lambda
is a temperature parameter that controls the sharpness of the distribution.
To enforce consistency within and across grids, a loss function (L_SSR) is applied, which minimizes the difference between the fused features (fn) and the ideal semantic representation.
Loss Function (L_SSR): This might look like a sum of squared differences or another metric measuring the discrepancy between the fused features and a target:
Experiments
- Dataset : SemanticKITTI, nuScenes, ScanNet
- Models :
- MaskCLIP : modify the attention pooling layer of the CLIP’s image encoder, to extract dense pixel-text correspondences.
- SPVCNN : 3D network to produce the point-wise feature
- Epoch : 20
- Optimizer : SGD with cosine scheduler.
- Temperature λ & τ : 1, 0.5
Annotation-free Semantic Segmentation
The paper approaches in the following way :
They used the nuScenes and ScanNet datasets to test the network’s ability to segment scenes without annotations. Used MaskCLIP, to generate textual prompts that help create text embeddings. These embeddings are averaged and used to classify each point in the 3D space.
Baseline: Initially, the network used a single sweep of point cloud data and was trained using semantic consistency, which helps align text and image features
Different Prompts: They experimented with text prompts from different datasets (nuScenes, SemanticKITTI, Cityscapes) to see if the network could still recognize objects from the nuScenes dataset.
Regulatory Techniques: They tested the network’s performance with and without spatial-temporal consistency regularization and semantic consistency regularization. They also tried using a Kullback–Leibler divergence loss, which did not perform well due to noise and calibration errors in data.
Training Strategies: They evaluated the effect of a Switchable Self-Training Strategy, where the network’s supervision signal was altered during training to improve learning.
Sweep Numbers: They checked how varying the number of sweeps (sequential scans of point clouds) from 1 to 5 affected performance, finding that three sweeps offered a good balance between performance and computational cost.
Annotation-efficient Semantic Segmentation
The pre-trained network is capable of improving its performance even when only a small amount of labeled data is available. This capability is tested against other methods like SLidR and PPKT, which also aim to enhance learning from minimal data using different techniques.
PPKT: A method that uses a contrastive loss to learn from 2D images and transfer this knowledge to 3D point clouds. SLidR: This method is designed for use with LiDAR data in autonomous driving and uses superpixels (clusters of pixels based on similarity) to improve the learning process.
The results show that their method outperforms these existing techniques significantly on the nuScenes dataset, both when very little data (1%) and a full dataset (100%) are used for fine-tuning.
-
Previous
[Paper review] WinCLIP: Zero-/Few-Shot Anomaly Classification and Segmentation -
Next
[Code review] Dig into TransReID official code repo