Description -
The idea of anomaly detection is pretty known thoughout the industry and research community. Yet, it remains one of the toughest problems to solve. Particularly in the context of manufacturing defects, where the defects variation is too much and data points are less.
Disclaimer - No, this is not ChatGPT generated post. I truly want to learn and understand things. However, i have used chatgpt to comprehend some equations and graphs presented in the paper.
CHECK OUT THE PAPER - Arxiv Link
Technical abbreviation -
- AC - Anomaly Classification
- AS - Anomaly Segmentation
- AUROC - Area Under the Receiver Operating Characteristic
- AUPR - Area Under the Precision-Recall curve
- CPE - Compositional Prompt Ensemble
- CLS - Class Token
- CNN - Convolutional Neural Network
- CVPR - Conference on Computer Vision and Pattern Recognition
- FP - Feature Map before Pooling
- FW - Feature Map from WinCLIP
- LAION-400M - Dataset of 400 million image-text pairs used for CLIP pre-training
- MVTec-AD - MVTec Anomaly Detection Dataset
- PRO - Per-Region Overlap Score
- ViT - Vision Transformer
- WinCLIP - Window-based CLIP for anomaly segmentation
Problem statement
- In factories, quality inspection role in finding defects can be long and tedious. Particularly due to wide range of variations makes difficult to specify anomaly.
- Lack of scalable visual inspection systems.
Paper discussion
Introduction
The paper is dealing in the context of anomaly segmentation and classification using zero/few shot with CLIP vision language model. The Vision language models have shown promise in zero shot classification tasks.
- “Normal” and “anomalous” are context dependent. For example, a hole in a dress is desirable depending upon the fashion choice.
- Language provides information on the particular type of defect and context to define defect.
- With the pretrained CLIP as a base model, paper shows and verify the hypothesis of language models aids zero/few shot classification and segmentation.
- Paper mentions a challenge “CLIP is trained to enforce cross-modal alignment only on the global embeddings of image and text”. it means that CLIP processes whole images and corresponding text descriptions to learn a general or “global” representation of each.
- Anomaly segmentation needs pixel level segmentation. For which paper produces WinCLIP. Uses multiple scales.
- Anomaly classification is related to state classification that predicts if an object is normal or anomalous.
WinCLIP & WinCLIP+
WinCLIP : Effective Window based CLIP for efficient zero shot anomaly segmentation.
WinCLIP++ : Benefit from a few normal reference images, with the context provided by benefits of language guided prediction.
Anomaly Classification (AC)
Introduce binary zero shot anomaly classification framework CLIP-AC.
- Two class prompts; $o$ is object level label for example “bottle”
- normal [ o ]
- anomalous [ o ]
- One-Class Design : the model uses a text prompt representing the normal state of an object in the image to compute similarity scores.
- Two-Class Design : this easily outperforms their one class design. which can make sense because more information to process.
- CLIP pretrained by large web dataset provides a powerful representation with good alignment b/w text and images.specific definition about anomaly is necessary for good performance.
- for this we use “Compositional Prompt Ensemble (CPE)” basically, multiple descriptions that encapsulate different states of an object (e.g., normal, damaged) and composes them to better capture the various possible anomalies.
- The prompts for CPE can include lists of state words for all objects and/or specific states for specific objects.
- The prompts for CPE can include lists of state words for all objects and/or specific states for specific objects. This allows the model to better understand and classify images based on the context provided by the prompts.
- Compositional Prompt Ensemble - CPE is different from CLIP prompt ensemble that does not explain object labels (e.g., “cat”) and only augments templates selected by trial-and-error for object classification, including the ones unsuitable for anomaly tasks, e.g., “a cartoon [c]”
WinCLIP for zero-shot AS
Window-based CLIP (WinCLIP) for zero-shot anomaly segmentation to predict pixel-level anomalies. WinCLIP extracts dense visual features with good language alignment and local details for x, followed by applying ascore_0 spatially to obtain the anomaly segmentation map.
- Creating Sliding Windows: Imagine dividing the image into overlapping square areas (windows).
- Extracting Features: For each highlighted section, WinCLIP uses its image understanding capabilities (the encoder
f
) to extract important features. - This is like summarizing what’s important or notable in each window, which might include shapes, textures, or patterns.
- the idea of Harmonic aggregation of windows, Each local window in an image is assigned an anomaly score, which is the similarity between the window’s feature and the text embeddings from a compositional prompt ensemble.
- each pixel in an image gets an anomaly score from various overlapping windows. A lower score (close to zero) suggests that the pixel is likely to be normal (not anomalous), while a higher score suggests an anomaly.
- Harmonic averaging is particularly effective because it gives more importance to lower anomaly scores, which are closer to zero and indicate normality, helping to refine the segmentation result .
In simpler terms, harmonic averaging helps ensure that a few high anomaly scores don’t overshadow many low scores, which indicate that a pixel is normal, thus improving the accuracy of identifying truly anomalous areas in the image.
WinCLIP+ with few-normal-shots
In manycases, the zero shot method does not really work because of the context dependent where the normal and anomaly. For example, for metal nut, an anomaly type labeled as “flipped upside-down”, which can only be identified relatively from a normal image.
WinCLIP+, an extension of WinCLIP. For better and more precise anomaly detections by incorporating normal reference images.
Reference association, this component in WinCLIP+ uses the reference normal images to generate memory features. These features are then used to enhance anomaly detection by comparing query images against these memories, looking for deviations that might indicate anomalies.
Three sets of reference memories denoted as R_s, R_m, and R_p are introduced as small-scale, mid-scale, and penultimate features.
- These are used to help the model differentiate between normal and anomalous patterns
For anomaly segmentation, the model averages the multi-scale predictions from the three types of reference memories.
producing a combined memory feature that takes into account information from all scales.
To perform anomaly classification, we combine the maximum value of M_w and the WinCLIP zero-shot classification score.
- maximum value of M_w score will attribute to spatial features of few-shot references.
- Other score is CLIP knowledge retrieved via language
The overall anomaly score is given by
Experiments
- Dataset : MVTec-AD, VisA
- Evaluation metrics :
- Classification :
- AUROC
- AUPR
- F1 -score at optimal threshold
- Segmentation :
- pAUROC
- PRO
- F1 -score at optimal threshold pixel wise
- Classification :
Zero-/few-shot anomaly classification
In the table below, paper compare zero-shot and few-normal-shot anomaly classification results with prior works on MVTec-AD and VisA benchmarks.
Zero Shot
- Model : CLIP-AC
- labels : {“normal [c]”, “anomalous [c]”}
- with the prompt ensemble
Few-normal-shot
- WinCLIP+ outperforms prior works.
Zero-/few-shot anomaly segmentation
- No prior works on zero-shot anomaly segmentation.
- WinCLIP+ outperforms in many cases.
Conclusion
- Introduction of a new framework that combines textual descriptions and image data to more accurately identify anomalies.
- Superior performance of WinCLIP and WinCLIP+ models in anomaly detection tasks using minimal training samples.
- Potential future improvements through pre-training on industry-specific data.
-
Previous
[Paper review] UniDistill: A Universal Cross-Modality Knowledge Distillation Framework.... -
Next
[Paper review] CLIP2Scene: Towards Label-efficient 3D Scene Understanding