Summary
Training robust visual models from unlabeled data is a long-standing problem, as human-provided annotations are often costly, error-prone, and incomplete. Consequently, self-supervised learning (SSL) has become an attractive research direction for learning generic visual representations useful for a wide range of downstream tasks such as classification and semantic segmentation. SSL operates on the principle of pretext tasks, self-generated challenges that encourage models to learn from the data's inherent structure. Initially, these tasks were relatively simplistic, such as predicting the rotation of an image or restoring its original colors. These early experiments laid the groundwork for more sophisticated techniques that extract deeper understandings from visual content through, e.g. contrastive learning or deep clustering.
Although these methods have already outperformed supervised representations in numerous downstream tasks, the field continues to advance at an unprecedented pace, introducing many new techniques. Some of the directions are predictive architectures removing the need for augmentation, masked image modeling, auto-regressive approaches, leveraging the self-supervision signals in videos, and exploiting the representations of generative models.
With so many new techniques flooding the field, it is important to pause and discuss how we can make optimal use of self-supervised representations in applications, as well as what are the remaining obstacles and possible approaches to tackle them. The workshop aims to give space to ask and discuss fundamental, longer-term questions with researchers leading this area. Key questions we aim to tackle include:
- What are the current bottlenecks in self-supervised learning?
- What is the role of SSL in the era of powerful image-text models?
- What can only or never be learned purely from self-supervision?
- What is the role of generative modeling for representation learning?
- Is SSL the new `pre-pretraining' paradigm, allowing to scale beyond coupled image-text data?
- What biases emerge in SSL models, and what are the implications?
- What is the role of multi-modal learning for robustness and understanding?
- What is the fundamental role of data augmentation and synthetic data?
This is the third iteration of the SSL-WIN workshop. The workshop will be organized as a half-day event where a series of invited speakers will present their views on how the field needs to evolve in the coming years.
Invited Speakers
Poster Session
List of accepted papers for presentation:
- Disentangling the Effects of Data Augmentation and Format Transform in Self-Supervised Learning of Image Representations - Presented by Neha Kalibhat
- SASSL: Leveraging Neural Style Transfer for Improved Self-Supervised Learning - Presented by Renan A. Rojas-Gomez
- VTCD: Understanding Video Transformers via Universal Concept Discovery - Presented by Matthew Kowal
- Efficient Image Pre-Training with Siamese Cropped Masked Autoencoders - Presented by Renaud Vandeghen
- Deep Spectral Methods for Unsupervised Ultrasound Image Interpretation - Presented by Yordanka Velikova
- MMEarth: Exploring Multi-Modal Pretext Tasks For Geospatial Representation Learning - Presented by Vishal Nedungadi ColorMAE: Exploring data-independent masking strategies in Masked AutoEncoders - Presented by Carlos Hinojosa
- PART: Self-supervised Pretraining with Pairwise Relative Translations - Presented by Melika Ayouhi
- SIGMA: Sinkhorn-Guided Masked Video Modeling - Presented by Mohammadreza Salehi
- UNIC: Universal Classification Models via Multi-teacher Distillation - Presented by Mert Bulent Sariyildiz
- SelEx: Self-Expertise in Fine-Grained Generalized Category Discovery - Presented by Sarah Rastegar
Schedule
The workshop is a half-day event consisting of a series of invited talks on recent developments on self-supervised learning from the leading experts in academia and industry and a poster session. Exact schedule coming in the next weeks.