Summary
Training robust visual models from unlabeled data is a long-standing problem, as human-provided annotations are often costly, error-prone, and incomplete. Consequently, self-supervised learning (SSL) has become an attractive research direction for learning generic visual representations useful for a wide range of downstream tasks such as classification and semantic segmentation. SSL operates on the principle of pretext tasks, self-generated challenges that encourage models to learn from the data's inherent structure. Initially, these tasks were relatively simplistic, such as predicting the rotation of an image or restoring its original colors. These early experiments laid the groundwork for more sophisticated techniques that extract deeper understandings from visual content through, e.g. contrastive learning or deep clustering.
Although these methods have already outperformed supervised representations in numerous downstream tasks, the field continues to advance at an unprecedented pace, introducing many new techniques. Some of the directions are predictive architectures removing the need for augmentation, masked image modeling, auto-regressive approaches, leveraging the self-supervision signals in videos, and exploiting the representations of generative models.
With so many new techniques flooding the field, it is important to pause and discuss how we can make optimal use of self-supervised representations in applications, as well as what are the remaining obstacles and possible approaches to tackle them. The workshop aims to give space to ask and discuss fundamental, longer-term questions with researchers leading this area. Key questions we aim to tackle include:
- What are the current bottlenecks in self-supervised learning?
- What is the role of SSL in the era of powerful image-text models?
- What can only or never be learned purely from self-supervision?
- What is the role of generative modeling for representation learning?
- Is SSL the new `pre-pretraining' paradigm, allowing to scale beyond coupled image-text data?
- What biases emerge in SSL models, and what are the implications?
- What is the role of multi-modal learning for robustness and understanding?
- What is the fundamental role of data augmentation and synthetic data?
This is the third iteration of the SSL-WIN workshop. The workshop will be organized as a half-day event where a series of invited speakers will present their views on how the field needs to evolve in the coming years.
Invited Speakers
Schedule
The workshop is a half-day event consisting of a series of invited talks on recent developments on self-supervised learning from leading experts in academia and industry, along with a poster session highlighting recent papers in the field.
Time | Speaker | Talk Title | Slides |
---|---|---|---|
09:00 | Opening | ||
09:00 - 09:30 |
Oriane Siméoni | From unsupervised object localization to open-vocabulary semantic segmentation | Slides |
09:30 - 10:00 | Ishan Misra | What world priors do generative visual models learn? | ArXiv |
10:00 - 10:30 | Xinlei Chen | Diffusion Models for Self-Supervised Learning: A Deconstructive Journey | Slides |
10:30 - 11:20 | Poster Session | ||
11:20 - 11:55 |
Olivier J. Hénaff | Data curation is the next frontier of self-supervised learning | Slides |
11:55 - 12:30 | Yuki M. Asano | Vision Foundation Models (with academic compute) | Slides |
12:30 - 13:00 | Yutong Bai | Listening to the Data: Visual Learning from the Bottom Up | Slides |
Poster Session
List of accepted papers for presentation:
- Poster 1: Disentangling the Effects of Data Augmentation and Format Transform in Self-Supervised Learning of Image Representations - Presented by Philip Mansfield
- Poster 2: SelEx: Self-Expertise in Fine-Grained Generalized Category Discovery - Presented by Sarah Rastegar
- Poster 3: MMEarth: Exploring Multi-Modal Pretext Tasks For Geospatial Representation Learning - Presented by Vishal Nedungadi
- Poster 4: Deep Spectral Methods for Unsupervised Ultrasound Image Interpretation - Presented by Yordanka Velikova
- Poster 5: SASSL: Leveraging Neural Style Transfer for Improved Self-Supervised Learning - Presented by Philip Mansfield
- Poster 6: VTCD: Understanding Video Transformers via Universal Concept Discovery - Presented by Matthew Kowal
- Poster 7: Efficient Image Pre-Training with Siamese Cropped Masked Autoencoders - Presented by Renaud Vandeghen
- Poster 8: ColorMAE: Exploring data-independent masking strategies in Masked AutoEncoders - Presented by Carlos Hinojosa
- Poster 9: PART: Self-supervised Pretraining with Pairwise Relative Translations - Presented by Melika Ayouhi
- Poster 10: SIGMA: Sinkhorn-Guided Masked Video Modeling - Presented by Mohammadreza Salehi
- Poster 11: UNIC: Universal Classification Models via Multi-teacher Distillation - Presented by Mert Bulent Sariyildiz
- Poster 12: FroSSL: Frobenius Norm Minimization for Efficient Multiview Self-Supervised Learning - Presented by Oscar Skean
- Poster 13: Self-supervised visual learning from interactions with objects - Presented by Arthur Aubret
- Poster 14: An Image is Worth More Than 16x16 Patches: Exploring Transformers on Individual Pixels - Presented by Xinlei Chen
- Poster 15: GeneralAD: Anomaly Detection Across Domains by Attending to Distorted Features - Presented by Luc Sträter