Self Supervised Learning: What is Next?

Summary

Training robust visual models from unlabeled data is a long-standing problem, as human-provided annotations are often costly, error-prone, and incomplete. Consequently, self-supervised learning (SSL) has become an attractive research direction for learning generic visual representations useful for a wide range of downstream tasks such as classification and semantic segmentation. SSL operates on the principle of pretext tasks, self-generated challenges that encourage models to learn from the data's inherent structure. Initially, these tasks were relatively simplistic, such as predicting the rotation of an image or restoring its original colors. These early experiments laid the groundwork for more sophisticated techniques that extract deeper understandings from visual content through, e.g. contrastive learning or deep clustering.

Although these methods have already outperformed supervised representations in numerous downstream tasks, the field continues to advance at an unprecedented pace, introducing many new techniques. Some of the directions are predictive architectures removing the need for augmentation, masked image modeling, auto-regressive approaches, leveraging the self-supervision signals in videos, and exploiting the representations of generative models.

With so many new techniques flooding the field, it is important to pause and discuss how we can make optimal use of self-supervised representations in applications, as well as what are the remaining obstacles and possible approaches to tackle them. The workshop aims to give space to ask and discuss fundamental, longer-term questions with researchers leading this area. Key questions we aim to tackle include:

What are the current bottlenecks in self-supervised learning?
What is the role of SSL in the era of powerful image-text models?
What can only or never be learned purely from self-supervision?
What is the role of generative modeling for representation learning?
Is SSL the new `pre-pretraining' paradigm, allowing to scale beyond coupled image-text data?
What biases emerge in SSL models, and what are the implications?
What is the role of multi-modal learning for robustness and understanding?
What is the fundamental role of data augmentation and synthetic data?

This is the third iteration of the SSL-WIN workshop. The workshop will be organized as a half-day event where a series of invited speakers will present their views on how the field needs to evolve in the coming years.

Invited Speakers

Yuki M. Asano
Univeristy of Amsterdam

Yutong Bai
UC Berkeley

Xinlei Chen
Meta FAIR

Olivier J. Hénaff
Google DeepMind

Ishan Misra
Meta GenAI

Oriane Siméoni
~~valeo.ai~~ Meta FAIR

Schedule

The workshop is a half-day event consisting of a series of invited talks on recent developments on self-supervised learning from leading experts in academia and industry, along with a poster session highlighting recent papers in the field.

Time	Speaker	Talk Title	Slides
09:00		Opening
09:00 - 09:30	Oriane Siméoni	From unsupervised object localization to open-vocabulary semantic segmentation	Slides
09:30 - 10:00	Ishan Misra	What world priors do generative visual models learn?	ArXiv
10:00 - 10:30	Xinlei Chen	Diffusion Models for Self-Supervised Learning: A Deconstructive Journey	Slides
10:30 - 11:20		Poster Session
11:20 - 11:55	Olivier J. Hénaff	Data curation is the next frontier of self-supervised learning	Slides
11:55 - 12:30	Yuki M. Asano	Vision Foundation Models (with academic compute)	Slides
12:30 - 13:00	Yutong Bai	Listening to the Data: Visual Learning from the Bottom Up	Slides