Training robust visual models from unlabeled data is a long-standing problem, as human-provided annotations are often costly, error-prone, and incomplete. Consequently, self-supervised learning (SSL) has become an attractive research direction for learning generic visual representations useful for a wide range of downstream tasks such as classification and semantic segmentation. SSL operates on the principle of pretext tasks, self-generated challenges that encourage models to learn from the data's inherent structure. Initially, these tasks were relatively simplistic, such as predicting the rotation of an image or restoring its original colors. These early experiments laid the groundwork for more sophisticated techniques that extract deeper understandings from visual content through, e.g. contrastive learning or deep clustering.

Although these methods have already outperformed supervised representations in numerous downstream tasks, the field continues to advance at an unprecedented pace, introducing many new techniques. Some of the directions are predictive architectures removing the need for augmentation, masked image modeling, auto-regressive approaches, leveraging the self-supervision signals in videos, and exploiting the representations of generative models.

With so many new techniques flooding the field, it is important to pause and discuss how we can make optimal use of self-supervised representations in applications, as well as what are the remaining obstacles and possible approaches to tackle them. The workshop aims to give space to ask and discuss fundamental, longer-term questions with researchers leading this area. Key questions we aim to tackle include:

  • What are the current bottlenecks in self-supervised learning?
  • What is the role of SSL in the era of powerful image-text models?
  • What can only or never be learned purely from self-supervision?
  • What is the role of generative modeling for representation learning?
  • Is SSL the new `pre-pretraining' paradigm, allowing to scale beyond coupled image-text data?
  • What biases emerge in SSL models, and what are the implications?
  • What is the role of multi-modal learning for robustness and understanding?
  • What is the fundamental role of data augmentation and synthetic data?

This is the third iteration of the SSL-WIN workshop. The workshop will be organized as a half-day event where a series of invited speakers will present their views on how the field needs to evolve in the coming years.


The workshop is a half-day event. It will consist of a series of invited talks on recent developments on self-supervised learning from the leading experts in academia and industry. This year's workshop will not include a call for papers.




* Tentatively confirmed speakers