The past two years have seen major advances in self-supervised learning, with many new methods reaching astounding performances on standard benchmarks. Moreover, many recent works have shown the large potential of coupled data sources such as image-text in producing even stronger models capable of zero-shot tasks, and often inspired by NLP. We have just witnessed a jump from the "default" single-modal pretraining with CNNs to Transformer-based multi-modal training, and these early developments will surely mature in the coming months. However, despite this it is also apparent that there are still major unresolved challenges and it is not clear what the next step-change is going to be. In this workshop we want to highlight and provide a forum to discuss potential research direction seeds, from radically new self-supervision tasks, data sources and paradigms to surprising counter-intuitive results. Through invited speakers and paper oral talks, our goal is to provide a forum to discuss and exchange ideas where both the leaders in this field, as well as the new, younger generation can equally contribute to discussing the future of this field.
A major goal of unsupervised learning in computer vision is to learn general data representations without labels. For this, countless pretext tasks such as image colorization and more recently contrastive learning and teacher-student approaches have been proposed to learn neural networks for feature extraction. While these methods are rapidly improving in performance and have surpassed supervised representations on many downstream tasks, many challenges remain and the ``next big step'' is not apparent.
As the methods are maturing, the field is now at the point where we have to start discussing how we can make optimal use of self-supervised representations in applications, as well as what are the remaining obstacles and possible approaches to tackle them. The workshop aims to give space to ask and discuss fundamental, longer-term questions with researchers that are leading this area. Key questions we aim to tackle include:
- What are the current bottlenecks in self-supervised learning?
- What is the future role of weak supervision, like image-text and video+ASR?
- What can never be learned purely from self-supervision?
- How many modalities do we need for true robustness and understanding?
- What is the fundamental role of data augmentation?
- Which kind of bias may be captured in the resulting models, and what does this imply?
This is the second iteration of the SSL-WIN workshop. The workshop will be organized as a hybrid full-day event where a series of invited speakers and oral talks from the submitted papers will present their views on how the field needs to evolve in the coming years.