Summary

The year 2020 has seen major advances in self-supervised representation learning, with many new methods reaching high performances on standard benchmarks. Using better losses and augmentation methods, this trend will surely continue to slowly advance the field. However, it is also apparent that there are still major unresolved challenges and it is not clear what the next step-change is going to be. In this half-day workshop we want to highlight and provide a forum to discuss potential research direction seeds from radically new self-supervision tasks to downstream self-supervised learning and semi-supervised learning approaches.

As the methods are maturing, the field is now at the point where we have to start discussing how we can make optimal use of self-supervised representations in applications, as well as what are the remaining obstacles and possible approaches to tackle them. The workshop aims to give space to ask and discuss fundamental, longer-term questions with researchers that are leading this area. Key questions we aim to tackle include:

  • What can be learned by the current generation of self-supervised techniques? And what does, instead, still require manual supervision?
  • How can we make optimal use of self-supervised learning?
  • Is combining tasks the new way forward?
  • Are images enough? Video and multi-modal data as self-supervision is becoming popular.
  • Why do contrastive losses work well and do they scale?
  • How do we evaluate the quality of the learned representations?
  • How can we move to meaningful down-stream tasks that benefit from feature learning?
  • Why do methods such as clustering or contrastive losses work better than those that focus on interpretable, image-specific tasks as for example colorization or jigsaw puzzles?

Videos

Welcome playlist
Aäron van den Oord

TBD

recording
Paolo Favaro

Unsupervised representation learning is becoming a practical and effective approach to avoid massive labeling of data. Recent methods based on self-supervised learning have shown remarkable progress and are now able to build features that are competitive with features built through supervised learning. However, it is still unclear why some methods perform better than others. I will give an overview of methods that have been proposed in the literature and provide some analysis to try and understand what factors might be importantin the design of the next generation of self-supervised learning methods.

recording
slides
Carl Doersch

When encountering novelty, like new tasks and new domains, current visual representations struggle to transfer knowledge if trained on standard tasks like ImageNet classification. This talk explores how to build representations which better capture the visual world, and transfer better to new tasks. I'll first discuss Bootstrap Your Own Latent (BYOL), a self-supervised representation learning algorithm based on the 'contrastive' method SimCLR. BYOL outperforms its baseline without 'contrasting' its predictions with any 'negative' data; I'll provide a new perspective on why this avoids a collapse to a trivial solution. Second, I'll present CrossTransformers, which achieves state-of-the-art few-shot fine-grained recognition on Meta-Dataset, via a self-supervised representation that's aware of spatial correspondence.

recording
slides
Andrew Zisserman

The talk will describe three phases of self-supervised learning. First, the `classical' phase, where the goal is semantic representation learning, e.g. training a network on ImageNet for an image representation. The second, `expansion' phase goes beyond single modality representation learning. The goals are more general and cover classical computer vision tasks, such as tracking and segmentation; and the data can be multi-modal such as video with audio. Tasks here include learning joint embeddings, learning to localize objects, and learning to transform videos into discrete objects. The final `uncurated' phase involves self-supervised learning from uncurated data.

slides
Ishan Misra

In this talk I will present our recent efforts in learning representation learning that can benefit semantic downstream tasks. Our methods build on two simple yet powerful insights - 1) The representation must be stable under different data augmentations or "views" of the data; 2) The representation must group together instances that co-occur in different views or modalities. I will show that these two insights can be applied to weakly supervised and self-supervised learning, to image, video, and audio data to learn highly performant representations. For example, these representations outperform weakly supervised representations trained on billions of images or millions of videos; can outperform ImageNet supervised pretraining on a variety of downstream tasks; and have led to state-of-the-art results on multiple benchmarks. These methods build upon prior work in clustering and contrastive methods for representation learning. I will conclude the talk by presenting shortcomings of our work and some preliminary thoughts on how they may be addressed.

recording
slides
Stella Yu

Unsupervised representation learning has made great strides with invariant mapping and instance-level discrimination, as benchmarked by classification on common datasets. However, these datasets are curated to be distinctive and class-balanced, whereas naturally collected data could be highly correlated within the class and long-tail distributed across classes. The natural grouping of instances conflicts with the fundamental assumption of instance-level discrimination. Contrastive feature learning is thus unstable without grouping, whereas grouping without contrastive feature learning is easily trapped into degeneracy. By integrating grouping into a discriminative metric learning framework, I will show that we can not only outperform the state-of-the-art on various classification, transfer learning, semi-supervised learning benchmarks with a much smaller (academic :-) compute, but also extend the goal of representation learning beyond semantic categorization.

recording
Alexei (Alyosha) Efros

TBD

recording
Deepak Pathak

TBD

recording

Speakers

Organizers