Workshop Description

There is a trend in the machine learning community to adopt self-supervised approaches to pre-train deep networks. Self-supervised learning utilizes proxy supervised learning tasks, for example, distinguishing parts of the input signal from distractors, or generating masked input segments conditioned on the unmasked ones, to obtain training data from unlabeled corpora. These approaches make it possible to use a tremendous amount of unlabeled data on the web to train large networks and solve complicated tasks. ELMo, BERT, and GPT in NLP are famous examples in this direction. Recently self-supervised approaches for speech and audio processing are also gaining attention. These approaches combine methods for utilizing no or partial labels, unpaired text and audio data, contextual text and video supervision, and signals from user interactions. Although the research direction of self-supervised learning is active in speech and audio processing, current works are limited to several problems such as automatic speech recognition, speaker identification, and speech translation, partially due to the diversity of modeling in various speech and audio processing problems. There is still much unexplored territory in the research direction for self-supervised learning.
This workshop will bring concentrated discussions on self-supervision for the field of speech and audio processing via several invited talks, oral and poster sessions with high-quality papers, and a panel of leading researchers from academia and industry. Alongside research work on new self-supervised methods, data, applications, and results, this workshop will call for novel work on understanding, analyzing, and comparing different self-supervision approaches for speech and audio processing. The workshop aims to:

Review existing and inspire new self-supervised methods and results,

Motivate the application of self-supervision approaches to more speech and audio processing problems in academia and industry, and encourage discussion amongst experts and practitioners from the two realms,

Encourage works on studying methods for understanding learned representations, comparing different self-supervision methods and comparing self-supervision to other self-training as well as transfer learning methods that low-resource speech and audio processing have long utilized,

Facilitate communication within the field of speech and audio processing (e.g., people who attend conferences such as INTERSPEECH and ICASSP) as well as between the field and the whole machine learning community for sharing knowledge, ideas, and data, and encourage future collaboration to inspire innovation in the field and the whole community.

Call for Papers

We welcome submissions that work in the area of self-supervised learning for audio and speech processing. Relevant research directions include, but not limited to:

New self-supervised training approaches

Application of self-supervised models to downstream tasks, such as automatic speech recognition, speech enhancement, speech augmentation, and spoken language understanding

Generalizability of self-supervised models across domains, tasks, or languages

Understanding of why do self-supervision methods work for speech and audio, for example:

  • What does the model learn in self-supervised learning tasks?
  • Why unrelated self-supervised proxy tasks improve downstream speech application performance?
  • Are there some self-supervision proxy tasks that are suitable for some downstream applications but not others?

Comparative study on self-supervised learning approaches

The submissions should be in NeurIPS style between 4 to 8 pages, excluding the references. Authors can add supplementary material in addition to the 8 pages, but reviewers are not required to review the extra material. Original works are preferred. Submissions that are published on arXiv or similar repositories are acceptable. Papers submitted to other conferences or workshops can be submitted, but the authors must contact the organizers. Submissions will be reviewed by at least three reviewers. Authors and reviewers are asked to disclose any possible conflict of interest, and the organizers will manage the conflict of interest when assigning submissions for reviews. The review will be double blind. Note that our workshop is not archival, but the accepted papers will be hosted on the workshop website. For reproducibility, we also encourage the authors to release the code of their experiments publicly.