This year’s International Conference on Acoustics, Speech and Signal Processing (ICASSP) was held in the sunny, seaside town of Brighton. It’s amazing to see how much has happened in the field of signal processing and machine learning since last year’s conference in Calgary (see my blog post from last year). One of the year’s most impressive accomplishments came from Katie Bouman (a member of the IEEE Signal Processing society) and her team, who created the first-ever rendering of a black hole image that was achieved via signal processing algorithms which combined signal data from telescopes located all over the world. Signal processing is pivotal for so much of what we can accomplish with modern technology, making the field more important than ever.
Figure: Brighton pier – location for the ICASSP 2019 welcome reception
Trust Data Privacy and the “Internet of Skills”
Sir David John Spiegelhalter, Chair of the Winton Centre for Risk and Evidence Communication, set a really important tone right from the start with his opening plenary presentation. He made a compelling case for scientists and engineers working in data-related fields to take on a greater, stronger role in communicating to the public about data and their discipline. In this new era of fake news, “data are being used as rhetorical devices to change our emotions rather than to inform,” he said. Sir Spiegelhalter went on to assert that we are the experts in our fields, and we are the ones who best know and understand the promise and limitations of the state of the art. Therefore, we have the responsibility and civic duty to ensure that scientific findings and statistical data are presented to the public in a fair, consistent, and accessible way. Scientific and research groups should have the objective of establishing trustworthiness rather than that of increasing trust — the difference being an important nuance emphasized during the talk.
Figure: Sir David John Spiegelhalter’s opening keynote presentation
Trust continued to be a pervasive topic throughout the conference. In Corinna Cortes’ keynote presentation on “[email protected]”, she talked about the ways that Google is developing fact-checking techniques and serving clarifications to users when a particular story has been debunked. She also talked about techniques to preserve user data privacy, including the use of so-called “federated learning” where training steps are performed on a user’s physical device (e.g., their mobile phone), and the result of this is sent to a central server which updates the model weights. Because your data never leaves your phone, this could be a major step forward in ensuring data privacy.
Figure: Prof. Mischa Dohler’s keynote presentation on the “Internet of skills”
With the imminent roll-out of 5G in many places around the world, the availability of ultra-low latency broadband connections is opening the door to a whole new category of applications that were not previously possible. Prof. Mischa Dohler, from the King’s College in London, refers to this category as the “Internet of Skills”. During his keynote address, he called out various new applications with which he is involved, including live music performances with musicians distributed all around the world and surgical operations controlled remotely. By combining visual, audio, and haptic technologies, Prof. Dohler stated that through the use of interactive sensory experiences an increasingly “synchronized reality” can be created.
Figure: Poster and demo by Irish child speech technology company SoapBox Labs
Figure: Conference banquet at The Grand Brighton
My Paper Highlights
So much to take in with this year’s ICASSP 2019 conference! I primarily attended sessions focused on speech and audio processing, as it happens to be my area of focus. Here, I’ve selected ten papers, which I list out below in no particular order, that really stood out to me during those sessions.
Google continues to lead the charge in developing end-to-end multilingual speech processing techniques. In this paper, they introduce an alternative to labeling speech as characters or words and instead use unicode bytes which enable much greater consistency across languages, particularly when there are different character systems (e.g., English vs. Japanese).
So called “student-teacher” approaches are getting increased attention as a means of utilizing large volumes of unlabeled data. Nobody (perhaps with the exception of Google) has as much unlabeled data available to them as Amazon does. As a result, it is not surprising that they are actively researching this. In this paper, they highlight challenges and solutions to building a speech recognition system that uses 7,000 hours of labeled and 1,000,000 hours of unlabeled data.
Audio event and sound scene classification continues to be a major topic at ICASSP. Some of the authors of this paper continue to emphasize that better audio event detectors can be built using large volumes of data with noisy labels compared to smaller datasets with perfectly clean labels.
Multitask learning is an attractive technique to use with end-to-end neural networks — due to the benefit of finding shared representations for similar tasks and for computational efficiency (e.g., using one model instead of three). This paper additionally incorporates an attention mechanism to each task-specific model component using a similar approach to my team’s 2018 Interspeech paper.
The AudioSet dataset with its millions of hours of hand-labeled YouTube clips has enabled significant advancement in audio event detection, ever since its release in 2017. The paper here produces state-of-the-art recognition accuracy on this dataset using a novel neural network architecture and pooling technique.
Modern signal processing is largely implemented using machine learning techniques that are highly transferable across different applications (e.g., from image processing to speech processing). As a result, some of the most useful papers are ones which assess and uncover consistent characteristics of model behavior. This was one such paper. Coming out of Yoshua Bengio’s lab in Montreal, the research highlighted the fact that early layers of deep convolutional neural networks find coarser, more transferable features while later layers tend to be more application-specific. This was demonstrated in the case of speech recognition where models were pre-trained on a different language than the target.
Often audio processing applications require accurate detection of events which may be rarely occurring or underrepresented in the available training data. This class imbalance problem is typically tackled using techniques like oversampling the rare classes or using weights in the cost function which additionally penalize rare class errors. This paper proposes a new approach to this involving a clustering technique which splits the overrepresented class into multiple new classes and demonstrates improved rare class accuracy as a result.
Embeddings are everywhere in modern machine learning. The notion of being able to collapse high-dimensional feature data into a lower dimensional form, which encodes important information to do with the phenomenon and preserves the concepts of distance and similarity, is very attractive for many applications. In this paper, the authors propose a method of training an embedding that encodes both lexical information as well as speaking style and paralinguistics. They go on to further demonstrate the value of this approach in applications such as speech recognition and emotion recognition.
The problem of accurately detecting emotional valence (e.g., positive vs. negative emotions) using just acoustic data remains elusive for the speech emotion recognition field. Solving this problem is unlikely to be achieved solely through training ever more complex neural networks with more and more data. One promising avenue seems to be exploiting context. In this paper the authors demonstrate that human listeners can provide more consistent labels for emotion when they are provided clips of speech audio from the same speaker in the right order. More consistent labels and neural network models that utilize context can produce a noticeable improvement in recognition accuracy.
My last, but not least, paper pick concerns the use of a neural network model with a triplet loss function. This is a technique that can be effective at training an embedding, which enables you to find similar samples to another selected sample. The authors demonstrate how this can be effective in the speech emotion recognition area.
Another Year, Another Brick in the Signal Processing Wall
ICASSP 2019 may not be as memorable a Brighton event as Pink Floyd’s first concert here, but it did showcase some of the latest research which is directly impacting our digital lives. With the current acceleration of new signal processing and machine learning methods and their inclusion in more and more of contemporary technology, who knows what we will be reflecting on when we meet again, next year, in Barcelona. Regardless, the future of the signal processing field continues to look bright!