Shreyan Chowdhury
Machine Learning Researcher | Music+AI Engineer
I am a post-doctoral researcher in the Institute of Computational Perception, a research group under the Department of Computer Science at Johannes Kepler University Linz, Austria. I work in Artificial Intelligence applied to Audio and Music.
What I’ve Been Up To…
Publications
Emotion Recognition in Piano Music
On Perceived Emotion in Expressive Piano Performance: Further Experimental Evidence for the Relevance of Mid-level Features
Shreyan Chowdhury, Gerhard Widmer
ISMIR 2021, Virtual
[abstract] | [full paper] | [video]
Despite recent advances in audio content-based music emotion recognition, a question that remains to be explored is whether an algorithm can reliably discern emotional or expressive qualities between different performances of the same piece. In the present work, we analyze several sets of features on their effectiveness in predicting arousal and valence of six different performances (by six famous pianists) of Bach's Well-Tempered Clavier Book 1. These features include low-level acoustic features, score-based features, features extracted using a pre-trained emotion model, and Mid-level perceptual features. We compare their predictive power by evaluating them on several experiments designed to test performance-wise or piece-wise variations of emotion. We find that Mid-level features show significant contribution in performance-wise variation of both arousal and valence -- even better than the pre-trained emotion model. Our findings add to the evidence of Mid-level perceptual features being an important representation of musical attributes for several tasks -- specifically, in this case, for capturing the expressive aspects of music that manifest as perceived emotion of a musical performance.
On the Characterization of Expressive Performance in Classical Music: First Results of the Con Espressione Game
Carlos Cancino-Chacón, Silvan Peter, Shreyan Chowdhury, Anna Aljanaki, Gerhard Widmer
ISMIR 2020, Virtual
[abstract] | [full paper]
A piece of music can be expressively performed, or interpreted, in a variety of ways. With the help of an online questionnaire, the Con Espressione Game, we collected some 1,500 descriptions of expressive character relating to 45 performances of 9 excerpts from classical piano pieces, played by different famous pianists. More specifically, listeners were asked to describe, using freely chosen words (preferably: adjectives), how they perceive the expressive character of the different performances. In this paper, we offer a first account of this new data resource for expressive performance research, and provide an exploratory analysis, addressing three main questions: (1) how similarly do different listeners describe a performance of a piece? (2) what are the main dimensions (or axes) for expressive character emerging from this?; and (3) how do measurable parameters of a performance (e.g., tempo, dynamics) and mid- and high-level features that can be predicted by machine learning models (e.g., articulation, arousal) relate to these expressive dimensions? The dataset that we publish along with this paper was enriched by adding hand-corrected score-to-performance alignments, as well as descriptive audio features such as tempo and dynamics curves.
Explainable Emotion Recognition
Towards Explainable Music Emotion Recognition: The Route via Mid-level Features
Shreyan Chowdhury, Andreu Vall, Verena Haunschmid, Gerhard Widmer
ISMIR 2019, Delft, Netherlands
[abstract] | [full paper] | [demo]
Emotional aspects play an important part in our interaction with music. However, modelling these aspects in MIR systems have been notoriously challenging since emotion is an inherently abstract and subjective experience, thus making it difficult to quantify or predict in the first place, and to make sense of the predictions in the next. In an attempt to create a model that can give a musically meaningful and intuitive explanation for its predictions, we propose a VGG-style deep neural network that learns to predict emotional characteristics of a musical piece together with (and based on) human-interpretable, mid-level perceptual features. We compare this to predicting emotion directly with an identical network that does not take into account the mid-level features and observe that the loss in predictive performance of going through the mid-level features is surprisingly low, on average. The design of our network allows us to visualize the effects of perceptual features on individual emotion predictions, and we argue that the small loss in performance in going through the mid-level features is justified by the gain in explainability of the predictions.
Two-level Explanations in Music Emotion Recognition
Verena Haunschmid, Shreyan Chowdhury, Gerhard Widmer
ICML 2019, Machine Learning for Music Discovery Workshop, Long Beach, LA, USA
[abstract] | [full paper] | [demo]
Current ML models for music emotion recognition, while generally working quite well, do not give meaningful or intuitive explanations for their predictions. In this work, we propose a 2-step procedure to arrive at spectrogram-level explanations that connect certain aspects of the audio to interpretable mid-level perceptual features, and these to the actual emotion prediction. That makes it possible to focus on specific musical reasons for a prediction (in terms of perceptual features), and to trace these back to patterns in the audio that can be interpreted visually and acoustically.
Tracing Back Music Emotion Predictions to Sound Sources and Intuitive Perceptual Qualities
Shreyan Chowdhury, Verena Praher, Gerhard Widmer
SMC 2021, Virtual
[abstract] | [full paper]
Music emotion recognition is an important task in MIR (Music Information Retrieval) research. Owing to factors like the subjective nature of the task and the variation of emotional cues between musical genres, there are still significant challenges in developing reliable and generalizable models. One important step towards better models would be to understand what a model is actually learning from the data and how the prediction for a particular input is made. In previous work, we have shown how to derive explanations of model predictions in terms of spectrogram image segments that connect to the high-level emotion prediction via a layer of easily interpretable perceptual features. However, that scheme lacks intuitive musical comprehensibility at the spectrogram level. In the present work, we bridge this gap by merging audioLIME -- a source-separation based explainer -- with mid-level perceptual features, thus forming an intuitive connection chain between the input audio and the output emotion predictions. We demonstrate the usefulness of this method by applying it to debug a biased emotion prediction model.
Domain Adaptation
Towards Explaining Expressive Qualities in Piano Recordings: Transfer of Explanatory Features via Acoustic Domain Adaptation
Shreyan Chowdhury, Gerhard Widmer
ICASSP 2021, Toronto, Canada
[abstract] | [full paper]
Emotion and expressivity in music have been topics of considerable interest in the field of music information retrieval. In recent years, mid-level perceptual features have been suggested as means to explain computational predictions of musical emotion. We find that the diversity of musical styles and genres in the available dataset for learning these features is not sufficient for models to generalise well to specialised acoustic domains such as solo piano music. In this work, we show that by utilising unsupervised domain adaptation together with receptive-field regularised deep neural networks, it is possible to significantly improve generalisation to this domain. Additionally, we demonstrate that our domain-adapted models can better predict and explain expressive qualities in classical piano performances, as perceived and described by human listeners.
The Con Espressione Game Dataset (1.0.0)
Carlos Cancino-Chacón, Silvan Peter, Shreyan Chowdhury, Anna Aljanaki, Gerhard Widmer
[abstract] | [dataset]
A piece of music can be expressively performed, or interpreted, in a variety of ways. With the help of an online questionnaire, the Con Espressione Game, we collected some 1,500 descriptions of expressive character relating to 45 performances of 9 excerpts from classical piano pieces, played by different famous pianists. More specifically, listeners were asked to describe, using freely chosen words (preferably: adjectives), how they perceive the expressive character of the different performances. The aim of this research is to find the dimensions of musical expression (in Western classical piano music) that can be attributed to a performance, as perceived and described in natural language by listeners.
Emotion and Theme Recognition in Music with Frequency-Aware RF-Regularized CNNs
Khaled Koutini, Shreyan Chowdhury, Verena Haunschmid, Hamid Eghbal-zadeh, Gerhard Widmer
MediaEval Multimedia Benchmark 2019, Sophia Antipolis, France
[abstract] | [full paper]
We present a Receptive Field-(RF)-regularized and Frequency-Aware CNN approach for tagging music with emotion/mood labels. We perform an investigation regarding the impact of the RF of the CNNs on their performance on this dataset. We observe that ResNets with smaller receptive fields – originally adapted for acoustic scene classification – also perform well in the emotion tagging task. We improve the performance of such architectures using techniques such as Frequency Awareness and Shake-Shake regularization, which were used in previous work on general acoustic recognition tasks.
Music Tempo Estimation Using Sub-Band Synchrony
Shreyan Chowdhury, Tanaya Guha, Rajesh M Hegde
INTERSPEECH 2017, Stockholm, Sweden
[abstract] | [full paper] | [poster]
Tempo estimation aims at estimating the pace of a musical piece measured in beats per minute. This paper presents a new tempo estimation method that utilizes coherent energy changes across multiple frequency sub-bands to identify the onsets. A new measure, called the sub-band synchrony, is proposed to detect and quantify the coherent amplitude changes across multiple sub-bands. Given a musical piece, our method first detects the onsets using the sub-band synchrony measure. The periodicity of the resulting onset curve, measured using the autocorrelation function, is used to estimate the tempo value. The performance of the sub-band synchrony based tempo estimation method is evaluated on two music databases. Experimental results indicate a reasonable improvement in performance when compared to conventional methods of tempo estimation.