Two-level explanations for music emotion recognition

1. Predicting emotion and mid-level perceptual features from audio spectrograms

In [1]:
from utils import *

1.1. Select a song from dataset

In [19]:
path = '/home/shreyan/PROJECTS/midlevel/Soundtracks/set1/set1/mp3/'
mp3_path = 'Soundtrack360_mp3/'
stft_path = 'stft/'
songname = '001.mp3'

song_audio_path = f'{path}{mp3_path}{songname}'
song_spec_path = f'{path}{stft_path}{songname}.spec'
song_stft = pickleload(f'{path}{stft_path}{songname}.spec.stft')
song_filterbank = pickleload(f'{path}{stft_path}{songname}.spec.filterbank')

Listen to input song

In [21]:
display(Audio(song_audio_path, rate=22050))

Compile prediction function

In [4]:
prediction_fn_audio = compile_prediction_function_audio(modelfile)

1.2. Predict mid-level and emotion

In [5]:
spectrum, _, _, _ = prepare_audio(song_spec_path)
ml_preds, emo_preds = prediction_fn_audio(np.array([spectrum]))
In [6]:
emo_preds = pd.DataFrame(emo_preds)
emo_preds.columns = emo_names
print(f"Emotion predictions for song {songname}")
print(emo_preds.T)
Emotion predictions for song 001.mp3
                0
valence  0.553503
energy   0.677310
tension  0.279063
anger    0.127628
fear     0.021633
happy    0.664131
sad      0.042493
tender   0.295766

HIGH emotions: "energy", "happy"

LOW emotions: "fear", "sad"

In [7]:
ml_preds = pd.DataFrame(ml_preds)
ml_preds.columns = ml_names
print(f"Mid-level predictions for song {songname}")
print(ml_preds.T)
Mid-level predictions for song 001.mp3
                        0
melodiousness    0.637723
articulation     0.886127
r_complexity     0.529747
r_stability      0.895867
dissonance       0.247777
tonal_stability  0.854901
minorness        0.131224

HIGH mid-level features: "articulation", "rhythmic stability", and "tonal stability"

LOW mid-level features: "dissonance", "minorness"

1.3. Obtain effects plots to visualize the effect of each mid-level prediction on each emotion prediction for this example. We can see the positive effect of articulation in the prediction of "energy", and its negative effect in the prediction of "sad".

Effects are calculated by multiplying the weights of mid to emotion layer with the mid-level predictions

In [8]:
fig, ax = plt.subplots(2,4,sharey=True,figsize=(25,8))
emotion_num = 0
for i in range(ax.shape[0]):
    for j in range(ax.shape[1]):
        song_ml = ml_preds
        song_ml_effect = np.multiply(song_ml, ML2Eweights.transpose()[:,emotion_num])
        plt1 = ax[i][j].barh(np.arange(7),song_ml_effect.values[0],color='g', alpha=0.6)
        ax[i][j].set_yticks(np.arange(7))
        ax[i][j].set_yticklabels(ml_names)
        ax[i][j].tick_params(axis='y', direction='in');
        ax[i][j].text(.9,.93,emo_names[emotion_num],horizontalalignment='center', transform=ax[i][j].transAxes)
        ax[i][j].axvline(0, alpha=0.5, linestyle='--')
        ax[i][j].yaxis.grid(True)
        ax[i][j].set_xlim(left=-0.5, right=0.5)
        emotion_num += 1
fig.subplots_adjust(wspace=0)

We now look at the visual and auditory explanation of high articulation

2. Using LIME to obtain explanations for mid-level predictions

In [23]:
from skimage.segmentation import felzenszwalb, mark_boundaries

spectrum, spec_orig, start_stop_times, start_stop_frames = prepare_audio(song_spec_path)
start_frame = start_stop_frames[0]
stop_frame = start_stop_frames[1]

song_stft_sliced = song_stft[start_frame:stop_frame,:]

segments = felzenszwalb(spectrum / np.max(np.abs(spectrum)), scale=25, min_size=40)

plt.imshow(np.rot90(spec_orig))
plt.xticks(np.linspace(0,spec_orig.shape[0], 5).astype(int), np.linspace(start_stop_times[0], start_stop_times[1], 5).round(1))
plt.show()

ml_pred, start_stop = prediction_fn_audio(np.array([spectrum]))
ml_pred = pd.DataFrame(ml_pred)
ml_pred.columns = ml_names
print(ml_pred.T)

list_exp = []
# spectrum = spectrum / np.max(np.abs(spectrum))
print("\n------LIME based analysis-----")
explainer = lime_image.LimeImageExplainer(verbose=True)
explanation, seg = explainer.explain_instance(image=spectrum,
                                              classifier_fn=prediction_fn_audio,
                                              hide_color=0, num_targets=7,
                                              num_samples=50000, for_emotion=False,
                                              segmentation_fn=felzenszwalb,
                                              scale=25, min_size=40
                                              )


from sklearn.metrics import r2_score
print("R2-score for the linear surrogate function", r2_score(explainer.base.right, [i[0] for i in explainer.base.predictions]))

aud_orig = recon_audio(song_stft_sliced, song_filterbank, spec_orig)

analyse_midlevel_i(explanation, 1, spec_orig, song_stft_sliced, song_filterbank, ml_dict, ml_names, prediction_fn_audio)
                        0
melodiousness    0.651893
articulation     0.878682
r_complexity     0.499061
r_stability      0.913261
dissonance       0.219267
tonal_stability  0.888172
minorness        0.107763

------LIME based analysis-----
Running prediction for perturbed inputs: 100%|██████████| 50000/50000 [05:31<00:00, 150.88it/s]
Intercept 0.16587576661573283
Prediction_local [0.14646465]
Right: 0.143271
Intercept 0.40713733567294225
Prediction_local [0.43016197]
Right: 0.41985396
Intercept 0.8010588064320111
Prediction_local [0.82122439]
Right: 0.8262716
Intercept 0.23924252473999996
Prediction_local [0.21404148]
Right: 0.21076453
Intercept 0.7225372183819428
Prediction_local [0.77475787]
Right: 0.7703204
Intercept 0.1499859642253053
Prediction_local [0.11777433]
Right: 0.11702541
Intercept 0.5984838439711856
Prediction_local [0.57097516]
Right: 0.58214384
R2-score for the linear surrogate function 0.9994317566807138
305
Num features for pos = 56
Num features for neg = 15