Music genre classification using CNN: Part 2- Classification

Namrata Dutt
5 min readJun 1, 2022

Learn how to classify music genres using CNNs.

Photo by Marius Masalar on Unsplash

In the previous part, we learned to extract features from audio samples. Now, that we have different features, we move on to the classification task. First, we will use to features separately to classify the audio samples, and then we will use an ensemble of all the features for classification.

Step 1: Import libraries

Step 2: Import npz file, extract the features, and split the train-test data

We extracted these features in the previous article and saved them in npz file. We are simply loading that file here.

Step 3: Resizing and Reshaping data

The scaling operation is applied only to the training dataset. During testing, the same maximum (from training data) is used to perform scaling on the testing data.

So, we find out the maximum of S_train and divide S_train by the maximum. During testing, we are dividing the S_test also by the maximum of S_train.

After that, we reshape the data in the form (N, row, col, 1) because CNN requires the input to be in this form. It indicates that there is only one channel in the image. The original shape of Spectrogram is (944, 1025, 1295).

The original shape of MFCC is (944, 10, 1293). We first resize both the MFCC train and test data to (944, 120, 600). After that, we reshape the data into (N, row, col, 1) for CNN. Then we standardize the data.

The original shape of the Mel-Spectrogram is (944, 128, 1293). We first scale the train and test data using the maximum of train data. Then we reshape the data to (N, row, col, 1) for CNN.

Step 4: Save training and testing features in npz file

For all the models, we set the epochs to be 100, the batch size to be 32, and the learning rate is 0.001.

Step 5: Classification using Spectrogram

We will first load the train-test split data (.npz file) of the spectrogram. Then we define a CNN model for classification. We have used ReLU as the activation function on the layers except for the last layer where we have used softmax. We then used the Adam optimizer. We have a checkpoint after every 5 epochs, which we can use if in any case our model is interrupted during training. After that, we save our trained model.

Now, we can comment the model training part and load the trained model. We calculate the accuracy of training data and then test data. Then, we display the confusion matrix for the test data. We achieved an accuracy of 71.96% on the test dataset.

Now, for this project, we used two models of spectrograms. The code below is the second model for the spectrogram. We achieved an accuracy of 73.54% on the test dataset. We will consider both models while taking an ensemble.

The confusion matrix for the best spectrogram model is shown below.

Confusion matrix for the best model of Spectrogram (Image by the Author)

Step 6: Classification using MFCC

For MFCC, we trained three models and used an ensemble of the three models to report accuracy. We have also used k-fold cross-validation with k =10. We achieved an accuracy of 74.07% on the test data.

We again load the train-test split for MFCC and define the model.

Now, since we will be taking an ensemble of the three MFCC models we define a function to take a majority vote from the three models. Then we train all three models.

After we train the models, we will load the model and send the y_pred into the get_majority() function.

We achieved an accuracy of 74.07% on the test data. The confusion matrix for MFCC is shown below.

Confusion matrix for MFCC (Image by the Author)

Step 7: Classification using Mel-Spectrogram

We achieved an accuracy of 75.13% on the test data. The confusion matrix for Mel-spectrogram is shown below.

Confusion matrix for Mel-spectrogram (Image by the Author)

Step 8: Create an ensemble of Spectrogram, Mel-Spectrogram, and MFCC CNNs

Now, for the ensemble, we will load all the models (use a new file to avoid any confusion and mistake) and calculate the y-pred. This ensemble contains 2 models of Spectrogram, 3 models of MFCC, and 1 model of Mel-Spectrogram. After that, we will send the predictions to the get_majority function and report the accuracy of the ensemble.

The accuracy achieved using the ensemble is 79.36%.

The Confusion matrix of the ensemble is shown below.

Confusion matrix for the Ensemble (Image by the Author)

Conclusion

We have learned how to use CNN for Music genre classification. The testing data was separately taken out to avoid any data leaks. Firstly, the features are trained separately for classification. Secondly, an ensemble of different features is created and predictions are made using the majority voting strategy. We have used 2 models of the spectrogram, 3 MFCC models, and 1 Mel-Spectrogram model. We achieved an accuracy of 79.36% on the ensemble which is significantly better compared to the single features models. This shows the robustness and efficiency of using an ensemble model. Among the misclassified genres, the most misclassified genre was “Rock”. It was mostly misclassified as “Metal”, “Disco”, “Blues”, and “Country”. The second most misclassified genre was “Country” and it was misclassified with “Rock”, “Reggae”, and “Blues”.

The complete code is available on Github.

Thanks for reading! I hope you found this article helpful.

Go Gators!🐊

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

Namrata Dutt
Namrata Dutt

Written by Namrata Dutt

Ph.D. Student at University of Florida | Interested in Image Processing, Machine Learning and Remote Sensing | Poetry Enthusiast

Responses (2)

Write a response