Is it possible to predict the popularity of a music genre based on historical data?
In this brief essay I will use some regression methods in R to understand if it is possible to predict the popularity of a music genre base on some empirical features and historical data.
Nowadays the music market size is huge, so it is fundamental to understand which musical genre is more promising and then finance it. The aim of the project is to predict the popularity of each genre, a feature with range 0–100.
To manage with this topic I used linear regression methods. The data used for the analysis are available on Kaggle (This is my profile https://www.kaggle.com/pesssinaluca) or in the site of Spotify.
The dataset is composed by 2664 different genres and 14 variables
- Popularity does not depend on the historical moment
- Popularity isn’t influenced by ither industry functions, as the distribution, marketing etc.
- Popularity can be summarized as a linear function of the others attributes, linear assumpion.
To better understand the data and how are distributed I divided the genres by High or Low popularity and plotted the features available in a radar plot, to see the R code check my Kaggle profile.
Some features like the speechiness and the liveness doesn’t seem to depend by the high or low popularity of the genre. Others like the acousticness or the energy are quite different based on the kind the genres, I will explore these features better in a second step.
Another fancy way to see the data is to plot them using a multdimensional plot using in addition to the axes also the size of the data point and the color. Using this plot is useful also to visualize some possible correlations among the variables, for example in this case there is a negative correlation between the energy and the acousticness, or a positive one between the acousticness and the popularity.
Check the outliers
The first think to do is to take a look to the data and try to understand some particular patterns and the outliers that we don’t need for the analysis. In this case I delated the very unpopular genres with some variables equal to 0.
Also some very characterized genres, passing to a dataset composed by 1527 genres and 13 variables.
Stay Hungry, stay simple
We always build up very complex methods to solve our data science problems, sometimes it is useful to beging from the simplest methods, these could be very useful and fast.
Using a lean approach I firtstly studied a one-variable step, then a multi-variable step.
I report only the results of the analysis.
Linear models doesn’t mean linear dependence!
It is fundamental to uderstand the inner dependences among the variables, so it is crucial to study the target variable. In particular I report a plot of the popularity against the loudness, it is possible to see how the relation is not linear, but quadratic. In this way we can implement a better model, adding the quadratic term.
The difference among the methods is shown in the formula used for the fitting were we can add the quadratic term. The Residual Squared Error (RSE) is 12.1 for the linear and 11.07 for the quadratic.
Using some other variables is it possible to reach a RSE of 8.37 and a R-squared of 0.61 . I used the models of the test set and plotted the results against the real data.
I will write a post to better investigate the methods used in this brief report.
Some advices for a good data science project.
- Focus on the data: insted of starting writing code and using complex method take some time to investigate the rought data, they are beautiful and useful!
- Make easy plots: plot the data is an art, train it and you will succeed.
- Use simple methods: Start using simple methods and add complexity to the model step by step.
- Reach an end: before starting make some questions, and if you answered them you finished your work. Don’t try to obtain something impossible to reach.
- Useful results: once you end your project check if your results are useful, this is extremely important!