Need to find the means of the multi-modal normal distribution
- In our day to day lives, we encounter many situations where data is generated with multiple peaks(modes).
- One such problem would be the identification of peak hour times in a public transport systems like metro or buses.
- We need to identify these peaks so that we can target increasing the frequencies of the buses/trains during the peak hours.
Generating random data with multiple peaks
Use the below code to generate multi-modal gaussian distributions
%matplotlib inline import numpy as np import pandas as pd # Generating multiple gaussians distribution1 = np.random.normal(loc=0,scale=1.0,size=(300)) distribution2 = np.random.normal(loc=5,scale=1.0,size=(300)) distribution3 = np.random.normal(loc=10,scale=1.0,size=(300)) distribution4 = np.random.normal(loc=15,scale=1.0,size=(150)) distribution5 = np.random.normal(loc=-10,scale=1.0,size=(10)) combined_distribution = np.concatenate([distribution1,distribution2,distribution3,distribution4,distribution5]) combined_data_dataframe = pd.DataFrame(combined_distribution) combined_data_dataframe.plot(kind='kde')
As you can see, we have created a random gaussian data with multiple means of
0,5,10,15 and -10
What happens if we,just calculate the mean?
print(combined_distribution.mean()) >>> 6.314271309260518
As you can see, the mean does not represent the peak due to multi-modality of the data.
Hence, to find multiple peaks programatically, we use one of the mixture models available in the
scikit-learn package called
Why a mixture model?
- These models are based on the assumption that, there is a presence of another subpopulation within the main population.
- Can approximate the subpopulations while being not computationally heavy for mid sized data (100's of thousands).
- In case of a huge data, approximations about data can be made by sampling a small sample of data randomly and fitting the model on these samples. [See Central Limit Theorem]
Now, the code for it
from sklearn.mixture import GaussianMixture mixture_model = GaussianMixture(n_components=5) mixture_model.fit(combined_distribution.reshape(-1,1)) print(mixture_model.means_.astype(np.int32).reshape(-1)) print(mixture_model.weights_.reshape(-1)) >>> [14 5 10 -9 0] >>> [0.1417197 0.28217055 0.2822368 0.00943396 0.28443898]
- We can combine use these weights as a representation of the density of the data at the peak.
- In order to automatically determine the peaks, we can use a variation of gaussian mixture called
BayesianGaussianMixturealong with the
degrees_of_freedom_attribute to select the proper peaks
- Alternative we can use the
scipy.stats.gaussian_kdeto find the density, but it smooths out the lower density values more than necessary.
Thank you for reading