Finding means of multi-modal Gaussian distribution
Need to find the means of the multi-modal normal distribution
- In our day to day lives, we encounter many situations where data is generated with multiple peaks(modes).
- One such problem would be the identification of peak hour times in a public transport systems like metro or buses.
- We need to identify these peaks so that we can target increasing the frequencies of the buses/trains during the peak hours.
Generating random data with multiple peaks
Use the below code to generate multi-modal gaussian distributions
%matplotlib inline
import numpy as np
import pandas as pd
# Generating multiple gaussians
distribution1 = np.random.normal(loc=0,scale=1.0,size=(300))
distribution2 = np.random.normal(loc=5,scale=1.0,size=(300))
distribution3 = np.random.normal(loc=10,scale=1.0,size=(300))
distribution4 = np.random.normal(loc=15,scale=1.0,size=(150))
distribution5 = np.random.normal(loc=-10,scale=1.0,size=(10))
combined_distribution = np.concatenate([distribution1,distribution2,distribution3,distribution4,distribution5])
combined_data_dataframe = pd.DataFrame(combined_distribution)
combined_data_dataframe.plot(kind='kde')
As you can see, we have created a random gaussian data with multiple means of 0,5,10,15 and -10
What happens if we,just calculate the mean?
print(combined_distribution.mean())
>>> 6.314271309260518
As you can see, the mean does not represent the peak due to multi-modality of the data.
Hence, to find multiple peaks programatically, we use one of the mixture models available in the scikit-learn
package called GaussianMixtureModel
Why a mixture model?
- These models are based on the assumption that, there is a presence of another subpopulation within the main population.
- Can approximate the subpopulations while being not computationally heavy for mid sized data (100's of thousands).
- In case of a huge data, approximations about data can be made by sampling a small sample of data randomly and fitting the model on these samples. [See Central Limit Theorem]
Now, the code for it
from sklearn.mixture import GaussianMixture
mixture_model = GaussianMixture(n_components=5)
mixture_model.fit(combined_distribution.reshape(-1,1))
print(mixture_model.means_.astype(np.int32).reshape(-1))
print(mixture_model.weights_.reshape(-1))
>>> [14 5 10 -9 0]
>>> [0.1417197 0.28217055 0.2822368 0.00943396 0.28443898]
- We can combine use these weights as a representation of the density of the data at the peak.
- In order to automatically determine the peaks, we can use a variation of gaussian mixture called
BayesianGaussianMixture
along with themeans_
anddegrees_of_freedom_
attribute to select the proper peaks- Alternative we can use the
scipy.stats.gaussian_kde
to find the density, but it smooths out the lower density values more than necessary.
Thank you for reading