Finding means of multi-modal Gaussian distribution

Need to find the means of the multi-modal normal distribution

  • In our day to day lives, we encounter many situations where data is generated with multiple peaks(modes).
  • One such problem would be the identification of peak hour times in a public transport systems like metro or buses.
  • We need to identify these peaks so that we can target increasing the frequencies of the buses/trains during the peak hours.

Generating random data with multiple peaks

Use the below code to generate multi-modal gaussian distributions

%matplotlib inline

import numpy as np
import pandas as pd

# Generating multiple gaussians 
distribution1 = np.random.normal(loc=0,scale=1.0,size=(300))
distribution2 = np.random.normal(loc=5,scale=1.0,size=(300))
distribution3 = np.random.normal(loc=10,scale=1.0,size=(300))
distribution4 = np.random.normal(loc=15,scale=1.0,size=(150))
distribution5 = np.random.normal(loc=-10,scale=1.0,size=(10))

combined_distribution = np.concatenate([distribution1,distribution2,distribution3,distribution4,distribution5])

combined_data_dataframe = pd.DataFrame(combined_distribution)
combined_data_dataframe.plot(kind='kde')

As you can see, we have created a random gaussian data with multiple means of 0,5,10,15 and -10

What happens if we,just calculate the mean?

print(combined_distribution.mean())
>>> 6.314271309260518

As you can see, the mean does not represent the peak due to multi-modality of the data.

Hence, to find multiple peaks programatically, we use one of the mixture models available in the scikit-learn package called GaussianMixtureModel

Why a mixture model?

  • These models are based on the assumption that, there is a presence of another subpopulation within the main population.
  • Can approximate the subpopulations while being not computationally heavy for mid sized data (100's of thousands).
  • In case of a huge data, approximations about data can be made by sampling a small sample of data randomly and fitting the model on these samples. [See Central Limit Theorem]

Now, the code for it

from sklearn.mixture import GaussianMixture
mixture_model = GaussianMixture(n_components=5)
mixture_model.fit(combined_distribution.reshape(-1,1))
print(mixture_model.means_.astype(np.int32).reshape(-1))
print(mixture_model.weights_.reshape(-1))
>>> [14  5 10 -9  0]
>>> [0.1417197  0.28217055 0.2822368  0.00943396 0.28443898]
  • We can combine use these weights as a representation of the density of the data at the peak.
  • In order to automatically determine the peaks, we can use a variation of gaussian mixture called BayesianGaussianMixture along with the means_ and degrees_of_freedom_ attribute to select the proper peaks
  • Alternative we can use the scipy.stats.gaussian_kde to find the density, but it smooths out the lower density values more than necessary.

Thank you for reading