Finding means of multi-modal Gaussian distribution

Need to find the means of the multi-modal normal distribution

In our day to day lives, we encounter many situations where data is generated with multiple peaks(modes).
One such problem would be the identification of peak hour times in a public transport systems like metro or buses.
We need to identify these peaks so that we can target increasing the frequencies of the buses/trains during the peak hours.

Generating random data with multiple peaks

Use the below code to generate multi-modal gaussian distributions

%matplotlib inline

import numpy as np
import pandas as pd

# Generating multiple gaussians 
distribution1 = np.random.normal(loc=0,scale=1.0,size=(300))
distribution2 = np.random.normal(loc=5,scale=1.0,size=(300))
distribution3 = np.random.normal(loc=10,scale=1.0,size=(300))
distribution4 = np.random.normal(loc=15,scale=1.0,size=(150))
distribution5 = np.random.normal(loc=-10,scale=1.0,size=(10))

combined_distribution = np.concatenate([distribution1,distribution2,distribution3,distribution4,distribution5])

combined_data_dataframe = pd.DataFrame(combined_distribution)
combined_data_dataframe.plot(kind='kde')

As you can see, we have created a random gaussian data with multiple means of 0,5,10,15 and -10

What happens if we,just calculate the mean?

print(combined_distribution.mean())
>>> 6.314271309260518

As you can see, the mean does not represent the peak due to multi-modality of the data.

Hence, to find multiple peaks programatically, we use one of the mixture models available in the scikit-learn package called GaussianMixtureModel

Why a mixture model?

These models are based on the assumption that, there is a presence of another subpopulation within the main population.
Can approximate the subpopulations while being not computationally heavy for mid sized data (100's of thousands).
In case of a huge data, approximations about data can be made by sampling a small sample of data randomly and fitting the model on these samples. [See Central Limit Theorem]

Now, the code for it

from sklearn.mixture import GaussianMixture
mixture_model = GaussianMixture(n_components=5)
mixture_model.fit(combined_distribution.reshape(-1,1))
print(mixture_model.means_.astype(np.int32).reshape(-1))
print(mixture_model.weights_.reshape(-1))
>>> [14  5 10 -9  0]
>>> [0.1417197  0.28217055 0.2822368  0.00943396 0.28443898]

We can combine use these weights as a representation of the density of the data at the peak.

In order to automatically determine the peaks, we can use a variation of gaussian mixture called BayesianGaussianMixture along with the means_ and degrees_of_freedom_ attribute to select the proper peaks

Alternative we can use the scipy.stats.gaussian_kde to find the density, but it smooths out the lower density values more than necessary.

Thank you for reading