Finding means of multi-modal Gaussian distribution

# Finding means of multi-modal Gaussian distribution

## Need to find the means of the multi-modal normal distribution

• In our day to day lives, we encounter many situations where data is generated with multiple peaks(modes).
• One such problem would be the identification of peak hour times in a public transport systems like metro or buses.
• We need to identify these peaks so that we can target increasing the frequencies of the buses/trains during the peak hours.

## Generating random data with multiple peaks

Use the below code to generate multi-modal gaussian distributions

``````%matplotlib inline

import numpy as np
import pandas as pd

# Generating multiple gaussians
distribution1 = np.random.normal(loc=0,scale=1.0,size=(300))
distribution2 = np.random.normal(loc=5,scale=1.0,size=(300))
distribution3 = np.random.normal(loc=10,scale=1.0,size=(300))
distribution4 = np.random.normal(loc=15,scale=1.0,size=(150))
distribution5 = np.random.normal(loc=-10,scale=1.0,size=(10))

combined_distribution = np.concatenate([distribution1,distribution2,distribution3,distribution4,distribution5])

combined_data_dataframe = pd.DataFrame(combined_distribution)
combined_data_dataframe.plot(kind='kde')
`````` As you can see, we have created a random gaussian data with multiple means of `0,5,10,15 and -10`

## What happens if we,just calculate the mean?

``````print(combined_distribution.mean())
>>> 6.314271309260518
``````

As you can see, the mean does not represent the peak due to multi-modality of the data.

Hence, to find multiple peaks programatically, we use one of the mixture models available in the `scikit-learn` package called `GaussianMixtureModel`

## Why a mixture model?

• These models are based on the assumption that, there is a presence of another subpopulation within the main population.
• Can approximate the subpopulations while being not computationally heavy for mid sized data (100's of thousands).
• In case of a huge data, approximations about data can be made by sampling a small sample of data randomly and fitting the model on these samples. [See Central Limit Theorem]

Now, the code for it

``````from sklearn.mixture import GaussianMixture
mixture_model = GaussianMixture(n_components=5)
mixture_model.fit(combined_distribution.reshape(-1,1))
print(mixture_model.means_.astype(np.int32).reshape(-1))
print(mixture_model.weights_.reshape(-1))
>>> [14  5 10 -9  0]
>>> [0.1417197  0.28217055 0.2822368  0.00943396 0.28443898]
``````
• We can combine use these weights as a representation of the density of the data at the peak.
• In order to automatically determine the peaks, we can use a variation of gaussian mixture called `BayesianGaussianMixture` along with the `means_` and `degrees_of_freedom_` attribute to select the proper peaks
• Alternative we can use the `scipy.stats.gaussian_kde` to find the density, but it smooths out the lower density values more than necessary.