# Finding means of multi-modal Gaussian distribution

## Need to find the means of the multi-modal normal distribution

- In our day to day lives, we encounter many situations where data is generated with multiple peaks(modes).
- One such problem would be the identification of peak hour times in a public transport systems like metro or buses.
- We need to identify these peaks so that we can target increasing the frequencies of the buses/trains during the peak hours.

## Generating random data with multiple peaks

Use the below code to generate multi-modal gaussian distributions

```
%matplotlib inline
import numpy as np
import pandas as pd
# Generating multiple gaussians
distribution1 = np.random.normal(loc=0,scale=1.0,size=(300))
distribution2 = np.random.normal(loc=5,scale=1.0,size=(300))
distribution3 = np.random.normal(loc=10,scale=1.0,size=(300))
distribution4 = np.random.normal(loc=15,scale=1.0,size=(150))
distribution5 = np.random.normal(loc=-10,scale=1.0,size=(10))
combined_distribution = np.concatenate([distribution1,distribution2,distribution3,distribution4,distribution5])
combined_data_dataframe = pd.DataFrame(combined_distribution)
combined_data_dataframe.plot(kind='kde')
```

As you can see, we have created a random gaussian data with multiple means of `0,5,10,15 and -10`

## What happens if we,just calculate the mean?

```
print(combined_distribution.mean())
>>> 6.314271309260518
```

As you can see, the mean does not represent the peak due to multi-modality of the data.

Hence, to find multiple peaks programatically, we use one of the mixture models available in the `scikit-learn`

package called `GaussianMixtureModel`

## Why a mixture model?

- These models are based on the assumption that, there is a presence of another subpopulation within the main population.
- Can approximate the subpopulations while being not computationally heavy for mid sized data (100's of thousands).
- In case of a huge data, approximations about data can be made by sampling a small sample of data randomly and fitting the model on these samples. [See Central Limit Theorem]

Now, the code for it

```
from sklearn.mixture import GaussianMixture
mixture_model = GaussianMixture(n_components=5)
mixture_model.fit(combined_distribution.reshape(-1,1))
print(mixture_model.means_.astype(np.int32).reshape(-1))
print(mixture_model.weights_.reshape(-1))
>>> [14 5 10 -9 0]
>>> [0.1417197 0.28217055 0.2822368 0.00943396 0.28443898]
```

- We can combine use these weights as a representation of the density of the data at the peak.
- In order to automatically determine the peaks, we can use a variation of gaussian mixture called
`BayesianGaussianMixture`

along with the`means_`

and`degrees_of_freedom_`

attribute to select the proper peaks- Alternative we can use the
`scipy.stats.gaussian_kde`

to find the density, but it smooths out the lower density values more than necessary.

Thank you for reading

## Member discussion