Ways to calculate outliers in Python Pandas Module

Author: Al-mamun Sarkar Date: 2020-04-01 17:33:02

Ways to calculate outliers in Python Pandas Module. The following code shows how to calculate outliers of DataFrame using pandas module.

 

In [1]:

import pandas as pd
import sys

 

In [4]:

# Create a dataframe with dates as your index
States = ['NY', 'NY', 'NY', 'NY', 'FL', 'FL', 'GA', 'GA', 'FL', 'FL'] 
data = [1.0, 2, 3, 4, 5, 6, 7, 8, 9, 10]

index = pd.date_range('1/1/2012', periods=10, freq='MS')
data_frame_1 = pd.DataFrame(data=data, index=index, columns=['Revenue'])
data_frame_1['State'] = States

# Create a second dataframe
data2 = [10.0, 10.0, 9, 9, 8, 8, 7, 7, 6, 6]
index2 = pd.date_range('1/1/2013', periods=10, freq='MS')
data_frame_2 = pd.DataFrame(data2, index=index2, columns=['Revenue'])
data_frame_2['State'] = States

 

In [5]:

new_data_frame = pd.concat([data_frame_1, data_frame_2])
new_data_frame

Out[5]:

  Revenue State
2012-01-01 1.0 NY
2012-02-01 2.0 NY
2012-03-01 3.0 NY
2012-04-01 4.0 NY
2012-05-01 5.0 FL
2012-06-01 6.0 FL
2012-07-01 7.0 GA
2012-08-01 8.0 GA
2012-09-01 9.0 FL
2012-10-01 10.0 FL
2013-01-01 10.0 NY
2013-02-01 10.0 NY
2013-03-01 9.0 NY
2013-04-01 9.0 NY
2013-05-01 8.0 FL
2013-06-01 8.0 FL
2013-07-01 7.0 GA
2013-08-01 7.0 GA
2013-09-01 6.0 FL
2013-10-01 6.0 FL

 

Calculate Outlier:

Method 1:

In [7]:

newdf = new_data_frame.copy()
newdf['x-Mean'] = abs(newdf['Revenue'] - newdf['Revenue'].mean())
newdf['1.96*std'] = 1.96*newdf['Revenue'].std()
newdf['Outlier'] = abs(newdf['Revenue'] - newdf['Revenue'].mean()) > 1.96*newdf['Revenue'].std()

newdf

Out[7]:

  Revenue State x-Mean 1.96*std Outlier
2012-01-01 1.0 NY 5.75 5.200273 True
2012-02-01 2.0 NY 4.75 5.200273 False
2012-03-01 3.0 NY 3.75 5.200273 False
2012-04-01 4.0 NY 2.75 5.200273 False
2012-05-01 5.0 FL 1.75 5.200273 False
2012-06-01 6.0 FL 0.75 5.200273 False
2012-07-01 7.0 GA 0.25 5.200273 False
2012-08-01 8.0 GA 1.25 5.200273 False
2012-09-01 9.0 FL 2.25 5.200273 False
2012-10-01 10.0 FL 3.25 5.200273 False
2013-01-01 10.0 NY 3.25 5.200273 False
2013-02-01 10.0 NY 3.25 5.200273 False
2013-03-01 9.0 NY 2.25 5.200273 False
2013-04-01 9.0 NY 2.25 5.200273 False
2013-05-01 8.0 FL 1.25 5.200273 False
2013-06-01 8.0 FL 1.25 5.200273 False
2013-07-01 7.0 GA 0.25 5.200273 False
2013-08-01 7.0 GA 0.25 5.200273 False
2013-09-01 6.0 FL 0.75 5.200273 False
2013-10-01 6.0 FL 0.75 5.200273 False

 

Method 2:

In [12]:

newdf = new_data_frame.copy()

State = newdf.groupby('State')

newdf['Outlier'] = State.transform(lambda x: abs(x-x.mean()) > 1.96*x.std())
newdf['x-Mean'] = State.transform(lambda x: abs(x - x.mean()))
newdf['1.96*std'] = State.transform(lambda x: 1.96*x.std())

newdf

Out[12]:

  Revenue State Outlier x-Mean 1.96*std
2012-01-01 1.0 NY False 5.00 7.554813
2012-02-01 2.0 NY False 4.00 7.554813
2012-03-01 3.0 NY False 3.00 7.554813
2012-04-01 4.0 NY False 2.00 7.554813
2012-05-01 5.0 FL False 2.25 3.434996
2012-06-01 6.0 FL False 1.25 3.434996
2012-07-01 7.0 GA False 0.25 0.980000
2012-08-01 8.0 GA False 0.75 0.980000
2012-09-01 9.0 FL False 1.75 3.434996
2012-10-01 10.0 FL False 2.75 3.434996
2013-01-01 10.0 NY False 4.00 7.554813
2013-02-01 10.0 NY False 4.00 7.554813
2013-03-01 9.0 NY False 3.00 7.554813
2013-04-01 9.0 NY False 3.00 7.554813
2013-05-01 8.0 FL False 0.75 3.434996
2013-06-01 8.0 FL False 0.75 3.434996
2013-07-01 7.0 GA False 0.25 0.980000
2013-08-01 7.0 GA False 0.25 0.980000
2013-09-01 6.0 FL False 1.25 3.434996
2013-10-01 6.0 FL False 1.25 3.434996

 

In [15]:

newdf = new_data_frame.copy()

StateMonth = newdf.groupby(['State', lambda x: x.month])

newdf['Outlier'] = StateMonth.transform( lambda x: abs(x-x.mean()) > 1.96*x.std() )
newdf['x-Mean'] = StateMonth.transform( lambda x: abs(x-x.mean()) )
newdf['1.96*std'] = StateMonth.transform( lambda x: 1.96*x.std() )
newdf

Out[15]:

  Revenue State Outlier x-Mean 1.96*std
2012-01-01 1.0 NY False 4.5 12.473364
2012-02-01 2.0 NY False 4.0 11.087434
2012-03-01 3.0 NY False 3.0 8.315576
2012-04-01 4.0 NY False 2.5 6.929646
2012-05-01 5.0 FL False 1.5 4.157788
2012-06-01 6.0 FL False 1.0 2.771859
2012-07-01 7.0 GA False 0.0 0.000000
2012-08-01 8.0 GA False 0.5 1.385929
2012-09-01 9.0 FL False 1.5 4.157788
2012-10-01 10.0 FL False 2.0 5.543717
2013-01-01 10.0 NY False 4.5 12.473364
2013-02-01 10.0 NY False 4.0 11.087434
2013-03-01 9.0 NY False 3.0 8.315576
2013-04-01 9.0 NY False 2.5 6.929646
2013-05-01 8.0 FL False 1.5 4.157788
2013-06-01 8.0 FL False 1.0 2.771859
2013-07-01 7.0 GA False 0.0 0.000000
2013-08-01 7.0 GA False 0.5 1.385929
2013-09-01 6.0 FL False 1.5 4.157788
2013-10-01 6.0 FL False 2.0 5.543717

 

In [10]:

newdf = new_data_frame.copy()

State = newdf.groupby('State')

def s(group):
    group['x-Mean'] = abs(newdf['Revenue'] - newdf['Revenue'].mean())
    group['1.96*std'] = 1.96*newdf['Revenue'].std()
    group['Outlier'] = abs(newdf['Revenue'] - newdf['Revenue'].mean()) > 1.96*newdf['Revenue'].std()
    return group
    
newdf2 = State.apply(s)
newdf2

Out[10]:

  Revenue State x-Mean 1.96*std Outlier
2012-01-01 1.0 NY 5.75 5.200273 True
2012-02-01 2.0 NY 4.75 5.200273 False
2012-03-01 3.0 NY 3.75 5.200273 False
2012-04-01 4.0 NY 2.75 5.200273 False
2012-05-01 5.0 FL 1.75 5.200273 False
2012-06-01 6.0 FL 0.75 5.200273 False
2012-07-01 7.0 GA 0.25 5.200273 False
2012-08-01 8.0 GA 1.25 5.200273 False
2012-09-01 9.0 FL 2.25 5.200273 False
2012-10-01 10.0 FL 3.25 5.200273 False
2013-01-01 10.0 NY 3.25 5.200273 False
2013-02-01 10.0 NY 3.25 5.200273 False
2013-03-01 9.0 NY 2.25 5.200273 False
2013-04-01 9.0 NY 2.25 5.200273 False
2013-05-01 8.0 FL 1.25 5.200273 False
2013-06-01 8.0 FL 1.25 5.200273 False
2013-07-01 7.0 GA 0.25 5.200273 False
2013-08-01 7.0 GA 0.25 5.200273 False
2013-09-01 6.0 FL 0.75 5.200273 False
2013-10-01 6.0 FL 0.75 5.200273 False

 

In [16]:

newdf = new_data_frame.copy()

State = newdf.groupby(['State', lambda x: x.month])

def s(group):
    group['x-Mean'] = abs(newdf['Revenue'] - newdf['Revenue'].mean())
    group['1.96*std'] = 1.96*newdf['Revenue'].std()
    group['Outlier'] = abs(newdf['Revenue'] - newdf['Revenue'].mean()) > 1.96*newdf['Revenue'].std()
    return group
    
newdf2 = State.apply(s)
newdf2

Out[16]:

  Revenue State x-Mean 1.96*std Outlier
2012-01-01 1.0 NY 5.75 5.200273 True
2012-02-01 2.0 NY 4.75 5.200273 False
2012-03-01 3.0 NY 3.75 5.200273 False
2012-04-01 4.0 NY 2.75 5.200273 False
2012-05-01 5.0 FL 1.75 5.200273 False
2012-06-01 6.0 FL 0.75 5.200273 False
2012-07-01 7.0 GA 0.25 5.200273 False
2012-08-01 8.0 GA 1.25 5.200273 False
2012-09-01 9.0 FL 2.25 5.200273 False
2012-10-01 10.0 FL 3.25 5.200273 False
2013-01-01 10.0 NY 3.25 5.200273 False
2013-02-01 10.0 NY 3.25 5.200273 False
2013-03-01 9.0 NY 2.25 5.200273 False
2013-04-01 9.0 NY 2.25 5.200273 False
2013-05-01 8.0 FL 1.25 5.200273 False
2013-06-01 8.0 FL 1.25 5.200273 False
2013-07-01 7.0 GA 0.25 5.200273 False
2013-08-01 7.0 GA 0.25 5.200273 False
2013-09-01 6.0 FL 0.75 5.200273 False
2013-10-01 6.0 FL 0.75 5.200273 False

 

In [17]:

newdf = new_data_frame.copy()

State = newdf.groupby('State')

newdf['Lower'] = State['Revenue'].transform( lambda x: x.quantile(q=.25) - (1.5*(x.quantile(q=.75)-x.quantile(q=.25))) )
newdf['Upper'] = State['Revenue'].transform( lambda x: x.quantile(q=.75) + (1.5*(x.quantile(q=.75)-x.quantile(q=.25))) )
newdf['Outlier'] = (newdf['Revenue'] < newdf['Lower']) | (newdf['Revenue'] > newdf['Upper']) 
newdf

Out[17]:

  Revenue State Lower Upper Outlier
2012-01-01 1.0 NY -7.000 19.000 False
2012-02-01 2.0 NY -7.000 19.000 False
2012-03-01 3.0 NY -7.000 19.000 False
2012-04-01 4.0 NY -7.000 19.000 False
2012-05-01 5.0 FL 2.625 11.625 False
2012-06-01 6.0 FL 2.625 11.625 False
2012-07-01 7.0 GA 6.625 7.625 False
2012-08-01 8.0 GA 6.625 7.625 True
2012-09-01 9.0 FL 2.625 11.625 False
2012-10-01 10.0 FL 2.625 11.625 False
2013-01-01 10.0 NY -7.000 19.000 False
2013-02-01 10.0 NY -7.000 19.000 False
2013-03-01 9.0 NY -7.000 19.000 False
2013-04-01 9.0 NY -7.000 19.000 False
2013-05-01 8.0 FL 2.625 11.625 False
2013-06-01 8.0 FL 2.625 11.625 False
2013-07-01 7.0 GA 6.625 7.625 False
2013-08-01 7.0 GA 6.625 7.625 False
2013-09-01 6.0 FL 2.625 11.625 False
2013-10-01 6.0 FL 2.625 11.625 False