Groupby to get aggregate values by combining in groups and get

Pandas

We can combining data based on header and apply different aggregate function to it.

DataFrame.groupby(by,axis, level, as_index, sort, group_keys, squeeze, observed)

by : The column or label or list over which the groups will be formed.
axis : 0 ( default ) over different axis
lavel :groupby particular level or levels
as_index : bool ( default True )
sort: (default True) Sort group keys
group_keys: While using apply we can refer to
squeez: (default False ) reduce the dimension
observed: (default False ) applies if any of the groupers are Categoricals
dropna: (default True ) NA values are treated as keys if value is False

Returns a groupby object

We used this CSV file for our examples. You can download and run the script.

sales.csv Six columns , sale_id, c_id ( customer id ), p_id (product_id), qty ( quantity sold) ,store ( name )

You can also use this DataFrame for our demo examples here.

import pandas as pd 
my_dict={'sale_id':[1,2,3,4,5,6,7,8,9],
         'c_id':[2,2,1,4,2,3,2,3,2],
         'p_id':[3,4,3,2,3,3,2,2,3],
	 'product':['Monitor','CPU','Monitor','RAM','Monitor','Monitor','RAM','RAM','Monitor'],
         'qty':[2,1,3,2,3,2,3,2,2],
         'store':['ABC','DEF','ABC','DEF','ABC','DEF','ABC','DEF','ABC']}
sales = pd.DataFrame(data=my_dict)
print(sales)

using groupby on product

import pandas as pd 
sales=pd.read_csv("sales.csv") 
my_sale=sales.groupby(['product'])
print(my_sale.first())

Output is here

         sale_id  c_id  p_id  qty store
product                                
CPU            2     2     4    1   DEF
Monitor        1     2     3    2   ABC
RAM            4     4     2    2   DEF

Along with product we are also getting first matching row values.

groupby with multiple columns

We can apply groupby on two columns ( product and store )

my_sale=sales.groupby(['product','store'])

Output

               sale_id  c_id  p_id  qty
product store                          
CPU     DEF          2     2     4    1
Monitor ABC          1     2     3    2
        DEF          6     3     3    2
RAM     ABC          7     2     2    3
        DEF          4     4     2    2

Using as_index=False. Note the difference in product column.

my_sale=sales.groupby(['product','store'],as_index=False)

   product store  sale_id  c_id  p_id  qty
0      CPU   DEF        2     2     4    1
1  Monitor   ABC        1     2     3    2
2  Monitor   DEF        6     3     3    2
3      RAM   ABC        7     2     2    3
4      RAM   DEF        4     4     2    2

We can apply all above parameters

my_sale=sales.groupby(['product'],observed=False,squeeze=False,as_index=False,group_keys=True,sort=False)

Output

   product  sale_id  c_id  p_id  qty store
0  Monitor        1     2     3    2   ABC
1      CPU        2     2     4    1   DEF
2      RAM        4     4     2    2   DEF

Using get_group()

Collect rows of a particular group

my_sale=sales.groupby(['product'])
print(my_sale.get_group('Monitor'))

Output ( rows of all Monitor under product )

   sale_id  c_id  p_id  product  qty store
0        1     2     3  Monitor    2   ABC
2        3     1     3  Monitor    3   ABC
4        5     2     3  Monitor    3   ABC
5        6     3     3  Monitor    2   DEF
8        9     2     3  Monitor    2   ABC

Using sum()

We can get sum of any column data by grouping them under another column. Here we will get total quantity sold against different types of products.

my_sale=sales.groupby(['product'])[['qty']].sum()

Output

         qty
product     
CPU        1
Monitor   12
RAM        7

We can use two columns for grouping and get the breakup of sum of quantity sold

my_sale=sales.groupby(['product','store'])[['qty']].sum()

Output

               qty
product store     
CPU     DEF      1
Monitor ABC     10
        DEF      2
RAM     ABC      3
        DEF      4

Using max()

What is highest quantity sold within every ( group of ) products ?

my_sale=sales.groupby(['product'])[['qty']].max()

Output

         qty
product     
CPU        1
Monitor    3
RAM        3

Using min()

What is the lowest quantity sold within every ( group of ) products ?

my_sale=sales.groupby(['product'])[['qty']].min()

Output

         qty
product     
CPU        1
Monitor    2
RAM        2

Using mean()

Get the mean of each product quantity sold

my_sale=sales.groupby(['product'])[['qty']].mean()

Output

              qty
product          
CPU      1.000000
Monitor  2.400000
RAM      2.333333

Using std()

Using standard deviation function std()

my_sale=sales.groupby(['product'])[['qty']].std()

              qty
product          
CPU           NaN
Monitor  0.547723
RAM      0.577350

Using count()

Total number of sales against each product. This can be treated as total number of rows against each product

my_sale=sales.groupby(['product'])[['qty']].count()

Output

         qty
product     
CPU        1
Monitor    5
RAM        3

Using apply()

my_sale=sales.groupby(['product'])
print(my_sale.apply(lambda x: x['qty'].sum() > 2))

Output

product
CPU        False
Monitor     True
RAM         True
dtype: bool

Using different aggregate functions against different columns

One price column is added. The aggregate function sum is useed for qty column and mean is used for price column. The option as_index=False is kept to keep the product values in all rows.

import pandas as pd 
my_dict={'sale_id':[1,2,3,4,5,6,7,8,9],
         'c_id':[2,2,1,4,2,3,2,3,2],
         'p_id':[3,4,3,2,3,3,2,2,3],
         'price':[30,20,40,20,30,40,20,30,40],
	 'product':['Monitor','CPU','Monitor','RAM','Monitor','Monitor','RAM','RAM','Monitor'],
         'qty':[2,1,3,2,3,2,3,2,2],
         'store':['ABC','DEF','ABC','DEF','ABC','DEF','ABC','DEF','ABC']}
sales = pd.DataFrame(data=my_dict)
my_sale=sales.groupby(['product','store'],as_index=False).agg({'qty':'sum','price':'mean'})
print(my_sale)

Output

   product store  qty  price
0      CPU   DEF    1     20
1  Monitor   ABC   10     35
2  Monitor   DEF    2     40
3      RAM   ABC    3     20
4      RAM   DEF    4     25

We can use dictionary to apply aggregate function on multiple columns.

d={'qty':'sum','price':'mean'} # dictionary to use
my_sale=sales.groupby(['product','store'],as_index=False).agg(d)

Mulitple aggregate values of same column.

d={'qty':['sum'],'price':['mean','max','min']}

Output is here

   product store qty price        
                 sum  mean max min
0      CPU   DEF   1    20  20  20
1  Monitor   ABC  10    35  40  30
2  Monitor   DEF   2    40  40  40
3      RAM   ABC   3    20  20  20
4      RAM   DEF   4    25  30  20

Handling Null values

While grouping by default the NA values are ignored. To include the blank data as key while grouping we have to use dropna=False.

my_sale=sales.groupby(['product','store'],dropna=False).agg(d)

Groupby can be used by merging more than one table. Check the exercise on Pandas DataFrame to understand application of groupby in developing reports.

How Groupby is used to create Year , Month numbers from datetime series.

Pandas value_counts cut() segment and sort data values into bins

Numpy arrays Python & MySQL Python- Tutorials

Subscribe to our YouTube Channel here

groupby: Pandas DataFrame

groupby with multiple columns

Using get_group()

Using sum()

Using max()

Using min()

Using mean()

Using std()

Using count()

Using apply()

Using different aggregate functions against different columns

Handling Null values

Subscribe