« Pandas
We can combining data based on header and apply different aggregate function to it.
DataFrame.groupby(by,axis, level, as_index, sort, group_keys, squeeze, observed)
by : The column or label or list over which the groups will be formed.
axis : 0 ( default ) over different axis
lavel :groupby particular level or levels
as_index : bool ( default True )
sort: (default True) Sort group keys
group_keys: While using apply we can refer to
squeez: (default False ) reduce the dimension
observed: (default False ) applies if any of the groupers are Categoricals
dropna: (default True ) NA values are treated as keys if value is False
Returns a groupby object
We used this CSV file for our examples. You can download and run the script.
sales.csv Six columns , sale_id, c_id ( customer id ), p_id (product_id), qty ( quantity sold) ,store ( name )
You can also use this DataFrame for our demo examples here.
import pandas as pd
my_dict={'sale_id':[1,2,3,4,5,6,7,8,9],
'c_id':[2,2,1,4,2,3,2,3,2],
'p_id':[3,4,3,2,3,3,2,2,3],
'product':['Monitor','CPU','Monitor','RAM','Monitor','Monitor','RAM','RAM','Monitor'],
'qty':[2,1,3,2,3,2,3,2,2],
'store':['ABC','DEF','ABC','DEF','ABC','DEF','ABC','DEF','ABC']}
sales = pd.DataFrame(data=my_dict)
print(sales)
using groupby on product
import pandas as pd
sales=pd.read_csv("sales.csv")
my_sale=sales.groupby(['product'])
print(my_sale.first())
Output is here
sale_id c_id p_id qty store
product
CPU 2 2 4 1 DEF
Monitor 1 2 3 2 ABC
RAM 4 4 2 2 DEF
Along with product we are also getting first matching row values.
groupby with multiple columns
We can apply groupby on two columns ( product and store )
my_sale=sales.groupby(['product','store'])
Output
sale_id c_id p_id qty
product store
CPU DEF 2 2 4 1
Monitor ABC 1 2 3 2
DEF 6 3 3 2
RAM ABC 7 2 2 3
DEF 4 4 2 2
Using as_index=False
. Note the difference in product column.
my_sale=sales.groupby(['product','store'],as_index=False)
product store sale_id c_id p_id qty
0 CPU DEF 2 2 4 1
1 Monitor ABC 1 2 3 2
2 Monitor DEF 6 3 3 2
3 RAM ABC 7 2 2 3
4 RAM DEF 4 4 2 2
We can apply all above parameters
my_sale=sales.groupby(['product'],observed=False,squeeze=False,as_index=False,group_keys=True,sort=False)
Output
product sale_id c_id p_id qty store
0 Monitor 1 2 3 2 ABC
1 CPU 2 2 4 1 DEF
2 RAM 4 4 2 2 DEF
Using get_group()
Collect rows of a particular group
my_sale=sales.groupby(['product'])
print(my_sale.get_group('Monitor'))
Output ( rows of all Monitor under product )
sale_id c_id p_id product qty store
0 1 2 3 Monitor 2 ABC
2 3 1 3 Monitor 3 ABC
4 5 2 3 Monitor 3 ABC
5 6 3 3 Monitor 2 DEF
8 9 2 3 Monitor 2 ABC
Using sum()
We can get sum of any column data by grouping them under another column. Here we will get total quantity sold against different types of products.
my_sale=sales.groupby(['product'])[['qty']].sum()
Output
qty
product
CPU 1
Monitor 12
RAM 7
We can use two columns for grouping and get the breakup of sum of quantity sold
my_sale=sales.groupby(['product','store'])[['qty']].sum()
Output
qty
product store
CPU DEF 1
Monitor ABC 10
DEF 2
RAM ABC 3
DEF 4
Using max()
What is highest quantity sold within every ( group of ) products ?
my_sale=sales.groupby(['product'])[['qty']].max()
Output
qty
product
CPU 1
Monitor 3
RAM 3
Using min()
What is the lowest quantity sold within every ( group of ) products ?
my_sale=sales.groupby(['product'])[['qty']].min()
Output
qty
product
CPU 1
Monitor 2
RAM 2
Using mean()
Get the mean of each product quantity sold
my_sale=sales.groupby(['product'])[['qty']].mean()
Output
qty
product
CPU 1.000000
Monitor 2.400000
RAM 2.333333
Using std()
Using standard deviation function std()
my_sale=sales.groupby(['product'])[['qty']].std()
qty
product
CPU NaN
Monitor 0.547723
RAM 0.577350
Using count()
Total number of sales against each product. This can be treated as total number of rows against each product
my_sale=sales.groupby(['product'])[['qty']].count()
Output
qty
product
CPU 1
Monitor 5
RAM 3
Using apply()
my_sale=sales.groupby(['product'])
print(my_sale.apply(lambda x: x['qty'].sum() > 2))
Output
product
CPU False
Monitor True
RAM True
dtype: bool
Using different aggregate functions against different columns
One price column is added. The aggregate function sum is useed for qty column and mean is used for price column. The option as_index=False
is kept to keep the product values in all rows.
import pandas as pd
my_dict={'sale_id':[1,2,3,4,5,6,7,8,9],
'c_id':[2,2,1,4,2,3,2,3,2],
'p_id':[3,4,3,2,3,3,2,2,3],
'price':[30,20,40,20,30,40,20,30,40],
'product':['Monitor','CPU','Monitor','RAM','Monitor','Monitor','RAM','RAM','Monitor'],
'qty':[2,1,3,2,3,2,3,2,2],
'store':['ABC','DEF','ABC','DEF','ABC','DEF','ABC','DEF','ABC']}
sales = pd.DataFrame(data=my_dict)
my_sale=sales.groupby(['product','store'],as_index=False).agg({'qty':'sum','price':'mean'})
print(my_sale)
Output
product store qty price
0 CPU DEF 1 20
1 Monitor ABC 10 35
2 Monitor DEF 2 40
3 RAM ABC 3 20
4 RAM DEF 4 25
We can use dictionary to apply aggregate function on multiple columns.
d={'qty':'sum','price':'mean'} # dictionary to use
my_sale=sales.groupby(['product','store'],as_index=False).agg(d)
Mulitple aggregate values of same column.
d={'qty':['sum'],'price':['mean','max','min']}
Output is here
product store qty price
sum mean max min
0 CPU DEF 1 20 20 20
1 Monitor ABC 10 35 40 30
2 Monitor DEF 2 40 40 40
3 RAM ABC 3 20 20 20
4 RAM DEF 4 25 30 20
Handling Null values
While grouping by default the NA values are ignored. To include the blank data as key while grouping we have to use dropna=False
.
my_sale=sales.groupby(['product','store'],dropna=False).agg(d)
« Pandas « value_counts
cut() segment and sort data values into bins
← Subscribe to our YouTube Channel here