« Pandas
Segment data into bins
Parameters
x : The one dimensional input array to be categorized.
bins : The segments to be used for categorization. We can specify integer or non-uniform width or interval index.
right: Default is True , the bin should include right most value or not ( see examples below )
labels : Default None , A list of labels can be used for bins, must match with number of segments or bins
retbins : Default False, to return bins or not.
precision : int , default 3
include_lowest : default False, the first interval should be left inclusive or not
duplicates : default 'raise', 'drop' For non-unique bin edges if set.
Examples using options
In this example mark of each student in MATH is used for segmentation. We used bins to make non-uniform 3 segments. That is from 1 to 50 , from 50 to 70 and from 70 to 100.
import pandas as pd
my_dict={'NAME':['Ravi','Raju','Alex','Ron','King','Jack'],
'ID':[1,2,3,4,5,6],
'MATH':[80,40,70,70,60,30],
'ENGLISH':[80,70,40,50,60,30]}
my_data = pd.DataFrame(data=my_dict)
my_data['my_cut'] = pd.cut(x=my_data['MATH'],bins=[1, 50, 70, 100])
print(my_data)
Output
NAME ID MATH ENGLISH my_cut
0 Ravi 1 80 80 (70, 100]
1 Raju 2 40 70 (1, 50]
2 Alex 3 70 40 (50, 70]
3 Ron 4 70 50 (50, 70]
4 King 5 60 60 (50, 70]
5 Jack 6 30 30 (1, 50]
category data type
The column holding the output of cut() is of categorical data types. You can check the output like this.
print(my_data['my_cut'].dtypes) # category
Read more on data types by dtypes() and about categorical data type.
Bins
How we will decide on segments for distribution of values ? There are three types.
Fixed width bins : By specifying integer we can say how many number of segments we want. Here mark is varying in the range of 50, so by saying bins= 5 we are creating segments of fixed width 10. The The range of x is extended by .1% to include minimum and maximum values.
my_data['my_cut'] = pd.cut(x=my_data['MATH'],bins=5)
print(my_data)
Output
NAME ID MATH ENGLISH my_cut
0 Ravi 1 80 80 (70.0, 80.0]
1 Raju 2 40 70 (29.95, 40.0]
2 Alex 3 70 40 (60.0, 70.0]
3 Ron 4 70 50 (60.0, 70.0]
4 King 5 60 60 (50.0, 60.0]
5 Jack 6 30 30 (29.95, 40.0]
Sequence of scalars : We specify the edges of the bins.
my_data['my_cut'] = pd.cut(x=my_data['MATH'],bins=[1,50,70,100])
print(my_data)
Output
NAME ID MATH ENGLISH my_cut
0 Ravi 1 80 80 (70, 100]
1 Raju 2 40 70 (1, 50]
2 Alex 3 70 40 (50, 70]
3 Ron 4 70 50 (50, 70]
4 King 5 60 60 (50, 70]
5 Jack 6 30 30 (1, 50]
Intervalindex : Non overlapping exact bins.
my_data['my_cut'] = pd.cut(x=my_data['MATH'],bins=[1,49,50,69,70,79,80,100])
print(my_data)
Output
NAME ID MATH ENGLISH my_cut
0 Ravi 1 80 80 (79, 80]
1 Raju 2 40 70 (1, 49]
2 Alex 3 70 40 (69, 70]
3 Ron 4 70 50 (69, 70]
4 King 5 60 60 (50, 69]
5 Jack 6 30 30 (1, 49]
mark
Which bin we should place for the mark which are at the edges of the bins ?
Alex got 70 and he is kept in 50, 70 segment. We can place him in 70 , 100 also. For this we have to use right option. By default right=True. So when MARK is 70, it is included in 50 to 70 segment. If we make right=False then we will include MARK in 70 to 100 segment.
my_data['my_cut'] = pd.cut(x=my_data['MATH'],bins=[1,50,70,100],right=True)
output
NAME ID MATH ENGLISH my_cut
0 Ravi 1 80 80 (70, 100]
1 Raju 2 40 70 (1, 50]
2 Alex 3 70 40 (50, 70]
3 Ron 4 70 50 (50, 70]
4 King 5 70 60 (50, 70]
5 Jack 6 30 30 (1, 50]
Let us change to right=False
my_data['my_cut'] = pd.cut(x=my_data['MATH'],bins=[1,50,70,100],right=False)
Output
NAME ID MATH ENGLISH my_cut
0 Ravi 1 80 80 [70, 100)
1 Raju 2 40 70 [1, 50)
2 Alex 3 70 40 [70, 100)
3 Ron 4 70 50 [70, 100)
4 King 5 70 60 [70, 100)
5 Jack 6 30 30 [1, 50)
Labels
Default is None. We can use labels for our segments.
my_labels=['Fail','Second','First']
my_data['my_cut'] = pd.cut(x=my_data['MATH'],bins=[1, 50, 75, 100],labels=my_labels)
print(my_data)
Output
NAME ID MATH ENGLISH my_cut
0 Ravi 1 80 80 First
1 Raju 2 40 70 Fail
2 Alex 3 70 40 Second
3 Ron 4 70 50 Second
4 King 5 70 60 Second
5 Jack 6 30 30 Fail
We can use sum of two columns as our input array.
my_labels=['Fail','Second','First']
my_data['my_cut'] = pd.cut(x=my_data['MATH']+my_data['ENGLISH'],bins=[1, 100, 150, 200],labels=my_labels)
print(my_data)
Output
NAME ID MATH ENGLISH my_cut
0 Ravi 1 80 80 First
1 Raju 2 40 70 Second
2 Alex 3 70 40 Second
3 Ron 4 70 50 Second
4 King 5 70 60 Second
5 Jack 6 30 30 Fail
include_lowest
Default value is False. The first interval should be left inclusive or not.
my_data['my_cut'] = pd.cut(x=my_data['MATH'],bins=[30,60,80,100],include_lowest=False)
Output
NAME ID MATH ENGLISH my_cut
0 Ravi 1 80 80 (60.0, 80.0]
1 Raju 2 40 70 (30.0, 60.0]
2 Alex 3 70 40 (60.0, 80.0]
3 Ron 4 70 50 (60.0, 80.0]
4 King 5 70 60 (60.0, 80.0]
5 Jack 6 30 30 NaN
Let us try include_lowest=True
my_data['my_cut'] = pd.cut(x=my_data['MATH'],bins=[30,60,80,100],include_lowest=True)
Output
NAME ID MATH ENGLISH my_cut
0 Ravi 1 80 80 (60.0, 80.0]
1 Raju 2 40 70 (29.999, 60.0]
2 Alex 3 70 40 (60.0, 80.0]
3 Ron 4 70 50 (60.0, 80.0]
4 King 5 70 60 (60.0, 80.0]
5 Jack 6 30 30 (29.999, 60.0]
duplicates
Default value is 'raise', we can change this to duplicates='drop'
my_data['my_cut'] = pd.cut(x=my_data['MATH'],bins=[40,50,50,100],duplicates='drop')
print(my_data)
Output
NAME ID MATH ENGLISH my_cut
0 Ravi 1 80 80 (50.0, 100.0]
1 Raju 2 40 70 NaN
2 Alex 3 70 40 (50.0, 100.0]
3 Ron 4 70 50 (50.0, 100.0]
4 King 5 70 60 (50.0, 100.0]
5 Jack 6 30 30 NaN
Let us change to duplicates='raise'
my_data['my_cut'] = pd.cut(x=my_data['MATH'],bins=[40,50,50,100],duplicates='raise')
Output
This will give ValueError
« loc « at « mask
groupby()
value_counts()
« Pandas
Pandas DataFrame
iloc - rows and columns by integers »
← Subscribe to our YouTube Channel here