Python Pandas DataFrame cut to segment and sort values into different bins

Pandas

Segment data into bins
Parameters

x : The one dimensional input array to be categorized.
bins : The segments to be used for categorization. We can specify integer or non-uniform width or interval index.
right: Default is True , the bin should include right most value or not ( see examples below )
labels : Default None , A list of labels can be used for bins, must match with number of segments or bins
retbins : Default False, to return bins or not.
precision : int , default 3
include_lowest : default False, the first interval should be left inclusive or not
duplicates : default 'raise', 'drop' For non-unique bin edges if set.

Examples using options

In this example mark of each student in MATH is used for segmentation. We used bins to make non-uniform 3 segments. That is from 1 to 50 , from 50 to 70 and from 70 to 100.

import pandas as pd 
my_dict={'NAME':['Ravi','Raju','Alex','Ron','King','Jack'],
         'ID':[1,2,3,4,5,6],
         'MATH':[80,40,70,70,60,30],
         'ENGLISH':[80,70,40,50,60,30]}
my_data = pd.DataFrame(data=my_dict)
my_data['my_cut'] = pd.cut(x=my_data['MATH'],bins=[1, 50, 70, 100]) 
print(my_data)

Output

   NAME  ID  MATH  ENGLISH     my_cut
0  Ravi   1    80       80  (70, 100]
1  Raju   2    40       70    (1, 50]
2  Alex   3    70       40   (50, 70]
3   Ron   4    70       50   (50, 70]
4  King   5    60       60   (50, 70]
5  Jack   6    30       30    (1, 50]

category data type

The column holding the output of cut() is of categorical data types. You can check the output like this.

print(my_data['my_cut'].dtypes) # category

Read more on data types by dtypes() and about categorical data type.

Bins

How we will decide on segments for distribution of values ? There are three types.

Fixed width bins : By specifying integer we can say how many number of segments we want. Here mark is varying in the range of 50, so by saying bins= 5 we are creating segments of fixed width 10. The The range of x is extended by .1% to include minimum and maximum values.

my_data['my_cut'] = pd.cut(x=my_data['MATH'],bins=5) 
print(my_data)

Output

   NAME  ID  MATH  ENGLISH         my_cut
0  Ravi   1    80       80   (70.0, 80.0]
1  Raju   2    40       70  (29.95, 40.0]
2  Alex   3    70       40   (60.0, 70.0]
3   Ron   4    70       50   (60.0, 70.0]
4  King   5    60       60   (50.0, 60.0]
5  Jack   6    30       30  (29.95, 40.0]

Sequence of scalars : We specify the edges of the bins.

my_data['my_cut'] = pd.cut(x=my_data['MATH'],bins=[1,50,70,100]) 
print(my_data)

Output

   NAME  ID  MATH  ENGLISH     my_cut
0  Ravi   1    80       80  (70, 100]
1  Raju   2    40       70    (1, 50]
2  Alex   3    70       40   (50, 70]
3   Ron   4    70       50   (50, 70]
4  King   5    60       60   (50, 70]
5  Jack   6    30       30    (1, 50]

Intervalindex : Non overlapping exact bins.

my_data['my_cut'] = pd.cut(x=my_data['MATH'],bins=[1,49,50,69,70,79,80,100]) 
print(my_data)

Output

   NAME  ID  MATH  ENGLISH    my_cut
0  Ravi   1    80       80  (79, 80]
1  Raju   2    40       70   (1, 49]
2  Alex   3    70       40  (69, 70]
3   Ron   4    70       50  (69, 70]
4  King   5    60       60  (50, 69]
5  Jack   6    30       30   (1, 49]

mark

Which bin we should place for the mark which are at the edges of the bins ?
Alex got 70 and he is kept in 50, 70 segment. We can place him in 70 , 100 also. For this we have to use right option. By default right=True. So when MARK is 70, it is included in 50 to 70 segment. If we make right=False then we will include MARK in 70 to 100 segment.

my_data['my_cut'] = pd.cut(x=my_data['MATH'],bins=[1,50,70,100],right=True)

output

   NAME  ID  MATH  ENGLISH     my_cut
0  Ravi   1    80       80  (70, 100]
1  Raju   2    40       70    (1, 50]
2  Alex   3    70       40   (50, 70]
3   Ron   4    70       50   (50, 70]
4  King   5    70       60   (50, 70]
5  Jack   6    30       30    (1, 50]

Let us change to right=False

my_data['my_cut'] = pd.cut(x=my_data['MATH'],bins=[1,50,70,100],right=False)

Output

   NAME  ID  MATH  ENGLISH     my_cut
0  Ravi   1    80       80  [70, 100)
1  Raju   2    40       70    [1, 50)
2  Alex   3    70       40  [70, 100)
3   Ron   4    70       50  [70, 100)
4  King   5    70       60  [70, 100)
5  Jack   6    30       30    [1, 50)

Labels

Default is None. We can use labels for our segments.

my_labels=['Fail','Second','First']
my_data['my_cut'] = pd.cut(x=my_data['MATH'],bins=[1, 50, 75, 100],labels=my_labels) 
print(my_data)

Output

   NAME  ID  MATH  ENGLISH  my_cut
0  Ravi   1    80       80   First
1  Raju   2    40       70    Fail
2  Alex   3    70       40  Second
3   Ron   4    70       50  Second
4  King   5    70       60  Second
5  Jack   6    30       30    Fail

We can use sum of two columns as our input array.

my_labels=['Fail','Second','First']
my_data['my_cut'] = pd.cut(x=my_data['MATH']+my_data['ENGLISH'],bins=[1, 100, 150, 200],labels=my_labels) 
print(my_data)

Output

   NAME  ID  MATH  ENGLISH  my_cut
0  Ravi   1    80       80   First
1  Raju   2    40       70  Second
2  Alex   3    70       40  Second
3   Ron   4    70       50  Second
4  King   5    70       60  Second
5  Jack   6    30       30    Fail

include_lowest

Default value is False. The first interval should be left inclusive or not.

my_data['my_cut'] = pd.cut(x=my_data['MATH'],bins=[30,60,80,100],include_lowest=False)

Output

   NAME  ID  MATH  ENGLISH        my_cut
0  Ravi   1    80       80  (60.0, 80.0]
1  Raju   2    40       70  (30.0, 60.0]
2  Alex   3    70       40  (60.0, 80.0]
3   Ron   4    70       50  (60.0, 80.0]
4  King   5    70       60  (60.0, 80.0]
5  Jack   6    30       30           NaN

Let us try include_lowest=True

my_data['my_cut'] = pd.cut(x=my_data['MATH'],bins=[30,60,80,100],include_lowest=True)

Output

   NAME  ID  MATH  ENGLISH          my_cut
0  Ravi   1    80       80    (60.0, 80.0]
1  Raju   2    40       70  (29.999, 60.0]
2  Alex   3    70       40    (60.0, 80.0]
3   Ron   4    70       50    (60.0, 80.0]
4  King   5    70       60    (60.0, 80.0]
5  Jack   6    30       30  (29.999, 60.0]

duplicates

Default value is 'raise', we can change this to duplicates='drop'

my_data['my_cut'] = pd.cut(x=my_data['MATH'],bins=[40,50,50,100],duplicates='drop') 
print(my_data)

Output

   NAME  ID  MATH  ENGLISH         my_cut
0  Ravi   1    80       80  (50.0, 100.0]
1  Raju   2    40       70            NaN
2  Alex   3    70       40  (50.0, 100.0]
3   Ron   4    70       50  (50.0, 100.0]
4  King   5    70       60  (50.0, 100.0]
5  Jack   6    30       30            NaN

Let us change to duplicates='raise'

my_data['my_cut'] = pd.cut(x=my_data['MATH'],bins=[40,50,50,100],duplicates='raise')

Output
This will give ValueError

cut() can be used to generate grade for students. Check the exercise on Pandas DataFrame cut() to understand use of binning.

loc at mask groupby() value_counts()

Pandas Pandas DataFrame iloc - rows and columns by integers

Numpy arrays Python & MySQL Python- Tutorials

Subscribe to our YouTube Channel here