« Pandas
We can get descriptive statistics of DataFrame or series by using describe().
percentiles: Default 25%,50% and 75%. We can specify the list as [.45,.68,.89].
include : 'all' , a list, 'None'. List of datatypes to be included in output
exclude :datatypes to be excluded from the output
Examples
We will use the options and check the output.
import pandas as pd
my_dict={'NAME':['Ravi','Raju','Alex','Ron','King','Jack'],
'ID':[1,2,3,4,5,6],
'MATH':[80,40,70,70,60,30],
'ENGLISH':[80,70,40,50,60,30]}
my_data = pd.DataFrame(data=my_dict)
print(my_data['MATH'].describe())
Output
count 6.000000
mean 58.333333
std 19.407902
min 30.000000
25% 45.000000
50% 65.000000
75% 70.000000
max 80.000000
We can get for full DataFrame
print(my_data.describe())
Output
ID MATH ENGLISH
count 6.000000 6.000000 6.000000
mean 3.500000 58.333333 55.000000
std 1.870829 19.407902 18.708287
min 1.000000 30.000000 30.000000
25% 2.250000 45.000000 42.500000
50% 3.500000 65.000000 55.000000
75% 4.750000 70.000000 67.500000
max 6.000000 80.000000 80.000000
You can see only numeric data type columns are included and one object column Name is not included.
count | Number of objects in the column |
mean | Mean value of objects in the (numeric )column |
std | Standard Deviation of objects in the column |
min | Minimum value appearing in the column |
25% | 25th percentile of objects in the column |
50% | 50th percentile of objects in the column |
75% | 75th percentile of objects in the column |
max | Maximum value appearing in the column |
percentiles
By default we get value for 25%, 50% and 75%. Now we will select our own percentiles like this percentiles=[.45,.68,.89]
print(my_data['MATH'].describe(percentiles=[.45,.68,.89]))
Output
count 6.000000
mean 58.333333
std 19.407902
min 30.000000
45% 62.500000
50% 65.000000
68% 70.000000
89% 74.500000
max 80.000000
include
We can apply describe() to object type data columns also. By using include='all'
output includes all types of columns.
Let us try by using include='all'
print(my_data.describe(include='all'))
Output ( watch the rows unique, top and freq )
NAME ID MATH ENGLISH
count 6 6.000000 6.000000 6.000000
unique 6 NaN NaN NaN
top King NaN NaN NaN
freq 1 NaN NaN NaN
mean NaN 3.500000 58.333333 55.000000
std NaN 1.870829 19.407902 18.708287
min NaN 1.000000 30.000000 30.000000
25% NaN 2.250000 45.000000 42.500000
50% NaN 3.500000 65.000000 55.000000
75% NaN 4.750000 70.000000 67.500000
max NaN 6.000000 80.000000 80.000000
In our example above we don't have any category dtype column so it is not included here. You can see the output with one category column at the end of this page.
unique | Number of distinct object in the column |
top | Most frequently occurring object in the column |
freq | Number of times the top appearing object in the column |
include=[np.object]
print(my_data.describe(include=[np.object]))
Output
NAME
count 6
unique 6
top King
freq 1
include=[np.number]
Show only the numeric type. ( count, mean , std, min, 25%,50%,75%, max )
print(my_data.describe(include=[np.number]))
Output
ID MATH ENGLISH
count 6.000000 6.000000 6.000000
mean 3.500000 58.333333 55.000000
std 1.870829 19.407902 18.708287
min 1.000000 30.000000 30.000000
25% 2.250000 45.000000 42.500000
50% 3.500000 65.000000 55.000000
75% 4.750000 70.000000 67.500000
max 6.000000 80.000000 80.000000
exclude
print(my_data.describe(exclude=['category']))
Output
NAME ID MATH ENGLISH
count 6 6.000000 6.000000 6.000000
unique 6 NaN NaN NaN
top King NaN NaN NaN
freq 1 NaN NaN NaN
mean NaN 3.500000 58.333333 55.000000
std NaN 1.870829 19.407902 18.708287
min NaN 1.000000 30.000000 30.000000
25% NaN 2.250000 45.000000 42.500000
50% NaN 3.500000 65.000000 55.000000
75% NaN 4.750000 70.000000 67.500000
max NaN 6.000000 80.000000 80.000000
There is no category dtype in our example above. Read more with one category dtype at the end of this tutorial.
exclude=[np.number]
print(my_data.describe(exclude=[np.number]))
Output
NAME
count 6
unique 6
top King
freq 1
exclude=[np.object]
Exclude the object type data.
print(my_data.describe(exclude=[np.object]))
Output
ID MATH ENGLISH
count 6.000000 6.000000 6.000000
mean 3.500000 58.333333 55.000000
std 1.870829 19.407902 18.708287
min 1.000000 30.000000 30.000000
25% 2.250000 45.000000 42.500000
50% 3.500000 65.000000 55.000000
75% 4.750000 70.000000 67.500000
max 6.000000 80.000000 80.000000
Using category data type
Here is one sample data with one category dtype ( grade here )
import pandas as pd
my_dict={'NAME':['Ravi','Raju','Alex','Ron','King','Jack'],
'ID':[1,2,3,4,5,6],
'MATH':[80,40,70,70,60,30],
'ENGLISH':[80,70,40,50,60,30],
'grade':['a', 'c', 'b', 'b','b','c']}
my_data = pd.DataFrame(data=my_dict)
my_data['grade']=my_data['grade'].astype('category')
my_data.describe(include='all')
Output
NAME ID MATH ENGLISH grade
count 6 6.000000 6.000000 6.000000 6
unique 6 NaN NaN NaN 3
top Alex NaN NaN NaN b
freq 1 NaN NaN NaN 3
mean NaN 3.500000 58.333333 55.000000 NaN
std NaN 1.870829 19.407902 18.708287 NaN
min NaN 1.000000 30.000000 30.000000 NaN
25% NaN 2.250000 45.000000 42.500000 NaN
50% NaN 3.500000 65.000000 55.000000 NaN
75% NaN 4.750000 70.000000 67.500000 NaN
max NaN 6.000000 80.000000 80.000000 NaN
my_data.describe(include='category')
Output ( only grade column included here )
grade
count 6
unique 3
top b
freq 3
We can remove the grade ( category dtype) and display other columns.
my_data.describe(exclude='category')
« Pandas « DataFrame
Data types
← Subscribe to our YouTube Channel here