describe() to get summary statistics of the Series or Dataframe in Pandas

Pandas

We can get descriptive statistics of DataFrame or series by using describe().

percentiles: Default 25%,50% and 75%. We can specify the list as [.45,.68,.89].
include : 'all' , a list, 'None'. List of datatypes to be included in output
exclude :datatypes to be excluded from the output

Examples

We will use the options and check the output.

import pandas as pd 
my_dict={'NAME':['Ravi','Raju','Alex','Ron','King','Jack'],
         'ID':[1,2,3,4,5,6],
         'MATH':[80,40,70,70,60,30],
         'ENGLISH':[80,70,40,50,60,30]}
my_data = pd.DataFrame(data=my_dict)
print(my_data['MATH'].describe())

Output

count     6.000000
mean     58.333333
std      19.407902
min      30.000000
25%      45.000000
50%      65.000000
75%      70.000000
max      80.000000

We can get for full DataFrame

print(my_data.describe())

Output

             ID       MATH    ENGLISH
count  6.000000   6.000000   6.000000
mean   3.500000  58.333333  55.000000
std    1.870829  19.407902  18.708287
min    1.000000  30.000000  30.000000
25%    2.250000  45.000000  42.500000
50%    3.500000  65.000000  55.000000
75%    4.750000  70.000000  67.500000
max    6.000000  80.000000  80.000000

You can see only numeric data type columns are included and one object column Name is not included.

`count`	Number of objects in the column
`mean`	Mean value of objects in the (numeric )column
`std`	Standard Deviation of objects in the column
`min`	Minimum value appearing in the column
`25%`	25^th percentile of objects in the column
`50%`	50^th percentile of objects in the column
`75%`	75^th percentile of objects in the column
`max`	Maximum value appearing in the column

percentiles

By default we get value for 25%, 50% and 75%. Now we will select our own percentiles like this percentiles=[.45,.68,.89]

print(my_data['MATH'].describe(percentiles=[.45,.68,.89]))

Output

count     6.000000
mean     58.333333
std      19.407902
min      30.000000
45%      62.500000
50%      65.000000
68%      70.000000
89%      74.500000
max      80.000000

include

We can apply describe() to object type data columns also. By using include='all' output includes all types of columns.
Let us try by using include='all'

print(my_data.describe(include='all'))

Output ( watch the rows unique, top and freq )

        NAME        ID       MATH    ENGLISH
count      6  6.000000   6.000000   6.000000
unique     6       NaN        NaN        NaN
top     King       NaN        NaN        NaN
freq       1       NaN        NaN        NaN
mean     NaN  3.500000  58.333333  55.000000
std      NaN  1.870829  19.407902  18.708287
min      NaN  1.000000  30.000000  30.000000
25%      NaN  2.250000  45.000000  42.500000
50%      NaN  3.500000  65.000000  55.000000
75%      NaN  4.750000  70.000000  67.500000
max      NaN  6.000000  80.000000  80.000000

In our example above we don't have any category dtype column so it is not included here. You can see the output with one category column at the end of this page.

`unique`	Number of distinct object in the column
`top`	Most frequently occurring object in the column
`freq`	Number of times the top appearing object in the column

include=[np.object]

print(my_data.describe(include=[np.object]))

Output

        NAME
count      6
unique     6
top     King
freq       1

include=[np.number]

Show only the numeric type. ( count, mean , std, min, 25%,50%,75%, max )

print(my_data.describe(include=[np.number]))

Output

             ID       MATH    ENGLISH
count  6.000000   6.000000   6.000000
mean   3.500000  58.333333  55.000000
std    1.870829  19.407902  18.708287
min    1.000000  30.000000  30.000000
25%    2.250000  45.000000  42.500000
50%    3.500000  65.000000  55.000000
75%    4.750000  70.000000  67.500000
max    6.000000  80.000000  80.000000

exclude

print(my_data.describe(exclude=['category']))

Output

        NAME        ID       MATH    ENGLISH
count      6  6.000000   6.000000   6.000000
unique     6       NaN        NaN        NaN
top     King       NaN        NaN        NaN
freq       1       NaN        NaN        NaN
mean     NaN  3.500000  58.333333  55.000000
std      NaN  1.870829  19.407902  18.708287
min      NaN  1.000000  30.000000  30.000000
25%      NaN  2.250000  45.000000  42.500000
50%      NaN  3.500000  65.000000  55.000000
75%      NaN  4.750000  70.000000  67.500000
max      NaN  6.000000  80.000000  80.000000

There is no category dtype in our example above. Read more with one category dtype at the end of this tutorial.

exclude=[np.number]

print(my_data.describe(exclude=[np.number]))

Output

        NAME
count      6
unique     6
top     King
freq       1

exclude=[np.object]

Exclude the object type data.

print(my_data.describe(exclude=[np.object]))

Output

             ID       MATH    ENGLISH
count  6.000000   6.000000   6.000000
mean   3.500000  58.333333  55.000000
std    1.870829  19.407902  18.708287
min    1.000000  30.000000  30.000000
25%    2.250000  45.000000  42.500000
50%    3.500000  65.000000  55.000000
75%    4.750000  70.000000  67.500000
max    6.000000  80.000000  80.000000

Using category data type

Here is one sample data with one category dtype ( grade here )

import pandas as pd 
my_dict={'NAME':['Ravi','Raju','Alex','Ron','King','Jack'],
         'ID':[1,2,3,4,5,6],
         'MATH':[80,40,70,70,60,30],
         'ENGLISH':[80,70,40,50,60,30],
         'grade':['a', 'c', 'b', 'b','b','c']}
my_data = pd.DataFrame(data=my_dict)
my_data['grade']=my_data['grade'].astype('category')

my_data.describe(include='all')

Output

	NAME	ID		MATH		ENGLISH		grade
count	6	6.000000	6.000000	6.000000	6
unique	6	NaN		NaN		NaN		3
top	Alex	NaN		NaN		NaN		b
freq	1	NaN		NaN		NaN		3
mean	NaN	3.500000	58.333333	55.000000	NaN
std	NaN	1.870829	19.407902	18.708287	NaN
min	NaN	1.000000	30.000000	30.000000	NaN
25%	NaN	2.250000	45.000000	42.500000	NaN
50%	NaN	3.500000	65.000000	55.000000	NaN
75%	NaN	4.750000	70.000000	67.500000	NaN
max	NaN	6.000000	80.000000	80.000000	NaN