Frequency
The module provides methods to perform frequency analytics. In general, the module is based on the crosstab module by pandas and hence borrows the parameter names index, column, and value accordingly. The module can create four different frequency tables depending on the exact parameter input.
Accessor
Initialise the DataFrame with the frequency method. Minimal working example:
df.crm.frequency(index="GRADE")
The module can create four different frequency tables depending on the exact parameter input:
df.crm.frequency(index="GRADE")
: frequency table per index row.df.crm.frequency(index="GRADE", column="DATE")
: frequency table per index row and column column.df.crm.frequency(index="GRADE", column="DATE", cohort="COHORT")
: frequency table per index row and column column and cohort adjustment.df.crm.frequency(index="GRADE", column="DATE", value="EXPOSURE")
: actual crosstab module by pandas per index row, column column, and value values with 'aggfunc="sum"'.
In case the index is categorical and should include "NaN"s, the "NaN"s have to be explicitly added, for example,
bins = [-np.inf, 0, 10, np.inf]
labels = ["0", "0-10", ">10"]
df["EXPOSURE_NEW"] = pd.cut(df["EXPOSURE_OLD"], bins=bins, labels=labels, right=False).values.add_categories("NaN")
df["EXPOSURE_NEW"] = df["EXPOSURE_NEW"].apply(lambda x: x if x in labels else "NaN")
Parameters:
Name | Type | Description | Default |
---|---|---|---|
index
|
str
|
Defines the index column in accordance to the crosstab module by pandas, for example, "GRADE". |
required |
column
|
str
|
Defines the column column in accordance to the crosstab module by pandas, for example, "DATE". |
None
|
value
|
str
|
Defines the value column in accordance to the crosstab module by pandas, for example, "EXPOSURE". |
None
|
cohort
|
str
|
Defines the cohort identifier, for example, "COHORT". |
None
|
Returns:
Type | Description |
---|---|
Frequency
|
Returns a class called "Frequency" providing frequency analytics methods. |
Methods
table(index_range=None, df_ext=None, sort_by_col=None, sort_by_list=None, sort_asc=True, add_sum=False)
Minimal working example:
df.crm.frequency(index="GRADE").table()
Parameters:
Name | Type | Description | Default |
---|---|---|---|
index_range
|
list
|
Extends the index range. In case of grades, for example, the index can be extended for missing grades via, for example, "constants.GRADES" to get the complete range. |
None
|
df_ext
|
DataFrame
|
Adds data (columns) to the resulting DataFrame. Hence, dimensions of df_ext need to be defined accordingly. Optimally, data in df_ext is already given in percentages. |
None
|
sort_by_col
|
str
|
Defines the column to sort. |
None
|
sort_by_list
|
list
|
Defines the list to sort in case of categorical items which have no intrinsic sorting order. |
None
|
sort_asc
|
bool
|
Sorts the previously defined column or list ascending. |
True
|
add_sum
|
bool
|
Adds a sum row to the DataFrame. |
False
|
Returns:
Type | Description |
---|---|
DataFrame
|
Returns a frequency table. Generally, returns index per columns in absolute and percentage values. |
Examples
>>> import credit_risk_modelling as crm
>>> data = crm.load_data.load_data()
>>> data
DATE ID GRADE GRADE_PD OVERRIDE OVERRIDE_PD DEFAULT
0 2019-12-31 10 B 0.1000 B 0.1000 0
1 2019-12-31 100 BBB 0.0090 BB 0.0400 0
2 2019-12-31 1000 BBB 0.0090 BBB 0.0090 0
3 2019-12-31 1001 BBB 0.0090 BBB 0.0090 0
4 2019-12-31 1003 BBB 0.0090 BBB 0.0090 0
... ... ... ... ... ... ... ...
4145 2023-12-31 994 AA 0.0010 AA 0.0010 0
4146 2023-12-31 995 AA 0.0010 AA 0.0010 0
4147 2023-12-31 996 A 0.0020 A 0.0020 0
4148 2023-12-31 998 B 0.1000 B 0.1000 0
4149 2023-12-31 999 AAA 0.0002 AAA 0.0002 0
[4150 rows x 7 columns]
>>> (
>>> data
>>> .crm.frequency(index="GRADE")
>>> .table(add_sum=True)
>>> )
GRADE GRADE_ABS GRADE_PCT
0 A 503.0 0.121205
1 AA 368.0 0.088675
2 AAA 273.0 0.065783
3 B 591.0 0.142410
4 BB 715.0 0.172289
5 BBB 735.0 0.177108
6 CCC 473.0 0.113976
7 D 492.0 0.118554
8 Sum 4150.0 1.000000
>>> (
>>> data
>>> .loc[lambda df: df["DATE"].dt.year.isin([2022, 2023])]
>>> .crm.frequency(index="GRADE", column="DATE", cohort="ID")
>>> .table(sort_by_list=crm.cfg.GRADES)
>>> )
GRADE 2022-12-31_ABS 2022-12-31_PCT 2023-12-31_ABS 2023-12-31_PCT
0 AAA 63 0.076736 69 0.084044
1 AA 79 0.096224 95 0.115713
2 A 109 0.132765 106 0.129111
3 BBB 126 0.153471 106 0.129111
4 BB 135 0.164434 132 0.160780
5 B 107 0.130329 108 0.131547
6 CCC 94 0.114495 99 0.120585
7 D 108 0.131547 106 0.129111