cdpg_anonkit.sanitisation¶
Classes¶
Module Contents¶
- class cdpg_anonkit.sanitisation.SanitiseData¶
- clip(min_value: float, max_value: float) pandas.Series ¶
Clip (limit) the values in a Series to a specified range.
- Parameters:
series (pd.Series) – The input Series to be clipped.
min_value (float) – The minimum value to clip to.
max_value (float) – The maximum value to clip to.
- Returns:
The clipped Series.
- Return type:
pd.Series
- hash_values(salt: str = '') pandas.Series ¶
Hash the values in a Series using the SHA-256 algorithm.
This can be used to pseudonymise values that need to be kept secret. The salt parameter can be used to add a common salt to all values. This can be useful if you want to combine the hashed values with other columns to create a unique identifier.
- Parameters:
series (pd.Series) – The input Series to be hashed.
salt (str, optional) – The salt to add to all values before hashing. Defaults to an empty string.
- Returns:
The hashed Series.
- Return type:
pd.Series
- suppress(threshold: int = 5, replacement: str | int | float | None = None) pandas.Series ¶
Suppress all values in a Series that occur less than a given threshold.
Replace all values that occur less than the threshold with the replacement value.
- Parameters:
series (pd.Series) – The input Series to be suppressed.
threshold (int, optional) – The minimum number of occurrences for a value to be kept. Defaults to 5.
replacement (Optional[Union[str, int, float]], optional) – The value to replace suppressed values with. Defaults to None, which means that the values will be replaced with NaN.
- Returns:
The Series with suppressed values.
- Return type:
pd.Series
- sanitise_data(columns_to_sanitise: List[str], sanitisation_rules: Dict[str, Dict[str, str | float | int | List | Dict]], drop_na: bool = False) pandas.DataFrame ¶
Sanitise a DataFrame by applying different methods to each column.
- Parameters:
df (pd.DataFrame) – The input DataFrame to be sanitised.
columns_to_sanitise (List[str]) – The columns in the DataFrame to be sanitised.
sanitisation_rules (Dict[str, Dict[str, Union[str, float, int, List, Dict]]]) –
A dictionary that maps each column in columns_to_sanitise to a dictionary that specifies the sanitisation method and parameters for that column. The dictionary should contain the following keys: * ‘method’: str, the sanitisation method to use * ‘params’: Dict[str, Union[str, float, int, List, Dict]], the parameters
for the sanitisation method
drop_na (bool, optional) – If True, drop all rows in the DataFrame that have any NaN values in the columns specified in columns_to_sanitise. Defaults to False.
- Returns:
The sanitised DataFrame.
- Return type:
pd.DataFrame