cdpg_anonkit ============ .. py:module:: cdpg_anonkit .. autoapi-nested-parse:: A toolkit for data anonymisation. Submodules ---------- .. toctree:: :maxdepth: 1 /autoapi/cdpg_anonkit/aggregation/index /autoapi/cdpg_anonkit/generalisation/index /autoapi/cdpg_anonkit/noise/index /autoapi/cdpg_anonkit/sanitisation/index Classes ------- .. autoapisummary:: cdpg_anonkit.SanitiseData cdpg_anonkit.GeneraliseData cdpg_anonkit.IncrementalGroupbyAggregator cdpg_anonkit.DifferentialPrivacy Package Contents ---------------- .. py:class:: SanitiseData .. py:method:: clip(min_value: float, max_value: float) -> pandas.Series Clip (limit) the values in a Series to a specified range. :param series: The input Series to be clipped. :type series: pd.Series :param min_value: The minimum value to clip to. :type min_value: float :param max_value: The maximum value to clip to. :type max_value: float :returns: The clipped Series. :rtype: pd.Series .. py:method:: hash_values(salt: str = '') -> pandas.Series Hash the values in a Series using the SHA-256 algorithm. This can be used to pseudonymise values that need to be kept secret. The salt parameter can be used to add a common salt to all values. This can be useful if you want to combine the hashed values with other columns to create a unique identifier. :param series: The input Series to be hashed. :type series: pd.Series :param salt: The salt to add to all values before hashing. Defaults to an empty string. :type salt: str, optional :returns: The hashed Series. :rtype: pd.Series .. py:method:: suppress(threshold: int = 5, replacement: Optional[Union[str, int, float]] = None) -> pandas.Series Suppress all values in a Series that occur less than a given threshold. Replace all values that occur less than the threshold with the replacement value. :param series: The input Series to be suppressed. :type series: pd.Series :param threshold: The minimum number of occurrences for a value to be kept. Defaults to 5. :type threshold: int, optional :param replacement: The value to replace suppressed values with. Defaults to None, which means that the values will be replaced with NaN. :type replacement: Optional[Union[str, int, float]], optional :returns: The Series with suppressed values. :rtype: pd.Series .. py:method:: sanitise_data(columns_to_sanitise: List[str], sanitisation_rules: Dict[str, Dict[str, Union[str, float, int, List, Dict]]], drop_na: bool = False) -> pandas.DataFrame Sanitise a DataFrame by applying different methods to each column. :param df: The input DataFrame to be sanitised. :type df: pd.DataFrame :param columns_to_sanitise: The columns in the DataFrame to be sanitised. :type columns_to_sanitise: List[str] :param sanitisation_rules: A dictionary that maps each column in columns_to_sanitise to a dictionary that specifies the sanitisation method and parameters for that column. The dictionary should contain the following keys: * 'method': str, the sanitisation method to use * 'params': Dict[str, Union[str, float, int, List, Dict]], the parameters for the sanitisation method :type sanitisation_rules: Dict[str, Dict[str, Union[str, float, int, List, Dict]]] :param drop_na: If True, drop all rows in the DataFrame that have any NaN values in the columns specified in columns_to_sanitise. Defaults to False. :type drop_na: bool, optional :returns: The sanitised DataFrame. :rtype: pd.DataFrame .. py:class:: GeneraliseData .. py:class:: SpatialGeneraliser .. py:method:: format_coordinates(series: pandas.Series) -> Tuple[pandas.Series, pandas.Series] :staticmethod: Clean coordinates attribute formatting. Takes a pandas Series of coordinates and returns a tuple of two Series: the first with the latitude, and the second with the longitude. The coordinates are expected to be in the format "[lat, lon]". The function will strip any leading or trailing whitespace and brackets from the coordinates, split them into two parts, and convert each part to a float. If the coordinate string is not in the expected format, a ValueError is raised. :param series: The series of coordinates to be cleaned. :type series: pd.Series :returns: A tuple of two Series, one with the latitude and one with the longitude. :rtype: Tuple[pd.Series, pd.Series] .. py:method:: generalise_spatial(latitude: pandas.Series, longitude: pandas.Series, spatial_resolution: int) -> pandas.Series :staticmethod: Generalise a set of coordinates to an H3 index at a given resolution. :param latitude: The series of latitude values to be generalised. :type latitude: pd.Series :param longitude: The series of longitude values to be generalised. :type longitude: pd.Series :param spatial_resolution: The spatial resolution of the H3 index. Must be between 0 and 15. :type spatial_resolution: int :returns: A series of H3 indices at the specified resolution. :rtype: pd.Series :raises ValueError: If the spatial resolution is not between 0 and 15, or if the latitude or longitude values are not between -90 and 90 or -180 and 180 respectively. :raises Warning: If the length of the latitude and longitude series are not equal. .. py:class:: TemporalGeneraliser .. py:method:: format_timestamp(series: pandas.Series) -> pandas.Series :staticmethod: Convert a pandas Series of timestamps into datetime objects. This function takes a Series containing timestamp data and converts it into pandas datetime objects. It handles mixed format timestamps and coerces any non-parseable values into NaT (Not a Time). :param series: The input Series containing timestamp data to be converted. :type series: pd.Series :returns: A Series where all timestamp values have been converted to datetime objects, with non-parseable values set to NaT. :rtype: pd.Series .. py:method:: generalise_temporal(data: Union[pandas.Series, pandas.DataFrame], timestamp_col: str = None, temporal_resolution: int = 60) -> pandas.Series :staticmethod: Generalise timestamp data into specified temporal resolutions. This function processes timestamp data, either in the form of a Series or a DataFrame, and generalises it into timeslots based on the specified temporal resolution. The resolution must be one of the following values: 15, 30, or 60 minutes. :param data: The input timestamp data. Can be a pandas Series of datetime objects or a DataFrame containing a column with datetime data. :type data: Union[pd.Series, pd.DataFrame] :param timestamp_col: The name of the column containing timestamp data in the DataFrame. Must be specified if the input data is a DataFrame. Defaults to None. :type timestamp_col: str, optional :param temporal_resolution: The temporal resolution in minutes for which the timestamps should be generalised. Allowed values are 15, 30, or 60. Defaults to 60. :type temporal_resolution: int, optional :returns: A pandas Series representing the generalised timeslots, with each entry formatted as 'hour_minute', indicating the start of the timeslot. :rtype: pd.Series :raises AssertionError: If the temporal resolution is not one of the allowed values (15, 30, 60). :raises ValueError: If `timestamp_col` is not specified when input data is a DataFrame, or if the specified column is not found in the DataFrame. If the timestamps cannot be converted to datetime objects. :raises TypeError: If the input data is neither a pandas Series nor a DataFrame. .. rubric:: Example ### Using with a Series generalise_temporal(ts_series) ### Using with a DataFrame generalise_temporal(df, timestamp_col='timestamp') .. py:class:: CategoricalGeneraliser .. py:method:: generalise_categorical(data: pandas.Series, bins: Union[int, List[float]], labels: Optional[List[str]] = None) -> pandas.Series :staticmethod: Generalise a categorical column by binning the values into categories. :param data: The input Series to be generalised. :type data: pd.Series :param bins: The number of bins to use, or a list of bin edges. :type bins: Union[int, List[float]] :param labels: The labels to use for each bin. If not specified, the bin edges will be used as labels. :type labels: Optional[List[str]], optional :returns: The generalised Series. :rtype: pd.Series .. py:class:: IncrementalGroupbyAggregator(group_columns: Union[str, List[str]], agg_column: str, agg_func: Literal['sum', 'count', 'min', 'max', 'mean']) Handles incremental aggregation of large datasets processed in chunks. Carefully merges chunk-level statistics to ensure correct final aggregation. .. py:attribute:: group_columns .. py:attribute:: agg_column .. py:attribute:: agg_func .. py:attribute:: _group_stats :type: Dict[tuple, Dict[str, Any]] .. py:method:: _merge_chunk_stats(existing: Dict[str, Any], new_chunk: Dict[str, Any]) -> Dict[str, Any] Merge chunk-level statistics into existing statistics. :param existing: The existing statistics to merge into. :type existing: Dict[str, Any] :param new_chunk: The new chunk statistics to merge. :type new_chunk: Dict[str, Any] :returns: The merged statistics. :rtype: Dict[str, Any] :raises ValueError: If the aggregation function is not one of {'mean', 'sum', 'min', 'max', 'count'}. .. py:method:: process_chunk(chunk: pandas.DataFrame) Process a chunk of data by performing aggregation and updating internal statistics. This method processes a given data chunk by validating its columns, performing groupby aggregation based on the specified aggregation function, and merging the computed statistics into the internal storage for incremental aggregation. :param chunk: A DataFrame representing a chunk of data to be processed. It must contain the columns specified in `self.group_columns` and `self.agg_column`. :type chunk: pd.DataFrame :raises ValueError: If any of the required columns specified in `self.group_columns` or `self.agg_column` are not found in the chunk, or if the aggregation function is unsupported. .. py:method:: get_final_result() -> pandas.DataFrame Return the final result as a DataFrame after all chunks have been processed. After all chunks have been processed using `process_chunk`, this method returns a DataFrame containing the final result of the aggregation. The columns of the DataFrame include the group columns and the aggregated column with a name based on the specified aggregation function (e.g. 'mean', 'sum', 'min', 'max', or 'count'). :returns: The final result of the aggregation. :rtype: pd.DataFrame .. py:class:: DifferentialPrivacy(mechanism: Literal['laplace', 'gaussian', 'exponential'], epsilon: float = 1.0, delta: Optional[float] = None, sensitivity: Optional[float] = None) Differential Privacy mechanism with support for various noise addition strategies. Focuses on fundamental DP principles with extensibility in mind. .. py:attribute:: mechanism .. py:attribute:: epsilon :value: 1.0 .. py:attribute:: delta :value: None .. py:attribute:: _sensitivity :value: None .. py:method:: clip_count(count: int, lower_bound: int = 0, upper_bound: Optional[int] = None) -> int :staticmethod: Clip the count to a specified range. This static method ensures that the count provided falls within the specified lower and upper bounds. If the upper bound is not provided, the count is clipped to the lower bound only. :param count: The original count to be clipped. :type count: int :param lower_bound: The minimum value to clip to, by default 0. :type lower_bound: int, optional :param upper_bound: The maximum value to clip to, by default None. :type upper_bound: Optional[int], optional :returns: The clipped count, constrained by the specified bounds. :rtype: int .. py:method:: compute_sensitivity(query_type: str = 'count', lower_bound: int = 0, upper_bound: Optional[int] = None) -> float Compute the sensitivity of a query based on its type and bounds. Sensitivity is a measure of how much the output of a query can change by modifying a single record in the dataset. It is crucial for determining the amount of noise to add in differential privacy mechanisms. :param query_type: The type of query for which sensitivity is being computed. Currently supported: 'count'. Defaults to 'count'. :type query_type: str, optional :param lower_bound: The minimum value constraint for the query. Defaults to 0. :type lower_bound: int, optional :param upper_bound: The maximum value constraint for the query. If None, no upper bound is considered. Defaults to None. :type upper_bound: Optional[int], optional :returns: The sensitivity of the query. For 'count' queries, this is 1.0. :rtype: float :raises ValueError: If the sensitivity computation for the specified query_type is not implemented. .. py:method:: add_noise(value: Union[int, float], sensitivity: Optional[float] = None, epsilon: Optional[float] = None) -> Union[int, float] Add noise to a given value according to the specified differential privacy mechanism. Depending on the mechanism set during initialization, this method will add noise to the input value to ensure differential privacy. Currently, the Laplace mechanism is implemented, with plans to support Gaussian and Exponential mechanisms. :param value: The original value to which noise will be added. :type value: Union[int, float] :param sensitivity: The sensitivity of the query. If not provided, the class-level sensitivity will be used. Defaults to None. :type sensitivity: Optional[float], optional :param epsilon: The privacy budget. If not provided, the class-level epsilon will be used. Defaults to None. :type epsilon: Optional[float], optional :returns: The value with added noise according to the specified mechanism. :rtype: Union[int, float] :raises ValueError: If the Gaussian mechanism is selected but delta is not specified. If the mechanism is unsupported. :raises NotImplementedError: If the Gaussian or Exponential mechanism is selected, as they are not yet implemented.