cdpg_anonkit.aggregation ======================== .. py:module:: cdpg_anonkit.aggregation Classes ------- .. autoapisummary:: cdpg_anonkit.aggregation.IncrementalGroupbyAggregator Functions --------- .. autoapisummary:: cdpg_anonkit.aggregation.example_usage Module Contents --------------- .. py:class:: IncrementalGroupbyAggregator(group_columns: Union[str, List[str]], agg_column: str, agg_func: Literal['sum', 'count', 'min', 'max', 'mean']) Handles incremental aggregation of large datasets processed in chunks. Carefully merges chunk-level statistics to ensure correct final aggregation. .. py:attribute:: group_columns .. py:attribute:: agg_column .. py:attribute:: agg_func .. py:attribute:: _group_stats :type: Dict[tuple, Dict[str, Any]] .. py:method:: _merge_chunk_stats(existing: Dict[str, Any], new_chunk: Dict[str, Any]) -> Dict[str, Any] Merge chunk-level statistics into existing statistics. :param existing: The existing statistics to merge into. :type existing: Dict[str, Any] :param new_chunk: The new chunk statistics to merge. :type new_chunk: Dict[str, Any] :returns: The merged statistics. :rtype: Dict[str, Any] :raises ValueError: If the aggregation function is not one of {'mean', 'sum', 'min', 'max', 'count'}. .. py:method:: process_chunk(chunk: pandas.DataFrame) Process a chunk of data by performing aggregation and updating internal statistics. This method processes a given data chunk by validating its columns, performing groupby aggregation based on the specified aggregation function, and merging the computed statistics into the internal storage for incremental aggregation. :param chunk: A DataFrame representing a chunk of data to be processed. It must contain the columns specified in `self.group_columns` and `self.agg_column`. :type chunk: pd.DataFrame :raises ValueError: If any of the required columns specified in `self.group_columns` or `self.agg_column` are not found in the chunk, or if the aggregation function is unsupported. .. py:method:: get_final_result() -> pandas.DataFrame Return the final result as a DataFrame after all chunks have been processed. After all chunks have been processed using `process_chunk`, this method returns a DataFrame containing the final result of the aggregation. The columns of the DataFrame include the group columns and the aggregated column with a name based on the specified aggregation function (e.g. 'mean', 'sum', 'min', 'max', or 'count'). :returns: The final result of the aggregation. :rtype: pd.DataFrame .. py:function:: example_usage()