cdpg_anonkit.aggregation¶
Classes¶
| Handles incremental aggregation of large datasets processed in chunks. | 
Functions¶
Module Contents¶
- class cdpg_anonkit.aggregation.IncrementalGroupbyAggregator(group_columns: str | List[str], agg_column: str, agg_func: Literal['sum', 'count', 'min', 'max', 'mean'])¶
- Handles incremental aggregation of large datasets processed in chunks. - Carefully merges chunk-level statistics to ensure correct final aggregation. - group_columns¶
 - agg_column¶
 - agg_func¶
 - _group_stats: Dict[tuple, Dict[str, Any]]¶
 - _merge_chunk_stats(existing: Dict[str, Any], new_chunk: Dict[str, Any]) Dict[str, Any]¶
- Merge chunk-level statistics into existing statistics. - Parameters:
- existing (Dict[str, Any]) – The existing statistics to merge into. 
- new_chunk (Dict[str, Any]) – The new chunk statistics to merge. 
 
- Returns:
- The merged statistics. 
- Return type:
- Dict[str, Any] 
- Raises:
- ValueError – If the aggregation function is not one of {‘mean’, ‘sum’, ‘min’, ‘max’, ‘count’}. 
 
 - process_chunk(chunk: pandas.DataFrame)¶
- Process a chunk of data by performing aggregation and updating internal statistics. - This method processes a given data chunk by validating its columns, performing groupby aggregation based on the specified aggregation function, and merging the computed statistics into the internal storage for incremental aggregation. - Parameters:
- chunk (pd.DataFrame) – A DataFrame representing a chunk of data to be processed. It must contain the columns specified in self.group_columns and self.agg_column. 
- Raises:
- ValueError – If any of the required columns specified in self.group_columns or self.agg_column are not found in the chunk, or if the aggregation function is unsupported. 
 
 - get_final_result() pandas.DataFrame¶
- Return the final result as a DataFrame after all chunks have been processed. - After all chunks have been processed using process_chunk, this method returns a DataFrame containing the final result of the aggregation. The columns of the DataFrame include the group columns and the aggregated column with a name based on the specified aggregation function (e.g. ‘mean’, ‘sum’, ‘min’, ‘max’, or ‘count’). - Returns:
- The final result of the aggregation. 
- Return type:
- pd.DataFrame 
 
 
- cdpg_anonkit.aggregation.example_usage()¶