This repository contains the codebase for the paper: tbd.
@article{tbd.,
title={...},
author={...},
journal={...},
year={...}
}User-driven privacy allows individuals to control whether and at what granularity their data is shared, leading to datasets that mix original, generalized, and missing values within the same records and attributes. While such representations are intuitive for privacy, they pose challenges for machine learning, which typically treats non-original values as new categories or as missing, thereby discarding generalization semantics. For learning from such tabular data, we propose novel data transformation strategies that account for heterogeneous anonymizations and evaluate them alongside standard imputation and LLM-based approaches. We employ multiple datasets, privacy configurations, and deployment scenarios, demonstrating that our method reliably regains utility. Our results show that generalized values are preferable to pure suppression, that the best data preparation strategy depends on the scenario, and that consistent data representations are crucial for maintaining downstream utility. Overall, our findings highlight that effective learning is tied to the appropriate handling of anonymized values.
The (anonymized) datasets and LLM imputation/prediction results used in our experiments are publicly available at: Zenodo
Alternatively, generate the datasets anew by following the instructions below.
Main Scripts and Locations:
run_all.sh,run.job,run.py: The main scripts to running the general experiments with different configurations.llm_runs.sh,llm.job,llm_src/: scripts for LLM evaluation._.env_llm,.env_llm: Environment variables for LLM (API URL, key, model), fill and rename from_.env_llmto.env_llmrequirements_*.txt: Needed packages for installation.DatasetCreation.ipynb: Notebook for individual generation of original and anonymized datasetsDatasetCreation_Specialization.ipynb: Notebook for individual generation of specialized datasets with filtering (optional, normally handled in-memory during experiments)src/: Contains the core Python source code for all pipeline steps.DatasetManager.py: Collection of dataset important information used in the workflows.spalten/: Dataset column specifications and generalization hierarchiesDatasetCreation.py: Creates dataset versions based on specified privacy preferences (percentages of original, generalized, and missing data).Main.py: Contains the main evaluation loop (run_evaluation) for model training and testing.- Additional modules for data loading, model evaluation, etc.
Data and Output Directory Reference:
data/: Raw and processed datasets{dataset}/: Original dataset files{dataset}/generalization/{percentage}/: Anonymized datasets for each privacy distribution{dataset}/forced_generalization/{percentage}/: Forced generalization preparation output{dataset}/specialization/{percentage}/: Optional pre-generated specialization datasets (created in-memory during experiments anyways)
llm_evaluation/: LLM evaluation outputs{percentage}_results/: Intermediate evaluation results (rename to{percentage}/when complete){percentage}/: Final results used by model evaluation
llm_slurm_logs/: Slurm job logs for LLM experimentsresults/: Overall experiment run outputsslurm_logs/: Slurm job logs for overall experiments
We used and recommend using a Slurm + Dask combination on a cluster for the general and LLM experiments. Local mode is ok for creating anonymized datasets, local LLM setups, and testing.
- Slurm: For distributed job scheduling (not required for dataset generation and small LLM outputs)
- Dask: For parallel processing (not required for dataset generation and LLM outputs but included in requirements file)
- Python 3.x: Depending on environment, with dependencies listed in requirements files
The project provides different requirement files for different use cases:
requirements_local.txt: For local development (Python 3.9.19, includes LLM packages)requirements_cluster.txt: As reference for cluster environment as loaded byrun.job, adjust job files to your environment (Python/3.9.6 on cluster, without LLM packages, as they're loaded viallm.jobfor such experiments)requirements_llm.txt: Only packages needed for LLM experiments (Python 3.12.3 on cluster)
# Installing packages for local development (for cluster see run.job and llm.job)
# using a conda environment
conda create -n requirements-local python=3.9.19 -y && \
conda run -n requirements-local pip install -r requirements_local.txt
# if pip from file not working:
# for some platforms with compatibility issues, use several conda distributed packages for ensuring support (all versions copied from the file)
conda create -n requirements-local python=3.9.19 -y && \
conda install -n requirements-local numpy=1.23.5 scipy=1.11.3 pandas=1.5.3 scikit-learn=1.2.2 matplotlib=3.5.2 -c conda-forge -y && \
conda run -n requirements-local pip install 'dask[distributed]==2023.1.0' dask-jobqueue==0.8.1 dask-mpi==2022.4.0 aiohttp==3.11.11 fastparquet==2024.11.0 folktables==0.0.12 h5py==3.11.0 jupyterlab==3.6.0 python-dotenv==1.0.0 requests==2.32.3 seaborn==0.13.2 xgboost==2.0.3This section covers how to run specific parts of the experiments and how to extend the codebase with new functionality.
This step is only needed if creating datasets separately, otherwise it is done and ensured on the start of each run in run_all.sh / run.job / run.py in Step 3. You may also download them from the Zenodo repository instead of running yourself.
Use DatasetCreation.ipynb, which provides an easy wrapper for running the needed script:
- Downloads, cleans, and splits the original datasets
- Creates anonymized versions depending on the set percentages
- Creates prepared versions for the forced generalization method
Note: Specialization datasets are normally created in-memory during experiments (Step 3). This step is only needed if you want to pre-generate specialization data for analysis or reuse. Actual specialization workflow still creates it in memory and does not read from or write to disk.
Use DatasetCreation_Specialization.ipynb to create specialized datasets with record-based approach:
- Loads generalized data from
data/{dataset}/generalization/{percentages}/ - For each record, generates variants by expanding generalized values and applies filtering to select best variants if wanted
- Saves filtered datasets with descriptive names:
specialization_{mode}_n{duplicates}.csv
Assuming the default data/ location we get the following structure:
data/: original and anonymized data for four datasets sorted into subfolders by dataset nameoriginal: at the toplevel of each dataset folder we find atrain.csvandtest.csvwith original data in 80/20 split and the following subfolders:generalization/: contains the anonymized data that again is sorted into subfolders for the different privacy distributions, with train and test data for eachforced_generalization/: similar structure togeneralization/but contains the prepared data under forced generalization method on the anonymized datasets, contains the whole dataset (no split) which will then be joined into the respective train and test parts by record ids in the experimentsspecialization/: optional folder for pre-generated specialization datasets (see Step 1b above), normally specialization is created and handled in-memory during experiments- Note: resulting datasets of other methods are not saved to disk as they are created and handled in-memory only
The project includes LLM-based approaches for data imputation and prediction. While prediction is a method on its own, outputs for imputation through an LLM are needed if using llm_imputation in Step 3 runs. You may also download from the Zenodo repository instead of running yourself.
What LLM evaluation does:
For each dataset's training and test sets:
- Value Imputation: Predicts missing/generalized values and creates imputed dataset versions (both parts saved as CSV)
- Label Prediction: Directly predicts target labels from anonymized records (results saved as CSV)
Results are saved in llm_evaluation/{percentage}_results/ directories. When finished, rename to llm_evaluation/{percentage}/ for use in model evaluation.
-
Configure the LLM API settings in
_.env_llmand rename to.env_llm(in project root):LLM_API_BASE_URL="..." LLM_API_KEY="..." LLM_MODEL="..."
-
Install (at least) LLM-specific dependencies or check cluster dependencies in
llm.jobfile:pip install -r requirements_llm.txt
-
Ensure original and anonymized datasets exist, assuming the default
data/location (from Step 1, Step 3 (except llm_imputation method), or downloaded from Zenodo repository)
Option 1: Local Execution for Testing or Smaller Tasks
- Use llm_evaluation_test.ipynb in
llm_src/
Option 2: Interactive Shell Wrapper (Slurm)
Starts jobs on cluster that interact with the LLM API. See llm_runs.sh and llm.job.
# command: bash llm_runs.sh launch <percentage> <datasets> <partitions> [batch_size] [input_dir]
# Single percentage, multiple datasets
bash llm_runs.sh launch "33-33-34" "Adult-Diabetes-German" 1 10
# Multiple percentages, with partitioning for large datasets
bash llm_runs.sh launch "33-33-34,66-17-17" "Employment" 3 10
# Check job status
bash llm_runs.sh status
# Merge partitioned results after completion (if applicable)
bash llm_runs.sh merge "33-33-34" "Employment" 3Other than the top-level job scripts, the subfolder llm_src/ contains the LLM-related scripts:
llm_runs.sh: Shell wrapper for launching multiple LLM-related jobs with extra optionsllm.job: Slurm job definition filellm_src/llm_evaluation.py: Main LLM evaluation scriptllm_src/llm_evaluation_test.ipynb: Jupyter notebook for small test runsllm_src/llm_evaluation_launcher.py: Job launcher with partitioning supportllm_src/llm_evaluation_merger.py: Merges results from partitioned runsllm_src/merge_llm_results.py: Utility scripts for merging partitioned results
Highly recommend Slurm+Dask on a cluster. Local is still possible with a LocalCluster as implemented in Main.py.
Run the complete evaluation pipeline to train and test models on prepared datasets. Results are automatically saved to run-specific subfolders (based on the run_all.sh session) in the results/ directory and organized following the structure: dataset -> distribution -> json files.
To run a comprehensive set of experiments, you can configure and use the run_all.sh script.
-
Configure
run_all.sh: Open therun_all.shscript and modify the arrays (datasets,percentages_list,train_methods,test_methods, etc.) by commenting out unwanted runs to define the parameter space for your experiments. Also consider the variablesDATA_DIRandbatch_size(alsorequire_matching_methodsif needed). Cluster-related settings for the DaskSlurmCluster constructor need to be changed in theMain.py. -
Execute the script:
bash run_all.sh
This will start submitting jobs through the
run.jobfile to the Slurm cluster based on the configurations. Each job will execute therun.pyscript with a different combination of parameters.
You may use run.py for starting single experiments outside of submitting a Slurm job. See run.py for parameters and for our used configurations see run_all.sh and run.job. Local runs use LocalCluster, see Main.py, which we generally did not use for the experiments.
Example command:
python run.py \
--save_dir ./results/my_experiment \
--data_dir ./data \
--dataset german \
--train_method original \
--test_method original \
--percentages "1.0 0.0 0.0" \
--n_workers 1- Just run overall experiments with the dataset and percentage and everything will be created accordingly (or run the individual dataset creation)
- LLM runs on the new distributions, needed for LLM-based imputation and prediction results
-
Add dataset configuration: Create a config entry in
src/DatasetManager.pywith dataset information, attributes, and anonymization levels. -
Create column specifications: Add a new class in
src/spalten/folder (following the pattern of existing datasets) that defines generalizations for each column. Import the class insrc/spalten/__init__.py. -
Configure dataset download: Add dataset information to
download_dataset_if_missing()insrc/DatasetCreation.pyfor automatic download, or add a custom download function (like for Employment dataset), or manually place files in thedata/folder. -
Implement data cleaning: Review
clean_and_split_data()insrc/Vorverarbeitung.pyand add any dataset-specific cleaning or splitting logic required. -
Update batch scripts: Add the dataset name to the
datasetsarray inrun_all.shfor batch processing.