Skip to content

Emory-AIMS/FusionDP

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

FusionDP: Foundation Model-Assisted Differentially Private Learning for Partially Sensitive Attributes

Official repository for FusionDP (submitted to VLDB 2026)

📄 arXiv version Official paper coming soon.

Datasets in sensitive domains often contain attributes with heterogeneous privacy requirements, where different features are subject to different access controls and compliance policies. For instance, in health datasets, specific identifiers like age and race are protected health information, while clinical measurements and lab results may be freely used for analytics. However, existing differential privacy (DP) mechanisms apply uniform protection across all features, leading to excessive noise injection that degrades the utility of machine learning models. We propose FusionDP, a framework that enables feature-level privacy control when training over partially sensitive data. FusionDP first leverages foundation models to impute sensitive features from non-sensitive ones, creating a privacy-preserving view of the data. It then uses a modified DP-SGD algorithm that trains on both original and imputed features while formally guaranteeing privacy for sensitive attributes. We evaluate FusionDP on 4 different classification tasks. Compared to standard DP-SGD baselines, FusionDP significantly improves model accuracy while maintaining rigorous feature-level privacy, demonstrating how exploiting feature-level heterogeneity enhances the privacy-utility tradeoff in sensitive data analytics.

Problem setting and training pipeline

We consider a scenario where only a subset of features requires privacy protection, while the rest can be used without restrictions.

Formally, let $\mathcal{X}$ denote the data space, where each data point $x \in \mathcal{X}$ can be decomposed into sensitive (private) components $x_{\text{priv}}$ and non-sensitive (public) components $x_{\text{pub}}$, such that $x = (x_{\text{priv}}, x_{\text{pub}})$. In real-world applications, $x_{\text{priv}}$ may include demographic attributes, rare diagnoses, education history, or identifiable spans in text that pose greater re-identification risks, or sensitive features specified by users. Meanwhile, $x_{\text{pub}}$ encompasses features like lab results, transactions, or non-identifying text tokens that do not require the same level of protection.

fusiondp

FusionDP is a two-step framework to achieve feature-DP with improved utility. The figure illustrates this training pipeline. We first use foundation models to generate hybrid samples where sensitive features are replaced by imputed values. We then train the model with a combined loss objective of public (in green) and private (in red) components. We clip and add noise only to the gradient of the private loss, which isolates and bounds the contribution of private features. Under this framework, we improve the private gradient component by leveraging the gradient calibration and proposing a representation-consistency regularizer to align hidden states of original and hybrid inputs.

Requirement

Install the environment:

conda env create -f environment.yml
conda activate fusiondp

Running FusionDP

Take the bank marketing dataset as an example, FusionDP can be ran using the following argument:

run python impute_bank.py to impute sensitive attributes with TabPFN and get bank_train.csv, bank_train_imputed.csv, bank_val.csv, bank_test.csv for training.

 python train_fusiondp_bank.py \
      --one_hot --mode fusiondp --epsilon $eps --epochs 10 --max_grad_norm $c \
          --alpha $a --beta $b 

Datasets

Results

Screenshot 2026-01-31 at 5 55 54 PM Screenshot 2026-01-31 at 5 56 07 PM

Contributors

Linghui Zeng $^{1}$, Ruixuan Liu $^{1}$, Atiquer Rahman Sarkar $^{2}$, Xiaoqian Jiang $^{3}$, Joyce C. Ho $^{1}$, Li~Xiong $^{1}$

$^{1}$ Emory University, $^{2}$ University of Manitoba, $^{3}$ University of Texas Health Science Center

About

Official repository for FusionDP (submitted to VLDB 2026)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 87.5%
  • Jupyter Notebook 12.5%