-
Notifications
You must be signed in to change notification settings - Fork 1
Updated TF attack #142
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
bzamanlooy
wants to merge
11
commits into
main
Choose a base branch
from
diabetes-tf-attack
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+110
−29
Open
Updated TF attack #142
Changes from all commits
Commits
Show all changes
11 commits
Select commit
Hold shift + click to select a range
dda5dc8
diabetes adapeted TF attack
bzamanlooy a5b00ed
Adapt Tartan Federer attack for diabetes
bzamanlooy 8b9415a
Updated test
bzamanlooy 616c96d
cleaning up and ruff check comment
bzamanlooy 0b698b1
ruff
bzamanlooy 200eb88
changed atack numbers with a cpu run to make more stable
bzamanlooy c01f3a9
addressed coderabbit comments
bzamanlooy a66d5e9
fix mypy issues
bzamanlooy fee5134
fix mypy error
bzamanlooy 53f0535
addressed David's comments
bzamanlooy 3ac0620
minor update
bzamanlooy File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -3,6 +3,7 @@ | |
| import csv | ||
| import os | ||
| from collections.abc import Generator | ||
| from dataclasses import replace | ||
| from logging import INFO | ||
| from pathlib import Path | ||
| from typing import Any | ||
|
|
@@ -98,6 +99,8 @@ def mixed_loss( | |
|
|
||
|
|
||
| # TODO: Unify this with the Dataset.from_df function. | ||
| # TODO: Noise scale is always called with a value of 0 for the attack. So we should remove it from the f | ||
| # function signature and the function calls. | ||
| def make_dataset_from_df_with_loaded( | ||
| data: pd.DataFrame, | ||
| transformation: Transformations, | ||
|
|
@@ -108,7 +111,7 @@ def make_dataset_from_df_with_loaded( | |
| noise_scale: float = 0, | ||
| ) -> Dataset: | ||
| """ | ||
| Create a dataset using artifacts. | ||
| Makes a dataset from a dataframe with loaded transformations. | ||
|
|
||
| Args: | ||
| data: Raw data to be used for creating the dataset. | ||
|
|
@@ -117,8 +120,8 @@ def make_dataset_from_df_with_loaded( | |
| table_metadata: Meta data about the table or tables. | ||
| label_encoders: Encoders that were used to encode the categorical data. | ||
| numerical_transform: Transformations that should be applied to the numerical data. Defaults to None. | ||
| noise_scale: he scale of the noise to add to the categorical features. Noise is drawn from a normal | ||
| distribution with standard deviation of ``noise_scale``. Defaults to 0. | ||
| noise_scale: The scale of the noise to add to the categorical features. Noise is drawn from a normal | ||
| distribution with standard deviation of ``noise_scale``. Defaults to 0. | ||
|
|
||
| Returns: | ||
| A full dataset constructed of the various pieces. | ||
|
|
@@ -128,7 +131,7 @@ def make_dataset_from_df_with_loaded( | |
| is_target_conditioned, | ||
| ) | ||
| numerical_features = {DataSplit.TRAIN.value: data[numerical_column_names].values.astype(np.float32)} | ||
| categorical_features = {DataSplit.TRAIN.value: data[categorical_column_names].to_numpy(dtype=np.str_)} | ||
| categorical_features = {DataSplit.TRAIN.value: data[categorical_column_names].to_numpy()} | ||
| targets = {DataSplit.TRAIN.value: data[[table_metadata.target_column_name]].values.astype(np.float32)} | ||
|
|
||
| if len(categorical_column_names) > 0: | ||
|
|
@@ -153,6 +156,13 @@ def make_dataset_from_df_with_loaded( | |
| numerical_features = categorical_features | ||
|
|
||
| target_info = TargetInfo(policy=None, mean=None, std=None) | ||
|
|
||
| # Apply the model's pre-fitted numerical transform directly instead of re-fitting a new one. | ||
| # Calling transform_dataset() would fit a brand new QuantileTransformer on the MIA data, | ||
| # which produces a different normalization than the model saw during training, destroying signal. | ||
| if numerical_transform is not None: | ||
| numerical_features = {k: numerical_transform.transform(v) for k, v in numerical_features.items()} | ||
|
bzamanlooy marked this conversation as resolved.
|
||
|
|
||
| dataset = Dataset( | ||
| numerical_features=numerical_features, | ||
| categorical_features=None, | ||
|
|
@@ -163,7 +173,9 @@ def make_dataset_from_df_with_loaded( | |
| categorical_transform=None, | ||
| numerical_transform=numerical_transform, | ||
| ) | ||
| return transform_dataset(dataset, transformation, None) | ||
| # Use a no-normalization transformation since we've already applied the model's scaler above. | ||
| transformation_no_norm = replace(transformation, normalization=None) | ||
| return transform_dataset(dataset, transformation_no_norm, None) | ||
|
|
||
|
|
||
| def get_dataset( | ||
|
|
@@ -394,7 +406,7 @@ def prepare_dataframe( | |
| return filter_dataframe(merged_data, df_data, columns_for_deduplication) | ||
|
|
||
|
|
||
| def train_tartan_federer_attack_classifier( | ||
| def train_tartan_federer_attack_classifier( # noqa: PLR0915, PLR0912 | ||
| train_indices: list[int], | ||
| val_indices: list[int] | None, | ||
| timesteps: list[int], | ||
|
|
@@ -448,7 +460,27 @@ def train_tartan_federer_attack_classifier( | |
| population_df_for_validation = pd.read_csv(population_data_dir / "population_dataset_for_validating_attack.csv") | ||
| log(INFO, "Population datasets for validating loaded.") | ||
|
|
||
| noise_dimension = len([col for col in population_df_for_training.columns if "_id" not in col]) | ||
| # Derive noise dimension from the actual diffusion model's num_numerical_features rather | ||
| # than from the population dataframe column count. The mixed_loss function slices | ||
| # x[:, :diffusion.num_numerical_features], so the noise vectors must have exactly that length. | ||
| # We load the first available model to read this value, then discard it. | ||
| first_model_number = train_indices[0] | ||
| first_model_dir = model_data_dir / f"{model_type}_{first_model_number}" | ||
| first_model_path = first_model_dir / target_model_subdir | ||
|
|
||
| if model_type != "tabddpm": | ||
| raise ValueError( | ||
| f"Unsupported model_type {model_type}. Tartan Federer Attack is only supported for ClavaDDPM-single-table models." | ||
| ) | ||
| # TODO: We should read this from the metadata instead. | ||
| _relation_order = [("None", "trans")] | ||
|
bzamanlooy marked this conversation as resolved.
|
||
| _parent, _child = _relation_order[0] | ||
| _ckpt_path = first_model_path / f"{_parent}_{_child}_ckpt.pkl" | ||
| with open(_ckpt_path, "rb") as _f: | ||
| _probe_model = CustomUnpickler(_f).load() | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Any reason we're using the |
||
| noise_dimension = _probe_model.diffusion.num_numerical_features | ||
| log(INFO, f"Noise dimension read from diffusion model: {noise_dimension}") | ||
|
|
||
| input_noise = [np.random.normal(size=noise_dimension).tolist() for _ in range(num_noise_per_time_step)] | ||
| input_dimension = len(input_noise) * len(timesteps) * len(additional_timesteps) | ||
|
|
||
|
|
||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the f here is a typo?