Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a reset_anonymization method #559

Closed
pvk-developer opened this issue Oct 3, 2022 · 1 comment · Fixed by #580
Closed

Add a reset_anonymization method #559

pvk-developer opened this issue Oct 3, 2022 · 1 comment · Fixed by #580
Labels
feature request Request for a new feature
Milestone

Comments

@pvk-developer
Copy link
Member

pvk-developer commented Oct 3, 2022

Problem Description

It would be useful to have a reset_generator parameter for the transformers that when reverse transforming generate data (AnonymizedFaker and RegexGenerator). This way if we are using enforce_uniqueness we would be able to start over again with new value.

This change should propagate to create_anonymized_columns as those transformers are mainly interacted from there.

Expected behavior

from rdt.transformers import AnonymizedFaker

anonymizer = AnonymizedFaker(provider_name='job', function_name='job', enforce_uniqueness=True)
# There are 639 unique jobs in english
data = pd.DataFrame({'job': np.arange(639)})
tr_data = anonymizer.fit_transform(data, 'job')

# With this run we empty the unique values from the current faker instance
rev_data = anonymizer.reverse_transform(tr_data)

# This will not generate new data and will fail. There is no way to reset it currently.
anonymizer.reverse_transform(tr_data)

File ~/.virtualenvs/RDT/lib/python3.8/site-packages/faker/proxy.py:320, in UniqueProxy._wrap.<locals>.wrapper(*args, **kwargs)
    319 else:
--> 320     raise UniquenessException(f"Got duplicated values after {_UNIQUE_ATTEMPTS:,} iterations.")
    322 generated.add(retval)

UniquenessException: Got duplicated values after 1,000 iterations.

The above exception was the direct cause of the following exception:

Error                                     Traceback (most recent call last)
Cell In [1], line 12
      9 rev_data = anonymizer.reverse_transform(tr_data)
     11 # This will not generate new data and will fail. There is no way to reset it currently.
---> 12 anonymizer.reverse_transform(tr_data)

File ~/Projects/sdv-dev/RDT/rdt/transformers/base.py:362, in BaseTransformer.reverse_transform(self, data)
    359 data = data.copy()
    361 columns_data = self._get_columns_data(data, self.output_columns)
--> 362 reversed_data = self._reverse_transform(columns_data)
    363 data = data.drop(self.output_columns, axis=1)
    364 data = self._add_columns_to_data(data, reversed_data, self.columns)

File ~/Projects/sdv-dev/RDT/rdt/transformers/pii/anonymizer.py:149, in AnonymizedFaker._reverse_transform(self, data)
    144     reverse_transformed = np.array([
    145         self._function()
    146         for _ in range(sample_size)
    147     ], dtype=object)
    148 except faker.exceptions.UniquenessException as exception:
--> 149     raise Error(
    150         f'The Faker function you specified is not able to generate {sample_size} unique '
    151         'values. Please use a different Faker function for column '
    152         f"('{self.get_input_column()}')."
    153     ) from exception
    155 return reverse_transformed

Error: The Faker function you specified is not able to generate 639 unique values. Please use a different Faker function for column ('job').

Additional context

If we need to reuse the same transformer to generate data, we currently are not able to restart the current state.

@pvk-developer pvk-developer added pending review feature request Request for a new feature labels Oct 3, 2022
@npatki
Copy link
Contributor

npatki commented Oct 5, 2022

Proposed API

  1. Each relevant transformer (AnonymizedFaker, PseudoAnonymizedFaker, RegexGenerator) should have a reset_anonymization() method that resets the generation. Next time you call reverse_transform it will start from the beginning
  2. HyperTransformer should have a reset_anonymization() method that calls reset_anonymization() on each relevant transformer in step 3
transformer = AnonymizedFaker(provider_name='job', function_name='job')
t = transformer.fit_transform(data)
transformer.reverse_transform(t)
transformer.reset_anonymization()
transformer.reverse_transform(t)

SDV Usage Example: SDV single table models have a randomize_samples parameter (docs). When set to False, the HyperTransformer should reset anonymization before calling sample.

@npatki npatki changed the title Add a reset_generator parameter Add a reset_anonymization method Oct 5, 2022
@npatki npatki added this to the 1.3.0 milestone Oct 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request Request for a new feature
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants