Start Free Pilot

fill up this form to send your pilot request

Email is not valid.

Email is not valid

Phone is not valid

Some error text

Referrer domain is wrong

Thank you for contacting us!

Thank you for contacting us!

We'll get back to you shortly

TU Dublin Quotes

Label Your Data were genuinely interested in the success of my project, asked good questions, and were flexible in working in my proprietary software environment.

Quotes
TU Dublin
Kyle Hamilton

Kyle Hamilton

PhD Researcher at TU Dublin

Trusted by ML Professionals

Trusted by ML Professionals
Back to blog Back to blog
Published June 11, 2020

Data Anonymization Tools for Machine Learning and AI

Table of Contents

arrow-left
Data Anonymization Tools for Machine Learning and AI

When processing large quantities of information for marketing, sociology, when developing machine learning algorithms, it is almost imminent that one would have to work with information that links data points to real individuals. Such information is often referred to as private data. Direct identifiers (names, phone numbers, addresses) and indirect identifiers (demographics, religion, medical, financial information) are among just a few private data types that are expected to be protected from the public view. Their storage and use are heavily regulated by laws and regulations, such as CCPA and GDPR. The implications of breaking them are grievous. Since the beginning of 2020, there have already been 72 fines issued for noncompliance with GDPR, the highest individual one of which reached 27,8 million euros.

But what are the options for processing the datasets with sensitive information while still complying with the regulations and eliminating legal risks? The short answer is anonymization!

What is data anonymization?

Let's expand a bit on that. Data anonymization is the process of protecting sensitive information by changing or erasing the parts of it that may connect a living individual to the data or cross-connect a piece of data with other datasets. It is used to make all people that data describes completely anonymous. To ensure that, according to GDPR, owners of the dataset should take care of these types of data disclosure risks:

Singling out
possibility to isolate an individual in a database
Linkability
possibility to link a record in a database to another in the same or other databases, therefore, identifying a pattern or an individual
Inference
ability to make assumptions and predictions about future behavior based on the information in the dataset even if the individual is not in it

If information can be fully anonymized, it is no longer considered personal and is not covered by data protection laws and regulations. Such data is possible to be collected without consent, processed for any purposes and stored for an infinite amount of time.

What are data anonymization techniques?

Anonymization is a tool that takes data processing out of the data protection legislation and reduces risks of private data breaches. Below we describe a few most common data anonymization tools and algorithms that organization use to comply with regulations and protect their users.

Suppression
removing of entire columns of data.
Character masking
changing certain characters or values with a chosen repeating symbol (for instance, hiding name "John" with "****" or "xxxx").
Pseudonymisation
replacing values with made up data. To prevent cross-dataset linkage, it is recommended not to use persistent pseudonyms for an individual in more than one dataset.
Swapping
randomly shuffling values in the dataset, so that they do not correspond to the original records.
Generalization
rounding and reduction in precision of data. This can be generalizing age values to categories or making locations less precise (name of the city instead of coordinates).
Perturbation
modifying the records to be slightly different and adding noise. It is more common for indirect numeric identifiers. For example, changing age records by substituting 2 from each of them or multiplying all house numbers by 5.
Synthetic data
creating fully or partially synthetic datasets based on the original data.

A very important step for organizations and businesses planning on anonymizing private data is to access the potential risk of de-anonymization — tracing the anonymized information back to specific individuals through decrypting or matching anonymized information with publically available datasets. Appropriate set of measures must be taken in order to minimize this risk. Sources often stress the need for both technical (encryption keys, mapping tables, etc.) and administrative (limiting unauthorized access, etc.) organizational control over data. There is no one-size-fits-all answer to how your specific anonymization process should be performed. A few steps to better access the possible solutions include analyzing the nature of data, its recipients, de-anonymization risk management, determining the use of data. Surely, it is best to run anonymization process only after a thorough assessment and a clear plan.

Start your data annotation project now: Get your quote request →

Written by

Veronika Gladchuk
Veronika Gladchuk Editor-at-Large

A pioneer of the Label Your Data blog, Veronika has helped many of us understand the ins and outs of today's cutting-edge technology and opportunities provided by artificial intelligence. She speaks on the most critical issues of data labeling, machine learning, big data, and more. You should definitely read her other articles to plunge into the world of AI and data annotation!