Data Anonymization Tools for Machine Learning and AI
Table of Contents
When processing large quantities of information for marketing, sociology, when developing machine learning algorithms, it is almost imminent that one would have to work with information that links data points to real individuals. Such information is often referred to as private data. Direct identifiers (names, phone numbers, addresses) and indirect identifiers (demographics, religion, medical, financial information) are among just a few private data types that are expected to be protected from the public view. Their storage and use are heavily regulated by laws and regulations, such as CCPA and GDPR. The implications of breaking them are grievous. Since the beginning of 2020, there have already been 72 fines issued for noncompliance with GDPR, the highest individual one of which reached 27,8 million euros.
But what are the options for processing the datasets with sensitive information while still complying with the regulations and eliminating legal risks? The short answer is anonymization!
What is data anonymization?
Let's expand a bit on that. Data anonymization is the process of protecting sensitive information by changing or erasing the parts of it that may connect a living individual to the data or cross-connect a piece of data with other datasets. It is used to make all people that data describes completely anonymous. To ensure that, according to GDPR, owners of the dataset should take care of these types of data disclosure risks:
- Singling out
- possibility to isolate an individual in a database
- Linkability
- possibility to link a record in a database to another in the same or other databases, therefore, identifying a pattern or an individual
- Inference
- ability to make assumptions and predictions about future behavior based on the information in the dataset even if the individual is not in it
If information can be fully anonymized, it is no longer considered personal and is not covered by data protection laws and regulations. Such data is possible to be collected without consent, processed for any purposes and stored for an infinite amount of time.
What are data anonymization techniques?
Anonymization is a tool that takes data processing out of the data protection legislation and reduces risks of private data breaches. Below we describe a few most common data anonymization tools and algorithms that organization use to comply with regulations and protect their users.
- Suppression
- removing of entire columns of data.
- Character masking
- changing certain characters or values with a chosen repeating symbol (for instance, hiding name "John" with "****" or "xxxx").
- Pseudonymisation
- replacing values with made up data. To prevent cross-dataset linkage, it is recommended not to use persistent pseudonyms for an individual in more than one dataset.
- Swapping
- randomly shuffling values in the dataset, so that they do not correspond to the original records.
- Generalization
- rounding and reduction in precision of data. This can be generalizing age values to categories or making locations less precise (name of the city instead of coordinates).
- Perturbation
- modifying the records to be slightly different and adding noise. It is more common for indirect numeric identifiers. For example, changing age records by substituting 2 from each of them or multiplying all house numbers by 5.
- Synthetic data
- creating fully or partially synthetic datasets based on the original data.
A very important step for organizations and businesses planning on anonymizing private data is to access the potential risk of de-anonymization — tracing the anonymized information back to specific individuals through decrypting or matching anonymized information with publically available datasets. Appropriate set of measures must be taken in order to minimize this risk. Sources often stress the need for both technical (encryption keys, mapping tables, etc.) and administrative (limiting unauthorized access, etc.) organizational control over data. There is no one-size-fits-all answer to how your specific anonymization process should be performed. A few steps to better access the possible solutions include analyzing the nature of data, its recipients, de-anonymization risk management, determining the use of data. Surely, it is best to run anonymization process only after a thorough assessment and a clear plan.
Written by
A pioneer of the Label Your Data blog, Veronika has helped many of us understand the ins and outs of today's cutting-edge technology and opportunities provided by artificial intelligence. She speaks on the most critical issues of data labeling, machine learning, big data, and more. You should definitely read her other articles to plunge into the world of AI and data annotation!