Archive for March 2020

One of the best features of neural networks and machine learning models is to memorize patterns from training data and apply those to unseen observations. That's where the magic is. 
However, there are scenarios in which the same machine learning models learn patterns so well such that they can disclose some of the data they have been trained on. This phenomenon goes under the name of unintended memorization and it is extremely dangerous.

Think about a language generator that discloses the passwords, the credit card numbers and the social security numbers of the records it has been trained on. Or more generally, think about a synthetic data generator that can disclose the training data it is trying to protect. 

In this episode I explain why unintended memorization is a real problem in machine learning. Except for differentially private training there is no other way to mitigate such a problem in realistic conditions.
At Pryml we are very aware of this. Which is why we have been developing a synthetic data generation technology that is not affected by such an issue.

 

This episode is supported by Harmonizely
Harmonizely lets you build your own unique scheduling page based on your availability so you can start scheduling meetings in just a couple minutes.
Get started by connecting your online calendar and configuring your meeting preferences.
Then, start sharing your scheduling page with your invitees!

 

References

The Secret Sharer: Evaluating and Testing Unintended Memorization in Neural Networks
https://www.usenix.org/conference/usenixsecurity19/presentation/carlini

Read Full Post »

In this episode I explain a very effective technique that allows one to infer the membership of any record at hand to the (private) training dataset used to train the target model. The effectiveness of such technique is due to the fact that it works on black-box models of which there is no access to the data used for training, nor model parameters and hyperparameters. Such a scenario is very realistic and typical of machine learning as a service APIs. 

This episode is supported by pryml.io, a platform I am personally working on that enables data sharing without giving up confidentiality. 

 

As promised below is the schema of the attack explained in the episode.

 shadow-model-attack.png

 

References

Membership Inference Attacks Against Machine Learning Models

 

 

Read Full Post »

Masking, obfuscating, stripping, shuffling. 
All the above techniques try to do one simple thing: keeping the data private while sharing it with third parties. Unfortunately, they are not the silver bullet to confidentiality. 

All the players in the synthetic data space rely on simplistic techniques that are not secure, might not be compliant and risky for production.
At pryml we do things differently. 

Read Full Post »

There are very good reasons why a financial institution should never share their data. Actually, they should never even move their data. Ever.
In this episode I explain you why.

 

 

Read Full Post »