Archive for the 'privacy and security' Category
One of the best features of neural networks and machine learning models is to memorize patterns from training data and apply those to unseen observations. That's where the magic is.
However, there are scenarios in which the same machine learning models learn patterns so well such that they can disclose some of the data they have been trained on. This phenomenon goes under the name of unintended memorization and it is extremely dangerous.
Think about a language generator that discloses the passwords, the credit card numbers and the social security numbers of the records it has been trained on. Or more generally, think about a synthetic data generator that can disclose the training data it is trying to protect.
In this episode I explain why unintended memorization is a real problem in machine learning. Except for differentially private training there is no other way to mitigate such a problem in realistic conditions.
At Pryml we are very aware of this. Which is why we have been developing a synthetic data generation technology that is not affected by such an issue.
This episode is supported by Harmonizely.
Harmonizely lets you build your own unique scheduling page based on your availability so you can start scheduling meetings in just a couple minutes.
Get started by connecting your online calendar and configuring your meeting preferences.
Then, start sharing your scheduling page with your invitees!
The Secret Sharer: Evaluating and Testing Unintended Memorization in Neural Networks
There are very good reasons why a financial institution should never share their data. Actually, they should never even move their data. Ever.
In this episode I explain you why.
This is the fourth and last episode of mini series "The dark side of AI".
I am your host Francesco and I’m with Chiara Tonini from London. The title of today’s episode is Bias in the machine
C: Francesco, today we are starting with an infuriating discussion. Are you ready to be angry?
F: yeah sure is this about brexit?
No, I don’t talk about that. In 1986 the New York City’s Rockefeller University conducted a study on breast and uterine cancers and their link to obesity. Like in all clinical trials up to that point, the subjects of the study were all men.
So Francesco, do you see a problem with this approach?
F: No problem at all, as long as those men had a perfectly healthy uterus.
In medicine, up to the end of the 20th century, medical studies and clinical trials were conducted on men, medicine dosage and therapy calculated on men (white men). The female body has historically been considered an exception, or variation, from a male body.
F: Like Eve coming from Adam’s rib. I thought we were past that...
When the female body has been under analysis, the focus was on the difference between it and the male body, the so-called “bikini approach”: the reproductive organs are different, therefore we study those, and those only. For a long time medicine assumed this was the only difference.
Oh good ...
This has led to a hugely harmful fallout across society. Because women had reproductive organs, they should reproduce, and all else about them was deemed uninteresting. Still today, they consider a woman without children somehow to have betrayed her biological destiny. This somehow does not apply to a man without children, who also has reproductive organs.
F: so this is an example of a very specific type of bias in medicine, regarding clinical trials and medical studies, that is not only harmful for the purposes of these studies, but has ripple effects in all of society
Only in the 2010 a serious conversation has started about the damage caused by not including women in clinical trials. There are many many examples (which we list in the references for this episode).
Give me one
Researchers consider cardiovascular disease a male disease - they even call it “the widower”. They conduct studies on male samples. But it turns out, the symptoms of a heart attack, especially the ones leading up to one, are different in women. This led to doctors not recognising or dismissing the early symptoms in women.
F: I was reading that women are also subject to chronic pain much more than men: for example migraines, and pain related to endometriosis. But there is extensive evidence now of doctors dismissing women’s pain, as either imaginary, or “inevitable”, like it is a normal state of being and does not need a cure at all.
The failure of the medical community as a whole to recognise this obvious bias up to the 21st century is an example of how insidious the problem of bias is.
There are 3 fundamental types of bias:
- One: Stochastic drift: you train your model on a dataset, and you validate the model on a split of the training set. When you apply your model out in the world, you systematically add bias in the predictions due to the training data being too specific
- Two: The bias in the model, introduced by your choice of the parameters of your model.
- Three: The bias in your training sample: people put training samples together, and people have culture, experience, and prejudice. As we will see today, this is the most dangerous and subtle bias. Today we’ll talk about this bias.
Bias is a warping of our understanding of reality. We see reality through the lens of our experience and our culture. The origin of bias can date back to traditions going back centuries, and is so ingrained in our way of thinking, that we don’t even see it anymore.
F: And let me add, when it comes to machine learning, we see reality through the lens of data. Bias is everywhere, and we could spend hours and hours talking about it. It’s complicated.
It’s about to become more complicated.
F: of course, if I know you…
Let’s throw artificial intelligence in the mix.
F: You know, there was a happier time when this sentence didn’t fill me with a sense of dread...
ImageNet is an online database of over 14 million photos, compiled more than a decade ago at Stanford University. They used it to train machine learning algorithms for image recognition and computer vision, and played an important role in the rise of deep learning. We’ve all played with it, right? The cats and dogs classifier when learning Tensorflow? (I am a dog by the way. )
F: ImageNet has been a critical asset for computer-vision research. There was an annual international competition to create algorithms that could most accurately label subsets of images.
In 2012, a team from the University of Toronto used a Convolutional Neural Network to handily win the top prize. That moment is widely considered a turning point in the development of contemporary AI. The final year of the ImageNet competition was 2017, and accuracy in classifying objects in the limited subset had risen from 71% to 97%. But that subset did not include the “Person” category, where the accuracy was much lower...
ImageNet contained photos of thousands of people, with labels. This included straightforward tags like “teacher,” “dancer” and “plumber”, as well as highly charged labels like “failure, loser” and “slut, slovenly woman, trollop.”
F: Uh Oh.
Then “ImageNet Roulette” was created, by an artist called Trevor Paglen and a Microsoft researcher named Kate Crawford. It was a digital art project, where you could upload your photo and let the classifier identify you, based on the labels of the database. Imagine how well that went.
F: I bet it did’t work
Of course it didn’t work. Random people were classified as “orphans” or “non-smoker” or “alcoholic”. Somebody with glasses was a “nerd”. Tabong Kima, a 24-year old African American, was classified as “offender” and “wrongdoer”.
F: and there it is.
Quote from Trevor Paglen: “We want to show how layers of bias and racism and misogyny move from one system to the next. The point is to let people see the work that is being done behind the scenes, to see how we are being processed and categorized all the time.”
F: The ImageNet labels were applied by thousands of unknown people, most likely in the United States, hired by the team from Stanford, and working through the crowdsourcing service Amazon Mechanical Turk. They earned pennies for each photo they labeled, churning through hundreds of labels an hour. The labels were not verified in any way : if a labeler thought someone looks “shady”, this label is just a result of their prejudice, but has no basis in reality.
As they did, biases were baked into the database. Paglen quote again: “The way we classify images is a product of our worldview,” he said. “Any kind of classification system is always going to reflect the values of the person doing the classifying.” They defined what a “loser” looked like. And a “slut.” And a “wrongdoer.”
F: The labels originally came from another sprawling collection of data called WordNet, a kind of conceptual dictionary for machines built by researchers at Princeton University in the 1980s. But with these inflammatory labels included, the Stanford researchers may not have realized what they were doing.
What is happening here is the transferring of bias from one system to the next.
Tech jobs, in past decades but still today, predominantly go to white males from a narrow social class. Inevitably, they imprint the technology with their worldview. So their algorithms learn that a person of color is a criminal, and a woman with a certain look is a slut.
I’m not saying they do it on purpose, but the lack of diversity in the tech industry translates into a narrower world view, which has real consequences in the quality of AI systems.
F: Diversity in tech teams is often framed as an equality issue (which of course it is), but there are enormous advantages in it: it allows to create that cognitive diversity that will reflect into superior products or services.
I believe this is an ongoing problem. In recent months, researchers have shown that face-recognition services from companies like Amazon, Microsoft and IBM can be biased against women and people of color.
Crawford and Paglen argue this:
“In many narratives around AI it is assumed that ongoing technical improvements will resolve all problems and limitations.
But what if the opposite is true? What if the challenge of getting computers to “describe what they see” will always be a problem? The automated interpretation of images is an inherently social and political project, rather than a purely technical one. Understanding the politics within AI systems matters more than ever, as they are quickly moving into the architecture of social institutions: deciding whom to interview for a job, which students are paying attention in class, which suspects to arrest, and much else.”
F: You are using the words “interpretation of images” here, as opposed to “description” or “classification”. Certain images depict something concrete, with an objective reality. Like an apple. But other images… not so much?
ImageNet contain images only corresponding to nouns (not verbs for example). Noun categories such as “apple” are well defined.
But not all nouns are created equal. Linguist George Lakoff points out that the concept of an “apple” is more nouny than the concept of “light”, which in turn is more nouny than a concept such as “health.”
Nouns occupy various places on an axis from concrete to abstract, and from descriptive to judgmental. The images corresponding to these nouns become more and more ambiguous.
These gradients have been erased in the logic of ImageNet. Everything is flattened out and pinned to a label.
The results can be problematic, illogical, and cruel, especially when it comes to labels applied to people.
F: so when an image is interpreted as Drug Addict, Crazy, Hypocrite, Spinster, Schizophrenic, Mulatto, Red Neck… this is not an objective description of reality, it’s somebody’s worldview coming to the surface.
The selection of images for these categories skews the meaning in ways that are gendered, racialized, ableist, and ageist. ImageNet is an object lesson in what happens when people are categorized like objects.
And this practice has only become more common in recent years, often inside the big AI companies, where there is no way for outsiders to see how images are being ordered and classified.
The bizarre thing about these systems is that they remind of early 20th century criminologists like Lombroso, or phrenologists (including Nazi scientists), and physiognomy in general. This was a discipline founded on the assumption that there is a relationship between an image of a person and the character of that person. If you are a murderer, or a Jew, the shape of your head for instance will tell.
F: In reaction to these ideas, Rene’ Magritte produced that famous painting of the pipe with the tag “This is not a pipe”.
You know that famous photograph of the soldier kissing the nurse at the end of the second world war? The nurse came public about it when she was like 90 years old, and told how this total stranger in the street had grabbed her and kissed her. This is a picture of sexual harassment. And knowing that, it does not seem romantic anymore.
F: not romantic at all indeed
Images do not describe themselves. This is a feature that artists have explored for centuries. We see those images differently when we see how they’re labeled. The correspondence between image, label, and referent is fluid. What’s more, those relations can change over time as the cultural context of an image shifts, and can mean different things depending on who looks, and where they are located. Images are open to interpretation and reinterpretation. Entire subfields of philosophy, art history, and media theory are dedicated to teasing out all the nuances of the unstable relationship between images and meanings.
The common mythos of AI and the data it draws on, is that they are objectively and scientifically classifying the world. But it’s not true, everywhere there is politics, ideology, prejudices, and all of the subjective stuff of history.
F: When we survey the most widely used training sets, we find that this is the rule rather than the exception.
Training sets are the foundation on which contemporary machine-learning systems are built. They are central to how AI systems recognize and interpret the world.
By looking at the construction of these training sets and their underlying structures, we discover many unquestioned assumptions that are shaky and skewed. These assumptions inform the way AI systems work—and fail—to this day.
And the impenetrability of the algorithms, the impossibility of reconstructing the decision-making of a NN, hides the bias further away from scrutiny. When an algorithm is a black box and you can’t look inside, you have no way of analysing its bias.
And the skewness and bias of these algorithms have real effects in society, the more you use AI in the judicial system, in medicine, the job market, in security systems based on facial recognition, the list goes on and on.
Last year Google unveiled BERT (Bidirectional Encoder Representations from Transformers). It’s an AI system that learns to talk: it’s a Natural Language Processing engine to generate written (or spoken) language.
F: we have an episode in which we explain all that
They trained it from lots and lots of digitized information, as varied as old books, Wikipedia entries and news articles. They baked decades and even centuries of biases — along with a few new ones — into all that material. So for instance BERT is extremely sexist: it associates with male almost all professions and positive attributes (except for “mom”).
BERT is widely used in industry and academia. For example it can interpret news headlines automatically. Even Google’s search engine use it.
Try googling “CEO”, and you get out a gallery of images of old white men.
F: such a pervasive and flawed AI system can propagate inequality at scale. And it’s super dangerous because it’s subtle. Especially in industry, query results will not be tested and examined for bias. AI is a black box and researchers take results at face value.
There are many cases of algorithm-based discrimination in the job market. Targeting candidates for tech jobs for instance, may be done by algorithms that will not recognise women as potential candidates. Therefore, they will not be exposed to as many job ads as men. Or, automated HR systems will rank them lower (for the same CV) and screen them out.
In the US, algorithms are used to calculate bail. The majority of the prison population in the US is composed of people of colour, as a result of a systemic bias that goes back centuries. An algorithm learns that a person of colour is more likely to commit a crime, is more likely to not be able to afford bail, is more likely to violate parole. Therefore, people of colour will receive harsher punishments for the same crime. This amplifies this inequality at scale.
Question everything, never take predictions of your models at face value. Always question how your training samples have been put together, who put them together, when and in what context. Always remember that your model produces an interpretation of reality, not a faithful depiction.
Treat reality responsibly.
We always hear the word “metadata”, usually in a sentence that goes like this
Your Honor, I swear, we were not collecting users data, just metadata.
Usually the guy saying this sentence is Zuckerberg, but could be anybody from Amazon or Google. “Just” metadata, so no problem. This is one of the biggest lies about the reality of data collection.
F: Ok the first question is, what the hell is metadata?
Metadata is data about data.
F: Ok… still not clear.
Imagine you make a phone call to your mum. How often do you call your mum, Francesco?
F: Every day of course! (coughing)
Good boy! Ok, so let’s talk about today’s phone call. Let’s call “data” the stuff that you and your mum actually said. What did you talk about?
F: She was giving me the recipe for her famous lasagna.
So your mum’s lasagna is the DATA. What is the metadata of this phone call? The lasagna has data of its own attached to it: the date and time when the conversation happened, the duration of the call, the unique hardware identifiers of your phone and your mum’s phone, the identifiers of the two sim cards, the location of the cell towers that pinged the call, the GPS coordinates of the phones themselves.
F: yeah well, this lasagna comes with a lot of data :)
And this is assuming that this data is not linked to any other data like your Facebook account or your web browsing history. More of that later.
F: Whoa Whoa Whoa, ok. Let’s put a pin in that. Going back to the “basic” metadata that you describe. I think we understand the concept of data about data. I am sure you did your research and you would love to paint me a dystopian nightmare, as always. Tell us why is this a big deal?
Metadata is a very big deal. In fact, metadata is far more “useful” than the actual data, where by “useful” I mean that it allows a third party to learn about you and your whole life. What I am saying is, the fact that you talk with your mum every day for 15 minutes is telling me more about you than the content of the actual conversations. In a way, the content does not matter. Only the metadata matters.
F: Ok, can you explain this point a bit more?
Imagine this scenario: you work in an office in Brussels, and you go by car. Every day, you use your time in the car while you go home to call your mum. So every day around 6pm, a cell tower along the path from your office to your home pings a call from your phone to your mum’s phone. Someone who is looking at your metadata, knows exactly where you are while you call your mum. Every day you will talk about something different, and it doesn't really matter. Your location will come through loud and clear. A lot of additional information can be deduced from this too: for example, you are moving along a motorway, therefore you have a car. The metadata of a call to mum now becomes information on where you are at 6pm, and the way you travel.
F: I see. So metadata about the phone call is, in fact, real data about me.
Exactly. YOU are what is interesting, not your mum’s lasagna.
F: you say so because you haven’t tried my mum’s lasagna. But I totally get your point.
Now, imagine that one day, instead of going straight home, you decide to go somewhere else. Maybe you are secretly looking for another job. Your metadata is recording the fact that after work you visit the offices of a rival company. Maybe you are a journalist and you visit your anonymous source. Your metadata records wherever you go, and one of these places is your secret meeting with your source. Anyone’s metadata can be combined with yours. There will be someone who was with you at the time and place of your secret meeting. Anyone who comes in contact with you can be tagged and monitored. Now their anonymity has been reduced.
F: I get it. So, compared to the content of my conversation, its metadata contains more actionable information. And this is the most useful, and most precious, kind of information about me. What I do, what I like, who I am, beyond the particular conversation.
Precisely. If companies like Facebook or the phone companies had the explicit permission to collect all the users’ data, including all content of conversations, it’s still the metadata that would generate the most actionable information. They would probably throw the content of conversations away. In the vast majority of instances, the content does not matter. Unless you are an actual spy talking about state secrets, nobody cares.
F: Let’s stay on the spy point for a minute. One could say, So what? As I have heard this many times. So what if my metadata contains actionable information, and there are entities that collect it. If I am an honest person, I have nothing to hide.
There are two aspects to the problem of privacy. Government surveillance, and corporate - in other words private - surveillance.
Government surveillance is a topic that has been covered flawlessly by Edward Snowden in his book “Permanent Record”, and in the documentary about his activity, “Citizenfour”. Which I both recommend, and in fact I think every data scientist should read and watch.
Let’s just briefly mention the obvious: just because something comes from a government, it does not mean it’s legal or legitimate, or even ethical or moral. What if your government is corrupt, or authoritarian. What if you are a dissident and you are fighting for human rights. What if you are a journalist, trying to uncover government corruption.
F: In other words, it is a false equivalence to say that protecting your privacy has anything to do with having something to hide.
Mass surveillance of private citizens without cause is a danger to individual freedom as well as civil liberties. Government exists to serve its citizens, not the other way around. To freely paraphrase Snowden, as individuals have no power compared to the government, the only way the system works is if the government is completely transparent to the citizens, so that they can collectively change it, and at the same time the single citizens are opaque to the government, so that it cannot abuse its power. But today the opposite happens: we citizens are completely naked and exposed in front of a completely opaque government machine, with secret surveillance programs on us, that we don’t even know exist. We are not free to self-determine, or do anything about government power, really.
F: We could really talk for days and days about government mass surveillance. But let’s go back to metadata, and let’s talk about the commercial use of it. Metadata for sale. You mentioned this term, “corporate surveillance”. It sounds…. Ominous.
We live in privacy hell, Francesco.
F: I get that. According to your research, where can we find metadata?
First of all, metadata is everywhere. We are swimming in it. In each and every interaction between two people, that make use of digital technology, metadata is generated automatically, without the user’s consent. When two people interact, two machines also interact, recording the “context” of this interaction. Who we are, when, where, why, what we want.
F: And that doesn’t seem avoidable. In fact metadata must be generated by devices and software to just work properly. I look at it as an intrinsic component that cannot be removed from the communication system, whatever it is. The problem is who owns it. So tell me, who has such data?
It does not matter, because it’s all for sale. Which means, we are for sale.
F: Ok, holy s**t, this keeps getting darker. Let’s have a practical example, shall we?
Have you booked a flight recently?
F: Yep. I’m going to Berlin, and in fact so are you. For a hackathon, no less.
Have you ever heard of a company called Adara?
F: No… Cannot say that I have.
Adara is a “Predictive Traveler Intelligence” company.
F: sounds pretty pretentious. Kinda douchy.
This came up on the terrifying twitter account of Wolfie Christl, author among other things of a great report about corporate surveillance for Cracked Labs. Go check him out on twitter, he’s great.
F: Sure I will add what I find to the show notes of this episode. Oh and by the way you can find all this stuff on datascienceathome.com
Sorry go ahead.
Adara collects data - metadata - about travel-related online searches, purchases, devices, passenger records, loyalty program records. Data from clients that include major airlines, major airports, hotel chains and car rental chains. It creates a profile, a “traveler graph” in real time, for 750 million people around the world. A profile based on personal identifiers.
F: uhh uhh Then what?
Then Adara sells these profiles.
F: Ok… I have to say, the box that I tick giving consent to the third-party use of my personal data when I use an airline website does not quite convey how far my data actually goes.
Consent. LOL. Adara calculates a: “traveler value score” based on customer behaviour and needs across the global travel ecosystem, over time.
The score is in the Salesforce Service Cloud, for sale to anyone.
This score, and your profile, determine the personalisation of travel offers and treatment, before purchase, during booking, post purchase, at check in, in airport, at destination.
In their own website, Adara explains how customer service agents for their myriad of clients - for example a front desk agent at a hotel - can instantly see the Traveler value score. Therefore they will treat you differently based on this score.
F: Oh so if you have money to spend they will treat you differently
The score is used to assess your potential value, to inform service and customer service strategies for you, as well as personalised messaging and relevant offers. And of course, the pricing you see when you look for flights. Low score? Prepare yourself to wait to have your call rerouted to a customer service agent. Would you ever tick a box to give consent to this?
F: Fuck no. How is this even legal? What about the GDPR?
It is, in fact, illegal. Adara is based in the US, but they collect data through data warehouses in the Netherlands. They claim they are GDPR-compliant. However, they collect all the data, and then decide on the specific business use, which is definitely not GDPR compliant.
F: exactly! According to GDPR the user has to know in advance what the business use of the data they are giving consent for!!
With GDPR and future regulations, there is a way to control how the data is used and with what purpose. Regulations are still blurred or undefined when it comes to metadata. For example, there’s no regulation for the number of records in a database or the timestamp when such record was created. As a matter of fact data is useless without metadata.
One cannot even collect data without metadata.
"Certain information (e.g. a recipient's identifier, an encrypted message body, etc.) is transmitted to us solely for the purpose of placing calls or transmitting messages. Unless otherwise stated below, this information is only kept as long as necessary to place each call or transmit each message, and is not used for any other purpose."
This is one of those issues that shall be solved with legislation.
But like money laundering, your data is caught in a storm of transactions so intricate that at a certain point, how do you even check...
All participating companies share customer data with each other (a process called value exchange). They let marketers utilize the data, for example to target people after they have searched for flights or hotels. Adara creates audience segments and sells them, for example to Google, for advertisement targeting. The consumer data broker LiveRamp for example lists Adara as a data provider.
F: consumer data broker. I am starting to get what you mean when you say that we are for sale.
Let’s talk about LiveRamp, part of Acxiom.
F: there they go... Acxiom... I heard of them
They self-describe as an “Identity Resolution Platform”.
F: I mean, George Orwell would be proud.
Their mission? “To connect offline data and online data back to a single identifier”. In other words, clients can “resolve all” of their “offline and online identifiers back to the individual consumer”.
Various digital profiles, like the ones generated on social media or when you visit a website, are matched to databases which contains names, postal addresses, email addresses, phone numbers, geo locations and IP addresses, online and mobile identifiers, such as cookie and device IDs.
F: well, all this stuff is possible if and only if someone gets in possession of all these profiles, or well... they purchase them. Still, what the f**k.
A cute example? Imagine you register on any random website but you don’t want to give them your home address. They just buy it from LiveRamp, which gets it from your phone geolocation data - which is for sale. Where does your phone sit still for 12 hours every night? That’s your home address. Easy.
F: And they definitely know how much time do I spend at the gym, without even checking my Instagram! Ok this is another level of creepy.
Clients of LiveRamp can upload their own consumer data to the platform, combine it with data from hundreds of 100 third-party data providers, and then utilize it on more than 500 marketing technology platforms. They can use this data to find and target people with specific characteristics, to recognize and track consumers across devices and platforms, to profile and categorize them, to personalize content for them, and to measure how they behave. For example, clients could “recognize a website visitor” and “provide a customized offer” based on extensive profile data, without requiring said user to log in to the website. Furthermore, LiveRamp has a data store, for other companies to “buy and sell valuable customer data”.
F: What is even the point of giving me the choice to consent to anything online?
In short, there is no point.
F: it seems we are so behind with regulations on data sharing. GDPR is not cutting it, not really. With programmatic advertising we have created a monster that has really grown out of control.
So: our lives are completely transparent to private corporations, that constantly surveil us en-masse, and exploit all of our data to sell us shit. How does this affect our freedom? How about we just don’t buy it? Can it be that simple? And I would not take a no for an answer here.
F: oh crap!
I’m going to read you a passage from Permanent Record:
Who among us can predict the future? Who would dare to?
The answer to the first question is no one, really, and the answer to the second is everyone, especially every government and business on the planet. This is what that data of ours is used for. Algorithms analyze it for patterns of established behaviour in order to extrapolate behaviours to come, a type of digital prophecy that’s only slightly more accurate that analog methods like palm reading. Once you go digging into the actual technical mechanisms by which predictability is calculated, you come to understand that its science is, in fact, anti-scientific, and fatally misnamed: predictability is actually manipulation.
A website that tells you that because you liked book 1 then you might also like book 2, isn’t offering an educated guess as much as a mechanism of subtle coercion. We can’t allow ourselves to be used in this way, to be used against the future. We can’t permit our data to be used to sell us the very things that must not be sold, such as journalism. [....]
We can’t let the god-like surveillance we’re under be used to “calculate” our citizenship scores, or to “predict” our criminal activity; to tell us what kind of education we can have, or what kind of job we can have [...], to discriminate against us based on our financial, legal, and medical histories, not to mention our ethnicity or race, which are constructs that data often assumes or imposes.
[...] if we allow [our data] to be used to identify us, then it will be used to victimize us, even to modify us - to remake the very essence of our humanity in the image of the technology that seeks its control. Of course, all of the above has already happened.
F: In other words, we are surveilled and our data collected, and used to affect every aspect of our lives - what we read, what movies we watch, where we travel, what we buy, who we date, what we study, where we work… This is a self-fulfilling prophecy for all of humanity, and the prophet is a stupid, imperfect algorithm optimised just to make money.
So I guess my message of today for all Data Scientists out there is this: just… don't.
- Wolfie Christl report https://crackedlabs.org/en/corporate-surveillance
In 2017 a research group at the University of Washington did a study on the Black Lives Matter movement on Twitter. They constructed what they call a “shared audience graph” to analyse the different groups of audiences participating in the debate, and found an alignment of the groups with the political left and political right, as well as clear alignments with groups participating in other debates, like environmental issues, abortion issues and so on. In simple terms, someone who is pro-environment, pro-abortion, left-leaning, is also supportive of the Black Lives Matter movement, and viceversa.
F: Ok, this seems to make sense, right? But… I suspect there is more to this story?
So far, yes…. What they did not expect to find, though, was a pervasive network of Russian accounts participating in the debate, which turned out to be orchestrated by the Internet Research Agency, the not-so-secret Russian secret service agency of internet black ops. The same connected with the US election and Brexit referendum, allegedly.
F: Are we talking about actual spies? Where are you going with this?
Basically, the Russian accounts (part of them human and part of them bots) were infiltrating all aspects of the debate, both on the left and on the right side, and always taking the most extreme stances on any particular aspect of the debate. The aim was to radicalise the conversation, to make it more and more extreme, in a tactic of divide-and-conquer: turn the population against itself in an online civil war, push for policies that normally would be considered too extreme (for instance, give tanks to the police to control riots, force a curfew, try to ban Muslims from your country). Chaos and unrest have repercussions on international trade and relations, and can align to foreign interests.
F: It seems like a pretty indirect and convoluted way of influencing a foreign power…
You might think so, but you are forgetting social media. This sort of operation is directly exploiting a core feature of internet social media platforms. And that feature, I am afraid, is recommender systems.
F: Whoa. Let’s take a step back. Let’s recap the general features of recommender systems, so we are on the same page.
The main purpose of recommender systems is to recommend people the same items similar people show an interest in.
Let’s think about books and readers. The general idea is to find a way to predict the best book to the best reader. Amazon is doing it, Netflix is doing it, probably the bookstore down the road does that too, just on a smaller scale.
Some of the most common methods to implement recommender systems, use concepts such as cosine/correlation similarity, matrix factorization, neural autoencoders and sequence predictors.
The major issue of recommender systems is in their validation. Even though validation occurs in a way that is similar to many machine learning methods, one should recommend a set of items first (in production) and measure the efficacy of such a recommendation. But, recommending is already altering the entire scenario, a bit in the flavour of the Heisenberg principle of uncertainty.
F: In the attention economy, the business model is to monetise the time the user spends on a platform, by showing them ads. Recommender systems are crucial for this purpose.
Chiara, you are saying that these algorithms have effects that are problematic?
As you say, recommender systems exist because the business model of social media platforms is to monetise attention. The most effective way to keep users’ attention is to show them stuff they could show an interest in.
In order to do that, one must segment the audience to find the best content for each user. But then, for each user, how do you keep them engaged, and make them consume more content?
F: You’re going to say the word “filter bubble” very soon.
Spot on. To keep the user on the platform, you start by showing them content that they are interested in, and that agrees with their opinion.
But that is not all. How many videos of the same stuff can you watch, how many articles can you read? You must also escalate the content that the user sees, increasing the wow factor. The content goes from mild to extreme (conspiracy theories, hate speech etc).
The recommended content pushes the user opinion towards more extreme stances. It is hard to see from inside the bubble, but a simple experiment will show it. If you continue to click the first recommended video on YouTube, and you follow the chain of first recommended videos, soon you will find yourself watching stuff you’d never have actively looked for, like conspiracy theories, or alt-right propaganda (or pranks that get progressively more cruel, videos by people committing suicide, and so on).
F: So you are saying that this is not an accident: is this the basis of the optimisation of the recommender system?
Yes, and it’s very effective. But obviously there are consequences.
F: And I’m guessing they are not good.
The collective result of single users being pushed toward more radical stances is a radicalisation of the whole conversation, the disappearance of nuances in the argument, the trivialisation of complex issues. For example, the Brexit debate in 2016 was about trade deals and custom unions, and now it is about remain vs no deal, with almost nothing in between.
F: Yes, the conversation is getting stupider. Is this just a giant accident? Just a sensible system that got out of control?
Yes and no. Recommender systems originate as a tool for boosting commercial revenue, by selling more products. But applied to social media, they have caused an aberration: the recommendation of information, which leads to the so-called filter bubbles, the rise of fake news and disinformation, and the manipulation of the masses.
There is an intense debate in the scientific community about the polarising effects of the internet and social media on the population. An example of such study is a paper by Johnson et al. It predicts that whether and how a population becomes polarised is dictated by the nature of the underlying competition, rather than the validity of the information that individuals receive or their online bubbles.
F: I would like to stress on this finding. This is really f*cked up. Polarisation is not caused by the particular subject nor the way a debate is conducted. But by how legitimate the information seems to the single person. Which means that if I find a way to convince the single individuals about something, I will be in fact manipulating the debate at a community scale or, in some cases, globally!
Oh my god we seem to be so f*cked.
Take for instance the people who believe that the Earth is flat. Or the time it took people to recognise global warming as scientific, despite the fact that, the threshold for scientific confirmation was reached decades ago.
F: So, recommender systems let loose on social media platforms amplify controversy and conflict, and fringe opinions. I know I’m not going to like the answer, but I’m going to ask the question anyway.
This is all just an innocent mistake, right?
Last year, the European Data Protection Supervisor has published a report on online manipulation at scale.
F: That does not sound good.
The online digital ecosystem has connected people across the world with over 50% of the population on the Internet, albeit very unevenly in terms of geography, wealth and gender. The initial optimism about the potential of internet tools and social media for civic engagement has given way to concern that people are being manipulated. This happens through the combination of constant harvesting of often intimate information about them, and the control over the information they see online according to the category they are put into (so called segmentation of the audience). Arguably since 2016, but probably before, mass manipulation at scale has occurred during democratic elections. By using algorithms to game recommender systems, among other things, to spread misinformation. Remember Cambridge Analytica?
F: I remember. I wish I didn’t. But why does it work? Are we so easy to manipulate?
An interesting point is this. When one receives information collectively, as for example from the television news, it is far less likely that she develops extreme views (like, the Earth is flat), because she would base the discourse on a common understanding of reality. And people call out each other’s bulls*it.
F: Fair enough.
But when one receives information singularly, like what happens via a recommender system through micro-targeting, then reality has a different manifestation for each audience member, with no common ground. It is far more likely to adopt extreme views, because there is no way to fact check, and because the news feel personal. In fact, they tailor such news are to the users to push their buttons.
Francesco, if you show me George Clooney shirtless and holding a puppy, and George tells me that the Earth is flat, I might have doubts for a minute. Too personal?
F: That’s good to know about you. I’m more of a cat person. But, experts keep saying that we are moving towards personalisation of everything. While this makes sense for things like personalised medicine, it probably is not that beneficial with many other kinds of recommendations. Especially not the news.
But social media feeds are extremely personalised. What can we do?
Solutions have focused on transparency measures, exposing the source of information while neglecting the accountability of players in the ecosystem who profit from harmful behaviour. But these are band aids on bullet wounds.
The problem is the social media platforms. In October 2019 Zuckerberg was in front of congress again, because Facebook refuses to fact-check political advertisements, in 2019, after everything that’s happened. At the same time market concentration and the rise of platform dominance threatens media pluralism. This in turn, is leading to repeat and amplify a handful of news pieces and to silence independent journalism.
F: When I think of a recommender system, I think of Netflix.
- You liked this kind of show in the past, so here are more shows of the same genre
- People like you have liked this other type of show. Hence, here it is for your consideration
This seems relatively benign. Although, if you think some more, you realise that this mechanism will prevent you from actually discovering anything new. It just gives you more of what you are likely to like. But one would not think that this would have world-changing consequences.
If you think of the news, this mechanism becomes lethal: in the mildest form – which is already bad – you will only hear opinions that already align with those of your own peer group. In the worst scenario, you will not hear some news at all, or you will hear a misleading or false version of the news, and you don’t even know that a different version exists.
In the Brexit referendum, misleading or false content (like the famous NHS money that supposedly was going to the EU instead) has been amplified in filter bubbles. Each bubble of people was essentially understanding a different version of the same issue. Brexit was a million different things, depending on your social media feeds.
And of course, there are malicious players in the game, like the russian Internet Research Agency and Cambridge Analytica, who actively exploited this features in order to swing the vote.
F: Even the traditional media is starting to adopt recommender systems for the news content. This seems like a very bad idea, after all. Is there any other scenario in which recommender systems are not great?
Researchers use recommender systems in a variety of applications.
For instance, in the job market. A recommender system limits exposure to certain information about jobs on the basis of the person’s gender or inferred health status, and therefore it perpetuates discriminatory attitudes and practices. In the US, researchers use recommender systems to calculate the bail fee for people who have been arrested, disproportionately penalising people of colour. This has to do with the training of the algorithm. In an already unequal system (where for instance there are few women in top managerial positions, and more African-Americans in jail than white Americans) a recommender system will by design amplify such inequality.
F: Recommender systems are part of the problem, and they make everything worse. But the origin of the problem lies somewhere else, I suspect.
Yep. The problem with recommender systems goes even deeper. I would rather connect it to the problem of privacy. A recommender system only works if it knows its audience. They are so powerful, because they know everything about us.
We don’t have any privacy anymore. Online players know exactly who we are, our lives are transparent to both corporations and governments. For an excellent analysis of this, read Snowden’s book “Permanent Record”. I highly recommend it.
F: The pun was intended wasn’t it?
With all this information about us, we are put into “categories” for specific purposes: selling us products, influencing our vote. They target us with ads aimed at our specific category, and this generates more discussion and more content on our social media. Recommender systems amplify the targeting by design. They would be much less effective, and much less dangerous, in a world where our lives are private.
F: Social media platforms base their whole business model in “knowing us”. The business model itself is problematic.
As we said in the previous episode, the internet has become centralised, with a handful of platforms controlling most of the traffic. In some countries like Myanmar, internet access itself is provided and controlled by Facebook.
F: Chiara, where’s Myanmar?
In South-East Asia, between India and Thailand.
In effect, the forum for public discourse and the available space for freedom of speech is now bounded by the profit motives of powerful private companies. Due to technical complexity or on the grounds of commercial secrecy, such companies decline to explain how decisions are made. Mostly, they make decisions via recommender algorithms, which amplify bias and segregation. And at the same time the few major platforms with their extraordinary reach offer an easy target for people seeking to use the system for malicious ends.
This is our call to all data scientists out there. Be aware of personalisation in building recommender systems. Personalising is not always beneficial. There are a few cases where it is, e.g. medicine, genetics, drug discovery. Many other cases where it is detrimental e.g. news, consumer products/services, opinions.
Personalisation by algorithm, and in particular of the news, leads to a fragmentation of reality that undermines democracy. Collectively we need to push for reigning in targeted advertising, and the path to this leads to more strict rules on privacy. As long as we are completely transparent to commercial and governmental players, like we are today, we are vulnerable to lies, misdirection and manipulation.
As Christopher Wylie (the Cambridge Analytica whistleblower) eloquently said, it’s like going on a date, where you know nothing about the other person, but they know absolutely everything about you.
We are left without agency, and without real choice.
In other words, we are f*cked
Black lives matter / Internet Research Agency (IRA) articles:
Johnson et al. “Population polarization dynamics and next-generation social media algorithms” https://arxiv.org/abs/1712.06009
In this episode I am completing the explanation about the integration fitchain-oceanprotocol that allows secure on-premise compute to operate in the decentralized data marketplace designed by Ocean Protocol.
As mentioned in the show, this is a picture that provides a 10000-feet view of the integration.
I hope you enjoy the show!
In this episode I briefly explain how two massive technologies have been merged in 2018 (work in progress :) - one providing secure machine learning on isolated data, the other implementing a decentralized data marketplace.
In this episode I explain:
- How do we make machine learning decentralized and secure?
- How can data owners keep their data private?
- How can we benefit from blockchain technology for AI and machine learning?
I hope you enjoy the show!
fitchain.io decentralized machine learnin
Ocean protocol decentralized data marketplace
In this episode I don't talk about data. In fact, I talk about metadata.
While many machine learning models rely on certain amounts of data eg. text, images, audio and video, it has been proved how powerful is the signal carried by metadata, that is all data that is invisible to the end user.
Behind a tweet of 140 characters there are more than 140 fields of data that draw a much more detailed profile of the sender and the content she is producing... without ever considering the tweet itself.
You are your Metadata: Identification and Obfuscation of Social Media Users using Metadata Information https://www.ucl.ac.uk/~ucfamus/papers/icwsm18.pdf
Attacking deep learning models
Compromising AI for fun and profit
Deep learning models have shown very promising results in computer vision and sound recognition. As more and more deep learning based systems get integrated in disparate domains, they will keep affecting the life of people. Autonomous vehicles, medical imaging and banking applications, surveillance cameras and drones, digital assistants, are only a few real applications where deep learning plays a fundamental role. A malfunction in any of these applications will affect the quality of such integrated systems and compromise the security of the individuals who directly or indirectly use them.
In this episode, we explain how machine learning models can be attacked and what we can do to protect intelligent systems from being compromised.
Humans seem to have reached a cross-point, where they are asked to choose between functionality and privacy. But not both. Not both at all. No data, no service. That’s what companies building personal finance services say. The same applies to marketing companies, social media companies, search engine companies, and healthcare institutions.
In this episode I speak about the reasons to aggregate data for precision medicine, the consequences of such strategies and how can researchers and organizations provide services to individuals while respecting their privacy.
Data is a complex topic, not only related to machine learning algorithms, but also and especially to privacy and security of individuals, the same individuals who create such data just by using the many mobile apps and services that characterize their digital life.
In this episode I am together with B.J.n Mendelson, author of “Social Media is Bullshit” from St. Martin’s Press and world-renowned speaker on issues involving the myths and realities involving today’s Internet platforms. B.J. has a new a book about privacy and sent me a free copy of "Privacy, and how to get it back" that I read in just one day. That was enough to realise how much we have in common when it comes to data and data collection.
Posted in privacy and security on Feb 15th, 2017
Talking about security of communication and privacy is never enough, especially when political instabilities are driving leaders towards decisions that will affect people on a global scale
Data science is making the difference also in fraud detection. In this episode I have a conversation with an expert in the field, Engineer Eyad Sibai, who works at iZettle, a fraud detection company
Extracting knowledge from large datasets with large number of variables is always tricky. Dimensionality reduction helps in analyzing high dimensional data, still maintaining most of the information hidden behind complexity. Here are some methods that you must try before further analysis (Part 1).