Monday Dec 23, 2019

The dark side of AI: metadata and the death of privacy (Ep. 91)

Get in touch with us

Join the discussion about data science, machine learning and artificial intelligence on our Discord server

Episode transcript

We always hear the word “metadata”, usually in a sentence that goes like this

Your Honor, I swear, we were not collecting users data, just metadata.

Usually the guy saying this sentence is Zuckerberg, but could be anybody from Amazon or Google. “Just” metadata, so no problem. This is one of the biggest lies about the reality of data collection.

F: Ok the first question is, what the hell is metadata?

Metadata is data about data.

F: Ok… still not clear.
Imagine you make a phone call to your mum. How often do you call your mum, Francesco?

F: Every day of course! (coughing)

Good boy! Ok, so let’s talk about today’s phone call. Let’s call “data” the stuff that you and your mum actually said. What did you talk about?

F: She was giving me the recipe for her famous lasagna.

So your mum’s lasagna is the DATA. What is the metadata of this phone call? The lasagna has data of its own attached to it: the date and time when the conversation happened, the duration of the call, the unique hardware identifiers of your phone and your mum’s phone, the identifiers of the two sim cards, the location of the cell towers that pinged the call, the GPS coordinates of the phones themselves.

F: yeah well, this lasagna comes with a lot of data :)

And this is assuming that this data is not linked to any other data like your Facebook account or your web browsing history. More of that later.

F: Whoa Whoa Whoa, ok. Let’s put a pin in that. Going back to the “basic” metadata that you describe. I think we understand the concept of data about data. I am sure you did your research and you would love to paint me a dystopian nightmare, as always. Tell us why is this a big deal?

Metadata is a very big deal. In fact, metadata is far more “useful” than the actual data, where by “useful” I mean that it allows a third party to learn about you and your whole life. What I am saying is, the fact that you talk with your mum every day for 15 minutes is telling me more about you than the content of the actual conversations. In a way, the content does not matter. Only the metadata matters.

F: Ok, can you explain this point a bit more?

Imagine this scenario: you work in an office in Brussels, and you go by car. Every day, you use your time in the car while you go home to call your mum. So every day around 6pm, a cell tower along the path from your office to your home pings a call from your phone to your mum’s phone. Someone who is looking at your metadata, knows exactly where you are while you call your mum. Every day you will talk about something different, and it doesn't really matter. Your location will come through loud and clear. A lot of additional information can be deduced from this too: for example, you are moving along a motorway, therefore you have a car. The metadata of a call to mum now becomes information on where you are at 6pm, and the way you travel.

F: I see. So metadata about the phone call is, in fact, real data about me.

Exactly. YOU are what is interesting, not your mum’s lasagna.

F: you say so because you haven’t tried my mum’s lasagna. But I totally get your point.

Now, imagine that one day, instead of going straight home, you decide to go somewhere else. Maybe you are secretly looking for another job. Your metadata is recording the fact that after work you visit the offices of a rival company. Maybe you are a journalist and you visit your anonymous source. Your metadata records wherever you go, and one of these places is your secret meeting with your source. Anyone’s metadata can be combined with yours. There will be someone who was with you at the time and place of your secret meeting. Anyone who comes in contact with you can be tagged and monitored. Now their anonymity has been reduced.

F: I get it. So, compared to the content of my conversation, its metadata contains more actionable information. And this is the most useful, and most precious, kind of information about me. What I do, what I like, who I am, beyond the particular conversation.

Precisely. If companies like Facebook or the phone companies had the explicit permission to collect all the users’ data, including all content of conversations, it’s still the metadata that would generate the most actionable information. They would probably throw the content of conversations away. In the vast majority of instances, the content does not matter. Unless you are an actual spy talking about state secrets, nobody cares.

F: Let’s stay on the spy point for a minute. One could say, So what? As I have heard this many times. So what if my metadata contains actionable information, and there are entities that collect it. If I am an honest person, I have nothing to hide.

There are two aspects to the problem of privacy. Government surveillance, and corporate - in other words private - surveillance.
Government surveillance is a topic that has been covered flawlessly by Edward Snowden in his book “Permanent Record”, and in the documentary about his activity, “Citizenfour”. Which I both recommend, and in fact I think every data scientist should read and watch.
Let’s just briefly mention the obvious: just because something comes from a government, it does not mean it’s legal or legitimate, or even ethical or moral. What if your government is corrupt, or authoritarian. What if you are a dissident and you are fighting for human rights. What if you are a journalist, trying to uncover government corruption.

F: In other words, it is a false equivalence to say that protecting your privacy has anything to do with having something to hide.

Mass surveillance of private citizens without cause is a danger to individual freedom as well as civil liberties. Government exists to serve its citizens, not the other way around. To freely paraphrase Snowden, as individuals have no power compared to the government, the only way the system works is if the government is completely transparent to the citizens, so that they can collectively change it, and at the same time the single citizens are opaque to the government, so that it cannot abuse its power. But today the opposite happens: we citizens are completely naked and exposed in front of a completely opaque government machine, with secret surveillance programs on us, that we don’t even know exist. We are not free to self-determine, or do anything about government power, really.

F: We could really talk for days and days about government mass surveillance. But let’s go back to metadata, and let’s talk about the commercial use of it. Metadata for sale. You mentioned this term, “corporate surveillance”. It sounds…. Ominous.

We live in privacy hell, Francesco.

F: I get that. According to your research, where can we find metadata?

First of all, metadata is everywhere. We are swimming in it. In each and every interaction between two people, that make use of digital technology, metadata is generated automatically, without the user’s consent. When two people interact, two machines also interact, recording the “context” of this interaction. Who we are, when, where, why, what we want.

F: And that doesn’t seem avoidable. In fact metadata must be generated by devices and software to just work properly. I look at it as an intrinsic component that cannot be removed from the communication system, whatever it is. The problem is who owns it. So tell me, who has such data?

It does not matter, because it’s all for sale. Which means, we are for sale.

F: Ok, holy s**t, this keeps getting darker. Let’s have a practical example, shall we?

Have you booked a flight recently?

F: Yep. I’m going to Berlin, and in fact so are you. For a hackathon, no less.

Have you ever heard of a company called Adara?

F: No… Cannot say that I have.

Adara is a “Predictive Traveler Intelligence” company.

F: sounds pretty pretentious. Kinda douchy.

This came up on the terrifying twitter account of Wolfie Christl, author among other things of a great report about corporate surveillance for Cracked Labs. Go check him out on twitter, he’s great.

F: Sure I will add what I find to the show notes of this episode. Oh and by the way you can find all this stuff on datascienceathome.com
Sorry go ahead.

Adara collects data - metadata - about travel-related online searches, purchases, devices, passenger records, loyalty program records. Data from clients that include major airlines, major airports, hotel chains and car rental chains. It creates a profile, a “traveler graph” in real time, for 750 million people around the world. A profile based on personal identifiers.

F: uhh uhh Then what?

Then Adara sells these profiles.

F: Ok… I have to say, the box that I tick giving consent to the third-party use of my personal data when I use an airline website does not quite convey how far my data actually goes.

Consent. LOL. Adara calculates a: “traveler value score” based on customer behaviour and needs across the global travel ecosystem, over time.

The score is in the Salesforce Service Cloud, for sale to anyone.
This score, and your profile, determine the personalisation of travel offers and treatment, before purchase, during booking, post purchase, at check in, in airport, at destination.
In their own website, Adara explains how customer service agents for their myriad of clients - for example a front desk agent at a hotel - can instantly see the Traveler value score. Therefore they will treat you differently based on this score.

F: Oh so if you have money to spend they will treat you differently

The score is used to assess your potential value, to inform service and customer service strategies for you, as well as personalised messaging and relevant offers. And of course, the pricing you see when you look for flights. Low score? Prepare yourself to wait to have your call rerouted to a customer service agent. Would you ever tick a box to give consent to this?

F: Fuck no. How is this even legal? What about the GDPR?

It is, in fact, illegal. Adara is based in the US, but they collect data through data warehouses in the Netherlands. They claim they are GDPR-compliant. However, they collect all the data, and then decide on the specific business use, which is definitely not GDPR compliant.

F: exactly! According to GDPR the user has to know in advance what the business use of the data they are giving consent for!!
With GDPR and future regulations, there is a way to control how the data is used and with what purpose. Regulations are still blurred or undefined when it comes to metadata. For example, there’s no regulation for the number of records in a database or the timestamp when such record was created. As a matter of fact data is useless without metadata.

One cannot even collect data without metadata.

Whatsapp, telegram, Facebook messenger... they all create metadata. So one might say “I’ve got end-to-end encryption, buddy”. Sure thing. How about the metadata attached to that encrypted gibberish nobody is really interested in? To show you how unavoidable the concept of metadata is, even Signal developed by the Signal Foundation which is considered the truly end-to-end and open source protocol for confidential information exchange, can see metadata. At Signal they claim they just don’t keep it, as they also state in the Signal’s privacy policy.

"Certain information (e.g. a recipient's identifier, an encrypted message body, etc.) is transmitted to us solely for the purpose of placing calls or transmitting messages. Unless otherwise stated below, this information is only kept as long as necessary to place each call or transmit each message, and is not used for any other purpose."
This is one of those issues that shall be solved with legislation.

But like money laundering, your data is caught in a storm of transactions so intricate that at a certain point, how do you even check...
All participating companies share customer data with each other (a process called value exchange). They let marketers utilize the data, for example to target people after they have searched for flights or hotels. Adara creates audience segments and sells them, for example to Google, for advertisement targeting. The consumer data broker LiveRamp for example lists Adara as a data provider.

F: consumer data broker. I am starting to get what you mean when you say that we are for sale.

Let’s talk about LiveRamp, part of Acxiom.

F: there they go... Acxiom... I heard of them

They self-describe as an “Identity Resolution Platform”.

F: I mean, George Orwell would be proud.

Their mission? “To connect offline data and online data back to a single identifier”. In other words, clients can “resolve all” of their “offline and online identifiers back to the individual consumer”.
Various digital profiles, like the ones generated on social media or when you visit a website, are matched to databases which contains names, postal addresses, email addresses, phone numbers, geo locations and IP addresses, online and mobile identifiers, such as cookie and device IDs.

F: well, all this stuff is possible if and only if someone gets in possession of all these profiles, or well... they purchase them. Still, what the f**k.

A cute example? Imagine you register on any random website but you don’t want to give them your home address. They just buy it from LiveRamp, which gets it from your phone geolocation data - which is for sale. Where does your phone sit still for 12 hours every night? That’s your home address. Easy.

F: And they definitely know how much time do I spend at the gym, without even checking my Instagram! Ok this is another level of creepy.

Clients of LiveRamp can upload their own consumer data to the platform, combine it with data from hundreds of 100 third-party data providers, and then utilize it on more than 500 marketing technology platforms. They can use this data to find and target people with specific characteristics, to recognize and track consumers across devices and platforms, to profile and categorize them, to personalize content for them, and to measure how they behave. For example, clients could “recognize a website visitor” and “provide a customized offer” based on extensive profile data, without requiring said user to log in to the website. Furthermore, LiveRamp has a data store, for other companies to “buy and sell valuable customer data”.

F: What is even the point of giving me the choice to consent to anything online?

In short, there is no point.

F: it seems we are so behind with regulations on data sharing. GDPR is not cutting it, not really. With programmatic advertising we have created a monster that has really grown out of control.
So: our lives are completely transparent to private corporations, that constantly surveil us en-masse, and exploit all of our data to sell us shit. How does this affect our freedom? How about we just don’t buy it? Can it be that simple? And I would not take a no for an answer here.

Unfortunately, no.

F: oh crap!

I’m going to read you a passage from Permanent Record:

Who among us can predict the future? Who would dare to?
The answer to the first question is no one, really, and the answer to the second is everyone, especially every government and business on the planet. This is what that data of ours is used for. Algorithms analyze it for patterns of established behaviour in order to extrapolate behaviours to come, a type of digital prophecy that’s only slightly more accurate that analog methods like palm reading. Once you go digging into the actual technical mechanisms by which predictability is calculated, you come to understand that its science is, in fact, anti-scientific, and fatally misnamed: predictability is actually manipulation.

A website that tells you that because you liked book 1 then you might also like book 2, isn’t offering an educated guess as much as a mechanism of subtle coercion. We can’t allow ourselves to be used in this way, to be used against the future. We can’t permit our data to be used to sell us the very things that must not be sold, such as journalism. [....]
We can’t let the god-like surveillance we’re under be used to “calculate” our citizenship scores, or to “predict” our criminal activity; to tell us what kind of education we can have, or what kind of job we can have [...], to discriminate against us based on our financial, legal, and medical histories, not to mention our ethnicity or race, which are constructs that data often assumes or imposes.

[...] if we allow [our data] to be used to identify us, then it will be used to victimize us, even to modify us - to remake the very essence of our humanity in the image of the technology that seeks its control. Of course, all of the above has already happened.

F: In other words, we are surveilled and our data collected, and used to affect every aspect of our lives - what we read, what movies we watch, where we travel, what we buy, who we date, what we study, where we work… This is a self-fulfilling prophecy for all of humanity, and the prophet is a stupid, imperfect algorithm optimised just to make money.
So I guess my message of today for all Data Scientists out there is this: just… don't.

The dark side of AI: metadata and the death of privacy (Ep. 91)

Episode transcript

References