Data Science at Home
Episodes
Thursday Nov 09, 2023
Rolling the Dice: Engineering in an Uncertain World (Ep. 242)
Thursday Nov 09, 2023
Thursday Nov 09, 2023
Hey there, engineering enthusiasts! Ever wondered how engineers deal with the wild, unpredictable twists and turns in their projects? In this episode, we're spilling the beans on uncertainty and why it's the secret sauce in every engineering recipe, not just the fancy stuff like deep learning and neural networks!
Join us for a ride through the world of uncertainty quantification. Tune in and let's demystify the unpredictable together! š²š§š
Ā
References
https://www.osti.gov/servlets/purl/1428000
https://arc.aiaa.org/doi/pdf/10.2514/6.2010-124
https://arxiv.org/pdf/2001.10411
Ā
Ā
Friday Dec 02, 2022
Machine learning is physics (Ep. 211)
Friday Dec 02, 2022
Friday Dec 02, 2022
What if we borrowed from physics some theories that would interpret deep learning and machine learning in general?Here is a list of plausible ways to interpret our beloved ML models and understand why they works, or they don't.Enjoy the show!
Our Sponsors
NordPass Business has developed a password manager, that will save you a lot of time and energy whenever youĀ need access to business accounts, work across devices, even with the other members of your team, or whenever you need to share sensitive data with your colleagues, or make payments efficiently. All this with the highest standard of cyber secure technology.
See NordPass Business in action now with a 3-month free trial herehttps://nordpass.com/DATASCIENCE with codeDATASCIENCE
Ā
Ā
Amethix works to create and maximize the impact of the worldās leading corporations and startups, so they can create a better future for everyone they serve. We provide solutions in AI/ML, Fintech, Healthcare/RWE, and Predictive maintenance.
Ā
Monday Jun 13, 2022
Online learning is better than batch, right? Wrong! (Ep. 200)
Monday Jun 13, 2022
Monday Jun 13, 2022
In this episode I speak about online learning systems and why blindly choosing such a paradigm can lead to very unpredictable and expensive outcomes.Also in this episode, I have to deal with an intruder :)
Ā
Ā
Links
Birman, K.; Joseph, T. (1987). "Exploiting virtual synchrony in distributed systems". Proceedings of the Eleventh ACM Symposium on Operating Systems Principles - SOSP '87. pp.Ā 123ā138. doi:10.1145/41457.37515. ISBNĀ 089791242X. S2CIDĀ 7739589.
Ā
Saturday Mar 19, 2022
Bayesian Machine Learning with Ravin Kumar (Ep. 191)
Saturday Mar 19, 2022
Saturday Mar 19, 2022
This is one episode where passion for math, statistics and computers are merged. I have a very interesting conversation with Ravin,Ā data scientist at Google where he uses data to inform decisions.
He has previously worked at Sweetgreen, designing systems that would benefit team members and communities through sustainable and healthy food, and SpaceX, creating tools that would ultimately launch rocket ships.
All opinions in this episode are his own and none of the companies he has worked for are represented.
Ā
This episode is brought to you by RailzAI
The Railz API connects to major accounting platforms to provide you with quick access to normalized and analyzed financial data. Get free access to their API and more. Just tell them you came through Data Science at Home podcast.
Ā
and by Amethix Technologies
Amethix use advanced Artificial Intelligence and Machine Learning to build data platforms and predictive engines in domain like finance, healthcare, pharmaceuticals, logistics, energy. Amethix provide solutions to collect and secure data with higher transparency and disintermediation, and build the statistical models that will support your business.
Ā
Ā
References
Bayesian Modeling and Computation in Python (Chapman & Hall/CRC Texts in Statistical Science) amazon.com
Bayesian Modeling and Computation in Python
https://twitter.com/canyon289
Ā
Ā
Tuesday Aug 03, 2021
2 effective ways to explain your predictions (Ep. 163)
Tuesday Aug 03, 2021
Tuesday Aug 03, 2021
Our Sponsor
Amethix use advanced Artificial Intelligence and Machine Learning to build data platforms and predictive engines in domain like finance, healthcare, pharmaceuticals, logistics, energy. Amethix provide solutions to collect and secure data with higher transparency and disintermediation, and build the statistical models that will support your business.
Ā
Ā
References
Fisher, Aaron, Cynthia Rudin, and Francesca Dominici. āModel Class Reliance: Variable importance measures for any machine learning model class, from the āRashomonā perspective.ā http://arxiv.org/abs/1801.01489 (2018).
Python SHAP https://github.com/slundberg/shap
Ā
Ā
Tuesday Jun 22, 2021
A simple trick for very unbalanced data (Ep. 157)
Tuesday Jun 22, 2021
Tuesday Jun 22, 2021
Data from the real world are never perfectly balanced. In this episode I explain a simple yet effective trick to train models with very unbalanced data. Enjoy the show!
Sponsors
Get one of the best VPN at a massive discount with coupon code DATASCIENCE. It provides you with an 83% discount which unlocks the best price in the market plus 3 extra months for free. Here is the link https://surfshark.deals/DATASCIENCE
Ā
References
Leo Breiman, Random Forests, 2001
C. Chen, A. Liaw, L. Breiman, Using Random Forest to Learn Imbalanced Data (2004)
Ā
Sunday Apr 19, 2020
Why average can get your predictions very wrong (ep. 102)
Sunday Apr 19, 2020
Sunday Apr 19, 2020
Whenever people reason about probability of events, they have the tendency to consider average values between two extremes.Ā In this episode I explain why such a way of approximating is wrong and dangerous, with a numerical example.
We are moving our community to Slack. See you there!
Ā
Ā
Saturday Dec 28, 2019
The dark side of AI: bias in the machine (Ep. 92)
Saturday Dec 28, 2019
Saturday Dec 28, 2019
Ā
This is the fourth and last episode of mini series "The dark side of AI". I am your host Francesco and Iām with Chiara Tonini from London. The title of todayās episode is Bias in the machineĀ
Ā
Ā
C: Francesco, today we are starting with an infuriating discussion. Are you ready to be angry?Ā
Ā
F: yeah sure is this about brexit?Ā No, I donāt talk about that. In 1986 the New York Cityās Rockefeller University conducted a study on breast and uterine cancers and their link to obesity. Like in all clinical trials up to that point, the subjects of the study were all men.Ā So Francesco, do you see a problem with this approach?Ā
Ā
F: No problem at all, as long as those men had a perfectly healthy uterus.In medicine, up to the end of the 20th century, medical studies and clinical trials were conducted on men, medicine dosage and therapy calculated on men (white men). The female body has historically been considered an exception, or variation, from a male body.Ā
Ā
F: Like Eve coming from Adamās rib. I thought we were past that...When the female body has been under analysis, the focus was on the difference between it and the male body, the so-called ābikini approachā: the reproductive organs are different, therefore we study those, and those only. For a long time medicine assumed this was the only difference.Ā
Ā
Oh goodĀ ...This has led to a hugely harmful fallout across society. Because women had reproductive organs, they should reproduce, and all else about them was deemed uninteresting. Still today, they consider a woman without children somehow to have betrayed her biological destiny. This somehow does not apply to a man without children, who also has reproductive organs.Ā
Ā
F: so this is an example of a very specific type of bias in medicine, regarding clinical trials and medical studies, that is not only harmful for the purposes of these studies, but has ripple effects in all of societyOnly in the 2010 a serious conversation has started about the damage caused by not including women in clinical trials. There are many many examples (which we list in the references for this episode).Ā
Ā
Give me oneResearchers consider cardiovascular disease a male disease - they even call it āthe widowerā. They conduct studies on male samples. But it turns out, the symptoms of a heart attack, especially the ones leading up to one, are different in women. This led to doctors not recognising or dismissing the early symptoms in women.Ā
Ā
F: I was reading that women are also subject to chronic pain much more than men: for example migraines, and pain related to endometriosis. But there is extensive evidence now of doctors dismissing womenās pain, as either imaginary, or āinevitableā, like it is a normal state of being and does not need a cure at all.Ā
Ā
The failure of the medical community as a whole to recognise this obvious bias up to the 21st century is an example of how insidious the problem of bias is.
Ā
There are 3 fundamental types of bias:Ā
Ā
One: Stochastic drift: you train your model on a dataset, and you validate the model on a split of the training set. When you apply your model out in the world, you systematically add bias in the predictions due to the training data being too specific
Two: The bias in the model, introduced by your choice of the parameters of your model.Ā Ā
Three: The bias in your training sample: people put training samples together, and people have culture, experience, and prejudice. As we will see today, this is the most dangerous and subtle bias. Today weāll talk about this bias.
Ā
Bias is a warping of our understanding of reality. We see reality through the lens of our experience and our culture. The origin of bias can date back to traditions going back centuries, and is so ingrained in our way of thinking, that we donāt even see it anymore.Ā
Ā
F: And let me add, when it comes to machine learning, we see reality through the lens of data. Bias is everywhere, and we could spend hours and hours talking about it. Itās complicated.Ā
Ā
Itās about to become more complicated.Ā
Ā
F: of course, if I know youā¦Letās throw artificial intelligence in the mix.Ā
Ā
F: You know, there was a happier time when this sentence didnāt fill me with a sense of dread...Ā ImageNet is an online database of over 14 million photos, compiled more than a decade ago at Stanford University. They used it to train machine learning algorithms for image recognition and computer vision, and played an important role in the rise of deep learning. Weāve all played with it, right? The cats and dogs classifier when learning Tensorflow? (I am a dog by the way. )
Ā
F: ImageNet has been a critical asset for computer-vision research. There was an annual international competition to create algorithms that could most accurately label subsets of images. In 2012, a team from the University of Toronto used a Convolutional Neural Network to handily win the top prize. That moment is widely considered a turning point in the development of contemporary AI. The final year of the ImageNet competition was 2017, and accuracy in classifying objects in the limited subset had risen from 71% to 97%.Ā But that subset did not include the āPersonā category, where the accuracy was much lower...Ā ImageNet contained photos of thousands of people, with labels. This included straightforward tags like āteacher,ā ādancerā and āplumberā, as well as highly charged labels like āfailure, loserā and āslut, slovenly woman, trollop.ā
Ā
F: Uh Oh.Ā Then āImageNet Rouletteā was created, by an artist called Trevor Paglen and a Microsoft researcher named Kate Crawford. It was a digital art project, where you could upload your photo and let the classifier identify you, based on the labels of the database. Imagine how well that went.Ā
Ā
F: I bet it didāt workOf course it didnāt work. Random people were classified as āorphansā or ānon-smokerā or āalcoholicā. Somebody with glasses was a ānerdā.Ā Tabong Kima, a 24-year old African American, was classified as āoffenderā and āwrongdoerā.Ā
Ā
F: and there it is.Ā Quote from Trevor Paglen: āWe want to show how layers of bias and racism and misogyny move from one system to the next. The point is to let people see the work that is being done behind the scenes, to see how we are being processed and categorized all the time.ā
Ā
F: The ImageNet labels were applied by thousands of unknown people, most likely in the United States, hired by the team from Stanford, and working through the crowdsourcing service Amazon Mechanical Turk. They earned pennies for each photo they labeled, churning through hundreds of labels an hour. The labels were not verified in any way : if a labeler thought someone looks āshadyā, this label is just a result of their prejudice, but has no basis in reality.As they did, biases were baked into the database. Paglen quote again: āThe way we classify images is a product of our worldview,ā he said. āAny kind of classification system is always going to reflect the values of the person doing the classifying.ā They defined what a āloserā looked like. And a āslut.ā And a āwrongdoer.ā
Ā
F: The labels originally came from another sprawling collection of data called WordNet, a kind of conceptual dictionary for machines built by researchers at Princeton University in the 1980s. But with these inflammatory labels included, the Stanford researchers may not have realized what they were doing.What is happening here is the transferring of bias from one system to the next.Ā
Ā
Tech jobs, in past decades but still today, predominantly go to white males from a narrow social class. Inevitably, they imprint the technology with their worldview.Ā So their algorithms learn that a person of color is a criminal, and a woman with a certain look is a slut.Ā
Ā
Iām not saying they do it on purpose, but the lack of diversity in the tech industry translates into a narrower world view, which has real consequences in the quality of AI systems.Ā
Ā
F: Diversity in tech teams is often framed as an equality issue (which of course it is), but there are enormous advantages in it: it allows to create that cognitive diversity that will reflect into superior products or services. I believe this is an ongoing problem. In recent months, researchers have shown that face-recognition services from companies like Amazon, Microsoft and IBM can be biased against women and people of color.Ā
Ā
Crawford and Paglen argue this: āIn many narratives around AI it is assumed that ongoing technical improvements will resolve all problems and limitations. But what if the opposite is true? What if the challenge of getting computers to ādescribe what they seeā will always be a problem? The automated interpretation of images is an inherently social and political project, rather than a purely technical one. Understanding the politics within AI systems matters more than ever, as they are quickly moving into the architecture of social institutions: deciding whom to interview for a job, which students are paying attention in class, which suspects to arrest, and much else.ā
Ā
F: You are using the words āinterpretation of imagesā here, as opposed to ādescriptionā or āclassificationā. Certain images depict something concrete, with an objective reality. Like an apple. But other imagesā¦ not so much?Ā
Ā
ImageNet contain images only corresponding to nouns (not verbs for example). Noun categories such as āappleā are well defined. But not all nouns are created equal. Linguist George Lakoff points out that the concept of an āappleā is more nouny than the concept of ālightā, which in turn is more nouny than a concept such as āhealth.ā Nouns occupy various places on an axis from concrete to abstract, and from descriptive to judgmental. The images corresponding to these nouns become more and more ambiguous.These gradients have been erased in the logic of ImageNet. Everything is flattened out and pinned to a label. The results can be problematic, illogical, and cruel, especially when it comes to labels applied to people.Ā
Ā
F: so when an image is interpreted as Drug Addict, Crazy, Hypocrite, Spinster, Schizophrenic, Mulatto, Red Neckā¦ this is not an objective description of reality, itās somebodyās worldview coming to the surface. The selection of images for these categories skews the meaning in ways that are gendered, racialized, ableist, and ageist. ImageNet is an object lesson in what happens when people are categorized like objects. And this practice has only become more common in recent years, often inside the big AI companies, where there is no way for outsiders to see how images are being ordered and classified.Ā
Ā
The bizarre thing about these systems is that they remind of early 20th century criminologists like Lombroso, or phrenologists (including Nazi scientists), and physiognomy in general. This was a discipline founded on the assumption that there is a relationship between an image of a person and the character of that person. If you are a murderer, or a Jew, the shape of your head for instance will tell.Ā
Ā
F: In reaction to these ideas, Reneā Magritte produced that famous painting of the pipe with the tag āThis is not a pipeā.
Ā
You know that famous photograph of the soldier kissing the nurse at the end of the second world war? The nurse came public about it when she was like 90 years old, and told how this total stranger in the street had grabbed her and kissed her. This is a picture of sexual harassment. And knowing that, it does not seem romantic anymore.Ā
Ā
F: not romantic at all indeedImages do not describe themselves. This is a feature that artists have explored for centuries. We see those images differently when we see how theyāre labeled. The correspondence between image, label, and referent is fluid. Whatās more, those relations can change over time as the cultural context of an image shifts, and can mean different things depending on who looks, and where they are located. Images are open to interpretation and reinterpretation. Entire subfields of philosophy, art history, and media theory are dedicated to teasing out all the nuances of the unstable relationship between images and meanings.The common mythos of AI and the data it draws on, is that they are objectively and scientifically classifying the world. But itās not true, everywhere there is politics, ideology, prejudices, and all of the subjective stuff of history.Ā
Ā
F: When we survey the most widely used training sets, we find that this is the rule rather than the exception.Training sets are the foundation on which contemporary machine-learning systems are built. They are central to how AI systems recognize and interpret the world.By looking at the construction of these training sets and their underlying structures, we discover many unquestioned assumptions that are shaky and skewed. These assumptions inform the way AI systems workāand failāto this day.And the impenetrability of the algorithms, the impossibility of reconstructing the decision-making of a NN, hides the bias further away from scrutiny. When an algorithm is a black box and you canāt look inside, you have no way of analysing its bias.Ā
Ā
And the skewness and bias of these algorithms have real effects in society, the more you use AI in the judicial system, in medicine, the job market, in security systems based on facial recognition, the list goes on and on.Ā
Ā
Last year Google unveiled BERT (Bidirectional Encoder Representations from Transformers). Itās an AI system that learns to talk: itās a Natural Language Processing engine to generate written (or spoken) language.Ā
Ā
F: we have an episode Ā in which we explain all that
Ā
They trained it from lots and lots of digitized information, as varied as old books, Wikipedia entries and news articles. They baked decades and even centuries of biases ā along with a few new ones ā into all that material. So for instance BERT is extremely sexist: it associates with male almost all professions and positive attributes (except for āmomā).Ā
Ā
BERT is widely used in industry and academia. For example it can interpret news headlines automatically. Even Googleās search engine use it.Ā
Ā
Try googling āCEOā, and you get out a gallery of images of old white men.
Ā
F: such a pervasive and flawed AI system can propagate inequality at scale. And itās super dangerous because itās subtle. Especially in industry, query results will not be tested and examined for bias. AI is a black box and researchers take results at face value.Ā
Ā
There are many cases of algorithm-based discrimination in the job market. Targeting candidates for tech jobs for instance, may be done by algorithms that will not recognise women as potential candidates. Therefore, they will not be exposed to as many job ads as men. Or, automated HR systems will rank them lower (for the same CV) and screen them out.Ā
Ā
In the US, algorithms are used to calculate bail. The majority of the prison population in the US is composed of people of colour, as a result of a systemic bias that goes back centuries. An algorithm learns that a person of colour is more likely to commit a crime, is more likely to not be able to afford bail, is more likely to violate parole. Therefore, people of colour will receive harsher punishments for the same crime. This amplifies this inequality at scale.Ā
Ā
Conclusion
Ā
Question everything, never take predictions of your models at face value. Always question how your training samples have been put together, who put them together, when and in what context. Always remember that your model produces an interpretation of reality, not a faithful depiction.Ā Treat reality responsibly.Ā
Ā
Thursday Oct 10, 2019
Thursday Oct 10, 2019
Join the discussion on our Discord server
Ā
In this episode I have an amazing conversation with Jimmy Soni and Rob Goodman, authors of āA mind at playā, a book entirely dedicated to the life and achievements of Claude Shannon.Ā Claude Shannon does not need any introduction. But for those who need a refresh, Shannon is the inventor of the information age.Ā
Have you heard of binary code, entropy in information theory, data compression theory (the stuff behind mp3, mpg, zip, etc.), error correcting codes (the stuff that makes your RAM work well), n-grams, block ciphers, the beta distribution, the uncertainty coefficient?
All that stuff has been invented by Claude Shannon :)Ā
Ā
Articles:Ā
https://medium.com/the-mission/10-000-hours-with-claude-shannon-12-lessons-on-life-and-learning-from-a-genius-e8b9297bee8f
https://medium.com/the-mission/on-claude-shannons-103rd-birthday-here-are-103-memorable-claude-shannon-quotes-maxims-and-843de4c716cf?source=your_stories_page---------------------------
http://nautil.us/issue/51/limits/how-information-got-re_invented
http://nautil.us/issue/50/emergence/claude-shannon-the-las-vegas-cheat
Ā
Claude's papers:
https://medium.com/the-mission/a-genius-explains-how-to-be-creative-claude-shannons-long-lost-1952-speech-fbbcb2ebe07f
http://www.math.harvard.edu/~ctm/home/text/others/shannon/entropy/entropy.pdf
Ā
A mind at play (book links):Ā
http://amzn.to/2pasLMzĀ -- Hardcover
https://amzn.to/2oCfVL0 -- Audio
Tuesday Aug 27, 2019
[RB] Validate neural networks without data with Dr. Charles Martin (Ep. 74)
Tuesday Aug 27, 2019
Tuesday Aug 27, 2019
In this episode, I am with Dr. Charles Martin from Calculation Consulting a machine learning and data science consulting company based in San Francisco. We speak about the nuts and bolts of deep neural networks and some impressive findings about the way they work.Ā
The questions that Charles answers in the show are essentially two:
Why is regularisation in deep learning seemingly quite different than regularisation in other areas on ML?
How can we dominate DNN in a theoretically principled way?
Ā
ReferencesĀ
The WeightWatcher tool for predicting the accuracy of Deep Neural NetworksĀ https://github.com/CalculatedContent/WeightWatcher
Slack channel https://weightwatcherai.slack.com/
Dr. Charles Martin BlogĀ http://calculatedcontent.comĀ and channelĀ https://www.youtube.com/c/calculationconsulting
Implicit Self-Regularization in Deep Neural Networks: Evidence from Random Matrix Theory and Implications for Learning - Charles H. Martin, Michael W. Mahoney