statistics | Page 2 | Data Science at Home

Episodes

Wednesday Aug 21, 2019

How to cluster tabular data with Markov Clustering (Ep. 73)

Wednesday Aug 21, 2019

In this episode I explain how a community detection algorithm known as Markov clustering can be constructed by combining simple concepts like random walks, graphs, similarity matrix. Moreover, I highlight how one can build a similarity graph and then run a community detection algorithm on such graph to find clusters in tabular data.
You can find a simple hands-on code snippet to play with on the Amethix Blog
Enjoy the show!

References
[1] S. Fortunato, “Community detection in graphs”, Physics Reports, volume 486, issues 3-5, pages 75-174, February 2010.
[2] Z. Yang, et al., “A Comparative Analysis of Community Detection Algorithms on Artificial Networks”, Scientific Reports volume 6, Article number: 30750 (2016)
[3] S. Dongen, “A cluster algorithm for graphs”, Technical Report, CWI (Centre for Mathematics and Computer Science) Amsterdam, The Netherlands, 2000.
[4] A. J. Enright, et al., “An efficient algorithm for large-scale detection of protein families”, Nucleic Acids Research, volume 30, issue 7, pages 1575-1584, 2002.

Tuesday Jul 23, 2019

Validate neural networks without data with Dr. Charles Martin (Ep. 70)

Tuesday Jul 23, 2019

In this episode, I am with Dr. Charles Martin from Calculation Consulting a machine learning and data science consulting company based in San Francisco. We speak about the nuts and bolts of deep neural networks and some impressive findings about the way they work.
The questions that Charles answers in the show are essentially two:
Why is regularisation in deep learning seemingly quite different than regularisation in other areas on ML?
How can we dominate DNN in a theoretically principled way?

References
The WeightWatcher tool for predicting the accuracy of Deep Neural Networks https://github.com/CalculatedContent/WeightWatcher
Slack channel https://weightwatcherai.slack.com/
Dr. Charles Martin Blog http://calculatedcontent.com and channel https://www.youtube.com/c/calculationconsulting
Implicit Self-Regularization in Deep Neural Networks: Evidence from Random Matrix Theory and Implications for Learning - Charles H. Martin, Michael W. Mahoney

Tuesday May 21, 2019

Episode 61: The 4 best use cases of entropy in machine learning

Tuesday May 21, 2019

It all starts from physics. The entropy of an isolated system never decreases… Everyone at school, at some point of his life, learned this in his physics class. What does this have to do with machine learning? To find out, listen to the show.

References
Entropy in machine learning https://amethix.com/entropy-in-machine-learning/

Wednesday Jan 23, 2019

Episode 53: Estimating uncertainty with neural networks

Wednesday Jan 23, 2019

Have you ever wanted to get an estimate of the uncertainty of your neural network? Clearly Bayesian modelling provides a solid framework to estimate uncertainty by design. However, there are many realistic cases in which Bayesian sampling is not really an option and ensemble models can play a role.
In this episode I describe a simple yet effective way to estimate uncertainty, without changing your neural network’s architecture nor your machine learning pipeline at all.
The post with mathematical background and sample source code is published here.

Thursday Jan 17, 2019

Episode 52: why do machine learning models fail? [RB]

Thursday Jan 17, 2019

The success of a machine learning model depends on several factors and events. True generalization to data that the model has never seen before is more a chimera than a reality. But under specific conditions a well trained machine learning model can generalize well and perform with testing accuracy that is similar to the one performed during training.
In this episode I explain when and why machine learning models fail from training to testing datasets.

Tuesday Sep 04, 2018

Episode 46: why do machine learning models fail? (Part 2)

Tuesday Sep 04, 2018

In this episode I continue the conversation from the previous one, about failing machine learning models.
When data scientists have access to the distributions of training and testing datasets it becomes relatively easy to assess if a model will perform equally on both datasets. What happens with private datasets, where no access to the data can be granted?
At fitchain we might have an answer to this fundamental problem.

Tuesday Aug 28, 2018

Episode 45: why do machine learning models fail?

Tuesday Aug 28, 2018

Tuesday Oct 03, 2017

Episode 23: Why do ensemble methods work?

Tuesday Oct 03, 2017

Ensemble methods have been designed to improve the performance of the single model, when the single model is not very accurate. According to the general definition of ensembling, it consists in building a number of single classifiers and then combining or aggregating their predictions into one classifier that is usually stronger than the single one.
The key idea behind ensembling is that some models will do well when they model certain aspects of the data while others will do well in modelling other aspects. In this episode I show with a numeric example why and when ensemble methods work.

Wednesday Mar 02, 2016

Episode 9: Markov Chain Montecarlo with full conditionals

Wednesday Mar 02, 2016

At some point, statistical problems need sampling. Sampling consists in generating observations from a specific distribution.

Monday Feb 15, 2016

Episode 8: Frequentists and Bayesians

Monday Feb 15, 2016

There are statisticians and data scientists... Among statisticians, there are some who just count. Some others who… think differently. In this show we explore the old time dilemma between frequentists and bayesians.Given a statistical problem, who’s going to be right?