Data Science at Home
Episodes
![[RB] Online learning is better than batch, right? Wrong! (Ep. 216)](https://pbcdn1.podbean.com/imglogo/image-logo/1799802/dsh-cover-2_300x300.jpg)
Wednesday Feb 15, 2023
[RB] Online learning is better than batch, right? Wrong! (Ep. 216)
Wednesday Feb 15, 2023
Wednesday Feb 15, 2023
In this episode I speak about online learning systems and why blindly choosing such a paradigm can lead to very unpredictable and expensive outcomes.Also in this episode, I have to deal with an intruder :)
 
 
Links
Birman, K.; Joseph, T. (1987). "Exploiting virtual synchrony in distributed systems". Proceedings of the Eleventh ACM Symposium on Operating Systems Principles - SOSP '87. pp. 123–138. doi:10.1145/41457.37515. ISBN 089791242X. S2CID 7739589.
 
![Edge AI applications for military and space [RB] (Ep. 213)](https://pbcdn1.podbean.com/imglogo/image-logo/1799802/dsh-cover-2_300x300.jpg)
Tuesday Dec 13, 2022
Edge AI applications for military and space [RB] (Ep. 213)
Tuesday Dec 13, 2022
Tuesday Dec 13, 2022
Our Sponsors
NordPass Business has developed a password manager, that will save you a lot of time and energy whenever you need access to business accounts, work across devices, even with the other members of your team, or whenever you need to share sensitive data with your colleagues, or make payments efficiently. All this with the highest standard of cyber secure technology.
See NordPass Business in action now with a 3-month free trial herehttps://nordpass.com/DATASCIENCE with code DATASCIENCE
 
 
Amethix works to create and maximize the impact of the world’s leading corporations and startups, so they can create a better future for everyone they serve. We provide solutions in AI/ML, Fintech, Healthcare/RWE, and Predictive maintenance.
 

Tuesday Nov 08, 2022
Evolution of data platforms (Ep. 209)
Tuesday Nov 08, 2022
Tuesday Nov 08, 2022
Let's look at the history of data platforms. How did they evolve? Why? Shall I switch to the latest architecture? Enjoy the show!
 
Our Sponsors
Explore the Complex World of Regulations. Compliance can be overwhelming. Multiple frameworks. Overlapping requirements. Let Arctic Wolf be your guide.Check it out at https://arcticwolf.com/datascience
 
Amethix works to create and maximize the impact of the world’s leading corporations and startups, so they can create a better future for everyone they serve. We provide solutions in AI/ML, Fintech, Healthcare/RWE, and Predictive maintenance.
![[RB] Is studying AI in academia a waste of time? (Ep. 208)](https://pbcdn1.podbean.com/imglogo/image-logo/1799802/dsh-cover-2_300x300.jpg)
Wednesday Nov 02, 2022
[RB] Is studying AI in academia a waste of time? (Ep. 208)
Wednesday Nov 02, 2022
Wednesday Nov 02, 2022
Companies and other business entities are actively involved in defining data products and applied research every year. Academia has always played a role in creating new methods and solutions/algorithms in the fields of machine learning and artificial intelligence.However, there is doubt about how powerful and effective such research efforts are.Is studying AI in academia a waste of time?
 
Our Sponsors
Ready to advance your career in data science? University of Cincinnati Online offers nationally recognized educational programs in business analytics and information systems. Predictive Analytics Today named UC as the No.1 MS Data Science school in the country and is nationally recognized with a proven track record of placing students at high-profile companies such as Google, Amazon and P&G. Discover more about the University of Cincinnati’s 100% online master’s degree programs at online.uc.edu/obais 
 
Amethix works to create and maximize the impact of the world’s leading corporations and startups, so they can create a better future for everyone they serve. We provide solutions in AI/ML, Fintech, Healthcare/RWE, and Predictive maintenance.
 

Tuesday Oct 25, 2022
Private machine learning done right (Ep. 207)
Tuesday Oct 25, 2022
Tuesday Oct 25, 2022
There are many solutions to private machine learning. I am pretty confident when I say that the one we are speaking in this episode is probably one of the most feasible and reliable.I am with Daniel Huynh, CEO of Mithril Security,  a graduate from Ecole Polytechnique with a specialisation in AI and data science. He worked at Microsoft on Privacy Enhancing Technologies under the office of the CTO of Microsoft France. He has written articles on Homomorphic Encryptions with the CKKS explained series (https://blog.openmined.org/ckks-explained-part-1-simple-encoding-and-decoding/). He is now focusing on Confidential Computing at Mithril Security and has written extensive articles on the topic: https://blog.mithrilsecurity.io/. 
In this show we speak about confidential computing, SGX and private machine learning
 
References
Mithril Security: https://www.mithrilsecurity.io/ 
BindAI GitHub: https://github.com/mithril-security/blindai 
Use cases for BlindAI:Deploy Transformers models with confidentiality: https://blog.mithrilsecurity.io/transformers-with-confidentiality/
Confidential medical image analysis with COVID-Net and BlindAI: https://blog.mithrilsecurity.io/confidential-covidnet-with-blindai/ 
Build a privacy-by-design voice assistant with BlindAI: https://blog.mithrilsecurity.io/privacy-voice-ai-with-blindai/ 
Confidential Computing Explained: https://blog.mithrilsecurity.io/confidential-computing-explained-part-1-introduction/ 
Confidential Computing Consortium: https://confidentialcomputing.io/ 
Confidential Computing White Papers: https://confidentialcomputing.io/white-papers-reports/ 
List of Intel processors with Intel SGX:https://www.intel.com/content/www/us/en/support/articles/000028173/processors.html 
https://github.com/ayeks/SGX-hardware 
Azure Confidential Computing VMs with SGX:Azure Docs: https://docs.microsoft.com/en-us/azure/confidential-computing/confidential-computing-enclaves 
How to deploy BlindAI on Azure: https://docs.mithrilsecurity.io/getting-started/cloud-deployment/azure-dcsv3 
Confidential Computing 101: https://www.youtube.com/watch?v=77U12Ss38Zc 
Rust: https://www.rust-lang.org/ 
ONNX: https://github.com/onnx/onnx 
Tract, a Rust inference engine for ONNX models: https://github.com/sonos/tract 
 

Saturday Oct 15, 2022
Edge AI for applications in military and space (Ep. 206)
Saturday Oct 15, 2022
Saturday Oct 15, 2022
Our Sponsors
Ready to advance your career in data science? University of Cincinnati Online offers nationally recognized educational programs in business analytics and information systems. Predictive Analytics Today named UC as the No.1 MS Data Science school in the country and is nationally recognized with a proven track record of placing students at high-profile companies such as Google, Amazon and P&G. Discover more about the University of Cincinnati’s 100% online master’s degree programs at online.uc.edu/obais 
 
Amethix works to create and maximize the impact of the world’s leading corporations and startups, so they can create a better future for everyone they serve. We provide solutions in AI/ML, Fintech, Healthcare/RWE, and Predictive maintenance.
![[RB] What are generalist agents and why they can change the AI game (Ep. 205)](https://pbcdn1.podbean.com/imglogo/image-logo/1799802/dsh-cover-2_300x300.jpg)
Wednesday Oct 05, 2022
[RB] What are generalist agents and why they can change the AI game (Ep. 205)
Wednesday Oct 05, 2022
Wednesday Oct 05, 2022
That deep learning alone is not sufficient to solve artificial general intelligence, is more and more accepted statement.
Generalist agents have great properties that can overcome some of the limitations of single-task deep learning models.
Be aware, we are still far from AGI, though.
 
So what are generalist agents?
 
References
https://arxiv.org/pdf/2205.06175
 
 

Wednesday Sep 28, 2022
LIDAR, cameras and autonomous vehicles (Ep. 204)
Wednesday Sep 28, 2022
Wednesday Sep 28, 2022
How does an autonomous vehicle see? How does it sense the road? They are equipped of many sensors, of course. Are they all powerful enough? Small enough to hide them and make your car look beautiful? In this episode I speak about LIDAR, high resolution cameras and some machine learning methods adapted to a minimal number of sensors. 
 
Our Sponsors
Ready to advance your career in data science? University of Cincinnati Online offers nationally recognized educational programs in business analytics and information systems. Predictive Analytics Today named UC as the No.1 MS Data Science school in the country and is nationally recognized with a proven track record of placing students at high-profile companies such as Google, Amazon and P&G. Discover more about the University of Cincinnati’s 100% online master’s degree programs at online.uc.edu/obais 
 
Amethix works to create and maximize the impact of the world’s leading corporations and startups, so they can create a better future for everyone they serve. We provide solutions in AI/ML, Fintech, Healthcare/RWE, and Predictive maintenance.
 
References
https://patents.google.com/patent/US20220043449A1/en?oq=20220043449
 
 

Tuesday Sep 20, 2022
Predicting Out Of Memory Kill events with Machine Learning (Ep. 203)
Tuesday Sep 20, 2022
Tuesday Sep 20, 2022
Sometimes applications crash. Some other times applications crash because memory is exhausted. Such issues exist because of bugs in the code, or heavy memory usage for reasons that were not expected during design and implementation. Can we use machine learning to predict and eventually detect out of memory kills from the operating system?
Apparently, the Netflix app many of us use on a daily basis leverage ML and time series analysis to prevent OOM-kills.
Enjoy the show!
Our Sponsors
Explore the Complex World of Regulations. Compliance can be overwhelming. Multiple frameworks. Overlapping requirements. Let Arctic Wolf be your guide.Check it out at https://arcticwolf.com/datascience
 
Amethix works to create and maximize the impact of the world’s leading corporations and startups, so they can create a better future for everyone they serve. We provide solutions in AI/ML, Fintech, Healthcare/RWE, and Predictive maintenance.
 
Transcript
100:00:04,150 --> 00:00:09,034And here we are again with the season four of the Data Science at Home podcast.
200:00:09,142 --> 00:00:19,170This time we have something for you if you want to help us shape the data science leaders of the future, we have created the the Data Science at Home's Ambassador program.
300:00:19,340 --> 00:00:28,378Ambassadors are volunteers who are passionate about data science and want to give back to our growing community of data science professionals and enthusiasts.
400:00:28,534 --> 00:00:37,558You will be instrumental in helping us achieve our goal of raising awareness about the critical role of data science in cutting edge technologies.
500:00:37,714 --> 00:00:45,740If you want to learn more about this program, visit the Ambassadors page on our website@datascienceathome.com.
600:00:46,430 --> 00:00:49,234Welcome back to another episode of Data Science at Home podcast.
700:00:49,282 --> 00:00:55,426I'm Francesco Podcasting from the Regular Office of Amethyx Technologies, based in Belgium.
800:00:55,618 --> 00:01:02,914In this episode, I want to speak about a machine learning problem that has been formulated at Netflix.
900:01:03,022 --> 00:01:22,038And for the record, Netflix is not sponsoring this episode, though I still believe that this problem is a very well known problem, a very common one across factors, which is how to predict out of memory kill in an application and formulate this problem as a machine learning problem.
1000:01:22,184 --> 00:01:39,142So this is something that, as I said, is very interesting, not just because of Netflix, but because it allows me to explain a few points that, as I said, are kind of invariance across sectors.
1100:01:39,226 --> 00:01:56,218Regardless of your application, is a video streaming application or any other communication type of application, or a fintech application, or energy, or whatever, this memory kill, out of memory kill still occurs.
1200:01:56,314 --> 00:02:05,622And what is an out of memory kill? Well, it's essentially the extreme event in which the machine doesn't have any more memory left.
1300:02:05,756 --> 00:02:16,678And so usually the operating system can start eventually swapping, which means using the SSD or the hard drive as a source of memory.
1400:02:16,834 --> 00:02:19,100But that, of course, will slow down a lot.
1500:02:19,430 --> 00:02:45,210And eventually when there is a bug or a memory leak, or if there are other applications running on the same machine, of course there is some kind of limiting factor that essentially kills the application, something that occurs from the operating system most of the time that kills the application in order to prevent the application from monopolizing the entire machine, the hardware of the machine.
1600:02:45,710 --> 00:02:48,500And so this is a very important problem.
1700:02:49,070 --> 00:03:03,306Also, it is important to have an episode about this because there are some strategies that I've used at Netflix that are pretty much in line with what I believe machine learning should be about.
1800:03:03,368 --> 00:03:25,062And usually people would go for the fancy solution there like this extremely accurate predictors or machine learning models, but you should have a massive number of parameters and that try to figure out whatever is happening on that machine that is running that application.
1900:03:25,256 --> 00:03:29,466While the solution at Netflix is pretty straightforward, it's pretty simple.
2000:03:29,588 --> 00:03:33,654And so one would say then why making an episode after this? Well.
2100:03:33,692 --> 00:03:45,730Because I think that we need more sobriety when it comes to machine learning and I believe we still need to spend a lot of time thinking about what data to collect.
2200:03:45,910 --> 00:03:59,730Reasoning about what is the problem at hand and what is the data that can actually tickle the particular machine learning model and then of course move to the actual prediction that is the actual model.
2300:03:59,900 --> 00:04:15,910That most of the time it doesn't need to be one of these super fancy things that you see on the news around chatbots or autonomous gaming agent or drivers and so on and so forth.
2400:04:16,030 --> 00:04:28,518So there are essentially two data sets that the people at Netflix focus on which are consistently different, dramatically different in fact.
2500:04:28,604 --> 00:04:45,570These are data about device characteristics and capabilities and of course data that are collected at Runtime and that give you a picture of what's going on in the memory of the device, right? So that's the so called runtime memory data and out of memory kills.
2600:04:45,950 --> 00:05:03,562So the first type of data is I would consider it very static because it considers for example, the device type ID, the version of the software development kit that application is running, cache capacities, buffer capacities and so on and so forth.
2700:05:03,646 --> 00:05:11,190So it's something that most of the time doesn't change across sessions and so that's why it's considered static.
2800:05:12,050 --> 00:05:18,430In contrast, the other type of data, the Runtime memory data, as the name says it's runtime.
2900:05:18,490 --> 00:05:24,190So it varies across the life of the session it's collected at Runtime.
3000:05:24,250 --> 00:05:25,938So it's very dynamic data.
3100:05:26,084 --> 00:05:36,298And example of these records are for example, profile, movie details, playback information, current memory usage, et cetera, et cetera.
3200:05:36,334 --> 00:05:56,086So this is the data that actually moves and moves in the sense that it changes depending on how the user is actually using the Netflix application, what movie or what profile description, what movie detail has been loaded for that particular movie and so on and so forth.
3300:05:56,218 --> 00:06:15,094So one thing that of course the first difficulty of the first challenge that the people at Netflix had to deal with was how would you combine these two things, very static and usually small tables versus very dynamic and usually large tables or views.
3400:06:15,142 --> 00:06:36,702Well, there is some sort of join on key that is performed by the people at Netflix in order to put together these different data resolutions, right, which is data of the same phenomenon but from different sources and with different carrying very different signals in there.
3500:06:36,896 --> 00:06:48,620So the device capabilities is captured usually by the static data and of course the other data, the Runtime memory and out of memory kill data.
3600:06:48,950 --> 00:07:04,162These are also, as I said, the data that will describe pretty accurately how is the user using that particular application on that particular hardware.
3700:07:04,306 --> 00:07:17,566Now of course, when it comes to data and deer, there is nothing new that people at Netflix have introduced dealing with missing data for example, or incorporating knowledge of devices.
3800:07:17,698 --> 00:07:26,062It's all stuff that it's part of the so called data cleaning and data collection strategy, right? Or data preparation.
3900:07:26,146 --> 00:07:40,782That is, whatever you're going to do in order to make that data or a combination of these data sources, let's say, compatible with the way your machine learning model will understand or will read that data.
4000:07:40,916 --> 00:07:58,638So if you think of a big data platform, the first step, the first challenge you have to deal, you have to deal with is how can I, first of all, collect the right amount of information, the right data, but also how to transform this data for my particular big data platform.
4100:07:58,784 --> 00:08:12,798And that's something that, again, nothing new, nothing fancy, just basics, what we have been used to, what we are used to seeing now for the last decade or more, that's exactly what they do.
4200:08:12,944 --> 00:08:15,222And now let me tell you something important.
4300:08:15,416 --> 00:08:17,278Cybercriminals are evolving.
4400:08:17,374 --> 00:08:22,446Their techniques and tactics are more advanced, intricate and dangerous than ever before.
4500:08:22,628 --> 00:08:30,630Industries and governments around the world are fighting back on dealing new regulations meant to better protect data against this rising threat.
4600:08:30,950 --> 00:08:39,262Today, the world of cybersecurity compliance is a complex one, and understanding the requirements your organization must adhere to can be a daunting task.
4700:08:39,406 --> 00:08:42,178But not when the pack has your best architect.
4800:08:42,214 --> 00:08:53,840Wolf, the leader in security operations, is on a mission to end cyber risk by giving organizations the protection, information and confidence they need to protect their people, technology and data.
4900:08:54,170 --> 00:09:02,734The new interactive compliance portal helps you discover the regulations in your region and industry and start the journey towards achieving and maintaining compliance.
5000:09:02,902 --> 00:09:07,542Visit Arcticwolves.com DataScience to take your first step.
5100:09:07,676 --> 00:09:11,490That's arcticwolf.com DataScience.
5200:09:12,050 --> 00:09:18,378I think that the most important part, though, I think are actually equally important.
5300:09:18,464 --> 00:09:26,854But the way they treat runtime memory data and out of memory kill data is by using sliding windows.
5400:09:26,962 --> 00:09:38,718So that's something that is really worth mentioning, because the way you would frame this problem is something is happening at some point in time and I have to kind of predict that event.
5500:09:38,864 --> 00:09:49,326That is usually an outlier in the sense that these events are quite rare, fortunately, because Netflix would not be as usable as we believe it is.
5600:09:49,448 --> 00:10:04,110So you would like to predict these weird events by looking at a historical view or an historical amount of records that you have before this particular event, which is the kill of the application.
5700:10:04,220 --> 00:10:12,870So the concept of the sliding window, the sliding window approach is something that comes as the most natural thing anyone would do.
5800:10:13,040 --> 00:10:18,366And that's exactly what the researchers and Netflix have done.
5900:10:18,488 --> 00:10:25,494So unexpectedly, in my opinion, they treated this problem as a time series, which is exactly what it is.
6000:10:25,652 --> 00:10:26,190Now.
6100:10:26,300 --> 00:10:26,754They.
6200:10:26,852 --> 00:10:27,330Of course.
6300:10:27,380 --> 00:10:31,426Use this sliding window with a different horizon.
6400:10:31,558 --> 00:10:32,190Five minutes.
6500:10:32,240 --> 00:10:32,838Four minutes.
6600:10:32,924 --> 00:10:33,702Two minutes.
6700:10:33,836 --> 00:10:36,366As close as possible to the event.
6800:10:36,548 --> 00:10:38,886Because maybe there are some.
6900:10:39,008 --> 00:10:39,762Let's say.
7000:10:39,896 --> 00:10:45,678Other dynamics that can raise when you are very close to the event or when you are very far from it.
7100:10:45,704 --> 00:10:50,166Like five minutes far from the out of memory kill.
7200:10:50,348 --> 00:10:51,858Might have some other.
7300:10:51,944 --> 00:10:52,410Let's say.
7400:10:52,460 --> 00:10:55,986Diagrams or shapes in the data.
7500:10:56,168 --> 00:11:11,310So for example, you might have a certain number of allocations that keep growing and growing, but eventually they grow with a certain curve or a certain rate that you can measure when you are five to ten minutes far from the out of memory kill.
7600:11:11,420 --> 00:11:16,566When you are two minutes far from the out of memory kill, probably this trend will change.
7700:11:16,688 --> 00:11:30,800And so probably what you would expect is that the memory is already half or more saturated and therefore, for example, the operating system starts swapping or other things are happening that you are going to measure in this.
7800:11:31,550 --> 00:11:39,730And that would give you a much better picture of what's going on in the, let's say, closest neighborhood of that event, the time window.
7900:11:39,790 --> 00:11:51,042The sliding window and time window approach is definitely worth mentioning because this is something that you can apply if you think pretty much anywhere right now.
8000:11:51,116 --> 00:11:52,050What they did.
8100:11:52,160 --> 00:12:04,146In addition to having a time window, a sliding window, they also assign different levels to memory readings that are closer to the out of memory kill.
8200:12:04,208 --> 00:12:10,062And usually these levels are higher and higher as we get closer and closer to the out of memory kill.
8300:12:10,136 --> 00:12:15,402So this means that, for example, we would have, for a five minute window, we would have a level one.
8400:12:15,596 --> 00:12:22,230Five minute means five minutes far from the out of memory kill, four minutes would be a level two.
8500:12:22,280 --> 00:12:37,234Three minutes it's much closer would be a level three, two minutes would be a level four, which means like kind of the severity of the event as we get closer and closer to the actual event when the application is actually killed.
8600:12:37,342 --> 00:12:51,474So by looking at this approach, nothing new there, even, I would say not even a seasoned data scientist would have understood that using a sliding window is the way to go.
8700:12:51,632 --> 00:12:55,482I'm not saying that Netflix engineers are not seasoned enough.
8800:12:55,556 --> 00:13:04,350Actually they do a great job every day to keep giving us video streaming platforms that actually never fail or almost never fail.
8900:13:04,910 --> 00:13:07,460So spot on there, guys, good job.
9000:13:07,850 --> 00:13:27,738But looking at this sliding window approach, the direct consequence of this is that they can plot, they can do some sort of graphical analysis of the out of memory kills versus the memory usage that can give the reader or the data scientist a very nice picture of what's going on there.
9100:13:27,824 --> 00:13:39,330And so you would have, for example, and I would definitely report some of the pictures, some of the diagrams and graphs in the show notes of this episode on the official website datascienceaton.com.
9200:13:39,500 --> 00:13:48,238But essentially what you can see there is that there might be premature peaks at, let's say, a lower memory reading.
9300:13:48,334 --> 00:14:08,958And usually these are some kind of false positives or anomalies that should not be there, then it's possible to set a threshold where the threshold to start lowering the memory usage because after that threshold something nasty can happen and usually happens according to your data.
9400:14:09,104 --> 00:14:18,740And then of course there is another graph about the Gaussian distribution or in fact no sharp peak at all.
9500:14:19,250 --> 00:14:21,898That is like kills or out of memory.
9600:14:21,934 --> 00:14:33,754Kills are more or less distributed in a normalized fashion and then of course there are the genuine peaks that indicate that kills near, let's say, the threshold.
9700:14:33,802 --> 00:14:38,758And so usually you would see that after that particular threshold of memory usage.
9800:14:38,914 --> 00:14:42,142You see most of the out of memory kills.
9900:14:42,226 --> 00:14:45,570Which makes sense because given a particular device.
10000:14:45,890 --> 00:14:48,298Which means certain amount of memories.
10100:14:48,394 --> 00:14:50,338Certain memory characteristics.
10200:14:50,494 --> 00:14:53,074Certain version of the SDK and so on and so forth.
10300:14:53,182 --> 00:14:53,814You can say.
10400:14:53,852 --> 00:14:54,090Okay.
10500:14:54,140 --> 00:15:10,510Well for this device type I have this memory memory usage threshold and after this I see that I have a relatively high number of out of memory kills immediately after this threshold.
10600:15:10,570 --> 00:15:18,150And this means that probably that is the threshold you would like to consider as the critical threshold you should never or almost never cross.
10700:15:18,710 --> 00:15:38,758So once you have this picture in front of you, you can start thinking of implementing some mechanisms that can monitor the memory usage and of course kind of preemptively dialocate things or keep that memory threshold as low as possible with respect to the critical threshold.
10800:15:38,794 --> 00:15:53,446So you can start implementing some logic that prevents the application from being killed by the operating system so that you would in fact reduce the rate of out of memory kills overall.
10900:15:53,578 --> 00:16:11,410Now, as always and as also the engineers state in their blog post, in the technical post, they say well, it's much more important for us to predict with a certain amount of false positive rather than false negatives.
11000:16:11,590 --> 00:16:18,718False negatives means missing an out of memory kill that actually occurred but got not predicted.
11100:16:18,874 --> 00:16:40,462If you are a regular listener of this podcast, that statement should resonate with you because this is exactly what happens, for example in healthcare applications, which means that doctors or algorithms that operate in healthcare would definitely prefer to have a bit more false positives rather than more false negatives.
11200:16:40,486 --> 00:16:54,800Because missing that someone is sick means that you are not providing a cure and you're just sending the patient home when he or she is sick, right? That's the false positive, it's the mess.
11300:16:55,130 --> 00:16:57,618So that's a false negative, it's the mess.
11400:16:57,764 --> 00:17:09,486But having a false positive, what can go wrong with having a false positive? Well, probably you will undergo another test to make sure that the first test is confirmed or not.
11500:17:09,608 --> 00:17:16,018So adding a false positive in this case is relatively okay with respect to having a false negative.
11600:17:16,054 --> 00:17:19,398And that's exactly what happens to the Netflix application.
11700:17:19,484 --> 00:17:32,094Now, I don't want to say that of course Netflix application is as critical as, for example, the application that predicts a cancer or an xray or something on an xray or disorder or disease of some sort.
11800:17:32,252 --> 00:17:48,090But what I'm saying is that there are some analogies when it comes to machine learning and artificial intelligence and especially data science, the old school data science, there are several things that kind of are, let's say, invariant across sectors.
11900:17:48,410 --> 00:17:56,826And so, you know, two worlds like the media streaming or video streaming and healthcare are of course very different from each other.
12000:17:56,888 --> 00:18:05,274But when it comes to machine learning and data science applications, well, there are a lot of analogies there.
12100:18:05,372 --> 00:18:06,202And indeed.
12200:18:06,286 --> 00:18:10,234In terms of the models that they use at Netflix to predict.
12300:18:10,342 --> 00:18:24,322Once they have the sliding window data and essentially they have the ground truth of where this out of memory kill happened and what happened before to the memory of the application or the machine.
12400:18:24,466 --> 00:18:24,774Well.
12500:18:24,812 --> 00:18:30,514Then the models they use to predict these things is these events is Artificial Neural Networks.
12600:18:30,622 --> 00:18:31,714Xg Boost.
12700:18:31,822 --> 00:18:36,742Ada Boost or Adaptive Boosting Elastic Net with Softmax and so on and so forth.
12800:18:36,766 --> 00:18:39,226So nothing fancy.
12900:18:39,418 --> 00:18:45,046As you can see, Xg Boost is probably one of the most used I would have expected even random forest.
13000:18:45,178 --> 00:18:47,120Probably they do, they've tried that.
13100:18:47,810 --> 00:18:58,842But XGBoost is probably one of the most used models on kaggle competitions for a reason, because it works and it leverages a lot.
13200:18:58,916 --> 00:19:04,880The data preparation step, that solves already more than half of the problem.
13300:19:05,810 --> 00:19:07,270Thank you so much for listening.
13400:19:07,330 --> 00:19:11,910I also invite you, as always, to join the Discord Channel.
13500:19:12,020 --> 00:19:15,966You will find a link on the official website datascience@home.com.
13600:19:16,148 --> 00:19:17,600Speak with you next time.
13700:19:18,350 --> 00:19:21,382You've been listening to Data Science at home podcast.
13800:19:21,466 --> 00:19:26,050Be sure to subscribe on itunes, Stitcher, or Pot Bean to get new, fresh episodes.
13900:19:26,110 --> 00:19:31,066For more, please follow us on Instagram, Twitter and Facebook or visit our website at datascienceathome.com
 
References
https://netflixtechblog.com/formulating-out-of-memory-kill-prediction-on-the-netflix-app-as-a-machine-learning-problem-989599029109

Tuesday Sep 13, 2022
Is studying AI in academia a waste of time? (Ep. 202)
Tuesday Sep 13, 2022
Tuesday Sep 13, 2022
Companies and other business entities are actively involved in defining data products and applied research every year. Academia has always played a role in creating new methods and solutions/algorithms in the fields of machine learning and artificial intelligence.However, there is doubt about how powerful and effective such research efforts are.Is studying AI in academia a waste of time?
 
Our Sponsors
Explore the Complex World of Regulations. Compliance can be overwhelming. Multiple frameworks. Overlapping requirements. Let Arctic Wolf be your guide.Check it out at https://arcticwolf.com/datascience
 
Amethix works to create and maximize the impact of the world’s leading corporations and startups, so they can create a better future for everyone they serve. We provide solutions in AI/ML, Fintech, Healthcare/RWE, and Predictive maintenance.