Sep 20, 2022

Predicting Out Of Memory Kill events with Machine Learning (Ep. 203)

Sometimes applications crash. Some other times applications crash because memory is exhausted. Such issues exist because of bugs in the code, or heavy memory usage for reasons that were not expected during design and implementation.
Can we use machine learning to predict and eventually detect out of memory kills from the operating system?

Apparently, the Netflix app many of us use on a daily basis leverage ML and time series analysis to prevent OOM-kills.

Enjoy the show!

Our Sponsors

Explore the Complex World of Regulations. Compliance can be overwhelming. Multiple frameworks. Overlapping requirements. Let Arctic Wolf be your guide.
Check it out at https://arcticwolf.com/datascience

Amethix works to create and maximize the impact of the world’s leading corporations and startups, so they can create a better future for everyone they serve. We provide solutions in AI/ML, Fintech, Healthcare/RWE, and Predictive maintenance.

Transcript

1
00:00:04,150 --> 00:00:09,034
And here we are again with the season four of the Data Science at Home podcast.

2
00:00:09,142 --> 00:00:19,170
This time we have something for you if you want to help us shape the data science leaders of the future, we have created the the Data Science at Home's Ambassador program.

3
00:00:19,340 --> 00:00:28,378
Ambassadors are volunteers who are passionate about data science and want to give back to our growing community of data science professionals and enthusiasts.

4
00:00:28,534 --> 00:00:37,558
You will be instrumental in helping us achieve our goal of raising awareness about the critical role of data science in cutting edge technologies.

5
00:00:37,714 --> 00:00:45,740
If you want to learn more about this program, visit the Ambassadors page on our website@datascienceathome.com.

6
00:00:46,430 --> 00:00:49,234
Welcome back to another episode of Data Science at Home podcast.

7
00:00:49,282 --> 00:00:55,426
I'm Francesco Podcasting from the Regular Office of Amethyx Technologies, based in Belgium.

8
00:00:55,618 --> 00:01:02,914
In this episode, I want to speak about a machine learning problem that has been formulated at Netflix.

9
00:01:03,022 --> 00:01:22,038
And for the record, Netflix is not sponsoring this episode, though I still believe that this problem is a very well known problem, a very common one across factors, which is how to predict out of memory kill in an application and formulate this problem as a machine learning problem.

10
00:01:22,184 --> 00:01:39,142
So this is something that, as I said, is very interesting, not just because of Netflix, but because it allows me to explain a few points that, as I said, are kind of invariance across sectors.

11
00:01:39,226 --> 00:01:56,218
Regardless of your application, is a video streaming application or any other communication type of application, or a fintech application, or energy, or whatever, this memory kill, out of memory kill still occurs.

12
00:01:56,314 --> 00:02:05,622
And what is an out of memory kill? Well, it's essentially the extreme event in which the machine doesn't have any more memory left.

13
00:02:05,756 --> 00:02:16,678
And so usually the operating system can start eventually swapping, which means using the SSD or the hard drive as a source of memory.

14
00:02:16,834 --> 00:02:19,100
But that, of course, will slow down a lot.

15
00:02:19,430 --> 00:02:45,210
And eventually when there is a bug or a memory leak, or if there are other applications running on the same machine, of course there is some kind of limiting factor that essentially kills the application, something that occurs from the operating system most of the time that kills the application in order to prevent the application from monopolizing the entire machine, the hardware of the machine.

16
00:02:45,710 --> 00:02:48,500
And so this is a very important problem.

17
00:02:49,070 --> 00:03:03,306
Also, it is important to have an episode about this because there are some strategies that I've used at Netflix that are pretty much in line with what I believe machine learning should be about.

18
00:03:03,368 --> 00:03:25,062
And usually people would go for the fancy solution there like this extremely accurate predictors or machine learning models, but you should have a massive number of parameters and that try to figure out whatever is happening on that machine that is running that application.

19
00:03:25,256 --> 00:03:29,466
While the solution at Netflix is pretty straightforward, it's pretty simple.

20
00:03:29,588 --> 00:03:33,654
And so one would say then why making an episode after this? Well.

21
00:03:33,692 --> 00:03:45,730
Because I think that we need more sobriety when it comes to machine learning and I believe we still need to spend a lot of time thinking about what data to collect.

22
00:03:45,910 --> 00:03:59,730
Reasoning about what is the problem at hand and what is the data that can actually tickle the particular machine learning model and then of course move to the actual prediction that is the actual model.

23
00:03:59,900 --> 00:04:15,910
That most of the time it doesn't need to be one of these super fancy things that you see on the news around chatbots or autonomous gaming agent or drivers and so on and so forth.

24
00:04:16,030 --> 00:04:28,518
So there are essentially two data sets that the people at Netflix focus on which are consistently different, dramatically different in fact.

25
00:04:28,604 --> 00:04:45,570
These are data about device characteristics and capabilities and of course data that are collected at Runtime and that give you a picture of what's going on in the memory of the device, right? So that's the so called runtime memory data and out of memory kills.

26
00:04:45,950 --> 00:05:03,562
So the first type of data is I would consider it very static because it considers for example, the device type ID, the version of the software development kit that application is running, cache capacities, buffer capacities and so on and so forth.

27
00:05:03,646 --> 00:05:11,190
So it's something that most of the time doesn't change across sessions and so that's why it's considered static.

28
00:05:12,050 --> 00:05:18,430
In contrast, the other type of data, the Runtime memory data, as the name says it's runtime.

29
00:05:18,490 --> 00:05:24,190
So it varies across the life of the session it's collected at Runtime.

30
00:05:24,250 --> 00:05:25,938
So it's very dynamic data.

31
00:05:26,084 --> 00:05:36,298
And example of these records are for example, profile, movie details, playback information, current memory usage, et cetera, et cetera.

32
00:05:36,334 --> 00:05:56,086
So this is the data that actually moves and moves in the sense that it changes depending on how the user is actually using the Netflix application, what movie or what profile description, what movie detail has been loaded for that particular movie and so on and so forth.

33
00:05:56,218 --> 00:06:15,094
So one thing that of course the first difficulty of the first challenge that the people at Netflix had to deal with was how would you combine these two things, very static and usually small tables versus very dynamic and usually large tables or views.

34
00:06:15,142 --> 00:06:36,702
Well, there is some sort of join on key that is performed by the people at Netflix in order to put together these different data resolutions, right, which is data of the same phenomenon but from different sources and with different carrying very different signals in there.

35
00:06:36,896 --> 00:06:48,620
So the device capabilities is captured usually by the static data and of course the other data, the Runtime memory and out of memory kill data.

36
00:06:48,950 --> 00:07:04,162
These are also, as I said, the data that will describe pretty accurately how is the user using that particular application on that particular hardware.

37
00:07:04,306 --> 00:07:17,566
Now of course, when it comes to data and deer, there is nothing new that people at Netflix have introduced dealing with missing data for example, or incorporating knowledge of devices.

38
00:07:17,698 --> 00:07:26,062
It's all stuff that it's part of the so called data cleaning and data collection strategy, right? Or data preparation.

39
00:07:26,146 --> 00:07:40,782
That is, whatever you're going to do in order to make that data or a combination of these data sources, let's say, compatible with the way your machine learning model will understand or will read that data.

40
00:07:40,916 --> 00:07:58,638
So if you think of a big data platform, the first step, the first challenge you have to deal, you have to deal with is how can I, first of all, collect the right amount of information, the right data, but also how to transform this data for my particular big data platform.

41
00:07:58,784 --> 00:08:12,798
And that's something that, again, nothing new, nothing fancy, just basics, what we have been used to, what we are used to seeing now for the last decade or more, that's exactly what they do.

42
00:08:12,944 --> 00:08:15,222
And now let me tell you something important.

43
00:08:15,416 --> 00:08:17,278
Cybercriminals are evolving.

44
00:08:17,374 --> 00:08:22,446
Their techniques and tactics are more advanced, intricate and dangerous than ever before.

45
00:08:22,628 --> 00:08:30,630
Industries and governments around the world are fighting back on dealing new regulations meant to better protect data against this rising threat.

46
00:08:30,950 --> 00:08:39,262
Today, the world of cybersecurity compliance is a complex one, and understanding the requirements your organization must adhere to can be a daunting task.

47
00:08:39,406 --> 00:08:42,178
But not when the pack has your best architect.

48
00:08:42,214 --> 00:08:53,840
Wolf, the leader in security operations, is on a mission to end cyber risk by giving organizations the protection, information and confidence they need to protect their people, technology and data.

49
00:08:54,170 --> 00:09:02,734
The new interactive compliance portal helps you discover the regulations in your region and industry and start the journey towards achieving and maintaining compliance.

50
00:09:02,902 --> 00:09:07,542
Visit Arcticwolves.com DataScience to take your first step.

51
00:09:07,676 --> 00:09:11,490
That's arcticwolf.com DataScience.

52
00:09:12,050 --> 00:09:18,378
I think that the most important part, though, I think are actually equally important.

53
00:09:18,464 --> 00:09:26,854
But the way they treat runtime memory data and out of memory kill data is by using sliding windows.

54
00:09:26,962 --> 00:09:38,718
So that's something that is really worth mentioning, because the way you would frame this problem is something is happening at some point in time and I have to kind of predict that event.

55
00:09:38,864 --> 00:09:49,326
That is usually an outlier in the sense that these events are quite rare, fortunately, because Netflix would not be as usable as we believe it is.

56
00:09:49,448 --> 00:10:04,110
So you would like to predict these weird events by looking at a historical view or an historical amount of records that you have before this particular event, which is the kill of the application.

57
00:10:04,220 --> 00:10:12,870
So the concept of the sliding window, the sliding window approach is something that comes as the most natural thing anyone would do.

58
00:10:13,040 --> 00:10:18,366
And that's exactly what the researchers and Netflix have done.

59
00:10:18,488 --> 00:10:25,494
So unexpectedly, in my opinion, they treated this problem as a time series, which is exactly what it is.

60
00:10:25,652 --> 00:10:26,190
Now.

61
00:10:26,300 --> 00:10:26,754
They.

62
00:10:26,852 --> 00:10:27,330
Of course.

63
00:10:27,380 --> 00:10:31,426
Use this sliding window with a different horizon.

64
00:10:31,558 --> 00:10:32,190
Five minutes.

65
00:10:32,240 --> 00:10:32,838
Four minutes.

66
00:10:32,924 --> 00:10:33,702
Two minutes.

67
00:10:33,836 --> 00:10:36,366
As close as possible to the event.

68
00:10:36,548 --> 00:10:38,886
Because maybe there are some.

69
00:10:39,008 --> 00:10:39,762
Let's say.

70
00:10:39,896 --> 00:10:45,678
Other dynamics that can raise when you are very close to the event or when you are very far from it.

71
00:10:45,704 --> 00:10:50,166
Like five minutes far from the out of memory kill.

72
00:10:50,348 --> 00:10:51,858
Might have some other.

73
00:10:51,944 --> 00:10:52,410
Let's say.

74
00:10:52,460 --> 00:10:55,986
Diagrams or shapes in the data.

75
00:10:56,168 --> 00:11:11,310
So for example, you might have a certain number of allocations that keep growing and growing, but eventually they grow with a certain curve or a certain rate that you can measure when you are five to ten minutes far from the out of memory kill.

76
00:11:11,420 --> 00:11:16,566
When you are two minutes far from the out of memory kill, probably this trend will change.

77
00:11:16,688 --> 00:11:30,800
And so probably what you would expect is that the memory is already half or more saturated and therefore, for example, the operating system starts swapping or other things are happening that you are going to measure in this.

78
00:11:31,550 --> 00:11:39,730
And that would give you a much better picture of what's going on in the, let's say, closest neighborhood of that event, the time window.

79
00:11:39,790 --> 00:11:51,042
The sliding window and time window approach is definitely worth mentioning because this is something that you can apply if you think pretty much anywhere right now.

80
00:11:51,116 --> 00:11:52,050
What they did.

81
00:11:52,160 --> 00:12:04,146
In addition to having a time window, a sliding window, they also assign different levels to memory readings that are closer to the out of memory kill.

82
00:12:04,208 --> 00:12:10,062
And usually these levels are higher and higher as we get closer and closer to the out of memory kill.

83
00:12:10,136 --> 00:12:15,402
So this means that, for example, we would have, for a five minute window, we would have a level one.

84
00:12:15,596 --> 00:12:22,230
Five minute means five minutes far from the out of memory kill, four minutes would be a level two.

85
00:12:22,280 --> 00:12:37,234
Three minutes it's much closer would be a level three, two minutes would be a level four, which means like kind of the severity of the event as we get closer and closer to the actual event when the application is actually killed.

86
00:12:37,342 --> 00:12:51,474
So by looking at this approach, nothing new there, even, I would say not even a seasoned data scientist would have understood that using a sliding window is the way to go.

87
00:12:51,632 --> 00:12:55,482
I'm not saying that Netflix engineers are not seasoned enough.

88
00:12:55,556 --> 00:13:04,350
Actually they do a great job every day to keep giving us video streaming platforms that actually never fail or almost never fail.

89
00:13:04,910 --> 00:13:07,460
So spot on there, guys, good job.

90
00:13:07,850 --> 00:13:27,738
But looking at this sliding window approach, the direct consequence of this is that they can plot, they can do some sort of graphical analysis of the out of memory kills versus the memory usage that can give the reader or the data scientist a very nice picture of what's going on there.

91
00:13:27,824 --> 00:13:39,330
And so you would have, for example, and I would definitely report some of the pictures, some of the diagrams and graphs in the show notes of this episode on the official website datascienceaton.com.

92
00:13:39,500 --> 00:13:48,238
But essentially what you can see there is that there might be premature peaks at, let's say, a lower memory reading.

93
00:13:48,334 --> 00:14:08,958
And usually these are some kind of false positives or anomalies that should not be there, then it's possible to set a threshold where the threshold to start lowering the memory usage because after that threshold something nasty can happen and usually happens according to your data.

94
00:14:09,104 --> 00:14:18,740
And then of course there is another graph about the Gaussian distribution or in fact no sharp peak at all.

95
00:14:19,250 --> 00:14:21,898
That is like kills or out of memory.

96
00:14:21,934 --> 00:14:33,754
Kills are more or less distributed in a normalized fashion and then of course there are the genuine peaks that indicate that kills near, let's say, the threshold.

97
00:14:33,802 --> 00:14:38,758
And so usually you would see that after that particular threshold of memory usage.

98
00:14:38,914 --> 00:14:42,142
You see most of the out of memory kills.

99
00:14:42,226 --> 00:14:45,570
Which makes sense because given a particular device.

100
00:14:45,890 --> 00:14:48,298
Which means certain amount of memories.

101
00:14:48,394 --> 00:14:50,338
Certain memory characteristics.

102
00:14:50,494 --> 00:14:53,074
Certain version of the SDK and so on and so forth.

103
00:14:53,182 --> 00:14:53,814
You can say.

104
00:14:53,852 --> 00:14:54,090
Okay.

105
00:14:54,140 --> 00:15:10,510
Well for this device type I have this memory memory usage threshold and after this I see that I have a relatively high number of out of memory kills immediately after this threshold.

106
00:15:10,570 --> 00:15:18,150
And this means that probably that is the threshold you would like to consider as the critical threshold you should never or almost never cross.

107
00:15:18,710 --> 00:15:38,758
So once you have this picture in front of you, you can start thinking of implementing some mechanisms that can monitor the memory usage and of course kind of preemptively dialocate things or keep that memory threshold as low as possible with respect to the critical threshold.

108
00:15:38,794 --> 00:15:53,446
So you can start implementing some logic that prevents the application from being killed by the operating system so that you would in fact reduce the rate of out of memory kills overall.

109
00:15:53,578 --> 00:16:11,410
Now, as always and as also the engineers state in their blog post, in the technical post, they say well, it's much more important for us to predict with a certain amount of false positive rather than false negatives.

110
00:16:11,590 --> 00:16:18,718
False negatives means missing an out of memory kill that actually occurred but got not predicted.

111
00:16:18,874 --> 00:16:40,462
If you are a regular listener of this podcast, that statement should resonate with you because this is exactly what happens, for example in healthcare applications, which means that doctors or algorithms that operate in healthcare would definitely prefer to have a bit more false positives rather than more false negatives.

112
00:16:40,486 --> 00:16:54,800
Because missing that someone is sick means that you are not providing a cure and you're just sending the patient home when he or she is sick, right? That's the false positive, it's the mess.

113
00:16:55,130 --> 00:16:57,618
So that's a false negative, it's the mess.

114
00:16:57,764 --> 00:17:09,486
But having a false positive, what can go wrong with having a false positive? Well, probably you will undergo another test to make sure that the first test is confirmed or not.

115
00:17:09,608 --> 00:17:16,018
So adding a false positive in this case is relatively okay with respect to having a false negative.

116
00:17:16,054 --> 00:17:19,398
And that's exactly what happens to the Netflix application.

117
00:17:19,484 --> 00:17:32,094
Now, I don't want to say that of course Netflix application is as critical as, for example, the application that predicts a cancer or an xray or something on an xray or disorder or disease of some sort.

118
00:17:32,252 --> 00:17:48,090
But what I'm saying is that there are some analogies when it comes to machine learning and artificial intelligence and especially data science, the old school data science, there are several things that kind of are, let's say, invariant across sectors.

119
00:17:48,410 --> 00:17:56,826
And so, you know, two worlds like the media streaming or video streaming and healthcare are of course very different from each other.

120
00:17:56,888 --> 00:18:05,274
But when it comes to machine learning and data science applications, well, there are a lot of analogies there.

121
00:18:05,372 --> 00:18:06,202
And indeed.

122
00:18:06,286 --> 00:18:10,234
In terms of the models that they use at Netflix to predict.

123
00:18:10,342 --> 00:18:24,322
Once they have the sliding window data and essentially they have the ground truth of where this out of memory kill happened and what happened before to the memory of the application or the machine.

124
00:18:24,466 --> 00:18:24,774
Well.

125
00:18:24,812 --> 00:18:30,514
Then the models they use to predict these things is these events is Artificial Neural Networks.

126
00:18:30,622 --> 00:18:31,714
Xg Boost.

127
00:18:31,822 --> 00:18:36,742
Ada Boost or Adaptive Boosting Elastic Net with Softmax and so on and so forth.

128
00:18:36,766 --> 00:18:39,226
So nothing fancy.

129
00:18:39,418 --> 00:18:45,046
As you can see, Xg Boost is probably one of the most used I would have expected even random forest.

130
00:18:45,178 --> 00:18:47,120
Probably they do, they've tried that.

131
00:18:47,810 --> 00:18:58,842
But XGBoost is probably one of the most used models on kaggle competitions for a reason, because it works and it leverages a lot.

132
00:18:58,916 --> 00:19:04,880
The data preparation step, that solves already more than half of the problem.

133
00:19:05,810 --> 00:19:07,270
Thank you so much for listening.

134
00:19:07,330 --> 00:19:11,910
I also invite you, as always, to join the Discord Channel.

135
00:19:12,020 --> 00:19:15,966
You will find a link on the official website datascience@home.com.

136
00:19:16,148 --> 00:19:17,600
Speak with you next time.

137
00:19:18,350 --> 00:19:21,382
You've been listening to Data Science at home podcast.

138
00:19:21,466 --> 00:19:26,050
Be sure to subscribe on itunes, Stitcher, or Pot Bean to get new, fresh episodes.

139
00:19:26,110 --> 00:19:31,066
For more, please follow us on Instagram, Twitter and Facebook or visit our website at datascienceathome.com

References

https://netflixtechblog.com/formulating-out-of-memory-kill-prediction-on-the-netflix-app-as-a-machine-learning-problem-989599029109

Comment (0)

No comments yet. Be the first to say something!