Since a typical kubernetes cluster consists of a lot of moving parts, there are many ways in which it could break. Therefore monitoring tools such as prometheus are often used to collect usage and health metrics from deployments. However, when there are thousands of deployments in your fleet, inspecting health metrics from individual deployments to diagnose the issues in them becomes tedious and inefficient.
In this talk, we will talk about how we applied data science to the health metrics collected from OpenShift clusters to help us proactively identify issues. Specifically, we used clustering to form groups of deployments that behave similarly. Then, we applied frequent pattern mining to determine the prominent, 'defining' patterns in each group. These patterns can help us precisely identify and codify the problem affecting the deployments. In this way, we can diagnose issues proactively and scalably. We found that in many cases, the patterns determined by these methods coincide very well with the rules developed by SMEs. Therefore, we believe these techniques can be used to generate actionable insights going forward if added as an extension to your existing monitoring system.