RFO: 2018-12-05 Partial outage on Hosted Metrics us-west2

Summary

We experienced a partial outage on the Hosted Metrics us-west2 cluster due to issues on our provider’s cloud platform. A few hours after the cluster recovered, a similar but less impacting event occurred.

Impact

A large number of pods were unschedulable, causing slow query performance and instance availability for a number of customers on the cluster.

Timeline (UTC)

2018-12-05

16:43: We start seeing alerts for hm-us-west2 and begin discussion in Slack ops channel

16:44: We identify that the issue is that four nodes in the cluster are not available. This caused four nodes worth of pods to attempt to get rescheduled elsewhere. The cluster didn't have excess capacity to handle the pods, causing many pods to get stuck in a pending state

16:46: We start seeing API server failures on the cluster

16:52: We see that the nodes are ready again, and pending pods are now getting scheduled on the nodes that came back up

16:53: Both hm-us-west2 and us-west1 clusters report 100% API failures, both in us-west1b

16:58: We see periodic api failures, and some pods are crashlooping and/or getting oom killed

17:02: Some tsdb-gw pods were crashlooping because their memory limits were not high enough to work through the backlog of metrics (this is an edge case).

17:08: All crashlooping tsdb-gw pods had their memory limits increased and came back online

17:50: All pods were ready and all alerts cleared

18:00: Incident was marked as resolved

20:34: We see that there seem to be issues with the cluster again, workloads were not showing in the portal and it showed api failures, kubectl commands not working too

20:35: We did not see any nodes go down this time, but there were over a hundred unschedulable pods, so it must have been very brief

20:45: API server began responding and we added a node to increase our pod capacity

20:46: There were issues with kafka and cassandra recovering slowly, which impacted performance slightly

22:34: Everything is recovered, and incident was marked as fully resolved

Root Causes

Four nodes in the kubernetes cluster became unavailable at the same time, causing all of the pods running on those nodes to be scheduled elsewhere. The cluster did not have the capacity to handle rescheduling the pods from the nodes that went down. We also saw issues with the kubernetes API server, possibly due to it being overloaded trying to reschedule the pods from the unavailable nodes.

Resolution

We added a node to the cluster. The nodes that went down came back online and were able to schedule pods again. We worked with the provider to try and understand why the nodes went down and how it can be prevented in the future.

Posted Jan 11, 2019 - 21:50 UTC

Resolved

This incident has been resolved.

Posted Dec 05, 2018 - 18:09 UTC

Identified

The issue has been identified and a fix is being implemented.

Posted Dec 05, 2018 - 17:10 UTC

Investigating

We've been alerted to problems on the hm-us-west2 cluster and engineers are investigating.

Posted Dec 05, 2018 - 16:53 UTC

This incident affected: Grafana Cloud: Graphite (AWS US West - prod-us-west-0: Querying).