Increased Ingest Latency on Hosted Metrics Graphite US-WEST
Incident Report for Grafana Cloud
Postmortem

RFO: 2018-12-11 us-west kafka upgrades hindering ingestion

Summary

On Tuesday, December 11th, we experienced a reduction in our kafka message processing rate due to upgrade preparations being performed in the cluster (us-west). This impacted ingestion (increased latency) for some customers on this cluster as the data sent to us could not be processed in a timely fashion. This resulted in data coming in with delays up to 2 hours.

Impact

High ingestion latency was experienced. The increased latency lead to some customers waiting up to 2 hours for metrics to be ingested and available for querying.

Timeline (UTC)

On Tuesday, the 11th, we were performing kafka partition re-balancing in preparation for future kafka upgrades to improve performance and increase stability. We had performed partition re-balancing in the past with similar settings with no adverse effects. Due to hardware changes and increased average load these settings caused spikes in disk I/O and network bandwidth consumption which were not anticipated. This affected customers in different ways depending on their relays' specific configuration settings.

17:10 UTC: as expected, partition re-balancing caused an additional load on the cluster. Ingestion times went from their usual mean of +- 200ms and max ~2 seconds to mean of ~4 seconds and max of ~7 seconds,. Since the delays were under 10 seconds, it appeared to be within tolerance of carbon-relay-ng's default timeout settings and no action was taken.

17:55 UTC: after initiating partition re-balancing on a large instance, we were alerted of higher than normal sustained ingestion delays with spikes of up to 10 seconds. This caused some instances of carbon-relay-ng to timeout while attempting to send data, resulting in delays for several customers. We did not halt the task during re-balancing in order to maintain overall stability in the cluster.

20:10 UTC: most of the partition re-balancing was completed on the large instance and ingestion delays dropped to a mean of ~3 seconds and max of ~5 seconds. This allowed customers to resume sending data at normal rates.

21:25 UTC: after partition re-balancing on the large instance was complete the tasks were safely and completely stopped. By stopping the kafka partition re-balancing tasks we were able to restore disk I/O and network bandwidth consumption to acceptable levels and prevent any further interruptions.

Root Causes

Re-balancing of Kafka topics caused higher than expected load on the Kafka cluster. This lead to ingestion latency to increase to an average of 4s. For customers with carbon-relay-ng configured with large batch sizes, the increased latency resulted in metric batches not being ingested completely before carbon-relay-ng timeout limits were reached. This resulted in carbon-relay-ng marking the whole batch write as failed and retrying it after a short back-off period. In extreme cases carbon-relay-ng would enter a loop where it would continually resend the same batch over and over again, preventing newer metrics from being sent.

Resolution

The re-balancing script was stopped but a re-balance of a large topic had already started and issues persisted until the re-balance of the topic was complete.

Posted Jan 11, 2019 - 21:54 UTC

Resolved
The cluster expansion has been paused and ingest latency has now returned to normal. We are evaluating options to ensure that future expansion does not impact ingest.
Posted Dec 11, 2018 - 21:46 UTC
Identified
An expansion of the kafka deployment in US-WEST has caused increased ingest latency for Hosted Metrics Graphite, and may result in metrics ingest being delayed. We are working to reduce ingest latency and resolve the problem.
Posted Dec 11, 2018 - 19:13 UTC