On Tuesday, December 11th, we experienced a reduction in our kafka message processing rate due to upgrade preparations being performed in the cluster (us-west). This impacted ingestion (increased latency) for some customers on this cluster as the data sent to us could not be processed in a timely fashion. This resulted in data coming in with delays up to 2 hours.
High ingestion latency was experienced. The increased latency lead to some customers waiting up to 2 hours for metrics to be ingested and available for querying.
On Tuesday, the 11th, we were performing kafka partition re-balancing in preparation for future kafka upgrades to improve performance and increase stability. We had performed partition re-balancing in the past with similar settings with no adverse effects. Due to hardware changes and increased average load these settings caused spikes in disk I/O and network bandwidth consumption which were not anticipated. This affected customers in different ways depending on their relays' specific configuration settings.
17:10 UTC: as expected, partition re-balancing caused an additional load on the cluster. Ingestion times went from their usual mean of +- 200ms and max ~2 seconds to mean of ~4 seconds and max of ~7 seconds,. Since the delays were under 10 seconds, it appeared to be within tolerance of carbon-relay-ng's default timeout settings and no action was taken.
17:55 UTC: after initiating partition re-balancing on a large instance, we were alerted of higher than normal sustained ingestion delays with spikes of up to 10 seconds. This caused some instances of carbon-relay-ng to timeout while attempting to send data, resulting in delays for several customers. We did not halt the task during re-balancing in order to maintain overall stability in the cluster.
20:10 UTC: most of the partition re-balancing was completed on the large instance and ingestion delays dropped to a mean of ~3 seconds and max of ~5 seconds. This allowed customers to resume sending data at normal rates.
21:25 UTC: after partition re-balancing on the large instance was complete the tasks were safely and completely stopped. By stopping the kafka partition re-balancing tasks we were able to restore disk I/O and network bandwidth consumption to acceptable levels and prevent any further interruptions.
Re-balancing of Kafka topics caused higher than expected load on the Kafka cluster. This lead to ingestion latency to increase to an average of 4s. For customers with carbon-relay-ng configured with large batch sizes, the increased latency resulted in metric batches not being ingested completely before carbon-relay-ng timeout limits were reached. This resulted in carbon-relay-ng marking the whole batch write as failed and retrying it after a short back-off period. In extreme cases carbon-relay-ng would enter a loop where it would continually resend the same batch over and over again, preventing newer metrics from being sent.
The re-balancing script was stopped but a re-balance of a large topic had already started and issues persisted until the re-balance of the topic was complete.