Deliveroo and a potential race condition with a Kafka sidecar

How Deliveroo could understand, during an incident, which other services may be affected by a race condition in a dependency.

(Note: this is a retrospectively published case study, and some details may have been modified to share publicly)


At Deliveroo, I was part of an incident in which we discovered that some services using a Kafka sidecar were vulnerable to a race condition due to a dependency of a dependency.

We needed to understand which other service(s) were affected and take action to appropriately protect business critical services from potential issues related to this.

If a service was configured to use the Kafka sidecar container, they would have the version configured by the following YAML configuration, used by Deliveroo's internal deployment tool, Hopper:

    image: "internal.docker.registry/service-name"
        kafka: 1.2.3

This wasn't as trivial to detect, especially in cases where we may have uses of YAML anchors:

# there may be some folks using YAML anchors
sidecars: &sidecars
    # or there could also be comments in here!
    kafka: 1.2.3

    image: "internal.docker.registry/service-name"

This data wouldn't be present by default, as it was a Deliveroo-specific configuration file, but fortunately we had already taken the foresight to consume this data - using Renovate's Custom Manager support - so was readily available to query.

Fortunately, a few weeks prior the team that owned the Kafka container had released a fix for this issue, so we knew that anyone using a version older than 0.3.0 could be affected by the bug.

As well as listing which services were affected, we needed to add the owning team of the repo - as a point of contact for work associated to prevent further issues - as well as the internal definition of how business critical the service is.

Both ownership information and information about business criticality could be found within our internally built Service Registry. Ownership data was already available in dependency-management-data using the owners table but the business criticality required some manual cross-referencing.


We needed to write a query that would:


Let's say that we have the following data:

In the renovate table:

platform organisation repo package_name version current_version package_manager package_file_path datasource dep_types
github deliveroo good-service internal-docker.tld/kafka 0.3.0 0.3.0 dockerfile .hopper.yml docker []
github deliveroo affected-service internal-docker.tld/kafka 0.2.1 0.2.1 dockerfile .hopper.yml docker []
github deliveroo also-affected-service internal-docker.tld/kafka 0.1.0 0.1.0 dockerfile .hopper.yml docker []

In the owners table:

platform organisation repo owner notes updated_at
github deliveroo good-service Restaraunts
github deliveroo affected-service Grocery

And in the repository_metadata table:

platform organisation repo is_monorepo is_fork repository_type repository_usage visibility description additional_metadata
github deliveroo good-service FALSE FALSE service PRIVATE {"tier": "tier_1"}
github deliveroo affected-service FALSE FALSE service PRIVATE {"tier": "tier_1"}
github deliveroo also-affected-service FALSE FALSE service PRIVATE {"tier": "tier_2"}

(Note that in this case, the tier information wasn't sourced from the repository_metadata table, as that didn't exist yet! So instead it was manually correlated. However, this is a good example of how having this data ingested by dependency-management-data can provide additional information that's very useful)


To solve this problem, we could then run:

  json_extract(additional_metadata, '$.tier') as tier
  left join owners on renovate.platform = owners.platform
  and renovate.organisation = owners.organisation
  and renovate.repo = owners.repo
  left join repository_metadata on renovate.platform = repository_metadata.platform
  and renovate.organisation = repository_metadata.organisation
  and renovate.repo = repository_metadata.repo
  -- NOTE: that this is performed with a lexicographical match, which is NOT
  -- likely to be what you are expecting to perform version constraint matching
  -- but this is a good start for these use cases
  renovate.current_version < '0.3'
order by
  tier ASC

Based on the above data, we get the following data:

organisation repo current_version owner tier
deliveroo affected-service 0.2.1 Grocery tier_1
deliveroo also-affected-service 0.1.0 tier_2

This shows us the repos, and an owner (if found) as well as the service's business criticality tier.