Deliveroo and a potential race condition with a Kafka sidecar
How Deliveroo could understand, during an incident, which other services may be affected by a race condition in a dependency.
(Note: this is a retrospectively published case study, and some details may have been modified to share publicly)
Context
At Deliveroo, I was part of an incident in which we discovered that some services using a Kafka sidecar were vulnerable to a race condition due to a dependency of a dependency.
We needed to understand which other service(s) were affected and take action to appropriately protect business critical services from potential issues related to this.
If a service was configured to use the Kafka sidecar container, they would have the version configured by the following YAML configuration, used by Deliveroo's internal deployment tool, Hopper:
app:
image: "internal.docker.registry/service-name"
sidecars:
kafka: 1.2.3
This wasn't as trivial to detect, especially in cases where we may have uses of YAML anchors:
# there may be some folks using YAML anchors
sidecars: &sidecars
# or there could also be comments in here!
kafka: 1.2.3
app:
image: "internal.docker.registry/service-name"
*sidecars
This data wouldn't be present by default, as it was a Deliveroo-specific configuration file, but fortunately we had already taken the foresight to consume this data - using Renovate's Custom Manager support - so was readily available to query.
Fortunately, a few weeks prior the team that owned the Kafka container had released a fix for this issue, so we knew that anyone using a version older than 0.3.0
could be affected by the bug.
As well as listing which services were affected, we needed to add the owning team of the repo - as a point of contact for work associated to prevent further issues - as well as the internal definition of how business critical the service is.
Both ownership information and information about business criticality could be found within our internally built Service Registry. Ownership data was already available in dependency-management-data using the owners
table but the business criticality required some manual cross-referencing.
Problem
We needed to write a query that would:
- List the services that have defined a Kafka sidecar container
- Filter these to those that are using a version earlier than
0.3.0
- List the owners for those services, if one was present
- Order those services based on the business priority, with the highest priority services first
Data
Let's say that we have the following data:
In the renovate
table:
platform | organisation | repo | package_name | version | current_version | package_manager | package_file_path | datasource | dep_types |
---|---|---|---|---|---|---|---|---|---|
github | deliveroo | good-service | internal-docker.tld/kafka | 0.3.0 | 0.3.0 | dockerfile | .hopper.yml | docker | [] |
github | deliveroo | affected-service | internal-docker.tld/kafka | 0.2.1 | 0.2.1 | dockerfile | .hopper.yml | docker | [] |
github | deliveroo | also-affected-service | internal-docker.tld/kafka | 0.1.0 | 0.1.0 | dockerfile | .hopper.yml | docker | [] |
In the owners
table:
platform | organisation | repo | owner | notes | updated_at |
---|---|---|---|---|---|
github | deliveroo | good-service | Restaraunts | ||
github | deliveroo | affected-service | Grocery |
And in the repository_metadata
table:
platform | organisation | repo | is_monorepo | is_fork | repository_type | repository_usage | visibility | description | additional_metadata |
---|---|---|---|---|---|---|---|---|---|
github | deliveroo | good-service | FALSE | FALSE | service | PRIVATE | {"tier": "tier_1"} |
||
github | deliveroo | affected-service | FALSE | FALSE | service | PRIVATE | {"tier": "tier_1"} |
||
github | deliveroo | also-affected-service | FALSE | FALSE | service | PRIVATE | {"tier": "tier_2"} |
(Note that in this case, the tier
information wasn't sourced from the repository_metadata
table, as that didn't exist yet! So instead it was manually correlated. However, this is a good example of how having this data ingested by dependency-management-data can provide additional information that's very useful)
Query
To solve this problem, we could then run:
select
renovate.organisation,
renovate.repo,
current_version,
owner,
json_extract(additional_metadata, '$.tier') as tier
from
renovate
left join owners on renovate.platform = owners.platform
and renovate.organisation = owners.organisation
and renovate.repo = owners.repo
left join repository_metadata on renovate.platform = repository_metadata.platform
and renovate.organisation = repository_metadata.organisation
and renovate.repo = repository_metadata.repo
where
-- NOTE: that this is performed with a lexicographical match, which is NOT
-- likely to be what you are expecting to perform version constraint matching
-- but this is a good start for these use cases
renovate.current_version < '0.3'
order by
tier ASC
Based on the above data, we get the following data:
organisation | repo | current_version | owner | tier |
---|---|---|---|---|
deliveroo | affected-service | 0.2.1 | Grocery | tier_1 |
deliveroo | also-affected-service | 0.1.0 | tier_2 |
This shows us the repos, and an owner (if found) as well as the service's business criticality tier.