Data Collection Patterns

What tools shall I use?

Deciding which of the various tools to use can be a barrier to getting started with dependency-management-data, especially if you're already using a Software Composition Analysis (SCA) platform. Below, you can find a flowchart which will hopefully make the choices clearer.

You'll notice that there's a strong bias towards using renovate-graph. This is due to experience of the data provided by it being superior to the other tools available, as well as the continuing increasing in support that the underlying Renovate project gets.

A secondary preference is the use of Software Bill of Materials (SBOMs), if your platform produces them.

If you'd like to request a new datasource, please consider raising a feature request.


flowchart TD
Start(How should I collect data for\nDependency Management Data?) --> P_
P_{"Are you already \nonboarded to a Software \nComposition \nAnalysis (SCA)\n platform?"}
P_ --> |Other|Other_GIT[Get in touch!]
P_ --> |No|P

P_ --> |Snyk|BetterDataSnyk{Want better data, for\n a little extra work?}
BetterDataSnyk --> |No|Snyk[Export SBOMs from Snyk]
BetterDataSnyk --> |Yes|RG

P_ --> |"OSS Review\n Toolkit (ORT)"|BetterDataORT{Want better data, for\n a little extra work?}
BetterDataORT --> |No|ORT[Export SBOMs from ORT]
BetterDataORT --> |Yes|RG

P{What code \nhosting service\n are you using?}
P --> |GitHub|GitHub
P --> |GitLab|GitLab
P --> |Other|Other

Other[Get in touch, but you \nshould be able to\n use renovate-graph] --> RG

RG(Use renovate-graph)

GitHub{Using GitHub Advanced \n Security \n or Dependabot?} --> |Yes|Dependabot(Use dependabot-graph)

Dependabot --> BetterDataDependabot -->|Yes| RG
BetterDataDependabot --> |No| Dependabot

BetterDataDependabot{Want better data, for\n a little extra work?}

BetterDataGitLab{Want better data, for\n a little extra work?}


GitLab{Using GitLab \nEE Dependency\n Scanning?} --> |Yes|BetterDataGitLab
BetterDataGitLab --> |No|GitLab_SBOM(Use Pipeline-specific\nCycloneDX SBOM exports)
BetterDataGitLab --> |Yes| RG

Once you've chosen, you'll likely want to follow one of the following links:

renovate-graph for using Renovate's excellent support for package ecosystems to extract dependency data
- See also: Utilising Renovate's local platform to make renovate-graph more efficient
dependabot-graph for using GitHub Advanced Security's Dependabot dependency graph functionality to extract dependency data
Using dependency-management-data with GitLab's Pipeline-specific CycloneDX SBOM exports

How should I design my data collection process?

Now you've chosen which tools you want to collect the data with, you need to decide how you're going to set up the actual retrieval of data.

All of this expects that there's a central Git repo where the data exports are committed to, and will be further documented soon.

You don't need to go through each of the stages one-by-one, and can instead jump to the point that fits your organisation.

Once you have the data, you'll want to store it in a central Git repo.

Locally fetched data

The simplest solution, which has been my starting point each time I've worked to onboard a new organisation, is to simply run the data collection process myself, on my local machine, and then push it up to a central Git repo.

Centralised, period data dump from CI platform

Next, we can start to automate the process. This may be a once-per-day process, depending on how large your organisation's set of projects are.

This uses a CI platform, such as GitHub Actions, GitLab CI, Jenkins, BuildKite to orchestrate the execution (with any parallelisation that may be necessary to speed up the process) which then leads to the generate data being pushed to the central Git repo.

Example

An example of this can be found with the example repo on GitLab.com.

Periodically updating, with multiple worker nodes

This process takes the previous steps, but turns it into more of a "production quality" architecture.

Ideally you would have the following components, which could be classed as microservices, or this could all exist within a single monolithic service:

Worker: the worker is the component that processes data through the chosen tool, for instance retrieving SBOMs from the SCA platform you're using, or running renovate-graph against your repository. Horizontally scalable, to allow for long-running exports (for instance against large monorepos or monoliths) to be processed at the same time as many other repos
Scheduled executor: the scheduled executor will run i.e. 3 times a day, list all repositories that should be processed, and then will trigger worker processes to handle the repos
Writer (optional): takes the outputted data from the worker and writes it back to the central Git repo. Allows separating the collection + storing of the data, but can be merged into the worker process if deemed unnecessary.

This may involve queues or an event-based architecture.

... with real-time updates

This is an evolution on top of the Periodically updating, with multiple worker nodes pattern above, and adds on the ability to perform real-time scans of repositories.

This adds an additional component:

Webhook processor: a component to handle incoming webhook events, verifying the signature, confirming if it requires a re-scan of a repo, and then if so, triggering a worker to process it

A pattern that works best for this is where you listen for all i.e. push events and ignore any non-default branch pushes. You could additionally look at changed files within the event, and only scan if dependency-related files, such as a go.mod or build.gradle.kts was updated, to limit the amount of processing you're performing.

This requires you integrate with your Source Control platform, for instance using a GitHub App that's auto-installed across the organisation, or a GitLab Webhook enabled at the Group/Project level.