Data Collection Patterns
What tools shall I use?
Deciding which of the various tools to use can be a barrier to getting started with dependency-management-data, especially if you're already using a Software Composition Analysis (SCA) platform. Below, you can find a flowchart which will hopefully make the choices clearer.
You'll notice that there's a strong bias towards using renovate-graph. This is due to experience of the data provided by it being superior to the other tools available, as well as the continuing increasing in support that the underlying Renovate project gets.
A secondary preference is the use of Software Bill of Materials (SBOMs), if your platform produces them.
If you'd like to request a new datasource, please consider raising a feature request.
flowchart TD Start(How should I collect data for\nDependency Management Data?) --> P_ P_{"Are you already \nonboarded to a Software \nComposition \nAnalysis (SCA)\n platform?"} P_ --> |Other|Other_GIT[Get in touch!] P_ --> |No|P P_ --> |Snyk|BetterDataSnyk{Want better data, for\n a little extra work?} BetterDataSnyk --> |No|Snyk[Export SBOMs from Snyk] BetterDataSnyk --> |Yes|RG P_ --> |"OSS Review\n Toolkit (ORT)"|BetterDataORT{Want better data, for\n a little extra work?} BetterDataORT --> |No|ORT[Export SBOMs from ORT] BetterDataORT --> |Yes|RG P{What code \nhosting service\n are you using?} P --> |GitHub|GitHub P --> |GitLab|GitLab P --> |Other|Other Other[Get in touch, but you \nshould be able to\n use renovate-graph] --> RG RG(Use renovate-graph) GitHub{Using GitHub Advanced \n Security \n or Dependabot?} --> |Yes|Dependabot(Use dependabot-graph) Dependabot --> BetterDataDependabot -->|Yes| RG BetterDataDependabot --> |No| Dependabot BetterDataDependabot{Want better data, for\n a little extra work?} BetterDataGitLab{Want better data, for\n a little extra work?} GitLab{Using GitLab \nEE Dependency\n Scanning?} --> |Yes|BetterDataGitLab BetterDataGitLab --> |No|GitLab_SBOM(Use Pipeline-specific\nCycloneDX SBOM exports) BetterDataGitLab --> |Yes| RG
Once you've chosen, you'll likely want to follow one of the following links:
- renovate-graph for using Renovate's excellent support for package ecosystems to extract dependency data
- dependabot-graph for using GitHub Advanced Security's Dependabot dependency graph functionality to extract dependency data
- Using dependency-management-data with GitLab's Pipeline-specific CycloneDX SBOM exports
How should I design my data collection process?
Now you've chosen which tools you want to collect the data with, you need to decide how you're going to set up the actual retrieval of data.
All of this expects that there's a central Git repo where the data exports are committed to, and will be further documented soon.
You don't need to go through each of the stages one-by-one, and can instead jump to the point that fits your organisation.
Once you have the data, you'll want to store it in a central Git repo.
Locally fetched data
The simplest solution, which has been my starting point each time I've worked to onboard a new organisation, is to simply run the data collection process myself, on my local machine, and then push it up to a central Git repo.
Centralised, period data dump from CI platform
Next, we can start to automate the process. This may be a once-per-day process, depending on how large your organisation's set of projects are.
This uses a CI platform, such as GitHub Actions, GitLab CI, Jenkins, BuildKite to orchestrate the execution (with any parallelisation that may be necessary to speed up the process) which then leads to the generate data being pushed to the central Git repo.
Example
An example of this can be found with the example repo on GitLab.com.
Periodically updating, with multiple worker nodes
This process takes the previous steps, but turns it into more of a "production quality" architecture.
Ideally you would have the following components, which could be classed as microservices, or this could all exist within a single monolithic service:
- Worker: the worker is the component that processes data through the chosen tool, for instance retrieving SBOMs from the SCA platform you're using, or running
renovate-graph
against your repository. Horizontally scalable, to allow for long-running exports (for instance against large monorepos or monoliths) to be processed at the same time as many other repos - Scheduled executor: the scheduled executor will run i.e. 3 times a day, list all repositories that should be processed, and then will trigger worker processes to handle the repos
- Writer (optional): takes the outputted data from the worker and writes it back to the central Git repo. Allows separating the collection + storing of the data, but can be merged into the worker process if deemed unnecessary.
This may involve queues or an event-based architecture.
... with real-time updates
This is an evolution on top of the Periodically updating, with multiple worker nodes pattern above, and adds on the ability to perform real-time scans of repositories.
This adds an additional component:
- Webhook processor: a component to handle incoming webhook events, verifying the signature, confirming if it requires a re-scan of a repo, and then if so, triggering a worker to process it
A pattern that works best for this is where you listen for all i.e. push
events and ignore any non-default branch pushes. You could additionally look at changed files within the event, and only scan if dependency-related files, such as a go.mod
or build.gradle.kts
was updated, to limit the amount of processing you're performing.
This requires you integrate with your Source Control platform, for instance using a GitHub App that's auto-installed across the organisation, or a GitLab Webhook enabled at the Group/Project level.