GitHub Actions: Data Flow & Data Persistence
In Github Actions, by default, data is not inherently persistent or available to the whole pipeline. Every step has is its own process, every job has its own runner. By default, whatever data emerges in a job, ends with it.
How do we pass data from one process to the other, or save it for the next process?
A short sweet answer:
Strategy | Data | Scope | Persistence | Explanation | Example |
---|---|---|---|---|---|
env | Values | Job (internal) | Ephemeral | Propagates data between steps in the same job | Pass a boolean to control whether the next step should run |
outputs | Values | Workflow (internal) | Ephemeral | Propagates data between jobs/steps in the same workflow | Pass a deployment id to the next job |
artefacts | Files | Workflow (internal & external) | Persistent | Propagates files between jobs/workflows | Pass the project build to different test jobs running in parallel Intended for frequently changing data. Files are available for download after the workflow finishes. |
cache | Files | Workflow (internal & external) | Persistent | Propagates files inside and between workflows in the same repository | Cache npm packages for use in different workflow runs. Intended for files that don't change much. |
For a completer answer: read on.
All the workflow examples in this article can be found as files here, along with a copy of the respective redacted logs.
Using env
It's pretty simple to create a data flow between steps: define a key-value pair and write it to the GITHUB_ENV
environment file, using the appropriate syntax for your shell. See examples below in bash and python:
Show code
steps:
- name: Two ways to set environment variable with sh
# Warning: in this step, the input is not sanitized or validated
shell: bash
run: |
# No print to the logs.
random_wiki_article_1=$(curl -L -X GET "https://en.wikipedia.org/api/rest_v1/page/random/summary" | jq .title)
echo "$random_wiki_article_1"
echo "ARTICLE_1=$random_wiki_article_1" >> "$GITHUB_ENV"
# 🐉 Print the variable in the logs: only for non-senstive data!
random_wiki_article_2=$(curl -L -X GET "https://en.wikipedia.org/api/rest_v1/page/random/summary" | jq .title)
echo "ARTICLE_2=$random_wiki_article_2" | tee -a "$GITHUB_ENV"
- name: Set environment variable with python
shell: python
# if using "write", use \n when creating multiple vars
# with "print", you can omit \n
run: |
from os import environ as env
with open(env.get('GITHUB_ENV', None), 'a') as ghenv:
ghenv.write("SUBJECT=Sun\n")
print("STATE=radiant", file=ghenv)
print("TIME=today", file=ghenv)
- name: 🛡️ Retrieving values securely
# observe that ARTICLE_1 was not sanitized or validated, so it's vulnerable to injection attacks.
# The approach below prevents the issue by setting env.ARTICLE_1 as an argument to the script.
# It also gives you the chance to rename the variables
env:
WHO: ${{ env.SUBJECT }}
WHAT: ${{ env.ARTICLE_1 }}
WHEN: ${{ env.TIME }}
run: |
echo "$WHO read about $WHAT $WHEN."
- name: 🐉 Retrieving values in a potentially vulnerable way
# This approach is vulnerable to injection attacks!
# Only use it if you have control over the input
shell: bash
run: |
echo "${{ env.SUBJECT }} is ${{ env.STATE }} ${{ env.TIME }}."
Debugging tip
To list all the environment variables available in a job, add this tiny step:
- run: env
Using outputs
Outputs are available to all steps in the same job, and to any subsequent job that needs
it.
The output is always an unicode string.
And obviously, jobs that depend on an output
will not run in parallel with the job that produces the output.
Show code
For simplicity, I show how to set the output in bash, but you can use any shell of your choice.
jobs:
setting-outputs:
runs-on: ubuntu-latest
outputs: # Required: name the output in the job level so it's available to other jobs
person_name: ${{ steps.use-hardcoded-value.outputs.NAME }}
location: ${{ steps.use-dynamic-value.outputs.LOCATION }}
steps:
- id: use-hardcoded-value
run: |
echo "NAME=Marcela" >> "$GITHUB_OUTPUT"
- id: use-dynamic-value
# note the use of jq -c to get the value as a single line
run: |
location=$(curl -H "Accept: application/json" https://randomuser.me/api/ | jq -c .results[].location)
echo "LOCATION=$location" > "$GITHUB_OUTPUT"
retrieving-outputs:
runs-on: ubuntu-latest
needs: setting-outputs
steps:
- name: Greet to location
run: |
COUNTRY=$(echo $GEODATA | jq -r . | jq .country)
echo "Hello $NAME, welcome to $COUNTRY!"
env:
NAME: ${{needs.setting-outputs.outputs.person_name}}
GEODATA: ${{ needs.setting-outputs.outputs.location }}
Even though it's recommended to use env
to pass data between steps, outputs
can be used for that purpose as well. This is useful when a value is required both in the current job and in subsequent jobs.
Show code
The previous example showed how to use outputs in different jobs.
To use an output the same job, simply add the code in the highlighted section.
jobs:
extract:
runs-on: ubuntu-latest
outputs:
person_name: ${{ steps.generate-hardcoded-value.outputs.name }}
location: ${{ steps.enerate-dynamic-value.outputs.location }}
steps:
- id: generate-hardcoded-value
run: |
echo "NAME=Marcela" >> "$GITHUB_OUTPUT"
- id: generate-dynamic-value
run: |
location=$(curl -H "Accept: application/json" https://randomuser.me/api/ | jq .results[].location | jq @json)
echo "LOCATION=$location" >> "$GITHUB_OUTPUT"
- name: Consume output in same job
run: |
echo "$PERSON, you're in $GEODATA, so we've updated your timezone to GMT$OFFSET."
env:
PERSON: ${{ steps.use-hardcoded-value.outputs.NAME }}
# use fromJSON() when filtering the output value at the env level
# See more about object filtering in
# https://docs.github.com/en/actions/learn-github-actions/expressions#object-filters
GEODATA: ${{ fromJSON(steps.use-dynamic-value.outputs.LOCATION).country }}
OFFSET: ${{ fromJSON(steps.use-dynamic-value.outputs.LOCATION).timezone.offset }}
(...)
- An individual output should be 1MB max.
- All outputs combined should not exceed 50MB.
GITHUB_OUTPUT
expects a one-line string.
If you need a multiline output, assign it to a variable and write to the output as follows:
echo "PAYLOAD_NAME<<EOF"$'\n'"$payload_var"$'\n'EOF >> "$GITHUB_OUTPUT".
Using artefacts
From the docs: "Use artefacts when you want to save files produced by a job to view after a workflow run has ended, such as built binaries or build logs."
Uploading artefacts
You can:
- select one or multiple files to be bundled as an artifact.
- use wildcards, multiple paths and exclusion patterns in the usual GitHub Actions syntax.
- set a retention period for the artefact.
jobs:
upload:
runs-on: ubuntu-latest
steps:
- name: Checkout
uses: actions/checkout@v4
- name: Upload log files
uses: actions/upload-artifact@v4
with:
name: all-logs # artefact name
path: | # path to files to be included in the artifact.
**/log*.txt # relative paths are rooted against $GITHUB_WORKSPACE
retention-days: 1
if-no-files-found: error # force step to fail if the content for the artefact is not found
Note that maximum retention period can be defined at repo, organisation, or enterprise level. There's a max of 90 days for public repos and 400 days for private repos. If you lower the retention period, you'll have more non-billed space ;)
Downloading artefacts
To retrieve the artefact, you can use:
- the Github UI
- the Github API
- the
gh
cli - the official
actions/download-artifact
action, if you need to retrieve artifacts programmatically. Fromv4
, the action allows you to download artefacts from a different workflows or repos, as long as you provide a token. (🛡️: it's recommended to use a GitHub App rather than a PAT for professional projects.)
Let's see how to retrieve the artefact we created in the previous example using actions/download-artifact
:
download:
runs-on: ubuntu-latest
needs: upload
steps:
- name: Download log files
id: download-artifacts
uses: actions/download-artifact@v4
with:
name: all-logs # note it uses the name defined in the upload step
- name: Pass artifact path to python
shell: python
run: |
import os
from glob import glob
artifact_path = os.environ.get("ARTIFACT_PATH", "")
glob_list = glob(artifact_path + "/*.txt")
for filename in glob_list:
with open(filename, "r", encoding="UTF-8") as f:
content = f.read()
print(content)
env:
ARTIFACT_PATH: ${{ steps.download-artifacts.outputs.download-path }}
All the zipping and unzipping of the artifacts is automatically handled by the actions.
Deleting artefacts
To delete an artefact, you can:
- use the Github UI
- use the Github API
- write a custom script using the Github API or using a community action.
Using cache
Do not store sensitive information in the cache (beware of configuration files containing secrets), as the cache is accessible to anyone who can create a PR on the repository, even on forks.
When we're handling data that is pretty stable and repeatedly used (like dependencies), we can do better than re-generating them every time: we can cache them for better performance.
In the example below, we're caching the pip
dependencies for a Python project. Note that we have added the cache step before the pip install step. The idea is that the install will only happen if the cache is not good or available:
jobs:
cache:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: '3.12'
cache: 'pip'
cache-dependency-path: |
**/requirements.txt
- name: Get pip cache dir
id: pip-cache
run: |
echo "dir=$(pip cache dir)" >> $GITHUB_OUTPUT
- name: Handle cache for Python dependencies
uses: actions/cache@v3
The cache action requires a `path` to the cache and a `key`. The `key` is used to retrieve the cache and to recreate it next time.
id: cache
with:
# path: location of files to cache
path: ${{ steps.pip-cache.outputs.dir }}
# key: unique id used to retrieve and recreate the cache
key: ${{ runner.os }}-pip-${{ hashFiles('**/requirements.txt') }}
- name: Install dependencies if not found in cache
if: steps.cache.outputs.cache-hit != 'true'
run: pip install -r requirements.txt
Setting the cache
The first time the workflow runs, the cache is obviously empty. Therefore, the output cache-hit
(native to the official actions/cache
action), will return false
, which in turn makes our workflow run the install step.
Check logs
However, a small magic happens too: a post-cache step, automatically added by action/cache
at the end of the job, will look at the keys your provided and add the files to the cache.
Retrieving the cache
As long as nothing has changed in your dependency manifest, the next time actions/cache
runs for that path and key, the action will find a cache-hit
and the workflow will safely skip the install step.
Check logs
Updating the cache
Have you noted the hashFiles
function used in the key
argument?
This is a function provided by GitHub Actions that creates a unique hash value based on a file path. When the hash value doesn't match, it means that there was a change in the file - in our case, the dependency manifest.
If the dependencies changed (even a single patch), the cache is no good anymore and the cache-hit
output will allow pip install
to run. And then we're back to square one: dependencies are installed and the cache is updated on a post-cache job.
Check the logs
Last notes about caching
- If you’re using self-hosted runners, the option to self-store the cache is only available in Enterprise plans.
- This action
actions/cache
manages the cache centrally. This means that the cache is available to (and updatable by) all jobs in the same repository - and even to other workflows. - Read more about caching strategies here.
That was a long post, phew.
See you later! :)