Skip to main content

GitHub Actions: Data Flow & Data Persistence

· 10 min read
Manu Magalhães
DevSecOps Engineer

In Github Actions, by default, data is not inherently persistent or available to the whole pipeline. Every step has is its own process, every job has its own runner. By default, whatever data emerges in a job, ends with it.

How do we pass data from one process to the other, or save it for the next process?

A short sweet answer:

StrategyDataScopePersistenceExplanationExample
envValuesJob (internal)EphemeralPropagates data
between steps
in the same job
Pass a boolean to control whether the next step should run
outputsValuesWorkflow (internal)EphemeralPropagates data
between jobs/steps
in the same workflow
Pass a deployment id to the next job
artefactsFilesWorkflow (internal & external)PersistentPropagates files
between jobs/workflows
Pass the project build to different test jobs running in parallel

Intended for frequently changing data. Files are available for download after the workflow finishes.
cacheFilesWorkflow (internal & external)PersistentPropagates files
inside and between workflows
in the same repository
Cache npm packages for use in different workflow runs.

Intended for files that don't change much.

For a completer answer: read on.
All the workflow examples in this article can be found as files here, along with a copy of the respective redacted logs.

Using env

It's pretty simple to create a data flow between steps: define a key-value pair and write it to the GITHUB_ENV environment file, using the appropriate syntax for your shell. See examples below in bash and python:

Show code

/.github/workflows/using_env.yaml
    steps:
- name: Two ways to set environment variable with sh
# Warning: in this step, the input is not sanitized or validated
shell: bash
run: |
# No print to the logs.
random_wiki_article_1=$(curl -L -X GET "https://en.wikipedia.org/api/rest_v1/page/random/summary" | jq .title)
echo "$random_wiki_article_1"
echo "ARTICLE_1=$random_wiki_article_1" >> "$GITHUB_ENV"
# 🐉 Print the variable in the logs: only for non-senstive data!
random_wiki_article_2=$(curl -L -X GET "https://en.wikipedia.org/api/rest_v1/page/random/summary" | jq .title)
echo "ARTICLE_2=$random_wiki_article_2" | tee -a "$GITHUB_ENV"

- name: Set environment variable with python
shell: python
# if using "write", use \n when creating multiple vars
# with "print", you can omit \n
run: |
from os import environ as env
with open(env.get('GITHUB_ENV', None), 'a') as ghenv:
ghenv.write("SUBJECT=Sun\n")
print("STATE=radiant", file=ghenv)
print("TIME=today", file=ghenv)

- name: 🛡️ Retrieving values securely
# observe that ARTICLE_1 was not sanitized or validated, so it's vulnerable to injection attacks.
# The approach below prevents the issue by setting env.ARTICLE_1 as an argument to the script.
# It also gives you the chance to rename the variables
env:
WHO: ${{ env.SUBJECT }}
WHAT: ${{ env.ARTICLE_1 }}
WHEN: ${{ env.TIME }}
run: |
echo "$WHO read about $WHAT $WHEN."

- name: 🐉 Retrieving values in a potentially vulnerable way
# This approach is vulnerable to injection attacks!
# Only use it if you have control over the input
shell: bash
run: |
echo "${{ env.SUBJECT }} is ${{ env.STATE }} ${{ env.TIME }}."

Debugging tip

To list all the environment variables available in a job, add this tiny step:

- run: env

Using outputs

Outputs are available to all steps in the same job, and to any subsequent job that needs it.
The output is always an unicode string.

And obviously, jobs that depend on an output will not run in parallel with the job that produces the output.

Show code

For simplicity, I show how to set the output in bash, but you can use any shell of your choice.

/.github/workflows/outputs-for-different-job.yaml
jobs:
setting-outputs:
runs-on: ubuntu-latest
outputs: # Required: name the output in the job level so it's available to other jobs
person_name: ${{ steps.use-hardcoded-value.outputs.NAME }}
location: ${{ steps.use-dynamic-value.outputs.LOCATION }}
steps:
- id: use-hardcoded-value
run: |
echo "NAME=Marcela" >> "$GITHUB_OUTPUT"

- id: use-dynamic-value
# note the use of jq -c to get the value as a single line
run: |
location=$(curl -H "Accept: application/json" https://randomuser.me/api/ | jq -c .results[].location)
echo "LOCATION=$location" > "$GITHUB_OUTPUT"

retrieving-outputs:
runs-on: ubuntu-latest
needs: setting-outputs
steps:
- name: Greet to location
run: |
COUNTRY=$(echo $GEODATA | jq -r . | jq .country)
echo "Hello $NAME, welcome to $COUNTRY!"
env:
NAME: ${{needs.setting-outputs.outputs.person_name}}
GEODATA: ${{ needs.setting-outputs.outputs.location }}

Even though it's recommended to use env to pass data between steps, outputs can be used for that purpose as well. This is useful when a value is required both in the current job and in subsequent jobs.

Show code

The previous example showed how to use outputs in different jobs.
To use an output the same job, simply add the code in the highlighted section.

/.github/workflows/outputs-for-same-job.yaml
jobs:
extract:
runs-on: ubuntu-latest
outputs:
person_name: ${{ steps.generate-hardcoded-value.outputs.name }}
location: ${{ steps.enerate-dynamic-value.outputs.location }}
steps:
- id: generate-hardcoded-value
run: |
echo "NAME=Marcela" >> "$GITHUB_OUTPUT"
- id: generate-dynamic-value
run: |
location=$(curl -H "Accept: application/json" https://randomuser.me/api/ | jq .results[].location | jq @json)
echo "LOCATION=$location" >> "$GITHUB_OUTPUT"
- name: Consume output in same job
run: |
echo "$PERSON, you're in $GEODATA, so we've updated your timezone to GMT$OFFSET."
env:
PERSON: ${{ steps.use-hardcoded-value.outputs.NAME }}
# use fromJSON() when filtering the output value at the env level
# See more about object filtering in
# https://docs.github.com/en/actions/learn-github-actions/expressions#object-filters
GEODATA: ${{ fromJSON(steps.use-dynamic-value.outputs.LOCATION).country }}
OFFSET: ${{ fromJSON(steps.use-dynamic-value.outputs.LOCATION).timezone.offset }}

(...)
Helpful debugging info
  • An individual output should be 1MB max.
  • All outputs combined should not exceed 50MB.

Real life XP

GITHUB_OUTPUT expects a one-line string.
If you need a multiline output, assign it to a variable and write to the output as follows:

echo "PAYLOAD_NAME<<EOF"$'\n'"$payload_var"$'\n'EOF >> "$GITHUB_OUTPUT".

Using artefacts

From the docs: "Use artefacts when you want to save files produced by a job to view after a workflow run has ended, such as built binaries or build logs."

Uploading artefacts

You can:

  • select one or multiple files to be bundled as an artifact.
  • use wildcards, multiple paths and exclusion patterns in the usual GitHub Actions syntax.
  • set a retention period for the artefact.
/.github/workflows/handle-artefacts.yaml
jobs:
upload:
runs-on: ubuntu-latest
steps:
- name: Checkout
uses: actions/checkout@v4
- name: Upload log files
uses: actions/upload-artifact@v4
with:
name: all-logs # artefact name
path: | # path to files to be included in the artifact.
**/log*.txt # relative paths are rooted against $GITHUB_WORKSPACE
retention-days: 1
if-no-files-found: error # force step to fail if the content for the artefact is not found

Note that maximum retention period can be defined at repo, organisation, or enterprise level. There's a max of 90 days for public repos and 400 days for private repos. If you lower the retention period, you'll have more non-billed space ;)

Downloading artefacts

To retrieve the artefact, you can use:

  • the Github UI
  • the Github API
  • the gh cli
  • the official actions/download-artifact action, if you need to retrieve artifacts programmatically. From v4, the action allows you to download artefacts from a different workflows or repos, as long as you provide a token. (🛡️: it's recommended to use a GitHub App rather than a PAT for professional projects.)

Let's see how to retrieve the artefact we created in the previous example using actions/download-artifact:

/.github/workflows/handle-artefacts.yaml
download:
runs-on: ubuntu-latest
needs: upload
steps:
- name: Download log files
id: download-artifacts
uses: actions/download-artifact@v4
with:
name: all-logs # note it uses the name defined in the upload step

- name: Pass artifact path to python
shell: python
run: |
import os
from glob import glob
artifact_path = os.environ.get("ARTIFACT_PATH", "")
glob_list = glob(artifact_path + "/*.txt")
for filename in glob_list:
with open(filename, "r", encoding="UTF-8") as f:
content = f.read()
print(content)
env:
ARTIFACT_PATH: ${{ steps.download-artifacts.outputs.download-path }}

All the zipping and unzipping of the artifacts is automatically handled by the actions.

Deleting artefacts

To delete an artefact, you can:

Using cache

🐉 Security warning

Do not store sensitive information in the cache (beware of configuration files containing secrets), as the cache is accessible to anyone who can create a PR on the repository, even on forks.

When we're handling data that is pretty stable and repeatedly used (like dependencies), we can do better than re-generating them every time: we can cache them for better performance.

In the example below, we're caching the pip dependencies for a Python project. Note that we have added the cache step before the pip install step. The idea is that the install will only happen if the cache is not good or available:

/.github/workflows/cache.yaml
jobs:
cache:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: '3.12'
cache: 'pip'
cache-dependency-path: |
**/requirements.txt

- name: Get pip cache dir
id: pip-cache
run: |
echo "dir=$(pip cache dir)" >> $GITHUB_OUTPUT

- name: Handle cache for Python dependencies
uses: actions/cache@v3
The cache action requires a `path` to the cache and a `key`. The `key` is used to retrieve the cache and to recreate it next time.
id: cache
with:
# path: location of files to cache
path: ${{ steps.pip-cache.outputs.dir }}
# key: unique id used to retrieve and recreate the cache
key: ${{ runner.os }}-pip-${{ hashFiles('**/requirements.txt') }}

- name: Install dependencies if not found in cache
if: steps.cache.outputs.cache-hit != 'true'
run: pip install -r requirements.txt

Setting the cache

The first time the workflow runs, the cache is obviously empty. Therefore, the output cache-hit (native to the official actions/cache action), will return false, which in turn makes our workflow run the install step.
Check logs

However, a small magic happens too: a post-cache step, automatically added by action/cache at the end of the job, will look at the keys your provided and add the files to the cache.

Retrieving the cache

As long as nothing has changed in your dependency manifest, the next time actions/cache runs for that path and key, the action will find a cache-hitand the workflow will safely skip the install step.
Check logs

Updating the cache

Have you noted the hashFiles function used in the key argument?
This is a function provided by GitHub Actions that creates a unique hash value based on a file path. When the hash value doesn't match, it means that there was a change in the file - in our case, the dependency manifest.

If the dependencies changed (even a single patch), the cache is no good anymore and the cache-hit output will allow pip install to run. And then we're back to square one: dependencies are installed and the cache is updated on a post-cache job.
Check the logs

Last notes about caching

  • If you’re using self-hosted runners, the option to self-store the cache is only available in Enterprise plans.
  • This action actions/cache manages the cache centrally. This means that the cache is available to (and updatable by) all jobs in the same repository - and even to other workflows.
  • Read more about caching strategies here.

That was a long post, phew.
See you later! :)