GitHub Actions: Data Flow & Data Persistence

October 31, 2023 · 10 min read

DevSecOps Engineer

In Github Actions, by default, data is not inherently persistent or available to the whole pipeline. Every step has is its own process, every job has its own runner. By default, whatever data emerges in a job, ends with it.

How do we pass data from one process to the other, or save it for the next process?

A short sweet answer:

Strategy	Data	Scope	Persistence	Explanation	Example
`env`	Values	Job (internal)	Ephemeral	Propagates data between steps in the same job	Pass a boolean to control whether the next step should run
`outputs`	Values	Workflow (internal)	Ephemeral	Propagates data between jobs/steps in the same workflow	Pass a deployment id to the next job
artefacts	Files	Workflow (internal & external)	Persistent	Propagates files between jobs/workflows	Pass the project build to different test jobs running in parallel Intended for frequently changing data. Files are available for download after the workflow finishes.
cache	Files	Workflow (internal & external)	Persistent	Propagates files inside and between workflows in the same repository	Cache npm packages for use in different workflow runs. Intended for files that don't change much.

For a completer answer: read on.
All the workflow examples in this article can be found as files here, along with a copy of the respective redacted logs.

Using `env`

It's pretty simple to create a data flow between steps: define a key-value pair and write it to the GITHUB_ENV environment file, using the appropriate syntax for your shell. See examples below in bash and python:

Show code

/.github/workflows/using_env.yaml
    steps:
      - name: Two ways to set environment variable with sh
      # Warning: in this step, the input is not sanitized or validated
        shell: bash
        run: |
          # No print to the logs.
          random_wiki_article_1=$(curl -L -X GET "https://en.wikipedia.org/api/rest_v1/page/random/summary" | jq .title)
          echo "$random_wiki_article_1"
          echo "ARTICLE_1=$random_wiki_article_1" >> "$GITHUB_ENV"
          # 🐉 Print the variable in the logs: only for non-senstive data!
          random_wiki_article_2=$(curl -L -X GET "https://en.wikipedia.org/api/rest_v1/page/random/summary" | jq .title)
          echo "ARTICLE_2=$random_wiki_article_2" | tee -a "$GITHUB_ENV"

      - name: Set environment variable with python
        shell: python
        # if using "write", use \n when creating multiple vars
        # with "print", you can omit \n
        run: |
          from os import environ as env
          with open(env.get('GITHUB_ENV', None), 'a') as ghenv:
            ghenv.write("SUBJECT=Sun\n")
            print("STATE=radiant", file=ghenv)
            print("TIME=today", file=ghenv)
          
      - name: 🛡️ Retrieving values securely
        # observe that ARTICLE_1 was not sanitized or validated, so it's vulnerable to injection attacks.
        # The approach below prevents the issue by setting env.ARTICLE_1 as an argument to the script.
        # It also gives you the chance to rename the variables
        env:
          WHO: ${{ env.SUBJECT }}
          WHAT: ${{ env.ARTICLE_1 }}
          WHEN: ${{ env.TIME }}
        run: |
          echo "$WHO read about $WHAT $WHEN."
        
      - name: 🐉 Retrieving values in a potentially vulnerable way
        # This approach is vulnerable to injection attacks!
        # Only use it if you have control over the input
        shell: bash
        run: |
          echo "${{ env.SUBJECT }} is ${{ env.STATE }} ${{ env.TIME }}."

Debugging tip

To list all the environment variables available in a job, add this tiny step:

- run: env

Using `outputs`

Outputs are available to all steps in the same job, and to any subsequent job that needs it.
The output is always an unicode string.

And obviously, jobs that depend on an output will not run in parallel with the job that produces the output.

Show code

For simplicity, I show how to set the output in bash, but you can use any shell of your choice.

/.github/workflows/outputs-for-different-job.yaml
jobs:
  setting-outputs:
    runs-on: ubuntu-latest
    outputs:  # Required: name the output in the job level so it's available to other jobs
      person_name: ${{ steps.use-hardcoded-value.outputs.NAME }}
      location: ${{ steps.use-dynamic-value.outputs.LOCATION }}
    steps:
      - id: use-hardcoded-value
        run: |
          echo "NAME=Marcela" >> "$GITHUB_OUTPUT"
      
      - id: use-dynamic-value
        # note the use of jq -c to get the value as a single line
        run: |
          location=$(curl -H "Accept: application/json" https://randomuser.me/api/ | jq -c .results[].location)
          echo "LOCATION=$location" > "$GITHUB_OUTPUT"

  retrieving-outputs:
    runs-on: ubuntu-latest
    needs: setting-outputs
    steps:
      - name: Greet to location
        run: |
          COUNTRY=$(echo $GEODATA | jq -r . | jq .country)
          echo "Hello $NAME, welcome to $COUNTRY!"
        env:
          NAME: ${{needs.setting-outputs.outputs.person_name}}
          GEODATA: ${{ needs.setting-outputs.outputs.location }}

Even though it's recommended to use env to pass data between steps, outputs can be used for that purpose as well. This is useful when a value is required both in the current job and in subsequent jobs.

Show code

The previous example showed how to use outputs in different jobs.
To use an output the same job, simply add the code in the highlighted section.

/.github/workflows/outputs-for-same-job.yaml
jobs:
  extract:
    runs-on: ubuntu-latest
    outputs:
      person_name: ${{ steps.generate-hardcoded-value.outputs.name }}
      location: ${{ steps.enerate-dynamic-value.outputs.location }}
        steps:
      - id: generate-hardcoded-value
        run: |
          echo "NAME=Marcela" >> "$GITHUB_OUTPUT"
      - id: generate-dynamic-value
        run: |
          location=$(curl -H "Accept: application/json" https://randomuser.me/api/ | jq .results[].location | jq @json) 
          echo "LOCATION=$location" >> "$GITHUB_OUTPUT"
      - name: Consume output in same job
        run: |
          echo "$PERSON, you're in $GEODATA, so we've updated your timezone to GMT$OFFSET."
        env:
          PERSON: ${{ steps.use-hardcoded-value.outputs.NAME }}
          # use fromJSON() when filtering the output value at the env level
          # See more about object filtering in 
          # https://docs.github.com/en/actions/learn-github-actions/expressions#object-filters
          GEODATA: ${{ fromJSON(steps.use-dynamic-value.outputs.LOCATION).country }}
          OFFSET: ${{ fromJSON(steps.use-dynamic-value.outputs.LOCATION).timezone.offset }}

    (...)

Helpful debugging info

An individual output should be 1MB max.
All outputs combined should not exceed 50MB.

Real life XP

GITHUB_OUTPUT expects a one-line string.
If you need a multiline output, assign it to a variable and write to the output as follows:

echo "PAYLOAD_NAME<<EOF"$'\n'"$payload_var"$'\n'EOF >> "$GITHUB_OUTPUT".

Using artefacts

From the docs: "Use artefacts when you want to save files produced by a job to view after a workflow run has ended, such as built binaries or build logs."

Uploading artefacts

You can:

select one or multiple files to be bundled as an artifact.
use wildcards, multiple paths and exclusion patterns in the usual GitHub Actions syntax.
set a retention period for the artefact.

/.github/workflows/handle-artefacts.yaml
jobs:
  upload:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout
        uses: actions/checkout@v4
      - name: Upload log files
        uses: actions/upload-artifact@v4
        with:
          name: all-logs      # artefact name
          path: |             # path to files to be included in the artifact.
            **/log*.txt       # relative paths are rooted against $GITHUB_WORKSPACE
          retention-days: 1
          if-no-files-found: error # force step to fail if the content for the artefact is not found

Note that maximum retention period can be defined at repo, organisation, or enterprise level. There's a max of 90 days for public repos and 400 days for private repos. If you lower the retention period, you'll have more non-billed space ;)

Downloading artefacts

To retrieve the artefact, you can use:

the Github UI
the Github API
the gh cli
the official actions/download-artifact action, if you need to retrieve artifacts programmatically. From v4, the action allows you to download artefacts from a different workflows or repos, as long as you provide a token. (🛡️: it's recommended to use a GitHub App rather than a PAT for professional projects.)

Let's see how to retrieve the artefact we created in the previous example using actions/download-artifact:

/.github/workflows/handle-artefacts.yaml
download:
    runs-on: ubuntu-latest
    needs: upload
    steps:
      - name: Download log files
        id: download-artifacts
        uses: actions/download-artifact@v4
        with:
          name: all-logs  # note it uses the name defined in the upload step
        
      - name: Pass artifact path to python
        shell: python
        run: |
          import os
          from glob import glob
          artifact_path = os.environ.get("ARTIFACT_PATH", "")
          glob_list = glob(artifact_path + "/*.txt")
          for filename in glob_list:
              with open(filename, "r", encoding="UTF-8") as f:
                  content = f.read()
                  print(content)
        env:
          ARTIFACT_PATH: ${{ steps.download-artifacts.outputs.download-path }}

All the zipping and unzipping of the artifacts is automatically handled by the actions.

Deleting artefacts

To delete an artefact, you can:

use the Github UI
use the Github API
write a custom script using the Github API or using a community action.

Using cache

🐉 Security warning

Do not store sensitive information in the cache (beware of configuration files containing secrets), as the cache is accessible to anyone who can create a PR on the repository, even on forks.

When we're handling data that is pretty stable and repeatedly used (like dependencies), we can do better than re-generating them every time: we can cache them for better performance.

In the example below, we're caching the pip dependencies for a Python project. Note that we have added the cache step before the pip install step. The idea is that the install will only happen if the cache is not good or available:

/.github/workflows/cache.yaml
jobs:
  cache:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: '3.12'
          cache: 'pip'
          cache-dependency-path: |
            **/requirements.txt
      
      - name: Get pip cache dir
        id: pip-cache
        run: |
          echo "dir=$(pip cache dir)" >> $GITHUB_OUTPUT

      - name: Handle cache for Python dependencies
        uses: actions/cache@v3
        The cache action requires a `path` to the cache and a `key`. The `key` is used to retrieve the cache and to recreate it next time. 
        id: cache
        with:
          # path: location of files to cache
          path: ${{ steps.pip-cache.outputs.dir }} 
          # key: unique id used to retrieve and recreate the cache
          key: ${{ runner.os }}-pip-${{ hashFiles('**/requirements.txt') }} 
        
      - name: Install dependencies if not found in cache
        if: steps.cache.outputs.cache-hit != 'true'
        run: pip install -r requirements.txt

Setting the cache

The first time the workflow runs, the cache is obviously empty. Therefore, the output cache-hit (native to the official actions/cache action), will return false, which in turn makes our workflow run the install step.
Check logs

However, a small magic happens too: a post-cache step, automatically added by action/cache at the end of the job, will look at the keys your provided and add the files to the cache.

Retrieving the cache

As long as nothing has changed in your dependency manifest, the next time actions/cache runs for that path and key, the action will find a cache-hitand the workflow will safely skip the install step.
Check logs

Updating the cache

Have you noted the hashFiles function used in the key argument?
This is a function provided by GitHub Actions that creates a unique hash value based on a file path. When the hash value doesn't match, it means that there was a change in the file - in our case, the dependency manifest.

If the dependencies changed (even a single patch), the cache is no good anymore and the cache-hit output will allow pip install to run. And then we're back to square one: dependencies are installed and the cache is updated on a post-cache job.
Check the logs

Last notes about caching

If you’re using self-hosted runners, the option to self-store the cache is only available in Enterprise plans.
This action actions/cache manages the cache centrally. This means that the cache is available to (and updatable by) all jobs in the same repository - and even to other workflows.
Read more about caching strategies here.

That was a long post, phew.
See you later! :)

Using env​

Debugging tip​

Using outputs​

Using artefacts​

Uploading artefacts​

Downloading artefacts​

Deleting artefacts​

Using cache​

Setting the cache​

Retrieving the cache​

Updating the cache​

Last notes about caching​

Using `env`

Debugging tip

Using `outputs`

Using artefacts

Uploading artefacts

Downloading artefacts

Deleting artefacts

Using cache

Setting the cache

Retrieving the cache

Updating the cache

Last notes about caching