Workflow Artifacts and Caching
# CHAPTER 14
Workflow Artifacts and Caching
1. Introduction
GitHub Runners are completely ephemeral. Every time a workflow starts, you receive a brand-new, empty virtual machine. Every time it ends, that machine is destroyed, and the hard drive is wiped clean. While this is fantastic for security and reproducibility, it creates two massive problems:- 1. If Job 1 compiles the code, how does Job 2 get the compiled files if the machine was destroyed?
-
2.
If it takes 3 minutes to download 500MB of
npmorcomposerpackages, why do we have to download them again for every single commit?
2. Learning Objectives
By the end of this chapter, you will be able to:- Differentiate between the operational purpose of Artifacts and Caching.
-
Use
actions/upload-artifactto persist data after a job finishes.
-
Use
actions/download-artifactto pass data to sequential jobs.
-
Use
actions/cacheto speed up dependency installation times.
- Understand how cache keys are dynamically generated based on lock files.
3. Beginner-Friendly Explanation
Imagine a multi-stage factory.- The Artifact (Passing the Baton): Station 1 bakes a cake. Station 2 frosts the cake. Because Station 1 and Station 2 are in different buildings, Station 1 must put the cake in a delivery box (Upload Artifact) and send it. Station 2 receives the box, opens it (Download Artifact), and applies the frosting. Artifacts are the physical handoff of the final product.
- The Cache (The Tool Shed): Station 1 needs a very specific wrench to fix the oven. The first time, they drive to the hardware store to buy it (takes 30 mins). Before they leave for the day, they put the wrench in a locked shed. The next day, instead of going to the store, they just grab the wrench from the shed (takes 10 seconds). Caching is saving the heavy tools so you don't have to re-download them.
4. Passing Data with Artifacts
Becausejobs run on completely separate servers, they cannot share files directly.
If the build job creates app.zip, the deploy job will not be able to find app.zip.
You must explicitly upload the file to GitHub's temporary storage, and the next job must explicitly download it.
Uploading:
*Note: Artifacts uploaded during a workflow are visible in the GitHub UI and can be manually downloaded by developers as ZIP files!*
5. Speeding Up Builds with Caching
If yourcomposer install command takes 2 minutes, you are wasting valuable cloud minutes. Dependencies rarely change between commits. We can cache the vendor/ or node_modules/ directories.
How Caching Works:
You provide a "Key" (usually a hash of your composer.lock file).
GitHub checks its storage: "Do I have a saved folder matching this key?"
-
Cache Hit: GitHub instantly copies the saved folder to your runner.
composer installfinishes in 1 second.
-
Cache Miss: GitHub runs
composer installnormally, and then saves the resulting folder for the next time.
6. Mini Project: Optimize CI Workflow Speed
Let's optimize a PHP workflow by implementing a robust dependency cache.Step-by-Step Walkthrough:
-
1.
Create
.github/workflows/caching-demo.yml.
- 2. Paste the following declarative code:
*Run this workflow twice. The first run will take normal time. The second run will be significantly faster because the cache was utilized!*
7. Real-World Scenarios
A data science team had a Python CI pipeline that required downloading 5 Gigabytes of Machine Learning libraries (like TensorFlow and PyTorch) during thepip install phase. Every time a developer pushed a single line of code, the pipeline took 25 minutes just to download the libraries. The developers stopped running tests because it was too slow. A DevOps engineer implemented the actions/cache step. Because the dependencies were massive but rarely changed, the cache hit successfully 99% of the time, dropping the pipeline execution time from 25 minutes down to 2 minutes.
8. Best Practices
-
Cache Invalidation (The Lock File): In the mini-project, the key was
${{ hashFiles('/composer.lock') }}. This is brilliant engineering. If a developer doesn't add new packages, the hash remains the same, and the cache is used. If a developer runscomposer require new-package, thecomposer.lockfile changes. The hash changes. GitHub sees a new key, realizes it's a "Cache Miss", completely ignores the old saved folder, and correctly downloads the new package.
9. Security Recommendations
- Artifact Exposure: Artifacts are tied to the repository. If you upload an artifact containing a compiled app with hardcoded API keys, anyone with read access to the GitHub repository can download that ZIP file from the Actions tab and extract the keys. Never include sensitive configuration files in uploaded artifacts.
10. Troubleshooting Tips
-
Artifact Size Limits: GitHub imposes storage limits on Artifacts. They are meant for passing code, not for storing 100GB database backups. If your pipeline fails during the upload step, check if you are attempting to upload the entire Linux filesystem instead of just a specific
./build/directory.
11. Exercises
- 1. Explain the functional difference between an Artifact and a Cache in a CI/CD pipeline.
-
2.
Why do we use the hash of a lock file (like
package-lock.json) to generate the Cache Key, rather than just naming the keymy-cache?
12. FAQs
Q: Do I need to explicitly "save" the cache at the end of the workflow? A: No! Theactions/cache step is smart. If it experiences a "Cache Miss" during the workflow, it will automatically run a post-job cleanup step to save the folder for the next time. You don't need to write any extra YAML.
13. Interview Questions
npm install. Architect the YAML step required to cache the node_modules directory, and explain the cache invalidation strategy ensuring outdated packages are not restored.
14. Summary
In Chapter 14, we mastered the manipulation of state across ephemeral environments. We solved the problem of isolated job execution by utilizingactions/upload-artifact to physically pass compiled deliverables down the assembly line. More importantly, we radically optimized our pipeline performance by implementing dependency caching. By utilizing intelligent cache keys based on cryptographic file hashes, we eliminated redundant downloads, saving immense amounts of cloud compute time and providing developers with the rapid feedback loops essential for true CI/CD.