How Software Builds Fail

Recently, I received a request to take a survey on Continuous Integration tools from a group studying Software Engineering. One of the questions they asked was if I could name a few ways I have seen software builds fail.

I gave them more than a few.

Github is down.
The database our tests are supposed to hit went down.
The database we didn’t know our tests hit went down [1].
We switched JavaScript tool chains, but the tool chain wasn’t installed on the build box.
Our build instance died. We reprovisioned it, but our JavaScript toolchain was missing again. [2]
Github SSH keys were rolled. They weren’t rolled on the build box.
We hit the file system’s maximum number of subdirectories in the /tmp directory [3]. Bonus: This can’t be fixed by running rm -rf /tmp/*. [4]
The build process hit the max number files of the sysctrl kernel limit. [5]
The partition where the builds are executed is full. Bonus: There is only one partition, now the machine is unreachable.
Github is down again.
Someone ran sudo rm -rf /var/lib/jenkins. I’ve personally done this. Twice. [6]
Double-digit GB files were committed to the code base. [7]
The change works on OS X, but the build runs on Linux. [8]
The agent on the Windows build machine is down. Bonus: no one knows the password for the Windows build box.
There are symlinks in your git repo.
The commit includes a new configuration parameter, with defaults for the dev and prod environments, but not for test.
An intern encountered their first merge conflict. [9]
The build machine has a different JVM version than the development environment. [10]
Our version of Ruby was updated when someone installed security updates. [11]
Github is down…. but just the ssh endpoints. [12]
Time out. The build hadn’t written tostdout in a while, because it was stuck.
Time out. The build hadn’t written to stdout in a while, because stdout was being buffered. [13]
The IAM permissions clean up project. [14]
TravisCI is down.
The $TERM environment variable suddenly dissapeared, and the PostgreSQL cli hangs without it. [15]
That thing, where you’re changing your .travis.yml, and it takes you 5-10 commits to get a passing build.
Left-Pad.
You build PRs and commits to master, but they have different config files. [16]
The build was already broken. [17]
It required being run with sudo to pass. [18].
Github is up! But only the ssh endpoints. The website is down, and we use their OAuth provider to log into Jenkins. [19].
A unit test found bug in our software. [20]

This list should neither be considered comprehensive nor representative of anyone’s experience but my own. Building software is hard, and all any of us can do is try our best. [21]

Special thanks to @vboykis and @tdhopper for their feedback on this post.

Footnotes:

Three months ago someone added a test which required the redis instance in the “demo” environment the product team uses to show off new features to the execs. No one new about the dependency until the execs declared that the demos are a waste of time, and the ops team tore down the demo environment.
We installed the build tool chain manually to get the build working.
Somewhere in core hadoop, I think HDFS, a temp directory is created every time a class is instantiated, but it was never cleaned up. That class was instantiated multiple times for every test case, and the were never cleaned up because no one realized it was happening. Never had the time to track down the source. Added a cron to delete anything older than a week.
Bash will attempt to expand the ‘*’ wildcard to include every file and directory present, which is more than its buffer has room for and tell you “Argument list too long”. There are some workarounds though.
Or open files, or processes, etc.
Moving the Jenkins home directory to a separate volume with more space because the main partition was filled up. Moved the home dir to /var/lib/jenkins-backup. Mounted the volume at /var/lib/jenkins, and ran cp /var/lib/jenkins-backup/* /var/lib/jenkins. A few weeks later I went to delete the backup, but fat fingered the command and deleted the contents of the volume. Only lost a few weeks of history thankfully.
Multiple stellar outcomes here. A full disk? Timeouts on the changes being pulled down?
Go allows for platform specific code to be defined in separate files using a naming convention to identify the platform. There’s no requirement you provide a version of the file for platforms you aren’t running. Developer committed only an OSX file.

They commited a merge conflict like below and pushed to master.

<<<<<<< HEAD
def double(arg):
 return arg * 2
=======
def double(val):
 return val * 2
>>>>>>> Renamed function argument

There’s an upgrade in progress with a staged roll out to development, then staging, and then production. The build system wasn’t accounted for.
We installed it via the linux distro’s package manager, and someone ran sudo apt-get update.
So… we can see our changes, we just can’t build them.
Python.
The AWS keys installed on the build box lost the ability to upload artifacts to S3. Could still perform LIST operations though, so we had that going for us.
When running psql -U postgres -f load-dev-db.sql in this scenario, you get the the following output:
```
WARNING: terminal is not fully functional
- (press RETURN)
```
And the process hangs waiting for user feedback. We were able to work around by redirecting the output to a file and catting it after the psql command completed.
The PR build worked. The build off of master, sadly, did not.
Since Friday last week. It’s Wednesday.
And no one really seems to know why.
So we can’t tell why this build failed. Does that count?
It’s been known to happen on occasion.
Some of the situations described have been fictionalized to protect the innocent as well as others that did not try their best.