Speed up Docker Image Building with the CDK

This content is more than 4 years old and the cloud moves fast so some information may be slightly out of date.

In a recent project we observed that build times with our CDK application were increasing over time, after we moved code from one repo to another. We were using a pipeline, which re-used the build hosts and build contexts in order to speed up build times, because dependencies didn’t need to be re-installed every time. In theory that concept was great, but in practice we noticed consecutive builds getting slower and slower. It took us a while to figure out, why this was the case - it turned out, that our docker builds were taking progressively longer each time the pipeline ran, because we had missed copying a crucial file, when we moved the repo. That file is the topic of this blog post, because I’m sure we aren’t the only one with this problem.

The file I’m talking about is the .dockerignore file, which influences the build process of a docker container. To understand what it does, let’s start with a brief overview of said build process. When you run a docker build, Docker requires that you give it the path to a directory that contains a Dockerfile. That directory is called the build context. You can also specify a different path for the Dockerfile using the -f parameter, but by default docker expects it to sit at the root of the build context. If you’re not too familiar with Docker, the Dockerfile is basically the script that determines how a docker image gets created.

When docker builds a container, it copies the build context to a working directory (the actual implementation may be a little more complex, but you can think of it as copying). By default this usually means it will copy your whole repository including your virtual environment (if you’re using python) or your node_modules directory as well as the .git directory to the working directory. In the case of the CDK, it will also copy the cdk.out directory. This means lots of files will get copied to a directory when you run cdk synth or cdk deploy, which takes time. That’s because cdk synth and by extension cdk deploy will build a new docker image from the source code.

In our case that was the root cause why our builds where taking longer and longer each time we ran them. As I said, the runner and working directory were re-used, which meant that the cdk.out directory grew larger on each build. Each time the directory got copied and that was very inefficient. The problem was, that we had moved repositories a while ago and had forgotten to move over one tiny little file, that’s hidden by default - the .dockerignore.

So what’s this about? If you’re familiar with the .gitignore file, .dockerignore is basically the same for Docker. If you aren’t, the .dockerignore file tells Docker which files it can safely ignore during the build process and thus don’t need to be copied over. It’s a very simple idea, which might have significant speed implications depending on your project. Here’s an example of a .dockerignore, that I used in most of my CDK projects:

# Files/directories in this list won't be copied to
# the docker build context, this saves a lot of time
.git
.env
.pytest_cache
.idea
.vscode
infrastructure
cdk.out
tests
venv
__pycache__

Depending on your project you might want to add other directories (e.g. node_modules) to this list as well. A good starting point to create this file is TopTotals’s gitignore.io tool, which helps you build .gitignore files for your language of choice. Since the file formats are mostly compatible as far as I’m aware, this might be useful to create a baseline.

Another thing we learned from this experience: it’s a good idea to periodically delete your cdk.out directory, because a colleague observed that it had grown several gigabytes in size.

Hopefully you enjoyed this brief blog post and learned something. If you have questions, concerns or any other feedback, feel free to reach out to me on Twitter (@Maurice_Brg) or any of the other social media channels listed in my bio.

Title image courtesy of Jonas Smith on Unsplash.

Similar Posts You Might Enjoy

Using CloudFormation Modules for Serverless Standard Architecture

Serverless - a Use Case for CloudFormation Modules? Let´s agree to “infrastructure as code” is a good thing. The next question is: What framework do you use? To compare the frameworks, we have the tRick-benchmark repository, where we model infrastructure with different frameworks. Here is a walk through how to use CloudFormation Modules. This should help you to compare the different frameworks. - by Gernot Glawe

Start Guessing Capacity - Benchmark EC2 Instances

Stop guessing capacity! - Start calculating. If you migrate an older server to the AWS Cloud using EC2 instances, the prefered way is to start with a good guess and then rightsize with CloudWatch metric data. But sometimes you’ve got no clue, where to start. And: Did you think all AWS vCPUs are created equal? No, not at all. The compute power of different instance types is - yes - different. - by Gernot Glawe

Enforcing encryption standards on S3-objects

Encrypting objects at rest is a best practice when working with S3. Enforcing this with policies is not as trivial as you may think. There are subtle issues with default encryption, which may result in compliance risks. We’re going to investigate these issues and show you how to solve them. - by Maurice Borgmeier , Gernot Glawe