Introduction
Apache Airflow and Docker are two powerful tools that have revolutionized the way we handle data and software deployment. Apache Airflow is an open-source platform that allows you to programmatically author, schedule, and monitor workflows. Docker, on the other hand, is a platform that enables developers to package applications into containers—standardized executable components that combine application source code with the OS libraries and dependencies required to run that code in any environment.
Running Airflow locally with Docker is a great idea for several reasons. First, Docker provides an isolated and consistent environment for your Airflow setup, reducing the chances of encountering issues due to differences in dependencies, libraries, or even OS. Second, Docker makes it easy to version, distribute, and replicate your Airflow setup, which can be particularly useful in a team setting or when moving from development to production.
Setting Up Your Local Environment
To get started, you'll need to have the following software installed on your machine:
- Python (version 3.6 or later)
- Docker (version 20.10.11)
- Docker Compose (version 2.2.1)
Installing Python, Docker and Airflow is straightforward. For Python and Docker, you can follow the official installation guide for your specific OS. For Airflow, you can install it using pip, Python's package installer.
Install Python
Apache Airflow is written in Python, so you'll need Python installed on your machine. You can download it from the official Python website. As of writing, Airflow requires Python 3.6 or above.
To check if Python is installed and see its version, open a terminal window and type:
$ python --version
Install Docker
Docker allows us to containerize our Airflow setup. You can download Docker from the official Docker website. Choose the version that's appropriate for your operating system.
After installation, you can check if Docker is installed correctly by opening a terminal window and typing:
$ docker --version
Install Docker Compose
Docker Compose is a tool that allows us to define and manage multi-container Docker applications, which is what our Airflow setup will be. It's typically included with the Docker installation on Windows and Mac, but may need to be installed separately on some Linux distributions. You can check if Docker Compose is installed and see its version by typing:
$ docker-compose --version
If it's not installed, you can follow the official Docker Compose installation guide.
Project Structure
It's a good practice to keep all Airflow-related files in a dedicated directory to maintain a clean and organized project structure.
Here's a suggested structure for your project:
my_project/
│
├── airflow/ # Directory for all Airflow-related files
│ ├── dags/ # Directory to store your Airflow DAGs
│ │ ├── dag1.py
│ │ ├── dag2.py
│ │ └── ...
│ │
│ ├── Dockerfile # Dockerfile for building your custom Airflow image
│ ├── docker-compose.yml # Docker Compose file for defining your services
│
└── ... # Other directories and files for your project
In this structure:
-
The
airflow/
directory is where you store all your Airflow-related files. This keeps your Airflow setup separate from the rest of your project, making it easier to manage. -
The
dags/
directory inside theairflow/
directory is where you store your Airflow DAGs. These are Python scripts that define your workflows. In your Docker Compose file, you would map this directory to/usr/local/airflow/dags
in your Airflow containers. -
The
Dockerfile
inside theairflow/
directory is used to build your custom Airflow Docker image. This file would contain the instructions to initialize the Airflow database and copy your customairflow.cfg
file into the image. -
The
docker-compose.yml
file inside theairflow/
directory is where you define your services (webserver, scheduler, database, etc.) for Docker Compose.
Personalizing Your Airflow-Docker Setup
Before you can run Airflow, you need to initialize its database. In a Dockerized setup, the initialization of the Airflow database and the customization of the airflow.cfg
file could be done within the Docker container, not on the host machine.
To do this, you can use a Dockerfile to build a custom Airflow Docker image. In this Dockerfile, you can specify the commands to initialize the Airflow database and customize the airflow.cfg
file.
Here's an example Dockerfile:
# Use the official Airflow image as the base
FROM apache/airflow:latest
# Set the AIRFLOW_HOME environment variable
ENV AIRFLOW_HOME=/usr/local/airflow
# Switch to the root user
USER root
# Create the AIRFLOW_HOME directory and change its ownership to the airflow user
RUN mkdir -p ${AIRFLOW_HOME} && chown -R airflow: ${AIRFLOW_HOME
# Switch back to the airflow user
USER airflow
# Initialize the Airflow database
RUN airflow db init
# Customize the airflow.cfg file
RUN echo "[core]" > ${AIRFLOW_HOME}/airflow.cfg && \
echo "airflow_home = ${AIRFLOW_HOME}" >> ${AIRFLOW_HOME}/airflow.cfg && \
echo "executor = LocalExecutor" >> ${AIRFLOW_HOME}/airflow.cfg && \
echo "" >> ${AIRFLOW_HOME}/airflow.cfg && \
echo "[webserver]" > ${AIRFLOW_HOME}/airflow.cfg && \
echo "base_url = http://localhost:8080" >> ${AIRFLOW_HOME}/airflow.cfg && \
echo "web_server_host = 0.0.0.0" >> ${AIRFLOW_HOME}/airflow.cfg && \
echo "web_server_port = 8080" >> ${AIRFLOW_HOME}/airflow.cfg{AIRFLOW_HOME}/airflow.cfg
In this Dockerfile, we first set the AIRFLOW_HOME
environment variable to /usr/local/airflow
. We then switch to the root user using the USER root
directive. This is necessary because we need root permissions to create a directory and change its ownership.
Next, we create the AIRFLOW_HOME
directory and change its ownership to the airflow
user. This is done using the RUN mkdir -p ${AIRFLOW_HOME} && chown -R airflow: ${AIRFLOW_HOME}
command. The -p
option in the mkdir
command ensures that the directory is created if it does not exist.
After that, we switch back to the airflow
user using the USER airflow
directive. This is a good practice for security reasons, as running containers as the root user can pose security risks.
We then initialize the Airflow database using the RUN airflow db init
command.
Finally, we customize the airflow.cfg
file directly within the Docker container. This is done using the RUN
directive with a series of echo
commands, which append our custom settings to the airflow.cfg
file. This approach allows us to customize the airflow.cfg
file without having to create and customize the file on the host machine.
In this Dockerfile, we're using the >
operator instead of the >>
operator when we first write to the airflow.cfg
file. The >
operator overwrites the file with the specified text, while the >>
operator appends the text to the file. By overwriting the file, we ensure that each section is only declared once.
Here's an example of how your airflow.cfg
configured sections in file are going to look like:
[core]
# The home directory for Airflow
airflow_home = ~/airflow
# The executor class that Airflow should use. Choices include SequentialExecutor, LocalExecutor, and CeleryExecutor.
executor = LocalExecutor
[webserver]
# The base URL for your Airflow web server
base_url = http://localhost:8080
# The IP address to bind to
web_server_host = 0.0.0.0
# The port to bind to
web_server_port = 8080
Once you've created this Dockerfile, you can build your custom Airflow Docker image using the docker build
command. Here's an example:
Check out our hands-on, practical guide to learning Git, with best-practices, industry-accepted standards, and included cheat sheet. Stop Googling Git commands and actually learn it!
$ docker build -t my-airflow-image .
Docker Compose
Docker Compose is a tool that allows you to define and manage multi-container Docker applications. It uses a YAML file to specify the services, networks, and volumes of your application, and then brings all these components up in a single command.
Configuring the Docker Compose File
The Docker Compose file is where you define your application's services. For Airflow, a basic Docker Compose file might include services for the webserver, scheduler, and database. This file should be in the same directory as your Dockerfile.
Here's an example:
version: "3"
services:
webserver:
build:
context: .
dockerfile: Dockerfile
command: webserver
volumes:
- ./dags:/usr/local/airflow/dags
ports:
- "8080:8080"
scheduler:
build:
context: .
dockerfile: Dockerfile
command: scheduler
volumes:
- ./dags:/usr/local/airflow/dags
postgres:
image: postgres:latest
environment:
- POSTGRES_USER=airflow
- POSTGRES_PASSWORD=airflow
- POSTGRES_DB=airflow
In this Docker Compose file, we define three services: webserver
, scheduler
, and postgres
. The build
directive tells Docker Compose to build an image using the Dockerfile in the current directory. The volumes
directive maps the ./dags
directory on your host machine to /usr/local/airflow/dags
in the Docker container, allowing you to store your DAGs on your host machine. The ports
directive maps port 8080 on your host machine to port 8080 in the Docker container, allowing you to access the Airflow web server at http://localhost:8080
.
The postgres
service uses the postgres:latest
image and sets environment variables directly in the Docker Compose file. These environment variables are used to configure the Postgres database.
Launching Airflow Services
To launch your Airflow services, you can use the docker-compose up
command. Adding the -d
flag runs the services in the background. Here's the command:
$ docker-compose up -d
This command will start all the services defined in your Docker Compose file. You can check the status of your services using the docker-compose ps
command.
Creating a User in Airflow
Once you've set up your Docker compose with Airflow, you'll need to create a user to be able to access the Airflow web interface. This can be done by running a command in the running Docker container.
First, you need to find the container ID of your running Airflow webserver. You can do this by running the following command:
$ docker ps
This command lists all running Docker containers and their details. Look for the container running the Airflow webserver and note down its container ID.
Next, you can create a new user in Airflow by running the following command:
$ docker exec -it <container-id> airflow users create --username admin --password admin --firstname First --lastname Last --role Admin --email [email protected]
Replace <container-id>
with the container ID you noted down earlier. This command creates a new user with username 'admin', password 'admin', first name 'First', last name 'Last', role 'Admin', and email '[email protected]'. You should replace these values with your own.
After running this command, you should be able to log into the Airflow web interface using the credentials of the user you just created.
Optimizations and Applications
Enhancing Performance
Optimizing your Airflow and Docker setup can significantly improve your data pipeline's performance. For Airflow, consider using LocalExecutor for parallel task execution and fine-tuning your DAGs to reduce unnecessary tasks. For Docker, ensure your images are as lightweight as possible and use Docker's built-in resource management features to limit CPU and memory usage.
For example, you can limit the memory usage of your Docker containers by adding the mem_limit
parameter to your Docker Compose file:
services:
webserver:
build:
context: .
dockerfile: Dockerfile
command: webserver
volumes:
- ./dags:/usr/local/airflow/dags
ports:
- "8080:8080"
mem_limit: 512m
In addition to memory, you can also manage the CPU resources that a container can use by setting the cpu_shares
parameter:
services:
webserver:
build:
context: .
dockerfile: Dockerfile
command: webserver
volumes:
- ./dags:/usr/local/airflow/dags
ports:
- "8080:8080"
mem_limit: 512m
cpu_shares: 512
The cpu_shares
parameter allows you to control the CPU resources that a container can use. The value is a relative weight to other containers. For example, if one container has a value of 1024 and another has a value of 512, the first container will get twice as much CPU time as the second.
This simple addition can have a profound impact on your system's performance, ensuring that your resources are used efficiently and that your data pipeline runs smoothly. It's these small tweaks and optimizations that can make a big difference in the long run, transforming a good data pipeline into a great one.
When and Why I Use Airflow with Docker
In my journey as a Data Engineer, I've had the opportunity to work on a variety of complex projects. One such project involved creating a data pipeline to process and analyze large volumes of data. The complexity of the project was daunting, and the need for a tool that could simplify the process was evident. That's when I discovered Airflow and Docker.
Airflow, with its robust scheduling and orchestration capabilities, was a perfect fit for our data pipeline needs. However, the true game-changer was Docker. Docker allowed us to containerize our Airflow setup, which brought a slew of benefits.
First, Docker made it incredibly easy to collaborate with my team. We were able to share our Docker images and ensure that everyone was working in the same environment. This eliminated the "but it works on my machine" problem and made our development process much smoother.
Second, Docker enabled us to test our Airflow pipelines on our local machines with ease. We could replicate our production environment locally, run our pipelines, and catch any issues early in the development process. This was a significant improvement over our previous workflow, where testing was a cumbersome process.
Last, when it came time to deploy our pipelines to production, Docker made the process seamless. We simply had to push our Docker image to the production server and run it. There was no need to worry about installing dependencies or configuring the server. Everything we needed was packaged within our Docker image.
Using Airflow with Docker was a transformative experience. It not only made our development process more efficient but also allowed us to deliver a high-quality data pipeline that met our project's needs. I would highly recommend this setup to any developer or team working on data pipeline projects.
Case Study: Quizlet's Data Analytics Workflow
Quizlet's adoption of Apache Airflow has revolutionized their analytics ETLs. Initially deployed on a single server, Airflow has streamlined their data extraction from Google BigQuery, running analytics in SQL, and storing results for reports and dashboards. The success of this deployment has led to an expansion of tasks, including machine learning classifiers training, search indexes calculation, A/B testing, and user targeting.
Looking forward, Quizlet plans to enhance their Airflow deployment by migrating the metadata database to a dedicated instance, integrating with Google Cloud Storage, and switching to a distributed queuing system. In essence, Apache Airflow has been a game-changer for Quizlet, empowering them to do more and drive their business forward.
Conclusion
In this article, we've explored how to run Airflow locally using Docker. We've covered everything from setting up your local environment and installing the necessary software, to deploying your Airflow services with Docker Compose. We've also discussed some tips and tricks for optimizing your setup and shared some personal experiences and real-world applications.
Running Airflow with Docker provides a consistent, isolated environment that can be easily versioned, distributed, and replicated. This setup is ideal for managing complex data pipelines, as it combines Airflow's powerful scheduling and orchestration capabilities with Docker's flexibility and isolation. Whether you're a data engineer looking to streamline your workflows, or a team seeking to ensure consistency across your development environment, running Airflow locally with Docker is a powerful solution.
I hope this guide has been helpful and has provided you with the knowledge and confidence to get started with your own Airflow-Docker setup.