Apache Airflow: Directed Acyclic Scheduling for Real Infrastructure
When cron stops scaling, Airflow starts making sense
What is Airflow?
It’s a workflow orchestrator built for engineers who prefer code over configs. DAGs (Directed Acyclic Graphs) define the logic, Python runs the show, and a web UI gives the team visibility without shell access.
Airflow was born at Airbnb, but it’s everywhere now: pipelines, ETLs, ML jobs, CI/CD side workflows, nightly batch processing — any task that has dependencies, retries, inputs, and consequences.
You have three daily jobs that must run in order.
You have twelve more that depend on those three.
Some run on schedule. Some are triggered by files appearing in S3.
One fails every third day unless it’s run after 10 a.m.
This is what cron hates. This is what Airflow handles.
You don’t write bash scripts or YAML chains. You write DAGs in Python — readable, versioned, testable.
Where It’s Being Used
– Data teams building pipelines with dynamic, conditional steps.
– Platform engineers managing image builds and release packaging.
– ML workflows with retraining, deployment, and drift detection logic.
– Event-driven workflows triggered by webhooks, sensors, or external APIs.
Key Characteristics
Feature | What It Actually Delivers |
Python-Based DAGs | Define workflows as Python code — no DSL, no external config layers |
Task-Level Retries | Set retries, timeouts, and failure logic per node in the graph |
UI for Observability | See DAG runs, logs, failed tasks, upcoming jobs — all in one place |
Pluggable Executors | Choose from Celery, Kubernetes, Local — depending on scale and infra |
Sensors and Hooks | React to external systems — S3, HTTP, Hive, Slack, Git, you name it |
XComs and Variables | Pass data between tasks without fragile file sharing |
Logs and Audit Trails | Built-in logging and metadata database for all runs |
Role-Based Access | Control who can trigger, edit, pause, or view jobs |
Scheduler with DAG Awareness | Not just time-based — understands dependencies, execution order |
Airflow API | Trigger jobs programmatically or integrate with external systems |
What You Actually Need
– Python 3.8+
– PostgreSQL or MySQL for metadata DB
– Redis (for Celery executor) or Kubernetes (for K8s executor)
– Optional: Docker Compose for quick testing
To install with pip (for local dev/testing):
pip install apache-airflow==2.7.3 \
–constraint “https://raw.githubusercontent.com/apache/airflow/constraints-2.7.3/constraints-3.8.txt”
To scaffold a new Airflow project:
airflow db init
airflow users create –username admin –password admin –role Admin –email admin@example.com
airflow webserver –port 8080
airflow scheduler
Or use Docker Compose:
curl -LfO ‘https://airflow.apache.org/docs/apache-airflow/2.7.3/docker-compose.yaml’
docker compose up -d
Web UI at http://localhost:8080
Default user: admin / admin
What People Learn After 3 Months
“It’s not a task runner. It’s a system of record for everything we automate.”
“Retries and dependencies feel native — we stopped writing wrappers for everything.”
“Airflow made us rethink what should be automated. We added jobs we used to ignore.”
One Thing to Know
Airflow won’t write the logic for you. It won’t install your packages. It won’t babysit your scripts. But it gives you a structured place to put them — and a framework to treat automation as part of the system, not glue code between crontabs.
It’s not for one-shot jobs. But if your infrastructure thinks in graphs, Airflow is the right answer.