Run GPU workloads on AWS Batch

AWS Batch is a fully managed service that enables developers to run batch computing workloads. It sits on top of Amazon Elastic Container Service (ECS) and simplifies the process of scheduling, executing, and managing containerized batch workloads.

AWS Batch supports two Compute Environments: Fargate and EC2. Fargate allows you to run containers without having to manage servers or clusters of Amazon EC2 instances. On the other hand, EC2 provides virtual servers that you can resize and scale based on your needs.

In this post, I will show you how to run GPU workloads on AWS Batch. Fargate does not support GPU instances, so we must use EC2 servers with GPU support.

The demo batch job will run on EC2 G5 instances, which are powered by NVIDIA A10G Tensor Core GPUs. These instances are less expensive than the more powerful instances with NVIDIA A100 and V100 GPUs but still provide good performance for many workloads.

These instances can be used for machine learning, scientific simulations, computer vision, rendering, and other GPU-accelerated tasks. You could even run LLMs (Large Language Models) on these instances and do batch inference or fine-tuning. In my day job, I use them to train machine learning models with CatBoost.

Prerequisites ¶

To follow along with this blog post, you'll need to have the following tools installed on your system:

Python: I'm using 3.12 for this tutorial.
Pulumi: Infrastructure as code tool.
Poetry: Dependency management and packaging tool.
Docker

Additionally, you'll need an AWS account with appropriate permissions to create and manage AWS resources.

Demo application ¶

In this blog post, I focus on the Pulumi code to provision the resources needed to run batch jobs. The demo application is a simple Python program that trains a CatBoost model on the Adult Income dataset. The program downloads the dataset, trains a classifier, and then uploads the model to an S3 bucket. Dependencies are managed with Poetry. The complete code is in this GitHub repository.

We need to bundle the application in a Docker image for AWS Batch. The Dockerfile is a standard setup for a Poetry-managed Python project. It installs Python and Poetry, copies the project files, installs the dependencies, and configures the command to run the demo program.

FROM nvidia/cuda:12.9.1-runtime-ubuntu24.04
ARG DEBIAN_FRONTEND=noninteractive

ENV PYTHONDONTWRITEBYTECODE=1
ENV PYTHONUNBUFFERED=1

RUN apt-get update && apt-get install -y --no-install-recommends \
    python3 \
    python3-pip \
    python3-venv \
    curl \
    && rm -rf /var/lib/apt/lists/*

RUN curl -sSL https://install.python-poetry.org | python3 -
ENV PATH="${PATH}:/root/.local/bin"

WORKDIR /app
COPY pyproject.toml poetry.lock /app/
RUN poetry install --no-interaction --no-ansi
COPY demo /app/demo

CMD ["poetry", "run", "python", "/app/demo"]

Dockerfile

One particular thing to note is that this Dockerfile uses the nvidia/cuda:12.9.1-runtime-ubuntu24.04 base image. This image contains the CUDA runtime and drivers needed to run GPU workloads on NVIDIA GPUs. Using this base image simplifies the setup, so we don't have to worry about installing the correct drivers.

The rest of the article will focus on the Pulumi code to provision the resources needed to run batch jobs.

Pulumi code ¶

The Pulumi code provisions the following resources:

An S3 bucket to store the model
An ECR repository to store the Docker image
A Batch job definition, queue, and compute environment
All the necessary IAM roles and policies
A CloudWatch log group to store the logs

I have written the Pulumi code in Python, but you can use any of the supported languages. I created the initial Pulumi project with pulumi new aws-python and selected Poetry as the dependency manager.

S3 Bucket ¶

The S3 bucket is used to store the model that the batch job trains. The following Pulumi code creates an S3 bucket with the name ml-models. Choose a unique name for the bucket, as S3 bucket names must be globally unique.

output_s3_bucket = aws.s3.Bucket(
    "output-s3-bucket",
    bucket="ml-models"
    acl="private"
)

main.py

VPC ¶

AWS Batch resources must always be created in an Amazon Virtual Private Cloud (Amazon VPC). When you set up a new AWS account, AWS automatically creates a default VPC for you. The Pulumi code retrieves the default VPC and its associated subnets and security groups in this example. You'll see later how these resources are referenced in the batch job resources.

You can also create new VPC, subnets, and security groups with Pulumi. See the Pulumi documentation for more information.

default_vpc = aws.ec2.get_vpc(default=True)
default_subnets = aws.ec2.get_subnets(
    filters=[
        {
            "name": "vpc-id",
            "values": [default_vpc.id],
        }
    ]
)
vpc_security_groups = aws.ec2.get_security_groups(
    filters=[
        {
            "name": "vpc-id",
            "values": [default_vpc.id],
        }
    ]
)

main.py

ECR Repository and Docker image ¶

The following code creates an ECR repository named ml.

ml_repo = aws.ecr.Repository(
    "ml_ecr",
    name="ml",
    image_scanning_configuration=aws.ecr.RepositoryImageScanningConfigurationArgs(
        scan_on_push=True,
    ),
)

main.py

With Pulumi, we can directly build and push the Docker image to the ECR repository. This functionality is not built into the standard Pulumi libraries. Instead, we have to add the following dependency to the project.

poetry add pulumi-docker-build

The following code first fetches the authorization token needed to push images to the ECR repository. Then, it builds the Docker image and pushes it to ECR.

auth_token = aws.ecr.get_authorization_token_output(registry_id=ml_repo.registry_id)
ml_repo_url = ml_repo.repository_url.apply(lambda url: f"{url}:latest")
ml_image = docker_build.Image(
    "ml-image",
    tags=[ml_repo_url],
    context=docker_build.BuildContextArgs(
        location="../catboost",
    ),
    cache_from=[
        docker_build.CacheFromArgs(
            registry=docker_build.CacheFromRegistryArgs(
                ref=ml_repo_url,
            ),
        )
    ],
    cache_to=[
        docker_build.CacheToArgs(
            inline=docker_build.CacheToInlineArgs(),
        )
    ],
    platforms=[docker_build.Platform.LINUX_AMD64],
    push=True,
    registries=[
        docker_build.RegistryArgs(
            address=ml_repo.repository_url,
            password=auth_token.password,
            username=auth_token.user_name,
        )
    ],
)

main.py

CloudWatch log group ¶

The following step is optional because AWS Batch automatically creates a log group for each job. However, log groups created by AWS always have a retention period of never, which means that the logs will never be deleted. Because storing log files on Cloudwatch costs a little money, I prefer to set a retention period so the logs are automatically deleted after a specific time.

In this example, we create a log group with a retention period of 30 days.

log_group = aws.cloudwatch.LogGroup(
    "ml-log-group",
    name="/aws/batch/ml",
    retention_in_days=30,
)

main.py

IAM Roles and Policies ¶

The setup requires, in total, four roles.

The first role is the service role for AWS Batch. The standard policy AWSBatchServiceRole is assigned to this role. The service role is then later assigned to the Compute Environment, which allows access to related services, including EC2, Autoscaling, EC2 Container service, and Cloudwatch Logs.

service_role = aws.iam.Role(
    "ml-service-role",
    name="ml-service-role",
    assume_role_policy=pulumi.Output.json_dumps(
        {
            "Version": "2012-10-17",
            "Statement": [
                {
                    "Effect": "Allow",
                    "Principal": {"Service": "batch.amazonaws.com"},
                    "Action": "sts:AssumeRole",
                }
            ],
        }
    ),
)

service_role_policy_attachment = aws.iam.RolePolicyAttachment(
    "ml-service-role-attachment",
    role=service_role.name,
    policy_arn="arn:aws:iam::aws:policy/service-role/AWSBatchServiceRole",
)

main.py

The second role is the instance role, also later assigned to the Compute Environment. This role gets the standard policy AmazonEC2ContainerServiceforEC2Role. These are all ECS-related permissions, such as creating and managing ECS tasks and services. AWS Batch builds on ECS, so the Compute Environment needs these permissions to run the batch jobs.

ecs_instance_role = aws.iam.Role(
    "ecs-instance-role",
    name="ecs-instance-role",
    assume_role_policy=pulumi.Output.json_dumps(
        {
            "Version": "2012-10-17",
            "Statement": [
                {
                    "Effect": "Allow",
                    "Principal": {"Service": "ec2.amazonaws.com"},
                    "Action": "sts:AssumeRole",
                }
            ],
        }
    ),
)

ecs_instance_role_policy_attachment = aws.iam.RolePolicyAttachment(
    "ecs-instance-role-policy-attachment",
    role=ecs_instance_role.name,
    policy_arn="arn:aws:iam::aws:policy/service-role/AmazonEC2ContainerServiceforEC2Role",
)

ecs_instance_profile = aws.iam.InstanceProfile(
    "ecs-instance-profile",
    role=ecs_instance_role.name,
)

main.py

The third role is the task execution role. The ECS task that runs on the EC2 instances assumes this role. The role gets the standard policy AmazonECSTaskExecutionRolePolicy. The policy contains ECR-related permissions that allow the job to pull the Docker image from the ECR repository.

ecs_task_execution_role = aws.iam.Role(
    "ecs_task_execution_role",
    name="ecs-task-execution-role",
    description="Allows ECS and Batch to execute tasks",
    assume_role_policy=pulumi.Output.json_dumps(
        {
            "Version": "2012-10-17",
            "Statement": [
                {
                    "Effect": "Allow",
                    "Action": "sts:AssumeRole",
                    "Principal": {"Service": "ecs-tasks.amazonaws.com"},
                }
            ],
        }
    ),
)

ecs_task_execution_attach = aws.iam.RolePolicyAttachment(
    "ecs_task_execution_attach",
    role=ecs_task_execution_role.name,
    policy_arn="arn:aws:iam::aws:policy/service-role/AmazonECSTaskExecutionRolePolicy",
)

main.py

The last role is the job role. The job assumes this role when it runs. The demo application only needs permission to upload files to the S3 bucket.

job_role = aws.iam.Role(
    "ml-job-role",
    name="ml-job",
    description="Job role for the ml job",
    assume_role_policy=pulumi.Output.json_dumps(
        {
            "Version": "2012-10-17",
            "Statement": [
                {
                    "Action": "sts:AssumeRole",
                    "Effect": "Allow",
                    "Principal": {"Service": "ecs-tasks.amazonaws.com"},
                }
            ],
        }
    ),
)

job_role_policy = aws.iam.RolePolicy(
    "ml-job-role-policy",
    role=job_role.name,
    policy=output_s3_bucket.arn.apply(
        lambda arn: pulumi.Output.json_dumps(
            {
                "Version": "2012-10-17",
                "Statement": [
                    {
                        "Effect": "Allow",
                        "Action": ["s3:PutObject"],
                        "Resource": [f"{arn}/*"],
                    }
                ],
            }
        )
    ),
)

main.py

EC2 Launch Template ¶

The EC2 instances that run the batch jobs are created from a launch template. Here, we only configure the instance's root volume to a size of 100 GB. The instance type will be set later in the Compute Environment.

launch_template = aws.ec2.LaunchTemplate(
    "ml-batch",
    name="ml_batch",
    block_device_mappings=[
        aws.ec2.LaunchTemplateBlockDeviceMappingArgs(
            device_name="/dev/xvda",
            ebs=aws.ec2.LaunchTemplateBlockDeviceMappingEbsArgs(
                volume_size=100,
                volume_type="gp2",
                delete_on_termination="true",
            ),
        )
    ],
    metadata_options=aws.ec2.LaunchTemplateMetadataOptionsArgs(
        http_tokens="required",
    ),
)

main.py

Batch job definition, queue, and compute environment ¶

Here, we see the main resources required for an AWS batch setup. First, the code creates a Compute Environment. This is the environment where the batch jobs run. The Compute Environment is configured with the launch template, the instance type, security groups, and subnets. You can specify multiple instance types. The batch system chooses the most suitable instance type based on the job requirements. In this example, we only want to run the batch jobs on g5.2xlarge instances.

The important part is the ec2_configurations field. Here we specify that we want to use the ECS_AL2_NVIDIA image type. This image type is required to run GPU workloads on the instances.

compute_environment = aws.batch.ComputeEnvironment(
    "ml-compute-environment",
    compute_environment_name="ce_ml",
    compute_resources=aws.batch.ComputeEnvironmentComputeResourcesArgs(
        instance_types=["g5.2xlarge"],
        max_vcpus=8,
        min_vcpus=0,
        desired_vcpus=0,
        type="EC2",
        launch_template=aws.batch.ComputeEnvironmentComputeResourcesLaunchTemplateArgs(
            launch_template_id=launch_template.id,
            version="$Latest",
        ),
        security_group_ids=vpc_security_groups.ids,
        subnets=default_subnets.ids,
        instance_role=ecs_instance_profile.arn,
        ec2_configurations=[
            aws.batch.ComputeEnvironmentComputeResourcesEc2ConfigurationArgs(
                image_type="ECS_AL2_NVIDIA",
            )
        ],
    ),
    service_role=service_role.arn,
    type="MANAGED",
    opts=pulumi.ResourceOptions(depends_on=[log_group]),
)

main.py

Next, the code creates a job queue. The job queue is used to submit jobs to the Compute Environment. It is associated with the Compute Environment. Whenever somebody submits a job, it goes into this queue, and if resources are available in the Compute Environment, the job will be executed. If not, the job will wait until resources are available.

job_queue = aws.batch.JobQueue(
    "ml-job-queue",
    name="ml_job_queue",
    state="ENABLED",
    priority=10,
    compute_environments=[compute_environment.arn],
)

main.py

Finally, the code creates a job definition, a template describing a job's requirements, permissions, and environment variables. The job definition is not associated with the Compute Environment or the job queue. Submitting a job with this definition to the job queue runs the job on the associated Compute Environment. You can create multiple job definitions with different resource requirements and Docker images and submit them to the same job queue.

Ensure that the Compute Environment provides the resources required by the job. In this example, we require 8 vCPUs, 30 GB of memory, and 1 GPU. If the Compute Environment lacks resources, the job will be stuck in the queue and never run.

job_definition = aws.batch.JobDefinition(
    "ml-job-definition",
    name="ml-job-definition",
    type="container",
    platform_capabilities=["EC2"],
    timeout=aws.batch.JobDefinitionTimeoutArgs(
        attempt_duration_seconds=30 * 60,
    ),
    container_properties=pulumi.Output.json_dumps(
        {
            "image": ml_repo.repository_url.apply(lambda url: f"{url}:latest"),
            "stopTimeout": 120,
            "resourceRequirements": [
                {"type": "VCPU", "value": "8"},
                {"type": "MEMORY", "value": "30000"},
                {"type": "GPU", "value": "1"},
            ],
            "linuxParameters": {
                "initProcessEnabled": True,
            },
            "environment": [
                {
                    "name": "AWS_DEFAULTS_MODE",
                    "value": "auto",
                },
                {
                    "name": "OUTPUT_BUCKET",
                    "value": output_s3_bucket.bucket.apply(lambda bucket: bucket),
                },
            ],
            "logConfiguration": {
                "logDriver": "awslogs",
                "options": {
                    "awslogs-group": log_group.name,
                },
            },
            "executionRoleArn": ecs_task_execution_role.arn,
            "jobRoleArn": job_role.arn,
        }
    ),
)

main.py

Outputs ¶

The following outputs are defined at the end of the Pulumi code. When the Pulumi stack is created, the names of all these resources will be printed to the console. We will need these values to submit jobs to the batch queue.

pulumi.export("ml_repo_url", ml_repo.repository_url)
pulumi.export("output_s3_bucket", output_s3_bucket.bucket)
pulumi.export("compute_environment_arn", compute_environment.arn)
pulumi.export("job_queue_arn", job_queue.arn)
pulumi.export("job_definition_arn", job_definition.arn)

main.py

Starting the batch job ¶

There are many ways to submit a job. You can use the AWS CLI, the AWS SDKs, or the AWS Management Console. Batch jobs can also be started based on a schedule or triggered by an event. Check out this AWS documentation for more information.

The following command starts a job using the AWS CLI.

aws batch submit-job --job-name "ml-job" --job-queue "<job_queue>" --job-definition "<job_definition>" --region <region>

After you have started the job, you can monitor the progress in the AWS Management Console. If there is something wrong, check the logs in the CloudWatch log group.

This concludes the blog post. We have seen how to set up a batch job on AWS Batch with GPU support. The tricky part of the setup is the different roles and policies that need to be created. But usually, you select the standard policies and assign them to the roles. The crucial role is the job role that permits your batch job to access the resources it needs.