AWS Lambda with Python, Poetry, and Pulumi

AWS Lambda, the serverless computing service provided by Amazon Web Services (AWS), allows you to run code without provisioning or managing servers. Various events, such as changes to data in an Amazon S3 bucket, can trigger these functions. AWS Lambda supports multiple programming languages, including Java, JavaScript, Python, and Ruby. Additionally, it provides an OS-only runtime that enables you to use and deploy programs written in any programming language.

In this blog post, I will show you how to write a simple AWS Lambda function using Python, build it with Poetry, and deploy it with Pulumi.

Prerequisites ¶

You need to have the following tools installed on your machine:

You also need an AWS account with permission to create Lambda functions and IAM roles. And if you want to send messages to a Slack channel, you need a Slack account and a Slack API token.

Architecture ¶

The architecture of the solution is straightforward. It consists of a Lambda function that will be triggered whenever a new object is uploaded to a specific S3 bucket. The function will download and read the PDF file, split it into individual pages, and upload all split pages to an S3 bucket. At the end of the process, it sends a notification to a Slack channel.

Setting up the project ¶

Poetry is a tool for dependency management and packaging in Python. It allows you to declare the libraries your project depends on and manage them in a virtual environment.

To create a new Poetry project, run the following command:

poetry new splitpdf

Create a new file poetry.toml in the root of the project with the following content:

[virtualenvs]
in-project = true

poetry.toml

Poetry downloads all dependencies into a folder called .venv, created by default in the user's home directory. The configuration above tells Poetry to create the virtual environment in the project directory instead, which makes it easier to bundle the project resources for deployment.

The following commands add the necessary dependencies to the project:

cd splitpdf
poetry add boto3[s3,ssm] boto3-stubs[s3,ssm] pypdf aws-lambda-powertools slack-sdk
poetry add --group dev moto[s3] ruff mypy pytest

These commands install the boto3 library to access AWS services, the pypdf library to manipulate PDF files, and the slack-sdk to send messages to Slack. I like to use types in my Python code as much as possible, so I add the boto3-stubs library, which contains type hints for boto3.
The aws-lambda-powertools library provides utilities and types for building Lambda functions. You can find more information about this library here.

With --group dev, you add dependencies to the development group. These dependencies will not be packaged into the deployment archive later. Moto is a library that allows you to mock AWS services for unit testing. Ruff is a Python linter and code formatter, mypy is a static type checker for Python, and pytest is a testing framework.

Writing the Lambda function ¶

After the necessary imports, we see some code outside any function. This code will be executed when the Lambda function is started. After the initial cold start, the AWS runtime will keep function instances warm (in memory) for a while. When the function is called again next time, it will reuse the same instance without rerunning this initialization code.

This is not a concern if you call the Lambda function only a few times a day, but it can reduce runtime and cost if you call the function many times in a short period. You should always put the initialization code outside the function as a good practice.

import os

import boto3
from aws_lambda_powertools.utilities.data_classes import S3Event, event_source
from aws_lambda_powertools.utilities.typing import LambdaContext
from mypy_boto3_s3 import S3Client
from mypy_boto3_ssm.client import SSMClient
from slack_sdk import WebClient

from splitpdf.split import split_pdf

s3_client: S3Client = boto3.client("s3")
output_bucket = os.getenv("OUTPUT_BUCKET", "output-bucket")

slack_api_token_ssm_name = os.getenv("SLACK_API_TOKEN_SSM_NAME", "slack-api-token")
ssm_client: SSMClient = boto3.client("ssm")
slack_token = ssm_client.get_parameter(
    Name=slack_api_token_ssm_name, WithDecryption=True
)["Parameter"]["Value"]

slack_channel = os.getenv("SLACK_CHANNEL", "general")
slack_client = WebClient(token=slack_token)

main.py

The initialization code creates the S3 client, reads the output bucket name from the environment variable OUTPUT_BUCKET, reads the Slack API token from the AWS Systems Manager Parameter Store, and creates the Slack client.

Lambda functions written in Python must have a function that expects two parameters: event and context. The event parameter contains the data that triggered the function, and the context parameter provides information about the invocation, function, and execution environment.

The @event_source decorator from the aws_lambda_powertools library specifies the type of the event source object that triggered the function. With this decorator, the function receives a typed object as the event parameter instead of a dict object. This allows the code to access the event data in a type-safe way.

@event_source(data_class=S3Event)
def lambda_handler(event: S3Event, context: LambdaContext) -> None:
    for record in event.records:
        input_bucket = record.s3.bucket.name
        input_key = record.s3.get_object.key
        output_key_prefix = input_key.replace(".pdf", "")

        split_pdf(s3_client, input_bucket, input_key, output_bucket, output_key_prefix)

        slack_client.chat_postMessage(
            channel=slack_channel,
            text=f"Processed file {input_key} from bucket {input_bucket}",
        )

main.py

The code reads the input bucket and key from the event data and passes all arguments to the split_pdf function, which does the actual work. At the end of the process, the function sends a message to a Slack channel with chat_postMessage.

Here is the implementation of the split_pdf function that downloads the PDF file from the input bucket, splits it into individual pages, and uploads them to the output bucket.

import io

from mypy_boto3_s3 import S3Client
from pypdf import PdfReader, PdfWriter


def split_pdf(
    s3_client: S3Client,
    input_bucket: str,
    input_key: str,
    output_bucket: str,
    output_key_prefix: str,
) -> None:
    pdf_stream = io.BytesIO()
    s3_client.download_fileobj(input_bucket, input_key, pdf_stream)
    pdf_stream.seek(0)

    reader = PdfReader(pdf_stream)

    for page_num in range(reader.get_num_pages()):
        writer = PdfWriter()
        writer.add_page(reader.get_page(page_num))

        page_stream = io.BytesIO()
        writer.write(page_stream)
        page_stream.seek(0)

        output_key = f"{output_key_prefix}/{page_num + 1:05d}.pdf"

        s3_client.upload_fileobj(page_stream, output_bucket, output_key)

split.py

Format, lint, and test ¶

With the help of ruff, we can format our code with the following command:

poetry run ruff format splitpdf tests

To run the linter and type checker, use the following commands:

poetry run ruff check splitpdf tests
poetry run mypy splitpdf tests

The unit test is written with pytest and stored in the tests directory. Poetry automatically creates this folder during project setup. For this demo application, I created one test for the split method. You can find the code here. The test leverages moto3 to mock the s3 client to avoid making actual calls to AWS services.

To run the tests, use the following command:

poetry run pytest

Package code for deployment ¶

Before deploying the Lambda function, we must create a zip file containing our Python code and its dependencies. For this purpose, I wrote a Python script that creates the zip file. You could also use a command-line tool like zip, but the Python approach has the benefit of being cross-platform.

def copy_directory(src: str, dst: str) -> None:
    for item in os.listdir(src):
        s = os.path.join(src, item)
        d = os.path.join(dst, item)
        if os.path.isdir(s):
            if item == "__pycache__":
                continue
            shutil.copytree(s, d, ignore=shutil.ignore_patterns("__pycache__"))
        else:
            shutil.copy2(s, d)


def run_zip():
    run_command("poetry install --only main --sync")
    shutil.rmtree("dist", ignore_errors=True)
    os.makedirs("dist/lambda-package/splitpdf", exist_ok=True)

    copy_directory(".venv/lib/site-packages", "dist/lambda-package")
    copy_directory("splitpdf", "dist/lambda-package/splitpdf")

    zip_directory("dist/lambda-package", "dist/lambda.zip")

zip.py

The main part of the zip.py script is the run_zip function. It creates a new directory dist/lambda-package and copies the Python code and its dependencies to this directory. Lastly, it zips the directory to dist/lambda.zip. Also note that the copy_directory function ignores the __pycache__ directory. AWS recommends not to include the __pycache__ directory in the zip file because these folders contain bytecode, which depends on the CPU architecture and Python version.

Add a new script entry to the pyproject.toml file to run the zip tool.

[tool.poetry.scripts]
zip = "zip:run_zip"

After a poetry install, run the script with the following command:

poetry run zip

Infrastructure as Code with Pulumi ¶

Pulumi is an open-source infrastructure-as-code tool for creating, deploying, and managing cloud infrastructure. The difference between Pulumi and Terraform is that Pulumi started as a tool where you write the infrastructure code in a programming language like Python, TypeScript, or Go. Terraform instead uses a declarative language (HCL) to define the infrastructure. Lately, these two tools have converged; Terraform has added support for writing configuration in a programming language, and Pulumi has added YAML support to define infrastructure in a declarative way.

The following commands set up a new Pulumi project.

mkdir iac
cd iac

pulumi login --local
pulumi new aws-python

aws-python is a template for an AWS Python project. This example stores the state on the local file system. The state is where Pulumi stores the information about the resources it manages. You can also use Pulumi Cloud, Pulumi's hosted service, or other storage backends like S3 or Azure Blob Storage to store the state.

The new command asks for a password, which is used to encrypt secrets in the state file. Pulumi supports many different ways to manage secrets. You can find more information in the official Pulumi documentation.

The next command stores the Slack API token as a secret in the Pulumi state.

pulumi config set --secret iac:slack_api_token xoxb-...

I put the following code in the __main__.py file. pulumi new creates this file with empty content.

After the necessary imports, the program reads the Slack API token from the Pulumi state and creates the two S3 buckets.

import json

import pulumi
import pulumi_aws as aws
from pulumi import ResourceOptions
from pulumi_aws import iam, lambda_, s3, ssm

slack_api_token = pulumi.Config().require_secret("slack_api_token")

# Create an S3 bucket for the input and output
input_bucket = s3.Bucket("rasc-input-bucket")
output_bucket = s3.Bucket("rasc-output-bucket")

main.py

It then stores the Slack token in the AWS Systems Manager Parameter Store. As the previous section shows, the Lambda function reads the Slack API token from this service.

# Store the Slack token in the AWS Systems Manager Parameter Store
slack_api_token_parameter = ssm.Parameter(
    "slack_api_token", type="SecureString", value=slack_api_token
)

main.py

Next, the code creates a role for the Lambda function, which gives the function permission to access other AWS services. In this example, the program attaches the standard policy AWSLambdaBasicExecutionRole, which permits the function to write logs to CloudWatch. Additionally, it attaches a policy to the role that allows the function to access the S3 buckets and the Slack API token in the Parameter Store. Because the token is stored encrypted in the Parameter Store, the function needs permission to decrypt it (kms:Decrypt).

# Create an IAM role for the Lambda function
splitpdf_lambda_role = aws.iam.Role(
    "lambdaRole",
    assume_role_policy="""{
        "Version": "2012-10-17",
        "Statement": [{
            "Action": "sts:AssumeRole",
            "Principal": {
                "Service": "lambda.amazonaws.com"
            },
            "Effect": "Allow",
            "Sid": ""
        }]
    }""",
)
# Attach policy to the Lambda role
role_policy_attachment = aws.iam.RolePolicyAttachment(
    "lambdaRolePolicyAttachment",
    role=splitpdf_lambda_role.name,
    policy_arn=aws.iam.ManagedPolicy.AWS_LAMBDA_BASIC_EXECUTION_ROLE,
)
# Lambda's policy to access S3 buckets
split_pdf_lambda_policy = iam.RolePolicy(
    "splitPdfLambdaRole",
    role=splitpdf_lambda_role.id,
    policy=pulumi.Output.all(
        input_bucket.arn, output_bucket.arn, slack_api_token_parameter.arn
    ).apply(
        lambda args: json.dumps(
            {
                "Version": "2012-10-17",
                "Statement": [
                    {
                        "Action": ["s3:GetObject"],
                        "Effect": "Allow",
                        "Resource": f"{args[0]}/*",
                    },
                    {
                        "Action": ["s3:PutObject"],
                        "Effect": "Allow",
                        "Resource": f"{args[1]}/*",
                    },
                    {
                        "Action": ["ssm:GetParameter"],
                        "Effect": "Allow",
                        "Resource": f"{args[2]}",
                    },
                    {
                        "Action": ["kms:Decrypt"],
                        "Effect": "Allow",
                        "Resource": "*",
                    },
                ],
            }
        )
    ),
)

main.py

The next code block creates a CloudWatch log group for the Lambda function. This step is optional because CloudWatch automatically creates a log group for each Lambda function. The problem is that log groups AWS creates have a retention period of never, which means that the logs are never deleted. I explicitly create log groups with a reasonable retention period to avoid unnecessary costs. In this example, logs will be deleted after 30 days.

# Create a CloudWatch log group
log_group = aws.cloudwatch.LogGroup(
    "splitpdfLogGroup",
    name="/aws/lambda/splitpdf",
    retention_in_days=30,
)

main.py

Next, we see the configuration for the Lambda. First, it creates an instance of FileArchive that points to the zip we created in the previous section with the zip.py script. Then, it creates the Lambda function with the lambda_.Function call.

code points to the zip file, handler references the lambda_handler function in our Python code. role points to the IAM role created earlier, and runtime specifies the runtime. AWS Lambda currently supports Python 3.8 to 3.12. architectures specifies that the function should run on ARM64 architecture. I prefer ARM64 because it is less expensive than x86.

The code further configures the necessary environment variables. The timeout specifies how long the function can run before the Lambda runtime will terminate it. This is to avoid costs if the function runs into an infinite loop. Note that the maximum timeout is 15 minutes.

Memory specifies how much memory the function can use. You can configure memory between 128 MB and 10,240 MB. More memory also means more CPU power so that the function will run faster. At 1,769 MB, a function has the equivalent of one vCPU.

lambda_zip = pulumi.FileArchive("../splitpdf/dist/lambda.zip")

# Lambda function
lambda_func = lambda_.Function(
    "splitpdf",
    name="splitpdf",
    code=lambda_zip,
    handler="splitpdf.main.lambda_handler",
    role=splitpdf_lambda_role.arn,
    runtime="python3.11",
    architectures=["arm64"],
    environment=lambda_.FunctionEnvironmentArgs(
        variables={
            "SLACK_CHANNEL": "#general",
            "OUTPUT_BUCKET": output_bucket.bucket.apply(lambda bucket: bucket),
            "SLACK_API_TOKEN_SSM_NAME": slack_api_token_parameter.name.apply(
                lambda arn: arn
            ),
        }
    ),
    timeout=60,  # seconds
    memory_size=256,  # MB
    opts=ResourceOptions(depends_on=[log_group]),
)

main.py

The following section creates a Lambda permission allowing the S3 input bucket to invoke the Lambda function.

# Lambda permission for S3 to invoke the function
lambda_permission = lambda_.Permission(
    "splitpdfLambdaPermission",
    action="lambda:InvokeFunction",
    function=lambda_func.name,
    principal="s3.amazonaws.com",
    source_arn=input_bucket.arn,
    statement_id="AllowS3Event",
)

main.py

Lastly, the Pulumi program creates a bucket notification for the input bucket to invoke the Lambda function whenever a new object is created. It only triggers the function when the object has the suffix .pdf.

# S3 bucket notification for the input bucket to invoke the Lambda function
bucket_notification = s3.BucketNotification(
    "bucketNotification",
    bucket=input_bucket.id,
    lambda_functions=[
        s3.BucketNotificationLambdaFunctionArgs(
            lambda_function_arn=lambda_func.arn,
            events=["s3:ObjectCreated:*"],
            filter_suffix=".pdf",
        )
    ],
)

main.py

At the end of the configuration, the pulumi.export method calls print the names of the buckets and the Lambda function to the console.

# Output the names of the buckets and the lambda function
pulumi.export("input_bucket_name", input_bucket.bucket)
pulumi.export("output_bucket_name", output_bucket.bucket)
pulumi.export("lambda_function_name", lambda_func.name)

main.py

With this configuration in place, we can deploy the infrastructure with the following command:

pulumi up

If you use aws-vault to manage your AWS credentials, you can run Pulumi with the following command:

 aws-vault exec <profile> -- pulumi up

Replace <profile> with the profile name you want to use.

pulumi up does not immediately deploy the infrastructure. It shows you the changes it will make and asks for confirmation. You can then select yes to apply the changes or no to cancel the deployment.

To test the Lambda function, upload a PDF into the input bucket. You can do this with the AWS Web Console or the AWS CLI.

aws s3 cp test.pdf s3://<input-bucket-name>

After a few seconds, you should find the split pages in the output bucket and a message in the Slack channel. If you don't receive a message, check the CloudWatch logs.

To destroy all the provisioned resources on AWS, run the following command:

pulumi down

Conclusion ¶

In this blog post, I showed you how to write a simple AWS Lambda function with Python, build it with Poetry, and deploy it with Pulumi. I hope you found this post helpful and that you learned something new. If you have any questions or feedback, send me a message.