Running Python Playwright in AWS Lambda

As someone who has been doing web scraping for a while, I have tried many tools and libraries, but recently I've been mostly using Playwright. It's a great tool that allows you to automate web browsers, and it's very easy to use. However, running Playwright in AWS Lambda can be a bit tricky.

Base Docker Image

The problem with running Playwright in AWS Lambda is that it requires a browser to be installed on the machine. Docker image for AWS Lambda doesn't come with a browser pre-installed, so you would need install it yourself. This will be a bit tricky with all the dependencies.

Alternatively, and this is the way I prefer, you can use a Playwright Docker image from Microsoft and install AWS Lambda runtime on top of it. This way you can be sure that everything is set up correctly and you don't have to worry about dependencies.

Dockerfile

To successfully run Playwright in AWS Lambda, you can use the following Dockerfile:

ARG LAMBDA_TASK_ROOT="/var/task"

# Use the official Playwright Docker image as the base image, make sure the version matches the one you are using in your project
FROM mcr.microsoft.com/playwright/python:v1.41.0-jammy as build-image

# Re-declare ARG so it's available in this stage
ARG LAMBDA_TASK_ROOT
RUN mkdir -p ${LAMBDA_TASK_ROOT}
WORKDIR ${LAMBDA_TASK_ROOT}

# Install aws-lambda-cpp build dependencies
RUN apt-get update && apt-get install -y g++ make cmake unzip libcurl4-openssl-dev

# Install the runtime interface client
RUN pip install --target ${LAMBDA_TASK_ROOT} awslambdaric

# ==== MULTI STAGE BUILD ====

# Use multi-stage build, to keep the final image cleaner
FROM mcr.microsoft.com/playwright/python:v1.41.0-jammy

# Re-declare ARG so it's available in this stage
ARG LAMBDA_TASK_ROOT
WORKDIR ${LAMBDA_TASK_ROOT}

# Copy in the build image dependencies
COPY --from=build-image ${LAMBDA_TASK_ROOT} ${LAMBDA_TASK_ROOT}

# Get the project dependencies
COPY requirements.txt ${LAMBDA_TASK_ROOT}/requirements.txt

# Install the specified packages
RUN pip install -r ${LAMBDA_TASK_ROOT}/requirements.txt

# Create function directory and copy the function code into it
COPY ./src ./

# Considering your handler function is called lambda_handler in app.py
ENTRYPOINT [ "/usr/bin/python", "-m", "awslambdaric" ]
CMD [ "app.lambda_handler" ]

Even though this isn't required, notice how the following order is used in the second stage:

Copy the build image dependencies
Get the project dependencies
Install the specified packages
Copy the function code into the image

This way, we are taking advantage of Docker's caching mechanism. If the dependencies haven't changed, Docker will use the cached image, which will speed up the build process.

Restrictions

There are some restrictions when running Playwright in AWS Lambda. To avoid issues, I suggest passing --disable-gpu and --single-process arguments to the browser:

browser = playwright.chromium.launch(headless=True, args=["--disable-gpu", "--single-process"])

Testing

To make sure everything works, you can emulate the AWS Lambda environment locally using the aws-lambda-rie package. When you've installed it, you can run your Lambda function like this (assuming you've installed it in ~/.aws-lambda-rie):

$ docker run -v ~/.aws-lambda-rie:/aws-lambda -p 9000:8080 --entrypoint /aws-lambda/aws-lambda-rie <image> /usr/bin/python -m awslambdaric app.lambda_handler

Then you can request your function using curl:

$ curl "http://localhost:9000/2015-03-31/functions/function/invocations" -d '{"key1": "value1", "key2": "value2"}'

Conclusion

By leveraging the Playwright image and adhering to the outlined best practices and configurations, you can deploy robust, reliable, and efficient browser automation functions on AWS Lambda. Testing with aws-lambda-rie further ensures that your deployment is smooth and issue-free.