Running Python Playwright in AWS Lambda
As someone who has been doing web scraping for a while, I have tried many tools and libraries, but recently I've been mostly using Playwright. It's a great tool that allows you to automate web browsers, and it's very easy to use. However, running Playwright in AWS Lambda can be a bit tricky.
Base Docker Image
The problem with running Playwright in AWS Lambda is that it requires a browser to be installed on the machine. Docker image for AWS Lambda doesn't come with a browser pre-installed, so you would need install it yourself. This will be a bit tricky with all the dependencies.
Alternatively, and this is the way I prefer, you can use a Playwright Docker image from Microsoft and install AWS Lambda runtime on top of it. This way you can be sure that everything is set up correctly and you don't have to worry about dependencies.
Dockerfile
To successfully run Playwright in AWS Lambda, you can use the following Dockerfile:
ARG LAMBDA_TASK_ROOT="/var/task"
# Use the official Playwright Docker image as the base image, make sure the version matches the one you are using in your project
FROM mcr.microsoft.com/playwright/python:v1.41.0-jammy as build-image
# Re-declare ARG so it's available in this stage
ARG LAMBDA_TASK_ROOT
RUN mkdir -p ${LAMBDA_TASK_ROOT}
WORKDIR ${LAMBDA_TASK_ROOT}
# Install aws-lambda-cpp build dependencies
RUN apt-get update && apt-get install -y g++ make cmake unzip libcurl4-openssl-dev
# Install the runtime interface client
RUN pip install --target ${LAMBDA_TASK_ROOT} awslambdaric
# ==== MULTI STAGE BUILD ====
# Use multi-stage build, to keep the final image cleaner
FROM mcr.microsoft.com/playwright/python:v1.41.0-jammy
# Re-declare ARG so it's available in this stage
ARG LAMBDA_TASK_ROOT
WORKDIR ${LAMBDA_TASK_ROOT}
# Copy in the build image dependencies
COPY ${LAMBDA_TASK_ROOT} ${LAMBDA_TASK_ROOT}
# Get the project dependencies
COPY requirements.txt ${LAMBDA_TASK_ROOT}/requirements.txt
# Install the specified packages
RUN pip install -r ${LAMBDA_TASK_ROOT}/requirements.txt
# Create function directory and copy the function code into it
COPY ./src ./
# Considering your handler function is called lambda_handler in app.py
ENTRYPOINT [ "/usr/bin/python", "-m", "awslambdaric" ]
CMD [ "app.lambda_handler" ]
Even though this isn't required, notice how the following order is used in the second stage:
- Copy the build image dependencies
- Get the project dependencies
- Install the specified packages
- Copy the function code into the image
This way, we are taking advantage of Docker's caching mechanism. If the dependencies haven't changed, Docker will use the cached image, which will speed up the build process.
Restrictions
There are some restrictions when running Playwright in AWS Lambda. To avoid issues, I suggest passing --disable-gpu
and --single-process
arguments to the browser:
browser = playwright.chromium.launch(headless=True, args=["--disable-gpu", "--single-process"])
Testing
To make sure everything works, you can emulate the AWS Lambda environment locally using the aws-lambda-rie
package. When you've installed it, you can run your Lambda function like this (assuming you've installed it in ~/.aws-lambda-rie
):
$ docker run -v ~/.aws-lambda-rie:/aws-lambda -p 9000:8080 --entrypoint /aws-lambda/aws-lambda-rie <image> /usr/bin/python -m awslambdaric app.lambda_handler
Then you can request your function using curl
:
$ curl "http://localhost:9000/2015-03-31/functions/function/invocations" -d '{"key1": "value1", "key2": "value2"}'
Conclusion
By leveraging the Playwright image and adhering to the outlined best practices and configurations, you can deploy robust, reliable, and efficient browser automation functions on AWS Lambda. Testing with aws-lambda-rie
further ensures that your deployment is smooth and issue-free.