Packaging External libs for AWS Glue

Cloud computing is warehouse for vast amount of computing power but sometimes unlocking the treasure requires us to solve some riddles. Here I will discuss one of such riddle, that is, using external libraries in AWS Glue.

AWS Glue is managed service which is tightly integrate with other AWS native tools. It allows us to build ETL pipelines using either python or spark framework. As you must be aware that python is popular just because if vast repository of libraries which a user can use to expidite the development. However when we deal with manage service, we something get limited in our creativity.

tl;dr

PreRequisite

We need to create a project to generate the dependency file. Following files need to be created:

  • setup.py
  • requirements.txt

Follow py-packager to generate the above files. Update the project name and version name for final dependency file name in setup.py. Also, add the required package name under requirements.txt

Python packaging

Glue python-shell job provide with 1 to 16 GB of memory to execute your python code. If we need external libs then we need to provide egg files as external libraries to the Glue. Execute the following command to generate the dependencies file,

python setup.py bdist_egg

This will generate file under dist/ folder and one can use it in python-shell job

Pyspark packaging

Glue pyspark job provides vast amount of computing power and for external modules we need to provide dependencies in zip format. We need to execute following commands to generate the zip for pyspark,

VERSION=$(shell grep "VERSION = [0-9.]*" setup.py | cut -d\" -f2)
PROJECT_NAME=$(shell grep "PACKAGE_NAME = *" setup.py | cut -d\" -f2)
mkdir -p libs dist
pip install -r requirements.txt -t libs
(cd libs; zip -r ../dist/${PROJECT_NAME}-${VERSION}.zip *)
rm -rf libs/

This will generate file under dist/ folder which can be used for pyspark job.

P.S. python module which depends on c will not be exeutable under pyspark environment.

For using it AWS Glue, One need to upload the files to S3 and add the S3 path in Python library path under job details.

Perfecting in my job, innovating is my passion