Apache Spark with AWS Glue metastore and Python docker image
Base Image: Uses sdaberdaku/spark-with-glue-builder:v3.5.1 as the base image.
Python Version: Uses Python 3.10.14-slim-bookworm.
User Setup: Creates a system user 'spark' with a specified UID and GID for running Spark processes.
Java and Other Packages: Installs OpenJDK 17, Tini, Procps, and Gettext-base.
Environment Variables: Sets up environment variables for Java, Spark, and Hadoop.
Spark and Hadoop Installation: Copies Spark and Hadoop binaries from the builder image to the specified directories.
Permissions: Adjusts ownership of Spark and Hadoop directories to the 'spark' user.
Entrypoint Setup: Copies and configures the entrypoint and decommission scripts for Spark.
Python Dependencies: Installs PySpark and other dependencies listed in the requirements.txt file.
The current Docker image inherits a series of JARs from its builder image.
Here is a summary of the JAR files that are included in the Docker image (under /opt/spark/jars):
- AWS Glue Data Catalog Spark Client JAR:
- AWS Java SDK bundle library:
- Hadoop AWS library:
- Wildfly OpenSSL library:
- PostgreSQL library:
- Checker Qual:
- delta-spark:
- antlr4-runtime:
- delta-storage:
- delta-storage-s3-dynamodb:
Hadoop native libraries are downloaded and installed in the /opt/hadoop directory.
Follow these instructions to build the Docker image:
- Clone this repository:
git clone https://github.com/sdaberdaku/spark-glue-python.git
cd spark-glue-python
docker build -t sdaberdaku/spark-glue-python:v3.5.1-python3.10.14 . --network host
docker push sdaberdaku/spark-glue-python:v3.5.1-python3.10.14