Skip to content

Commit a15b601

Browse files
authored
Added sample junit project (#153)
* Added sample junit project * Delete examples/GlueUnitTestingLocalSample/README * Upgraded lang3 to match Glue 4 on production * Added batch companion * Renamed * Update README.md * Update README.md * Update README.md
1 parent 20e064f commit a15b601

File tree

8 files changed

+472
-0
lines changed

8 files changed

+472
-0
lines changed
Lines changed: 58 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,58 @@
1+
## GlueUnitTestingLocalSample
2+
This project can be used as a template for a AWS Glue version 4.0 (PySpark) project with pytest unit tests
3+
4+
### Prerequisites
5+
To do the setup correctly, the script needs several tools to be installed and available in the system path during the setup, afterwards you just need Python and Java.
6+
- Python3 https://www.python.org/downloads/
7+
- Java JRE with the JAVA_HOME environment variable set. https://docs.aws.amazon.com/corretto/latest/corretto-8-ug/downloads-list.html. Java 8 is recommended, since is what Glue 4.0 uses, but Java 11 also works since is backwards compatible with 8.
8+
- Apache Maven https://maven.apache.org/download.cgi
9+
- Git https://git-scm.com/downloads
10+
11+
### Setup
12+
Execute the setup script provided on the base directory.
13+
On a Linux/Mac shell:
14+
15+
sh setup_venv.sh
16+
17+
On a Microsoft Windows Command Prompt:
18+
19+
setup_venv.cmd
20+
21+
The script will create a Python virtual environment for the project and in it install PySpark and the Glue libraries required.
22+
If the script fails, you migth have to wipe the *venv* directory to rerun after you solve the issue.
23+
Once the setup in complete, you will get a message indicating that the setup is done.
24+
25+
### Run unit tests
26+
On Linux/Mac, activate the virtual environment created running:
27+
28+
source venv/bin/activate
29+
30+
On Windows, it is already activated by the setup script, if you need to reactivate it later run:
31+
32+
call venv/Scripts/activate
33+
34+
Once the environment is activated and the prompt starts with *(venv)*, simply run the **pytest** command which will locate and run the sample unit test in the *test* directory that tests the Glue script under the *src* folder.
35+
If all goes as expected, pytest will report that the test has passed and store the test report and coverage files under the *build* directory.
36+
37+
### Sample unit test provided
38+
The project includes a sample test, when you run pytest, it will find the find *test_glue_script.py* in the *test* directory, load the test suite and run the test *test_glue_script*.
39+
It finds the files and the test configuration based on pytest naming conventions.
40+
The test first mocks a catalog source and a Postgres sink, since unit tests shouldn't make external connections.
41+
Then it loads the Glue script in the *src* directory and validates that the data produced is the result of reading and transforming as expected.
42+
The result of running the test suite looks like this (using the flag *--disable-warnings* for simplicity):
43+
44+
===================================================================================== test session starts =====================================================================================
45+
platform linux -- Python 3.7.16, pytest-7.4.4, pluggy-1.2.0
46+
rootdir: /tmp/aws-glue-samples/examples/GlueUnitTestingLocalSample
47+
configfile: pytest.ini
48+
plugins: cov-4.1.0
49+
collected 1 item
50+
51+
test/test_glue_script.py . [100%]
52+
53+
------------------------------------------- generated xml file: /tmp/aws-glue-samples/examples/GlueUnitTestingLocalSample/build/gluetest-report.xml -------------------------------------------
54+
55+
---------- coverage: platform linux, python 3.7.16-final-0 -----------
56+
Coverage XML written to file build/cov.xml
57+
58+
=============================================================================== 1 passed, 2 warnings in 12.46s ================================================================================
Lines changed: 136 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,136 @@
1+
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
2+
<modelVersion>4.0.0</modelVersion>
3+
<groupId>com.amazonaws</groupId>
4+
<artifactId>AWSGlueApp</artifactId>
5+
<version>1.0-SNAPSHOT</version>
6+
<name>${project.artifactId}</name>
7+
<description>AWS Glue ETL application</description>
8+
9+
<properties>
10+
<scala.version>2.12.7</scala.version>
11+
<glue.version>4.0.0</glue.version>
12+
</properties>
13+
<dependencies>
14+
<dependency>
15+
<groupId>org.scala-lang</groupId>
16+
<artifactId>scala-library</artifactId>
17+
<version>${scala.version}</version>
18+
<!-- A "provided" dependency, this will be ignored when you package your application -->
19+
<scope>provided</scope>
20+
</dependency>
21+
<dependency>
22+
<groupId>com.amazonaws</groupId>
23+
<artifactId>AWSGlueETL</artifactId>
24+
<version>${glue.version}</version>
25+
<!-- A "provided" dependency, this will be ignored when you package your application -->
26+
<scope>provided</scope>
27+
</dependency>
28+
<dependency>
29+
<groupId>org.apache.logging.log4j</groupId>
30+
<artifactId>log4j-core</artifactId>
31+
<version>2.17.2</version>
32+
</dependency>
33+
<dependency>
34+
<groupId>org.apache.logging.log4j</groupId>
35+
<artifactId>log4j-api</artifactId>
36+
<version>2.17.2</version>
37+
</dependency>
38+
<dependency>
39+
<groupId>org.apache.commons</groupId>
40+
<artifactId>commons-lang3</artifactId>
41+
<version>3.12.0</version>
42+
</dependency>
43+
</dependencies>
44+
45+
<repositories>
46+
<repository>
47+
<id>aws-glue-etl-artifacts</id>
48+
<url>https://aws-glue-etl-artifacts.s3.amazonaws.com/release/</url>
49+
</repository>
50+
</repositories>
51+
<build>
52+
<sourceDirectory>src/main/scala</sourceDirectory>
53+
<plugins>
54+
<plugin>
55+
<!-- see http://davidb.github.com/scala-maven-plugin -->
56+
<groupId>net.alchim31.maven</groupId>
57+
<artifactId>scala-maven-plugin</artifactId>
58+
<version>3.4.0</version>
59+
<executions>
60+
<execution>
61+
<goals>
62+
<goal>compile</goal>
63+
<goal>testCompile</goal>
64+
</goals>
65+
</execution>
66+
</executions>
67+
</plugin>
68+
<plugin>
69+
<groupId>org.codehaus.mojo</groupId>
70+
<artifactId>exec-maven-plugin</artifactId>
71+
<version>1.6.0</version>
72+
<executions>
73+
<execution>
74+
<goals>
75+
<goal>java</goal>
76+
</goals>
77+
</execution>
78+
</executions>
79+
<configuration>
80+
<systemProperties>
81+
<systemProperty>
82+
<key>spark.master</key>
83+
<value>local[*]</value>
84+
</systemProperty>
85+
<systemProperty>
86+
<key>spark.app.name</key>
87+
<value>localrun</value>
88+
</systemProperty>
89+
<systemProperty>
90+
<key>org.xerial.snappy.lib.name</key>
91+
<value>libsnappyjava.jnilib</value>
92+
</systemProperty>
93+
</systemProperties>
94+
</configuration>
95+
</plugin>
96+
<plugin>
97+
<groupId>org.apache.maven.plugins</groupId>
98+
<artifactId>maven-enforcer-plugin</artifactId>
99+
<version>3.0.0-M2</version>
100+
<executions>
101+
<execution>
102+
<id>enforce-maven</id>
103+
<goals>
104+
<goal>enforce</goal>
105+
</goals>
106+
<configuration>
107+
<rules>
108+
<requireMavenVersion>
109+
<version>3.5.3</version>
110+
</requireMavenVersion>
111+
</rules>
112+
</configuration>
113+
</execution>
114+
</executions>
115+
</plugin>
116+
<!-- The shade plugin will be helpful in building a uberjar or fatjar.
117+
You can use this jar in the AWS Glue runtime environment. For more information, see https://maven.apache.org/plugins/maven-shade-plugin/ -->
118+
<plugin>
119+
<groupId>org.apache.maven.plugins</groupId>
120+
<artifactId>maven-shade-plugin</artifactId>
121+
<version>3.2.4</version>
122+
<configuration>
123+
<!-- any other shade configurations -->
124+
</configuration>
125+
<executions>
126+
<execution>
127+
<phase>package</phase>
128+
<goals>
129+
<goal>shade</goal>
130+
</goals>
131+
</execution>
132+
</executions>
133+
</plugin>
134+
</plugins>
135+
</build>
136+
</project>
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
[pytest]
2+
pythonpath = src
3+
addopts = --cov --junitxml=build/gluetest-report.xml --cov-report xml:build/cov.xml
Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,50 @@
1+
@echo off
2+
call python --version
3+
IF %ERRORLEVEL% NEQ 0 (
4+
echo This script requires a version of Python3 to be installed and included in the PATH
5+
exit /b 1
6+
)
7+
8+
call mvn --version
9+
IF %ERRORLEVEL% NEQ 0 (
10+
echo This script requires Maven to be installed and included in the PATH
11+
exit /b 1
12+
)
13+
14+
call git --version
15+
IF %ERRORLEVEL% NEQ 0 (
16+
echo This script requires Git to be installed and included in the PATH
17+
exit /b 1
18+
)
19+
20+
echo Creating Python virtual environment
21+
python -m venv venv
22+
IF %ERRORLEVEL% NEQ 0 (
23+
echo Failed to create Python virtual environment, cannot continue
24+
exit /b 1
25+
)
26+
echo Created venv
27+
call venv/Scripts/activate.bat
28+
29+
echo Installing Python modules
30+
pip install pyspark==3.3.0 pytest pytest-cov || exit /b
31+
pip install git+https://github.com/awslabs/aws-glue-libs.git || exit /b
32+
33+
set TMP_PY_FILE=tmp_get_path.py
34+
echo import pyspark > %TMP_PY_FILE%
35+
echo import os >> %TMP_PY_FILE%
36+
echo print(os.path.dirname(os.path.realpath(pyspark.__file__))) >> %TMP_PY_FILE%
37+
for /f %%i in ('python %TMP_PY_FILE%') do set PYSPARK_PATH=%%i
38+
del %TMP_PY_FILE%
39+
echo "Installed PySpark under: %PYSPARK_PATH%"
40+
41+
echo Replacing jars with the ones from Glue 4
42+
set JARS_PATH=%PYSPARK_PATH%\jars
43+
move %JARS_PATH% %JARS_PATH%_bak
44+
call mvn -f configuration/pom.xml dependency:copy-dependencies -DoutputDirectory=%JARS_PATH%
45+
46+
echo ---------------------------------------------------------------------------------------------------------------------
47+
echo Done, to run the test you can run 'pytest' with virtual environment activated,
48+
echo if you need to reactivate it later you can run 'call venv/Scripts/activate'
49+
echo ---------------------------------------------------------------------------------------------------------------------
50+
Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
#!/bin/bash -eu
2+
if ! command -v mvn > /dev/null; then
3+
echo "This script requires Maven to be installed and included in the PATH"
4+
exit 1
5+
fi
6+
7+
if ! command -v python3 > /dev/null; then
8+
echo "This script requires a version of Python3 to be installed and included in the PATH"
9+
exit 1
10+
fi
11+
12+
if ! command -v git > /dev/null; then
13+
echo "This script requires Git to be installed and included in the PATH"
14+
exit 1
15+
fi
16+
17+
python3 -m venv venv
18+
echo "Created venv"
19+
source venv/bin/activate
20+
echo "Installing Python modules"
21+
pip install pyspark==3.3.0 pytest pytest-cov
22+
pip install git+https://github.com/awslabs/aws-glue-libs.git
23+
24+
PYSPARK_PATH=$(python -c '
25+
import pyspark
26+
import os
27+
print(os.path.dirname(os.path.realpath(pyspark.__file__)))
28+
')
29+
echo "Installed PySpark under: $PYSPARK_PATH"
30+
echo "Replacing jars with the ones from Glue 4"
31+
JARS_PATH="$PYSPARK_PATH/jars"
32+
mv $JARS_PATH ${JARS_PATH}_bak | true
33+
mvn -f configuration/pom.xml dependency:copy-dependencies -DoutputDirectory=${JARS_PATH}
34+
echo "---------------------------------------------------------------------------------------------------------------------"
35+
echo "Done, to run the test you can run 'pytest' once you have activated the venv using 'source venv/bin/activate'"
36+
echo "---------------------------------------------------------------------------------------------------------------------"
37+
Lines changed: 63 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,63 @@
1+
import sys
2+
from awsglue.transforms import *
3+
from awsglue.utils import getResolvedOptions
4+
from pyspark.context import SparkContext
5+
from awsglue.context import GlueContext
6+
from awsglue.job import Job
7+
from pyspark.sql import DataFrame
8+
from awsglue.dynamicframe import DynamicFrame
9+
10+
args = getResolvedOptions(sys.argv, ["JOB_NAME"])
11+
sc = SparkContext.getOrCreate()
12+
glueContext = GlueContext(sc)
13+
spark = glueContext.spark_session
14+
job = Job(glueContext)
15+
job.init(args["JOB_NAME"], args)
16+
17+
18+
s3data_dynf = glueContext.create_dynamic_frame.from_options(
19+
connection_type="s3",
20+
connection_options={"paths": ["s3://somebucket/somepath"]},
21+
format="csv"
22+
)
23+
24+
output_path = "s3://outputbucket/datapath"
25+
glueContext.purge_s3_path(output_path, {"retentionPeriod": 0})
26+
glueContext.write_dynamic_frame.from_options(
27+
frame=s3data_dynf,
28+
connection_type="s3",
29+
format="parquet",
30+
connection_options={
31+
"path": output_path,
32+
},
33+
format_options={
34+
"useGlueParquetWriter": True,
35+
},
36+
)
37+
38+
customer_df = glueContext.create_dynamic_frame.from_catalog(
39+
database="corey-reporting-db",
40+
table_name="cust-corey_customer",
41+
transformation_ctx="dynamoDbConnection_node1",
42+
)
43+
44+
customer_df = ApplyMapping.apply(frame=customer_df, mappings=[
45+
("customerId", "string", "customerId", "int"),
46+
("firstname", "string", "firstname", "string"),
47+
("lastname", "string", "lastname", "string")
48+
], transformation_ctx = "customerMapping")
49+
50+
customerSink = glueContext.write_dynamic_frame.from_options(
51+
frame=customer_df,
52+
connection_type='postgresql',
53+
connection_options={
54+
"url": "jdbc:postgresql://********.us-east-1.rds.amazonaws.com:5432/corey_reporting",
55+
"dbtable": "poc.customer",
56+
"user": "postgres",
57+
"password": "********"
58+
59+
},
60+
transformation_ctx="customerSink"
61+
)
62+
63+
job.commit()

0 commit comments

Comments
 (0)