Added sample junit project (#153)

gherreros · web-flow · commit a15b60122987 · 2024-03-15T18:18:20.000+09:00
* Added sample junit project

* Delete examples/GlueUnitTestingLocalSample/README

* Upgraded lang3 to match Glue 4 on production

* Added batch companion

* Renamed

* Update README.md

* Update README.md

* Update README.md
diff --git a/examples/GlueUnitTestingLocalSample/README.md b/examples/GlueUnitTestingLocalSample/README.md
@@ -0,0 +1,58 @@
+## GlueUnitTestingLocalSample
+This project can be used as a template for a AWS Glue version 4.0 (PySpark) project with pytest unit tests
+
+### Prerequisites
+To do the setup correctly, the script needs several tools to be installed and available in the system path during the setup, afterwards you just need Python and Java.
+- Python3 https://www.python.org/downloads/
+- Java JRE with the JAVA_HOME environment variable set. https://docs.aws.amazon.com/corretto/latest/corretto-8-ug/downloads-list.html. Java 8 is recommended, since is what Glue 4.0 uses, but Java 11 also works since is backwards compatible with 8.
+- Apache Maven https://maven.apache.org/download.cgi
+- Git https://git-scm.com/downloads
+
+### Setup
+Execute the setup script provided on the base directory.   
+On a Linux/Mac shell: 
+
+    sh setup_venv.sh
+
+On a Microsoft Windows Command Prompt:  
+
+    setup_venv.cmd
+
+The script will create a Python virtual environment for the project and in it install PySpark and the Glue libraries required.  
+If the script fails, you migth have to wipe the *venv* directory to rerun after you solve the issue.   
+Once the setup in complete, you will get a message indicating that the setup is done.  
+
+### Run unit tests
+On Linux/Mac, activate the virtual environment created running:
+
+    source venv/bin/activate
+
+On Windows, it is already activated by the setup script, if you need to reactivate it later run: 
+
+    call venv/Scripts/activate
+  
+Once the environment is activated and the prompt starts with *(venv)*, simply run the **pytest** command which will locate and run the sample unit test in the *test* directory that tests the Glue script under the *src* folder.
+If all goes as expected, pytest will report that the test has passed and store the test report and coverage files under the *build* directory.  
+
+### Sample unit test provided
+The project includes a sample test, when you run pytest, it will find the find *test_glue_script.py* in the *test* directory, load the test suite and run the test *test_glue_script*.
+It finds the files and the test configuration based on pytest naming conventions.    
+The test first mocks a catalog source and a Postgres sink, since unit tests shouldn't make external connections.   
+Then it loads the Glue script in the *src* directory and validates that the data produced is the result of reading and transforming as expected.   
+The result of running the test suite looks like this (using the flag *--disable-warnings* for simplicity): 
+
+    ===================================================================================== test session starts =====================================================================================
+    platform linux -- Python 3.7.16, pytest-7.4.4, pluggy-1.2.0
+    rootdir: /tmp/aws-glue-samples/examples/GlueUnitTestingLocalSample
+    configfile: pytest.ini
+    plugins: cov-4.1.0
+    collected 1 item
+    
+    test/test_glue_script.py .                                                                                                                                                              [100%]
+    
+    ------------------------------------------- generated xml file: /tmp/aws-glue-samples/examples/GlueUnitTestingLocalSample/build/gluetest-report.xml -------------------------------------------
+    
+    ---------- coverage: platform linux, python 3.7.16-final-0 -----------
+    Coverage XML written to file build/cov.xml
+    
+    =============================================================================== 1 passed, 2 warnings in 12.46s ================================================================================
diff --git a/examples/GlueUnitTestingLocalSample/configuration/pom.xml b/examples/GlueUnitTestingLocalSample/configuration/pom.xml
@@ -0,0 +1,136 @@
+  <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
+    <modelVersion>4.0.0</modelVersion>
+    <groupId>com.amazonaws</groupId>
+    <artifactId>AWSGlueApp</artifactId>
+    <version>1.0-SNAPSHOT</version>
+    <name>${project.artifactId}</name>
+    <description>AWS Glue ETL application</description>
+
+        <properties>
+            <scala.version>2.12.7</scala.version>
+	    <glue.version>4.0.0</glue.version>
+        </properties>
+    <dependencies>
+        <dependency>
+            <groupId>org.scala-lang</groupId>
+            <artifactId>scala-library</artifactId>
+            <version>${scala.version}</version>
+			<!-- A "provided" dependency, this will be ignored when you package your application -->
+            <scope>provided</scope>
+        </dependency>
+        <dependency>
+            <groupId>com.amazonaws</groupId>
+            <artifactId>AWSGlueETL</artifactId>
+			<version>${glue.version}</version>
+            <!-- A "provided" dependency, this will be ignored when you package your application -->
+            <scope>provided</scope>
+    	</dependency>
+        <dependency>
+            <groupId>org.apache.logging.log4j</groupId>
+            <artifactId>log4j-core</artifactId>
+            <version>2.17.2</version>
+           </dependency>
+        <dependency>
+            <groupId>org.apache.logging.log4j</groupId>
+            <artifactId>log4j-api</artifactId>
+            <version>2.17.2</version>
+        </dependency>
+        <dependency>
+            <groupId>org.apache.commons</groupId>
+            <artifactId>commons-lang3</artifactId>
+            <version>3.12.0</version>
+       </dependency>
+    </dependencies>
+
+    <repositories>
+        <repository>
+            <id>aws-glue-etl-artifacts</id>
+            <url>https://aws-glue-etl-artifacts.s3.amazonaws.com/release/</url>
+        </repository>
+    </repositories>
+    <build>
+        <sourceDirectory>src/main/scala</sourceDirectory>
+        <plugins>
+            <plugin>
+                <!-- see http://davidb.github.com/scala-maven-plugin -->
+                <groupId>net.alchim31.maven</groupId>
+                <artifactId>scala-maven-plugin</artifactId>
+                <version>3.4.0</version>
+                <executions>
+                    <execution>
+                        <goals>
+                            <goal>compile</goal>
+                            <goal>testCompile</goal>
+                        </goals>
+                    </execution>
+                </executions>
+            </plugin>
+            <plugin>
+                <groupId>org.codehaus.mojo</groupId>
+                <artifactId>exec-maven-plugin</artifactId>
+                <version>1.6.0</version>
+                <executions>
+                    <execution>
+                        <goals>
+                        <goal>java</goal>
+                        </goals>
+                    </execution>
+                </executions>
+                <configuration>
+                <systemProperties>
+                    <systemProperty>
+                        <key>spark.master</key>
+                        <value>local[*]</value>
+                    </systemProperty>
+                    <systemProperty>
+                        <key>spark.app.name</key>
+                        <value>localrun</value>
+                    </systemProperty>
+                    <systemProperty>
+                        <key>org.xerial.snappy.lib.name</key>
+                        <value>libsnappyjava.jnilib</value>
+                    </systemProperty>
+                </systemProperties>
+                </configuration>
+            </plugin>
+            <plugin>
+                <groupId>org.apache.maven.plugins</groupId>
+                <artifactId>maven-enforcer-plugin</artifactId>
+                <version>3.0.0-M2</version>
+                <executions>
+                    <execution>
+                        <id>enforce-maven</id>
+                        <goals>
+                            <goal>enforce</goal>
+                        </goals>
+                        <configuration>
+                            <rules>
+                                <requireMavenVersion>
+                                    <version>3.5.3</version>
+                                </requireMavenVersion>
+                            </rules>
+                        </configuration>
+                    </execution>
+                </executions>
+            </plugin>
+			<!-- The shade plugin will be helpful in building a uberjar or fatjar.
+			You can use this jar in the AWS Glue runtime environment. For more information, see https://maven.apache.org/plugins/maven-shade-plugin/ -->
+            <plugin>
+                <groupId>org.apache.maven.plugins</groupId>
+                <artifactId>maven-shade-plugin</artifactId>
+                <version>3.2.4</version>
+                <configuration>
+                    <!-- any other shade configurations -->
+                </configuration>
+                <executions>
+                    <execution>
+                        <phase>package</phase>
+                        <goals>
+                            <goal>shade</goal>
+                        </goals>
+                    </execution>
+                </executions>
+            </plugin>
+        </plugins>
+    </build>
+</project>
diff --git a/examples/GlueUnitTestingLocalSample/pytest.ini b/examples/GlueUnitTestingLocalSample/pytest.ini
@@ -0,0 +1,3 @@
+[pytest]
+pythonpath = src
+addopts = --cov --junitxml=build/gluetest-report.xml --cov-report xml:build/cov.xml
diff --git a/examples/GlueUnitTestingLocalSample/setup_venv.cmd b/examples/GlueUnitTestingLocalSample/setup_venv.cmd
@@ -0,0 +1,50 @@
+@echo off
+call python --version
+IF %ERRORLEVEL% NEQ 0 (
+    echo This script requires a version of Python3 to be installed and included in the PATH
+    exit /b 1
+)
+
+call mvn --version
+IF %ERRORLEVEL% NEQ 0 (
+    echo This script requires Maven to be installed and included in the PATH
+    exit /b 1
+)
+
+call git --version
+IF %ERRORLEVEL% NEQ 0 (
+    echo This script requires Git to be installed and included in the PATH
+    exit /b 1
+)    
+
+echo Creating Python virtual environment
+python -m venv venv
+IF %ERRORLEVEL% NEQ 0 (
+    echo Failed to create Python virtual environment, cannot continue
+    exit /b 1
+)    
+echo Created venv
+call venv/Scripts/activate.bat
+
+echo Installing Python modules
+pip install pyspark==3.3.0 pytest pytest-cov || exit /b
+pip install git+https://github.com/awslabs/aws-glue-libs.git || exit /b
+
+set TMP_PY_FILE=tmp_get_path.py
+echo import pyspark > %TMP_PY_FILE%
+echo import os >> %TMP_PY_FILE%
+echo print(os.path.dirname(os.path.realpath(pyspark.__file__))) >> %TMP_PY_FILE%
+for /f %%i in ('python %TMP_PY_FILE%') do set PYSPARK_PATH=%%i
+del %TMP_PY_FILE%
+echo "Installed PySpark under: %PYSPARK_PATH%"
+
+echo Replacing jars with the ones from Glue 4
+set JARS_PATH=%PYSPARK_PATH%\jars
+move %JARS_PATH% %JARS_PATH%_bak
+call mvn -f configuration/pom.xml dependency:copy-dependencies -DoutputDirectory=%JARS_PATH%
+
+echo ---------------------------------------------------------------------------------------------------------------------
+echo Done, to run the test you can run 'pytest' with virtual environment activated,
+echo  if you need to reactivate it later you can run 'call venv/Scripts/activate'
+echo ---------------------------------------------------------------------------------------------------------------------
+
diff --git a/examples/GlueUnitTestingLocalSample/setup_venv.sh b/examples/GlueUnitTestingLocalSample/setup_venv.sh
@@ -0,0 +1,37 @@
+#!/bin/bash -eu
+if ! command -v mvn > /dev/null; then
+  echo "This script requires Maven to be installed and included in the PATH"
+  exit 1 
+fi
+
+if ! command -v python3 > /dev/null; then
+  echo "This script requires a version of Python3 to be installed and included in the PATH"
+  exit 1
+fi
+
+if ! command -v git > /dev/null; then
+  echo "This script requires Git to be installed and included in the PATH"
+  exit 1
+fi
+
+python3 -m venv venv
+echo "Created venv"
+source venv/bin/activate
+echo "Installing Python modules"
+pip install pyspark==3.3.0 pytest pytest-cov
+pip install git+https://github.com/awslabs/aws-glue-libs.git
+
+PYSPARK_PATH=$(python -c '
+import pyspark
+import os
+print(os.path.dirname(os.path.realpath(pyspark.__file__)))
+')
+echo "Installed PySpark under: $PYSPARK_PATH"
+echo "Replacing jars with the ones from Glue 4"
+JARS_PATH="$PYSPARK_PATH/jars"
+mv $JARS_PATH ${JARS_PATH}_bak | true
+mvn -f configuration/pom.xml dependency:copy-dependencies -DoutputDirectory=${JARS_PATH}
+echo "---------------------------------------------------------------------------------------------------------------------"
+echo "Done, to run the test you can run 'pytest' once you have activated the venv using 'source venv/bin/activate'"
+echo "---------------------------------------------------------------------------------------------------------------------"
+
diff --git a/examples/GlueUnitTestingLocalSample/src/my_script.py b/examples/GlueUnitTestingLocalSample/src/my_script.py
@@ -0,0 +1,63 @@
+import sys
+from awsglue.transforms import *
+from awsglue.utils import getResolvedOptions
+from pyspark.context import SparkContext
+from awsglue.context import GlueContext
+from awsglue.job import Job
+from pyspark.sql import DataFrame
+from awsglue.dynamicframe import DynamicFrame
+
+args = getResolvedOptions(sys.argv, ["JOB_NAME"])
+sc = SparkContext.getOrCreate()
+glueContext = GlueContext(sc)
+spark = glueContext.spark_session
+job = Job(glueContext)
+job.init(args["JOB_NAME"], args)
+
+
+s3data_dynf = glueContext.create_dynamic_frame.from_options(
+    connection_type="s3",
+    connection_options={"paths": ["s3://somebucket/somepath"]},
+    format="csv"
+)
+
+output_path = "s3://outputbucket/datapath"
+glueContext.purge_s3_path(output_path, {"retentionPeriod": 0})
+glueContext.write_dynamic_frame.from_options(
+    frame=s3data_dynf,
+    connection_type="s3",
+    format="parquet",
+    connection_options={
+        "path": output_path,
+    },
+    format_options={
+        "useGlueParquetWriter": True,
+    },
+)
+
+customer_df = glueContext.create_dynamic_frame.from_catalog(
+    database="corey-reporting-db",
+    table_name="cust-corey_customer",
+    transformation_ctx="dynamoDbConnection_node1",
+)
+
+customer_df = ApplyMapping.apply(frame=customer_df, mappings=[
+    ("customerId", "string", "customerId", "int"),
+    ("firstname", "string", "firstname", "string"),
+    ("lastname", "string", "lastname", "string")
+], transformation_ctx = "customerMapping")
+
+customerSink = glueContext.write_dynamic_frame.from_options(
+    frame=customer_df,
+    connection_type='postgresql',
+    connection_options={
+        "url": "jdbc:postgresql://********.us-east-1.rds.amazonaws.com:5432/corey_reporting",
+        "dbtable": "poc.customer",
+        "user": "postgres",
+        "password": "********"
+
+    },
+    transformation_ctx="customerSink"
+)
+
+job.commit()
diff --git a/examples/GlueUnitTestingLocalSample/test/conftest.py b/examples/GlueUnitTestingLocalSample/test/conftest.py
diff --git a/examples/GlueUnitTestingLocalSample/test/test_glue_script.py b/examples/GlueUnitTestingLocalSample/test/test_glue_script.py

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,3 @@`
	`1`	`+[pytest]`
	`2`	`+pythonpath = src`
	`3`	`+addopts = --cov --junitxml=build/gluetest-report.xml --cov-report xml:build/cov.xml`