Skip to content

Commit c7a5c82

Browse files
[HUDI-2267] Update docs and infra test configs, add support for graphite (apache#3482)
Co-authored-by: Sivabalan Narayanan <[email protected]>
1 parent 3a150ee commit c7a5c82

7 files changed

+114
-23
lines changed

docker/compose/docker-compose_hadoop284_hive233_spark244.yml

+11-2
Original file line numberDiff line numberDiff line change
@@ -33,7 +33,7 @@ services:
3333
interval: 30s
3434
timeout: 10s
3535
retries: 3
36-
36+
3737
datanode1:
3838
image: apachehudi/hudi-hadoop_2.8.4-datanode:latest
3939
container_name: datanode1
@@ -84,7 +84,7 @@ services:
8484
- hive-metastore-postgresql:/var/lib/postgresql
8585
hostname: hive-metastore-postgresql
8686
container_name: hive-metastore-postgresql
87-
87+
8888
hivemetastore:
8989
image: apachehudi/hudi-hadoop_2.8.4-hive_2.3.3:latest
9090
hostname: hivemetastore
@@ -221,6 +221,15 @@ services:
221221
- ${HUDI_WS}:/var/hoodie/ws
222222
command: worker
223223

224+
graphite:
225+
container_name: graphite
226+
hostname: graphite
227+
image: graphiteapp/graphite-statsd
228+
ports:
229+
- 80:80
230+
- 2003-2004:2003-2004
231+
- 8126:8126
232+
224233
adhoc-1:
225234
image: apachehudi/hudi-hadoop_2.8.4-hive_2.3.3-sparkadhoc_2.4.4:latest
226235
hostname: adhoc-1

docker/demo/config/test-suite/complex-dag-cow.yaml

+2-2
Original file line numberDiff line numberDiff line change
@@ -49,7 +49,7 @@ dag_content:
4949
deps: third_insert
5050
first_validate:
5151
config:
52-
validate_hive: true
52+
validate_hive: false
5353
type: ValidateDatasetNode
5454
deps: first_hive_sync
5555
first_upsert:
@@ -76,7 +76,7 @@ dag_content:
7676
deps: first_delete
7777
second_validate:
7878
config:
79-
validate_hive: true
79+
validate_hive: false
8080
delete_input_data: true
8181
type: ValidateDatasetNode
8282
deps: second_hive_sync

docker/demo/config/test-suite/cow-clustering-example.yaml

+2-2
Original file line numberDiff line numberDiff line change
@@ -55,7 +55,7 @@ dag_content:
5555
deps: first_delete
5656
first_validate:
5757
config:
58-
validate_hive: true
58+
validate_hive: false
5959
type: ValidateDatasetNode
6060
deps: first_hive_sync
6161
first_cluster:
@@ -71,6 +71,6 @@ dag_content:
7171
deps: first_cluster
7272
second_validate:
7373
config:
74-
validate_hive: true
74+
validate_hive: false
7575
type: ValidateDatasetNode
7676
deps: second_hive_sync

docker/demo/config/test-suite/cow-long-running-example.yaml

+2-2
Original file line numberDiff line numberDiff line change
@@ -49,7 +49,7 @@ dag_content:
4949
deps: third_insert
5050
first_validate:
5151
config:
52-
validate_hive: true
52+
validate_hive: false
5353
type: ValidateDatasetNode
5454
deps: first_hive_sync
5555
first_upsert:
@@ -76,7 +76,7 @@ dag_content:
7676
deps: first_delete
7777
second_validate:
7878
config:
79-
validate_hive: true
79+
validate_hive: false
8080
delete_input_data: true
8181
type: ValidateDatasetNode
8282
deps: second_hive_sync

docker/demo/config/test-suite/cow-long-running-multi-partitions.yaml

+2-2
Original file line numberDiff line numberDiff line change
@@ -49,7 +49,7 @@ dag_content:
4949
deps: third_insert
5050
first_validate:
5151
config:
52-
validate_hive: true
52+
validate_hive: false
5353
type: ValidateDatasetNode
5454
deps: first_hive_sync
5555
first_upsert:
@@ -76,7 +76,7 @@ dag_content:
7676
deps: first_delete
7777
second_validate:
7878
config:
79-
validate_hive: true
79+
validate_hive: false
8080
delete_input_data: true
8181
type: ValidateDatasetNode
8282
deps: second_hive_sync

docker/generate_test_suite.sh

+47-2
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,37 @@
1616
# See the License for the specific language governing permissions and
1717
# limitations under the License.
1818

19+
usage="
20+
USAGE:
21+
$(basename "$0") [--help] [--all boolen] -- Script to generate the test suite according to arguments provided and run these test suites.
22+
23+
where:
24+
--help show this help text
25+
--all set the seed value
26+
--execute_test_suite flag if test need to execute (DEFAULT- true)
27+
--medium_num_iterations number of medium iterations (DEFAULT- 20)
28+
--long_num_iterations number of long iterations (DEFAULT- 30)
29+
--intermittent_delay_mins delay after every test run (DEFAULT- 1)
30+
--table_type hoodie table type to test (DEFAULT COPY_ON_WRITE)
31+
--include_long_test_suite_yaml include long infra test suite (DEFAULT false)
32+
--include_medium_test_suite_yaml include medium infra test suite (DEFAULT false)
33+
--cluster_num_itr number of cluster iterations (DEFAULT 30)
34+
--include_cluster_yaml include cluster infra test suite (DEFAULT false)
35+
--input_path input path for test in docker image (DEFAULT /user/hive/warehouse/hudi-integ-test-suite/input/)
36+
--output_path input path for test in docker image (DEFAULT /user/hive/warehouse/hudi-integ-test-suite/output/)
37+
38+
Example:
39+
Note - Execute the command from within docker folder
40+
41+
1. To generate and run all test suites
42+
./generate_test_suite.sh --all true
43+
2. To only generate test suites
44+
./generate_test_suite.sh --all --execute_test_suite false
45+
3. To run only specific test suite yaml
46+
./generate_test_suite.sh --execute_test_suite true --include_medium_test_suite_yaml true
47+
"
48+
49+
1950
MEDIUM_NUM_ITR=20
2051
LONG_NUM_ITR=50
2152
DELAY_MINS=1
@@ -39,6 +70,17 @@ do
3970
key="$1"
4071

4172
case $key in
73+
--help)
74+
echo "$usage"
75+
exit
76+
;;
77+
--all)
78+
INCLUDE_LONG_TEST_SUITE="$2"
79+
INCLUDE_MEDIUM_TEST_SUITE="$2"
80+
INCLUDE_CLUSTER_YAML="$2"
81+
shift # past argument
82+
shift # past value
83+
;;
4284
--execute_test_suite)
4385
EXECUTE_TEST_SUITE="$2"
4486
shift # past argument
@@ -115,12 +157,15 @@ case $key in
115157
;;
116158
*) # unknown option
117159
POSITIONAL+=("$1") # save it in an array for later
160+
echo "Unknown argument provided - '$1'"
161+
echo "$usage"
162+
exit 0
118163
shift # past argument
119164
;;
120165
esac
121166
done
122167
set -- "${POSITIONAL[@]}" # restore positional parameters
123-
168+
echo "$POSITIONAL"
124169
echo "Include Medium test suite $INCLUDE_MEDIUM_TEST_SUITE"
125170
if $INCLUDE_MEDIUM_TEST_SUITE ; then
126171
echo "Medium test suite iterations = ${MEDIUM_NUM_ITR}"
@@ -232,7 +277,7 @@ fi
232277

233278
if $EXECUTE_TEST_SUITE ; then
234279

235-
docker cp $CUR_DIR/../packaging/hudi-integ-test-bundle/target/$JAR_NAME adhoc-2:/opt/
280+
docker cp $CUR_DIR/../packaging/hudi-integ-test-bundle/target/"$JAR_NAME" adhoc-2:/opt/
236281
docker exec -it adhoc-2 /bin/bash rm -rf /opt/staging*
237282
docker cp demo/config/test-suite/staging/ adhoc-2:/opt/
238283
docker exec -it adhoc-2 /bin/bash echo "\n============================== Executing sanity test suite ============================== "

hudi-integ-test/README.md

+48-11
Original file line numberDiff line numberDiff line change
@@ -177,7 +177,7 @@ cd /opt
177177
Copy the integration tests jar into the docker container
178178

179179
```
180-
docker cp packaging/hudi-integ-test-bundle/target/hudi-integ-test-bundle-0.8.0-SNAPSHOT.jar adhoc-2:/opt
180+
docker cp packaging/hudi-integ-test-bundle/target/hudi-integ-test-bundle-0.10.0-SNAPSHOT.jar adhoc-2:/opt
181181
```
182182

183183
```
@@ -214,21 +214,29 @@ spark-submit \
214214
--conf spark.network.timeout=600s \
215215
--conf spark.yarn.max.executor.failures=10 \
216216
--conf spark.sql.catalogImplementation=hive \
217+
--conf spark.driver.extraClassPath=/var/demo/jars/* \
218+
--conf spark.executor.extraClassPath=/var/demo/jars/* \
217219
--class org.apache.hudi.integ.testsuite.HoodieTestSuiteJob \
218-
/opt/hudi-integ-test-bundle-0.8.0-SNAPSHOT.jar \
220+
/opt/hudi-integ-test-bundle-0.10.0-SNAPSHOT.jar \
219221
--source-ordering-field test_suite_source_ordering_field \
220222
--use-deltastreamer \
221223
--target-base-path /user/hive/warehouse/hudi-integ-test-suite/output \
222224
--input-base-path /user/hive/warehouse/hudi-integ-test-suite/input \
223225
--target-table table1 \
224226
--props file:/var/hoodie/ws/docker/demo/config/test-suite/test.properties \
225-
--schemaprovider-class org.apache.hudi.utilities.schema.FilebasedSchemaProvider \
227+
--schemaprovider-class org.apache.hudi.integ.testsuite.schema.TestSuiteFileBasedSchemaProvider \
226228
--source-class org.apache.hudi.utilities.sources.AvroDFSSource \
227229
--input-file-size 125829120 \
228230
--workload-yaml-path file:/var/hoodie/ws/docker/demo/config/test-suite/complex-dag-cow.yaml \
229231
--workload-generator-classname org.apache.hudi.integ.testsuite.dag.WorkflowDagGenerator \
230232
--table-type COPY_ON_WRITE \
231-
--compact-scheduling-minshare 1
233+
--compact-scheduling-minshare 1 \
234+
--hoodie-conf hoodie.metrics.on=true \
235+
--hoodie-conf hoodie.metrics.reporter.type=GRAPHITE \
236+
--hoodie-conf hoodie.metrics.graphite.host=graphite \
237+
--hoodie-conf hoodie.metrics.graphite.port=2003 \
238+
--clean-input \
239+
--clean-output
232240
```
233241

234242
Or a Merge-on-Read job:
@@ -253,23 +261,44 @@ spark-submit \
253261
--conf spark.network.timeout=600s \
254262
--conf spark.yarn.max.executor.failures=10 \
255263
--conf spark.sql.catalogImplementation=hive \
264+
--conf spark.driver.extraClassPath=/var/demo/jars/* \
265+
--conf spark.executor.extraClassPath=/var/demo/jars/* \
256266
--class org.apache.hudi.integ.testsuite.HoodieTestSuiteJob \
257-
/opt/hudi-integ-test-bundle-0.8.0-SNAPSHOT.jar \
267+
/opt/hudi-integ-test-bundle-0.10.0-SNAPSHOT.jar \
258268
--source-ordering-field test_suite_source_ordering_field \
259269
--use-deltastreamer \
260270
--target-base-path /user/hive/warehouse/hudi-integ-test-suite/output \
261271
--input-base-path /user/hive/warehouse/hudi-integ-test-suite/input \
262272
--target-table table1 \
263273
--props file:/var/hoodie/ws/docker/demo/config/test-suite/test.properties \
264-
--schemaprovider-class org.apache.hudi.utilities.schema.FilebasedSchemaProvider \
274+
--schemaprovider-class org.apache.hudi.integ.testsuite.schema.TestSuiteFileBasedSchemaProvider \
265275
--source-class org.apache.hudi.utilities.sources.AvroDFSSource \
266276
--input-file-size 125829120 \
267277
--workload-yaml-path file:/var/hoodie/ws/docker/demo/config/test-suite/complex-dag-mor.yaml \
268278
--workload-generator-classname org.apache.hudi.integ.testsuite.dag.WorkflowDagGenerator \
269279
--table-type MERGE_ON_READ \
270-
--compact-scheduling-minshare 1
280+
--compact-scheduling-minshare 1 \
281+
--hoodie-conf hoodie.metrics.on=true \
282+
--hoodie-conf hoodie.metrics.reporter.type=GRAPHITE \
283+
--hoodie-conf hoodie.metrics.graphite.host=graphite \
284+
--hoodie-conf hoodie.metrics.graphite.port=2003 \
285+
--clean-input \
286+
--clean-output
271287
```
272288

289+
## Visualize and inspect the hoodie metrics and performance (local)
290+
Graphite server is already setup (and up) in ```docker/setup_demo.sh```.
291+
292+
Open browser and access metrics at
293+
```
294+
http://localhost:80
295+
```
296+
Dashboard
297+
```
298+
http://localhost/dashboard
299+
300+
```
301+
273302
## Running long running test suite in Local Docker environment
274303

275304
For long running test suite, validation has to be done differently. Idea is to run same dag in a repeated manner for
@@ -279,12 +308,12 @@ contents both via spark datasource and hive table via spark sql engine. Hive val
279308
If you have "ValidateDatasetNode" in your dag, do not replace hive jars as instructed above. Spark sql engine does not
280309
go well w/ hive2* jars. So, after running docker setup, follow the below steps.
281310
```
282-
docker cp packaging/hudi-integ-test-bundle/target/hudi-integ-test-bundle-0.8.0-SNAPSHOT.jar adhoc-2:/opt/
283-
docker cp demo/config/test-suite/test.properties adhoc-2:/opt/
311+
docker cp packaging/hudi-integ-test-bundle/target/hudi-integ-test-bundle-0.10.0-SNAPSHOT.jar adhoc-2:/opt/
312+
docker cp docker/demo/config/test-suite/test.properties adhoc-2:/opt/
284313
```
285314
Also copy your dag of interest to adhoc-2:/opt/
286315
```
287-
docker cp demo/config/test-suite/complex-dag-cow.yaml adhoc-2:/opt/
316+
docker cp docker/demo/config/test-suite/complex-dag-cow.yaml adhoc-2:/opt/
288317
```
289318

290319
For repeated runs, two additional configs need to be set. "dag_rounds" and "dag_intermittent_delay_mins".
@@ -428,7 +457,7 @@ spark-submit \
428457
--conf spark.driver.extraClassPath=/var/demo/jars/* \
429458
--conf spark.executor.extraClassPath=/var/demo/jars/* \
430459
--class org.apache.hudi.integ.testsuite.HoodieTestSuiteJob \
431-
/opt/hudi-integ-test-bundle-0.8.0-SNAPSHOT.jar \
460+
/opt/hudi-integ-test-bundle-0.10.0-SNAPSHOT.jar \
432461
--source-ordering-field test_suite_source_ordering_field \
433462
--use-deltastreamer \
434463
--target-base-path /user/hive/warehouse/hudi-integ-test-suite/output \
@@ -446,6 +475,14 @@ spark-submit \
446475
--clean-output
447476
```
448477

478+
If you wish to enable metrics add below properties as well
479+
```
480+
--hoodie-conf hoodie.metrics.on=true \
481+
--hoodie-conf hoodie.metrics.reporter.type=GRAPHITE \
482+
--hoodie-conf hoodie.metrics.graphite.host=graphite \
483+
--hoodie-conf hoodie.metrics.graphite.port=2003 \
484+
```
485+
449486
Few ready to use dags are available under docker/demo/config/test-suite/ that could give you an idea for long running
450487
dags.
451488
```

0 commit comments

Comments
 (0)