Skip to content

Conversation

@adoroszlai
Copy link
Contributor

@adoroszlai adoroszlai commented Aug 2, 2020

What changes were proposed in this pull request?

The problem is caused by this line in test-all.sh:

https://github.com/apache/hadoop-ozone/blob/e219aae7c7bcd8eacd592ca2ccfcf58626477f37/hadoop-ozone/dist/src/main/compose/test-all.sh#L28

because testlib.sh turns on "exit on error":

https://github.com/apache/hadoop-ozone/blob/e219aae7c7bcd8eacd592ca2ccfcf58626477f37/hadoop-ozone/dist/src/main/compose/testlib.sh#L17

So test-all.sh exits after the first failed test.sh execution, ... is FAILED is not present in output.

https://github.com/apache/hadoop-ozone/blob/e219aae7c7bcd8eacd592ca2ccfcf58626477f37/hadoop-ozone/dist/src/main/compose/test-all.sh#L46-L50

The fix is simple: temporarily turn off "exit on error" before running test.sh, then turn it back on. We also need to make rebot return zero code even if some tests failed.

https://issues.apache.org/jira/browse/HDDS-4057

How was this patch tested?

Added two always failing tests and two environments. These can be used to test failure handling in acceptance test runner. Without the fix, only the first test in the first environment is run, and its logs are not copied to target/acceptance:

$ OZONE_ACCEPTANCE_SUITE=failing ./hadoop-ozone/dev-support/checks/acceptance.sh
...
Test1 :: This test always fails                                       | FAIL |
...
Output:  /tmp/smoketest/failing1/result/robot-failing1-failing1-test1-scm.xml

$ ls -1 target/acceptance

With the fix, both tests in both environments are run:

$ OZONE_ACCEPTANCE_SUITE=failing ./hadoop-ozone/dev-support/checks/acceptance.sh
...
Test1 :: This test always fails                                       | FAIL |
...
Output:  /tmp/smoketest/failing1/result/robot-failing1-failing1-test1-scm.xml
...
ERROR: Test execution of ./failing1 is FAILED!!!!
...
Test1 :: This test always fails                                       | FAIL |
...
Output:  /tmp/smoketest/failing2/result/robot-failing2-failing2-test1-scm.xml
...
ERROR: Test execution of ./failing2 is FAILED!!!!

$ ls -1 target/acceptance
docker-failing1-failing1-test1-scm.log
docker-failing2-failing2-test1-scm.log
failing1.xml
failing2.xml
log.html
report.html
summary.html

Regular CI:
https://github.com/adoroszlai/hadoop-ozone/runs/937941098

@adoroszlai adoroszlai self-assigned this Aug 2, 2020
@adoroszlai adoroszlai requested a review from elek August 3, 2020 10:04
@adoroszlai adoroszlai added the bug Something isn't working label Aug 7, 2020
Copy link
Member

@elek elek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks to work on it @adoroszlai

As usual, I am fine with the patch, but I have related thoughts to discuss.

  1. set +e and set -e

I think it was a mistake to not using set -e everywhere from the beginning. I mean with --nostatusrc al the other lines supposed to be error free. (or we need to add | tree).

Not clear why do we need the set +e here if we introduce --nostatusrc

  1. failing1-2

I have mixed feeling about including the failing1 directory to here. The original vision with compose was to provide easy to understood examples for the users for different use-cases. From this point of view failing1 is a confusing name as it's test related.

But we already started to add more and more test related stuff (for example upgrade), even if upgrade can be interesting for the users as well.

What I am thinking about is some kind of separations. What about keeping the compose folder and introduce a new smoketest. compose setup should be tested as before, but complex test cases (like a restart test, or log snapshotting) can be moved to the smoketest folder.

Not sure what is the best approach, interested about your opinion.

@adoroszlai
Copy link
Contributor Author

  1. set +e and set -e
  2. failing1-2

I agree it would be nice to implement both: consistent set -e and test env. reorganization. Created HDDS-4100 and HDDS-4101 for these.

Not clear why do we need the set +e here if we introduce --nostatusrc

test.sh execution also needs to "ignore" errors, otherwise we may not reach rebot.

@elek
Copy link
Member

elek commented Aug 12, 2020

test.sh execution also needs to "ignore" errors, otherwise we may not reach rebot.

Yes, that's a good question. The scripts of hadoop-ozone/dev-support/checks/*.sh have very simple contract: in case of failure a non-zero shell code should be returned.

If we would like to keep the same contract for the individual test.sh files we need this set +/-e trick. But in this case we couldn't differentiate between errors (in environment / script) and failures (in robot tests).

Another option what I started to follow is to make test.sh return with 255 only if the test execution failed in the script, and the result of the test can be checked from the robot.xml files.

Don't know which one is better.

Anyway, let me merge this PR and we can continue to think about theses questions.

@elek elek merged commit 7ab53b5 into apache:master Aug 12, 2020
@adoroszlai adoroszlai deleted the HDDS-4057 branch August 12, 2020 10:21
@adoroszlai
Copy link
Contributor Author

Thanks @elek for reviewing and committing it.

errose28 added a commit to errose28/ozone that referenced this pull request Aug 12, 2020
* master: (28 commits)
  HDDS-4037. Incorrect container numberOfKeys and usedBytes in SCM after key deletion (apache#1295)
  HDDS-3232. Include the byteman scripts in the distribution tar file (apache#1309)
  HDDS-4095. Byteman script to debug HCFS performance (apache#1311)
  HDDS-4057. Failed acceptance test missing from bundle (apache#1283)
  HDDS-4040. [OFS] BasicRootedOzoneFileSystem to support batchDelete (apache#1286)
  HDDS-4061. Pending delete blocks are not always included in #BLOCKCOUNT metadata (apache#1288)
  HDDS-4067. Implement toString for OMTransactionInfo (apache#1300)
  HDDS-3878. Make OMHA serviceID optional if one (but only one) is defined in the config (apache#1149)
  HDDS-3833. Use Pipeline choose policy to choose pipeline from exist pipeline list (apache#1096)
  HDDS-3979. Make bufferSize configurable for stream copy (apache#1212)
  HDDS-4048. Show more information while SCM version info mismatch (apache#1278)
  HDDS-4078. Use HDDS InterfaceAudience/Stability annotations (apache#1302)
  HDDS-4034. Add Unit Test for HadoopNestedDirGenerator. (apache#1266)
  HDDS-4076. Translate CSI.md into Chinese (apache#1299)
  HDDS-4046. Extensible subcommands for CLI applications (apache#1276)
  HDDS-4051. Remove whitelist/blacklist terminology from Ozone (apache#1306)
  HDDS-4055. Cleanup GitHub workflow (apache#1282)
  HDDS-4042. Update documentation for the GA release (apache#1269)
  HDDS-4066. Add core-site.xml to intellij configuration (apache#1292)
  HDDS-4073. Remove leftover robot.robot (apache#1297)
  ...
rakeshadr pushed a commit to rakeshadr/hadoop-ozone that referenced this pull request Sep 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants