Merge-on-Red: Implemented YAML log reader alongside the XML ones #11807

ivdiazsa · 2022-12-01T22:49:11Z

Follow up to issue #11559. Now that the Helix machines have the PyYaml dependency installed, we can proceed with the next stage, which is processing the YAML test logs.

For context, our current test logs written in XML format. The problem with this is that if a there's a fatal hang/freeze or crash, then the log will end up incomplete, and therefore unreadable later on. This led to the motivation to add another log based in YAML format. Even if it's "incomplete", it's perfectly readable thanks to YAML's lack of closing indicators.

The full project can be found in the following links, for further information:

Merge-on-Red is thoroughly explained here: [User Story] CI Health: Redefining CI investigations and Health runtime#75243
As for this particular work item, we are tracking it here: [Merge-on-Red] - Accurately log catastrophic test failures and freezes in source-gened test infrastructure runtime#77735 (This one is also linked in the Merge-on-Red issue specified above)

The goal of this PR is to update Helix's scripts, so that they can process the YAML logs, alongside to their existing XML counterparts.

MattGal · 2022-12-01T23:34:20Z

Rats. There's a problem here, namely since you're doing this in the reporter it has to work inside docker containers (log of failure)

I tried to make this an explicit dependency of the helix client scripts (so it'd be forcibly installed inside docker scenarios) but it broke on some specific SLES 12 dependencies, so I settled for just installing it on the helix clients themselves. I'm not sure how to unblock you here, sorry for the inconvenience. I will discuss w/ my team and let you know if I have any useful ideas.

MattGal · 2022-12-02T00:09:54Z

I apologize for any of your time I’ve wasted here, but at this point I don’t see any reasonable way forward for this feature to work in Helix. Specifically, the Helix client .WHL is authored in such a way that it is compatible, without modifications, with literally every mac, linux, and windows machine we run on. This is made more complex by the fact that .NET Core supports some old, long-supported OSes such as SLES, and that changing out the version of Python 3 / PIP on these machines often can result in breaking core operating system functionality and/or requires building python from source.

As such, the work I did end up doing was to make sure some version of pyYaml was on every helix image, but this doesn’t apply to the docker scenarios. In this case, the dependencies of the Helix WHL are installed at the beginning of any docker work item (where we try to pre-install them before running the work item), but expressing this dependency in our WHL prevents the WHL from working on SLES 12 and possibly SLES 15. These distros contain distutils versions of these packages which cannot be updated / downgraded via pip, and which break our installer if this dependency is expressed.

We walk a precarious line trying to keep a set of dependencies that can run the Helix client on all of our OSes and architectures, which is often frustrating for folks who want to use latest and greatest packages / python features. If and when we remove support for SLES, it would be possible to try to put this dependency back into the Helix client WHL and see if it works again. I’ll also be the first to admit I’m still a relative newcomer to python, and express my willingness to try out other approaches if you think there’s a reasonable way to rectify this problem I will do my best to try them out and see if we can make something work.

ChadNedzlek · 2022-12-02T00:12:08Z

src/Microsoft.DotNet.Helix/Sdk/tools/azure-pipelines/reporter/formats/yaml.py

+        skip_reason = test.find("reason")
+        res = TestResult(name, u'yaml', type_name, method, time, result, exception_type,
+                         failure_message, stack_trace, skip_reason, attachments)
+        yield res


I think we discussed that we need some sort of terminal detection. To make sure that the tests actually finished, and didn't just abort in the middle. Presumably it can go at the end here, and just be something like

if not contents.get("completed"): yield TestResult("TEST CRASH"...)

I'm not sure what a good name for the fake test is, since you'd presumably want it to be different for different workitems (so that you didn't just get a bunch of "TEST CRASH" tests that you can't tell which actual test execution they are bound to).

markwilkie · 2022-12-02T21:59:49Z

Thanks @MattGal - I was sorta afraid this might happen. @ivdiazsa - we're hoping to work on a common/shared approach to testing which this likely falls into.

ivdiazsa · 2022-12-02T23:09:59Z

@MattGal This is a very unfortunate turn of events indeed. I really appreciate the detailed explanation of all that's going on nonetheless, Matt. I can't think of a solution out off the top of my head as of now, so we will have to discuss this problem with the whole Merge-on-Red team. We'll be in touch in the nearby future.

agocke · 2022-12-05T19:03:06Z

I'm confused. Why does the yaml parsing logic need to run in the runners themselves? Don't we just need the parsing to run after the run has finished, in order to find the Build Analysis failures? Can't that be done in the orchestration layer?

alexperovich · 2022-12-05T19:05:44Z

The test result parsing happens on the same machine the tests run on, after the test run. This means whatever python packages are needed for it to work must be installed. Changing where the parsing happens would be risky feature work, but doable.

agocke · 2022-12-05T19:33:22Z

Yeah, I guess I would have expected it to be done on a different machine for exactly this reason -- you don't want to be burdened with the machine requirements.

Moreover, if the runner is super slow for whatever reason I'd imagine you don't want to do compute-heavy work on it.

ChadNedzlek · 2022-12-05T20:00:08Z

The main problem is the scale. Having to spin up another machine to send the results would dramatically increase the costs of helix. We already have a machine that can do the work with the file already available. Moving that parsing to another machine means that we need to have another machine and that the intermediate file needs to get transferred there somehow. And versioning becomes complicated as well. Right now, since it's all running on the same machine with the same scripts, there are no N/N-1 versioning problems.

markwilkie · 2022-12-05T21:19:32Z

To re-iterate, it's my belief that that there's a solution here, but it likely involves another architectural layer, not a bolt on to Helix.

agocke · 2022-12-06T00:44:26Z

It sounds like a pretty big architectural change. We can think about it, but in the meantime I’m ok with standardizing on the xunit xml format. Our test watchdog can be responsible for writing or fixing up the file if necessary.

Implemented Yaml reader alongside Xunit, Junit, and Trx.

ced8c1c

ivdiazsa requested review from markwilkie, MattGal and ChadNedzlek December 1, 2022 22:49

ivdiazsa added the area-Infrastructure-coreclr label Dec 1, 2022

ChadNedzlek reviewed Dec 2, 2022

View reviewed changes

ivdiazsa closed this May 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge-on-Red: Implemented YAML log reader alongside the XML ones #11807

Merge-on-Red: Implemented YAML log reader alongside the XML ones #11807

ivdiazsa commented Dec 1, 2022

MattGal commented Dec 1, 2022

MattGal commented Dec 2, 2022

ChadNedzlek Dec 2, 2022

markwilkie commented Dec 2, 2022

ivdiazsa commented Dec 2, 2022

agocke commented Dec 5, 2022

alexperovich commented Dec 5, 2022

agocke commented Dec 5, 2022

ChadNedzlek commented Dec 5, 2022

markwilkie commented Dec 5, 2022

agocke commented Dec 6, 2022

Merge-on-Red: Implemented YAML log reader alongside the XML ones #11807

Merge-on-Red: Implemented YAML log reader alongside the XML ones #11807

Conversation

ivdiazsa commented Dec 1, 2022

MattGal commented Dec 1, 2022

MattGal commented Dec 2, 2022

ChadNedzlek Dec 2, 2022

Choose a reason for hiding this comment

markwilkie commented Dec 2, 2022

ivdiazsa commented Dec 2, 2022

agocke commented Dec 5, 2022

alexperovich commented Dec 5, 2022

agocke commented Dec 5, 2022

ChadNedzlek commented Dec 5, 2022

markwilkie commented Dec 5, 2022

agocke commented Dec 6, 2022