Skip to content

Conversation

marksvc
Copy link
Collaborator

@marksvc marksvc commented Sep 30, 2025

It bothers me to see all the error messages.
My workstation can start it in 15 s. GHA can start it in 30 s.


This change is Reviewable

Copy link

codecov bot commented Sep 30, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 82.18%. Comparing base (ced642c) to head (d0a9b97).
✅ All tests successful. No failed tests found.

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #3483   +/-   ##
=======================================
  Coverage   82.18%   82.18%           
=======================================
  Files         611      611           
  Lines       36449    36449           
  Branches     6005     6005           
=======================================
  Hits        29957    29957           
  Misses       5623     5623           
  Partials      869      869           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@marksvc marksvc marked this pull request as ready for review September 30, 2025 22:17
It bothers me to see all the error messages.
My workstation can start it in 15 s. GHA can start it in 30 s.
@pmachapman pmachapman self-assigned this Sep 30, 2025
@pmachapman pmachapman self-requested a review September 30, 2025 22:29
Copy link
Collaborator

@pmachapman pmachapman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pmachapman reviewed 1 of 1 files at r1, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @pmachapman)


src/SIL.XForge.Scripture/ClientApp/e2e/await-application-startup.mts line 7 at r1 (raw file):

const pollUrl = 'http://localhost:5000/projects';
const pollInterval = 17000;

Wouldn't it be better to poll every second, but reduce the error logging to only log those failures that occur after the 30 second mark? This will help reduce the error messages without a non-standard interval number (i.e. when some looks at 17000 in future, they won't realize it is based on your PC's and the GHA's performance).

Including an exponential backoff for the interval, i.e. 1000, 2000, 4000, 8000, 16000, is also a usual practice for polling non-responsive services.

Code quote:

const pollInterval = 17000;

Copy link
Collaborator

@Nateowami Nateowami left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @marksvc)


-- commits line 4 at r1:
I don't see what's wrong with the current approach. I can't think of a reason to run this locally, and usually if you're looking at it on GHA, it's because the e2e tests are broken, and we're going to have to re-run them. This will often add an unnecessary delay to running the tests, which we'd ideally like to optimize to run much faster.

@marksvc
Copy link
Collaborator Author

marksvc commented Oct 1, 2025

-- commits line 4 at r1:
One problem is that when I go look at e2e logs, I see

0:00 Startup check failed: error sending request for url (http://localhost:5000/projects): client error (Connect): tcp connect error: Connection refused (os error 111)

0:01 Startup check failed: error sending request for url (http://localhost:5000/projects): client error (Connect): tcp connect error: Connection refused (os error 111)

0:02 Startup check failed: error sending request for url (http://localhost:5000/projects): client error (Connect): tcp connect error: Connection refused (os error 111)

0:03 Startup check failed: error sending request for url (http://localhost:5000/projects): client error (Connect): tcp connect error: Connection refused (os error 111)

0:04 Startup check failed: error sending request for url (http://localhost:5000/projects): client error (Connect): tcp connect error: Connection refused (os error 111)
...

33 times. Once when I was still learning and checked the e2e logs early on I incorrectly thought the e2e tests never even successfully started because of all these messages. The little message "Startup check passed. Exiting." at the end is a squeak compared to all the errors :)

The other problem is seeing it on my workstation, which is more often. On my local computer I start stuff all the time: compile job, launch a program, start some tests. It will give an indication of starting up, and eventually it's going. But it doesn't say "Error" a dozen times before successfully starting up :-)

It would be an improvement if every second we printed a message saying "Still waiting" 33 times.
I like what Peter suggests: Try every second, but don't print the messages until enough time has passed. Then it could try every second for half a minute, but not say Error every time.

I can't think of a reason to run this locally,

I often run it locally. If the GHA job fails, I might be able to more accurately try what it is doing by running the same script.
Another reason I run it locally is that if I use e2e.mts, then

  1. It does stuff in the GUI and sometimes I am concerned that I might have messed it up with my mouse or keyboard.
  2. I need to be more coordinated with SF running locally in the background, since e2e.mts seems to test an already running SF.

I know the e2e.mts script has various configuration etc, but it's been very convenient to just run the bash script that handles it all.

usually if you're looking at it on GHA, it's because the e2e tests are broken, and we're going to have to re-run them

Can you clarify why this might mean the display of 33 error messages at the beginning of the test run isn't undesirable?

This will often add an unnecessary delay to running the tests, which we'd ideally like to optimize to run much faster.

Yes, I see what you're saying here.

Copy link
Collaborator

@Nateowami Nateowami left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @pmachapman)


-- commits line 4 at r1:
It seems like the issue has more to do with how the startup ping failures appear in the log, than their frequency or presence. Perhaps the message could be changed to just say it's not up yet. It's verbose because it's intended to run on the CI.

The canonical way of running the tests, defined in the README, is to run e2e.mts. pre_merge_ci.sh is intended as a wrapper around that for the sake of a CI. I'm not even sure if developers on Windows can run it.

  1. It does stuff in the GUI and sometimes I am concerned that I might have messed it up with my mouse or keyboard.

This is a good argument for adding a new preset called headless that's identical to the default, except for being headless. I usually run them in headed mode because I want to be able to easily investigate failures, and headless mode makes it nearly impossible.

  1. I need to be more coordinated with SF running locally in the background, since e2e.mts seems to test an already running SF.

I nearly always have SF running for development purposes, and shutting down the processes to start another is quite slow. Do you not have it running most of the time? I guess I've just assumed it would be difficult to work without SF running.

Can you clarify why this might mean the display of 33 error messages at the beginning of the test run isn't undesirable?

I see them as startup status messages, in a CI script that's intended to be as verbose as possible. Arguably maybe the error messages could be toned down to just say that it can't connect yet instead of the underlying error message.


src/SIL.XForge.Scripture/ClientApp/e2e/await-application-startup.mts line 7 at r1 (raw file):
I don't really want an exponential backoff, since this isn't about a service failure, but just waiting for a service to come up. And the cost of the requests is very low, since it's all on localhost.

reduce the error logging to only log those failures that occur after the 30 second

Taking more than 30 seconds wouldn't be particularly abnormal. In my opinion showing network failure messages after an arbitrary time period would make the problem of status messages looking like critical failures worse, since they normally wouldn't show up, so when they do show up it would deviate from what's normal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants