Skip to content

Fix: TCP setevent no longer throws bogus error message#2986

Merged
mwinkel-dev merged 6 commits intoMDSplus:alphafrom
mwinkel-dev:mw-2977-setevent
Nov 19, 2025
Merged

Fix: TCP setevent no longer throws bogus error message#2986
mwinkel-dev merged 6 commits intoMDSplus:alphafrom
mwinkel-dev:mw-2977-setevent

Conversation

@mwinkel-dev
Copy link
Contributor

Fixes issue #2977 .

@mwinkel-dev mwinkel-dev self-assigned this Nov 18, 2025
@mwinkel-dev mwinkel-dev added bug An unexpected problem or unintended behavior US Priority tool/event Relates to the event tools (wfevent, setevent) labels Nov 18, 2025
@mwinkel-dev
Copy link
Contributor Author

mwinkel-dev commented Nov 18, 2025

This WIP must be carefully scrutinized to ensure it is not a breaking change for the "events" feature of MDSplus. When reviewing this proposed fix, it will likely be useful to study the individual commits.

Commit 9874518: This is the minimum change needed to fix the problem reported by PPPL. The issue is that because PR #2288 redefined INVALID_CONNECTION_ID = -1 instead of zero, the TCP setevent code was not correctly handling a connection ID of zero, thus it reconnected as connection ID = 1, which did send the event, but also generated a bogus error message. The fix in this first commit was sufficient to prove that the root cause had been identified. However, it is a partial fix because it only works if the mds_event_target environment variable contains just a single TCP event server.

Commit 23e2581: Updates the MdsEvents.c file to use the INVALID_CONNECTION_ID define throughout, thereby eliminating the confusion caused by the old code (which incorrectly assumes zero is the invalid ID).

Commit 9366af2: The sendRemoteEvent() function returns an MDSplus status code (32-bit integer with three fields) up the call stack, which is what the TCL setevent feature needs. However, the separate setevent utility runs at the operating system prompt and thus should use C exit codes when terminating. For clarity, this is done with the C_OK and C_ERROR defines.

Commits 6721c2e and 1079327: For clarity, the separate wfevent utility was also changed to use C_OK and C_ERROR.

Commit 8d7e819: Significant rewrite of the sendRemoteEvent() function to correctly process a list of servers specified by the mds_event_target environment variable. The existing code was only retrying a connection on the first failed server that was encountered. This rewrite assumes that an effort should be made to write to every server in the list even if some servers cannot be reached. An alternate design would be to terminate as soon as the first failing server is encountered.

@mwinkel-dev
Copy link
Contributor Author

mwinkel-dev commented Nov 18, 2025

To manually test these changes, the following configuration was used.

  • Two VMs, A and B, were used. (The operating system doesn't matter.)
  • Server A had four terminal windows open, with an mdsip process running in each window. (Each mdsip process used a different port.). When server A was a Rocky Linux VM, the software firewall, firewalld, was configured to allow TCP traffic on the various mdsip ports. The firewall was also configured for UDP traffic on port 4000.
  • Server B ran a developer build with this WIP. To force TCP events to be used, this statement was executed: export mds_event_target="<server A>:<port 1>;<A>:<port 2>;<A>:<port 3>;<A>:<port 4>". Note that the double quotes and semicolon are important.
  • On server B, two terminal windows were opened. One was used to run the standalone wfevent utility to confirm that TCP events were indeed being sent. The other terminal windows was used to run 1) TCL's setevent command and 2) the separate setevent utility.
  • After running a test, the messages written by the mdsip processes on server A were also reviewed.
  • To test a failed connection, B's terminal windows did wfevent, then on A one of the mdsip processes was terminated, and then on B the TCL setevent or separate setevent was issued.

free(ansarg.ptr);
}

if (tries >= MAX_TRIES) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: When a bad server is encountered, should the function continue sending the event to the remaining servers (as per this WIP)? Or should the function instead immediately terminate?

else
{
int len = (int)((argc > 2) ? strlen(argv[2]) : 0);
status = MDSEvent(argv[1], len, argv[2]);
Copy link
Contributor Author

@mwinkel-dev mwinkel-dev Nov 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that the old code was returning an MDSplus status code (32-bits, 3 fields) when the function terminates. This is problematic because MDSplusSUCCESS = 65545, which when treated as a C exit code (a byte value) is a failure (i.e., echo $? displays 9 which is a non-zero value, thus denotes a C failure).

default:
printhelp(argv[0]);
return 1;
return C_ERROR;
Copy link
Contributor Author

@mwinkel-dev mwinkel-dev Nov 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this is a change in behavior. The define is C_ERROR = -1, which echo $? displays as 255.

The conjecture is that very few users have shell scripts that check the status code of wfevent. Thus, this change is likely very low risk.

Alternatively, we could add a new define, C_FAILURE = 1, and use it instead of C_ERROR.

{
receive_thread_ids[idx] = searchOpenServer(receive_servers[idx]);
if(receive_thread_ids[idx] < 0)
if(receive_thread_ids[idx] <= INVALID_CONNECTION_ID)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that around line 413 below, an error condition is detected whereupon a socket is set to zero (i.e., stdin) which seems a bit odd. The code works "as is", so it is probably OK to keep it.

@mwinkel-dev
Copy link
Contributor Author

There appears to be yet another issue with TCP events, thus this WIP definitely needs more work.

To test TCP events, it is advisable to use three virtual machines, call them A, B and C.

A sets mds_event_target to point to B. B runs the mdsip process that relays the TCP event. C sets mds_event_server to B. On all VMs, make sure the software firewall permits TCP traffic on the desired port.

The test consists of the following:

  1. On C, do wfevent test_event
  2. On A, do setevent test_event
  3. On C, check to see if the test_event was received
  4. On B, check the messages displayed by the mdsip process to troubleshoot any issues.

The above test revealed that C would only receive the test_event if B's software firewall was configured to allow UDP traffic on port 4000 (the default for MDSplus UDP events). Because the three VMs are on the same virtual network, it is probable that B is sending out UDP events that are received by A and C.

Josh has confirmed that TCP events should work without using UDP on any of the VMs. (Similar behavior was seen months ago, but it was incorrectly assumed that TCP events were designed to convert to UDP at the target system.)

Thus, additional investigation is required.

@mwinkel-dev
Copy link
Contributor Author

Experiments prove that TCP events are indeed being converted to UDP events at the "target" system. This occurs with both TCL's setevent command and the separate setevent utility. Fixing TCP events will likely be complicated, so that task should probably should be split off as a new issue / PR.

Here are the details . . .

Both TCL and the seteventutility call the MDSEvent() function of MdsEvents.c, which detects the mds_event_target environment variable and eventually causes sendRemoteEvent() to be called. That function sends a TDI expression to the "target" system that causes it to evaluate the setevent.fun TDI function, which in turn causes the "target" to also call its MDSEvent() function. However, on the "target", the mds_event_target and mds_event_server environment variables are undefined, and thus the target's MDSEvent() calls MDSUdpEvent(). Thus the "target" converts TCP events into UDP events.

references TDI function

setevent.fun

MDSEvent

@mwinkel-dev
Copy link
Contributor Author

A tentative conclusion from additional experiments is that UDP is the backbone of TCP events.

Assume four servers: A, B, C, and D. B and C are on the same network segment. D executes a wfevent test_event command that uses TCP to listen to an mdsip process on C. The A server does a setevent test_event that uses TCP to send an event to an mdsip process on B. B converts the TCP event to a UDP event, which is seen by the mdsip process on C, which in turn then uses TCP to notify D.

It is difficult to test the above configuration when all the four systems are VMs are on the same virtual network. However, based on experiments and perusing the source code, it seems likely that UDP is the backbone with TCP being used for the first and last network hop in the chain.

More investigation is required.

@mwinkel-dev
Copy link
Contributor Author

An additional experiment using debug print statements proves the preceding conjecture that TCP events are converted to / from UDP events.

Copy link
Member

@WhoBrokeTheBuild WhoBrokeTheBuild left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These all look like good changes, and I think the risks are minimal. LGTM

@mwinkel-dev mwinkel-dev marked this pull request as ready for review November 19, 2025 17:03
@mwinkel-dev mwinkel-dev merged commit 89fdb73 into MDSplus:alpha Nov 19, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug An unexpected problem or unintended behavior tool/event Relates to the event tools (wfevent, setevent) US Priority

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants