Fix broken symlink python3 and unmet dependencies warnings. #124

gabrielcocenza · 2023-05-04T15:29:41Z

Co-authored-by: peppepetra [email protected]

agileshaw

LGTM

aieri

LGTM

Currently, slow running OpenStack API Requests (either stuck connecting or still waiting for the actual response) from the periodic DataGatherer task will block the HTTPServer connections from being processed. Conversely, a stalled client of the HTTPServer (e.g. opening a telnet session and not sending a request) will also block other HTTPServer connections from being processed and also block the DataGatherer task from running. Observed Symptoms: - Slow or failed prometheus requests - Statistics not being updated as often as you would expect - HTTP 500 responses and BrokenPipeError tracebacks being logged due to later trying to respond to prometheus clients which timed out and disconnected the socket. Cause: This happens because we are intending to use the eventlet library for asynchronous non-blocking I/O, but, not all code running is correctly using the patched or "green" versions of various standard libraries (e.g. socket). As a result, we sometimes block the other tasks from running. Fix this by ensuring the entire program is correctly using eventlet and green patched functions by importing eventlet and using eventlet.patcher.monkey_patch() before importing any other modules. == History Lesson == There have been several incorrect attempts to solve this and some related problems. To try and avoid any further such problems, I have comprehensively documented the historical issues and why those fixes have not worked below, both for my understanding and yours :) 1. eventlet implements asynchronous "non-blocking" socket I/O without any code changes to the application and without using real pthreads by using co-operative "green threads" from the greenlet library. For this to work correctly, it needs to replace many standard libraries (e.g. socket, time, threading) with an alternative implementation. This applies to both our own code, and code within imported modules (e.g. novaclient). This does not happen automatically, you can find the full details at https://eventlet.readthedocs.io/en/latest/patching.html but as a brief summary this can be done with 3 different methods: - Explicitly importing all relevant modules from eventlet.green - Automatically during a single import eventlet.patcher.import_patched - Automatically during future imports with eventlet.patcher.monkey_patch 2. The original Issue canonical#112 found that the process deadlocked with the following error: greenlet.error: cannot switch to a different thread At the time, we used a native Python Thread for the DataGatherer class and separately used the ForkingHTTPServer to allow both functions to operate simultaneously. We did not intend to use eventlet/green threads at all, however, the python-cinderclient library incorrectly imports eventlet.sleep which results in sometimes using green threads accidentally, hence the error. We attempted to fix that in canonical#115 by importing the green version of threading.Thread explicitly. This avoided the "cannot switch to a different thread" issue by only using green threads and not mixing Python threads and green threads in the same process. 3. After merging canonical#115 it was found that the HTTPServer loop never co-operatively yielded to the DataGatherer's thread and the stats were never updated. To fix this, canonical#116 imported the green version of socket, asyncore and time and also littered a few sleep(0) calls around to force co-operative yielding at various points. 4. In canonical#124 we switched from ForkingHTTPServer to the normal HTTPServer because sometimes it would fork too many servers and hit the process or system-wide process limit. Though not noted elsewhere, when I reproduce this issue by connecting many clients using the tool `siege` to a server where I firewalled the nova API connections, I can see that all of those processes are defunct and not actually alive. This is most likely because the process is blocked and the calls to waitpid which would reap them never happen. Since we are not using the eventlet version of http.server.HTTPServer, without the forked model we now block anytime we are handling a server request. Additionally, anytime the DataGatherer green thread calls out through the OpenStack API libraries, it uses non-patched versions of socket/requests/urllib3 and also blocks the HTTPServer which is now inside the same process. == Testing == To test we now have a working solution, you can 1. Block access to the Nova API (causes connect to hang for 120 seconds) using this firewall command: iptables -I OUTPUT -p tcp -m state --state NEW --dport 8774 -j DROP 2. Make many concurrent and repeated requests using siege: while true; do siege http://172.16.0.30:9183/metrics -t 5s -c 5 -d 0.1; done When testing with these changes, I never see us block a server or client connection and all requests take a few milliseconds at most, whether or not the client requests are slow or we open a connection to the server that doesn't send a request. Fixes: canonical#112, canonical#115, canonical#116, canonical#124, canonical#126

Currently, slow running OpenStack API Requests (either stuck connecting or still waiting for the actual response) from the periodic DataGatherer task will block HTTPServer connections from being processed. Blocked HTTPServer connections will also block both other connections and the DataGatherer task. Observed Symptoms: - Slow or failed prometheus requests - Statistics not being updated as often as you would expect - HTTP 500 responses and BrokenPipeError tracebacks being logged due to later trying to respond to prometheus clients which timed out and disconnected the socket - Hitting the forked process limit This happens because in the current code, we are intending to use the eventlet library for asynchronous non-blocking I/O, but we are not using it correctly. All code within the main application and all imported dependencies must import the special eventlet "green" versions of many python libraries (e.g. socket, time, threading, SimpleHTTPServer, etc) which yield to other green threads when they would have blocked waiting for I/O or to sleep. Currently this does not always happen. Fix this by importing eventlet and using eventlet.patcher.monkey_patch() before importing any other modules. This will automatically intercept all future imports (including those inside dependencies) and automatically load the green versions of relevant libraries. Documentation on correctly import eventlet can be found here: https://eventlet.readthedocs.io/en/latest/patching.html A detailed and comprehensive analysis of the issue and multiple previous attempts to fix it can be found in Issue canonical#130. If you intend to make further related changes to the use of eventlet, threads or forked processes please read the detailed history lesson available there. Fixes: canonical#130, canonical#126, canonical#124, canonical#116, canonical#115, canonical#112

Fix broken symlink python3 and unmet dependencies warnings.

3d6b51f

Co-authored-by: peppepetra [email protected]

gabrielcocenza requested review from peppepetra, Pjack, rgildein and esunar May 4, 2023 15:29

agileshaw approved these changes May 5, 2023

View reviewed changes

aieri self-requested a review May 5, 2023 16:07

aieri approved these changes May 5, 2023

View reviewed changes

gabrielcocenza merged commit e276088 into canonical:master May 5, 2023

lathiat mentioned this pull request Jun 10, 2024

Client & server threads block each other due to incorrect eventlet/greenlet imports #130

Closed

lathiat mentioned this pull request Jun 13, 2024

Version 0.1.9 are forking infinitely #123

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix broken symlink python3 and unmet dependencies warnings. #124

Fix broken symlink python3 and unmet dependencies warnings. #124

gabrielcocenza commented May 4, 2023 •

edited

Loading

agileshaw left a comment

aieri left a comment

Fix broken symlink python3 and unmet dependencies warnings. #124

Fix broken symlink python3 and unmet dependencies warnings. #124

Conversation

gabrielcocenza commented May 4, 2023 • edited Loading

agileshaw left a comment

Choose a reason for hiding this comment

aieri left a comment

Choose a reason for hiding this comment

gabrielcocenza commented May 4, 2023 •

edited

Loading