Skip to content

Conversation

@haroldrandom
Copy link
Contributor

@haroldrandom haroldrandom commented Apr 11, 2020

Description
As I discovered in #12949, CLI would crash after long time running due to running out of available fd because it didn't close the fd properly.

Running tests with pytest, testing is terminated after the number of avaliable fd is consumed.
image
image

I add a hook doCleanups of unittest.Testcase to mandatorily clean those unclosed fd hold by CLI's logger.

This could reduce the risk of fd leaking but still need developer to manually call end_cmd_metadata_logging() of AzLogging, like:

az_cli.logging.end_cmd_metadata_logging(exit_code)

Why I say reducing

  1. there are still some test classes didn't inherit from ScenarioTest/LiveScenarioTest but unittest.TestCase, so, they couldn't benifit from this fix.
  2. A log files rotated logic is applied to delete older log files to keep at most 25 log files in order to save the disk. If there are too many commands/tests running at the same time and CLI instance can't recover by Python Garbage Collector, it will keep leak the fd until CLI instance is del or the process is terminated. Like these 2 PRs: Logging leaks file handles leading to FD exhaustion for long running scripts #12882, Azure command logging is leaking file handles #10435

In both scenarios, we will still have chance to see (deleted) files that are unclosed and leaking

Testing Guide

  1. uncomment the new added lines and run a test with lots of test units like test_network_commands.py,
  2. what over the opened fd by your process, lsof -p {TestProcessID} | grep "azure",
  3. After a while, you will see lots of items end up with (deleted) like the image above. That's the leaking fds.
  4. If you run with pytest pytest -x -v --capture=no and with tons of tests, process will crash.
  5. Adding the new added line in this PR, there are still sometime that the deleted file will show in lsof -p {TestProcessID}, but they will disappear very soon.

History Notes


This checklist is used to make sure that common guidelines for a pull request are followed.

@haroldrandom haroldrandom added this to the S168 milestone Apr 11, 2020
@haroldrandom haroldrandom self-assigned this Apr 11, 2020
@haroldrandom haroldrandom added bug This issue requires a change to an existing behavior in the product in order to be resolved. Core CLI core infrastructure labels Apr 11, 2020
@yonzhan
Copy link
Collaborator

yonzhan commented Apr 12, 2020

add to S168

@haroldrandom haroldrandom changed the title {Core} Reducing the risk of logger file fd leaking {Core} Reducing the risk of logging file fd leaking Apr 12, 2020
@haroldrandom haroldrandom removed the bug This issue requires a change to an existing behavior in the product in order to be resolved. label Apr 12, 2020
@haroldrandom haroldrandom marked this pull request as ready for review April 12, 2020 11:26
@jiasli
Copy link
Member

jiasli commented Apr 13, 2020

Handlers are added by azure.cli.core.azlogging.AzCliLogging._init_command_logfile_handlers:

command_metadata_logger.addHandler(logfile_handler)

which is called when EVENT_INVOKER_PRE_CMD_TBL_TRUNCATE is raised

self.cli_ctx.register_event(EVENT_INVOKER_PRE_CMD_TBL_TRUNCATE, AzCliLogging.init_command_file_logging)

self.cli_ctx.raise_event(EVENT_INVOKER_PRE_CMD_TBL_TRUNCATE,
load_cmd_tbl_func=self.commands_loader.load_command_table, args=args)

I think we can use the same mechanism to call end_cmd_metadata_logging in

Currently, the only place where end_cmd_metadata_logging is called is

az_cli.logging.end_cmd_metadata_logging(exit_code)

which is never called during in a ScenarioTest, because ScenarioTest is generating its own dummy cli:

self.cli_ctx = get_dummy_cli()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.command_metadata_logger.removeHandler(handler) [](start = 14, length = 53)

remove elements in the array loop may cause unexpected result

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's a copy from slice [:], won't have side affect.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this get cleaned up if the az process returns an error? I seem to remember the call to end_cmd_metadata_logging being skipped if an exception was raised when I was testing as it was only called on a successful path rather than from a finally block.

Copy link
Contributor Author

@haroldrandom haroldrandom Apr 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@elpollouk If you were using az, that would go with code in __main__.py

elapsed_time = timeit.default_timer() - start_time
az_cli.logging.end_cmd_metadata_logging(exit_code)
sys.exit(exit_code)
except KeyboardInterrupt:
telemetry.set_user_fault('keyboard interrupt')
sys.exit(1)
except SystemExit as ex: # some code directly call sys.exit, this is to make sure command metadata is logged
exit_code = ex.code if ex.code is not None else 1
try:
elapsed_time = timeit.default_timer() - start_time
except NameError:
pass
az_cli.logging.end_cmd_metadata_logging(exit_code)
raise ex
finally:
telemetry.conclude()
try:
logger.info("command ran in %.3f seconds.", elapsed_time)
except NameError:
pass

whenever an exception is raised, the final exception on top of exception stack would be SystemExit because there is a hook:

def error(self, message):
telemetry.set_user_fault('parse error: {}'.format(message))
args = {'prog': self.prog, 'message': message}
with CommandLoggerContext(logger):
logger.error('%(prog)s: error: %(message)s', args)
self.print_usage(sys.stderr)
failure_recovery_recommendations = self._get_failure_recovery_recommendations()
self._suggestion_msg.extend(failure_recovery_recommendations)
self._print_suggestion_msg(sys.stderr)
self.exit(2)

With this PR merged, I guess there won't be a leaking problem. But after that, it should be a leaking as you said.
That's my understanding, if anything missing/wrong, could you please give a detailed report?

@haroldrandom
Copy link
Contributor Author

haroldrandom commented Apr 13, 2020

@jiasli What event name do you suggest? And that where do we raise the end logger event?

@zhoxing-ms
Copy link
Contributor

I see that the same log handler(delete) appears repeatedly in the second picture. Could we design the log handler object which handle the same type of business as a singleton? This can also avoid repeated creation after multiple calls and reduce the fd.

@haroldrandom
Copy link
Contributor Author

I see that the same log handler(delete) appears repeatedly in the second picture. Could we design the log handler object which handle the same type of business as a singleton? This can also avoid repeated creation after multiple calls and reduce the fd.

Good suggestion. Can singletion affect the logging order? If CLI were used concurrenly, will it affect anything? I am concered about it.

@zhoxing-ms
Copy link
Contributor

zhoxing-ms commented Apr 13, 2020

The idea comes from the fact that when I used a logging framework in Java before, I always create a LogFactory to generates singleton logger objects for different businesses:

  1. The logger object performs the printed invocation method mainly according to the log settings of corresponding business and does not affect the log printing order.
  2. Because different business threads use different logger objects, they do not affect each other. And the configuration of multi-threaded print logs for the same business should be the same, so singletons can reduce unnecessary re-creation of objects.
  3. If the logger object of the same business is often created repeatedly, the singleton pattern is more recommended. Otherwise, it might be better to use up and release.

However, I don't have a deep understanding for Python's log, just a guess based on my previous experience in Java, maybe you can use it as a reference

@haroldrandom haroldrandom marked this pull request as draft April 16, 2020 13:51
@haroldrandom haroldrandom force-pushed the fd-leak-logging branch 2 times, most recently from 52450f5 to 61ab0a3 Compare April 16, 2020 15:58
@haroldrandom
Copy link
Contributor Author

The idea comes from the fact that when I used a logging framework in Java before, I always create a LogFactory to generates singleton logger objects for different businesses:

  1. The logger object performs the printed invocation method mainly according to the log settings of corresponding business and does not affect the log printing order.
  2. Because different business threads use different logger objects, they do not affect each other. And the configuration of multi-threaded print logs for the same business should be the same, so singletons can reduce unnecessary re-creation of objects.
  3. If the logger object of the same business is often created repeatedly, the singleton pattern is more recommended. Otherwise, it might be better to use up and release.

However, I don't have a deep understanding for Python's log, just a guess based on my previous experience in Java, maybe you can use it as a reference

Got it.
The Python logging module does this similarly. the logging module will deal with duplicated logger once we call logging.getLogger(AzCliLogging._COMMAND_METADATA_LOGGER).
But once logger can hold more than one file handlers, meanwhile different handler can hold different fd to the same logger file if the program executes one command more that 1 time and fast because the file name ends with minutes only:

time = datetime.datetime.now().time()
time_str = "{:02}-{:02}-{:02}".format(time.hour, time.minute, time.second)
log_name = "{}.{}.{}.{}.{}".format(date_str, time_str, command_str, os.getpid(), "log")
log_file_path = os.path.join(self.command_log_dir, log_name)
logfile_handler = logging.FileHandler(log_file_path)
lfmt = logging.Formatter(_CMD_LOG_LINE_PREFIX + ' %(process)d | %(asctime)s | %(levelname)s | %(name)s | %(message)s') # pylint: disable=line-too-long
logfile_handler.setFormatter(lfmt)
logfile_handler.setLevel(logging.DEBUG)
command_metadata_logger.addHandler(logfile_handler)

If we are going to implement the singleton on logger level, I think it's unncessary for now. Because it's more meaningful under the cocurrent scenaro. And if we decide to make it singleton, the feedback it's involved to change because it needs logs.

@haroldrandom
Copy link
Contributor Author

Closed due to mistaken merged.

@haroldrandom haroldrandom deleted the fd-leak-logging branch April 21, 2020 08:03
@haroldrandom haroldrandom restored the fd-leak-logging branch April 21, 2020 08:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Core CLI core infrastructure

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants