{Core} Reducing the risk of logging file fd leaking #12971

haroldrandom · 2020-04-11T16:01:38Z

Description
As I discovered in #12949, CLI would crash after long time running due to running out of available fd because it didn't close the fd properly.

Running tests with pytest, testing is terminated after the number of avaliable fd is consumed.

I add a hook doCleanups of unittest.Testcase to mandatorily clean those unclosed fd hold by CLI's logger.

This could reduce the risk of fd leaking but still need developer to manually call end_cmd_metadata_logging() of AzLogging, like:

azure-cli/src/azure-cli/azure/cli/__main__.py

Lines 49 to 50 in 806e431


	az_cli.logging.end_cmd_metadata_logging(exit_code)

Why I say reducing

there are still some test classes didn't inherit from ScenarioTest/LiveScenarioTest but unittest.TestCase, so, they couldn't benifit from this fix.
A log files rotated logic is applied to delete older log files to keep at most 25 log files in order to save the disk. If there are too many commands/tests running at the same time and CLI instance can't recover by Python Garbage Collector, it will keep leak the fd until CLI instance is del or the process is terminated. Like these 2 PRs: Logging leaks file handles leading to FD exhaustion for long running scripts #12882, Azure command logging is leaking file handles #10435

In both scenarios, we will still have chance to see (deleted) files that are unclosed and leaking

Testing Guide

uncomment the new added lines and run a test with lots of test units like test_network_commands.py,
what over the opened fd by your process, lsof -p {TestProcessID} | grep "azure",
After a while, you will see lots of items end up with (deleted) like the image above. That's the leaking fds.
If you run with pytest pytest -x -v --capture=no and with tons of tests, process will crash.
Adding the new added line in this PR, there are still sometime that the deleted file will show in lsof -p {TestProcessID}, but they will disappear very soon.

History Notes

This checklist is used to make sure that common guidelines for a pull request are followed.

The PR title and description has followed the guideline in Submitting Pull Requests.
I adhere to the Command Guidelines.

yonzhan · 2020-04-12T00:56:33Z

add to S168

jiasli · 2020-04-13T03:12:38Z

Handlers are added by azure.cli.core.azlogging.AzCliLogging._init_command_logfile_handlers:

azure-cli/src/azure-cli-core/azure/cli/core/azlogging.py

Line 106 in 9015bb4

command_metadata_logger.addHandler(logfile_handler)

which is called when EVENT_INVOKER_PRE_CMD_TBL_TRUNCATE is raised

azure-cli/src/azure-cli-core/azure/cli/core/azlogging.py

Line 48 in 9015bb4

    
           self.cli_ctx.register_event(EVENT_INVOKER_PRE_CMD_TBL_TRUNCATE, AzCliLogging.init_command_file_logging)

azure-cli/src/azure-cli-core/azure/cli/core/commands/__init__.py

Lines 503 to 504 in 34f9033

    
           self.cli_ctx.raise_event(EVENT_INVOKER_PRE_CMD_TBL_TRUNCATE, 
        
                                    load_cmd_tbl_func=self.commands_loader.load_command_table, args=args)

I think we can use the same mechanism to call end_cmd_metadata_logging in

azure-cli/src/azure-cli-core/azure/cli/core/commands/__init__.py

Line 490 in 34f9033

def execute(self, args):

Currently, the only place where end_cmd_metadata_logging is called is

azure-cli/src/azure-cli/azure/cli/__main__.py

Line 50 in 1bab798

az_cli.logging.end_cmd_metadata_logging(exit_code)

which is never called during in a ScenarioTest, because ScenarioTest is generating its own dummy cli:

azure-cli/src/azure-cli-testsdk/azure/cli/testsdk/base.py

Line 181 in 74d9ee5

self.cli_ctx = get_dummy_cli()

qianwens · 2020-04-13T03:25:15Z

src/azure-cli-core/azure/cli/core/azlogging.py

self.command_metadata_logger.removeHandler(handler) [](start = 14, length = 53)

remove elements in the array loop may cause unexpected result

it's a copy from slice [:], won't have side affect.

Will this get cleaned up if the az process returns an error? I seem to remember the call to end_cmd_metadata_logging being skipped if an exception was raised when I was testing as it was only called on a successful path rather than from a finally block.

@elpollouk If you were using az, that would go with code in __main__.py

azure-cli/src/azure-cli/azure/cli/__main__.py

Lines 48 to 73 in 1bab798

elapsed_time = timeit.default_timer() - start_time

az_cli.logging.end_cmd_metadata_logging(exit_code)

sys.exit(exit_code)

except KeyboardInterrupt:

telemetry.set_user_fault('keyboard interrupt')

sys.exit(1)

except SystemExit as ex: # some code directly call sys.exit, this is to make sure command metadata is logged

exit_code = ex.code if ex.code is not None else 1

try:

elapsed_time = timeit.default_timer() - start_time

except NameError:

pass

az_cli.logging.end_cmd_metadata_logging(exit_code)

raise ex

finally:

telemetry.conclude()

try:

logger.info("command ran in %.3f seconds.", elapsed_time)

except NameError:

pass

whenever an exception is raised, the final exception on top of exception stack would be SystemExit because there is a hook:

azure-cli/src/azure-cli-core/azure/cli/core/parser.py

Lines 148 to 157 in 4770a43

def error(self, message):

telemetry.set_user_fault('parse error: {}'.format(message))

args = {'prog': self.prog, 'message': message}

with CommandLoggerContext(logger):

logger.error('%(prog)s: error: %(message)s', args)

self.print_usage(sys.stderr)

failure_recovery_recommendations = self._get_failure_recovery_recommendations()

self._suggestion_msg.extend(failure_recovery_recommendations)

self._print_suggestion_msg(sys.stderr)

self.exit(2)

With this PR merged, I guess there won't be a leaking problem. But after that, it should be a leaking as you said.
That's my understanding, if anything missing/wrong, could you please give a detailed report?

haroldrandom · 2020-04-13T03:30:31Z

@jiasli What event name do you suggest? And that where do we raise the end logger event?

zhoxing-ms · 2020-04-13T07:21:42Z

I see that the same log handler(delete) appears repeatedly in the second picture. Could we design the log handler object which handle the same type of business as a singleton? This can also avoid repeated creation after multiple calls and reduce the fd.

haroldrandom · 2020-04-13T08:04:23Z

I see that the same log handler(delete) appears repeatedly in the second picture. Could we design the log handler object which handle the same type of business as a singleton? This can also avoid repeated creation after multiple calls and reduce the fd.

Good suggestion. Can singletion affect the logging order? If CLI were used concurrenly, will it affect anything? I am concered about it.

zhoxing-ms · 2020-04-13T14:57:03Z

The idea comes from the fact that when I used a logging framework in Java before, I always create a LogFactory to generates singleton logger objects for different businesses:

The logger object performs the printed invocation method mainly according to the log settings of corresponding business and does not affect the log printing order.
Because different business threads use different logger objects, they do not affect each other. And the configuration of multi-threaded print logs for the same business should be the same, so singletons can reduce unnecessary re-creation of objects.
If the logger object of the same business is often created repeatedly, the singleton pattern is more recommended. Otherwise, it might be better to use up and release.

However, I don't have a deep understanding for Python's log, just a guess based on my previous experience in Java, maybe you can use it as a reference

haroldrandom · 2020-04-20T09:00:10Z

The idea comes from the fact that when I used a logging framework in Java before, I always create a LogFactory to generates singleton logger objects for different businesses:

The logger object performs the printed invocation method mainly according to the log settings of corresponding business and does not affect the log printing order.

Because different business threads use different logger objects, they do not affect each other. And the configuration of multi-threaded print logs for the same business should be the same, so singletons can reduce unnecessary re-creation of objects.

If the logger object of the same business is often created repeatedly, the singleton pattern is more recommended. Otherwise, it might be better to use up and release.

However, I don't have a deep understanding for Python's log, just a guess based on my previous experience in Java, maybe you can use it as a reference

Got it.
The Python logging module does this similarly. the logging module will deal with duplicated logger once we call logging.getLogger(AzCliLogging._COMMAND_METADATA_LOGGER).
But once logger can hold more than one file handlers, meanwhile different handler can hold different fd to the same logger file if the program executes one command more that 1 time and fast because the file name ends with minutes only:

azure-cli/src/azure-cli-core/azure/cli/core/azlogging.py

Lines 99 to 111 in ec56c66

    
           time = datetime.datetime.now().time() 
        
           time_str = "{:02}-{:02}-{:02}".format(time.hour, time.minute, time.second) 
        
           log_name = "{}.{}.{}.{}.{}".format(date_str, time_str, command_str, os.getpid(), "log") 
        
           log_file_path = os.path.join(self.command_log_dir, log_name) 
        
           logfile_handler = logging.FileHandler(log_file_path) 
        
           lfmt = logging.Formatter(_CMD_LOG_LINE_PREFIX + ' %(process)d | %(asctime)s | %(levelname)s | %(name)s | %(message)s')  # pylint: disable=line-too-long 
        
           logfile_handler.setFormatter(lfmt) 
        
           logfile_handler.setLevel(logging.DEBUG) 
        
           command_metadata_logger.addHandler(logfile_handler)

If we are going to implement the singleton on logger level, I think it's unncessary for now. Because it's more meaningful under the cocurrent scenaro. And if we decide to make it singleton, the feedback it's involved to change because it needs logs.

haroldrandom · 2020-04-21T07:19:27Z

Closed due to mistaken merged.

haroldrandom added this to the S168 milestone Apr 11, 2020

haroldrandom self-assigned this Apr 11, 2020

haroldrandom added bug This issue requires a change to an existing behavior in the product in order to be resolved. Core CLI core infrastructure labels Apr 11, 2020

yonzhan requested review from Juliehzl, jiasli, jsntcy, mmyyrroonn, qianwens and zhoxing-ms April 12, 2020 00:57

haroldrandom changed the title ~~{Core} Reducing the risk of logger file fd leaking~~ {Core} Reducing the risk of logging file fd leaking Apr 12, 2020

haroldrandom force-pushed the fd-leak-logging branch from 5e8e9e5 to a9b935b Compare April 12, 2020 04:06

haroldrandom mentioned this pull request Apr 12, 2020

Logging leaks file handles leading to FD exhaustion for long running scripts #12882

Closed

haroldrandom removed the bug This issue requires a change to an existing behavior in the product in order to be resolved. label Apr 12, 2020

haroldrandom mentioned this pull request Apr 12, 2020

Azure command logging is leaking file handles #10435

Closed

haroldrandom marked this pull request as ready for review April 12, 2020 11:26

haroldrandom requested review from arrownj, bim-msft and fengzhou-msft as code owners April 12, 2020 11:26

qianwens reviewed Apr 13, 2020

View reviewed changes

haroldrandom force-pushed the fd-leak-logging branch from 8be4bc6 to 4addfb7 Compare April 13, 2020 09:02

haroldrandom marked this pull request as draft April 16, 2020 13:51

haroldrandom force-pushed the fd-leak-logging branch 2 times, most recently from 52450f5 to 61ab0a3 Compare April 16, 2020 15:58

harold random added 4 commits April 17, 2020 21:00

correctly close logger fd and add doCleanups hook to ScenarioTest

e9e9e66

correctly close logger fd and add doCleanups hook to LiveScenarioTest

99f651b

revert the number of deleting files to oldest 5

0adc2cb

remove unnecessary call to removeHanlder()

2d99cb5

haroldrandom force-pushed the fd-leak-logging branch from 61ab0a3 to 2d99cb5 Compare April 18, 2020 11:12

yonzhan modified the milestones: S168, S169 - For Build Apr 18, 2020

Jianhui Harold added 5 commits April 20, 2020 12:40

{ARO} add __init__.py for ARO tests and update recording (Azure#13089)

fd6af49

correctly close logger fd and add doCleanups hook to ScenarioTest

8acb824

correctly close logger fd and add doCleanups hook to LiveScenarioTest

18f5e82

revert the number of deleting files to oldest 5

ce9bfa6

remove unnecessary call to removeHanlder()

7697633

Harold Zeng added 3 commits April 21, 2020 14:49

register event EVENT_CLI_POST_EXECUTE callback to clean log file handler

eb5932b

revert modification in testsdk

e7d56ae

merge

96bb818

haroldrandom mentioned this pull request Apr 21, 2020

[Core] fix logging file fd leaking #13102

Merged

2 tasks

haroldrandom closed this Apr 21, 2020

haroldrandom deleted the fd-leak-logging branch April 21, 2020 08:03

haroldrandom restored the fd-leak-logging branch April 21, 2020 08:32

	elapsed_time = timeit.default_timer() - start_time

	az_cli.logging.end_cmd_metadata_logging(exit_code)
	sys.exit(exit_code)

	except KeyboardInterrupt:
	telemetry.set_user_fault('keyboard interrupt')
	sys.exit(1)
	except SystemExit as ex: # some code directly call sys.exit, this is to make sure command metadata is logged
	exit_code = ex.code if ex.code is not None else 1

	try:
	elapsed_time = timeit.default_timer() - start_time
	except NameError:
	pass

	az_cli.logging.end_cmd_metadata_logging(exit_code)
	raise ex

	finally:
	telemetry.conclude()

	try:
	logger.info("command ran in %.3f seconds.", elapsed_time)
	except NameError:
	pass

	def error(self, message):
	telemetry.set_user_fault('parse error: {}'.format(message))
	args = {'prog': self.prog, 'message': message}
	with CommandLoggerContext(logger):
	logger.error('%(prog)s: error: %(message)s', args)
	self.print_usage(sys.stderr)
	failure_recovery_recommendations = self._get_failure_recovery_recommendations()
	self._suggestion_msg.extend(failure_recovery_recommendations)
	self._print_suggestion_msg(sys.stderr)
	self.exit(2)

{Core} Reducing the risk of logging file fd leaking #12971

{Core} Reducing the risk of logging file fd leaking #12971

Uh oh!

Conversation

haroldrandom commented Apr 11, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yonzhan commented Apr 12, 2020

Uh oh!

jiasli commented Apr 13, 2020

Uh oh!

qianwens Apr 13, 2020

Choose a reason for hiding this comment

Uh oh!

haroldrandom Apr 13, 2020

Choose a reason for hiding this comment

Uh oh!

elpollouk Apr 16, 2020

Choose a reason for hiding this comment

Uh oh!

haroldrandom Apr 20, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

haroldrandom commented Apr 13, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zhoxing-ms commented Apr 13, 2020

Uh oh!

haroldrandom commented Apr 13, 2020

Uh oh!

zhoxing-ms commented Apr 13, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

haroldrandom commented Apr 20, 2020

Uh oh!

haroldrandom commented Apr 21, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

haroldrandom commented Apr 11, 2020 •

edited

Loading

haroldrandom Apr 20, 2020 •

edited

Loading

haroldrandom commented Apr 13, 2020 •

edited

Loading

zhoxing-ms commented Apr 13, 2020 •

edited

Loading