Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for JSON formatted logs #5799

Merged
merged 13 commits into from
Oct 14, 2020

Conversation

mriedem
Copy link
Contributor

@mriedem mriedem commented Oct 7, 2020

This adds a new extra package dependency on json-logging
and an environment variable, which when set to "true" regardless
of case, will try to use the json-logging non-web app log formatter.
A NotebookApp.log_json config option is also added which gets
its default value from the JUPYTER_ENABLE_JSON_LOGGING
environment variable.

If the json-logging package is not installed but log_json is True,
something like this will be logged but it will not crash the application:

$ JUPYTER_ENABLE_JSON_LOGGING=true jupyter notebook
Unable to use json logging due to missing packages. Run "pip install notebook[json-logging]" to fix.
Traceback (most recent call last):
  File "/home/osboxes/jupyter/notebook/notebook/notebookapp.py", line 144, in <module>
    import json_logging
ModuleNotFoundError: No module named 'json_logging'

With this working you get logs like this:

{"written_at": "2020-10-07T21:10:51.606Z", "written_ts": 1602105051606265000, "msg": "404 GET /nbextensions/widgets/notebook/js/extension.js (127.0.0.1) 9.26ms referer=http://localhost:8888/notebooks/Untitled.ipynb", "type": "log", "logger": "NotebookApp", "thread": "MainThread", "level": "WARNING", "module": "log", "line_no": 49}
{"written_at": "2020-10-07T21:10:54.443Z", "written_ts": 1602105054443309000, "msg": "Starting buffering for f260ddd7-938c-42d0-ac3b-455bea76694f:49f30b53fc4b4ec6a8f2fb748a171613", "type": "log", "logger": "NotebookApp", "thread": "MainThread", "level": "INFO", "module": "kernelmanager", "line_no": 222}
{"written_at": "2020-10-07T21:10:54.446Z", "written_ts": 1602105054446264000, "msg": "Kernel shutdown: f260ddd7-938c-42d0-ac3b-455bea76694f", "type": "log", "logger": "NotebookApp", "thread": "MainThread", "level": "INFO", "module": "multikernelmanager", "line_no": 201}

Web app requests get logged with separate properties like this:

{"written_at": "2020-10-09T20:59:52.044Z", "written_ts": 1602277192044732000, "msg": "", "type": "log", "logger": "NotebookApp", "thread": "MainThread", "level": "WARNING", "module": "log", "line_no": 50, "status": 404, "method": "GET", "ip": "127.0.0.1", "uri": "/api/kernels/a6dd447b-282a-4e1c-ab65-eb340ad12965/channels?session_id=472907c27ed94e2cbfbae86501f7d159", "request_time": 8.63, "referer": "None"}

The nice thing about that is you can filter request logs on certain error
status codes or request times that take longer than some threshold to
setup alerts in your logging infrastructure (ELK, LogDNA, etc).

Closes #5798

This adds a new extra package dependency on `json-logging`
and an environment variable, which when set to "true" regardless
of case, will try to use the json-logging non-web app log formatter.

If the `json-logging` package is not installed but the environment
variable is True, something like this will be logged but it will not
crash the application:

```
$ ENABLE_JSON_LOGGING=true jupyter notebook
Unable to use json logging due to missing packages. Run "pip install notebook[json-logging]" to fix.
Traceback (most recent call last):
  File "/home/osboxes/jupyter/notebook/notebook/notebookapp.py", line 144, in <module>
    import json_logging
ModuleNotFoundError: No module named 'json_logging'
```

Initially I tried to add a new Bool config option to toggle this but
the problem is (from my limited understanding of traitlets and tornado)
is that `_log_formatter_cls` needs to be set early and trying to se the
log formatter later in `init_logging` is too late - or at least I couldn't
figure out a way to reset the log formatter dynamically (I tried calling
`_log_format_changed` [1] but that didn't work).

With this working you get logs like this:

```
{"written_at": "2020-10-07T21:10:51.606Z", "written_ts": 1602105051606265000, "msg": "404 GET /nbextensions/widgets/notebook/js/extension.js (127.0.0.1) 9.26ms referer=http://localhost:8888/notebooks/Untitled.ipynb", "type": "log", "logger": "NotebookApp", "thread": "MainThread", "level": "WARNING", "module": "log", "line_no": 49}
{"written_at": "2020-10-07T21:10:54.443Z", "written_ts": 1602105054443309000, "msg": "Starting buffering for f260ddd7-938c-42d0-ac3b-455bea76694f:49f30b53fc4b4ec6a8f2fb748a171613", "type": "log", "logger": "NotebookApp", "thread": "MainThread", "level": "INFO", "module": "kernelmanager", "line_no": 222}
{"written_at": "2020-10-07T21:10:54.446Z", "written_ts": 1602105054446264000, "msg": "Kernel shutdown: f260ddd7-938c-42d0-ac3b-455bea76694f", "type": "log", "logger": "NotebookApp", "thread": "MainThread", "level": "INFO", "module": "multikernelmanager", "line_no": 201}
```

An obvious improvement here would to be able to split the `msg` fields apart
so that we can log things like response status code, request method, request
URL, response_time_ms etc. That should be possible with the `JSONLogWebFormatter`
from `json-logging` but when I tried using that I was getting errors from
the library about a correlation id (which is based on a request header we
don't use). The `json-logging` library supports several web frameworks like
Flask but unfortunately does not have built in support for Tornado, but it does
support custom formatters so that might be a follow up option to improve on this.

[1] https://github.com/ipython/traitlets/blob/4.3.3/traitlets/config/application.py#L195

Closes jupyter#5798
@mriedem
Copy link
Contributor Author

mriedem commented Oct 7, 2020

@sdague @rmoe Here is an idea for a start on JSON log formatting though it might not be super useful if we can't split apart the request-specific parts of the msg field.

@mriedem
Copy link
Contributor Author

mriedem commented Oct 7, 2020

This is the error I get when trying to use JSONLogWebFormatter:

--- Logging error ---
Traceback (most recent call last):
  File "/usr/lib/python3.6/logging/__init__.py", line 994, in emit
    msg = self.format(record)
  File "/usr/lib/python3.6/logging/__init__.py", line 840, in format
    return fmt.format(record)
  File "/home/osboxes/jupyter/notebook/.tox/lib/python3.6/site-packages/json_logging/__init__.py", line 243, in format
    log_object = self._format_log_object(record, request_util=_request_util)
  File "/home/osboxes/jupyter/notebook/.tox/lib/python3.6/site-packages/json_logging/__init__.py", line 341, in _format_log_object
    "correlation_id": request_util.get_correlation_id(within_formatter=True),
AttributeError: 'NoneType' object has no attribute 'get_correlation_id'

And it looks like that's because I'm not going through one of the init_* framework methods:

https://github.com/bobbui/json-logging-python/blob/1.2.8/json_logging/__init__.py#L132

This gets us closer to logging parts of the request
as separate json fields but not all the way there.

For example:

```json
{"written_at": "2020-10-07T21:52:16.975Z", "written_ts": 1602107536975497000, "msg": "{'status': 404, 'method': 'GET', 'ip': '127.0.0.1', 'uri': '/nbextensions/widgets/notebook/js/extension.js', 'request_time': 1.9807815551757812, 'referer': 'http://localhost:8888/notebooks/test.ipynb'}", "type": "log", "logger": "NotebookApp", "thread": "MainThread", "level": "WARNING", "module": "log", "line_no": 54}
```
@mriedem
Copy link
Contributor Author

mriedem commented Oct 7, 2020

Getting a little further by changing the log_request method so I see something like this now:

{"written_at": "2020-10-07T21:52:16.975Z", "written_ts": 1602107536975497000, "msg": "{'status': 404, 'method': 'GET', 'ip': '127.0.0.1', 'uri': '/nbextensions/widgets/notebook/js/extension.js', 'request_time': 1.9807815551757812, 'referer': 'http://localhost:8888/notebooks/test.ipynb'}", "type": "log", "logger": "NotebookApp", "thread": "MainThread", "level": "WARNING", "module": "log", "line_no": 54}

But the msg is a serialized JSON string now.

notebook/log.py Outdated
@@ -46,5 +50,11 @@ def log_request(handler):
if status >= 500 and status != 502:
# log all headers if it caused an error
log_method(json.dumps(dict(request.headers), indent=2))
log_method(msg.format(**ns))
if _enable_json_logs:
# FIXME: this still logs the msg as a serialized json string,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we could use a partial to pass log in here:

log_function=log_request,

Because that'd be the NotebookApp logger which is using our json log formatter.

notebook/log.py Outdated
@@ -46,5 +50,11 @@ def log_request(handler):
if status >= 500 and status != 502:
# log all headers if it caused an error
log_method(json.dumps(dict(request.headers), indent=2))
log_method(msg.format(**ns))
if _enable_json_logs:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just because this is true doesn't mean that json-logging is installed. If it's not, the request log line ends up looking like this:

[W 17:04:57.097 NotebookApp] {'status': 404, 'method': 'GET', 'ip': '127.0.0.1', 'uri': '/nbextensions/widgets/notebook/js/extension.js', 'request_time': 6.743431091308594, 'referer': 'http://localhost:8888/notebooks/Untitled.ipynb?kernel_name=python3'}

Also note that with the json logging we're not formatting the request time to 2 decimal places:

'request_time': 6.743431091308594

@mriedem mriedem marked this pull request as draft October 7, 2020 22:19
@kevin-bates
Copy link
Member

Thanks Matt. This is an interesting conundrum on a couple counts.

  1. The logging formatter is something used very early - introducing potential chicken/egg situations.
  2. The tornado logging is a different facility, yet ideally, we want to leverage the same configurable setting to determine if JSON formatting should be used.

I think these can be resolved in a more application-friendly manner by first introducing a boolean configurable json_logging that can be configured via a configuration file using c.NotebookApp.json_logging = True. We'd also want to tie this to a CLI flag (e.g., --json-logging such that, when present, it enables JSON formatting). You should be able to follow the example of allow_root for these two details.

Traitlets also provide the ability to validate a config option. This is where we could see if the json_logging package can be imported and, if so, "tickle" the current log_format value after setting the log_formatter_cls. The reason for this "tickle" is to trigger the observation that takes place in the traitlets Application class - which will perform a reset using the formatter_cls value - as you had mentioned. Modeling the validation after the 'ip' traitlet can get you started here.

We could let the default value for the json-logging flag still derive from the env - which is helpful in containerized envs and this could be done via a _default_json_logging() entry that returns configurable's default. Again, 'ip' uses a defaulting method that you can model this after.

I apologize for these details, but because there were so many unknowns I had, it was quicker just to try things out. I have some working code - but I'd rather help you navigate this if you don't mind. Otherwise, just let me know.

@mriedem
Copy link
Contributor Author

mriedem commented Oct 8, 2020

@kevin-bates thanks for those detailed pointers.

I'd rather help you navigate this if you don't mind

Yup, no problem and I definitely appreciate the patience with helping me learn here (I don't really have experience with tornado or traitlets, mostly copy/tweak/test when I need to make changes).

@mriedem
Copy link
Contributor Author

mriedem commented Oct 9, 2020

@kevin-bates I worked on your suggestions a bit this morning and captured the diff here since it's not working. A few notes:

  • I added the config option but with name log_format_json so that it would show up alongside log_format and log_datefmt etc in the help output:
--NotebookApp.log_datefmt=<Unicode>
    Default: '%Y-%m-%d %H:%M:%S'
    The date format used by logging formatters for %(asctime)s
--NotebookApp.log_format=<Unicode>
    Default: '[%(name)s]%(highlevel)s %(message)s'
    The Logging format template
--NotebookApp.log_format_json=<Bool>
    Default: False
    Set to True to enable JSON formatted logs. Run "pip install notebook[json-
    logging]" to install the required dependent packages. Can also be set using
    the environment variable ENABLE_JSON_LOGGING=true.
--NotebookApp.log_level=<Enum>
    Default: 30
    Choices: (0, 10, 20, 30, 40, 50, 'DEBUG', 'INFO', 'WARN', 'ERROR', 'CRITICAL')
    Set the log level by value or name.

I think that's useful for someone looking at logging options to have those grouped together for context.

  • I'm assuming I didn't implement _validate_log_format_json correctly because it doesn't seem to be getting hit. I've tried running with both ENABLE_JSON_LOGGING=true jupyter notebook and jupyter notebook --NotebookApp.log_format_json=True. I never see the logging statements I put in _validate_log_format_json though, nor do I see the ImportError traceback if I uninstall json-logging and try to start the server with log_format_json=True. Is there something I'm missing there about when that validation function is supposed to be called?

  • I'm also not sure that I'm "tickling" the log_format option correctly by doing self.log_format = self.log_format.

Thanks again for the help.

@kevin-bates
Copy link
Member

Thanks Matt. The "tickle" should actually change the format. In this case, I removed the color-coding attributes. Here's the code I had put together the other day. Since you've changed the traitlet's name, that change would need to be reflected, as well as perhaps logging that the import failed during the validation.

Lastly, the convention for envs is that they be prefixed with JUPYTER_.

I found no changes necessary other than in notebookapp.py, but would prefer that you confirm the same.

    json_logging = Bool(config=True)

    @default('json_logging')
    def _default_json_logging(self):
        return os.getenv('JUPYTER_ENABLE_JSON_LOGGING', 'false').lower() == 'true'

    @validate('json_logging')
    def _validate_json_logging(self, proposal):
        value = proposal['value']
        if value is True:
            try:
                import json_logging
                self._log_formatter_cls = json_logging.JSONLogFormatter
                self.log_format = u"[%(levelname)1.1s %(asctime)s.%(msecs).03d %(name)s] %(message)s"
            except ImportError:
                value = False
        return value

My concern with the change to log_format_json is that it implies the value should be some kind of format string - similar to log_format, whereas json_logging seems more about the kind of logging that will take place. The concern being that the "format" approach could be confusing. Yes, I would agree whatever the name is should be grouped near the others as its all the same "topic".

@mriedem
Copy link
Contributor Author

mriedem commented Oct 9, 2020

The "tickle" should actually change the format. In this case, I removed the color-coding attributes.

Got it. I'm not totally sure about that solution since if I set log_format specifically and this changed it, I'd probably wonder what's going on if the formatting doesn't show up as expected. If there is no better solution we could at least document something about that in the config help.

the convention for envs is that they be prefixed with JUPYTER_.

Done.

I found no changes necessary other than in notebookapp.py

I see a bug in my code, I wasn't using self._log_formatter_cls.

My concern with the change to log_format_json is that it implies the value should be some kind of format string

I renamed the config option to simply log_json to take the "format" out of it.


I'll tinker with this locally and see what I come up with.

I'm also going to see what happens if we call json_logging.init_non_web(enable_json=True) from init_logging since init_non_web does change the formatter for all of the loggers:

https://github.com/bobbui/json-logging-python/blob/1.2.8/json_logging/__init__.py#L153

This adds the `NotebookApp.log_json` CLI option which defaults to
False. The default value comes from the `JUPYTER_ENABLE_JSON_LOGGING`
environment variable. A validation hook is added such that if json
logging is enabled but the `json-logging` package is not installed,
an error is logged and the effective `log_json` value is False.

If enabled and the `json-logging` package is installed, the `init_logging`
method will initialize the non-web [1] json logging formatter.

[1] https://github.com/bobbui/json-logging-python#21-non-web-application-log
@mriedem
Copy link
Contributor Author

mriedem commented Oct 9, 2020

I'm also going to see what happens if we call json_logging.init_non_web(enable_json=True) from init_logging since init_non_web does change the formatter for all of the loggers

This works and avoids the need to change log_format to trigger the _log_format_changed observer. Updated the PR.

Still need to figure out how to get log.log_request working properly because if you run without the environment variable set it doesn't log the request parameters in json format:

{"written_at": "2020-10-09T19:32:51.825Z", "written_ts": 1602271971825341000, "msg": "404 GET /nbextensions/widgets/notebook/js/extension.js (127.0.0.1) 8.94ms referer=http://localhost:8888/notebooks/Untitled.ipynb", "type": "log", "logger": "NotebookApp", "thread": "MainThread", "level": "WARNING", "module": "log", "line_no": 59}

and if you do it logs the request parameters as a serialized json string:

{"written_at": "2020-10-09T19:34:05.565Z", "written_ts": 1602272045565346000, "msg": "{'status': 404, 'method': 'GET', 'ip': '127.0.0.1', 'uri': '/api/kernels/48ecd2e7-ae9d-4484-8910-70fb46d72491/channels?session_id=5aa3a7dcedf1427e8ff41ae0cb3dae94', 'request_time': 1.8270015716552734, 'referer': 'None'}", "type": "log", "logger": "NotebookApp", "thread": "MainThread", "level": "WARNING", "module": "log", "line_no": 57}

In order to use the same logger from the notebook app we
need to pass it through to the log_request function. To
do that we need to use a partial function so the `log` kwarg
is available. If not provided the default logger is the same
as before, the tornado web access logger.

This is necessary for the `log_json` settings to be used for
both the notebook app logs and the web app request logging.

There is still a todo needed here so that the individual
request fields get formatted separately rather than dumped
into the `msg` field.
@mriedem mriedem marked this pull request as ready for review October 9, 2020 20:15
notebook/notebookapp.py Outdated Show resolved Hide resolved
notebook/log.py Outdated Show resolved Hide resolved
@mriedem
Copy link
Contributor Author

mriedem commented Oct 9, 2020

I'd like to add tests for this but I'm not sure if there is a great way to test it. Maybe something like start the notebook app server with the env var set and verify the logged message can be serialized and then deserialized JSON? Though that would also mean having json-logging installed to run tests.

This updates the log_request method to get the request
properties logged separately if `log_json=True`. This
ends up looking like:

```
{"written_at": "2020-10-09T20:59:52.044Z", "written_ts": 1602277192044732000, "msg": "", "type": "log", "logger": "NotebookApp", "thread": "MainThread", "level": "WARNING", "module": "log", "line_no": 50, "status": 404, "method": "GET", "ip": "127.0.0.1", "uri": "/api/kernels/a6dd447b-282a-4e1c-ab65-eb340ad12965/channels?session_id=472907c27ed94e2cbfbae86501f7d159", "request_time": 8.63, "referer": "None"}
```
notebook/log.py Outdated Show resolved Hide resolved
Copy link
Member

@kevin-bates kevin-bates left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Matt. This is coming along, but I think still too spread out. I've added some comments that should help isolate some of these changes. Thanks.

notebook/notebookapp.py Outdated Show resolved Hide resolved
notebook/notebookapp.py Outdated Show resolved Hide resolved
notebook/notebookapp.py Outdated Show resolved Hide resolved
@@ -250,9 +251,11 @@ def init_settings(self, jupyter_app, kernel_manager, contents_manager,
# collapse $HOME to ~
root_dir = '~' + root_dir[len(home):]

# Use the NotebookApp logger and its formatting for tornado request logging.
log_function = functools.partial(log_request, log=log)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not finding this to be necessary. If I restore to the original, I get the same output.

This PR:
{"written_at": "2020-10-09T21:11:30.439Z", "written_ts": 1602277890439353000, "msg": "200 GET /api/contents?type=directory&_=1602277851454 (127.0.0.1) 22.49ms", "type": "log", "logger": "NotebookApp", "thread": "MainThread", "level": "DEBUG", "module": "log", "line_no": 49}

with changes reverted:
{"written_at": "2020-10-09T21:19:03.578Z", "written_ts": 1602278343578833000, "msg": "200 GET /api/contents?type=directory&_=1602278343433 (127.0.0.1) 12.31ms", "type": "log", "logger": "NotebookApp", "thread": "MainThread", "level": "DEBUG", "module": "log", "line_no": 49}

notebook/notebookapp.py Show resolved Hide resolved
Copy link
Contributor Author

@mriedem mriedem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes coming soon.

notebook/notebookapp.py Show resolved Hide resolved
notebook/notebookapp.py Outdated Show resolved Hide resolved
notebook/notebookapp.py Outdated Show resolved Hide resolved
notebook/notebookapp.py Outdated Show resolved Hide resolved
notebook/log.py Outdated Show resolved Hide resolved
notebook/notebookapp.py Outdated Show resolved Hide resolved
Just use the app logger since that has the default LogFormatter
logging until we change it to JSON.
Copy link
Member

@kevin-bates kevin-bates left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi Matt - this is looking great. Just had the one comment regarding the debug statement that doesn't seem to get displayed.

notebook/notebookapp.py Show resolved Hide resolved
Make sure the default value is handled properly based
on environment variables. Also checks the validation
code based on whether or not json_logging is imported.
Basic test just to make sure log_request handles log_json=True
properly.
The intent of the test is to make sure that log_json defaults
to True when JUPYTER_ENABLE_JSON_LOGGING is set to "true" so
this makes the test more explicit and to pass when json-logging
is not installed so that does not interfere due to validation.
@kevin-bates
Copy link
Member

Hi Matt - would you mind merging with the latest master and pushing that commit? Your changes with legitimate test failures coincided with a general build issue and I'd like to have this PR be 'green' prior to its merge into master. Sorry for the inconvenience.

Copy link
Member

@kevin-bates kevin-bates left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good - thank you Matt.

@kevin-bates kevin-bates merged commit 1550410 into jupyter:master Oct 14, 2020
@mriedem mriedem deleted the 5798-json-logging branch October 14, 2020 12:53
@mriedem
Copy link
Contributor Author

mriedem commented Oct 14, 2020

Looks good - thank you Matt.

Thanks for all of your help Kevin.

@mriedem
Copy link
Contributor Author

mriedem commented Oct 14, 2020

@kevin-bates quick question, do you have any idea of a timeline for the next planned release? Looking at recent releases it looks like you've been doing monthly releases which is cool. My team would like to avoid pulling this change in via a git hash so if there is a release planned on the horizon we'll wait, just wondering.

@kevin-bates
Copy link
Member

Would getting a release out in the next couple of weeks be sufficient - say last week of October/first week of November? Although convenient, I'd rather not set a fixed cadence unless there are other maintainers that can assist.

@mriedem
Copy link
Contributor Author

mriedem commented Oct 14, 2020

Would getting a release out in the next couple of weeks be sufficient - say last week of October/first week of November?

Yup, that would be great.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

RFE: Add support for JSON logging format
2 participants