Observability mode - output jsonlines

HypothesisWorks · Dec 10, 2023 · 239c836 · 239c836
1 parent 6941cd2
commit 239c836
Show file tree

Hide file tree

Showing 21 changed files with 645 additions and 42 deletions.
diff --git a/hypothesis-python/RELEASE.rst b/hypothesis-python/RELEASE.rst
@@ -0,0 +1,4 @@
+RELEASE_TYPE: minor
+
+This release adds an experimental :wikipedia:`observability <Observability_(software)>`
+mode.  :doc:`You can read the docs about it here <observability>`.
diff --git a/hypothesis-python/docs/_static/.empty b/hypothesis-python/docs/_static/.empty
diff --git a/hypothesis-python/docs/_static/wrap-in-tables.css b/hypothesis-python/docs/_static/wrap-in-tables.css
@@ -0,0 +1,15 @@
+/* override table width restrictions */
+/* thanks to https://github.com/readthedocs/sphinx_rtd_theme/issues/117#issuecomment-153083280 */
+@media screen and (min-width: 767px) {
+
+    .wy-table-responsive table td {
+        /* !important prevents the common CSS stylesheets from
+            overriding this as on RTD they are loaded after this stylesheet */
+        white-space: normal !important;
+    }
+
+    .wy-table-responsive {
+        overflow: visible !important;
+    }
+
+}
diff --git a/hypothesis-python/docs/changes.rst b/hypothesis-python/docs/changes.rst
@@ -2362,7 +2362,7 @@ Did you know that of the 2\ :superscript:`64` possible floating-point numbers,
 
 While nans *usually* have all zeros in the sign bit and mantissa, this
 `isn't always true <https://wingolog.org/archives/2011/05/18/value-representation-in-javascript-implementations>`__,
-and :wikipedia:`'signaling' nans might trap or error <https://en.wikipedia.org/wiki/NaN#Signaling_NaN>`.
+and :wikipedia:`'signaling' nans might trap or error <NaN#Signaling_NaN>`.
 To help distinguish such errors in e.g. CI logs, Hypothesis now prints ``-nan`` for
 negative nans, and adds a comment like ``# Saw 3 signaling NaNs`` if applicable.
 

diff --git a/hypothesis-python/docs/conf.py b/hypothesis-python/docs/conf.py
@@ -30,6 +30,7 @@
     "hoverxref.extension",
     "sphinx_codeautolink",
     "sphinx_selective_exclude.eager_only",
+    "sphinx-jsonschema",
 ]
 
 templates_path = ["_templates"]
@@ -147,6 +148,8 @@ def setup(app):
 
 html_static_path = ["_static"]
 
+html_css_files = ["wrap-in-tables.css"]
+
 htmlhelp_basename = "Hypothesisdoc"
 
 html_favicon = "../../brand/favicon.ico"

diff --git a/hypothesis-python/docs/index.rst b/hypothesis-python/docs/index.rst
@@ -80,3 +80,4 @@ check out some of the
   support
   packaging
   reproducing
+  observability
diff --git a/hypothesis-python/docs/observability.rst b/hypothesis-python/docs/observability.rst
@@ -0,0 +1,76 @@
+===================
+Observability tools
+===================
+
+.. warning::
+
+    This feature is experimental, and could have breaking changes or even be removed
+    without notice.  Try it out, let us know what you think, but don't rely on it
+    just yet!
+
+
+Motivation
+==========
+
+Understanding what your code is doing - for example, why your test failed - is often
+a frustrating exercise in adding some more instrumentation or logging (or ``print()`` calls)
+and running it again.  The idea of :wikipedia:`observability <Observability_(software)>`
+is to let you answer questions you didn't think of in advance.  In slogan form,
+
+  *Debugging should be a data analysis problem.*
+
+By default, Hypothesis only reports the minimal failing example... but sometimes you might
+want to know something about *all* the examples.  Printing them to the terminal with
+:ref:`verbose output <verbose-output>` might be nice, but isn't always enough.
+This feature gives you an analysis-ready dataframe with useful columns and one row
+per test case, with columns from arguments to code coverage to pass/fail status.
+
+This is deliberately a much lighter-weight and task-specific system than e.g.
+`OpenTelemetry <https://opentelemetry.io/>`__.  It's also less detailed than time-travel
+debuggers such as `rr <https://rr-project.org/>`__ or `pytrace <https://pytrace.com/>`__,
+because there's no good way to compare multiple traces from these tools and their
+Python support is relatively immature.
+
+
+Configuration
+=============
+
+If you set the ``HYPOTHESIS_EXPERIMENTAL_OBSERVABILITY`` environment variable,
+Hypothesis will log various observations to jsonlines files in the
+``.hypothesis/observed/`` directory.  You can load and explore these with e.g.
+:func:`pd.read_json(".hypothesis/observed/*_testcases.jsonl", lines=True) <pandas.read_json>`,
+or by using the :pypi:`sqlite-utils` and :pypi:`datasette` libraries::
+
+    sqlite-utils insert testcases.db testcases .hypothesis/observed/*_testcases.jsonl --nl --flatten
+    datasette serve testcases.db
+
+
+Collecting more information
+---------------------------
+
+If you want to record more information about your test cases than the arguments and
+outcome - for example, was ``x`` a binary tree?  what was the difference between the
+expected and the actual value?  how many queries did it take to find a solution? -
+Hypothesis makes this easy.
+
+:func:`~hypothesis.event` accepts a string label, and optionally a string or int or
+float observation associated with it.  All events are collected and summarized in
+:ref:`statistics`, as well as included on a per-test-case basis in our observations.
+
+:func:`~hypothesis.target` is a special case of numeric-valued events: as well as
+recording them in observations, Hypothesis will try to maximize the targeted value.
+Knowing that, you can use this to guide the search for failing inputs.
+
+
+Data Format
+===========
+
+We dump observations in `json lines format <https://jsonlines.org/>`__, with each line
+describing either a test case or an information message.  The tables below are derived
+from :download:`this machine-readable JSON schema <schema_observations.json>`, to
+provide both readable and verifiable specifications.
+
+.. jsonschema:: ./schema_observations.json#/oneOf/0
+   :hide_key: /additionalProperties, /type
+.. jsonschema:: ./schema_observations.json#/oneOf/1
+   :hide_key: /additionalProperties, /type
diff --git a/hypothesis-python/docs/schema_observations.json b/hypothesis-python/docs/schema_observations.json
@@ -0,0 +1,93 @@
+{
+    "title": "PBT Observations",
+    "description": "PBT Observations define a standard way to communicate what happened when property-based tests were run.  They describe test cases, or general notifications classified as info, alert, or error messages.",
+    "oneOf": [
+        {
+            "title": "Test case",
+            "description": "Describes the inputs to and result of running some test function on a particular input.  The test might have passed, failed, or been abandoned part way through (e.g. because we failed a ``.filter()`` condition).",
+            "type": "object",
+            "properties": {
+                "type": {
+                    "const": "test_case",
+                    "description": "A tag which labels this observation as data about a specific test case."
+                },
+                "status": {
+                    "enum": ["passed", "failed", "gave_up"],
+                    "description": "Whether the test passed, failed, or was aborted before completion (e.g. due to use of ``.filter()``).  Note that if we gave_up partway, values such as arguments and features may be incomplete."
+                },
+                "status_reason": {
+                    "type": "string",
+                    "description": "If non-empty, the reason for which the test failed or was abandoned.  For Hypothesis, this is usually the exception type and location."
+                },
+                "representation": {
+                    "type": "string",
+                    "description": "The string representation of the input."
+                },
+                "arguments": {
+                    "type": "object",
+                    "description": "A structured json-encoded representation of the input.  Hypothesis always provides a dictionary of argument names to json-ified values, including interactive draws from the :func:`~hypothesis.strategies.data` strategy.  In other libraries this can be any object."
+                },
+                "how_generated": {
+                    "type": ["string", "null"],
+                    "description": "How the input was generated, if known.  In Hypothesis this might be an explicit example, generated during a particular phase with some backend, or by replaying the minimal failing example."
+                },
+                "features": {
+                    "type": "object",
+                    "description": "Runtime observations which might help explain what this test case did.  Hypothesis includes target() scores, tags from event(), time spent generating data and running user code, and so on."
+                },
+                "coverage": {
+                    "type": ["object", "null"],
+                    "description": "Mapping of filename to list of covered line numbers, if coverage information is available, or None if not.  Hypothesis deliberately omits stdlib and site-packages code.",
+                    "additionalProperties": {
+                        "type": "array",
+                        "items": {"type": "integer", "minimum": 1},
+                        "uniqueItems": true
+                    }
+                },
+                "metadata": {
+                    "type": "object",
+                    "description": "Arbitrary metadata which might be of interest, but does not semantically fit in 'features'.  For example, Hypothesis includes the traceback for failing tests here."
+                },
+                "property": {
+                    "type": "string",
+                    "description": "The name or representation of the test function we're running."
+                },
+                "run_start": {
+                    "type": "number",
+                    "description": "unix timestamp at which we started running this test function, so that later analysis can group test cases by run."
+                }
+            },
+            "required": ["type", "status", "status_reason", "representation", "arguments", "how_generated", "features", "coverage", "metadata", "property", "run_start"],
+            "additionalProperties": false
+        },
+        {
+            "title": "Information message",
+            "description": "Info, alert, and error messages correspond to a group of test cases or the overall run, and are intended for humans rather than machine analysis.",
+            "type": "object",
+            "properties": {
+                "type": {
+                    "enum": [ "info", "alert", "error"],
+                    "description": "A tag which labels this observation as general information to show the user.  Hypothesis uses info messages to report statistics; alert or error messages can be provided by plugins."
+                },
+                "title": {
+                    "type": "string",
+                    "description": "The title of this message"
+                },
+                "content": {
+                    "type": "string",
+                    "description": "The body of the message.  May use markdown."
+                },
+                "property": {
+                    "type": "string",
+                    "description": "The name or representation of the test function we're running.  For Hypothesis, usually the Pytest nodeid."
+                },
+                "run_start": {
+                    "type": "number",
+                    "description": "unix timestamp at which we started running this test function, so that later analysis can group test cases by run."
+                }
+            },
+            "required": [ "type", "title", "content", "property", "run_start"],
+            "additionalProperties": false
+        }
+    ]
+}
diff --git a/hypothesis-python/src/_hypothesis_pytestplugin.py b/hypothesis-python/src/_hypothesis_pytestplugin.py
@@ -373,6 +373,13 @@ def pytest_terminal_summary(terminalreporter):
                 if fex:
                     failing_examples.append(json.loads(fex))
 
+        from hypothesis.internal.observability import _WROTE_TO
+
+        if _WROTE_TO:
+            terminalreporter.section("Hypothesis")
+            for fname in sorted(_WROTE_TO):
+                terminalreporter.write_line(f"observations written to {fname}")
+
         if failing_examples:
             # This must have been imported already to write the failing examples
             from hypothesis.extra._patching import gc_patches, make_patch, save_patch
@@ -384,7 +391,8 @@ def pytest_terminal_summary(terminalreporter):
             except Exception:
                 # fail gracefully if we hit any filesystem or permissions problems
                 return
-            terminalreporter.section("Hypothesis")
+            if not _WROTE_TO:
+                terminalreporter.section("Hypothesis")
             terminalreporter.write_line(
                 f"`git apply {fname}` to add failing examples to your code."
             )