From f8f32287b9c54adfebb0330f9c6f88e2fc71ad51 Mon Sep 17 00:00:00 2001 From: Grzegorz Bokota Date: Fri, 11 Aug 2023 15:39:35 +0200 Subject: [PATCH 01/12] add nap for telemetry --- docs/_toc.yml | 1 + docs/naps/8-telemetry.md | 181 +++++++++++++++++++++++++++++++++++++++ 2 files changed, 182 insertions(+) create mode 100644 docs/naps/8-telemetry.md diff --git a/docs/_toc.yml b/docs/_toc.yml index a100ff8cc..0913786d7 100644 --- a/docs/_toc.yml +++ b/docs/_toc.yml @@ -145,6 +145,7 @@ subtrees: - file: naps/5-new-logo - file: naps/6-contributable-menus - file: naps/7-key-binding-dispatch + - file: naps/8*telemetry - file: developers/documentation/index subtrees: - entries: diff --git a/docs/naps/8-telemetry.md b/docs/naps/8-telemetry.md new file mode 100644 index 000000000..9d699dff7 --- /dev/null +++ b/docs/naps/8-telemetry.md @@ -0,0 +1,181 @@ +:orphan: + + (nap-8)= + + # NAP-8 — Telemetry + + ```{eval-rst} + :Author: Grzegorz Bokota + :Created: + :Resolution: (required for Accepted | Rejected | Withdrawn) + :Resolved: + :Status: Draft + :Type: Standards Track + :Version effective: (for accepted NAPs) + ``` + + ## Abstract + + This NAP is describing why Telemetry is helpful to the Napari project and describes the architecture and solutions selected to maximize the privacy of our users. + + ## Motivation and Scope + +With the growth of napari, the standard feedback loop through napari community meetings and napari-related events at conferences has reached its capacity. Also, we collect many feature requests for which we cannot find volunteers for implementation. + +To have the possibility of sustainable development of the project we need either have funds to pay contractors or have some company that donates their worker times manageable by core-devs. + +Both these scenarios require our side the ability to provide some information about the estimated number of users to prove to potential founders that their donation/grant will be used in a valuable way. + +Adding the option for monitoring plugin usage allows us to identify the most important plugins and try to establish cooperation with their maintainers to reduce the probability that the plugin will not be ready for a new release. Such monitoring could contain not only the list of installed plugins but also which commands contributions are used most often. + +Also collecting information about data types and their size will provide valuable information about the typical use cases of napari. + +Still, a user need to be able to opt out of such monitoring. And adjust the level of detail of the information that is sent to the napari server. + + + ## Detailed Description + +`napari-telemetry` will be a package responsible for collecting and sending telemetry data to the napari server. It will be installed after user confirmation. It will contain callbacks for collection data, and utils for its storage and sending. Also, this package will contain utils for validation if the user has agreed to telemetry. + +In the main package, there is a need to add code to ask users if they want to enable telemetry. This 1code should be executed only once per environment. + +Telemetry should contain following way to disable it: + +1. uninstall `napari-telemetry` package +2. Environment variable `NAPARI_TELEMETRY=0` +3. Full list of endpoints used for collecting telemetry, that could be filtered on the firewall level. + +The user should be able to adjust the telemetry level of detail. The following levels are proposed: + +1. `none` - no telemetry is collected +2. `basic` - information about the napari version, python version, OS, and CPU architecture is collected and if it is the first report by the user. There is also a user identifier created based on computer details that will be rerendered each week to prevent tracking the user, but allow to not count a user multiple times. +3. `middle` - same as in `basic` but also information about the list of installed plugins and their versions is collected. We take care to not collect data about plugins that are not indented to be public. +4. `full` - same as in `middle` but also information about plugin usage by binding to app-model and collect information about called plugins command. Also basic information about data like type (`np.ndarray`, `dask.array`, `zarr.Array`, etc.) and its size is collected. + +There should be a visible indicator that telemetry is enabled (for example on the status bar). + +The second part of this work should be setup the server to collect telemetry data. Next to collecting data, it should provide a basic public dashboard that will allow a community to see aggregated information. + +I propose to have the following data retention policy: + +1) Up to 2 weeks for logs. +2) up 2 months of raw data (1 month of collection, then aggregation and time to validate aggregated data), +3) infinite of aggregated data. + +## Privacy assessment + +During the preparation of this NAP we assume that none of the collected data will be presented in +a form that allows to identify a single user or identify a research area of user. We also select a set of data that will be collected to minimize the possibility of reval of fragile data, but it is impossible to guarantee that it will not be possible to identify a single user (for example checking plugins combination). + +Because of this, we decided to not publish raw data and only show aggregated results. The aggregation will be performed using scripts. Napari core-devs will access raw data only when there will be some errors in the aggregation process. + +We also will publish a list of endpoints for each level of telemetry, so the given level of telemetry could be blocked on the organization level (for example by the rule on the firewall). + + +If someone found that we are publishing some problematic data we will remove them and update the aggregation process to prevent such a situation in the future. +This NAP will be updated to reflect the current state of telemetry. + + +## Related Work + +Total systems: +https://plausible.io/ +https://sentry.io/ +https://opentelemetry.io/ + +Visualizations: +https://github.com/grafana/grafana + + + +## Implementation + +The main thing for implementation should be the low cost of maintenance. So the solution should be as simple as possible. We could either use existing solutions on the server side or implement our own. + +The benefit of the existing solution is that most of the work is already done. The downside is that it may require additional cost of maintenance. This cost may be caused by many features that are not needed for napari and could increase the risk of leaking data. As I check they are implemented in techniques that are not familiar to napari core-devs. So if there will be a decision to use them we should select an SAS solution that will be maintained by the company. + + +For the current, I suggest creating a simple REST API server for collecting the data. +It could be a simple Python FastAPI server that will store data in the SQLite database. + +Data for aggregation should be extracted from the database using a script running on the same machine. + +The output of the aggregation script should be loaded to some existing visualization tool, like grafana. + +It may be nice to host them on separate servers, then even if the data presented on the dashboard will be compromised, the raw data will be not exposed to the world. + +Having both server and aggregation scripts in Python will reduce maintenance costs for napari core-devs. + +We should register the `telemetry.napari.org` domain and use it for the server. The main page contains this NAP and a link to the summary dashboard. + + +The main part of the application side should be implemented in `napari-telemetry` package. +The package should not report in stream mode, but collect data on the disk and send it in batches. This will reduce the risk of leaking data. The package should implement a utility to allow users to preview collected data before sending it to the server. + +In the napari itself following changes should be implemented: + +1) The indicator that shows the telemetry status +2) The dialog that asks a user if they want to enable telemetry +3) code to check if telemetry is enabled (to not load the `napari-telemetry` package if it is disabled) +4) code required to init `napari-telemetry` package + + +## Potential problems + +There is a risk that someone may try to highjack the telemetry module name to have code executed at every napari start. + +I do not expect that it is a high risk, but exists. We could address it by code signing. This will require additional procedures to protect private cryptographic keys. + +Another option is to scan public plugins and their dependencies. This is simpler but will require establishing additional communication channels to be able to warn users about the potential problem. + + + +## Backward Compatibility + + Not relevant + +## Future Work + +A nice extension may be the ability for the steering council to create a certificate of telemetry output that could be given to plugin maintainers to prove to supervisors that their plugin is used by the community. + + + ## Alternatives + + During the discussion, there is a proposal to use the same approach as used in ImageJ. + + Mean that instead of implementing telemetry on the client side we could implement it on the update server side. The advantage and disadvantage of such a solution is that no user could opt out of telemetry. Also, such a method could potentially provide information about the Python version, napari version and list of installed plugins. All others will require a mechanism from this NAP. + + It will also require updates on the Napari side as currently we only communicate with the update server when a user opens the plugin manager. Also, to have proper information about installed plugins we will need to send information about the list of installed plugins instead of just downloading the information about all plugins from the server. + + As this solution provides less information, does not allow for opt-out and could cause blacklisting of the update server IP address, I do not recommend it. + + But based on talks that happen during the discussion we may think about more frequent checks for updates to inform users that they could update their Napari or plugin version. For such a change we need to update our update server to provide information per Python version (as some plugins could drop old Python earlier). + + +The second alternative is use a third-party solution like [plausable.io](https://plausible.io/). But from my perspective, it is harder to adjust a set of data that is collected as these services are designed to monitor webpages. + + + ## Discussion + + This section may just be a bullet list including links to any discussions + regarding the NAP, but could also contain additional comments about that + discussion: + + - This includes links to discussion forum threads or relevant GitHub discussions. + + ## References and Footnotes + + All NAPs should be declared as dedicated to the public domain with the CC0 + license [^id3], as in `Copyright`, below, with attribution encouraged with + CC0+BY [^id4]. + + [^id3]: CC0 1.0 Universal (CC0 1.0) Public Domain Dedication, + + + [^id4]: + + ## Copyright + + This document is dedicated to the public domain with the Creative Commons CC0 + license [^id3]. Attribution to this source is encouraged where appropriate, as per + CC0+BY [^id4]. \ No newline at end of file From 966430e2483886d02a5856217c179d780152d9b5 Mon Sep 17 00:00:00 2001 From: Grzegorz Bokota Date: Fri, 18 Aug 2023 23:54:27 +0200 Subject: [PATCH 02/12] Apply suggestions from code review Co-authored-by: Draga Doncila Pop <17995243+DragaDoncila@users.noreply.github.com> --- docs/naps/8-telemetry.md | 38 +++++++++++++++++++------------------- 1 file changed, 19 insertions(+), 19 deletions(-) diff --git a/docs/naps/8-telemetry.md b/docs/naps/8-telemetry.md index 9d699dff7..19b4f2a1d 100644 --- a/docs/naps/8-telemetry.md +++ b/docs/naps/8-telemetry.md @@ -16,17 +16,17 @@ ## Abstract - This NAP is describing why Telemetry is helpful to the Napari project and describes the architecture and solutions selected to maximize the privacy of our users. + This NAP is describes why telemetry is helpful to the napari project and the architecture and solutions selected to maximize the privacy of our users. ## Motivation and Scope With the growth of napari, the standard feedback loop through napari community meetings and napari-related events at conferences has reached its capacity. Also, we collect many feature requests for which we cannot find volunteers for implementation. -To have the possibility of sustainable development of the project we need either have funds to pay contractors or have some company that donates their worker times manageable by core-devs. +To have the possibility of sustainable development of the project we will either need to rely on paid contractors or on companies donating employee time managed by the core devs. -Both these scenarios require our side the ability to provide some information about the estimated number of users to prove to potential founders that their donation/grant will be used in a valuable way. +Both scenarios require us to provide some information about the estimated number of users to prove to potential funders that their donation/grant will be used in a valuable way. -Adding the option for monitoring plugin usage allows us to identify the most important plugins and try to establish cooperation with their maintainers to reduce the probability that the plugin will not be ready for a new release. Such monitoring could contain not only the list of installed plugins but also which commands contributions are used most often. +Adding the option for monitoring plugin usage allows us to identify the most important plugins and try to establish cooperation with their maintainers to reduce the probability that the plugin will not be ready for a new napari release. Such monitoring could contain not only the list of installed plugins but also which commands and contributions are used most often. Also collecting information about data types and their size will provide valuable information about the typical use cases of napari. @@ -35,11 +35,11 @@ Still, a user need to be able to opt out of such monitoring. And adjust the leve ## Detailed Description -`napari-telemetry` will be a package responsible for collecting and sending telemetry data to the napari server. It will be installed after user confirmation. It will contain callbacks for collection data, and utils for its storage and sending. Also, this package will contain utils for validation if the user has agreed to telemetry. +`napari-telemetry` will be a package responsible for collecting and sending telemetry data to the napari server. It will be installed after user confirmation. It will contain callbacks for data collection, and utils for storage and sending. Also, this package will contain utils for validating if the user has agreed to telemetry. -In the main package, there is a need to add code to ask users if they want to enable telemetry. This 1code should be executed only once per environment. +In the main package, there is a need to add code to ask users if they want to enable telemetry. This code should be executed only once per environment. -Telemetry should contain following way to disable it: +Telemetry should contain following ways to disable it: 1. uninstall `napari-telemetry` package 2. Environment variable `NAPARI_TELEMETRY=0` @@ -49,12 +49,12 @@ The user should be able to adjust the telemetry level of detail. The following l 1. `none` - no telemetry is collected 2. `basic` - information about the napari version, python version, OS, and CPU architecture is collected and if it is the first report by the user. There is also a user identifier created based on computer details that will be rerendered each week to prevent tracking the user, but allow to not count a user multiple times. -3. `middle` - same as in `basic` but also information about the list of installed plugins and their versions is collected. We take care to not collect data about plugins that are not indented to be public. -4. `full` - same as in `middle` but also information about plugin usage by binding to app-model and collect information about called plugins command. Also basic information about data like type (`np.ndarray`, `dask.array`, `zarr.Array`, etc.) and its size is collected. +3. `middle` - same as in `basic` but also information about the list of installed plugins and their versions is collected. We take care to not collect data about plugins that are not intended to be public. +4. `full` - same as in `middle` but also collects information about plugin usage by binding to app-model and logging plugin commands used. Also basic information about data like type (`np.ndarray`, `dask.array`, `zarr.Array`, etc.) and its size is collected. There should be a visible indicator that telemetry is enabled (for example on the status bar). -The second part of this work should be setup the server to collect telemetry data. Next to collecting data, it should provide a basic public dashboard that will allow a community to see aggregated information. +The second part of this work should be to setup the server to collect telemetry data. After collecting data, it should provide a basic public dashboard that will allow the community to see aggregated information. I propose to have the following data retention policy: @@ -65,9 +65,9 @@ I propose to have the following data retention policy: ## Privacy assessment During the preparation of this NAP we assume that none of the collected data will be presented in -a form that allows to identify a single user or identify a research area of user. We also select a set of data that will be collected to minimize the possibility of reval of fragile data, but it is impossible to guarantee that it will not be possible to identify a single user (for example checking plugins combination). +a form that allows to identify a single user or identify a research area of user. We also select a set of data that will be collected to minimize the possibility of revealing fragile data, but it is impossible to guarantee that it will not be possible to identify a single user (for example by checking installed plugin combinations). -Because of this, we decided to not publish raw data and only show aggregated results. The aggregation will be performed using scripts. Napari core-devs will access raw data only when there will be some errors in the aggregation process. +Because of this, we propose to not publish raw data and only show aggregated results. The aggregation will be performed using scripts. Napari core devs will access raw data only if there are errors in the aggregation process. We also will publish a list of endpoints for each level of telemetry, so the given level of telemetry could be blocked on the organization level (for example by the rule on the firewall). @@ -92,27 +92,27 @@ https://github.com/grafana/grafana The main thing for implementation should be the low cost of maintenance. So the solution should be as simple as possible. We could either use existing solutions on the server side or implement our own. -The benefit of the existing solution is that most of the work is already done. The downside is that it may require additional cost of maintenance. This cost may be caused by many features that are not needed for napari and could increase the risk of leaking data. As I check they are implemented in techniques that are not familiar to napari core-devs. So if there will be a decision to use them we should select an SAS solution that will be maintained by the company. +The benefit of existing solutions is that most of the work is already done. The downside is that it may require additional cost of maintenance. This cost may be caused by many features that are not needed for napari and could increase the risk of leaking data. Quick checks of their code revealed they are implemented in techniques that are not familiar to napari core devs. So, if we decide to use them, we should select an SAS solution that will be maintained by the company. -For the current, I suggest creating a simple REST API server for collecting the data. +For now, I suggest creating a simple REST API server for collecting the data. It could be a simple Python FastAPI server that will store data in the SQLite database. Data for aggregation should be extracted from the database using a script running on the same machine. The output of the aggregation script should be loaded to some existing visualization tool, like grafana. -It may be nice to host them on separate servers, then even if the data presented on the dashboard will be compromised, the raw data will be not exposed to the world. +It may be nice to host raw and aggregate data on separate servers - then even if the data presented on the dashboard is compromised, the raw data will be not exposed to the world. -Having both server and aggregation scripts in Python will reduce maintenance costs for napari core-devs. +Having both server and aggregation scripts in Python will reduce maintenance costs for napari core devs. -We should register the `telemetry.napari.org` domain and use it for the server. The main page contains this NAP and a link to the summary dashboard. +We should register the `telemetry.napari.org` domain and use it for the server. The main page will contain this NAP and a link to the summary dashboard. The main part of the application side should be implemented in `napari-telemetry` package. The package should not report in stream mode, but collect data on the disk and send it in batches. This will reduce the risk of leaking data. The package should implement a utility to allow users to preview collected data before sending it to the server. -In the napari itself following changes should be implemented: +In napari itself, the following changes should be implemented: 1) The indicator that shows the telemetry status 2) The dialog that asks a user if they want to enable telemetry @@ -152,7 +152,7 @@ A nice extension may be the ability for the steering council to create a certifi But based on talks that happen during the discussion we may think about more frequent checks for updates to inform users that they could update their Napari or plugin version. For such a change we need to update our update server to provide information per Python version (as some plugins could drop old Python earlier). -The second alternative is use a third-party solution like [plausable.io](https://plausible.io/). But from my perspective, it is harder to adjust a set of data that is collected as these services are designed to monitor webpages. +The second alternative is use a third-party solution like [plausible.io](https://plausible.io/). But from my perspective, it is harder to adjust a set of data that is collected as these services are designed to monitor webpages. ## Discussion From 33c62e3fbb6e7675d0a6c7cc1f0c800f512c5b99 Mon Sep 17 00:00:00 2001 From: Grzegorz Bokota Date: Fri, 18 Aug 2023 23:56:12 +0200 Subject: [PATCH 03/12] Update docs/naps/8-telemetry.md Co-authored-by: Draga Doncila Pop <17995243+DragaDoncila@users.noreply.github.com> --- docs/naps/8-telemetry.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/naps/8-telemetry.md b/docs/naps/8-telemetry.md index 19b4f2a1d..51fc0daee 100644 --- a/docs/naps/8-telemetry.md +++ b/docs/naps/8-telemetry.md @@ -30,7 +30,7 @@ Adding the option for monitoring plugin usage allows us to identify the most imp Also collecting information about data types and their size will provide valuable information about the typical use cases of napari. -Still, a user need to be able to opt out of such monitoring. And adjust the level of detail of the information that is sent to the napari server. +Still, users need to be able to opt out of such monitoring, and adjust the level of detail of the information that is sent to the napari server. ## Detailed Description From d30adbf6b572e08e0a5df0a4b1236770e09c952d Mon Sep 17 00:00:00 2001 From: Grzegorz Bokota Date: Sat, 19 Aug 2023 00:18:41 +0200 Subject: [PATCH 04/12] add disabling telemetry in settings --- docs/naps/8-telemetry.md | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/docs/naps/8-telemetry.md b/docs/naps/8-telemetry.md index 51fc0daee..951f81b75 100644 --- a/docs/naps/8-telemetry.md +++ b/docs/naps/8-telemetry.md @@ -41,9 +41,10 @@ In the main package, there is a need to add code to ask users if they want to en Telemetry should contain following ways to disable it: -1. uninstall `napari-telemetry` package -2. Environment variable `NAPARI_TELEMETRY=0` -3. Full list of endpoints used for collecting telemetry, that could be filtered on the firewall level. +1. Disable in settings +2. uninstall `napari-telemetry` package +3. Environment variable `NAPARI_TELEMETRY=0` +4. Full list of endpoints used for collecting telemetry, that could be filtered on the firewall level. The user should be able to adjust the telemetry level of detail. The following levels are proposed: From 848465c5b0278b5bb5ce388a61e4d12c749975ca Mon Sep 17 00:00:00 2001 From: Grzegorz Bokota Date: Sat, 19 Aug 2023 00:22:51 +0200 Subject: [PATCH 05/12] fix toc --- docs/_toc.yml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/_toc.yml b/docs/_toc.yml index 0913786d7..15f2a332f 100644 --- a/docs/_toc.yml +++ b/docs/_toc.yml @@ -145,7 +145,7 @@ subtrees: - file: naps/5-new-logo - file: naps/6-contributable-menus - file: naps/7-key-binding-dispatch - - file: naps/8*telemetry + - file: naps/8-telemetry - file: developers/documentation/index subtrees: - entries: From 20586136b47800074afc710cd66f1b054059c7ce Mon Sep 17 00:00:00 2001 From: Grzegorz Bokota Date: Tue, 12 Sep 2023 11:26:54 +0200 Subject: [PATCH 06/12] clarrify data about which plugin will be collected --- docs/naps/8-telemetry.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/naps/8-telemetry.md b/docs/naps/8-telemetry.md index 951f81b75..0c1635e51 100644 --- a/docs/naps/8-telemetry.md +++ b/docs/naps/8-telemetry.md @@ -50,7 +50,7 @@ The user should be able to adjust the telemetry level of detail. The following l 1. `none` - no telemetry is collected 2. `basic` - information about the napari version, python version, OS, and CPU architecture is collected and if it is the first report by the user. There is also a user identifier created based on computer details that will be rerendered each week to prevent tracking the user, but allow to not count a user multiple times. -3. `middle` - same as in `basic` but also information about the list of installed plugins and their versions is collected. We take care to not collect data about plugins that are not intended to be public. +3. `middle` - same as in `basic` but also information about the list of installed plugins and their versions is collected. We take care to not collect data about plugins that are not intended to be public, so we will not collect information about plugins searchable as napri plugin using plugin dialog or napri-hub. We also will not collect information about plugins that are installed in non stable version. 4. `full` - same as in `middle` but also collects information about plugin usage by binding to app-model and logging plugin commands used. Also basic information about data like type (`np.ndarray`, `dask.array`, `zarr.Array`, etc.) and its size is collected. There should be a visible indicator that telemetry is enabled (for example on the status bar). From 78e9e1ce38e9723299324cc3f1e03390981e955d Mon Sep 17 00:00:00 2001 From: Grzegorz Bokota Date: Tue, 12 Sep 2023 11:35:46 +0200 Subject: [PATCH 07/12] add gdpr information --- docs/naps/8-telemetry.md | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/docs/naps/8-telemetry.md b/docs/naps/8-telemetry.md index 0c1635e51..e1dbed3ab 100644 --- a/docs/naps/8-telemetry.md +++ b/docs/naps/8-telemetry.md @@ -130,6 +130,14 @@ I do not expect that it is a high risk, but exists. We could address it by code Another option is to scan public plugins and their dependencies. This is simpler but will require establishing additional communication channels to be able to warn users about the potential problem. +## GDPR compatybility + +I'm almost sure that we will not collect data that are covered by GDPR. But to get better atmosphere +we need to add instruction how user could retrive his unique identifier and setup a process +for requests to remove data from the server. It is not high propability of usage as life span of data is short, +but we need to be prepared for such a situation. I suggest to use e-mail for that. + + ## Backward Compatibility From 3e17ac637a5caedf77ca0f86a8631634abc10406 Mon Sep 17 00:00:00 2001 From: Grzegorz Bokota Date: Thu, 12 Oct 2023 10:40:21 +0200 Subject: [PATCH 08/12] fixes from review --- docs/naps/8-telemetry.md | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/docs/naps/8-telemetry.md b/docs/naps/8-telemetry.md index e1dbed3ab..52773fc83 100644 --- a/docs/naps/8-telemetry.md +++ b/docs/naps/8-telemetry.md @@ -66,7 +66,7 @@ I propose to have the following data retention policy: ## Privacy assessment During the preparation of this NAP we assume that none of the collected data will be presented in -a form that allows to identify a single user or identify a research area of user. We also select a set of data that will be collected to minimize the possibility of revealing fragile data, but it is impossible to guarantee that it will not be possible to identify a single user (for example by checking installed plugin combinations). +a form that allows to identify a single user or identify a research area of user. We also select a set of data that will be collected to minimize the possibility of revealing sensitive data, but it is impossible to guarantee that it will not be possible to identify a single user (for example by checking installed plugin combinations). Because of this, we propose to not publish raw data and only show aggregated results. The aggregation will be performed using scripts. Napari core devs will access raw data only if there are errors in the aggregation process. @@ -98,6 +98,7 @@ The benefit of existing solutions is that most of the work is already done. The For now, I suggest creating a simple REST API server for collecting the data. It could be a simple Python FastAPI server that will store data in the SQLite database. +Connection to server will be encrypted using HTTPS and certificate provided by LetsEncrypt. Data for aggregation should be extracted from the database using a script running on the same machine. @@ -130,7 +131,7 @@ I do not expect that it is a high risk, but exists. We could address it by code Another option is to scan public plugins and their dependencies. This is simpler but will require establishing additional communication channels to be able to warn users about the potential problem. -## GDPR compatybility +## GDPR compliance I'm almost sure that we will not collect data that are covered by GDPR. But to get better atmosphere we need to add instruction how user could retrive his unique identifier and setup a process From d4c5d9c120361cf6b16de77ebe1a666cf1344f71 Mon Sep 17 00:00:00 2001 From: Grzegorz Bokota Date: Wed, 18 Oct 2023 12:53:42 +0200 Subject: [PATCH 09/12] Apply suggestions from code review Co-authored-by: Draga Doncila Pop <17995243+DragaDoncila@users.noreply.github.com> --- docs/naps/8-telemetry.md | 24 ++++++++++++------------ 1 file changed, 12 insertions(+), 12 deletions(-) diff --git a/docs/naps/8-telemetry.md b/docs/naps/8-telemetry.md index 52773fc83..582852b8e 100644 --- a/docs/naps/8-telemetry.md +++ b/docs/naps/8-telemetry.md @@ -16,7 +16,7 @@ ## Abstract - This NAP is describes why telemetry is helpful to the napari project and the architecture and solutions selected to maximize the privacy of our users. + This NAP describes how telemetry would be used by the napari project and the architecture and solutions proposed to maximize the privacy of our users. ## Motivation and Scope @@ -26,7 +26,7 @@ To have the possibility of sustainable development of the project we will either Both scenarios require us to provide some information about the estimated number of users to prove to potential funders that their donation/grant will be used in a valuable way. -Adding the option for monitoring plugin usage allows us to identify the most important plugins and try to establish cooperation with their maintainers to reduce the probability that the plugin will not be ready for a new napari release. Such monitoring could contain not only the list of installed plugins but also which commands and contributions are used most often. +Adding the option for monitoring plugin usage allows us to identify heavily used plugins and try to establish cooperation with their maintainers to reduce the probability that the plugin will not be ready for a new napari release. Such monitoring could contain not only the list of installed plugins but also which commands and contributions are used most often. Also collecting information about data types and their size will provide valuable information about the typical use cases of napari. @@ -42,26 +42,26 @@ In the main package, there is a need to add code to ask users if they want to en Telemetry should contain following ways to disable it: 1. Disable in settings -2. uninstall `napari-telemetry` package +2. Uninstall `napari-telemetry` package 3. Environment variable `NAPARI_TELEMETRY=0` 4. Full list of endpoints used for collecting telemetry, that could be filtered on the firewall level. The user should be able to adjust the telemetry level of detail. The following levels are proposed: 1. `none` - no telemetry is collected -2. `basic` - information about the napari version, python version, OS, and CPU architecture is collected and if it is the first report by the user. There is also a user identifier created based on computer details that will be rerendered each week to prevent tracking the user, but allow to not count a user multiple times. -3. `middle` - same as in `basic` but also information about the list of installed plugins and their versions is collected. We take care to not collect data about plugins that are not intended to be public, so we will not collect information about plugins searchable as napri plugin using plugin dialog or napri-hub. We also will not collect information about plugins that are installed in non stable version. +2. `basic` - information about the napari version, python version, OS, and CPU architecture is collected and if it is the first report by the user. There is also a user identifier created based on computer details that will be regenerated each week to prevent tracking the user, but allow us to accurately gauge individual user numbers. +3. `middle` - same as in `basic` but information about the list of installed plugins and their versions is also collected. We take care to not collect data about plugins that are not intended to be public, so we will only collect information about plugins searchable as using plugin dialog or napari hub. We also will not collect information about plugins that are installed in non stable version. 4. `full` - same as in `middle` but also collects information about plugin usage by binding to app-model and logging plugin commands used. Also basic information about data like type (`np.ndarray`, `dask.array`, `zarr.Array`, etc.) and its size is collected. There should be a visible indicator that telemetry is enabled (for example on the status bar). The second part of this work should be to setup the server to collect telemetry data. After collecting data, it should provide a basic public dashboard that will allow the community to see aggregated information. -I propose to have the following data retention policy: +We propose the following data retention policy: 1) Up to 2 weeks for logs. -2) up 2 months of raw data (1 month of collection, then aggregation and time to validate aggregated data), -3) infinite of aggregated data. +2) Up 2 months of raw data (1 month of collection, then aggregation and time to validate aggregated data). +3) Infinite of aggregated data. ## Privacy assessment @@ -91,7 +91,7 @@ https://github.com/grafana/grafana ## Implementation -The main thing for implementation should be the low cost of maintenance. So the solution should be as simple as possible. We could either use existing solutions on the server side or implement our own. +The key consideration for implementation should be the low cost of maintenance. So the solution should be as simple as possible. We could either use existing solutions on the server side or implement our own. The benefit of existing solutions is that most of the work is already done. The downside is that it may require additional cost of maintenance. This cost may be caused by many features that are not needed for napari and could increase the risk of leaking data. Quick checks of their code revealed they are implemented in techniques that are not familiar to napari core devs. So, if we decide to use them, we should select an SAS solution that will be maintained by the company. @@ -134,8 +134,8 @@ Another option is to scan public plugins and their dependencies. This is simpler ## GDPR compliance I'm almost sure that we will not collect data that are covered by GDPR. But to get better atmosphere -we need to add instruction how user could retrive his unique identifier and setup a process -for requests to remove data from the server. It is not high propability of usage as life span of data is short, +we need to add instruction how user could retrieve his unique identifier and setup a process +for requests to remove data from the server. It is not high probability of usage as life span of data is short, but we need to be prepared for such a situation. I suggest to use e-mail for that. @@ -153,7 +153,7 @@ A nice extension may be the ability for the steering council to create a certifi During the discussion, there is a proposal to use the same approach as used in ImageJ. - Mean that instead of implementing telemetry on the client side we could implement it on the update server side. The advantage and disadvantage of such a solution is that no user could opt out of telemetry. Also, such a method could potentially provide information about the Python version, napari version and list of installed plugins. All others will require a mechanism from this NAP. +This would mean instead of implementing telemetry on the client side we could implement it on the update server side. The advantage and disadvantage of such a solution is that no user could opt out of telemetry. Also, such a method could potentially provide information about the Python version, napari version and list of installed plugins. All others will require a mechanism from this NAP. It will also require updates on the Napari side as currently we only communicate with the update server when a user opens the plugin manager. Also, to have proper information about installed plugins we will need to send information about the list of installed plugins instead of just downloading the information about all plugins from the server. From a5c1cc3e7f1db5b058794a2c51ad3a38bac08a7a Mon Sep 17 00:00:00 2001 From: Grzegorz Bokota Date: Sun, 22 Oct 2023 18:50:19 +0200 Subject: [PATCH 10/12] fox changes from code review --- docs/naps/8-telemetry.md | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/docs/naps/8-telemetry.md b/docs/naps/8-telemetry.md index 582852b8e..d4970147e 100644 --- a/docs/naps/8-telemetry.md +++ b/docs/naps/8-telemetry.md @@ -6,7 +6,7 @@ ```{eval-rst} :Author: Grzegorz Bokota - :Created: + :Created: 2023-08-11 :Resolution: (required for Accepted | Rejected | Withdrawn) :Resolved: :Status: Draft @@ -31,6 +31,10 @@ Adding the option for monitoring plugin usage allows us to identify heavily used Also collecting information about data types and their size will provide valuable information about the typical use cases of napari. Still, users need to be able to opt out of such monitoring, and adjust the level of detail of the information that is sent to the napari server. +Each time when we update the collected data, we should inform users about the changes and provide them with the possibility to opt out of telemetry. + +Users could also provide a temporary agreement for sending telemetry. +Then after a given period of time, the dialog with question will be shown again. ## Detailed Description @@ -44,7 +48,7 @@ Telemetry should contain following ways to disable it: 1. Disable in settings 2. Uninstall `napari-telemetry` package 3. Environment variable `NAPARI_TELEMETRY=0` -4. Full list of endpoints used for collecting telemetry, that could be filtered on the firewall level. +4. System-wide disablement e.g. via firewall filtering for hpc or other environments. The user should be able to adjust the telemetry level of detail. The following levels are proposed: From 0116f73894aa3c83e8abfb2698c6bf0887360337 Mon Sep 17 00:00:00 2001 From: Grzegorz Bokota Date: Sun, 22 Oct 2023 19:16:45 +0200 Subject: [PATCH 11/12] reformat text to reduce need of word wrap --- docs/naps/8-telemetry.md | 168 +++++++++++++++++++++++++++------------ 1 file changed, 119 insertions(+), 49 deletions(-) diff --git a/docs/naps/8-telemetry.md b/docs/naps/8-telemetry.md index d4970147e..5faf03d0b 100644 --- a/docs/naps/8-telemetry.md +++ b/docs/naps/8-telemetry.md @@ -16,22 +16,33 @@ ## Abstract - This NAP describes how telemetry would be used by the napari project and the architecture and solutions proposed to maximize the privacy of our users. +This NAP describes how telemetry would be used by the napari project and the architecture +and solutions proposed to maximize the privacy of our users. ## Motivation and Scope -With the growth of napari, the standard feedback loop through napari community meetings and napari-related events at conferences has reached its capacity. Also, we collect many feature requests for which we cannot find volunteers for implementation. +With the growth of napari, +the standard feedback loop through napari community meetings and napari-related events at conferences has reached its capacity. +Also, we collect many feature requests for which we cannot find volunteers for implementation. -To have the possibility of sustainable development of the project we will either need to rely on paid contractors or on companies donating employee time managed by the core devs. +To have the possibility of sustainable development of the project, +we will either need to rely on paid contractors or on companies donating employee time managed by the core devs. -Both scenarios require us to provide some information about the estimated number of users to prove to potential funders that their donation/grant will be used in a valuable way. +Both scenarios require us to provide some information about the estimated number of users to prove to potential +funders that their donation/grant will be used in a valuable way. -Adding the option for monitoring plugin usage allows us to identify heavily used plugins and try to establish cooperation with their maintainers to reduce the probability that the plugin will not be ready for a new napari release. Such monitoring could contain not only the list of installed plugins but also which commands and contributions are used most often. +Adding the option for monitoring plugin usage allows us to identify heavily used plugins and try +to establish cooperation with their maintainers +to reduce the probability that the plugin will not be ready for a new napari release. +Such monitoring could contain not only the list of installed plugins +but also which commands and contributions are used most often. Also collecting information about data types and their size will provide valuable information about the typical use cases of napari. -Still, users need to be able to opt out of such monitoring, and adjust the level of detail of the information that is sent to the napari server. -Each time when we update the collected data, we should inform users about the changes and provide them with the possibility to opt out of telemetry. +Still, users need to be able to opt out of such monitoring, +and adjust the level of detail of the information that is sent to the napari server. +Each time when we update the collected data, +we should inform users about the changes and provide them with the possibility to opt out of telemetry. Users could also provide a temporary agreement for sending telemetry. Then after a given period of time, the dialog with question will be shown again. @@ -39,45 +50,69 @@ Then after a given period of time, the dialog with question will be shown again. ## Detailed Description -`napari-telemetry` will be a package responsible for collecting and sending telemetry data to the napari server. It will be installed after user confirmation. It will contain callbacks for data collection, and utils for storage and sending. Also, this package will contain utils for validating if the user has agreed to telemetry. +`napari-telemetry` will be a package responsible for collecting and sending telemetry data to the napari server. +It will be installed after user confirmation. +It will contain callbacks for data collection, and utils for storage and sending. +Also, this package will contain utils for validating if the user has agreed to telemetry. -In the main package, there is a need to add code to ask users if they want to enable telemetry. This code should be executed only once per environment. +In the main package, there is a need to add code to ask users if they want to enable telemetry. +This code should be executed only once per environment. -Telemetry should contain following ways to disable it: +Telemetry should contain the following ways to disable it: 1. Disable in settings 2. Uninstall `napari-telemetry` package 3. Environment variable `NAPARI_TELEMETRY=0` -4. System-wide disablement e.g. via firewall filtering for hpc or other environments. +4. System-wide disablement, e.g., via firewall filtering for hpc or other environments. The user should be able to adjust the telemetry level of detail. The following levels are proposed: 1. `none` - no telemetry is collected -2. `basic` - information about the napari version, python version, OS, and CPU architecture is collected and if it is the first report by the user. There is also a user identifier created based on computer details that will be regenerated each week to prevent tracking the user, but allow us to accurately gauge individual user numbers. -3. `middle` - same as in `basic` but information about the list of installed plugins and their versions is also collected. We take care to not collect data about plugins that are not intended to be public, so we will only collect information about plugins searchable as using plugin dialog or napari hub. We also will not collect information about plugins that are installed in non stable version. -4. `full` - same as in `middle` but also collects information about plugin usage by binding to app-model and logging plugin commands used. Also basic information about data like type (`np.ndarray`, `dask.array`, `zarr.Array`, etc.) and its size is collected. - -There should be a visible indicator that telemetry is enabled (for example on the status bar). - -The second part of this work should be to setup the server to collect telemetry data. After collecting data, it should provide a basic public dashboard that will allow the community to see aggregated information. +2. `basic` - information about the napari version, python version, OS, and CPU architecture is collected, + and if it is the first report by the user. + There is also a user identifier created based on computer details + that will be regenerated each week to prevent tracking the user, + but allow us to accurately gauge individual user numbers. +3. `middle` - same as in `basic` plus information about the list of installed plugins and their versions are also collected. + We take care to not collect data about plugins that are not intended to be public, + so we will only collect information about plugins searchable as using plugin dialog or napari hub. + We also will not collect information about plugins that are installed in a non-stable version. +4. `full` - same as in `middle` + plus collects information about plugin usage by binding to app-model and logging plugin commands used. + Additionally basic information about data like types + (`np.ndarray`, `dask.array`, `zarr.Array`, etc.) and its size will be collected. + +There should be a visible indicator that telemetry is enabled (for example, on the status bar). + +The second part of this work should be to set up the server to collect telemetry data. +After collecting data, +it should provide a basic public dashboard that will allow the community to see aggregated information. We propose the following data retention policy: 1) Up to 2 weeks for logs. -2) Up 2 months of raw data (1 month of collection, then aggregation and time to validate aggregated data). +2) Up to 2 months of raw data (1 month of collection, then aggregation and time to validate aggregated data). 3) Infinite of aggregated data. ## Privacy assessment -During the preparation of this NAP we assume that none of the collected data will be presented in -a form that allows to identify a single user or identify a research area of user. We also select a set of data that will be collected to minimize the possibility of revealing sensitive data, but it is impossible to guarantee that it will not be possible to identify a single user (for example by checking installed plugin combinations). +During the preparation of this NAP, we assume that none of the collected data will be presented in +a form that allows to identify a single user or identify a research area of user. +We also select a set of data that will be collected to minimize the possibility of revealing sensitive data. +However, it is impossible to guarantee that it will not be possible to identify a single user +(for example, by checking installed plugin combinations). -Because of this, we propose to not publish raw data and only show aggregated results. The aggregation will be performed using scripts. Napari core devs will access raw data only if there are errors in the aggregation process. +Because of this, we propose to not publish raw data and only show aggregated results. +The aggregation will be performed using scripts. +Napari core devs will access raw data only if there are errors in the aggregation process. -We also will publish a list of endpoints for each level of telemetry, so the given level of telemetry could be blocked on the organization level (for example by the rule on the firewall). +We also will publish a list of endpoints for each level of telemetry, +so the given level of telemetry could be blocked on the organization level +(for example, by the rule on the firewall). -If someone found that we are publishing some problematic data we will remove them and update the aggregation process to prevent such a situation in the future. +If someone found that we are publishing some problematic data, we will remove them +and update the aggregation process to prevent such a situation in the future. This NAP will be updated to reflect the current state of telemetry. @@ -95,9 +130,15 @@ https://github.com/grafana/grafana ## Implementation -The key consideration for implementation should be the low cost of maintenance. So the solution should be as simple as possible. We could either use existing solutions on the server side or implement our own. +The key consideration for implementation should be the low cost of maintenance. +So the solution should be as simple as possible. +We could either use existing solutions on the server side or implement our own. -The benefit of existing solutions is that most of the work is already done. The downside is that it may require additional cost of maintenance. This cost may be caused by many features that are not needed for napari and could increase the risk of leaking data. Quick checks of their code revealed they are implemented in techniques that are not familiar to napari core devs. So, if we decide to use them, we should select an SAS solution that will be maintained by the company. +The benefit of existing solutions is that most of the work is already done. +The downside is that it may require additional cost of maintenance. +This cost may be caused by many features that are not needed for napari and could increase the risk of leaking data. +Quick checks of their code revealed they are implemented in techniques that are not familiar to napari core devs. +So, if we decide to use them, we should select an SAS solution that will be maintained by the company. For now, I suggest creating a simple REST API server for collecting the data. @@ -108,15 +149,20 @@ Data for aggregation should be extracted from the database using a script runnin The output of the aggregation script should be loaded to some existing visualization tool, like grafana. -It may be nice to host raw and aggregate data on separate servers - then even if the data presented on the dashboard is compromised, the raw data will be not exposed to the world. +It may be nice to host raw and aggregate data on separate servers — +then even if the data presented on the dashboard is compromised, +the raw data will be not exposed to the world. Having both server and aggregation scripts in Python will reduce maintenance costs for napari core devs. -We should register the `telemetry.napari.org` domain and use it for the server. The main page will contain this NAP and a link to the summary dashboard. +We should register the `telemetry.napari.org` domain and use it for the server. +The main page will contain this NAP and a link to the summary dashboard. The main part of the application side should be implemented in `napari-telemetry` package. -The package should not report in stream mode, but collect data on the disk and send it in batches. This will reduce the risk of leaking data. The package should implement a utility to allow users to preview collected data before sending it to the server. +The package should not report in stream mode, but collect data on the disk and send it in batches. +This will reduce the risk of leaking data. +The package should implement a utility to allow users to preview collected data before sending it to the server. In napari itself, the following changes should be implemented: @@ -128,18 +174,25 @@ In napari itself, the following changes should be implemented: ## Potential problems -There is a risk that someone may try to highjack the telemetry module name to have code executed at every napari start. +There is a risk +that someone may try to high-jack the telemetry module name to have code executed at every napari start. -I do not expect that it is a high risk, but exists. We could address it by code signing. This will require additional procedures to protect private cryptographic keys. +I do not expect that it is a high risk, but exists. +We could address it by code signing. +This will require additional procedures to protect private cryptographic keys. -Another option is to scan public plugins and their dependencies. This is simpler but will require establishing additional communication channels to be able to warn users about the potential problem. +Another option is to scan public plugins and their dependencies. +This is simpler, +but will require establishing additional communication channels to be able to warn users about the potential problem. ## GDPR compliance -I'm almost sure that we will not collect data that are covered by GDPR. But to get better atmosphere -we need to add instruction how user could retrieve his unique identifier and setup a process -for requests to remove data from the server. It is not high probability of usage as life span of data is short, +I'm almost sure that we will not collect data that are covered by GDPR. +But to get a better atmosphere, +we need to add instruction how a user could retrieve their unique identifier and set up a process +for requests to remove data from the server. +It is not a high probability of usage as the life span of data is short, but we need to be prepared for such a situation. I suggest to use e-mail for that. @@ -147,26 +200,43 @@ but we need to be prepared for such a situation. I suggest to use e-mail for tha ## Backward Compatibility Not relevant - + ## Future Work -A nice extension may be the ability for the steering council to create a certificate of telemetry output that could be given to plugin maintainers to prove to supervisors that their plugin is used by the community. +A nice extension may be the ability for the steering council to create a certificate of telemetry output that could be +given to plugin maintainers to prove to supervisors that their plugin is used by the community. - ## Alternatives +## Alternatives - During the discussion, there is a proposal to use the same approach as used in ImageJ. +During the discussion, there is a proposal to use the same approach as used in ImageJ. -This would mean instead of implementing telemetry on the client side we could implement it on the update server side. The advantage and disadvantage of such a solution is that no user could opt out of telemetry. Also, such a method could potentially provide information about the Python version, napari version and list of installed plugins. All others will require a mechanism from this NAP. - - It will also require updates on the Napari side as currently we only communicate with the update server when a user opens the plugin manager. Also, to have proper information about installed plugins we will need to send information about the list of installed plugins instead of just downloading the information about all plugins from the server. - - As this solution provides less information, does not allow for opt-out and could cause blacklisting of the update server IP address, I do not recommend it. - - But based on talks that happen during the discussion we may think about more frequent checks for updates to inform users that they could update their Napari or plugin version. For such a change we need to update our update server to provide information per Python version (as some plugins could drop old Python earlier). - - -The second alternative is use a third-party solution like [plausible.io](https://plausible.io/). But from my perspective, it is harder to adjust a set of data that is collected as these services are designed to monitor webpages. +This would mean, instead of implementing telemetry on the client side, we could implement it on the update server side. +The advantage and disadvantage of such a solution is that no user could opt out of telemetry. +Also, such a method could potentially provide information about the Python version, +napari version, and list of installed plugins. +All others will require a mechanism from this NAP. + +It will also require updates on the Napari side +as currently we only communicate with the update server when a user opens the plugin manager. +Also, +to have proper information about installed plugins, we will need +to send information about the list of installed plugins +instead of just downloading the information about all plugins from the server. + +As this solution provides less information, +it does not allow for opt-out and could cause ban-listing of the update server IP address, +I do not recommend it. + +But based on talks +that happen during the discussion, we may think about more frequent checks for updates +to inform users that they could update their Napari or plugin version. +For such a change, we need to update our update server to provide information per Python version +(as some plugins could drop old Python earlier). + +The second alternative is use a third-party solution like [plausible.io](https://plausible.io/). +But from my perspective, +it is harder to adjust a set of data that is collected as these services are designed to monitor webpages. ## Discussion From 8469c166f2a00d2efed4f47fe96f1bed3289c1c8 Mon Sep 17 00:00:00 2001 From: Grzegorz Bokota Date: Mon, 23 Oct 2023 12:28:46 +0200 Subject: [PATCH 12/12] remove obsolete section --- docs/naps/8-telemetry.md | 14 -------------- 1 file changed, 14 deletions(-) diff --git a/docs/naps/8-telemetry.md b/docs/naps/8-telemetry.md index 5faf03d0b..36fbc5742 100644 --- a/docs/naps/8-telemetry.md +++ b/docs/naps/8-telemetry.md @@ -172,20 +172,6 @@ In napari itself, the following changes should be implemented: 4) code required to init `napari-telemetry` package -## Potential problems - -There is a risk -that someone may try to high-jack the telemetry module name to have code executed at every napari start. - -I do not expect that it is a high risk, but exists. -We could address it by code signing. -This will require additional procedures to protect private cryptographic keys. - -Another option is to scan public plugins and their dependencies. -This is simpler, -but will require establishing additional communication channels to be able to warn users about the potential problem. - - ## GDPR compliance I'm almost sure that we will not collect data that are covered by GDPR.