Skip to content

Messing around with clustering error/warnings from spack monitor servers

License

Apache-2.0, MIT licenses found

Licenses found

Apache-2.0
LICENSE-APACHE
MIT
LICENSE-MIT
Notifications You must be signed in to change notification settings

buildsi/spack-monitor-nlp

Repository files navigation

Spack Monitor NLP

docs/clustering.png

This is an effort to see if we can do some kind of clustering using the warning and error messages in the server. The goal will be to:

  1. retrieve current warnings and errors via the spack monitor API
  2. build a word2vec model using them.
  3. output embeddings for each
  4. cluster!

⭐️ View Interface ⭐️

You'll see that the best clustering comes from just using the error or warning messages, and that most of the clusters are boost errors. Could it be that a direct match (e.g., parsing libraries in advance to identify text of errors and using that) is better? Perhaps! And in fact we could do some kind of KNN based on that too. This is more of an unsupervised clustering (we don't have labels).

Usage

$ python -m venv env
$ source env/bin/activate
$ pip install -r requirements.txt

Install umap from conda:

$ conda install -c conda-forge umap-learn
$ pip install umap-learn

Then download data from spack monitor

$ python 1.get_data.py

This will generate a file of errors and warnings!

$ tree data/
data/
├── errors.json
└── warnings.json

We next want to preprocess the data and generate models / vectors!

$ python 2.vectors.py

We are currently only parsing errors, as it's a smaller set and we are more interested in build errors than warnings that clutter the signal. For the "error only" (or parsed) approach we look for strings that have error: and split and take the right side of that. For all other processing methods, we remove paths (e.g., tokenize then remove anything with an os.sep or path separator).

Then generate counts of data (to be put into docs if we want to eventually visualize):

$ python 3.charts.py
Found 30000 errors!
1832 out of 30000 mention 'undefined reference'

Some data will be generated in data, and assets for the web interface will go into docs. The interface allows you to select and see the difference between the models, and clearly just using the error messages (parsed or not) has the strongest signal (best clustering).

And finally, generate a quick plot to show that, if we did KNN for each error, the mean similarity of the closests 10 points (standard deviation not shown, but is calculated if we need):

data/means.png

License

Spack is distributed under the terms of both the MIT license and the Apache License (Version 2.0). Users may choose either license, at their option.

All new contributions must be made under both the MIT and Apache-2.0 licenses.

See LICENSE-MIT, LICENSE-APACHE, COPYRIGHT, and NOTICE for details.

SPDX-License-Identifier: (Apache-2.0 OR MIT)

LLNL-CODE-811652

About

Messing around with clustering error/warnings from spack monitor servers

Resources

License

Apache-2.0, MIT licenses found

Licenses found

Apache-2.0
LICENSE-APACHE
MIT
LICENSE-MIT

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages