-
Notifications
You must be signed in to change notification settings - Fork 109
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bunch of performance fixes and speedups #46
Conversation
"currently the gemspec has to have gsl added in order to test the GSL variants, but it is what it is. Maybe somebody else can fix that :)" I'm not sure there's any way around that. Anyway, I'll take a look and merge it asap. |
Could maybe add it as a development dependency for the devs, then for the test environment have it run twice in CI, once with the env NATIVE_VECTOR and once without. Much of this code is definitely long in the tooth, but I'm trying to scope down my changes instead of completely rewriting everything :) |
That's a good idea. I'll look at CI stuff this weekend. Yeah, we're trying to replace things only as needed, but we can discuss rewrites in the future. |
CachedContentNode.new(clean_word_hash, *categories) | ||
else | ||
ContentNode.new(clean_word_hash, *categories) | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what do you think about a new_content_node()
method that abstracts this creation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is already a #node_for_content that serves up either a indexed ContentNode or creates a new one (used for searching/classification, its transient). Creating CachedContentNodes for transient operations is just overhead .. only items in the index need to be CachedContentNodes. I could abstract it, but since its not reused it doesn't seem worthwhile to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok!
One thing to note that I did not take the time to track down, was that with GSL and without |
Do we have a test (or tests!) for |
I tested locally that the changes I made resulted in the same output, but I have no tests other than that. |
@parkr what would it take to set up CI to test gsl and non gsl code? |
Do we need to put the travis modifications into this branch or can we open another issue for that? |
We can do that elsewhere. It may be a bit complicated. |
Ok, @parkr I'm going to call this good to merge. |
Bunch of performance fixes and speedups
classifier-reborn is designed to work with or without [GSL](https://www.gnu.org/software/gsl/) support. https://github.com/jekyll/classifier-reborn/blob/99d13af5adf040ba40a6fe77dbe0b28756562fcc/docs/index.md?plain=1#L68 If GSL is installed, classifier-reborn will detect it and use it. If GSL is not installed, classifier-reborn will fall back to a pure-ruby implementation. The mechanism for doing so is in `lsi.rb`: https://github.com/jekyll/classifier-reborn/blob/99d13af5adf040ba40a6fe77dbe0b28756562fcc/lib/classifier-reborn/lsi.rb#L7-L17 Notably, there's a comment there about how to test with/without GSL enabled. > to test the native vector class, try `rake test NATIVE_VECTOR=true` As far as I can tell, this was only ever used for local development/testing, and was never tested in CI (though it was previously discussed [here](jekyll#46 (comment))). I missed this in my last PR (jekyll#195) because I was focused on porting existing testing functionality from TravisCI to GitHub Actions. Now that this is working, I think it's important to expand our CI coverage to test with and without GSL in CI. So, in this PR, I'm doing so by setting `NATIVE_VECTOR` to true or false in our test matrix. While working on this, I noticed some tests in the LSI spec that return early when `$GSL` is not enabled. It would be better for those tests to report as skipped when GSL is not enabled (and this matches the pattern of the redis tests, that report as skipped if redis isn't available).
classifier-reborn is designed to work with or without [GSL](https://www.gnu.org/software/gsl/) support. https://github.com/jekyll/classifier-reborn/blob/99d13af5adf040ba40a6fe77dbe0b28756562fcc/docs/index.md?plain=1#L68 If GSL is installed, classifier-reborn will detect it and use it. If GSL is not installed, classifier-reborn will fall back to a pure-ruby implementation. The mechanism for doing so is in `lsi.rb`: https://github.com/jekyll/classifier-reborn/blob/99d13af5adf040ba40a6fe77dbe0b28756562fcc/lib/classifier-reborn/lsi.rb#L7-L17 Notably, there's a comment there about how to test with/without GSL enabled. > to test the native vector class, try `rake test NATIVE_VECTOR=true` As far as I can tell, this was only ever used for local development/testing, and was never tested in CI (though it was previously discussed [here](jekyll#46 (comment))). I missed this in my last PR (jekyll#195) because I was focused on porting existing testing functionality from TravisCI to GitHub Actions. Now that this is working, I think it's important to expand our CI coverage to test with and without GSL in CI. So, in this PR, I'm doing so by setting `NATIVE_VECTOR` to true or false in our test matrix. While working on this, I noticed some tests in the LSI spec that return early when `$GSL` is not enabled. It would be better for those tests to report as skipped when GSL is not enabled (and this matches the pattern of the redis tests, that report as skipped if redis isn't available).
classifier-reborn is designed to work with or without [GSL](https://www.gnu.org/software/gsl/) support. https://github.com/jekyll/classifier-reborn/blob/99d13af5adf040ba40a6fe77dbe0b28756562fcc/docs/index.md?plain=1#L68 If GSL is installed, classifier-reborn will detect it and use it. If GSL is not installed, classifier-reborn will fall back to a pure-ruby implementation. The mechanism for doing so is in `lsi.rb`: https://github.com/jekyll/classifier-reborn/blob/99d13af5adf040ba40a6fe77dbe0b28756562fcc/lib/classifier-reborn/lsi.rb#L7-L17 Notably, there's a comment there about how to test with/without GSL enabled. > to test the native vector class, try `rake test NATIVE_VECTOR=true` As far as I can tell, this was only ever used for local development/testing, and was never tested in CI (though adding GSL testing to CI was previously discussed [here](jekyll#46 (comment))). I did not include this in my last PR (jekyll#195) because I was focused on porting existing testing functionality from TravisCI to GitHub Actions. Now that GitHub Actions is working, I think it's important to expand our CI coverage to test with and without GSL in CI. So, in this PR, I'm doing so by setting `NATIVE_VECTOR` to true or false in our test matrix and installing the required `libgsl-dev` package in the Ubuntu test environment. While working on this, I noticed some tests in the LSI spec that return early when `$GSL` is not enabled. It would be better for those tests to report as skipped when GSL is not enabled (and this matches the pattern of the redis tests, that report as skipped if redis isn't available).
classifier-reborn is designed to work with or without [GSL](https://www.gnu.org/software/gsl/) support. https://github.com/jekyll/classifier-reborn/blob/99d13af5adf040ba40a6fe77dbe0b28756562fcc/docs/index.md?plain=1#L68 If GSL is installed, classifier-reborn will detect it and use it. If GSL is not installed, classifier-reborn will fall back to a pure-ruby implementation. The mechanism for doing so is in `lsi.rb`: https://github.com/jekyll/classifier-reborn/blob/99d13af5adf040ba40a6fe77dbe0b28756562fcc/lib/classifier-reborn/lsi.rb#L7-L17 Notably, there's a comment there about how to test with/without GSL enabled. > to test the native vector class, try `rake test NATIVE_VECTOR=true` As far as I can tell, this was only ever used for local development/testing, and was never tested in CI (though it was previously discussed [here](jekyll#46 (comment))). I did not include this in my last PR (jekyll#195) because I was focused on porting existing testing functionality from TravisCI to GitHub Actions. Now that GitHub Actions is working, I think it's important to expand our CI coverage to test with and without GSL in CI. So, in this PR, I'm doing so by setting `NATIVE_VECTOR` to true or false in our test matrix. Lucky for us, GitHub already [includes libgsl as pre-installed software](https://github.com/actions/virtual-environments/blob/main/images/linux/Ubuntu2004-Readme.md#installed-apt-packages), so we don't need to do anything special there. While working on this, I noticed some tests in the LSI spec that return early when `$GSL` is not enabled. It would be better for those tests to report as skipped when GSL is not enabled (and this matches the pattern of the redis tests, that report as skipped if redis isn't available).
classifier-reborn is designed to work with or without [GSL](https://www.gnu.org/software/gsl/) support. https://github.com/jekyll/classifier-reborn/blob/99d13af5adf040ba40a6fe77dbe0b28756562fcc/docs/index.md?plain=1#L68 If GSL is installed, classifier-reborn will detect it and use it. If GSL is not installed, classifier-reborn will fall back to a pure-ruby implementation. The mechanism for doing so is in `lsi.rb`: https://github.com/jekyll/classifier-reborn/blob/99d13af5adf040ba40a6fe77dbe0b28756562fcc/lib/classifier-reborn/lsi.rb#L7-L17 Notably, there's a comment there about how to test with/without GSL enabled. > to test the native vector class, try `rake test NATIVE_VECTOR=true` As far as I can tell, this was only ever used for local development/testing, and was never tested in CI (though it was previously discussed [here](jekyll#46 (comment))). I did not include this in my last PR (jekyll#195) because I was focused on porting existing testing functionality from TravisCI to GitHub Actions. Now that GitHub Actions is working, I think it's important to expand our CI coverage to test with and without GSL in CI. So, in this PR, I'm doing so by setting `NATIVE_VECTOR` to true or false in our test matrix. Lucky for us, GitHub already [includes libgsl as pre-installed software](https://github.com/actions/virtual-environments/blob/main/images/linux/Ubuntu2004-Readme.md#installed-apt-packages), so we don't need to do anything special there. While working on this, I noticed some tests in the LSI spec that return early when `$GSL` is not enabled. It would be better for those tests to report as skipped when GSL is not enabled (and this matches the pattern of the redis tests, that report as skipped if redis isn't available).
classifier-reborn is designed to work with or without [GSL](https://www.gnu.org/software/gsl/) support. https://github.com/jekyll/classifier-reborn/blob/99d13af5adf040ba40a6fe77dbe0b28756562fcc/docs/index.md?plain=1#L68 If GSL is installed, classifier-reborn will detect it and use it. If GSL is not installed, classifier-reborn will fall back to a pure-ruby implementation. The mechanism for doing so is in `lsi.rb`: https://github.com/jekyll/classifier-reborn/blob/99d13af5adf040ba40a6fe77dbe0b28756562fcc/lib/classifier-reborn/lsi.rb#L7-L17 I think it's important to test the classifier-reborn gem with GSL support in CI. One of my goals is to add similar support using `Numo`, and I'd like that to be tested in CI as well, and I want to make sure I don't do anything that could break existing GSL support. As far as I can tell, GSL support was never tested in CI before now (though it was previously discussed [here](jekyll#46 (comment))). I did find a comment about how to test with/without GSL enabled, but I think this was only used for locally. > to test the native vector class, try `rake test NATIVE_VECTOR=true` So, in this PR, I'm expanding our text matrix to test with and without GSL enabled by setting `NATIVE_VECTOR` to true or false. If `NATIVE_VECTOR` is false, we need to install the `gsl` gem (which is not included in our Gemfile since it's optional). Lucky for us, GitHub already [includes libgsl as pre-installed software](https://github.com/actions/virtual-environments/blob/main/images/linux/Ubuntu2004-Readme.md#installed-apt-packages), so we don't need to do anything special for the apt package. Currently, GSL only works with Ruby 2.7. (One of the main reasons I want to add support for Numo is because GSL is becoming difficult to support.) As such, I've excluded other versions of ruby in our test matrix for now. While working on this, I noticed some tests in the LSI spec that return early when `$GSL` is not enabled. It would be better for those tests to report as skipped when GSL is not enabled (and this matches the pattern of the redis tests, that report as skipped if redis isn't available).
classifier-reborn is designed to work with or without [GSL](https://www.gnu.org/software/gsl/) support. https://github.com/jekyll/classifier-reborn/blob/99d13af5adf040ba40a6fe77dbe0b28756562fcc/docs/index.md?plain=1#L68 If GSL is installed, classifier-reborn will detect it and use it. If GSL is not installed, classifier-reborn will fall back to a pure-ruby implementation. The mechanism for doing so is in `lsi.rb`: https://github.com/jekyll/classifier-reborn/blob/99d13af5adf040ba40a6fe77dbe0b28756562fcc/lib/classifier-reborn/lsi.rb#L7-L17 I think it's important to test the classifier-reborn gem with GSL support in CI. One of my goals is to add similar support using `Numo`, and I'd like that to be tested in CI as well, and I want to make sure I don't do anything that could break existing GSL support. As far as I can tell, GSL support was never tested in CI before now (though it was previously discussed [here](jekyll#46 (comment))). I did find a comment about how to test with/without GSL enabled, but I think this was only used for locally. > to test the native vector class, try `rake test NATIVE_VECTOR=true` So, in this PR, I'm expanding our text matrix to test with and without GSL enabled by setting `NATIVE_VECTOR` to true or false. If `NATIVE_VECTOR` is false, we need to install the `gsl` gem (which is not included in our Gemfile since it's optional). Lucky for us, GitHub already [includes libgsl as pre-installed software](https://github.com/actions/virtual-environments/blob/main/images/linux/Ubuntu2004-Readme.md#installed-apt-packages), so we don't need to do anything special for the apt package. Currently, GSL only works with Ruby 2.7. (One of the main reasons I want to add support for Numo is because GSL is becoming difficult to support.) As such, I've excluded other versions of ruby in our test matrix for now. While working on this, I noticed some tests in the LSI spec that return early when `$GSL` is not enabled. It would be better for those tests to report as skipped when GSL is not enabled (and this matches the pattern of the redis tests, that report as skipped if redis isn't available).
classifier-reborn is designed to work with or without [GSL](https://www.gnu.org/software/gsl/) support. https://github.com/jekyll/classifier-reborn/blob/99d13af5adf040ba40a6fe77dbe0b28756562fcc/docs/index.md?plain=1#L68 If GSL is installed, classifier-reborn will detect it and use it. If GSL is not installed, classifier-reborn will fall back to a pure-ruby implementation. The mechanism for doing so is in `lsi.rb`: https://github.com/jekyll/classifier-reborn/blob/99d13af5adf040ba40a6fe77dbe0b28756562fcc/lib/classifier-reborn/lsi.rb#L7-L17 I think it's important to test the classifier-reborn gem with GSL support in CI. One of my goals is to add similar support using Numo, and I'd like that to be tested in CI as well. Also, I want to make sure I don't do anything along the way that could break existing GSL support. As far as I can tell, GSL support was never tested in CI before now (though it was previously discussed [here](jekyll#46 (comment))). In this PR, I'm expanding our text matrix to test with and without GSL enabled. When the matrix has GSL enabled, we install the `gsl` gem (which is not included in our Gemfile since it's optional). Lucky for us, GitHub already [includes libgsl as pre-installed software](https://github.com/actions/virtual-environments/blob/main/images/linux/Ubuntu2004-Readme.md#installed-apt-packages), so we don't need to do anything special for the apt package. Our gem already has everything needed to build & install. When `matrix.gsl` is false, we won't install the gem and tests will run the native Ruby implementation. When `matrix.gsl` is true, we'll install the gem tests will run the GSL implementation. Currently, GSL only works with Ruby 2.7. (One of the main reasons I want to add support for Numo is because GSL is becoming difficult to support.) As such, I've excluded other versions of ruby in our test matrix for now. They'll still be tested with GSL disabled, but not with it enabled. While working on this, I noticed some tests in the LSI spec that return early when `$GSL` is not enabled. It would be better for those tests to report as skipped when GSL is not enabled (and this matches the pattern of the redis tests, that report as skipped if redis isn't available).
Lot of little things here that add up but I'll try to summarize:
I'm not exactly sure how much faster this is, much of it depends on the size/nature of your corpus and the classifications you are making but for me at around 150 documents and thousands of classifications, its somewhere around 600%.
Its a bit awkward to test things .. currently the gemspec has to have gsl added in order to test the GSL variants, but it is what it is. Maybe somebody else can fix that :)