Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bunch of performance fixes and speedups #46

Merged
merged 1 commit into from
Sep 15, 2015
Merged

Conversation

kreynolds
Copy link
Contributor

Lot of little things here that add up but I'll try to summarize:

  • Reduce some unnecessary methods
  • Add optional CachedContentNode (GSL only)
    • Caches the transposed search_vector
    • Has custom marshal_ methods to not save the cache when dumping/loading
    • Transparently compatible with pure ruby
  • Optimized some numeric comparisons and iterators
  • Added cached calculation table when computing raw_vectors

I'm not exactly sure how much faster this is, much of it depends on the size/nature of your corpus and the classifications you are making but for me at around 150 documents and thousands of classifications, its somewhere around 600%.

Its a bit awkward to test things .. currently the gemspec has to have gsl added in order to test the GSL variants, but it is what it is. Maybe somebody else can fix that :)

@Ch4s3
Copy link
Member

Ch4s3 commented Sep 11, 2015

"currently the gemspec has to have gsl added in order to test the GSL variants, but it is what it is. Maybe somebody else can fix that :)"

I'm not sure there's any way around that. Anyway, I'll take a look and merge it asap.

@kreynolds
Copy link
Contributor Author

Could maybe add it as a development dependency for the devs, then for the test environment have it run twice in CI, once with the env NATIVE_VECTOR and once without. Much of this code is definitely long in the tooth, but I'm trying to scope down my changes instead of completely rewriting everything :)

@Ch4s3
Copy link
Member

Ch4s3 commented Sep 11, 2015

That's a good idea. I'll look at CI stuff this weekend. Yeah, we're trying to replace things only as needed, but we can discuss rewrites in the future.

CachedContentNode.new(clean_word_hash, *categories)
else
ContentNode.new(clean_word_hash, *categories)
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do you think about a new_content_node() method that abstracts this creation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is already a #node_for_content that serves up either a indexed ContentNode or creates a new one (used for searching/classification, its transient). Creating CachedContentNodes for transient operations is just overhead .. only items in the index need to be CachedContentNodes. I could abstract it, but since its not reused it doesn't seem worthwhile to me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok!

@kreynolds
Copy link
Contributor Author

One thing to note that I did not take the time to track down, was that with GSL and without CachedContentNode, repeated classifications got progressively slower so it seems as if search_vector.col has some sort of leak or hidden awful to it. I didn't track it down because the CachedContentNode made it irrelevant, but its worth noting. On a test 140 document corpus, after 400 classifications of 1-5 word phrases, it was classifying at about 3/s and with the CachedContentNode, it was going steady at 120/s without any slowdown.

@parkr
Copy link
Member

parkr commented Sep 13, 2015

Do we have a test (or tests!) for raw_vector_with? Changes look pretty good to me, but it's always nice to have a computer back up your assumptions.

@kreynolds
Copy link
Contributor Author

I tested locally that the changes I made resulted in the same output, but I have no tests other than that.

@Ch4s3
Copy link
Member

Ch4s3 commented Sep 14, 2015

@parkr what would it take to set up CI to test gsl and non gsl code?

@kreynolds
Copy link
Contributor Author

Do we need to put the travis modifications into this branch or can we open another issue for that?

@parkr
Copy link
Member

parkr commented Sep 15, 2015

We can do that elsewhere. It may be a bit complicated.

@Ch4s3
Copy link
Member

Ch4s3 commented Sep 15, 2015

Ok, @parkr I'm going to call this good to merge.

Ch4s3 added a commit that referenced this pull request Sep 15, 2015
Bunch of performance fixes and speedups
@Ch4s3 Ch4s3 merged commit 55f3178 into jekyll:master Sep 15, 2015
mkasberg added a commit to mkasberg/classifier-reborn that referenced this pull request May 10, 2022
classifier-reborn is designed to work with or without
[GSL](https://www.gnu.org/software/gsl/) support.

https://github.com/jekyll/classifier-reborn/blob/99d13af5adf040ba40a6fe77dbe0b28756562fcc/docs/index.md?plain=1#L68

If GSL is installed, classifier-reborn will detect it and use it. If GSL
is not installed, classifier-reborn will fall back to a pure-ruby
implementation. The mechanism for doing so is in `lsi.rb`:

https://github.com/jekyll/classifier-reborn/blob/99d13af5adf040ba40a6fe77dbe0b28756562fcc/lib/classifier-reborn/lsi.rb#L7-L17

Notably, there's a comment there about how to test with/without GSL
enabled.

> to test the native vector class, try `rake test NATIVE_VECTOR=true`

As far as I can tell, this was only ever used for local
development/testing, and was never tested in CI (though it was
previously discussed
[here](jekyll#46 (comment))).
I missed this in my last PR (jekyll#195) because I was focused on porting
existing testing functionality from TravisCI to GitHub Actions. Now that
this is working, I think it's important to expand our CI coverage to
test with and without GSL in CI. So, in this PR, I'm doing so by setting
`NATIVE_VECTOR` to true or false in our test matrix.

While working on this, I noticed some tests in the LSI spec that return
early when `$GSL` is not enabled. It would be better for those tests to
report as skipped when GSL is not enabled (and this matches the pattern
of the redis tests, that report as skipped if redis isn't available).
mkasberg added a commit to mkasberg/classifier-reborn that referenced this pull request May 10, 2022
classifier-reborn is designed to work with or without
[GSL](https://www.gnu.org/software/gsl/) support.

https://github.com/jekyll/classifier-reborn/blob/99d13af5adf040ba40a6fe77dbe0b28756562fcc/docs/index.md?plain=1#L68

If GSL is installed, classifier-reborn will detect it and use it. If GSL
is not installed, classifier-reborn will fall back to a pure-ruby
implementation. The mechanism for doing so is in `lsi.rb`:

https://github.com/jekyll/classifier-reborn/blob/99d13af5adf040ba40a6fe77dbe0b28756562fcc/lib/classifier-reborn/lsi.rb#L7-L17

Notably, there's a comment there about how to test with/without GSL
enabled.

> to test the native vector class, try `rake test NATIVE_VECTOR=true`

As far as I can tell, this was only ever used for local
development/testing, and was never tested in CI (though it was
previously discussed
[here](jekyll#46 (comment))).
I missed this in my last PR (jekyll#195) because I was focused on porting
existing testing functionality from TravisCI to GitHub Actions. Now that
this is working, I think it's important to expand our CI coverage to
test with and without GSL in CI. So, in this PR, I'm doing so by setting
`NATIVE_VECTOR` to true or false in our test matrix.

While working on this, I noticed some tests in the LSI spec that return
early when `$GSL` is not enabled. It would be better for those tests to
report as skipped when GSL is not enabled (and this matches the pattern
of the redis tests, that report as skipped if redis isn't available).
mkasberg added a commit to mkasberg/classifier-reborn that referenced this pull request May 10, 2022
classifier-reborn is designed to work with or without
[GSL](https://www.gnu.org/software/gsl/) support.

https://github.com/jekyll/classifier-reborn/blob/99d13af5adf040ba40a6fe77dbe0b28756562fcc/docs/index.md?plain=1#L68

If GSL is installed, classifier-reborn will detect it and use it. If GSL
is not installed, classifier-reborn will fall back to a pure-ruby
implementation. The mechanism for doing so is in `lsi.rb`:

https://github.com/jekyll/classifier-reborn/blob/99d13af5adf040ba40a6fe77dbe0b28756562fcc/lib/classifier-reborn/lsi.rb#L7-L17

Notably, there's a comment there about how to test with/without GSL
enabled.

> to test the native vector class, try `rake test NATIVE_VECTOR=true`

As far as I can tell, this was only ever used for local
development/testing, and was never tested in CI (though adding GSL
testing to CI was previously discussed
[here](jekyll#46 (comment))).
I did not include this in my last PR (jekyll#195) because I was focused on
porting existing testing functionality from TravisCI to GitHub Actions.
Now that GitHub Actions is working, I think it's important to expand our
CI coverage to test with and without GSL in CI. So, in this PR, I'm
doing so by setting `NATIVE_VECTOR` to true or false in our test matrix
and installing the required `libgsl-dev` package in the Ubuntu test
environment.

While working on this, I noticed some tests in the LSI spec that return
early when `$GSL` is not enabled. It would be better for those tests to
report as skipped when GSL is not enabled (and this matches the pattern
of the redis tests, that report as skipped if redis isn't available).
mkasberg added a commit to mkasberg/classifier-reborn that referenced this pull request May 10, 2022
classifier-reborn is designed to work with or without
[GSL](https://www.gnu.org/software/gsl/) support.

https://github.com/jekyll/classifier-reborn/blob/99d13af5adf040ba40a6fe77dbe0b28756562fcc/docs/index.md?plain=1#L68

If GSL is installed, classifier-reborn will detect it and use it. If GSL
is not installed, classifier-reborn will fall back to a pure-ruby
implementation. The mechanism for doing so is in `lsi.rb`:

https://github.com/jekyll/classifier-reborn/blob/99d13af5adf040ba40a6fe77dbe0b28756562fcc/lib/classifier-reborn/lsi.rb#L7-L17

Notably, there's a comment there about how to test with/without GSL
enabled.

> to test the native vector class, try `rake test NATIVE_VECTOR=true`

As far as I can tell, this was only ever used for local
development/testing, and was never tested in CI (though it was
previously discussed
[here](jekyll#46 (comment))).
I did not include this in my last PR (jekyll#195) because I was focused on
porting existing testing functionality from TravisCI to GitHub Actions.
Now that GitHub Actions is working, I think it's important to expand our
CI coverage to test with and without GSL in CI. So, in this PR, I'm
doing so by setting `NATIVE_VECTOR` to true or false in our test matrix.
Lucky for us, GitHub already [includes libgsl as pre-installed
software](https://github.com/actions/virtual-environments/blob/main/images/linux/Ubuntu2004-Readme.md#installed-apt-packages),
so we don't need to do anything special there.

While working on this, I noticed some tests in the LSI spec that return
early when `$GSL` is not enabled. It would be better for those tests to
report as skipped when GSL is not enabled (and this matches the pattern
of the redis tests, that report as skipped if redis isn't available).
mkasberg added a commit to mkasberg/classifier-reborn that referenced this pull request May 10, 2022
classifier-reborn is designed to work with or without
[GSL](https://www.gnu.org/software/gsl/) support.

https://github.com/jekyll/classifier-reborn/blob/99d13af5adf040ba40a6fe77dbe0b28756562fcc/docs/index.md?plain=1#L68

If GSL is installed, classifier-reborn will detect it and use it. If GSL
is not installed, classifier-reborn will fall back to a pure-ruby
implementation. The mechanism for doing so is in `lsi.rb`:

https://github.com/jekyll/classifier-reborn/blob/99d13af5adf040ba40a6fe77dbe0b28756562fcc/lib/classifier-reborn/lsi.rb#L7-L17

Notably, there's a comment there about how to test with/without GSL
enabled.

> to test the native vector class, try `rake test NATIVE_VECTOR=true`

As far as I can tell, this was only ever used for local
development/testing, and was never tested in CI (though it was
previously discussed
[here](jekyll#46 (comment))).
I did not include this in my last PR (jekyll#195) because I was focused on
porting existing testing functionality from TravisCI to GitHub Actions.
Now that GitHub Actions is working, I think it's important to expand our
CI coverage to test with and without GSL in CI. So, in this PR, I'm
doing so by setting `NATIVE_VECTOR` to true or false in our test matrix.
Lucky for us, GitHub already [includes libgsl as pre-installed
software](https://github.com/actions/virtual-environments/blob/main/images/linux/Ubuntu2004-Readme.md#installed-apt-packages),
so we don't need to do anything special there.

While working on this, I noticed some tests in the LSI spec that return
early when `$GSL` is not enabled. It would be better for those tests to
report as skipped when GSL is not enabled (and this matches the pattern
of the redis tests, that report as skipped if redis isn't available).
mkasberg added a commit to mkasberg/classifier-reborn that referenced this pull request May 10, 2022
classifier-reborn is designed to work with or without
[GSL](https://www.gnu.org/software/gsl/) support.

https://github.com/jekyll/classifier-reborn/blob/99d13af5adf040ba40a6fe77dbe0b28756562fcc/docs/index.md?plain=1#L68

If GSL is installed, classifier-reborn will detect it and use it. If GSL
is not installed, classifier-reborn will fall back to a pure-ruby
implementation. The mechanism for doing so is in `lsi.rb`:

https://github.com/jekyll/classifier-reborn/blob/99d13af5adf040ba40a6fe77dbe0b28756562fcc/lib/classifier-reborn/lsi.rb#L7-L17

I think it's important to test the classifier-reborn gem with GSL
support in CI. One of my goals is to add similar support using `Numo`,
and I'd like that to be tested in CI as well, and I want to make sure I
don't do anything that could break existing GSL support. As far as I can
tell, GSL support was never tested in CI before now (though it was
previously discussed
[here](jekyll#46 (comment))).

I did find a comment about how to test with/without GSL enabled, but I
think this was only used for locally.

> to test the native vector class, try `rake test NATIVE_VECTOR=true`

So, in this PR, I'm expanding our text matrix to test with and without
GSL enabled by setting `NATIVE_VECTOR` to true or false. If
`NATIVE_VECTOR` is false, we need to install the `gsl` gem (which is not
included in our Gemfile since it's optional). Lucky for us, GitHub
already [includes libgsl as pre-installed
software](https://github.com/actions/virtual-environments/blob/main/images/linux/Ubuntu2004-Readme.md#installed-apt-packages),
so we don't need to do anything special for the apt package.

Currently, GSL only works with Ruby 2.7. (One of the main reasons I want
to add support for Numo is because GSL is becoming difficult to
support.) As such, I've excluded other versions of ruby in our test
matrix for now.

While working on this, I noticed some tests in the LSI spec that return
early when `$GSL` is not enabled. It would be better for those tests to
report as skipped when GSL is not enabled (and this matches the pattern
of the redis tests, that report as skipped if redis isn't available).
mkasberg added a commit to mkasberg/classifier-reborn that referenced this pull request May 10, 2022
classifier-reborn is designed to work with or without
[GSL](https://www.gnu.org/software/gsl/) support.

https://github.com/jekyll/classifier-reborn/blob/99d13af5adf040ba40a6fe77dbe0b28756562fcc/docs/index.md?plain=1#L68

If GSL is installed, classifier-reborn will detect it and use it. If GSL
is not installed, classifier-reborn will fall back to a pure-ruby
implementation. The mechanism for doing so is in `lsi.rb`:

https://github.com/jekyll/classifier-reborn/blob/99d13af5adf040ba40a6fe77dbe0b28756562fcc/lib/classifier-reborn/lsi.rb#L7-L17

I think it's important to test the classifier-reborn gem with GSL
support in CI. One of my goals is to add similar support using `Numo`,
and I'd like that to be tested in CI as well, and I want to make sure I
don't do anything that could break existing GSL support. As far as I can
tell, GSL support was never tested in CI before now (though it was
previously discussed
[here](jekyll#46 (comment))).

I did find a comment about how to test with/without GSL enabled, but I
think this was only used for locally.

> to test the native vector class, try `rake test NATIVE_VECTOR=true`

So, in this PR, I'm expanding our text matrix to test with and without
GSL enabled by setting `NATIVE_VECTOR` to true or false. If
`NATIVE_VECTOR` is false, we need to install the `gsl` gem (which is not
included in our Gemfile since it's optional). Lucky for us, GitHub
already [includes libgsl as pre-installed
software](https://github.com/actions/virtual-environments/blob/main/images/linux/Ubuntu2004-Readme.md#installed-apt-packages),
so we don't need to do anything special for the apt package.

Currently, GSL only works with Ruby 2.7. (One of the main reasons I want
to add support for Numo is because GSL is becoming difficult to
support.) As such, I've excluded other versions of ruby in our test
matrix for now.

While working on this, I noticed some tests in the LSI spec that return
early when `$GSL` is not enabled. It would be better for those tests to
report as skipped when GSL is not enabled (and this matches the pattern
of the redis tests, that report as skipped if redis isn't available).
mkasberg added a commit to mkasberg/classifier-reborn that referenced this pull request May 11, 2022
classifier-reborn is designed to work with or without
[GSL](https://www.gnu.org/software/gsl/) support.

https://github.com/jekyll/classifier-reborn/blob/99d13af5adf040ba40a6fe77dbe0b28756562fcc/docs/index.md?plain=1#L68

If GSL is installed, classifier-reborn will detect it and use it. If GSL
is not installed, classifier-reborn will fall back to a pure-ruby
implementation. The mechanism for doing so is in `lsi.rb`:

https://github.com/jekyll/classifier-reborn/blob/99d13af5adf040ba40a6fe77dbe0b28756562fcc/lib/classifier-reborn/lsi.rb#L7-L17

I think it's important to test the classifier-reborn gem with GSL
support in CI. One of my goals is to add similar support using Numo,
and I'd like that to be tested in CI as well. Also, I want to make sure I
don't do anything along the way that could break existing GSL support.
As far as I can tell, GSL support was never tested in CI before now
(though it was previously discussed
[here](jekyll#46 (comment))).

In this PR, I'm expanding our text matrix to test with and without GSL
enabled. When the matrix has GSL enabled, we install the `gsl` gem
(which is not included in our Gemfile since it's optional). Lucky for
us, GitHub already [includes libgsl as pre-installed
software](https://github.com/actions/virtual-environments/blob/main/images/linux/Ubuntu2004-Readme.md#installed-apt-packages),
so we don't need to do anything special for the apt package. Our gem
already has everything needed to build & install. When `matrix.gsl` is
false, we won't install the gem and tests will run the native Ruby
implementation. When `matrix.gsl` is true, we'll install the gem tests
will run the GSL implementation.

Currently, GSL only works with Ruby 2.7. (One of the main reasons I want
to add support for Numo is because GSL is becoming difficult to
support.) As such, I've excluded other versions of ruby in our test
matrix for now. They'll still be tested with GSL disabled, but not with
it enabled.

While working on this, I noticed some tests in the LSI spec that return
early when `$GSL` is not enabled. It would be better for those tests to
report as skipped when GSL is not enabled (and this matches the pattern
of the redis tests, that report as skipped if redis isn't available).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants