Skip to content

Commit

Permalink
Support Numo Gem for performing SVD
Browse files Browse the repository at this point in the history
**Background:**
The slow step of LSI is computing the SVD (singular value decomposition)
of a matrix. Even with a relatively small collection of documents (say,
about 20 blog posts), the native ruby implementation is too slow to be
usable (taking hours to complete).

To work around this problem, classifier-reborn allows you to optionally
use the `gsl` gem to make use of the [Gnu Scientific
Library](https://www.gnu.org/software/gsl/) when performing matrix
calculations. Computations with this gem perform orders of magnitude
faster than the ruby-only matrix implementation, and they're fast enough
that using LSI with Jekyll finishes in a reasonable amount of time
(seconds).

Unfortunately, [rb-gsl](https://github.com/SciRuby/rb-gsl) is
unmaintained -- there's a commit on main that makes it compatible with
Ruby 3, but nobody has released the gem so the only way to use rb-gsl
with Ruby 3 right now is to specify the git hash in your Gemfile. See
SciRuby/rb-gsl#67. This will be increasingly
problematic because Ruby 2.7 is now in [security
maintenance](https://www.ruby-lang.org/en/news/2022/04/12/ruby-2-7-6-released/)
and will become end of life in less than a year.

Notably, `rb-gsl` depends on the
[narray](https://github.com/masa16/narray#new-version-is-under-development---rubynumonarray)
gem. `narray` is deprecated, and the readme suggests using
`Numo::NArray` instead.

**Changes:**
In this PR, my goal is to provide an alternative matrix implementation
that can perform singular value decomposition quickly and works with
Ruby 3. Doing so will make classifier-reborn compatible with Ruby 3
without depending on the unmaintained/unreleased gsl gem. There aren't
many gems that provide fast matrix support for ruby, but
[Numo](https://github.com/ruby-numo) seems to be more actively
maintained than rb-gsl, and Numo has a working Ruby 3 implementation
that can perform a singular value decomposition, which is exactly what
we need. This requires
[numo-narray](https://github.com/ruby-numo/numo-narray) and
[numo-linalg](https://github.com/ruby-numo/numo-linalg).

My goal is to allow users to (optionally) use classifier-reborn with
Numo/Lapack the same way they'd use it with GSL. That is, the user
should install the `numo-narray` and `numo-linalg` gems (with their
required C libraries), and classifier-reborn will detect and use these
if they are found.
  • Loading branch information
mkasberg committed Jun 4, 2022
1 parent fb5da8e commit aa3ea2c
Show file tree
Hide file tree
Showing 9 changed files with 116 additions and 34 deletions.
18 changes: 12 additions & 6 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -14,17 +14,17 @@ on:

jobs:
ci:
name: "Run Tests (Ruby ${{ matrix.ruby_version }}, GSL: ${{ matrix.gsl }})"
name: "Run Tests (Ruby ${{ matrix.ruby_version }}, Linalg: ${{ matrix.linalg_gem }})"
runs-on: "ubuntu-latest"
env:
# See https://github.com/marketplace/actions/setup-ruby-jruby-and-truffleruby#matrix-of-gemfiles
BUNDLE_GEMFILE: ${{ matrix.gemfile }}
LOAD_GSL: ${{ matrix.gsl }}
LINALG_GEM: ${{ matrix.linalg_gem }}
strategy:
fail-fast: false
matrix:
ruby_version: ["2.7", "3.0", "3.1", "jruby-9.3.4.0"]
gsl: [true, false]
linalg_gem: ["none", "gsl", "numo"]
# We use `include` to assign the correct Gemfile for each ruby_version
include:
- ruby_version: "2.7"
Expand All @@ -39,17 +39,23 @@ jobs:
# Ruby 3.0 does not work with the latest released gsl gem
# https://github.com/SciRuby/rb-gsl/issues/67
- ruby_version: "3.0"
gsl: true
linalg_gem: "gsl"
# Ruby 3.1 does not work with the latest released gsl gem
# https://github.com/SciRuby/rb-gsl/issues/67
- ruby_version: "3.1"
gsl: true
linalg_gem: "gsl"
# jruby-9.3.4.0 doesn't easily build the gsl gem on a GitHub worker. Skipping for now.
- ruby_version: "jruby-9.3.4.0"
gsl: true
linalg_gem: "gsl"
# jruby-9.3.4.0 doesn't easily build the numo gems on a GitHub worker. Skipping for now.
- ruby_version: "jruby-9.3.4.0"
linalg_gem: "numo"
steps:
- name: Checkout Repository
uses: actions/checkout@v3
- name: Install Lapack
if: ${{ matrix.linalg_gem == 'numo' }}
run: sudo apt-get install -y liblapacke-dev libopenblas-dev
- name: "Set up ${{ matrix.label }}"
uses: ruby/setup-ruby@v1
with:
Expand Down
2 changes: 1 addition & 1 deletion .rubocop.yml
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
inherit_from: .rubocop_todo.yml

Style/GlobalVars:
AllowedVariables: [$GSL]
AllowedVariables: [$SVD]

Naming/MethodName:
Exclude:
Expand Down
7 changes: 6 additions & 1 deletion Gemfile
Original file line number Diff line number Diff line change
Expand Up @@ -4,4 +4,9 @@ source 'https://rubygems.org'
gemspec name: 'classifier-reborn'

# For testing with GSL support & bundle exec
gem 'gsl' if ENV['LOAD_GSL'] == 'true'
gem 'gsl' if ENV['LINALG_GEM'] == 'gsl'

if ENV['LINALG_GEM'] == 'numo'
gem 'numo-narray'
gem 'numo-linalg'
end
30 changes: 26 additions & 4 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -60,12 +60,34 @@ The only runtime dependency of this gem is Roman Shterenzon's `fast-stemmer` gem
gem install fast-stemmer
```

To speed up `LSI` classification by at least 10x consider installing following libraries.
In addition, it is **recommended** to install either Numo or GSL to speed up LSI classification by at least 10x.

* [GSL - GNU Scientific Library](http://www.gnu.org/software/gsl)
* [Ruby/GSL Gem](https://rubygems.org/gems/gsl)
Note that LSI will work without these libraries, but as soon as they are installed, classifier will make use of them. No configuration changes are needed, we like to keep things ridiculously easy for you.

### Install Numo Gems

[Numo](https://ruby-numo.github.io/narray/) is a set of Numerical Module gems for Ruby that provide a Ruby interface to [LAPACK](http://www.netlib.org/lapack/). If classifier detects that the required Numo gems are installed, it will make use of them to perform LSI faster.

* Install [LAPACKE](https://www.netlib.org/lapack/lapacke.html)
* Ubuntu: `apt-get install liblapacke-dev`
* macOS: (Help wanted to verify installation steps) https://stackoverflow.com/questions/38114201/installing-lapack-and-blas-libraries-for-c-on-mac-os
* Install [OpenBLAS](https://www.openblas.net/)
* Ubuntu: `apt-get install libopenblas-dev`
* macOS: (Help wanted to verify installation steps) https://stackoverflow.com/questions/38114201/installing-lapack-and-blas-libraries-for-c-on-mac-os
* Install the [Numo::NArray](https://ruby-numo.github.io/narray/) and [Numo::Linalg](https://ruby-numo.github.io/linalg/) gems
* `gem install numo-narray numo-linalg`

### Install GSL Gem

**Note:** The `gsl` gem is currently incompatible with Ruby 3. It is recommended to use Numo instead with Ruby 3.

The [GNU Scientific Library (GSL)](http://www.gnu.org/software/gsl) is an alternative to Numo/LAPACK that can be used to improve LSI performance. (You should install one or the other, but both are not required.)

* Install the [GNU Scientific Library](http://www.gnu.org/software/gsl)
* Ubuntu: `apt-get install libgsl-dev`
* Install the [Ruby/GSL Gem](https://rubygems.org/gems/gsl) (or add it to your Gemfile)
* `gem install gsl`

Note that `LSI` will work without these libraries, but as soon as they are installed, classifier will make use of them. No configuration changes are needed, we like to keep things ridiculously easy for you.

## Further Readings

Expand Down
60 changes: 49 additions & 11 deletions lib/classifier-reborn/lsi.rb
Original file line number Diff line number Diff line change
Expand Up @@ -4,16 +4,28 @@
# Copyright:: Copyright (c) 2005 David Fayram II
# License:: LGPL

# Try to load Numo first - it's the most current and the most well-supported.
# Fall back to GSL.
# Fall back to native vector.
begin
raise LoadError if ENV['NATIVE_VECTOR'] == 'true' # to test the native vector class, try `rake test NATIVE_VECTOR=true`
raise LoadError if ENV['GSL'] == 'true' # to test with gsl, try `rake test GSL=true`

require 'gsl' # requires https://github.com/SciRuby/rb-gsl
require_relative 'extensions/vector_serialize'
$GSL = true
require 'numo/narray' # https://ruby-numo.github.io/narray/
require 'numo/linalg' # https://ruby-numo.github.io/linalg/
$SVD = :numo
rescue LoadError
$GSL = false
require_relative 'extensions/vector'
require_relative 'extensions/zero_vector'
begin
raise LoadError if ENV['NATIVE_VECTOR'] == 'true' # to test the native vector class, try `rake test NATIVE_VECTOR=true`

require 'gsl' # requires https://github.com/SciRuby/rb-gsl
require_relative 'extensions/vector_serialize'
$SVD = :gsl
rescue LoadError
$SVD = :ruby
require_relative 'extensions/vector'
require_relative 'extensions/zero_vector'
end
end

require_relative 'lsi/word_list'
Expand Down Expand Up @@ -140,7 +152,15 @@ def build_index(cutoff = 0.75)
doc_list = @items.values
tda = doc_list.collect { |node| node.raw_vector_with(@word_list) }

if $GSL
if $SVD == :numo
tdm = Numo::NArray.asarray(tda.map(&:to_a)).transpose
ntdm = numo_build_reduced_matrix(tdm, cutoff)

ntdm.each_over_axis(1).with_index do |col_vec, i|
doc_list[i].lsi_vector = col_vec
doc_list[i].lsi_norm = col_vec / Numo::Linalg.norm(col_vec)
end
elsif $SVD == :gsl
tdm = GSL::Matrix.alloc(*tda).trans
ntdm = build_reduced_matrix(tdm, cutoff)

Expand Down Expand Up @@ -201,7 +221,9 @@ def proximity_array_for_content(doc, &block)
content_node = node_for_content(doc, &block)
result =
@items.keys.collect do |item|
val = if $GSL
val = if $SVD == :numo
content_node.search_vector.dot(@items[item].transposed_search_vector)
elsif $SVD == :gsl
content_node.search_vector * @items[item].transposed_search_vector
else
(Matrix[content_node.search_vector] * @items[item].search_vector)[0]
Expand All @@ -220,7 +242,8 @@ def proximity_norms_for_content(doc, &block)
return [] if needs_rebuild?

content_node = node_for_content(doc, &block)
if $GSL && content_node.raw_norm.isnan?.all?
if ($SVD == :gsl && content_node.raw_norm.isnan?.all?) ||
($SVD == :numo && content_node.raw_norm.isnan.all?)
puts "There are no documents that are similar to #{doc}"
else
content_node_norms(content_node)
Expand All @@ -230,7 +253,9 @@ def proximity_norms_for_content(doc, &block)
def content_node_norms(content_node)
result =
@items.keys.collect do |item|
val = if $GSL
val = if $SVD == :numo
content_node.search_norm.dot(@items[item].search_norm)
elsif $SVD == :gsl
content_node.search_norm * @items[item].search_norm.col
else
(Matrix[content_node.search_norm] * @items[item].search_norm)[0]
Expand Down Expand Up @@ -332,7 +357,20 @@ def build_reduced_matrix(matrix, cutoff = 0.75)
s[ord] = 0.0 if s[ord] < s_cutoff
end
# Reconstruct the term document matrix, only with reduced rank
u * ($GSL ? GSL::Matrix : ::Matrix).diag(s) * v.trans
u * ($SVD == :gsl ? GSL::Matrix : ::Matrix).diag(s) * v.trans
end

def numo_build_reduced_matrix(matrix, cutoff = 0.75)
s, u, vt = Numo::Linalg.svd(matrix, driver: 'svd', job: 'S')

# TODO: Better than 75% term (as above)
s_cutoff = s.sort.reverse[(s.size * cutoff).round - 1]
s.size.times do |ord|
s[ord] = 0.0 if s[ord] < s_cutoff
end

# Reconstruct the term document matrix, only with reduced rank
u.dot(::Numo::DFloat.eye(s.size) * s).dot(vt)
end

def node_for_content(item, &block)
Expand Down
23 changes: 17 additions & 6 deletions lib/classifier-reborn/lsi/content_node.rb
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,11 @@ def search_vector

# Method to access the transposed search vector
def transposed_search_vector
search_vector.col
if $SVD == :numo
search_vector
else
search_vector.col
end
end

# Use this to fetch the appropriate search vector in normalized form.
Expand All @@ -40,7 +44,9 @@ def search_norm
# Creates the raw vector out of word_hash using word_list as the
# key for mapping the vector space.
def raw_vector_with(word_list)
vec = if $GSL
vec = if $SVD == :numo
Numo::DFloat.zeros(word_list.size)
elsif $SVD == :gsl
GSL::Vector.alloc(word_list.size)
else
Array.new(word_list.size, 0)
Expand All @@ -51,7 +57,9 @@ def raw_vector_with(word_list)
end

# Perform the scaling transform and force floating point arithmetic
if $GSL
if $SVD == :numo
total_words = vec.sum.to_f
elsif $SVD == :gsl
sum = 0.0
vec.each { |v| sum += v }
total_words = sum
Expand All @@ -61,7 +69,7 @@ def raw_vector_with(word_list)

total_unique_words = 0

if $GSL
if [:numo, :gsl].include?($SVD)
vec.each { |word| total_unique_words += 1 if word != 0.0 }
else
total_unique_words = vec.count { |word| word != 0 }
Expand All @@ -85,12 +93,15 @@ def raw_vector_with(word_list)
hash[val] = Math.log(val + 1) / -weighted_total
end

vec.collect! do |val|
vec = vec.map do |val|
cached_calcs[val]
end
end

if $GSL
if $SVD == :numo
@raw_norm = vec / Numo::Linalg.norm(vec)
@raw_vector = vec
elsif $SVD == :gsl
@raw_norm = vec.normalize
@raw_vector = vec
else
Expand Down
2 changes: 1 addition & 1 deletion test/extensions/matrix_test.rb
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

class MatrixTest < Minitest::Test
def test_zero_division
skip "extensions/vector is only used by non-GSL implementation" if $GSL
skip "extensions/vector is only used by non-GSL implementation" if $SVD != :ruby

matrix = Matrix[[1, 0], [0, 1]]
matrix.SV_decomp
Expand Down
2 changes: 1 addition & 1 deletion test/extensions/zero_vector_test.rb
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

class ZeroVectorTest < Minitest::Test
def test_zero?
skip "extensions/zero_vector is only used by non-GSL implementation" if $GSL
skip "extensions/zero_vector is only used by non-GSL implementation" if $SVD != :ruby

vec0 = Vector[]
vec1 = Vector[0]
Expand Down
6 changes: 3 additions & 3 deletions test/lsi/lsi_test.rb
Original file line number Diff line number Diff line change
Expand Up @@ -163,7 +163,7 @@ def test_cached_content_node_option
end

def test_clears_cached_content_node_cache
skip "transposed_search_vector is only used by GSL implementation" unless $GSL
skip "transposed_search_vector is only used by GSL implementation" if $SVD == :ruby

lsi = ClassifierReborn::LSI.new(cache_node_vectors: true)
lsi.add_item @str1, 'Dog'
Expand Down Expand Up @@ -191,8 +191,8 @@ def test_keyword_search
assert_equal %i[dog text deal], lsi.highest_ranked_stems(@str1)
end

def test_invalid_searching_when_using_gsl
skip "Only GSL currently raises invalid search error" unless $GSL
def test_invalid_searching_with_linalg_lib
skip "Only GSL currently raises invalid search error" if $SVD == :ruby

lsi = ClassifierReborn::LSI.new
lsi.add_item @str1, 'Dog'
Expand Down

0 comments on commit aa3ea2c

Please sign in to comment.