Skip to content

Commit e6153b4

Browse files
authored
Support Numo Gem for performing SVD (#198)
Merge pull request 198
1 parent fb5da8e commit e6153b4

File tree

9 files changed

+116
-34
lines changed

9 files changed

+116
-34
lines changed

.github/workflows/ci.yml

+12-6
Original file line numberDiff line numberDiff line change
@@ -14,17 +14,17 @@ on:
1414

1515
jobs:
1616
ci:
17-
name: "Run Tests (Ruby ${{ matrix.ruby_version }}, GSL: ${{ matrix.gsl }})"
17+
name: "Run Tests (Ruby ${{ matrix.ruby_version }}, Linalg: ${{ matrix.linalg_gem }})"
1818
runs-on: "ubuntu-latest"
1919
env:
2020
# See https://github.com/marketplace/actions/setup-ruby-jruby-and-truffleruby#matrix-of-gemfiles
2121
BUNDLE_GEMFILE: ${{ matrix.gemfile }}
22-
LOAD_GSL: ${{ matrix.gsl }}
22+
LINALG_GEM: ${{ matrix.linalg_gem }}
2323
strategy:
2424
fail-fast: false
2525
matrix:
2626
ruby_version: ["2.7", "3.0", "3.1", "jruby-9.3.4.0"]
27-
gsl: [true, false]
27+
linalg_gem: ["none", "gsl", "numo"]
2828
# We use `include` to assign the correct Gemfile for each ruby_version
2929
include:
3030
- ruby_version: "2.7"
@@ -39,17 +39,23 @@ jobs:
3939
# Ruby 3.0 does not work with the latest released gsl gem
4040
# https://github.com/SciRuby/rb-gsl/issues/67
4141
- ruby_version: "3.0"
42-
gsl: true
42+
linalg_gem: "gsl"
4343
# Ruby 3.1 does not work with the latest released gsl gem
4444
# https://github.com/SciRuby/rb-gsl/issues/67
4545
- ruby_version: "3.1"
46-
gsl: true
46+
linalg_gem: "gsl"
4747
# jruby-9.3.4.0 doesn't easily build the gsl gem on a GitHub worker. Skipping for now.
4848
- ruby_version: "jruby-9.3.4.0"
49-
gsl: true
49+
linalg_gem: "gsl"
50+
# jruby-9.3.4.0 doesn't easily build the numo gems on a GitHub worker. Skipping for now.
51+
- ruby_version: "jruby-9.3.4.0"
52+
linalg_gem: "numo"
5053
steps:
5154
- name: Checkout Repository
5255
uses: actions/checkout@v3
56+
- name: Install Lapack
57+
if: ${{ matrix.linalg_gem == 'numo' }}
58+
run: sudo apt-get install -y liblapacke-dev libopenblas-dev
5359
- name: "Set up ${{ matrix.label }}"
5460
uses: ruby/setup-ruby@v1
5561
with:

.rubocop.yml

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
inherit_from: .rubocop_todo.yml
22

33
Style/GlobalVars:
4-
AllowedVariables: [$GSL]
4+
AllowedVariables: [$SVD]
55

66
Naming/MethodName:
77
Exclude:

Gemfile

+6-1
Original file line numberDiff line numberDiff line change
@@ -4,4 +4,9 @@ source 'https://rubygems.org'
44
gemspec name: 'classifier-reborn'
55

66
# For testing with GSL support & bundle exec
7-
gem 'gsl' if ENV['LOAD_GSL'] == 'true'
7+
gem 'gsl' if ENV['LINALG_GEM'] == 'gsl'
8+
9+
if ENV['LINALG_GEM'] == 'numo'
10+
gem 'numo-narray'
11+
gem 'numo-linalg'
12+
end

docs/index.md

+26-4
Original file line numberDiff line numberDiff line change
@@ -60,12 +60,34 @@ The only runtime dependency of this gem is Roman Shterenzon's `fast-stemmer` gem
6060
gem install fast-stemmer
6161
```
6262

63-
To speed up `LSI` classification by at least 10x consider installing following libraries.
63+
In addition, it is **recommended** to install either Numo or GSL to speed up LSI classification by at least 10x.
6464

65-
* [GSL - GNU Scientific Library](http://www.gnu.org/software/gsl)
66-
* [Ruby/GSL Gem](https://rubygems.org/gems/gsl)
65+
Note that LSI will work without these libraries, but as soon as they are installed, classifier will make use of them. No configuration changes are needed, we like to keep things ridiculously easy for you.
66+
67+
### Install Numo Gems
68+
69+
[Numo](https://ruby-numo.github.io/narray/) is a set of Numerical Module gems for Ruby that provide a Ruby interface to [LAPACK](http://www.netlib.org/lapack/). If classifier detects that the required Numo gems are installed, it will make use of them to perform LSI faster.
70+
71+
* Install [LAPACKE](https://www.netlib.org/lapack/lapacke.html)
72+
* Ubuntu: `apt-get install liblapacke-dev`
73+
* macOS: (Help wanted to verify installation steps) https://stackoverflow.com/questions/38114201/installing-lapack-and-blas-libraries-for-c-on-mac-os
74+
* Install [OpenBLAS](https://www.openblas.net/)
75+
* Ubuntu: `apt-get install libopenblas-dev`
76+
* macOS: (Help wanted to verify installation steps) https://stackoverflow.com/questions/38114201/installing-lapack-and-blas-libraries-for-c-on-mac-os
77+
* Install the [Numo::NArray](https://ruby-numo.github.io/narray/) and [Numo::Linalg](https://ruby-numo.github.io/linalg/) gems
78+
* `gem install numo-narray numo-linalg`
79+
80+
### Install GSL Gem
81+
82+
**Note:** The `gsl` gem is currently incompatible with Ruby 3. It is recommended to use Numo instead with Ruby 3.
83+
84+
The [GNU Scientific Library (GSL)](http://www.gnu.org/software/gsl) is an alternative to Numo/LAPACK that can be used to improve LSI performance. (You should install one or the other, but both are not required.)
85+
86+
* Install the [GNU Scientific Library](http://www.gnu.org/software/gsl)
87+
* Ubuntu: `apt-get install libgsl-dev`
88+
* Install the [Ruby/GSL Gem](https://rubygems.org/gems/gsl) (or add it to your Gemfile)
89+
* `gem install gsl`
6790

68-
Note that `LSI` will work without these libraries, but as soon as they are installed, classifier will make use of them. No configuration changes are needed, we like to keep things ridiculously easy for you.
6991

7092
## Further Readings
7193

lib/classifier-reborn/lsi.rb

+49-11
Original file line numberDiff line numberDiff line change
@@ -4,16 +4,28 @@
44
# Copyright:: Copyright (c) 2005 David Fayram II
55
# License:: LGPL
66

7+
# Try to load Numo first - it's the most current and the most well-supported.
8+
# Fall back to GSL.
9+
# Fall back to native vector.
710
begin
811
raise LoadError if ENV['NATIVE_VECTOR'] == 'true' # to test the native vector class, try `rake test NATIVE_VECTOR=true`
12+
raise LoadError if ENV['GSL'] == 'true' # to test with gsl, try `rake test GSL=true`
913

10-
require 'gsl' # requires https://github.com/SciRuby/rb-gsl
11-
require_relative 'extensions/vector_serialize'
12-
$GSL = true
14+
require 'numo/narray' # https://ruby-numo.github.io/narray/
15+
require 'numo/linalg' # https://ruby-numo.github.io/linalg/
16+
$SVD = :numo
1317
rescue LoadError
14-
$GSL = false
15-
require_relative 'extensions/vector'
16-
require_relative 'extensions/zero_vector'
18+
begin
19+
raise LoadError if ENV['NATIVE_VECTOR'] == 'true' # to test the native vector class, try `rake test NATIVE_VECTOR=true`
20+
21+
require 'gsl' # requires https://github.com/SciRuby/rb-gsl
22+
require_relative 'extensions/vector_serialize'
23+
$SVD = :gsl
24+
rescue LoadError
25+
$SVD = :ruby
26+
require_relative 'extensions/vector'
27+
require_relative 'extensions/zero_vector'
28+
end
1729
end
1830

1931
require_relative 'lsi/word_list'
@@ -140,7 +152,15 @@ def build_index(cutoff = 0.75)
140152
doc_list = @items.values
141153
tda = doc_list.collect { |node| node.raw_vector_with(@word_list) }
142154

143-
if $GSL
155+
if $SVD == :numo
156+
tdm = Numo::NArray.asarray(tda.map(&:to_a)).transpose
157+
ntdm = numo_build_reduced_matrix(tdm, cutoff)
158+
159+
ntdm.each_over_axis(1).with_index do |col_vec, i|
160+
doc_list[i].lsi_vector = col_vec
161+
doc_list[i].lsi_norm = col_vec / Numo::Linalg.norm(col_vec)
162+
end
163+
elsif $SVD == :gsl
144164
tdm = GSL::Matrix.alloc(*tda).trans
145165
ntdm = build_reduced_matrix(tdm, cutoff)
146166

@@ -201,7 +221,9 @@ def proximity_array_for_content(doc, &block)
201221
content_node = node_for_content(doc, &block)
202222
result =
203223
@items.keys.collect do |item|
204-
val = if $GSL
224+
val = if $SVD == :numo
225+
content_node.search_vector.dot(@items[item].transposed_search_vector)
226+
elsif $SVD == :gsl
205227
content_node.search_vector * @items[item].transposed_search_vector
206228
else
207229
(Matrix[content_node.search_vector] * @items[item].search_vector)[0]
@@ -220,7 +242,8 @@ def proximity_norms_for_content(doc, &block)
220242
return [] if needs_rebuild?
221243

222244
content_node = node_for_content(doc, &block)
223-
if $GSL && content_node.raw_norm.isnan?.all?
245+
if ($SVD == :gsl && content_node.raw_norm.isnan?.all?) ||
246+
($SVD == :numo && content_node.raw_norm.isnan.all?)
224247
puts "There are no documents that are similar to #{doc}"
225248
else
226249
content_node_norms(content_node)
@@ -230,7 +253,9 @@ def proximity_norms_for_content(doc, &block)
230253
def content_node_norms(content_node)
231254
result =
232255
@items.keys.collect do |item|
233-
val = if $GSL
256+
val = if $SVD == :numo
257+
content_node.search_norm.dot(@items[item].search_norm)
258+
elsif $SVD == :gsl
234259
content_node.search_norm * @items[item].search_norm.col
235260
else
236261
(Matrix[content_node.search_norm] * @items[item].search_norm)[0]
@@ -332,7 +357,20 @@ def build_reduced_matrix(matrix, cutoff = 0.75)
332357
s[ord] = 0.0 if s[ord] < s_cutoff
333358
end
334359
# Reconstruct the term document matrix, only with reduced rank
335-
u * ($GSL ? GSL::Matrix : ::Matrix).diag(s) * v.trans
360+
u * ($SVD == :gsl ? GSL::Matrix : ::Matrix).diag(s) * v.trans
361+
end
362+
363+
def numo_build_reduced_matrix(matrix, cutoff = 0.75)
364+
s, u, vt = Numo::Linalg.svd(matrix, driver: 'svd', job: 'S')
365+
366+
# TODO: Better than 75% term (as above)
367+
s_cutoff = s.sort.reverse[(s.size * cutoff).round - 1]
368+
s.size.times do |ord|
369+
s[ord] = 0.0 if s[ord] < s_cutoff
370+
end
371+
372+
# Reconstruct the term document matrix, only with reduced rank
373+
u.dot(::Numo::DFloat.eye(s.size) * s).dot(vt)
336374
end
337375

338376
def node_for_content(item, &block)

lib/classifier-reborn/lsi/content_node.rb

+17-6
Original file line numberDiff line numberDiff line change
@@ -29,7 +29,11 @@ def search_vector
2929

3030
# Method to access the transposed search vector
3131
def transposed_search_vector
32-
search_vector.col
32+
if $SVD == :numo
33+
search_vector
34+
else
35+
search_vector.col
36+
end
3337
end
3438

3539
# Use this to fetch the appropriate search vector in normalized form.
@@ -40,7 +44,9 @@ def search_norm
4044
# Creates the raw vector out of word_hash using word_list as the
4145
# key for mapping the vector space.
4246
def raw_vector_with(word_list)
43-
vec = if $GSL
47+
vec = if $SVD == :numo
48+
Numo::DFloat.zeros(word_list.size)
49+
elsif $SVD == :gsl
4450
GSL::Vector.alloc(word_list.size)
4551
else
4652
Array.new(word_list.size, 0)
@@ -51,7 +57,9 @@ def raw_vector_with(word_list)
5157
end
5258

5359
# Perform the scaling transform and force floating point arithmetic
54-
if $GSL
60+
if $SVD == :numo
61+
total_words = vec.sum.to_f
62+
elsif $SVD == :gsl
5563
sum = 0.0
5664
vec.each { |v| sum += v }
5765
total_words = sum
@@ -61,7 +69,7 @@ def raw_vector_with(word_list)
6169

6270
total_unique_words = 0
6371

64-
if $GSL
72+
if [:numo, :gsl].include?($SVD)
6573
vec.each { |word| total_unique_words += 1 if word != 0.0 }
6674
else
6775
total_unique_words = vec.count { |word| word != 0 }
@@ -85,12 +93,15 @@ def raw_vector_with(word_list)
8593
hash[val] = Math.log(val + 1) / -weighted_total
8694
end
8795

88-
vec.collect! do |val|
96+
vec = vec.map do |val|
8997
cached_calcs[val]
9098
end
9199
end
92100

93-
if $GSL
101+
if $SVD == :numo
102+
@raw_norm = vec / Numo::Linalg.norm(vec)
103+
@raw_vector = vec
104+
elsif $SVD == :gsl
94105
@raw_norm = vec.normalize
95106
@raw_vector = vec
96107
else

test/extensions/matrix_test.rb

+1-1
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
class MatrixTest < Minitest::Test
44
def test_zero_division
5-
skip "extensions/vector is only used by non-GSL implementation" if $GSL
5+
skip "extensions/vector is only used by non-GSL implementation" if $SVD != :ruby
66

77
matrix = Matrix[[1, 0], [0, 1]]
88
matrix.SV_decomp

test/extensions/zero_vector_test.rb

+1-1
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
class ZeroVectorTest < Minitest::Test
44
def test_zero?
5-
skip "extensions/zero_vector is only used by non-GSL implementation" if $GSL
5+
skip "extensions/zero_vector is only used by non-GSL implementation" if $SVD != :ruby
66

77
vec0 = Vector[]
88
vec1 = Vector[0]

test/lsi/lsi_test.rb

+3-3
Original file line numberDiff line numberDiff line change
@@ -163,7 +163,7 @@ def test_cached_content_node_option
163163
end
164164

165165
def test_clears_cached_content_node_cache
166-
skip "transposed_search_vector is only used by GSL implementation" unless $GSL
166+
skip "transposed_search_vector is only used by GSL implementation" if $SVD == :ruby
167167

168168
lsi = ClassifierReborn::LSI.new(cache_node_vectors: true)
169169
lsi.add_item @str1, 'Dog'
@@ -191,8 +191,8 @@ def test_keyword_search
191191
assert_equal %i[dog text deal], lsi.highest_ranked_stems(@str1)
192192
end
193193

194-
def test_invalid_searching_when_using_gsl
195-
skip "Only GSL currently raises invalid search error" unless $GSL
194+
def test_invalid_searching_with_linalg_lib
195+
skip "Only GSL currently raises invalid search error" if $SVD == :ruby
196196

197197
lsi = ClassifierReborn::LSI.new
198198
lsi.add_item @str1, 'Dog'

0 commit comments

Comments
 (0)