-
Notifications
You must be signed in to change notification settings - Fork 590
Description
Describe the bug
cuml.naive_bayes.MultinomialNB
produces different class predictions from scikit-learn on a minimal 2-feature example when trained (a) with one-shot fit
, and (b) incrementally via partial_fit
after introducing an unseen class. The behavior contradicts scikit-learn’s documented semantics for MultinomialNB
(Laplace smoothing, incremental updates) and cuML’s stated API alignment goals. cuML’s docs list MultinomialNB
but do not document a partial_fit
method there (only GaussianNB
/CategoricalNB
explicitly document partial_fit
), suggesting the incremental path for MultinomialNB
may be under-documented and potentially incorrect for dense CuPy inputs. scikit-learn rapids
Steps/Code to reproduce bug
# ===== cuML: one-shot fit (dense) =====
import cupy as cp
from cuml.naive_bayes import MultinomialNB as cuMNB
X1 = cp.array([[0, 1], [1, 0]], dtype=cp.float32) # two samples, two features
y1 = cp.array([0, 1], dtype=cp.int32)
X_add = cp.array([[1, 1]], dtype=cp.float32) # new sample for a new class
y_add = cp.array([2], dtype=cp.int32)
# A) One-shot fit on first batch (classes {0,1})
m_fit_1 = cuMNB().fit(X1, y1)
# Expected: [0,1,0] for [[0,1],[1,0],[1,1]]
assert int(m_fit_1.predict(cp.array([[0,1]]))) == 0
assert int(m_fit_1.predict(cp.array([[1,0]]))) == 1
assert int(m_fit_1.predict(cp.array([[1,1]]))) == 0
# B) One-shot fit on merged data (classes {0,1,2})
X12 = cp.vstack([X1, X_add]); y12 = cp.concatenate([y1, y_add])
m_fit_12 = cuMNB().fit(X12, y12)
# Expected: [0,1,2] after seeing class 2
assert int(m_fit_12.predict(cp.array([[0,1]]))) == 0
assert int(m_fit_12.predict(cp.array([[1,0]]))) == 1
assert int(m_fit_12.predict(cp.array([[1,1]]))) == 2
# ===== cuML: partial_fit (dense) =====
classes = cp.array([0,1,2], dtype=cp.int32)
m_pf = cuMNB()
m_pf.partial_fit(X1, y1, classes=classes) # first batch (0,1)
assert int(m_pf.predict(cp.array([[0,1]]))) == 0
assert int(m_pf.predict(cp.array([[1,0]]))) == 1
assert int(m_pf.predict(cp.array([[1,1]]))) == 0
# introduce class 2 incrementally
m_pf.partial_fit(X_add, y_add) # second batch adds class 2
assert int(m_pf.predict(cp.array([[0,1]]))) == 0
assert int(m_pf.predict(cp.array([[1,0]]))) == 1 # <-- cuML returns 0 (unexpected)
assert int(m_pf.predict(cp.array([[1,1]]))) == 2 # <-- cuML returns 0 (unexpected)
Reference result with scikit-learn (expected semantics):
# ===== scikit-learn: reference behavior =====
import numpy as np
from sklearn.naive_bayes import MultinomialNB as skMNB
X1 = np.array([[0,1],[1,0]], dtype=np.float32)
y1 = np.array([0,1], dtype=np.int32)
X_add = np.array([[1,1]], dtype=np.float32)
y_add = np.array([2], dtype=np.int32)
# A) one-shot fit({0,1})
s1 = skMNB().fit(X1, y1)
assert int(s1.predict([[0,1]])[0]) == 0
assert int(s1.predict([[1,0]])[0]) == 1
assert int(s1.predict([[1,1]])[0]) == 0
# B) one-shot fit({0,1,2}) on merged data
s12 = skMNB().fit(np.vstack([X1, X_add]), np.concatenate([y1, y_add]))
assert int(s12.predict([[0,1]])[0]) == 0
assert int(s12.predict([[1,0]])[0]) == 1
assert int(s12.predict([[1,1]])[0]) == 2 # <-- as expected
# C) partial_fit path aligned with docs (first call must pass 'classes')
spf = skMNB()
spf.partial_fit(X1, y1, classes=np.array([0,1,2], dtype=np.int32))
assert int(spf.predict([[1,0]])[0]) == 1
assert int(spf.predict([[1,1]])[0]) == 0
spf.partial_fit(X_add, y_add)
assert int(spf.predict([[1,1]])[0]) == 2
- Scikit-learn’s
MultinomialNB
explicitly documentspartial_fit
semantics for incremental learning (first call must passclasses=
). cuML aims for scikit-learn API compatibility. - cuML’s API page shows
MultinomialNB
(example uses CSR sparse input) but does not list apartial_fit
method there;partial_fit
is explicitly documented onGaussianNB
/CategoricalNB
pages.
Expected behavior
For identical data and hyperparameters (default alpha=1.0
Laplace smoothing), cuML MultinomialNB
should match scikit-learn’s predictions for both one-shot fit
and partial_fit
. After adding a sample of a new class 2
, the model trained on the merged data or incrementally via partial_fit
should predict class 2
for [1,1]
, and class 1
for [1,0]
, consistent with the standard multinomial NB formulation with Laplace/Lidstone smoothing. sklearn.naive_bayes.MultinomialNB
Environment details:
- Environment location: Bare-metal
- Linux Distro/Architecture: [Ubuntu 20.04 x86_64]
- GPU Model/Driver: [A800, driver 525.147.05]
- CUDA: [12.9, V12.9.86]
- Python: [3.12.11]
- cuML version where it reproduces: 25.08
- Method of cuDF & cuML install: conda
- Due to the such long result, I prefer to paste the command line I used to create the virtual environment. conda create -n rapids-25.08 -c rapidsai -c conda-forge -c nvidia
rapids=25.08 python=3.12 'cuda-version>=12.0,<=12.9'
- Due to the such long result, I prefer to paste the command line I used to create the virtual environment. conda create -n rapids-25.08 -c rapidsai -c conda-forge -c nvidia