Skip to content

[BUG] cuML MultinomialNB mispredicts on dense CuPy inputs and after partial_fit with a new class #7203

@Davidoxn

Description

@Davidoxn

Describe the bug
cuml.naive_bayes.MultinomialNB produces different class predictions from scikit-learn on a minimal 2-feature example when trained (a) with one-shot fit, and (b) incrementally via partial_fit after introducing an unseen class. The behavior contradicts scikit-learn’s documented semantics for MultinomialNB (Laplace smoothing, incremental updates) and cuML’s stated API alignment goals. cuML’s docs list MultinomialNB but do not document a partial_fit method there (only GaussianNB/CategoricalNB explicitly document partial_fit), suggesting the incremental path for MultinomialNB may be under-documented and potentially incorrect for dense CuPy inputs. scikit-learn rapids

Steps/Code to reproduce bug

# ===== cuML: one-shot fit (dense) =====
import cupy as cp
from cuml.naive_bayes import MultinomialNB as cuMNB

X1 = cp.array([[0, 1], [1, 0]], dtype=cp.float32)   # two samples, two features
y1 = cp.array([0, 1], dtype=cp.int32)
X_add = cp.array([[1, 1]], dtype=cp.float32)        # new sample for a new class
y_add = cp.array([2], dtype=cp.int32)

# A) One-shot fit on first batch (classes {0,1})
m_fit_1 = cuMNB().fit(X1, y1)
# Expected: [0,1,0] for [[0,1],[1,0],[1,1]]
assert int(m_fit_1.predict(cp.array([[0,1]]))) == 0
assert int(m_fit_1.predict(cp.array([[1,0]]))) == 1      
assert int(m_fit_1.predict(cp.array([[1,1]]))) == 0

# B) One-shot fit on merged data (classes {0,1,2})
X12 = cp.vstack([X1, X_add]); y12 = cp.concatenate([y1, y_add])
m_fit_12 = cuMNB().fit(X12, y12)
# Expected: [0,1,2] after seeing class 2
assert int(m_fit_12.predict(cp.array([[0,1]]))) == 0
assert int(m_fit_12.predict(cp.array([[1,0]]))) == 1
assert int(m_fit_12.predict(cp.array([[1,1]]))) == 2      

# ===== cuML: partial_fit (dense) =====
classes = cp.array([0,1,2], dtype=cp.int32)
m_pf = cuMNB()
m_pf.partial_fit(X1, y1, classes=classes)  # first batch (0,1)
assert int(m_pf.predict(cp.array([[0,1]]))) == 0
assert int(m_pf.predict(cp.array([[1,0]]))) == 1          
assert int(m_pf.predict(cp.array([[1,1]]))) == 0
# introduce class 2 incrementally
m_pf.partial_fit(X_add, y_add)                             # second batch adds class 2
assert int(m_pf.predict(cp.array([[0,1]]))) == 0
assert int(m_pf.predict(cp.array([[1,0]]))) == 1          # <-- cuML returns 0 (unexpected)
assert int(m_pf.predict(cp.array([[1,1]]))) == 2          # <-- cuML returns 0 (unexpected)

Reference result with scikit-learn (expected semantics):

# ===== scikit-learn: reference behavior =====
import numpy as np
from sklearn.naive_bayes import MultinomialNB as skMNB

X1 = np.array([[0,1],[1,0]], dtype=np.float32)
y1 = np.array([0,1], dtype=np.int32)
X_add = np.array([[1,1]], dtype=np.float32)
y_add = np.array([2], dtype=np.int32)

# A) one-shot fit({0,1})
s1 = skMNB().fit(X1, y1)
assert int(s1.predict([[0,1]])[0]) == 0
assert int(s1.predict([[1,0]])[0]) == 1
assert int(s1.predict([[1,1]])[0]) == 0

# B) one-shot fit({0,1,2}) on merged data
s12 = skMNB().fit(np.vstack([X1, X_add]), np.concatenate([y1, y_add]))
assert int(s12.predict([[0,1]])[0]) == 0
assert int(s12.predict([[1,0]])[0]) == 1
assert int(s12.predict([[1,1]])[0]) == 2  # <-- as expected

# C) partial_fit path aligned with docs (first call must pass 'classes')
spf = skMNB()
spf.partial_fit(X1, y1, classes=np.array([0,1,2], dtype=np.int32))
assert int(spf.predict([[1,0]])[0]) == 1
assert int(spf.predict([[1,1]])[0]) == 0
spf.partial_fit(X_add, y_add)
assert int(spf.predict([[1,1]])[0]) == 2
  • Scikit-learn’s MultinomialNB explicitly documents partial_fit semantics for incremental learning (first call must pass classes=). cuML aims for scikit-learn API compatibility.
  • cuML’s API page shows MultinomialNB (example uses CSR sparse input) but does not list a partial_fit method there; partial_fit is explicitly documented on GaussianNB/CategoricalNB pages.

Expected behavior
For identical data and hyperparameters (default alpha=1.0 Laplace smoothing), cuML MultinomialNB should match scikit-learn’s predictions for both one-shot fit and partial_fit. After adding a sample of a new class 2, the model trained on the merged data or incrementally via partial_fit should predict class 2 for [1,1], and class 1 for [1,0], consistent with the standard multinomial NB formulation with Laplace/Lidstone smoothing. sklearn.naive_bayes.MultinomialNB

Environment details:

  • Environment location: Bare-metal
  • Linux Distro/Architecture: [Ubuntu 20.04 x86_64]
  • GPU Model/Driver: [A800, driver 525.147.05]
  • CUDA: [12.9, V12.9.86]
  • Python: [3.12.11]
  • cuML version where it reproduces: 25.08
  • Method of cuDF & cuML install: conda
    • Due to the such long result, I prefer to paste the command line I used to create the virtual environment. conda create -n rapids-25.08 -c rapidsai -c conda-forge -c nvidia
      rapids=25.08 python=3.12 'cuda-version>=12.0,<=12.9'

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions