Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HashingVectorizer behaves differently from FeatureHasher #963

Open
brian-methodical opened this issue Feb 8, 2023 · 1 comment
Open

Comments

@brian-methodical
Copy link
Contributor

Describe the issue:
HashingVectorizer behaves differently from FeatureHasher, HashingVectorizer can work off a Sting like

JUNK_FOOD_DOCS = (
    "the pizza pizza beer copyright",
    "the pizza burger beer copyright",
    "the the pizza beer beer copyright",
    "the burger beer beer copyright",
    "the coke burger coke copyright",
    "the coke burger burger",
)

but FeatureHasher expects an iterable of strings like:

JUNK_FOOD_DOCS = [["the", "pizza", "pizza", "beer", "copyright"],
   ["the", "coke", "burger", "burger"] ]

Which is the correct behavior:

  1. expect the hasher to parse strings into vectors
  2. fix the test by sending list of lists of strings to FeatureHasher instead list of strings, like the other hasher expects

Minimal Complete Verifiable Example:

See dask-ml/tests/feature_extraction/test_text.py: test_basic()

Anything else we need to know?:

This is illustrated in the failing test

Environment:

  • Dask version: dask-ml-3.8 conda env dask 2023.1.1
  • Python version: 3.8, 3.9, 3.10
  • Operating System: ubuntu-latest
  • Install method (conda, pip, source): conda
brian-methodical added a commit to brian-methodical/dask-ml that referenced this issue Feb 9, 2023
@brian-methodical
Copy link
Contributor Author

Here is a possible fix to this test if this is the route we wish to go: main...brian-methodical:dask-ml:fix-tests#diff-824171fe718be1c9bd2d722b5ebc30f71b6dd402568282716078a1c5ec25db1f

mmccarty pushed a commit that referenced this issue Feb 10, 2023
* text matrix

* spliting the string creates the expected input to FeatureHasher #964

* FeatureHasher issue #963

* addressing catagories_ type mismatch when auto by explicitly setting dtype on test data to object #964

* reverted to just ubuntu for time saving
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant