Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make NNCTPH take in StringProfile or SparseIntegerVector? #4

Open
thiakx opened this issue Jun 21, 2016 · 1 comment
Open

Make NNCTPH take in StringProfile or SparseIntegerVector? #4

thiakx opened this issue Jun 21, 2016 · 1 comment

Comments

@thiakx
Copy link

thiakx commented Jun 21, 2016

Hi. I am able to deploy LSHSuperBitNNDescentTextExample successfully in our spark cluster. I really like the idea of pre-calculating the stringProfiles via ks.getProfile and performance is good.

I am testing the NNCTPHExample and trying to feed NNCTPH the pre-calculated the stringProfiles. Unfortunately, it seems like the NNCTPH constructor and .setSimilarity only takes in String? Can we make NNCTPH take in StringProfile or SparseIntegerVector? It is a lot slower than LSHSuperBitNNDescentTextExample, and I suspect it has to recalculate the profiles at every comparison. I also replaced Jaro-Winkler with the more cost efficient Jaccard index, which improved performance slightly.

@thiakx thiakx changed the title NNCTPH take in StringProfile or SparseIntegerVector? Make NNCTPH take in StringProfile or SparseIntegerVector? Jun 21, 2016
@tdebatty
Copy link
Owner

Hello,

Sorry for this late answer :-/

Your idea is good, but NNCTPH is currently not compatible with this approach:
NNCTPH requires a simple String as input, so it can compute a hash and bin the data in different buckets, while you would like to compute similarity between the profile representation of these strings.

One solution would be to refactor NNCTPH so it uses an interface as input (instead of the Node class). I will make some tests and keep you informed...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants