-
-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor KMeans #3166
refactor KMeans #3166
Conversation
Refactor So basically:
Along with adding some helper functions for training, moving methods and some doc changes. Github comparison is making it look messy , doesnt detect renames :( |
|
||
void CKMeans::compute_cluster_variances() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This compute_cluster_variances
I cannot make sense of it. Its not part of the KMeans algo ,
But this is leading to additional computations since this is called at end of train_machine
. and causes performance hit for some case like this for dataset corel-histogram shape=(68040, 32) for k=10
2.77 s with compute_cluster_variances
1.31 s without compute_cluster_variances
It is impossible to review this. A few high level comments:
We need another dev to give an opinion here before merging. @vigsterkr @lisitsyn ? |
This idea was discussed a bit here: #2558
|
Thanks for the comments. I think it should be ok to merge then.... |
@vigsterkr from my side this can be merged. What do you think? Shall we maybe put this in a feature branch to allow for some more checks (on buildbot for example?) Or merge from here? I think the new structure is fine and travis seems to be fine as well. .... |
(has to be rebased against #3217 though before merge) |
Can you rebase? There will be some minor conflicts from #3217 |
Ok I resolved conflicts. Lets see travis again. |
|
||
for(int32_t j=0; j<lhs_size; j++) | ||
for (int32_t i=0; i<num_centers; i++) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
coulnd't we omp pragma this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is not worth it as the loop is ultra cheap (but maybe investigate!)
lets address the comments + rebase with the latest develop and see how travis behaves :) |
Can we push this in parallel to the other stuff you are doing @Saurabh7 ? |
8fe51d6
to
97479d2
Compare
Ok updated :) |
lhs->free_feature_vector(vec, cluster_center_i); | ||
} | ||
/* Weights : Number of points in each cluster */ | ||
SGVector<int32_t> weights_set(num_centers); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is 2^31 = 2147483648
i.e. 2.1 billion points enough? :) i mean in case of a big data set it could be more than that or?
i'm just suggesting that maybe using uint32_t
or rather int64_t
or uint64_t
would be better choice, or?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes in the case if 2.1 billion points happen to be assigned to one center, it might be short :) should i change to int64_t then ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't weights_set[i] = the number of elements assigned to the ith centre
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes it is.
But again its starts off with all points assigned to center 0. So I guess this should definintely be increased...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah ok yeah then rather use int64_t
97479d2
to
c472419
Compare
for (j=0; j<dim; j++) | ||
{ | ||
centers(j, min_cluster)+= | ||
(vec[j]-centers(j, min_cluster)) / weights_set[min_cluster]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
only thing: a /
operator (division) is usually much more costly than multiplication...
so precomputing x = 1/weights_set[min_cluster]
and then just do
vec[j]-x*centers(j, min_cluster))
of course it'd be good to check what does the compiler do when you do -O3
c472419
to
802b810
Compare
Updated |
ALL GREEN! good job @Saurabh7 let's merge it! |
refactor KMeans
No description provided.