-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Precision over 100% reported if ground truth contains pairs of identical ids #20
Comments
Some more data. From pyJedAI/src/pyjedai/clustering.py Line 366 in 2e41af4
I printed eval_obj.__dict__ :
So |
Ah I got it. We have matching pairs of the same id in our ground truth. So sth. like "id1|id1" as row in the csv file. Thinking about it, this is not incorrect: An entity obviously is identical to itself, but I see also that the gt is not as clean as it should be. I will cleanup the gt, but an additional approach might be to check for identity of the ids here: pyJedAI/src/pyjedai/clustering.py Line 362 in 2e41af4
and in that case not increase true_positives to make evaluation more robust. But of course, one would need to investigate also for clean clean ER case and the other steps' evaluations, that calculations remain correct / consistent.
|
We hadn't considered this scenario before. I fully agree that it should be addressed, given the prevalence of errors in data. We will address this by adding a validation check. Thanks for the detailed trace and feedback! |
We added a drop_duplicates when we parse the GT file. Here: pyJedAI/src/pyjedai/datamodel.py Line 159 in c19399a
I think this will work better. Cheers, |
We have a dirty ER workflow, where the EntityMatching graph is generated with
similarity_threshold=0.0
(to get all compared edges) and then we optimize the clustering for the optimalsimilarity_threshold
using optuna. We encountered this:On the top end, where the threshold gets towards 1.0 and as such the clustering does not produce alot of matches, the reported precision goes beyond 100%. I would have to dig deeper what exactly causes this, but maybe you have an idea, possibly it is only a bug regarding edge cases where the number of matches is low.
best
The text was updated successfully, but these errors were encountered: