Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve correctness and speed of duplicate filter. #1144

Merged
merged 6 commits into from
Jul 25, 2023

Conversation

tsteven4
Copy link
Collaborator

The duplicate filter could erroneously delete points that were not duplicates if the crc's happened to match.

The complexity remains O(n^2), however the coefficient is reduced resulting in shorter runtimes. The complexity is not changed by the often useless sort.

waypt_del(Waypoint*) is inefficient as it requires a search of the list to find the matching waypoint. Support waypt_del with iterators.

This test uses the indicated number of waypoints, which were all duplicated, and then the order was randomized. The blue trace is without this PR, the red trace includes just using iterators to delete the duplicates, the yellow trace also uses QMultiHash for simplicity and correctness.
dup

Sorry to rain on the btree forest, but it's 2023 and we have libraries available.

The duplicate filter could errouneously delete points that were
not duplicates if the crc's happened to match.

waypt_del(Waypoint*) is inefficent as it requires a search of the
list to find the matching waypoint.  Support waypt_del with iterators.
@tsteven4 tsteven4 requested a review from robertlipe July 22, 2023 18:34
@tsteven4
Copy link
Collaborator Author

tsteven4 commented Jul 22, 2023

This would allow us to retire util_crc.cc as well.

@tsteven4
Copy link
Collaborator Author

We can do better by swapping the waypoint list and just adding back the points we want as in d0bd0fd

All my testcase times shifted for an unexplained reason, so I reran all the cases. The new green line is with the latest with the waypoint list swap and rebuild.

dupl

Copy link
Collaborator

@robertlipe robertlipe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome! Thank you.

I've sort of known our filtering performance was an achilles heel, but once I got out of the GIS biz, I quit working with data sizes that it actually bothered me enough to revisit it. I never even noticed that we tossed duplicates of the (bad) CRC. How did THAT go unnoticed for so long?

You're correct about the bespoke btree. This code was early in our tree, when we were C only and all we had was linked lists. Everything may still look like a hammer, but QMultiHash is a MUCH better nail. (Or something like that.) After I did the initial structure for the filters, Ron Parker (rlp or parkerr in the code) knocked out a couple in short succession based on his own past GIS background. I really think the whole 'exported' concept was part of his own personal geocaching workflow and isn't really useful in the general case. (XmlTag was his, too.) I think we can drop it.

So instead of deleting from the list, it runs over the whole list and builds a Waypointlist of ones to keep and then swaps them all in when we're done? (Just the pointers and not the actual struct Waypoints, right?) Where does the original Waypointlist get deleted/emptied?

I'm sure you've thought through that because you're you, but I'd like to better understand it.

Thank you for tackling this!

P.S. What are you using to make the curve-fitting graphs?

@robertlipe
Copy link
Collaborator

Actually, other than in unicsv (where we cargo cult pretty much all members of Waypoint - and which didn't exist in the Parker era - the duplicate filter is OLD) is gcdata.exported ever actually set?

It looks like that's now a vestigial read-only variable. We used to parse it in gpx.c, but we haven't done that in years and it's not an official tag in the groundspeak schema.

I don't consider its existence in unicsv as proof that it's used/useful.

Give it a nod of agreement and I'll nuke it.

@GPSBabelDeveloper
Copy link
Collaborator

GPSBabelDeveloper commented Jul 24, 2023 via email

@codacy-production
Copy link

Coverage summary from Codacy

Merging #1144 (4ff119c) into master (f73ac5c) - See PR on Codacy

Coverage variation Diff coverage
-0.02% 86.67%
Coverage variation details
Coverable lines Covered lines Coverage
Common ancestor commit (f73ac5c) 23091 15995 69.27%
Head commit (4ff119c) 23068 (-23) 15974 (-21) 69.25% (-0.02%)

Coverage variation is the difference between the coverage for the head and common ancestor commits of the pull request branch: <coverage of head commit> - <coverage of common ancestor commit>

Diff coverage details
Coverable lines Covered lines Diff coverage
Pull request (#1144) 30 26 86.67%

Diff coverage is the percentage of lines that are covered by tests out of the coverable lines that the pull request added or modified: <covered lines added or modified>/<coverable lines added or modified> * 100%

See your quality gate settings    Change summary preferences

@robertlipe
Copy link
Collaborator

Once you've landed this, if you agree it's the thing to do, we'll knife 'exported' with:
6996e4d

If you want to just pull that into your commit, consider it pre-approved.

@tsteven4
Copy link
Collaborator Author

We run through the hash results and mark the waypoints to be deleted (L124). Then we swap the global_waypt_list with a local empty list, oldlist. Immediately after the swap global_waypt_list is empty, and oldlist has all the waypoints. We go through oldlist and either add the waypoints back to the global_waypt_list, or delete them. The waypoints themselves are never copied, these lists are lists of waypoint pointers.

I have known since converting our double ended queue to WaypointList that waypt_del(Waypoint* wpt) was inefficient as it involved n/2 operations on average searching the list looking for the wpt. We can avoid that by using iterators. What I didn't realize is that QList::erase(QList::iterator pos) also apparently involves n/2 operations on average! This leads to complexities of O(n^2) as we are deleting n/2 points. We can quickly add to the end of a QList, but deleting an element from the middle, even with erase, is slow. This is not true of std::list which "supports constant time insertion and removal of elements from anywhere in the container". When I created WaypointList I anticipated we might want to back it with std::list instead of QList. If we did so then I think we could just use erase (as in intermediate commit 04f534faffc00492851c0d68757718e57cc826860) instead of swapping and rebuilding the global_waypt_list with the wpts we want to keep.

I am using Libre Office Calc to generate the graphs. It supports adding various trend lines which you see, along with their equations and the coefficient of determination. One must be careful to add sufficient points to discriminate between different trend lines when assessing the complexity.

@tsteven4
Copy link
Collaborator Author

BTW, in my test case, with all wpts using empty gc data, the sort is irrelevant to performance. But I agree, the sort based on exported always struck me as a hidden pet case. I don't think it is mentioned in the documentation at all.

@robertlipe
Copy link
Collaborator

robertlipe commented Jul 24, 2023 via email

@tsteven4
Copy link
Collaborator Author

To quote Qt: "For most applications, QList is the best type to use. It provides very fast appends. If you really need a linked-list, use std::list."

@GPSBabelDeveloper
Copy link
Collaborator

GPSBabelDeveloper commented Jul 24, 2023 via email

@tsteven4
Copy link
Collaborator Author

You pushed 6996e4d into this PR. I intend to revert it and merge this. Then you can add it to another PR.

@tsteven4 tsteven4 merged commit 796d517 into GPSBabel:master Jul 25, 2023
20 checks passed
@robertlipe
Copy link
Collaborator

robertlipe commented Jul 25, 2023 via email

@tsteven4
Copy link
Collaborator Author

If helpful I can create a branch with your changes on top of the new master, which I think is what you wanted in #1145.

robertlipe added a commit that referenced this pull request Aug 12, 2024
* improve correctness and speed of duplicate filter.

The duplicate filter could errouneously delete points that were
not duplicates if the crc's happened to match.

waypt_del(Waypoint*) is inefficent as it requires a search of the
list to find the matching waypoint.  Support waypt_del with iterators.

* retire util_crc.cc

* improve duplicate to linear complexity

* polish new list creation.

* Remove final remnants of 'exported'

* Revert "Remove final remnants of 'exported'"

This reverts commit 6996e4d.

---------

Co-authored-by: Robert Lipe <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants