-
Notifications
You must be signed in to change notification settings - Fork 126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
improve correctness and speed of duplicate filter. #1144
Conversation
The duplicate filter could errouneously delete points that were not duplicates if the crc's happened to match. waypt_del(Waypoint*) is inefficent as it requires a search of the list to find the matching waypoint. Support waypt_del with iterators.
This would allow us to retire util_crc.cc as well. |
We can do better by swapping the waypoint list and just adding back the points we want as in d0bd0fd All my testcase times shifted for an unexplained reason, so I reran all the cases. The new green line is with the latest with the waypoint list swap and rebuild. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is awesome! Thank you.
I've sort of known our filtering performance was an achilles heel, but once I got out of the GIS biz, I quit working with data sizes that it actually bothered me enough to revisit it. I never even noticed that we tossed duplicates of the (bad) CRC. How did THAT go unnoticed for so long?
You're correct about the bespoke btree. This code was early in our tree, when we were C only and all we had was linked lists. Everything may still look like a hammer, but QMultiHash is a MUCH better nail. (Or something like that.) After I did the initial structure for the filters, Ron Parker (rlp or parkerr in the code) knocked out a couple in short succession based on his own past GIS background. I really think the whole 'exported' concept was part of his own personal geocaching workflow and isn't really useful in the general case. (XmlTag was his, too.) I think we can drop it.
So instead of deleting from the list, it runs over the whole list and builds a Waypointlist of ones to keep and then swaps them all in when we're done? (Just the pointers and not the actual struct Waypoints, right?) Where does the original Waypointlist get deleted/emptied?
I'm sure you've thought through that because you're you, but I'd like to better understand it.
Thank you for tackling this!
P.S. What are you using to make the curve-fitting graphs?
Actually, other than in unicsv (where we cargo cult pretty much all members of Waypoint - and which didn't exist in the Parker era - the duplicate filter is OLD) is gcdata.exported ever actually set? It looks like that's now a vestigial read-only variable. We used to parse it in gpx.c, but we haven't done that in years and it's not an official tag in the groundspeak schema. I don't consider its existence in unicsv as proof that it's used/useful. Give it a nod of agreement and I'll nuke it. |
I dut for it.
93c71e4
2003 was early days for Pocket Queries (groundspeak/geocaching GPX files)
and it was common to have to order up many of them to cover an area. It
looks like this was part of his bespoke system for managing multiples so
that records from multiple dates that tied into the (taa daa) duplicate
filter so he could end up with one PQ that contained the most recently
updated records. This was presumably easier than doing actual database
stuff and cracking open the logs tags and merging those, but keeping the
most recent top-level geocache information for most recent coords, cache
description, and the much-needed enabled/archived flags so you wouldn't go
into the woods and hunt for a geocache that wasn't there.
These tags didn't exist outside of his personal geocaching tools. I'll prep
a CL that removes this in a few moments...
… Message ID: ***@***.***>
|
Coverage summary from CodacyMerging #1144 (4ff119c) into
Coverage variation details
Coverage variation is the difference between the coverage for the head and common ancestor commits of the pull request branch: Diff coverage details
Diff coverage is the percentage of lines that are covered by tests out of the coverable lines that the pull request added or modified: See your quality gate settings Change summary preferences |
Once you've landed this, if you agree it's the thing to do, we'll knife 'exported' with: If you want to just pull that into your commit, consider it pre-approved. |
We run through the hash results and mark the waypoints to be deleted (L124). Then we swap the global_waypt_list with a local empty list, oldlist. Immediately after the swap global_waypt_list is empty, and oldlist has all the waypoints. We go through oldlist and either add the waypoints back to the global_waypt_list, or delete them. The waypoints themselves are never copied, these lists are lists of waypoint pointers. I have known since converting our double ended queue to WaypointList that waypt_del(Waypoint* wpt) was inefficient as it involved n/2 operations on average searching the list looking for the wpt. We can avoid that by using iterators. What I didn't realize is that QList::erase(QList::iterator pos) also apparently involves n/2 operations on average! This leads to complexities of O(n^2) as we are deleting n/2 points. We can quickly add to the end of a QList, but deleting an element from the middle, even with erase, is slow. This is not true of std::list which "supports constant time insertion and removal of elements from anywhere in the container". When I created WaypointList I anticipated we might want to back it with std::list instead of QList. If we did so then I think we could just use erase (as in intermediate commit 04f534faffc00492851c0d68757718e57cc826860) instead of swapping and rebuilding the global_waypt_list with the wpts we want to keep. I am using Libre Office Calc to generate the graphs. It supports adding various trend lines which you see, along with their equations and the coefficient of determination. One must be careful to add sufficient points to discriminate between different trend lines when assessing the complexity. |
BTW, in my test case, with all wpts using empty gc data, the sort is irrelevant to performance. But I agree, the sort based on exported always struck me as a hidden pet case. I don't think it is mentioned in the documentation at all. |
Ah.lovely. I hadn't really noticed that waypoint:swap swaps the waypoint
LISTS and not actually waypoints. Dig it!
There was talk within the Qt community about turning down some of the Qt
containers (some are better, some are worse) and/or backing them with
std::foo now that you can count on good implementations everywhere - which
was definitely not the case in Qt3 days. I've fallen out of the insiders
loop and don't know where that wind is now blowing. DateTime and String are
now pretty entrenched, but I think we can swap most of the containers out
without a big deal.
I wouldn't casually mix QFoo and std::Foo, but with a good reason, my
loyalty could swing.
In my followup deleting exported (I couldn't find a sane case that ever set
it...) that sort gets deleted, too. That'll buy some clock cycles back in
your case.
When everything was a linked list, waypt_del() was cheap as it was
basically two pointer moves. Our foundational containers just changed over
the years.
Thanx!
…On Mon, Jul 24, 2023 at 3:50 PM tsteven4 ***@***.***> wrote:
BTW, in my test case, with all wpts using empty gc data, the sort is
irrelevant to performance. But I agree, the sort based on exported always
struck me as a hidden pet case. I don't think it is mentioned in the
documentation at all.
—
Reply to this email directly, view it on GitHub
<#1144 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACCSD3Y6XLVGGJ4TB7HLIRTXR3NXXANCNFSM6AAAAAA2T7TKHU>
.
You are receiving this because your review was requested.Message ID:
***@***.***>
|
To quote Qt: "For most applications, QList is the best type to use. It provides very fast appends. If you really need a linked-list, use std::list." |
There's some truth to that. We started with a linked list.
…On Mon, Jul 24, 2023 at 4:43 PM tsteven4 ***@***.***> wrote:
To quote Qt: "For most applications, QList
<https://doc.qt.io/qt-6/qlist.html> is the best type to use. It provides
very fast appends. If you really need a linked-list, use std::list."
—
Reply to this email directly, view it on GitHub
<#1144 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AC3VAD6YMCUFJ2SJQU2H47LXR3UAVANCNFSM6AAAAAA2T7TKHU>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
You pushed 6996e4d into this PR. I intend to revert it and merge this. Then you can add it to another PR. |
This reverts commit 6996e4d.
That sounds the opposite of helpful. Sorry and thanx.
…On Mon, Jul 24, 2023 at 7:28 PM tsteven4 ***@***.***> wrote:
Merged #1144 <#1144> into master.
—
Reply to this email directly, view it on GitHub
<#1144 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACCSD34F6FXQK6OGQ64KLH3XR4HLFANCNFSM6AAAAAA2T7TKHU>
.
You are receiving this because your review was requested.Message ID:
***@***.***>
|
If helpful I can create a branch with your changes on top of the new master, which I think is what you wanted in #1145. |
* improve correctness and speed of duplicate filter. The duplicate filter could errouneously delete points that were not duplicates if the crc's happened to match. waypt_del(Waypoint*) is inefficent as it requires a search of the list to find the matching waypoint. Support waypt_del with iterators. * retire util_crc.cc * improve duplicate to linear complexity * polish new list creation. * Remove final remnants of 'exported' * Revert "Remove final remnants of 'exported'" This reverts commit 6996e4d. --------- Co-authored-by: Robert Lipe <[email protected]>
The duplicate filter could erroneously delete points that were not duplicates if the crc's happened to match.
The complexity remains O(n^2), however the coefficient is reduced resulting in shorter runtimes. The complexity is not changed by the often useless sort.
waypt_del(Waypoint*) is inefficient as it requires a search of the list to find the matching waypoint. Support waypt_del with iterators.
This test uses the indicated number of waypoints, which were all duplicated, and then the order was randomized. The blue trace is without this PR, the red trace includes just using iterators to delete the duplicates, the yellow trace also uses QMultiHash for simplicity and correctness.
Sorry to rain on the btree forest, but it's 2023 and we have libraries available.