Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NIP-29: Simple time-based Sync #826

Draft
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

vitorpamplona
Copy link
Collaborator

@vitorpamplona vitorpamplona commented Oct 16, 2023

This is a simple way to implement a sync between event databases of a client and a relay.

The goal is to be so simple that relays and clients can actually implement from scratch.

It is similar to what StrFry does, but strfry's algorithm has too many options to code, which makes an interoperable implementation very difficult/complex.

Curious to hear from @jb55, @hoytech, @mikedilger, @cameri and others who have spent time on this.

Read: https://github.com/vitorpamplona/nips/blob/negentropy-sync/29.md

Copy link
Member

@staab staab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm glad to see this coming into the protocol. I think this could be simplified by just using a HASH verb, since filters are used anyway. That way, a client can window more or less granularly as needed.

@hoytech
Copy link
Contributor

hoytech commented Oct 17, 2023

Interesting choice to encode the week as the base unit of time in the protocol! The 7 day cycle of days we use today has been unbroken since at least 60 CE during the reign of Augustus, and possibly much longer -- if synchronised with Judaism then perhaps a thousand years earlier than that. This makes the week by far the most stable calendrical unit of time.

However, no matter the unit of time you choose, there will be cases where it is sub-optimal. Consider queries that match infrequent events. If they are posted once a week or less, then weekly hashes degrades into simply transferring the event IDs on every sync. Here it would be ideal to have something like monthly or yearly buckets.

Alternatively, consider syncing the comments on a thread. Most of the time a thread will have a flurry of activity in the first couple days after posting, and then go quiet after. So every time a difference is detected by a sync you will need to re-download the full thread (and potentially paginate it, since most relays limit the number of stored events that a REQ can return). Here ideally you'd use hourly or minutely buckets.

As you suggest, it is possible to have custom intervals selected by the protocol. Requiring implementations to do calendar arithmetic seems far too complicated to me. With negentropy, the matching events are divided into (by default) 16 equal-sized (by number of events, not time) buckets, and the starting timestamp for each bucket is sent.

The second problem: Once you detect a difference in the hash of a set of items, what do you do next? If you simply perform the entire query then the amount of data transferred is linear in the entire set for each sync. In the worst case where you run a separate sync for each individual event: quadratic.

With @staab's suggestion, you could split the range into half-week windows, and get each of their hashes, and recurse into the ones that differ. This would result in bandwidth overhead logarithmic in the set size.

This is effectively how negentropy works, except that many ranges can be batched together and worked on concurrently (while strictly adhering to transport frame size limits), and when a small enough range is found, the item IDs are sent directly, rather then being hashed.

strfry's algorithm has way too many options, which makes an interoperable implementation very difficult

I guess you are referring to the complexity of a negentropy implementation? Granted, this is non-trivial. However, there are 3 existing implementations and a pretty decent reference test suite. What language would be most useful for your app?

@vitorpamplona
Copy link
Collaborator Author

no matter the unit of time you choose, there will be cases where it is sub-optimal.

Yes, my hope is that it is ok to be suboptimal for simplicity's sake. We will see.

Once you detect a difference in the hash of a set of items, what do you do next?

Clients would simply download the full week again (we already do that, so it shouldn't be that much of a problem). But I assume past discrepancies of hashes to be rare. It would only happen when an event from the past is re-broadcasted to the relay or when there was an EOSE issue somewhere and the client stopped asking for a given range of events. Most of the time, hashes should match.

I guess you are referring to the complexity of a negentropy implementation? Granted, this is non-trivial.

Yes, I spent some time going through it. I think it is very powerful but too hard/open for clients to be able to declare full compliance between two implementations.

Since we have so many people building relays from scratch, I think it is important to keep this as simple as possible. Anybody should be able to code and reply to these calls correctly, even if they are coding from scratch.

@arthurfranca
Copy link
Contributor

What about the option of relays storing a .seen_at field to events then clients would filter using since/until/limit based on that field instead of on .created_at?

Maybe then clients could store for each filter, just the last moment it requested events from each relay. If it is enough for your syncing use case, it would be much easier for relays to implement.

@vitorpamplona
Copy link
Collaborator Author

vitorpamplona commented Oct 17, 2023

What about the option of relays storing a .seen_at field to events then clients would filter using since/until/limit based on that field instead of on .created_at?

It's orthogonal to the solution here. This could help, but it doesn't solve the problem of making sure the past has been fully synced.

@staab
Copy link
Member

staab commented Oct 20, 2023

Going to restate my comment, why do we need to enforce weekly? Since we're accepting arbitrary filters here, the hardcoded window doesn't seem to do anything to help relays with performance, other than forcing clients to limit the scope of their request. But that's a pretty weak heuristic, since a year of one pubkey's data is going to be less than 10 minutes of global data. So why not just calculate the hash based on the filter, and relays can complain if the filter isn't restrictive enough?

So:

The client sends a HASH message to the relay with a subscription ID and appropriate filters for the content to be synced.

Request:

["HASH-REQ", <subscription ID string>, <nostr filter>, <nostr filter2>, <nostr filter3>]

The relay calculates the hash and responds with the following:

Response:

["HASH-RES", <subscription ID string>, <SHA256(JSON.stringify([event1.id, event2.id, event3.id, ...])) in hex>]

The client then compares the receiving hashes with those stored locally and, if different, uses the same filter to download all desired events.


Ok, so thinking about this more, this is actually ok. A binary search of hashes over a long time period would require a relay to do more work, and clients can always impose their own narrower windows if desired.

@vitorpamplona
Copy link
Collaborator Author

vitorpamplona commented Oct 20, 2023

All window sizes that I could imagine came with weird problems. Calendar formatting (timezone, leap years, leap seconds, etc) for instance is already an issue between implementations in multiple languages. Like @hoytech mentioned "week of the year" seems to be the most stable metric among all calendar types.

An option would be to group by a substring of the first n chars of a stringified .created_at. The client sends the number of chars to be grouped by:

  • 1: *-> group by periods of 31 years
  • 2: ** -> group by periods of 3 years
  • 3: *** -> group by periods of 16 weeks and half
  • 4: **** -> group by periods of 11 days and 13 hours
  • 5: ***** -> group by periods of 1 day and 3 hours
  • 6: ****** - > group by periods of 2 hours and 46 minutes
  • 7: ******* -> group by periods of ~16 minutes
  • 8: ******** -> group by periods of 1 minute and 40 seconds
  • 9: ********* -> group by periods of 10 seconds
  • 10: ********** -> group by seconds

In that way, implementers don't need to format the date but groups are less intuitive.

But the goal is to make a Sync that is stupidly simple to implement.

@staab
Copy link
Member

staab commented Oct 20, 2023

Right, I was suggesting omitting time-based chunking entirely. But I convinced myself.

@vitorpamplona
Copy link
Collaborator Author

Time-based chunking seems right because sync is only useful when reprocessing past events.

We have since/until to deal with downloading events since the last time the user was online. For everything else, sync should be used.

@staab
Copy link
Member

staab commented Oct 20, 2023

I just reviewed the negentropy protocol, and while it's a little more complex and not as tidy, there are good libraries available for multiple languages. It's also already implemented in strfry, and probably elsewhere. How much better is negentropy in terms of time/space complexity compared with simple sync? I'm inclined to just use what already exists unless there's a good reason note to. @hoytech would definitely like your opinion on how best to get this into a NIP.

@vitorpamplona
Copy link
Collaborator Author

Frankly I don't see any relay dev coding a fully interoperable interface with what strfry has. Much less all the other ad-hoc relays that were coded from scratch that we see out there. Even as a Client, that just needs to code one of the multiple ways to use the protocol, it took me forever to figure out how to code it. Imagine relays that must support all Negentropy options.

That's why I made this PR. The goal is to create something where the end result is very similar in time/costs at a fraction of the complexity of the full protocol.

@jb55
Copy link
Contributor

jb55 commented Oct 21, 2023 via email

@vitorpamplona
Copy link
Collaborator Author

Why not just standardize around that?

If we can get people to actually code it, sure.

@mikedilger
Copy link
Contributor

mikedilger commented Oct 21, 2023

Sorting by created_at is not sufficient as you'll get multiple events with the same timestamp. You should subsequently sort by id.

If you only got one hash at a time, you could specify any time period with 'since' and 'until'. EDIT: I presume you want the time periods fixed so relays can have the hashes pre-calculated

What about the option of relays storing a .seen_at field to events then clients would filter using since/until/limit based on that field instead of on .created_at?

It's orthogonal to the solution here. This could help, but it doesn't solve the problem of making sure the past has been fully synced.

I'm still in favor of seen_at filters and maybe I should re-open that.

If I know that I got all the events from a relay up to seen_at time X, then when I ask again later I know that ALL events I might be missing must have flowed in after seen_at time X, even if the created_at times are all over the place. The only slightly tricky bit is that IF the timestamp I put in for seen_at is in the future according to the relay clock, then it won't have seen everything up to that stamp just yet... so the relay should cap the request to the relay's now and indicate that somehow in the reply. This was suggested long long ago.

@vitorpamplona
Copy link
Collaborator Author

vitorpamplona commented Oct 21, 2023

Sorting by created_at is not sufficient as you'll get multiple events with the same timestamp. You should subsequently sort by id.

Agreed.

I'm still in favor of seen_at filters and maybe I should re-open that.

So,basically, relays must store a received_at date for each event and then the seen_at filter is just like since but for the received_at dates instead of created_at?

@mikedilger
Copy link
Contributor

So,basically, relays must store a received_at date for each event and then the seen_at filter is just like since but for the received_at dates instead of created_at?

Yep. It would prevent gaps. But I agree there are still other reasons to do what this NIP is suggesting.

@arthurfranca
Copy link
Contributor

Well don't want to sound repetitive but this is exactly NIP-34 PR as I said above xD It only needs some repo maintainer review

@hoytech
Copy link
Contributor

hoytech commented Oct 23, 2023

Thanks guys! Currently I still consider negentropy experimental so I don't think we should turn it into a NIP yet. Based on my discussions with the the author of the paper negentropy is based on, there is one more change I'm making to the protocol. I'm also going to simplify it slightly, and remove the idSize parameter -- that's pretty much the only exposed option now.

I'm working on an article now about why Range-Based Set Reconciliation (RBSR) is so cool. I think it will be a very important building block for internet protocols in the next few years, since it has many compelling advantages over merkle search trees and other similar approaches. I'll let you all know when my article is ready.

A couple notes:

  • Bucketing by time: Negentropy does not directly bucket by time. For example, imagine you are using a week interval, and detect there are changes, so you try bucketing it by days. But say all the matching events were created in one particular day. In this case you have made no progress (haven't identified any missing events, or recursed into actual event-containing ranges). Negentropy always makes progress in every message in either direction, because it creates equal-sized buckets according to the local number of events (independent of their timestamps).
  • seen_at event metadata: Using this relies on keeping relay-specific state, whereas the content-addressed sync types (both time-bucketing and RBSR) do not. If you connect to many relays this becomes less efficient (because you may be downloading the same event N times). Also, relays can unexpectedly re-build their DBs (potentially throwing away or resetting seen_at metadata), or multiple relays can be behind a load-balancer and not have consistent seen_ats, or their clocks can be changed, or events can be deleted and re-inserted, or any number of other things. In a chaotic environment like nostr, I think syncing by content is preferable.

@vitorpamplona
Copy link
Collaborator Author

vitorpamplona commented Oct 23, 2023

Interesting point on the seen_at.

If Negentropy clusters by a number of events in order, it feels like if there is an added event right at the beginning of the order, all clusters would be affected and the resulting hashes (or resulting created_ats) would be different. Is that correct?

If that is true, then the algorithm on the client side that figures out which clusters to dive into is a bit more complex. Isn't it?

@staab
Copy link
Member

staab commented Oct 23, 2023

I think it will be a very important building block for internet protocols in the next few years

This should be the vision for this NIP. Efficient sync is good for more than syncing relays; special-purpose clients or DVMs for fulfilling advanced queries across many relays (search, count, find replies, trending) would be well served by an efficient sync mechanism. If it were possible to do the sync in real time in response to a user request, that would open up a lot of really interesting use cases.

@jb55
Copy link
Contributor

jb55 commented Oct 24, 2023 via email

@hoytech
Copy link
Contributor

hoytech commented Oct 24, 2023

If Negentropy clusters by a number of events in order, it feels like if there is an added event right at the beginning of the order, all clusters would be affected and the resulting hashes (or resulting created_ats) would be different. Is that correct?

If that is true, then the algorithm on the client side that figures out which clusters to dive into is a bit more complex. Isn't it?

Yes, with the current negentropy version, any stored events that land in a range that you've pre-computed a hash for will need re-computing. This is what I am working on now: Parameterising an incremental hash function with adequate security. This will let you add or subtract event IDs out of a hash without touching any other events. It is literally just treating the hashes as numbers and adding them hashes. There are more secure approaches but they involve complicated elliptic curve dependencies, which I really want to keep out of negentropy spec. Anyway collisions in hash additions takes a fair amount of resources/time, I wrote a program to attack this as best I could: https://github.com/hoytech/birthday-collisions (and there are several additional countermeasures that will make it harder secure still).

@hoytech
Copy link
Contributor

hoytech commented Oct 24, 2023

Thanks for the heads up! You think it might be ready to spec after this simplification? I will hold off implementing until it's updated.

Yes definitely hold off for now. I think it should be good to go after this update, but maybe some people will have good feedback after I publish the article, so we'll see!

@vitorpamplona
Copy link
Collaborator Author

Yes, with the current negentropy version, any stored events that land in a range that you've pre-computed a hash for will need re-computing.

Then, I assume that because it is pre-computed, the base negentropy algo doesn't really have a way to specify the filter you want to sync. Or does it?

For instance, Amethyst would like to sync only the event set where the user is the author or p-tagged + all Kind:0s and all Kind:10002s for follows + follows of follows of the user.

I wasn't sure there was a smart algo somewhere to allow the pre-computation for different filters, so this PR requires computing hashes on the fly all the time. But maybe there is a way to pre-compute them.

@hoytech
Copy link
Contributor

hoytech commented Oct 24, 2023

Yes, it allows syncing an arbitrary filter. It always computes the hashes when needed: As far as I know there isn't really a way to avoid this given arbitrary filters.

With incremental hashes you could build this into a tree, and that will improve the efficiency of the sync somewhat, but you'll still need to run the query, read in the matching IDs and re-build the tree on each query, which probably isn't worth it unless filters will match millions of results.

@vitorpamplona
Copy link
Collaborator Author

vitorpamplona commented Oct 24, 2023

Considering the under-development status of @hoytech's approach, do we want to advocate for a simpler version now and deprecate it next year or do we want to wait for it?

I could really use a sync right now but I understand if people want to wait.

@staab
Copy link
Member

staab commented Oct 24, 2023

Depends on the timeline, I'd be ok with two versions, especially if they're very substantially different in complexity/capability, which it seems these are.

@weex
Copy link

weex commented Oct 26, 2023

I could really use a sync right now but I understand if people want to wait.

It would be great to see this and other solutions (like #579) being tried now to save battery and bandwidth for everyone on mobile.

A GitHub Draft of negentropy-sync would also be awesome so that discussion can be easy to follow later.

To the question of implementation complexity, this is what libraries are for and they will come, but first we need specifications.

@vitorpamplona
Copy link
Collaborator Author

To the question of implementation complexity, this is what libraries are for and they will come, but first we need specifications.

I am not sure if I agree with this. Depending on very common libraries is ok, like with SHA-256 for instance. Depending on large, complex, and opinionated libraries is not good. They might come, but it could take years to get a good chunk of languages covered by good library options that are fully interoperable with one another.

I would prefer if we made sure Nostr can be easily implemented/assembled in any language.

@Semisol Semisol self-requested a review October 30, 2023 17:45
Copy link
Collaborator

@Semisol Semisol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use negentropy, and/or simplify the negentropy spec?

@vitorpamplona
Copy link
Collaborator Author

vitorpamplona commented Nov 8, 2023

After some testing, I now think the flexibility in window size is important. So, I moved away from a fixed spec with weekly hashes to a spec based on the first n-chars of .created_at. The client can specify how to truncate the timestamp to create groups of multiple sizes, bound by the filters in the subscription.

It looks more complicated, but it is actually simper than the previous one.

29.md Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants