Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Fetcher field selection and optimistic filter evaluation #491

Closed
wants to merge 21 commits into from

Conversation

jsimnz
Copy link
Member

@jsimnz jsimnz commented Jun 1, 2022

RELEVANT ISSUE(S)

Resolves #490

DESCRIPTION

Refactors the db document fetcher to handle

  1. Field selection
  2. Filter selection

Side-effect is to optimize the fetcher to use the BadgerBD efficient key only iteration, which required additional changes in the go-datastore interfaces, BadgerDB, and BadgerDB datastore interface.

This has significant performance increases depending on the type of query (2-4x)

This is a pretty notable change, not in its design, but the side effect in the other packages. We can track the required changes for the dependant repos:
BadgerDB - sourcenetwork/badger#1
go-datastore - sourcenetwork/go-datastore#1

This PR is an extreme WIP, still has lingering artifacts from the several previous attempts at this goal, and debug print statements.

Additionally, many additional benchmarks have been added throughout this effort and need to be cleaned up.

HOW HAS THIS BEEN TESTED?

At the moment it uses the existing integration testing suite. I focused on the query tests, there may be others this breaks.

CHECKLIST:

  • I have commented the code, particularly in hard-to-understand areas.
  • I have made corresponding changes to the repo-held documentation.
  • I have made sure that the PR title adheres to the conventional commit style (subset of the ones we use can be found under: tools/configs/chglog/config.yml

ENVIRONMENT / OS THIS WAS TESTED ON?

Please specify which of the following was this tested on (remove or add your own):

  • Arch Linux
  • Debian Linux
  • MacOS
  • Windows

@jsimnz jsimnz added area/db-system Related to the core system related components of the DB feature New feature or request labels Jun 1, 2022
@jsimnz jsimnz added this to the DefraDB v0.3 milestone Jun 1, 2022
@jsimnz
Copy link
Member Author

jsimnz commented Jun 1, 2022

note: Depending on your preferred review style, you can ignore the commit history, as I kept the previous attempts' progress/WIP commits when trying different approaches (for historical reasons). So you can scope your review to the files changed tab as that will show only the most recent approach

@jsimnz jsimnz requested a review from AndrewSisley June 1, 2022 22:44
@jsimnz
Copy link
Member Author

jsimnz commented Jun 1, 2022

Benchmark results for the bench/query/simple benches only. Base (original) commit used: 9263b5d.

Uses -count 3 for benmcharks, everything else default (eg benchtime).

name                                               old time/op  new time/op  delta
_Query_UserSimple_Query1_WithFilter_Sync_1-12       249µs ± 5%   243µs ± 1%   -2.30%  (p=1.000 n=3+3)
_Query_UserSimple_Query1_WithFilter_Sync_10-12      413µs ± 3%   341µs ± 1%  -17.49%  (p=0.100 n=3+3)
_Query_UserSimple_Query1_WithFilter_Sync_100-12    2.25ms ± 0%  1.32ms ± 1%  -41.39%  (p=0.100 n=3+3)
_Query_UserSimple_Query1_WithFilter_Sync_1000-12   21.0ms ± 0%  10.6ms ± 0%  -49.38%  (p=0.100 n=3+3)
_Query_UserSimple_Query1_WithFilter_Sync_10000-12   208ms ± 1%   105ms ± 1%  -49.36%  (p=0.100 n=3+3)
_Query_UserSimple_Query2_WithFilter_Sync_1000-12   20.4ms ± 2%   9.7ms ± 1%  -52.28%  (p=0.100 n=3+3)
_Query_UserSimple_Query2_WithFilter_Sync_10000-12   202ms ± 0%    97ms ± 4%  -51.86%  (p=0.100 n=3+3)
_Query_UserSimple_Query3_WithFilter_Sync_1000-12   20.4ms ± 1%   8.7ms ± 1%  -57.42%  (p=0.100 n=3+3)
_Query_UserSimple_Query3_WithFilter_Sync_10000-12   203ms ± 4%    84ms ± 1%  -58.62%  (p=0.100 n=3+3)
_Query_UserSimple_Query4_WithFilter_Sync_1000-12   20.5ms ± 1%  10.1ms ± 7%  -50.80%  (p=0.100 n=3+3)
_Query_UserSimple_Query4_WithFilter_Sync_10000-12   202ms ± 1%    95ms ± 0%  -52.91%  (p=0.100 n=3+3)
_Query_UserSimple_Query5_WithFilter_Sync_1000-12   20.8ms ± 1%   7.8ms ± 1%  -62.36%  (p=0.100 n=3+3)
_Query_UserSimple_Query5_WithFilter_Sync_10000-12   208ms ± 3%    73ms ± 1%  -64.73%  (p=0.100 n=3+3)
_Query_UserSimple_Query6_WithFilter_Sync_1000-12   20.9ms ± 1%  10.7ms ± 2%  -48.69%  (p=0.100 n=3+3)
_Query_UserSimple_Query6_WithFilter_Sync_10000-12   207ms ± 0%   109ms ± 7%  -47.25%  (p=0.100 n=3+3)
_Query_UserSimple_Query7_WithFilter_Sync_1000-12   20.4ms ± 2%   8.8ms ± 2%  -57.10%  (p=0.100 n=3+3)
_Query_UserSimple_Query7_WithFilter_Sync_10000-12   205ms ± 1%    83ms ± 3%  -59.38%  (p=0.100 n=3+3)
_Query_UserSimple_Query8_WithFilter_Sync_1000-12   20.2ms ± 1%   8.3ms ± 1%  -58.79%  (p=0.100 n=3+3)
_Query_UserSimple_Query8_WithFilter_Sync_10000-12   199ms ± 0%    81ms ± 1%  -59.33%  (p=0.100 n=3+3)
_Query_UserSimple_Query9_WithFilter_Sync_1000-12   20.5ms ± 2%   7.5ms ± 1%  -63.55%  (p=0.100 n=3+3)
_Query_UserSimple_Query9_WithFilter_Sync_10000-12   206ms ± 5%    77ms ±11%  -62.52%  (p=0.100 n=3+3)

@@ -120,7 +120,7 @@ func NewDataStoreKey(key string) DataStoreKey {
} else {
indexOfDocKey = numberOfElements - 1
}
dataStoreKey.DocKey = elements[indexOfDocKey]
dataStoreKey.DocKey = strings.Split(elements[indexOfDocKey], ":")[0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😱

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why so scared @AndrewSisley? lol

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PTSD coming back from the previous key-code 😆 We used to have some very horrible and unsafe string magic, both in the original, and my first refactor of it #84.

I'm guessing this can be removed, as I didn't spot any other refs to it last night - but I only gave it a quick scan as my mental energy was quite low and it is a complicated PR.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The strings.Split function is pretty safe here. It will alway return an slice with length of at least one so using [0] will never panic with index out of range.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually have no memory doing this or why it's here. Prob related to a bug I noticed a while back w.r.t the instance type and doc key not being properly parsed. Will def cleanup/make safer

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looked like that was the case - will comment more if it stays 😆

@@ -156,6 +160,7 @@ type Spans []Span

// KeyValue is a KV store response containing the resulting core.Key and byte array value
type KeyValue struct {
Res dsq.Result
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: I know this is super WIP code, so I'm guessing this isn't the long term plan - but just in case you miss it in the cleanup or whatever, I really dont think this should live here and the fetcher might need to define it's own internal KeyValue struct or similar instead of leaking this through here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed in keeping this internal to the fetcher

// 2) we have a filter and its a filter field
// 3) we have passed the filter
// then get the value
// otherwise itll be lazy loaded down the line
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: I would be really really cautious in making this lazy outside of the fetcher. This is a IO/file/system/whatever operation and making it too lazy could really result in some misleading benchmarks, and nasty reliability issues, as well as the more obvious leaking of concerns/concepts through to other areas of the codebase.

I think I might be more concerned about this than the required modifications to badger/etc, and would be curious as to how much you think we gain from this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is lazy within the fetcher. Itll be resolved before fetcher.FetchNext returns.

It's lazy in the chance that the entire document is ignored due to filter not passing

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which case there's no point spending the time to copy the value bytes from badger

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah okay - got it. When cleaning up it might be worth tweaking this comment then (depending on how the eventual code looks like), as to me it read like the laziness would be leaked outside the fetcher which is much scarier/important

@jsimnz jsimnz self-assigned this Jun 3, 2022
@jsimnz jsimnz added the action/no-benchmark Skips the action that runs the benchmark. label Jun 7, 2022
@jsimnz jsimnz modified the milestones: DefraDB v0.3, DefraDB v0.4 Aug 4, 2022
@jsimnz jsimnz modified the milestones: DefraDB v0.4, DefraDB v0.5 Jan 20, 2023
@jsimnz
Copy link
Member Author

jsimnz commented May 12, 2023

Closing as this is too old and there is a new PR with a diff approach #1500

@jsimnz jsimnz closed this May 12, 2023
jsimnz added a commit that referenced this pull request Jun 27, 2023
## Relevant issue(s)

Resolves #490 
Resolves #1582 (indirectly)

## Description

This is a reduced version of #491. It takes a very different approach,
and tries to keep as much of the existing Fetcher structure as possible.

Basically, this will try to eagerly ignore documents that don't pass the
given filter at the fetcher level. This means we can apply various
optimizations then if the filter was applied at the scanNode level like
before.
shahzadlone pushed a commit to shahzadlone/defradb that referenced this pull request Feb 23, 2024
## Relevant issue(s)

Resolves sourcenetwork#490 
Resolves sourcenetwork#1582 (indirectly)

## Description

This is a reduced version of sourcenetwork#491. It takes a very different approach,
and tries to keep as much of the existing Fetcher structure as possible.

Basically, this will try to eagerly ignore documents that don't pass the
given filter at the fetcher level. This means we can apply various
optimizations then if the filter was applied at the scanNode level like
before.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
action/no-benchmark Skips the action that runs the benchmark. area/db-system Related to the core system related components of the DB feature New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fetcher field selection and filter optimization
4 participants