Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix regression that closed the warc filereader too early #83

Merged
merged 47 commits into from
Nov 10, 2023

Conversation

maeb
Copy link
Member

@maeb maeb commented Oct 30, 2023

This PR reverts the commit that caused the WARC file reader to be closed prematurely.

In addition a lot of fixes and updates of dependencies have been made.

maeb added 30 commits October 26, 2023 14:27
github.com/nlnwa/whatwg-url changed serialization for empty query values
This reverts commit b48b83b.

The commit caused the warc records to be closed too early.
This commit ignores validation errors when loading records because
the record may be valuable even tough it is not valid.
This commit adds specific error messages such that it is possible
to differentiate between different errors.
This commit changes the behaviour when searching for one url with or
without schema if the match type is exact or prefix. A search for http
and https is enforced to get as many results as possible.

If one would like to differentiate between schemas, use the coreserver
API.
This commit refactors the ListStorageRef method
such that all writes to the result channel are handled
in the same code block for better readability.
This commit refactors code to be easier to comprehend.
The cdx response creation is wrapped in a lambda function.
This commit uses the known prefix (key) to narrow the search to prefix.
This commit fixes the test to continue with next case if the previous
test failed.
This commit substitutes CRLF for LF and simplifies the code by
returning LF after every response.

See also https://jsonlines.org/ which states that LF is standard.
The old test cases are no longer valid after the ssurt package changed
it's behavior.
This commit returns an empty response when the search was empty.
This enables the handler to return an 404 and be sure nothing has
already been written in response.
This commit enables filtering and limiting responses when using tikv
closest api.
This commit comes in the name of readability.
This commit ensures a response when
limit is set to 0 in tikv methods.

Default to TiKV MaxRawKVScanLimit.
This commit skips adding a record to the write batch if the
cdx key is larger than tikv max key size.
This commit refactors the badger search methods
such that all writes to the result channel are handled
in the same code block for better readability.
This commit enables filtering and limiting responses when
using badger closest api.
This commits refactors sorted parallel search by  inlining
the traversal of sorted items. This reduces the
abstraction level to make it easier to follow the logic.
This commit removes a comment that adds no value.
maeb added 17 commits October 30, 2023 11:40
This commit bumps dependencies and as a consequence of
this updates the error checking that changed in the
github.com/nlnwa/whatwg-url library.
This commit refines the release tag to match semver only and
not any tag starting with the letter 'v'.
This commit changes the test workflow to trigger on all push events.
This commit updates the build base image to golang 1.21.
This commit changes the error messages in the filestorageloader
to provide more information about the error.
This commit returns a special error to signal that the given
WARC-Refers-To id could not be found, meaning it is not in the
index.
This commit fixes the resolve method in badger implementation to
check if key is not found. This means that the function  will return
an empty storageRef and nil error if key is not found.
This commit adds a check to see if value slice is empty in resolve
method.
This commit changes the behaviour to return HTTP status code 404
instead of 500 if the referred record in the revisit record could
not be found.
@maeb maeb merged commit a8500e6 into main Nov 10, 2023
3 checks passed
@maeb maeb deleted the fix/close-warc-file-reader-too-early branch November 10, 2023 14:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant