-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor GitScanner
and simplify implementation
#5389
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
chrisd8088
force-pushed
the
refactor-gitscanner
branch
from
June 7, 2023 22:48
8c16837
to
5e9cff8
Compare
In commit d0e950d of PR git-lfs#3978 the ScanMultiRangeToRemote() method of the GitScanner structure was added, and the sole caller of the existing ScanRangeToRemote() method was revised to call the new scan method instead. As no additional callers of the existing method have been introduced since then, we can simplify our code by removing the unused method now.
In commit 4049928 of PR git-lfs#4209 the scanStashed() function of the "lfs" package was revised to no longer make use of the *GitScanner argument that was part of its original implementation from commit 2dc718b in the same PR. However, the argument was never dropped from the function's signature, so we can remove it now.
In commit d5fd152 of PR git-lfs#1745 the "git lfs fetch" command, among others, was revised to set the Filter element of the GitScanner structure, and no longer pass a common GitScanner created in the main fetchCommand() function to various other functions such as scanAll() and fetchAll(). As a result, the GitScanner structure created in the fetchCommand() function was no longer used, but it was not fully removed at the time, as was done in the "git lfs checkout" command's checkoutCommand() function, for instance. Therefore we can simply remove this unused structure's initialization now in the fetchCommand() function.
The GitScanner structure and its methods were introduced in PR git-lfs#1670, and in commit bdbca39 of that PR the structure's internal sync.Mutex element was added to protect access to the "remote" and "skippedRefs" elements, as well as the "closed" element. The structure's methods always acquire the mutex before reading or writing these elements, particularly the "remote" one. This type of design is typical of other structures where we expect concurrent access to the structure's data, such as when multiple goroutines attempt to read and write the same data. However, in practice the GitScanner structure's "remote" and "skippedRefs" elements are only ever initialized once, and indeed only initialized in one specific use case by the (*uploadContext).buildGitScanner() method for the "git lfs push" and "git lfs pre-push" commands. Meanwhile, the "closed" element is only used in the Close() method, and this method is only ever called once per structure by our existing code. Moreover, other elements of the GitScanner structure which have been added since its original introduction are set and retrieved without locking the mutex, such as the FoundLockable and PotentialLockables elements, which are checked and read directly in the scanRefsToChan() function of the "lfs" package. As it never the case that multiple goroutines need to read and write to the "remote", "skippedRefs", and "closed" internal elements of the GitScanner structure, we can remove the mutex and the associated code, which simplifies the setup of the structure.
The GitScanner structure and its methods were introduced in PR git-lfs#1670, and in commit bdbca39 of that PR the structure's Close() method was introduced. Unlike other similar structures whose Close() methods should be called to release underlying resources such as channels or I/O streams, the (*GitScanner).Close() method serves only to output an optional performance timing trace metric. This Close() method is not called consistently; for instance, it is never called by the migrateExportCommand() function of the "git lfs migrate" command, and will be skipped by the checkoutCommand() function of the "git lfs checkout" command if an error is returned by the (*GitScanner).ScanTree() method. The utility of the performance timing metric is also undercut by the fact that some commands perform other tasks before and after calling the specific (*GitScanner).Scan*() method they invoke. And in the particular case of the "git lfs prune" command, multiple goroutines are started, each of which runs a different Scan*() method simultaneously with the others, so the final timing metric does not account for their different execution times, just the overall final timing. We can improve the value of the timing metric while also simplifying the calling convention for the GitScanner structure's methods by removing the Close() method, and tracing the performance of each Scan*() method individually. Removing the Close() method clarifies that no underlying resources must be released for the GitScanner structure, and so callers need not try to register a deferred call to the method. This parallels some other conventional Go structures, such as the Scanner structure of the "bufio" package. As well, running a "git lfs prune" command with the GIT_TRACE_PERFORMANCE=1 environment variable set now results in more detailed and useful output, for example: 12:36:51.221526 performance ScanStashed: 0.013632533 s 12:36:51.224494 performance ScanUnpushed: 0.016570280 s 12:36:51.240670 performance ScanTree: 0.017171717 s
The RevListScanner structure of the "git" package is constructed by the NewRevListScanner() function, which takes a ScanRefsOptions structure as an argument. The git.ScanRefsOptions structure encapsulates the large number of flags and other fields which a caller may potentially set for the "git rev-list" scanning operation. The "lfs" package also has a similar ScanRefsOptions structure, which is used by the various Scan*() methods of the GitScanner structure to pass optional flags and fields to the internal functions which implement the scanning operations. However, unlike the "git" package, this ScanRefsOptions structure is not used when constructing a GitScanner, and is moreover never used outside of the "lfs" package. We can therefore remove this structure in favour of a more conventional set of internal elements encapsulated directly in the GitScanner structure. This simplifies the signatures of many of the internal scan*() functions and helps clarify what parts of the GitScanner structure actually need to be exported. As part of this change, we also create a new, dedicated and un-exported nameMap structure, which holds the map of Git object SHAs to their pathspecs, as populated during scan operations. It is then utilized in the existing lockableNameSet structure, whose Check() method is called by the runCatFileBatch() and runCatFileBatchCheck() functions to collect the pathspecs of files locked by other users. Note that the intent to remove the ScanRefsOptions structure from the "lfs" package was discussed in several PRs in 2016, and in particular, this commit's changes achieve some of the goals of the never-merged PR git-lfs#1595, but without introducing an entirely generic "scanner" package. See also: git-lfs#1670 (comment)
In commit 781c7f5 of PR git-lfs#1953 the GitScannerFoundPointer "callback" element of the GitScanner structure was exported from the "lfs" package by renaming it to "FoundPointer". However, this element has never since been utilized outside of the package, so we can simplify the structure's set of exported elements by renaming it to "foundPointer".
The "git lfs push" and "git lfs pre-push" commands utilize the uploadForRefUpdates() function of the "commands" package, which calls the (*uploadContext).buildGitScanner() method to initialize a GitScanner structure with the additional elements necessary to exclude from the results any objects which are reachable from refs known to exist on the given remote. Two of these elements, the FoundLockable callback and the PotentialLockables set of pathspecs locked by other users (as used to report errors or warnings when attempting to push them, specifically when they are not Git LFS objects, for which lock verification is handled separately), were added in commit 9b8bed7 of PR git-lfs#1953, and are initialized following a call to the RemoteForPush() method of the GitScanner structure, which sets two other internal elements of the structure named "remote" and "skippedRefs". We can therefore replace the RemoteForPush() method with a NewGitScannerForPush() function, which allows us to accept all the necessary arguments for push operations in a single function, and to avoid exporting the GitScannerFoundLockable and GitScannerSet elements used to identify locked pathspecs.
In commit 08c5ae6 of PR git-lfs#1953 the runCatFileBatchCheck() and runCatFileBatch() functions of the "lfs" package were updated to write the pathspecs of locked files which are not Git LFS pointers to a dedicated channel created by the catFileBatchCheck() and catFileBatch() wrapper functions, respectively. The scanRefsToChan() function, which calls both of these with a potentially non-nil *lockableNameSet paramter (and is the only caller to do so), starts a goroutine with an anonymous function to read any events on the channel returned by catFileBatchCheck(), and then reads all events on the channel returned by catFileBatch() directly. In the latter case, this would cause scanRefsToChan() to stall indefinitely unless the channel is closed, so the anonymous function started by the runCatFileBatch() function that writes to the channel always closes the channel upon exit. However, the anonymous function started by the runCatFileBatchCheck() function that writes to its lock path channel does not do the same. While this does not cause a stalled program because scanRefsToChan() creates its own anonymous function to read from the channel, that function will not exit until the progam stops. By adding an explicit close() of the channel at the end of the anonymous function started by runCatFileBatchCheck(), we can ensure the anonymous function which reads that channel will also exit as soon as possible.
chrisd8088
force-pushed
the
refactor-gitscanner
branch
from
June 7, 2023 22:57
5e9cff8
to
ac73c0a
Compare
bk2204
approved these changes
Jun 9, 2023
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The
GitScanner
structure was originally introduced in PR #1670 and has been expanded organically over time since then. This PR addresses a number of inconsistencies in the current design, removing unused or unnecessary logic and revising the internal implementation some of the scanning functions.It should be easiest to review this PR commit-by-commit, as each commit has a detailed description and should be logically independent and bisectable.
This PR is motivated principally by the desire to remove the structure's
Close()
method, whose presence implies that it should be called to release resources opened by theGitScanner
methods, such as I/O streams or channels. However, upon inspection, this method simply outputs an optional performance timing tracing metric, one whose utility is undermined by the fact that various callers performs unrelated actions between the time they create aGitScanner
structure and when the callClose()
(if they call it at all). And in the case of thegit lfs prune
command in particular, multipleScan*()
methods are called in parallel goroutines, but only the final overall timing is reported.This PR therefore removes this method, along with the structure's internal mutex. The mutex was introduced in commit bdbca39 of PR #1670, with the apparent expectation that some of the
GitScanner
's elements might be read and written simultaneously by multiple concurrent goroutines. However, in practice, this is never the case, so the mutex is unneeded. As well, not all the current elements of the structure are currently protected by this mutex, as they have been introduced in subsequent PRs which ignored it. Moreover, the two main elements which are protected, namelyremote
andskippedRefs
, are updated exclusively by using theRemoteForPush()
method, and the only caller of this method updates two other elements of the structure directly, as these were added later without any mutex protection.This sole caller of the
RemoteForPush()
method, the(*uploadContext).buildGitScanner()
method in thecommands
package, is used only by thegit lfs pre-push
andgit lfs push
commands, as they need to scan for Git LFS objects which do not exist on a given remote. In this context, theFoundLockable
callback and thePotentialLockables
GitScannerSet
are always set as well; they are also only set in this context. To help clarify these inter-dependencies we replace theRemoteForPush()
method with aNewGitScannerForPush()
function, which accepts all the necessary arguments for push operations in a single function, and avoids the need to export theGitScannerFoundLockable
andGitScannerSet
elements of theGitScanner
structure.We also remove the
ScanRefsOptions
structure in thelfs
package, as it is not needed to encapsulate multiple options when initializing a newGitScanner
structure, unlike the similarScanRefsOptions
structure of thegit
package, which is used to consolidate the many options passed to theNewRevListScanner()
function. This allows us to make the majority of the elements of theGitScanner
structure private to thelfs
package.As part of this change to remove the
lfs.ScanRefsOptions
structure, we also create a new, dedicated and un-exportednameMap
structure, which holds the map of Git object SHAs to their pathspecs, as populated during scan operations. It is then utilized in the existinglockableNameSet
structure, whoseCheck()
method is called by therunCatFileBatch()
function and also called by therunCatFileBatchCheck()
function to collect the pathspecs of files locked by other users.(Note that the intent to remove the
ScanRefsOptions
structure from thelfs
package was discussed in several PRs in 2016, and in particular, this commit's changes achieve some of the goals of the never-merged PR #1595, but without introducing an entirely genericscanner
package.)Finally, we add an explicit
close()
of thelockableCh
channel at the end of the anonymous function started byrunCatFileBatchCheck()
, we can ensure the anonymous function which reads that channel will also exit as soon as possible. This function is started by thescanRefsToChan()
function, which then proceeds to read directly from the similar channel returned by therunCatFileBatch()
function. That read loop would stall indefinitely if its channel was not always closed byrunCatFileBatch()
, whereas the anonymous function in the goroutine which handles the channel returned byrunCatFileBatchCheck()
does not exit until the program exits.