Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve matching performance. #293

Closed
wants to merge 4 commits into from
Closed

Commits on Jun 24, 2017

  1. Add C scanner infrastructure.

    This scanner allows low overhead reading of files. It can currently read
    paths from an array of strings (good for ruby-based scanners) and a
    terminated string (great for command executor scanners).
    
    This also converts the find scanner and the git scanner to use the
    string based infrastructure.
    kevincox committed Jun 24, 2017
    Configuration menu
    Copy the full SHA
    c1b730f View commit details
    Browse the repository at this point in the history
  2. Add option for custom scanning command.

    This option provides the user with ultimate flexibility. They have the
    choice to provide any shell command, script or pipeline in order to
    generate a file list.
    
    To give the command more context in the future variables can be set before
    calling the user command in the shell to pass more information to the external
    command. This allows future backwards compatible extension.
    
    The only drawback I see is that the terminator is "static" for example if
    a user has a script that switches the technique used based on context
    (for example current path) it will always have to use the same
    terminator. Luckily the workaround is simple by using `| tr '\n' '\0'`
    to convert newlines (or whatever the alternative separator is) to nulls
    and setting CommandT to always use nulls. This has a slight CPU and
    memory overhead but since it runs in parallel it should be insignificant
    on a multi-core machine.
    kevincox committed Jun 24, 2017
    Configuration menu
    Copy the full SHA
    f389ccf View commit details
    Browse the repository at this point in the history
  3. Improve matching performance.

    This improves matching performance by using a trie to store the paths.
    Hit-or-miss matching is then done iteratively, allowing the pruning of
    subtrees and requiring less work to identify matches.
    
    Additionally this structure usually uses much less memory as common
    prefixes of paths are only stored once.
    
    Testing on benchmarks and use cases shows a 2-10x performance
    improvement for common scenarios. For some edge cases effectively
    infinite speedup can be seen as huge numbers of paths can be pruned.
    This is particularly common with a lot of hidden paths.
    
    Other notes:
    - A tiny cache boost might be gained by putting all strings into the same buffer
    	Instead of strdup'ing strings separately.
    - I considered putting the path segments inline into the paths_t
    	structure but it performed worse. This was surprising but
    	probably due to good cache prediction on the strings being in
    	order in memory.
    - Most of the time is still spent in calculate_match. While we call this
    	function much less now (on every match rather then every string)
    	it is still expensive. While this might be acceptable because
    	its complexity provides a useful result order it is a good
    	optimization target.
    - Another boost might be gained by post-processing the paths into a
    	single array that stores the delta from the previous. This would
    	give excellent cache-locality but would make skipping over
    	hidden or mask-failing files difficult/impossible. This would
    	also be difficult/impossible to do threaded.
    kevincox committed Jun 24, 2017
    Configuration menu
    Copy the full SHA
    f91f894 View commit details
    Browse the repository at this point in the history
  4. Add limit to matcher benchmark.

    Without a limit the time is dominated by sorting all the results in a
    single thread. The limit is configurable so it can be raised for testing
    merging.
    kevincox committed Jun 24, 2017
    Configuration menu
    Copy the full SHA
    be96c6d View commit details
    Browse the repository at this point in the history