Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: fixed reindexing of unchanged files, now uses last_modified stamp #8

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

ErikBjare
Copy link
Member

@ErikBjare ErikBjare commented Jan 18, 2025

Had these unstaged changes, maybe I merged this


Important

Optimize indexing by using file modification timestamps to skip unchanged files, enhancing efficiency in indexer.py, cli.py, and watcher.py.

  • Behavior:
    • Indexer now uses last_modified timestamps to skip unchanged files during indexing in indexer.py.
    • collect_documents() in indexer.py checks for file modifications using stored timestamps.
    • add_documents() updates document metadata with current timestamps after indexing.
  • Functions:
    • Added _normalize_timestamp() and _compare_timestamps() in indexer.py for timestamp handling.
    • Modified collect_documents() to include check_modified parameter for skipping unchanged files.
  • Misc:
    • Updated index() in cli.py to use new document collection logic.
    • Adjusted logging in watcher.py to reflect changes in file processing.

This description was created by Ellipsis for 0b71d3c. It will automatically update as commits are pushed.

Copy link

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Looks good to me! Reviewed everything up to 0b71d3c in 54 seconds

More details
  • Looked at 269 lines of code in 3 files
  • Skipped 0 files when reviewing.
  • Skipped posting 4 drafted comments based on config settings.
1. gptme_rag/indexing/indexer.py:186
  • Draft comment:
    The add_documents method updates the last_modified metadata after indexing, but this logic is duplicated in _add_documents. Consider consolidating this logic to avoid redundancy.
  • Reason this comment was not posted:
    Confidence changes required: 50%
    The PR description is unclear, but the changes seem to focus on improving the handling of file modification timestamps to avoid unnecessary reindexing of unchanged files. The changes in the Indexer class and related methods are consistent with this intent. However, there are some areas that could be improved for clarity and efficiency.
2. gptme_rag/indexing/indexer.py:910
  • Draft comment:
    Returning a timestamp of 0 on error might lead to incorrect assumptions about file modification times. Consider logging the error and handling it more explicitly.
  • Reason this comment was not posted:
    Confidence changes required: 50%
    The _normalize_timestamp method is used to ensure timestamps are in a consistent format. However, the method defaults to a timestamp of 0 if an error occurs, which might not be the best approach as it could lead to incorrect assumptions about file modification times.
3. gptme_rag/indexing/indexer.py:922
  • Draft comment:
    Assuming a file is modified if there's an error comparing timestamps might not always be correct. Consider handling errors more explicitly to avoid false positives.
  • Reason this comment was not posted:
    Confidence changes required: 50%
    The _compare_timestamps method is crucial for determining if a file has been modified. The current implementation assumes that any error in comparison means the file is modified, which might not always be the case.
4. gptme_rag/indexing/indexer.py:967
  • Draft comment:
    The logic for checking if a file is modified could be optimized by directly comparing timestamps without converting them to strings. This would improve performance slightly.
  • Reason this comment was not posted:
    Confidence changes required: 33%
    The collect_documents method now includes a check_modified parameter to skip unchanged files. This is a good addition, but the logic for checking modifications could be more efficient.

Workflow ID: wflow_vVOA9BIX9oF4RfSM


You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant