Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize and generalize has-notes #623

Closed
bdarcus opened this issue Jun 2, 2022 · 8 comments · Fixed by #634
Closed

Optimize and generalize has-notes #623

bdarcus opened this issue Jun 2, 2022 · 8 comments · Fixed by #634

Comments

@bdarcus
Copy link
Contributor

bdarcus commented Jun 2, 2022

@roshanshariff - I got the notes API stuff merged (via #603), and it works, but the performance discrepancy here is obvious.

ELISP> (citar-select-ref :filter (lambda (key entry) (member key (citar-keys-with-notes))))
ELISP> (citar-select-ref :filter (citar-has-file))

After a bit of experimenting, I think the issue is really just list vs hash performance. If I simply convert citar-keys-with-notes from list to hash, the performance is much better again.

ELISP> (citar-select-ref :filter (lambda (key _entry) (gethash key my/h2)))

So my impulse is just to do that.

Do you have any better suggestions?

For example, possible to generalize citar-has-note more?

TIA.

@roshanshariff
Copy link
Collaborator

Just using a hash table would work, but perhaps the API of citar-has-file might be better? Instead of returning a hash table, it returns a predicate function that takes a citekey and entry and tells you whether there's an associated file:

(defun citar-org-roam--has-note ()
  (let ((keys (make-hash-table :test #'equal)))
    ;; Store keys that have notes in hash table
    (dolist (record (org-roam-db-query
                     [:select ref :from refs :where (= type "cite")]))
      (puthash (car record) t keys))
    ;; Return predicate that queries hash table for given key
    (lambda (citekey _entry)
      (gethash citekey keys))))

The advantage is that the use of a hash table is just an internal implementation detail of this function, and is not part of its public API in any way. I hope that makes sense? I can clarify if needed.

@bdarcus
Copy link
Contributor Author

bdarcus commented Jun 2, 2022

Instead of returning a hash table, it returns a predicate function that takes a citekey and entry and tells you whether there's an associated file.

Just to clarify, more general than a file; an associated note, which may be a file, but could also be a sub-file node.

Your example makes sense to me, but how would I hook that up to citar?

Would do something like this, and have citar-has-file call those?

(setq citar-has-note-functions '(citar-file--has-note citar-org-roam--has-note))

@roshanshariff
Copy link
Collaborator

Yes, exactly, citar-has-note would be something like the code I had in #601 (comment).

@bdarcus
Copy link
Contributor Author

bdarcus commented Jun 2, 2022

Ok, I'll whip up a PR. Thanks.

I'm thinking I'll make those functions public, though, since we'd want to encourage their use for this sort of thing.

bdarcus added a commit that referenced this issue Jun 2, 2022
Revert the keys-with-notes defcustom and code, and instead generalize
'citar-has-note', which now iterates through 'citar-note-backends'.

Fix #623
@bdarcus bdarcus closed this as completed in 1976f29 Jun 2, 2022
@bdarcus bdarcus changed the title Optimizing citar-keys-with-notes Optimize and generalize has-notes Jun 2, 2022
@bdarcus
Copy link
Contributor Author

bdarcus commented Jun 4, 2022

@roshanshariff - follow-up on performance.

So this is much better and faster than the earlier code.

But it's still a fair bit slower than checking the contents of the entry, after citar--format-candidates has already used citar-has-note to add cons cells with this information.

So with this:

(defun my/np1 ()
  (lambda (_key entry)
    (assoc "has-note" entry)))

... this result, and comparison to using citar-has-note:

ELISP> (benchmark-run-compiled 500 (citar--get-candidates nil (my/np1)))
(0.26864172399999997 0 0.0)

ELISP> (benchmark-run-compiled 500 (citar--get-candidates nil (citar-has-note)))
(5.593660889000001 15 2.1500094049999987)

I think the performance difference might be noticeable with large candidate lists and a lot of associated files, and filtering with citar-select-ref.

Any thoughts on what, if anything, to do with this?

Like would it make sense to have two of these functions; one designed for creating the candidates, and the other for filtering after-the-fact?

Maybe I can name them for the contexts they're optimized for; like citar--format-candidates-has-file?

@roshanshariff
Copy link
Collaborator

I'll have to think about it some more, but that benchmark isn't entirely fair. I think the call to citar-has-note shouldn't be part of the benchmark, because it's actually doing the work of scanning the directory/querying the Org-Roam database (500 times in this case). Maybe something like this makes more sense?

(let ((filter (citar-has-note)))
    (benchmark-run-compiled 500 (citar--get-candidates nil filter)))

Though even then I'm not sure if the filter lambda function will be byte-compiled, so it might still not be representative of the actual performance.

I think the question really comes down to whether you want the has-note/has-file metadata to be updated every time you filter, or only when the candidate cache is updated. Currently the cache has the formatted candidates, which necessarily includes the has-note and has-file tags. My feeling is this is a bad idea, because it forces a lot more cache invalidations; for correctness you have to invalidate whenever a note or file is created/deleted. This needs file watches on directories, or somehow knowing when a new Org-Roam note is created, which strikes me as unnecessary complexity.

I suspect that the slowest part of recreating the cache is actually parsing the bib file. If this is the case, then it would be better to just cache the parsed bib file and generate the rest of the metadata as-needed. Or perhaps include the string representation in the cache, but tack on the note and file indicators as-needed. You would only have to invalidate the cache when the actual bib file changes, which would require re-parsing anyway.

It'll take some benchmarks to figure out whether the candidate formatting is fast enough to not cache, and how much time it takes compared to parsing the bib file.

One nice feature of formatting the candidates on the fly is that that you can fill the width of the emacs window, without having to regenerate the cache every time the window is resized...

@bdarcus
Copy link
Contributor Author

bdarcus commented Jun 4, 2022

The original code had no cache, which led to an early feature request #68 #69. It definitely got much more responsive with the cache.

But yeah, there's that trade-off.

With the corrected benchmark:

ELISP> (benchmark-run-compiled 500 (citar--get-candidates nil filter))
(1.157143517 4 0.4320387100000005)

I suspect that the slowest part of recreating the cache is actually parsing the bib file.

I agree.

If this is the case, then it would be better to just cache the parsed bib file and generate the rest of the metadata as-needed.

So in this scenario the cache would be the hash created by parsebib (where key = citekey and value = entry), but citar--format-candidates would then run every time?

Adding the conses to the hash values seems straightforward:

(defvar citar-cache--db
  (parsebib-parse citar-bibliography))

(defun citar-cache--get-entry (citekey)
  "Return entry data for CITEKEY."
  (gethash citekey citar-cache--db))

(defun citar-cache--get-value (citekey field)
  "Return FIELD value for CITEKEY."
  (let ((entry (citar-cache--get-entry citekey)))
    (cdr (assoc field entry))))

(defun citar-cache--has-files-p (citekey)
  (citar-cache--get-value citekey "has-files"))

(defun citar-cache--has-notes-p (citekey)
  (citar-cache--get-value citekey "has-notes"))

(defun citar-cache--entry-update-metadata (citekey)
  "Update the cached bibliographic data CITEKEY with additional metadata."
  (let* ((hasnotes (when (citar-cache--has-notes-p citekey) "has-notes"))
         (hasfiles (when (citar-cache--has-files-p citekey) "has-files")))
    (dolist (metafield (list hasnotes hasfiles))
      (when metafield
        (push (cons metafield t) ; needs to be smarter
              (gethash citekey citar-cache--db))))))

(defun citar-cache--update-metadata ()
  "Update the cached bibliographic data with additional metadata."
  (let ((keys (hash-table-keys citar-cache--db)))
    (dolist (key keys)
      (citar-cache--entry-update-metadata key))))

Or perhaps include the string representation in the cache, but tack on the note and file indicators as-needed.

By "string representation", you mean the candidates created by citar--format-candidates? E.g. in this scenario, the content of the cache wouldn't change; just how it works internally?

My impulse says the first approach is better, mostly because the API becomes more intuitive and consistent, and it would isolate or abstract things specific to completing-read. WDYT?

OTOH, not so sure it's that easy to adapt citar--get-candidates to this, given that it handles the global and local caches, and merges them through seq-concatenate, which I can't see how to do with hash-tables.

The affixation content isn't actually included in the candidates. Not sure, but that might provide an opportunity?

I wonder if this case is suitable for programmed completion?

@bdarcus
Copy link
Contributor Author

bdarcus commented Jun 4, 2022

I'm experimenting on #625.

bdarcus added a commit that referenced this issue Jun 10, 2022
Sorry for this breaking change, but I wanted to get the foundations
right before tagging 1.0.

This completely restructures the core of citar to borrow some code and
ideas from the org-mode oc-basic package.

In particular, it changes to using two primary caches:

- bibliography
- completion

Both of these now use hash tables, rather than lists.

Caching functionality is also changed, and the API now focuses on
citekeys as arguments for key functions.

Finally, citar--parse-bibliography should re-parse bibliography files
upon change.

Fix #623 Close #627
bdarcus added a commit that referenced this issue Jun 10, 2022
Sorry for this breaking change, but I wanted to get the foundations
right before tagging 1.0.

This completely restructures the core of citar to borrow some code and
ideas from the org-mode oc-basic package.

In particular, it changes to using two primary caches:

- bibliography
- completion

Both of these now use hash tables, rather than lists.

Caching functionality is also changed, and the API now focuses on
citekeys as arguments for key functions.

Finally, citar--parse-bibliography should re-parse bibliography files
upon change.

Fix #623 Close #627
bdarcus added a commit that referenced this issue Jun 13, 2022
Sorry for this breaking change, but I wanted to get the foundations
right before tagging 1.0.

This completely restructures the core of citar to borrow some code and
ideas from the org-mode oc-basic package.

In particular, it changes to using two primary caches:

- bibliography
- completion

Both of these now use hash tables, rather than lists.

Caching functionality is also changed, and the API now focuses on
citekeys as arguments for key functions.

Finally, citar--parse-bibliography should re-parse bibliography files
upon change.

Fix #623 Close #627
roshanshariff pushed a commit to roshanshariff/citar that referenced this issue Jun 19, 2022
Sorry for this breaking change, but I wanted to get the foundations
right before tagging 1.0.

This completely restructures the core of citar to borrow some code and
ideas from the org-mode oc-basic package.

In particular, it changes to using two primary caches:

- bibliography
- completion

Both of these now use hash tables, rather than lists.

Caching functionality is also changed, and the API now focuses on
citekeys as arguments for key functions.

Finally, citar--parse-bibliography should re-parse bibliography files
upon change.

Fix emacs-citar#623 Close emacs-citar#627
roshanshariff pushed a commit to roshanshariff/citar that referenced this issue Jun 24, 2022
Sorry for this breaking change, but I wanted to get the foundations
right before tagging 1.0.

This completely restructures the core of citar to borrow some code and
ideas from the org-mode oc-basic package.

In particular, it changes to using two primary caches:

- bibliography
- completion

Both of these now use hash tables, rather than lists.

Caching functionality is also changed, and the API now focuses on
citekeys as arguments for key functions.

Finally, citar--parse-bibliography should re-parse bibliography files
upon change.

Fix emacs-citar#623 Close emacs-citar#627
@bdarcus bdarcus linked a pull request Jun 26, 2022 that will close this issue
6 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants