Optimize and generalize `has-notes` #623

bdarcus · 2022-06-02T18:59:53Z

@roshanshariff - I got the notes API stuff merged (via #603), and it works, but the performance discrepancy here is obvious.

ELISP> (citar-select-ref :filter (lambda (key entry) (member key (citar-keys-with-notes))))
ELISP> (citar-select-ref :filter (citar-has-file))

After a bit of experimenting, I think the issue is really just list vs hash performance. If I simply convert citar-keys-with-notes from list to hash, the performance is much better again.

ELISP> (citar-select-ref :filter (lambda (key _entry) (gethash key my/h2)))

So my impulse is just to do that.

Do you have any better suggestions?

For example, possible to generalize citar-has-note more?

TIA.

The text was updated successfully, but these errors were encountered:

roshanshariff · 2022-06-02T19:23:19Z

Just using a hash table would work, but perhaps the API of citar-has-file might be better? Instead of returning a hash table, it returns a predicate function that takes a citekey and entry and tells you whether there's an associated file:

(defun citar-org-roam--has-note ()
  (let ((keys (make-hash-table :test #'equal)))
    ;; Store keys that have notes in hash table
    (dolist (record (org-roam-db-query
                     [:select ref :from refs :where (= type "cite")]))
      (puthash (car record) t keys))
    ;; Return predicate that queries hash table for given key
    (lambda (citekey _entry)
      (gethash citekey keys))))

The advantage is that the use of a hash table is just an internal implementation detail of this function, and is not part of its public API in any way. I hope that makes sense? I can clarify if needed.

bdarcus · 2022-06-02T19:33:16Z

Instead of returning a hash table, it returns a predicate function that takes a citekey and entry and tells you whether there's an associated file.

Just to clarify, more general than a file; an associated note, which may be a file, but could also be a sub-file node.

Your example makes sense to me, but how would I hook that up to citar?

Would do something like this, and have citar-has-file call those?

(setq citar-has-note-functions '(citar-file--has-note citar-org-roam--has-note))

roshanshariff · 2022-06-02T19:36:23Z

Yes, exactly, citar-has-note would be something like the code I had in #601 (comment).

bdarcus · 2022-06-02T19:37:29Z

Ok, I'll whip up a PR. Thanks.

I'm thinking I'll make those functions public, though, since we'd want to encourage their use for this sort of thing.

Revert the keys-with-notes defcustom and code, and instead generalize 'citar-has-note', which now iterates through 'citar-note-backends'. Fix #623

bdarcus · 2022-06-04T01:43:40Z

@roshanshariff - follow-up on performance.

So this is much better and faster than the earlier code.

But it's still a fair bit slower than checking the contents of the entry, after citar--format-candidates has already used citar-has-note to add cons cells with this information.

So with this:

(defun my/np1 ()
  (lambda (_key entry)
    (assoc "has-note" entry)))

... this result, and comparison to using citar-has-note:

ELISP> (benchmark-run-compiled 500 (citar--get-candidates nil (my/np1)))
(0.26864172399999997 0 0.0)

ELISP> (benchmark-run-compiled 500 (citar--get-candidates nil (citar-has-note)))
(5.593660889000001 15 2.1500094049999987)

I think the performance difference might be noticeable with large candidate lists and a lot of associated files, and filtering with citar-select-ref.

Any thoughts on what, if anything, to do with this?

Like would it make sense to have two of these functions; one designed for creating the candidates, and the other for filtering after-the-fact?

Maybe I can name them for the contexts they're optimized for; like citar--format-candidates-has-file?

roshanshariff · 2022-06-04T03:10:39Z

I'll have to think about it some more, but that benchmark isn't entirely fair. I think the call to citar-has-note shouldn't be part of the benchmark, because it's actually doing the work of scanning the directory/querying the Org-Roam database (500 times in this case). Maybe something like this makes more sense?

(let ((filter (citar-has-note)))
    (benchmark-run-compiled 500 (citar--get-candidates nil filter)))

Though even then I'm not sure if the filter lambda function will be byte-compiled, so it might still not be representative of the actual performance.

I think the question really comes down to whether you want the has-note/has-file metadata to be updated every time you filter, or only when the candidate cache is updated. Currently the cache has the formatted candidates, which necessarily includes the has-note and has-file tags. My feeling is this is a bad idea, because it forces a lot more cache invalidations; for correctness you have to invalidate whenever a note or file is created/deleted. This needs file watches on directories, or somehow knowing when a new Org-Roam note is created, which strikes me as unnecessary complexity.

I suspect that the slowest part of recreating the cache is actually parsing the bib file. If this is the case, then it would be better to just cache the parsed bib file and generate the rest of the metadata as-needed. Or perhaps include the string representation in the cache, but tack on the note and file indicators as-needed. You would only have to invalidate the cache when the actual bib file changes, which would require re-parsing anyway.

It'll take some benchmarks to figure out whether the candidate formatting is fast enough to not cache, and how much time it takes compared to parsing the bib file.

One nice feature of formatting the candidates on the fly is that that you can fill the width of the emacs window, without having to regenerate the cache every time the window is resized...

bdarcus · 2022-06-04T11:08:50Z

The original code had no cache, which led to an early feature request #68 #69. It definitely got much more responsive with the cache.

But yeah, there's that trade-off.

With the corrected benchmark:

ELISP> (benchmark-run-compiled 500 (citar--get-candidates nil filter))
(1.157143517 4 0.4320387100000005)

I suspect that the slowest part of recreating the cache is actually parsing the bib file.

I agree.

If this is the case, then it would be better to just cache the parsed bib file and generate the rest of the metadata as-needed.

So in this scenario the cache would be the hash created by parsebib (where key = citekey and value = entry), but citar--format-candidates would then run every time?

Adding the conses to the hash values seems straightforward:

(defvar citar-cache--db
  (parsebib-parse citar-bibliography))

(defun citar-cache--get-entry (citekey)
  "Return entry data for CITEKEY."
  (gethash citekey citar-cache--db))

(defun citar-cache--get-value (citekey field)
  "Return FIELD value for CITEKEY."
  (let ((entry (citar-cache--get-entry citekey)))
    (cdr (assoc field entry))))

(defun citar-cache--has-files-p (citekey)
  (citar-cache--get-value citekey "has-files"))

(defun citar-cache--has-notes-p (citekey)
  (citar-cache--get-value citekey "has-notes"))

(defun citar-cache--entry-update-metadata (citekey)
  "Update the cached bibliographic data CITEKEY with additional metadata."
  (let* ((hasnotes (when (citar-cache--has-notes-p citekey) "has-notes"))
         (hasfiles (when (citar-cache--has-files-p citekey) "has-files")))
    (dolist (metafield (list hasnotes hasfiles))
      (when metafield
        (push (cons metafield t) ; needs to be smarter
              (gethash citekey citar-cache--db))))))

(defun citar-cache--update-metadata ()
  "Update the cached bibliographic data with additional metadata."
  (let ((keys (hash-table-keys citar-cache--db)))
    (dolist (key keys)
      (citar-cache--entry-update-metadata key))))

Or perhaps include the string representation in the cache, but tack on the note and file indicators as-needed.

By "string representation", you mean the candidates created by citar--format-candidates? E.g. in this scenario, the content of the cache wouldn't change; just how it works internally?

My impulse says the first approach is better, mostly because the API becomes more intuitive and consistent, and it would isolate or abstract things specific to completing-read. WDYT?

OTOH, not so sure it's that easy to adapt citar--get-candidates to this, given that it handles the global and local caches, and merges them through seq-concatenate, which I can't see how to do with hash-tables.

The affixation content isn't actually included in the candidates. Not sure, but that might provide an opportunity?

I wonder if this case is suitable for programmed completion?

bdarcus · 2022-06-04T19:40:11Z

I'm experimenting on #625.

Sorry for this breaking change, but I wanted to get the foundations right before tagging 1.0. This completely restructures the core of citar to borrow some code and ideas from the org-mode oc-basic package. In particular, it changes to using two primary caches: - bibliography - completion Both of these now use hash tables, rather than lists. Caching functionality is also changed, and the API now focuses on citekeys as arguments for key functions. Finally, citar--parse-bibliography should re-parse bibliography files upon change. Fix #623 Close #627

Sorry for this breaking change, but I wanted to get the foundations right before tagging 1.0. This completely restructures the core of citar to borrow some code and ideas from the org-mode oc-basic package. In particular, it changes to using two primary caches: - bibliography - completion Both of these now use hash tables, rather than lists. Caching functionality is also changed, and the API now focuses on citekeys as arguments for key functions. Finally, citar--parse-bibliography should re-parse bibliography files upon change. Fix emacs-citar#623 Close emacs-citar#627

bdarcus mentioned this issue Jun 2, 2022

General feedback/questions mclear-tools/consult-notes#1

Closed

bdarcus added a commit that referenced this issue Jun 2, 2022

refactor(notes): generalize citar-has-note

359e06e

Revert the keys-with-notes defcustom and code, and instead generalize 'citar-has-note', which now iterates through 'citar-note-backends'. Fix #623

bdarcus closed this as completed in 1976f29 Jun 2, 2022

bdarcus changed the title ~~Optimizing citar-keys-with-notes~~ Optimize and generalize has-notes Jun 2, 2022

bdarcus mentioned this issue Jun 4, 2022

refactor: cache the parsebib hashes #625

Closed

4 tasks

This was referenced Jun 6, 2022

Split data and completion caches, change APIs to use citekeys #627

Closed

Refactor cache architecture #628

Closed

bdarcus linked a pull request Jun 26, 2022 that will close this issue

New caching and formatting functionality #634

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize and generalize `has-notes` #623

Optimize and generalize `has-notes` #623

bdarcus commented Jun 2, 2022 •

edited

Loading

roshanshariff commented Jun 2, 2022

bdarcus commented Jun 2, 2022 •

edited

Loading

roshanshariff commented Jun 2, 2022

bdarcus commented Jun 2, 2022

bdarcus commented Jun 4, 2022 •

edited

Loading

roshanshariff commented Jun 4, 2022

bdarcus commented Jun 4, 2022 •

edited

Loading

bdarcus commented Jun 4, 2022

Optimize and generalize has-notes #623

Optimize and generalize has-notes #623

Comments

bdarcus commented Jun 2, 2022 • edited Loading

roshanshariff commented Jun 2, 2022

bdarcus commented Jun 2, 2022 • edited Loading

roshanshariff commented Jun 2, 2022

bdarcus commented Jun 2, 2022

bdarcus commented Jun 4, 2022 • edited Loading

roshanshariff commented Jun 4, 2022

bdarcus commented Jun 4, 2022 • edited Loading

bdarcus commented Jun 4, 2022

Optimize and generalize `has-notes` #623

Optimize and generalize `has-notes` #623

bdarcus commented Jun 2, 2022 •

edited

Loading

bdarcus commented Jun 2, 2022 •

edited

Loading

bdarcus commented Jun 4, 2022 •

edited

Loading

bdarcus commented Jun 4, 2022 •

edited

Loading