Skip to content

libutil: buffer readLine on unix#14829

Closed
Zaczero wants to merge 3 commits intoNixOS:masterfrom
Zaczero:zaczero/libstore-readline
Closed

libutil: buffer readLine on unix#14829
Zaczero wants to merge 3 commits intoNixOS:masterfrom
Zaczero:zaczero/libstore-readline

Conversation

@Zaczero
Copy link
Member

@Zaczero Zaczero commented Dec 18, 2025

  • readLine() previously performed a syscall per byte, which is slow for large line-oriented reads.
  • BM_ReadLineFile/10000_mean runs 4000x faster

Motivation

Context


Add 👍 to pull requests you find important.

The Nix maintainer team uses a GitHub project board to schedule and track reviews.

- Add a microbenchmark that reads many fixed-width lines from a temp file using readLine().
- readLine() previously performed a syscall per byte, which is slow for large line-oriented reads.
- BM_ReadLineFile/10000_mean runs 4000x faster
@Zaczero Zaczero requested a review from edolstra as a code owner December 18, 2025 04:32
@Zaczero Zaczero changed the title libstore-tests: add benchmark for readLine on files libutil: buffer readLine on unix Dec 18, 2025
@Ericson2314
Copy link
Member

Can we do this without a global variable? I would feel much better about that.

@Zaczero
Copy link
Member Author

Zaczero commented Dec 18, 2025

Can we do this without a global variable? I would feel much better about that.

I assume you mean thread_local std::unordered_map<int, ReadLineState> readLineStates;

Can you please explain why is that not preferred here?

Hmmm... I see it's potentially risky, when several threads operate on the same fd (but that code pattern would be ugly in the first place).

@Ericson2314
Copy link
Member

  1. The global variable is not safe by construction because file descriptor numbers can be reused when things are opened and closed. At the very least it requires the clearing of the cache to be absolute correct

  2. It is not, in general, safe to "overread", as happen with the cache, because the next read might not go through the cache. Looking through the code, I see this in ssh case --- it looks like we read a line or so, and then we pass up the file descriptor to some other arbitrary function or even process. We can't do that unless we can "put back" the rest of the cache, and sadly we cannot do that for an arbitrary file descriptor.

In general, I suspect that these FIXMEs are not all sources of "easy wins". There is a bit of a https://en.wikipedia.org/wiki/Survivorship_bias sometimes where a FIXME might look easy enough, but if it was so easy someone would have fixed it by now. (That is certainly not always the case though, just to be clear.)

@Zaczero
Copy link
Member Author

Zaczero commented Dec 18, 2025

Because of 2 it seems like this FIXME is unrealistic. So it's an architectural issue.

@Zaczero Zaczero closed this Dec 18, 2025
@Ericson2314
Copy link
Member

FWIW the local state approach can still work if it is opt-in. Then, just the call sites where the entire life cycle of reads is known so safety can be judged can use it.

@Zaczero
Copy link
Member Author

Zaczero commented Dec 18, 2025

I think the correct way would be to refactor the usage to some standard interface (I don't know much of C++ abstractions) if one exists. That would allow for cached operations on the fd, with optional "unpacking" into raw fd. Not having a safe, clean fd abstraction prevents cache optimizations that could be then unified across all read operations.

@xokdvium
Copy link
Contributor

We have those. It's called FdSource.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants