Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Getting time and formatting it seems slow. #94

Open
dpc opened this issue Sep 9, 2016 · 5 comments
Open

Getting time and formatting it seems slow. #94

dpc opened this issue Sep 9, 2016 · 5 comments

Comments

@dpc
Copy link

dpc commented Sep 9, 2016

Working on slog-rs - a logger, I've been doing some benchmarking, and getting a current timestamp (formatted) from chrono is taking a lot of time.

Fist I've discovered that just taking time (without formatting) takes around 60ns in my benchmarks. Does it require doing two syscalls (one for time, one for timezone)? I've filled rust-lang/rust#36358 to inquire why it isn't faster.

Second formatting is taking around 500ns, which seems too much. I wish formatting provided something like fn fmt_rfc3339<W : io::Write>(&self, io : W) -> io::Result<()> which does not require any allocation. Also, when investigating the code, it seems there's a lot of cloned() calls, and the whole thing is based on StrftimeItems which like printf in C: scans the format every time.

@lifthrasiir
Copy link
Contributor

I'm right on the trip and the reply may be not quick. That said...

Does it require doing two syscalls (one for time, one for timezone)?

Time zone is a local concept to the current process, normally implemented as a part of libc (glibc has tons of the code dedicated for this, and includes a complete reader for tzfile). Right now Chrono depends on libtime for this task, that is AFAIK just a call to gettimeofday (a system call) and localtime_r (may do a bunch of things at the first time, and probably locks the internal data structure).

I'm actually thinking of replacing them with an independent implementation (I've met some crazy bugs from libtime), and I guess localtime_r can be completely eliminated---Chrono's Local can be designed to make a common case extremely faster. Per gettimeofday, I do believe that it should be replaced with clock_gettime anyway (that has nsec precision) and completely agree that vDSO can be used if possible. I will take a note for that :) (By the way, I was currently somehow blocked on getting a nsec-level timestamp in macOS, which seems to be too complex...)

Second formatting is taking around 500ns, which seems too much. I wish formatting provided something like fn fmt_rfc3339<W : io::Write>(&self, io : W) -> io::Result<()> which does not require any allocation.

The following code would be equivalent (untested):

use chrono::format::{Item, Fixed};
datetime.format_with_items([Item::Fixed(Fixed::RFC3339)].iter().cloned())

I think this should be reasonably fast, but it still depends on the entire formatting architecture of Rust that is not very fast. (It does not allocate, but has tons of indirections.) I believe Chrono excels at breaking the time components apart, though.

Also, when investigating the code, it seems there's a lot of cloned() calls, and the whole thing is based on StrftimeItems which like printf in C: scans the format every time.

cloned() is there to convert &Item to Item and it is just a memory copy at worst (please let me know if it does otherwise!). You may also refer to Clone bounds required for the iterator; it is currently used to make DelayedFormat Cloneable, but I guess it can be made optional. I don't think either of these affects the performance.

format method intentionally scans the format every time because otherwise it risks the allocation. This design assumes that the result from .format() call is immediately used as an argument to formatting, and advanced users would use the format_with_items method instead, but it seems that the later is a bit rough right now (see #76). Anyway, this cost can be eliminated. Let me know if this doesn't work for you.

@dpc
Copy link
Author

dpc commented Sep 10, 2016

Thank you very much for the answer. Very informative. I'm not really blocked by this in any way. The timestamp is lazily evaluated - only when required so it's not a big deal, but obviously it would be better if it was as fast as possible, so I brought it to your attention.

@lifthrasiir
Copy link
Contributor

lifthrasiir commented Oct 2, 2016

Okay, I've actually tried vDSO as a part of getting rid of libtime dependency! The result was... a bit surprising to me (until I'd figured out what the hell was going on). The mandatory source code. In my Linux box, this resulted in...

test bench_cgt_clock       ... [example: (1475435701, 518071307)] bench:          48 ns/iter (+/- 0)
test bench_gtod_clock      ... [example: (1475435705, 536291000)] bench:          48 ns/iter (+/- 0)
test bench_vdso_cgt_clock  ... [example: (1475435711, 409065340)] bench:          46 ns/iter (+/- 0)
test bench_vdso_gtod_clock ... [example: (1475435714, 744960000)] bench:          48 ns/iter (+/- 0)
test bench_x_system_time   ... [example: (1475435718, 637945471)] bench:          58 ns/iter (+/- 0)

Note that the Bencher used here will only check the timer when a number of iterations (typically ranges from 100k to 1M) is run, so the Bencher may not have affected the benchmark. The actual assembly for the tightest loop looks like this:

    f220:       48 8b 44 24 10          mov    0x10(%rsp),%rax
    f225:       31 ff                   xor    %edi,%edi
    f227:       4c 89 fe                mov    %r15,%rsi
    f22a:       ff d0                   callq  *%rax
    f22c:       85 c0                   test   %eax,%eax
    f22e:       75 6f                   jne    f29f <_ZN17system_time_bench20bench_vdso_cgt_clock17h9420e0afd8f9d31cE+0xef>
    f230:       48 ff c3                inc    %rbx
    f233:       48 8b 04 24             mov    (%rsp),%rax
    f237:       8b 4c 24 08             mov    0x8(%rsp),%ecx
    f23b:       48 89 04 24             mov    %rax,(%rsp)
    f23f:       89 4c 24 08             mov    %ecx,0x8(%rsp)
    f243:       4c 39 e3                cmp    %r12,%rbx
    f246:       72 d8                   jb     f220 <_ZN17system_time_bench20bench_vdso_cgt_clock17h9420e0afd8f9d31cE+0x70>

...which gives a fairly good idea on what is actually measured. callq %*rax there is calling a vDSO function, and it has a glimpse of black_box being discarded (f233--f23f). I tried a bit to remove a trail of black_box but it seems hard to do (I should note that it does adversely affect the extra benchmark for SystemTime which takes at least 5ns/iter---something to do with a larger struct?).

So in any case, they are almost identical in run time. But how is this possible? The CloudFlare post mentions 2x speedup over the system call... And I realized that I was actually measuring the clock_gettime call, which might not call a system call at all! (Go does not use libc at all, so they had to do vDSO themselves.) Indeed, glibc will use vDSO (or vsyscall) as much as possible, so it is natural that they have essentially identical performance characteristics. Phew!

For the curiosity I've also benchmarked against musl. In my understanding musl also does vDSO for clock_gettime but not for gettimeofday (glibc does both). And the result is...

test bench_cgt_clock       ... [example: (1475437875, 483293947)] bench:          49 ns/iter (+/- 0)
test bench_gtod_clock      ... [example: (1475437879, 304213000)] bench:          69 ns/iter (+/- 1)
test bench_vdso_cgt_clock  ... [example: (1475437879, 729167421)] bench:          47 ns/iter (+/- 0)
test bench_vdso_gtod_clock ... [example: (1475437882, 771884000)] bench:          48 ns/iter (+/- 0)
test bench_x_system_time   ... [example: (1475437886, 276674770)] bench:          57 ns/iter (+/- 0)

...exactly showing that fact.


Takeaway: As long as we are using a common libc (which includes glibc and musl), we automatically get vDSO. There indeed is a slight benefit in using vDSO clock_gettime directly, but only because of less indirection. Nothing to worry about that.

PS. During doing this, I've also confirmed that macOS clock_get_time (via Mach clock service) is a complete liar. It advertises struct timespec but only has usec precision and is 20x slower (i.e. taking ~1us per call) than gettimeofday which has the same precision!

PPS. I guess that std::time::SystemTime should be a way to go if you want the maximal performance. It is struct timespec behind the scene and does not go through a conversion like Chrono. Eventually I plan to go with initializing DateTime<Tz> from SystemTime anyway, so you will be able to convert only when required. (I initially didn't plan this because of leap seconds, but it seems that while some systems tick leap seconds they do not have any mechanism to show leap seconds. Ugh.)

@drrlvn
Copy link

drrlvn commented Jun 5, 2018

Formatting also seems slow. I tested the format() method compared to manually formatting the fields:

#[bench]
fn chrono_format(bench: &mut Bencher) {
    let ts = Local.timestamp(100, 100);
    bench.iter(|| format!("{}", ts.format("%F %T")))
}

#[bench]
fn manual_format(bench: &mut Bencher) {
    let ts = Local.timestamp(100, 100);
    bench.iter(|| {
        format!(
            "{}-{:02}-{:02} {:02}:{:02}:{:02}",
            ts.year(),
            ts.month(),
            ts.day(),
            ts.hour(),
            ts.minute(),
            ts.second()
        )
    })
}

With these results:

test chrono_format ... bench:       1,006 ns/iter (+/- 16)
test manual_format ... bench:         478 ns/iter (+/- 1)

Can format() be optimized somehow?

@quodlibetor
Copy link
Contributor

I'm sure it can. The formatting/parsing infrastructure is complex. I am currently significantly more interested in improving error messages than in improving speed, so while benchmarks and optimizations would be very welcome I wouldn't be super interested in taking things that make them both worse.

On the other hand, I'm also interested in basically completely rewriting the parsing (and therefore formatting) code, specifically because of the fact that errors are pretty opaque right now, so optimizations are welcome because they might inform a new design.

If you (or anyone) want to work on formatting/parsing performance I'm happy to mentor, but it is pretty low on my priority list of my personal work right now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants