Skip to content
This repository has been archived by the owner on Feb 14, 2023. It is now read-only.

Rare issue where lepton fails, but exits with code 0 #129

Open
leijurv opened this issue Nov 19, 2019 · 5 comments
Open

Rare issue where lepton fails, but exits with code 0 #129

leijurv opened this issue Nov 19, 2019 · 5 comments

Comments

@leijurv
Copy link

leijurv commented Nov 19, 2019

lepton v1.0-1.2.1-183-g3d1bc19
terminate called after throwing an instance of 'std::system_error'
  what():  Resource temporarily unavailable

This is the entire output (stderr). I am using lepton as a subprocess with stdin and stdout piped to compress hundreds of thousands of jpgs. Stderr is piped to stderr. There is a crash approximately one in every 10000. It does not appear to be dependent on the file (rerunning the program works fine through the file that previously failed).

My main issue here is that lepton exits with code 0 even though it failed and wrote 0 bytes to stdout. If lepton fails for any reason, shouldn't it exit with a nonzero code to indicate the error? If not, how can I discover this error, other than searching stderr for this string?

@danielrh
Copy link
Contributor

I've certainly never seen this.

what command line flag are you using?

My guess is that somehow lepton is trying to allocate new threads or some system resource and is failing.
I would try a few things to help diagnose this
a) try to avoid using the system allocator... try --enable-custom-allocator and see if it still happens... the goal here is to try and see if it's the system allocator or creating a new thread that's causing issues
b) try -singlethread when encoding... see if you can reproduce it when no threads are being allocated
c) try -skipverify flag... to make sure it's not a bug in the verifier...
if you can get it in either (a) or (b) but not both we may have an idea of where it's happening
alternatively if you could somehow get a stack trace (maybe running through valgrind or gdb somesuch) so we can see where the error is happening that would be ideal.
I've run this billions of times with skipverify and custom allocation and never have seen that error---though to be fair I've always decompressed the image again to check the result... (the default settings should do that automatically--unless it's a bug with the validator)

@leijurv
Copy link
Author

leijurv commented Dec 16, 2019

Oh yeah, I'm nearly certain that this happened when I was out of RAM, or perhaps out of threads? I had 8 lepton processes running, each invoked as ("lepton", "-allowprogressive", "-memory=2048M", "-threadmemory=256M", "-"). I only have 16gb physical ram.

I don't expect lepton to magically continue working when I run out of RAM or threads, I only filed this issue because of the exit code being 0. It was only caught because I decompress every jpg to make sure it's byte for byte identical.

Lepton should not exit with a 0 exit code if there was an exception thrown, is what I'm suggesting in this issue.

It took a few attempts but I was able to compress my whole photo library using lepton :) this is just for people in the future who look at the exit code to check if lepton succeeded. It should be nonzero if it fails for any reason, including running out of threads or ram.

@danielrh
Copy link
Contributor

sorry for the delay here--do you still have the image that caused this problem so I can reproduce it and see how it exits with code zero?

@leijurv
Copy link
Author

leijurv commented Dec 31, 2020

There is a crash approximately one in every 10000. It does not appear to be dependent on the file (rerunning the program works fine through the file that previously failed).

Sorry, I don't really remember, but rereading what I wrote ^

Maybe try greatly reducing the memory or thread memory to cause it intentionally? idk

The use case is https://github.com/leijurv/gb specifically https://github.com/leijurv/gb/blob/master/compression/lepton.go

@danielrh
Copy link
Contributor

I've written a custom malloc that fails after N allocations and looped N all the way up to where it succeeds everything--- I can't seem to reproduce this problem. It must be something very specific to the class of images you've encountered. Is there any chance you can dig up an image which caused it to fail so I can try to analyze the structure more deeply and come to a hypothesis of what could be causing the erroneous success?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants