-
Notifications
You must be signed in to change notification settings - Fork 180
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve memory use and performance of multipart parsing #745
Conversation
Memory usage and performance noted in https://discourse.julialang.org/t/http-parse-multipart-form-does-a-lot-of-memory-allocations/66661. This uses some of the same regex tricks we use in the other http parsing code that operates on SubStrings and avoids excessive allocations. We also streamline things to avoid allocating when comparing byte buffers and the unnecessary allocated arrays of the content disposition processing. On my machine for current master, I see `35.022 μs (866 allocations: 70.27 KiB)` for parsing a simple multipart request from the tests, and with this PR, I see `3.603 μs (42 allocations: 2.55 KiB)`.
|
||
""" | ||
find_multipart_boundary(bytes, boundaryDelimiter; start::Int=1) | ||
|
||
Find the first and last index of the next boundary delimiting a part, and if | ||
the discovered boundary is the terminating boundary. | ||
""" | ||
function find_multipart_boundary(bytes::AbstractVector{UInt8}, boundaryDelimiter::AbstractVector{UInt8}; start::Int=1) | ||
@inline function find_multipart_boundary(bytes::AbstractVector{UInt8}, boundaryDelimiter::AbstractVector{UInt8}; start::Int=1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm kind of amazed that inlining this function has any performance improvement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, it doesn't. This was from an iteration where I was returning a SubString instead of indices and it could avoid the allocation if it was inlined.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So should it have been removed?
I can confirm this significantly improves the original request that was being tested/reported on Discourse — it is much larger and has an even larger benefit. Before: julia> @benchmark HTTP.parse_multipart_form($r)
BenchmarkTools.Trial: 26 samples with 1 evaluation.
Range (min … max): 181.312 ms … 235.950 ms ┊ GC (min … max): 3.11% … 3.33%
Time (median): 195.296 ms ┊ GC (median): 3.16%
Time (mean ± σ): 197.687 ms ± 13.029 ms ┊ GC (mean ± σ): 3.20% ± 0.16%
Memory estimate: 584.19 MiB, allocs estimate: 4785830. After: julia> @benchmark HTTP.parse_multipart_form($r)
BenchmarkTools.Trial: 917 samples with 1 evaluation.
Range (min … max): 5.131 ms … 9.564 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 5.309 ms ┊ GC (median): 0.00%
Time (mean ± σ): 5.447 ms ± 448.252 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
Memory estimate: 2.03 KiB, allocs estimate: 32. |
src/parsemultipart.jl
Outdated
const CR_BYTE = 0x0d # \r | ||
const LF_BYTE = 0x0a # \n | ||
const DASH_BYTE = 0x2d # - | ||
const HTAB_BYTE = 0x09 # \t | ||
const SPACE_BYTE = 0x20 | ||
const RETURN_BYTES = [CR_BYTE, LF_BYTE] | ||
const SEMICOLON_BYTE = UInt8(';') | ||
const CRLFCRLF = [CR_BYTE, LF_BYTE, CR_BYTE, LF_BYTE] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make into a tuple (allows the compiler to know it's length etc)? I don't think it is mutated anywhere?
Ok, I unexported |
Codecov Report
@@ Coverage Diff @@
## master #745 +/- ##
==========================================
- Coverage 76.89% 76.84% -0.05%
==========================================
Files 38 38
Lines 2445 2514 +69
==========================================
+ Hits 1880 1932 +52
- Misses 565 582 +17
Continue to review full report at Codecov.
|
4d8c65f
to
a52b672
Compare
Also clean up usages of startread, closeread, startwrite, closewrite to quality to the IOExtras module. Also fix the websocket tests that were hanging since echo.websocket.org doesn't exist anymore. I found pie socket that offers a free websocket testing server if you sign up and get an api key, so that's been added as a repository secret now. If anyone wants the api key to run the tests on their local machine, I'm happy to share.
closeread(io) = nothing | ||
if isdefined(Base, :startwrite) | ||
"$_doc" | ||
Base.startwrite(io) = nothing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is type piracy now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, we'll need to make a plan around this since Base doesn't define a fallback for closewrite
. Either we add closewrite(io) = nothing
to Base and we can check if it's defined here, or it's also possible we don't need the fallback anymore. I left this as-is for now since it ensures non-breaking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why extend the function at all? You can just keep using it as a separate function and qualify it (or the Base one).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because then that's breaking; we have all these code examples all over the package where we tell people to write streaming servers doing closewrite(stream)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can still export it. Downstreams users just must then decide if they want to use Base.closewrite
or HTTP.closewrite
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally, I would strongly prefer that we remove the type piracy and just make a breaking release of HTTP.jl.
(I don't think it's actually breaking. I think the burden is on the downstream users to make sure that they properly qualify any names that they use.
But if this is considering breaking, then I'd rather just tag the breaking release so that we can get rid of the type piracy.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
… considered harmful") (#42080) I feel we are heading up against a "`using` crisis" where any new feature that is implemented by exporting a new name (either in Base or a package) becomes a breaking change. This is already happening (JuliaGPU/CUDA.jl#1097, JuliaWeb/HTTP.jl#745) and as projects get bigger and more names are exported, the likelihood of this rapidly increases. The flaw in `using Foo` is fundamental in that you cannot lexically see where a name comes from so when two packages export the same name, you are screwed. Any code that relies on `using Foo` and then using an exported name from `Foo` is vulnerable to another dependency exporting the same name. Therefore, I think we should start to strongly discourage the use of `using Foo` and only recommend `using Foo` for ephemeral work (e.g. REPL work). --------- Co-authored-by: Dilum Aluthge <[email protected]> Co-authored-by: Mason Protter <[email protected]> Co-authored-by: Max Horn <[email protected]> Co-authored-by: Matt Bauman <[email protected]> Co-authored-by: Alex Arslan <[email protected]> Co-authored-by: Ian Butterworth <[email protected]> Co-authored-by: Neven Sajko <[email protected]>
… considered harmful") (#42080) I feel we are heading up against a "`using` crisis" where any new feature that is implemented by exporting a new name (either in Base or a package) becomes a breaking change. This is already happening (JuliaGPU/CUDA.jl#1097, JuliaWeb/HTTP.jl#745) and as projects get bigger and more names are exported, the likelihood of this rapidly increases. The flaw in `using Foo` is fundamental in that you cannot lexically see where a name comes from so when two packages export the same name, you are screwed. Any code that relies on `using Foo` and then using an exported name from `Foo` is vulnerable to another dependency exporting the same name. Therefore, I think we should start to strongly discourage the use of `using Foo` and only recommend `using Foo` for ephemeral work (e.g. REPL work). --------- Co-authored-by: Dilum Aluthge <[email protected]> Co-authored-by: Mason Protter <[email protected]> Co-authored-by: Max Horn <[email protected]> Co-authored-by: Matt Bauman <[email protected]> Co-authored-by: Alex Arslan <[email protected]> Co-authored-by: Ian Butterworth <[email protected]> Co-authored-by: Neven Sajko <[email protected]> (cherry picked from commit ee09ae7)
… considered harmful") (#42080) I feel we are heading up against a "`using` crisis" where any new feature that is implemented by exporting a new name (either in Base or a package) becomes a breaking change. This is already happening (JuliaGPU/CUDA.jl#1097, JuliaWeb/HTTP.jl#745) and as projects get bigger and more names are exported, the likelihood of this rapidly increases. The flaw in `using Foo` is fundamental in that you cannot lexically see where a name comes from so when two packages export the same name, you are screwed. Any code that relies on `using Foo` and then using an exported name from `Foo` is vulnerable to another dependency exporting the same name. Therefore, I think we should start to strongly discourage the use of `using Foo` and only recommend `using Foo` for ephemeral work (e.g. REPL work). --------- Co-authored-by: Dilum Aluthge <[email protected]> Co-authored-by: Mason Protter <[email protected]> Co-authored-by: Max Horn <[email protected]> Co-authored-by: Matt Bauman <[email protected]> Co-authored-by: Alex Arslan <[email protected]> Co-authored-by: Ian Butterworth <[email protected]> Co-authored-by: Neven Sajko <[email protected]> (cherry picked from commit ee09ae7)
Memory usage and performance noted in
https://discourse.julialang.org/t/http-parse-multipart-form-does-a-lot-of-memory-allocations/66661.
This uses some of the same regex tricks we use in the other http parsing
code that operates on SubStrings and avoids excessive allocations. We
also streamline things to avoid allocating when comparing byte buffers
and the unnecessary allocated arrays of the content disposition
processing.
On my machine for current master, I see
35.022 μs (866 allocations: 70.27 KiB)
for parsing a simple multipart request from the tests, andwith this PR, I see
3.603 μs (42 allocations: 2.55 KiB)
.