Add support to combine update operations #25976

urso · 2021-05-27T21:22:17Z

Enhancement

What does this PR do?

With this change the update operations emited by the filestream input
are not directly applied up on ACK, but will be scheduled for
application. A backgroun go-routine picks up all updates from all inputs
that are or have been active and applies the changes to the registry.

This removes the application to the cursor and write to persistent store
from the ACK handler path, which when blocked could produce back
pressure on the outputs, processors or inputs.

When another set of state updates are acked by the output, but the
writer is still busy with past updates, then we merge the new and other
pending updates into a single update operation. This reduces the amount
of in memory and on disk updates we need to execute if the system gets
under pressure.

TODO: performance tests logs input, master, and changes.

Why is it important?

With this change backpressure from the registry writes is mostly
nullified, which removes the need for a registry.flush setting, as we
have globally for the logs input.

limitations: We combine as many updates as we can. If the registry can't
be updates for a long time, there is a chance of publishing duplicates
when filebeat is restarted after a crash. This same issue exists with
the filebeat.registry.flush setting.
On clean shutdown we still serialize all pending updates for already
ACKed events.

Benchmarks

I did run some benchmarks with Filebeat master and this PR using the console output. I'm the complete output, as many subsystem are involved, that can impact performance overall.

Test cases:

SL: single log file. 1GB
L100: send 100 log files (10MB each) concurrently
EL: single log file is 95% of lines being filtered out by exclude_lines
DL: single log file is 95% of lines being filtered out by drop_event processor
EL100: single log file is 95% of lines being filtered out by exclude_lines. 100 log files processed concurrently.
DL100: single log file is 95% of lines being filtered out by drop_event processor. 100 log files processed concurrently.

Note: For the test cases with events being dropped, the output event rate does not exactly tell us how many events
have been processed in total. We estimate the assumed event rate of the inputs by computing rate_in = rate_out * 100 / (100 - filtered_percentage).

Inputs:

log: logs input
fs: filestream input as in master branch
fsack: filestream input with ACK merging

SL:

log:

$ ./scripts/run_console.zsh master filebeat -E max_procs=2  -c input/nasa.yml
CPU=188% User=37,05s Sys=0,68s Total=20,02s MaxRSS=69 InOps=502472 OutOps=816
[ 116k/s] [ 116k/s]

fs:

$ ./scripts/run_console.zsh master filebeat -E max_procs=2  -c input/filestream_nasa.yml
CPU=192% User=37,94s Sys=0,55s Total=20,01s MaxRSS=65 InOps=130048 OutOps=856
[ 146k/s] [ 146k/s]

fsack

$ ./scripts/run_console.zsh fsperf filebeat -E max_procs=2  -c input/filestream_nasa.yml
CPU=182% User=36,03s Sys=0,59s Total=20,03s MaxRSS=66 InOps=0 OutOps=848
[ 142k/s] [ 142k/s]

L100:

log

$ ./scripts/run_console.zsh master filebeat -E max_procs=2  -c input/log.yml
CPU=193% User=38,03s Sys=0,82s Total=20,03s MaxRSS=74 InOps=556736 OutOps=81184
[ 119k/s] [ 119k/s]

fs:

$ ./scripts/run_console.zsh master filebeat -E max_procs=2  -c input/filestream_log.yml
CPU=193% User=38,10s Sys=0,61s Total=20,03s MaxRSS=118 InOps=74240 OutOps=68000
[ 125k/s] [ 125k/s]

fsack

$ ./scripts/run_console.zsh fsperf filebeat -E max_procs=2  -c input/filestream_log.yml
CPU=189% User=37,26s Sys=0,71s Total=20,02s MaxRSS=128 InOps=75776 OutOps=69240
[ 131k/s] [ 131k/s]

EL:

log:

$ ./scripts/run_console.zsh master filebeat -E max_procs=2  -c input/logsremove.yml
CPU=160% User=29,44s Sys=2,74s Total=20,02s MaxRSS=105 InOps=0 OutOps=184
[8,38k/s] [8,38k/s]
rate_in: 167,0k events/s

fs:

$ ./scripts/run_console.zsh master filebeat -E max_procs=2  -c input/filestream_logsremove.yml
CPU=155% User=30,53s Sys=0,60s Total=20,02s MaxRSS=63 InOps=0 OutOps=184
[16,9k/s] [16,9k/s]
rate_in: 338,0k events/s

fsack:

$ ./scripts/run_console.zsh fsperf filebeat -E max_procs=2  -c input/filestream_logsremove.yml
CPU=149% User=29,31s Sys=0,64s Total=20,02s MaxRSS=64 InOps=0 OutOps=192
[16,9k/s] [16,9k/s]
rate_in: 338,0k events/s

DL:

log:

$ ./scripts/run_console.zsh master filebeat -E max_procs=2  -c input/logsremove_drop.yml
CPU=161% User=28,91s Sys=3,37s Total=20,03s MaxRSS=101 InOps=0 OutOps=144
[4,29k/s] [4,29k/s]
rate_in: 86,8k events/s

fs:

$ ./scripts/run_console.zsh master filebeat -E max_procs=2  -c input/filestream_logsremove_drop.yml
CPU=156% User=30,84s Sys=0,52s Total=20,02s MaxRSS=98 InOps=0 OutOps=160
[6,14k/s] [6,14k/s]
rate_in: 122,8k events/s

fsack:

$ ./scripts/run_console.zsh fsperf filebeat -E max_procs=2  -c input/filestream_logsremove_drop.yml
CPU=159% User=31,34s Sys=0,50s Total=20,02s MaxRSS=91 InOps=0 OutOps=144
[7,06k/s] [7,06k/s]
rate_in: 141,2 events/s

EL100:

logs
$ ./scripts/run_console.zsh master filebeat -E max_procs=2 -c input/dir_remove.yml CPU=194% User=28,86s Sys=0,40s Total=15,04s MaxRSS=108 InOps=0 OutOps=6848 [12,8k/s] [12,8k/s] rate_in: 256k events/s

fs

$ ./scripts/run_console.zsh master filebeat -E max_procs=2  -c input/filestream_dir_remove.yml
CPU=186% User=27,57s Sys=0,42s Total=15,02s MaxRSS=84 InOps=0 OutOps=9640
[23,7k/s] [23,7k/s]
rate_in: 474k events/s

fsack

$ ./scripts/run_console.zsh fsperf filebeat -E max_procs=2  -c input/filestream_dir_remove.yml
CPU=183% User=27,17s Sys=0,33s Total=15,02s MaxRSS=83 InOps=0 OutOps=9624
[23,8k/s] [23,8k/s]
rate_in: 476k events/s

DL100:

logs

$ ./scripts/run_console.zsh master filebeat -E max_procs=2  -c input/dir_drop.yml
CPU=188% User=27,65s Sys=0,63s Total=15,02s MaxRSS=108 InOps=0 OutOps=2592
[4,62k/s] [4,62k/s]
rate_in: 92,4k events/s

fs

$ ./scripts/run_console.zsh master filebeat -E max_procs=2  -c input/filestream_dir_drop.yml
CPU=198% User=29,74s Sys=0,23s Total=15,06s MaxRSS=117 InOps=0 OutOps=6640
[8,75k/s] [8,75k/s]
rate_in: 175k events/s

fsack

$ ./scripts/run_console.zsh fsperf filebeat -E max_procs=2  -c input/filestream_dir_drop.yml
CPU=198% User=29,64s Sys=0,25s Total=15,04s MaxRSS=118 InOps=0 OutOps=4504
[10,1k/s] [10,1k/s]
rate_in: 202k events/s

=>
In general, the new filestream input seems to be faster in comparison to the logs input.

Especially the exclude_lines configuration is almost twice as fast as the logs input.
The rate achieved by the inputs with exclude_lines only outperforms the
non-filtered case always.
Note: The tests have been run without any real output that can produce backpressure.
The exclude_lines implementation of the logs inputs produce a status update
per dropped event, that still must be processed by the registry. Slow outputs
can still slow down inputs with lines to be filtered.
The filestream input on the other hand just ignores the line without producing
a status update for the registry, which allows it to keep a higher rate in case
the output is unresponsive.

Filtering event via drop_event + when clause is rather costly. In comparison
the exclude_lines a full event must be produces, the event is then
'normalized', which results in a deep copy with many small allocations, the
when clause must be evaluated and finally the event is dropped via ack
handlers. Next the produced but dropped event might eventually be shipped to the registrar
with or without pending ACKs. At worst this can produce an ACK event per
dropped event (very likely at 95% drop rate). By writing an event per dropped event to the
registry file, we produce additional overhead, which can result in backpressure and
slow down publishing of events.
The full processing generates quite some measurable overhead in all setups.
In case of the logs input, the line read rate is less in comparison to simply
publishing all lines. Due to the architecture of the logs input, this can also
lead to a very long startup phase, which is intermixed with event publishing.
The filestream input can better keep the rate here, and with ACK merging
support (fsack tests), the read rate still is higher, then for the non-filtered case.

Checklist

My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I have made corresponding change to the default configuration files
I have added tests that prove my fix is effective or that my feature works
I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

Author's Checklist

[ ]

How to test this PR locally

Related issues

Use cases

Screenshots

Logs

With this change the update operations emited by the filestream input are not directly applied up on ACK, but will be scheduled for application. A backgroun go-routine picks up all updates from all inputs that are or have been active and applies the changes to the registry. This removes the application to the cursor and write to persistent store from the ACK handler path, which when blocked could produce back pressure on the outputs, processors or inputs. When another set of state updates are acked by the output, but the writer is still busy with past updates, then we merge the new and other pending updates into a single update operation. This reduces the amount of in memory and on disk updates we need to execute if the system gets under pressure. With this change backpressure from the registry writes is mostly nullified, which removes the need for a `registry.flush` setting, as we have globally for the logs input. limitations: We combine as many updates as we can. If the registry can't be updates for a long time, there is a chance of publishing duplicates when filebeat is restarted after a crash. This same issue exists with the `filebeat.registry.flush` setting. On clean shutdown we still serialize all pending updates for already ACKed events.

urso · 2021-05-27T21:25:18Z

filebeat/input/filestream/internal/input-logfile/publish.go

-		return nil, err
-	}
-	resource.pendingCursor = cursor
+	resource.pendingUpdate = updates


we do not apply the update directly anymore. Instead we store the most recent delta with the resource. The delta is applied lazily when the current pendingCursor state is required for real. This removes an 'expensive' operation from the hot path.

urso · 2021-05-27T21:28:44Z

filebeat/input/filestream/internal/input-logfile/update_writer.go

+// Set overwrites key value pair in the pending update operations.
+func (w *updateWriter) Schedule(op *updateOp, n uint) {
+	w.mutex.Lock()
+	defer w.mutex.Unlock()


Schedule is a point of contention here. But it is still better then blocking and waiting for the serialization and IO to be finished.

elasticmachine · 2021-05-27T21:33:35Z

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS

Expand to view the summary

Build stats

Start Time: 2021-07-19T19:46:22.082+0000
Duration: 96 min 12 sec
Commit: 5085316

Test stats 🧪

Test	Results
Failed	0
Passed	14894
Skipped	2313
Total	17207

Trends 🧪

💚 Flaky test report

Tests succeeded.

Expand to view the summary

Test stats 🧪

Test	Results
Failed	0
Passed	14894
Skipped	2313
Total	17207

urso · 2021-05-28T15:32:48Z

/test

kvch

I like the general direction of the PR, hence I approved the PR. I am glad loginp.Cursor no longer requires a store instance. \o/

But please provide tests for your changes.

- don't pass complete store, but only channel to acker (only pass required functionality) - separate the update writer into worker and channel to ease testing

elasticmachine · 2021-07-19T19:39:36Z

Pinging @elastic/agent (Team:Agent)

urso · 2021-07-19T19:46:37Z

Added tests. PR is now ready for review.

(cherry picked from commit 1c71c94)

(cherry picked from commit 1c71c94) Co-authored-by: Steffen Siering <[email protected]>

botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label May 27, 2021

urso commented May 27, 2021

View reviewed changes

remove old comment

afdb8ca

lint

9b9b584

urso mentioned this pull request May 28, 2021

Log file collector based on v2 input API #20243

Closed

18 tasks

kvch approved these changes May 31, 2021

View reviewed changes

kvch suggested changes Jun 1, 2021

View reviewed changes

jsoriano added the Team:Elastic-Agent Label for the Agent team label Jun 24, 2021

botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Jun 24, 2021

urso added 5 commits June 28, 2021 21:42

Merge branch 'master' into filestream-performance

9cef6fc

Merge branch 'master' into filestream-performance

287024d

refactor/cleanup update writer

2df8ac8

- don't pass complete store, but only channel to acker (only pass required functionality) - separate the update writer into worker and channel to ease testing

add test

cf9c4c4

Add missing license header

2f39e88

urso added the backport-v7.15.0 Automated backport with mergify label Jul 19, 2021

urso requested a review from kvch July 19, 2021 19:39

urso marked this pull request as ready for review July 19, 2021 19:39

add changelog entry

5085316

kvch approved these changes Jul 20, 2021

View reviewed changes

urso merged commit 1c71c94 into elastic:master Jul 21, 2021

urso deleted the filestream-performance branch July 21, 2021 13:01

mergify bot pushed a commit that referenced this pull request Jul 21, 2021

Add support to combine update operations (#25976)

067574f

(cherry picked from commit 1c71c94)

mergify bot mentioned this pull request Jul 21, 2021

[7.x](backport #25976) Add support to combine update operations #27000

Merged

kvch pushed a commit that referenced this pull request Jul 22, 2021

Add support to combine update operations (#25976) (#27000)

702646a

(cherry picked from commit 1c71c94) Co-authored-by: Steffen Siering <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add support to combine update operations #25976

Add support to combine update operations #25976

Uh oh!

urso commented May 27, 2021 •

edited

Loading

Uh oh!

urso May 27, 2021

Uh oh!

urso May 27, 2021

Uh oh!

elasticmachine commented May 27, 2021 •

edited by jenkins-beats-ci bot

Loading

Build stats

Test stats 🧪

Trends 🧪

Test stats 🧪

Uh oh!

urso commented May 28, 2021

Uh oh!

kvch left a comment

Uh oh!

elasticmachine commented Jul 19, 2021

Uh oh!

urso commented Jul 19, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Add support to combine update operations #25976

Add support to combine update operations #25976

Uh oh!

Conversation

urso commented May 27, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Why is it important?

Benchmarks

Checklist

Author's Checklist

How to test this PR locally

Related issues

Use cases

Screenshots

Logs

Uh oh!

urso May 27, 2021

Choose a reason for hiding this comment

Uh oh!

urso May 27, 2021

Choose a reason for hiding this comment

Uh oh!

elasticmachine commented May 27, 2021 • edited by jenkins-beats-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💚 Build Succeeded

Build stats

Test stats 🧪

Trends 🧪

💚 Flaky test report

Test stats 🧪

Uh oh!

urso commented May 28, 2021

Uh oh!

kvch left a comment

Choose a reason for hiding this comment

Uh oh!

elasticmachine commented Jul 19, 2021

Uh oh!

urso commented Jul 19, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

urso commented May 27, 2021 •

edited

Loading

elasticmachine commented May 27, 2021 •

edited by jenkins-beats-ci bot

Loading