Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add upload timeout when using disk buffer to maintain fresh logs on s3. #8

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

davemarco
Copy link
Contributor

@davemarco davemarco commented Aug 19, 2024

Description

Adds a timeout when using disk buffer to maintain fresh logs on s3. Previously only disk buffer size was considered for upload criteria.

Smaller changes include cleaning up the format of the plugin's logs, and adding newlines in IR messages to make decompressed output more readable.

Next PR will add a guard preventing the same disk buffer path for two output instances. For now I just put a warning in the README.

Timeout goroutines

Adding a timeout is non-trivial. For example, consider a scenario where an application writes a few logs and never writes again. The logs will be stored in the disk buffer, and then the plugin will return execution to Fluent Bit engine. Since there are no more writes, the plugin never regains execution to send these logs. I could not think of an easy way to solve this problem.

The solution in this PR spawns goroutines for each tag. The goroutine will listen for a timeout event, and when received will send logs to s3.

Unfortunately, this multithreaded structure introduces synchronization problems. Mutexes were also added to prevent an upload while writing.

Next, I added wait groups to ensure goroutines are closed by plugin when program receives a kill signal. Without wait groups, goroutines will be terminated by OS. Could help avoid scenario where IR/Zstd postamble is added, then plugin is terminated and stream not yet uploaded to S3. On restart, the stream would have a double postamble. Probably overkill..., but trying to be safe.

Lastly, this PR should also bring some performance improvement. Not planned, just a result of design. Since the uploading is done in its own goroutine, the plugin can pass execution back to FluentBit engine sooner. This is one reason why disk buffer size is also using goroutine. Similarly why the goroutine is also used with memory buffer even though there is no timeout so not required.

Log format

Made the log format consistent for all logs. They all now start with "[out_clp_s3] ".

IR msg changes

Decompressed IR streams appear as one concatenated string. I added a space separating the log from the timestamp. I also added a new line that separates logs from each other. I spoke about newline with Kirk earlier but only added now.

Validation performed

Tested that timeout worked correctly

@davemarco davemarco requested a review from davidlion August 19, 2024 23:47
Copy link
Member

@davidlion davidlion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the goroutines it appears error returns were replaces with logs or panic calls. Lets go back to returning nil/errors through a channel (we can keep all the logs). One event manager shouldn't be able to bring down everything by panicing.

| `use_disk_buffer` | Buffer logs on disk prior to sending to S3. See [Disk Buffering](#disk-buffering) for more info. | `TRUE` |
| `disk_buffer_path` | Directory for disk buffer. Path should be unique for each output. | `tmp/out_clp_s3/1/` |
| `upload_size_mb` | Set upload size in MB when disk buffer is enabled. Size refers to the compressed size. | `16` |
| `timeout` | Upload timeout if upload size is not met. For use when disk buffer is enabled. Valid time units are s, m, h. | `15m` |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets point people to time.ParseDuration as it is a more complete description of what is/isn't valid.

Comment on lines +65 to +70
if m.Writer.GetUseDiskBuffer() {
m.diskUploadListener(config, uploader)
} else {
m.memoryUploadListener(config, uploader)
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets refactor this. I could be wrong, but it looks like all of the upload path should be merged. We should still have a timeout in the memory buffered version and we can call CheckEmpty which should cover the differences.

Essentially, we shouldn't need to know if it is memory or disk buffering in the manager code.

@kirkrodrigues
Copy link
Member

I guess this PR is waiting for changes from @davemarco?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants