drain-java is a continuous log template miner, for each log message it extracts tokens and group them into clusters of tokens. As new log messages are added, drain-java will identify similar token and update the cluster with the new template, or simply create a new token cluster. Each time a cluster is matched a counter is incremented.
These clusters are stored in prefix tree, which is somewhat similar to a trie, but here the tree as a fixed depth in order to avoid long tree traversal. In avoiding deep trees this also helps to keep it balance.
First, Java 11 is required to run drain-java.
You can consume drain-java as a dependency in your project io.github.bric3.drain:drain-java-core
,
currently only snapshots
are available by adding this repository.
repositories {
maven {
url("https://oss.sonatype.org/content/repositories/snapshots/")
}
}
Since this tool is not yet released the tool needs to be built locally. Also, the built jar is not yet super user-friendly. Since it’s not a finished product, anything could change.
$ ./gradlew build
$ java -jar tailer/build/libs/tailer-0.1.0-SNAPSHOT-all.jar -h
tail - drain
Usage: tail [-dfhV] [--verbose] [-n=NUM]
[--parse-after-str=FIXED_STRING_SEPARATOR]
[--parser-after-col=COLUMN] FILE
...
FILE log file
-d, --drain use DRAIN to extract log patterns
-f, --follow output appended data as the file grows
-h, --help Show this help message and exit.
-n, --lines=NUM output the last NUM lines, instead of the last 10; or use
-n 0 to output starting from beginning
--parse-after-str=FIXED_STRING_SEPARATOR
when using DRAIN remove the left part of a log line up to
after the FIXED_STRING_SEPARATOR
--parser-after-col=COLUMN
when using DRAIN remove the left part of a log line up to
COLUMN
-V, --version Print version information and exit.
--verbose Verbose output, mostly for DRAIN or errors
$ java -jar tailer/build/libs/tailer-0.1.0-SNAPSHOT-all.jar --version
Versioned Command 1.0
Picocli 4.6.3
JVM: 19 (Amazon.com Inc. OpenJDK 64-Bit Server VM 19+36-FR)
OS: Mac OS X 12.6 x86_64
By default, the tool act similarly to tail
, and it will output the file to the stdout.
The tool can follow a file if the --follow
option is passed.
However, when run with the --drain
this tool will classify log lines using DRAIN, and will
output identified clusters.
Note that this tool doesn’t handle multiline log messages (like logs that contains a stacktrace).
On the SSH log data set we can use it this way.
$ java -jar build/libs/drain-java-1.0-SNAPSHOT-all.jar \
-d \ (1)
-n 0 \ (2)
--parse-after-str "]: " (3)
build/resources/test/SSH.log (4)
-
Identify patterns in the log
-
Starts from the beginning of the file (otherwise it starts from the last 10 lines)
-
Remove the left part of log line (`Dec 10 06:55:46 LabSZ sshd[24200]: `), ie effectively ignoring some variable elements like the time.
-
The log file
---- Done processing file. Total of 655147 lines, done in 1.588 s, 51 clusters (1)
0010 (size 140768): Failed password for <*> from <*> port <*> ssh2 (2)
0009 (size 140701): pam unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= <*> <*>
0007 (size 68958): Connection closed by <*> [preauth]
0008 (size 46642): Received disconnect from <*> 11: <*> <*> <*>
0014 (size 37963): PAM service(sshd) ignoring max retries; <*> > 3
0012 (size 37298): Disconnecting: Too many authentication failures for <*> [preauth]
0013 (size 37029): PAM <*> more authentication <*> logname= uid=0 euid=0 tty=ssh ruser= <*> <*>
0011 (size 36967): message repeated <*> times: [ Failed password for <*> from <*> port <*> ssh2]
0006 (size 20241): Failed <*> for invalid user <*> from <*> port <*> ssh2
0004 (size 19852): pam unix(sshd:auth): check pass; user unknown
0001 (size 18909): reverse mapping checking getaddrinfo for <*> <*> failed - POSSIBLE BREAK-IN ATTEMPT!
0002 (size 14551): Invalid user <*> from <*>
0003 (size 14551): input userauth request: invalid user <*> [preauth]
0005 (size 14356): pam unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= <*>
0018 (size 1289): PAM <*> more authentication <*> logname= uid=0 euid=0 tty=ssh ruser= <*>
0024 (size 952): fatal: Read from socket failed: Connection reset by peer [preauth]
...
-
51 types of logs were identified from 655147 lines in 1.588s
-
There was
140768
similar log messages with this pattern, with3
positions where the token is identified as parameter<*>
.
On the same dataset, the java implementation performed roughly around 10 times faster. As my implementation does not yet have masking, mask configuration was removed in the Drain3 implementation.
This tool is not yet intended to be used as a library, but for the curious the DRAIN algorythm can be used this way:
var drain = Drain.drainBuilder()
.additionalDelimiters("_")
.depth(4)
.build()
Files.lines(Paths.get("build/resources/test/SSH.log"),
StandardCharsets.UTF_8)
.forEach(drain::parseLogMessage);
// do something with clusters
drain.clusters();
Pieces of puzzle are coming in no particular order, I first bootstrapped the code from a simple Java file. Then I wrote in Java an implementation of Drain. Now here’s what I would like to do.
-
❏ More unit tests
-
✓ Wire things together
-
❏ More documentation
-
✓ Implement tail follow mode (currently in drain mode the whole file is read and stops once finished)
-
❏ In follow drain mode dump clusters on forced exit (e.g. for example when hitting
ctrl
+c
) -
✓ Start reading from the last x lines (like
tail -n 30
) -
❏ Implement log masking (e.g. log contain an email, or an IP address which may be considered as private data)
-
❏ Json message field extraction
-
❏ How to handle prefixes : Dates, log level, etc. ; possibly using masking
-
❏ Investigate marker with specific behavior, e.g. log level severity
-
❏ Investigate log with stacktraces (likely multiline)
-
❏ Improve handling of very long lines
-
❏ Logback appender with micrometer counter
I was inspired by a blog article from one of my colleague on LogMine, — many thanks to him for doing the initial research and explaining concepts --, we were both impressed by the log pattern extraction of Datadog’s Log explorer, his blog post sparked my interest.
After some discussion together, we saw that Drain was a bit superior to LogMine. Googling Drain in Java didn’t yield any result, although I certainly didn’t search exhaustively, but regardless this triggered the idea to implement this algorithm in Java.
The Drain port is mostly a port of Drain3 done by IBM folks (David Ohana, Moshik Hershcovitch). IBM’s Drain3 is a fork of the original work done by the LogPai team based on the paper of Pinjia He, Jieming Zhu, Zibin Zheng, and Michael R. Lyu.
I didn’t follow up on other contributors of these projects, reach out if you think you have been omitted.
For reference here’s the linked I looked at: