Presentation slide is available here (and its Japanese version here)
Have you ever used grep
's -A
, -B
, or -C
option or pcregrep
, or maybe awk
, perl
-oneliners, to search something like multi-line log entries?
Then probably this command is for you.
mlr-grep is like grep
, but record-oriented rather than line-oriented; when it finds a match in a line, it prints all lines in the record the match is found. In other words, all the lines around the match, surrounded by record separator instead of \n
s.
And of course, you can specify record-separator using --rs=REGEX
option, default of which is ^$|(-----\*|=====\*)$
(blank line or, four or more dashes). This is similar to -d
(delimiter) option of agrep, but our version accept arbitrary regex as a record-separator.
Useful for multi-line logs, config files with indents or section headers (like *.ini
like format), command output like ifconfig
or pbsnodes
There are four implementations with Haskell, AWK, Rust, and Python named hmlgrep
, amlgrep
, rmlgrep
, and pymlgrep
respectively.
Haskell is the fastest, often by far the fastest, especially for sparse matches. It's often as fast as GNU grep and sometimes faster, because it dopted some of the GNU grep's performance optimization techniques, such as avoiding breaking input into lines, Boyer-Moore algorithm for non-regex patterns, and using mmap to avoid copying buffer, etc.
MLR_GREP [OPTIONS] [--] PATTERN[...] [--] [FILES...]
-v
,--invert-match
Select non-matching lines (same asgrep -v
).-i
,--ignore-case
Case-insensitive match (same asgrep -i
).-c
,--count
Print number of matches (same asgrep -c
).--color
,--mono
(Do not) highlight matches (same as GNUgrep --color
). Default is to highlight if stdout is a terminal.-r
,--rs
Set input record separator. Default is^$|(-----\*|=====\*)$
-t
,--timestamp
Same as--rs=
TIMESTAMP_REGEX, where the regex matches timestamps often used in log files, e.g.,2014-12-31 12:34:56
orDec 31 12:34:56
.-a
,--and
Print entries match with all (instead of any) ofPATTERN
s.-h
,--help
Print help of a command.
For config file like this:
[name1]
attr1=abc
attr2=123
[name2]
attr1=xxx
attr2=456
attr3=zzz
[name3]
attr1=abc
attr2=789
You can use hmlgrep attr2=456
to print whole entries of [name3]
section, without guessing number of lines to use like grep -C 2
(In this case blank line is parsed as separators, but you can use --rs
option like hmlgrep --rs='^\['
to change the separator regex.
$ hmlgrep attr2=456
[name2]
attr1=xxx
attr2=456
attr3=zzz
When you have multi-line log entries begin with timestamp:
$ cat logfile.txt
2014-01-23 12:34:56 logentry 1
foo
bar
2014-01-24 12:34:57 logentry 2
abc
def
2014-01-23 12:34:58 logentry 3
123
456
You can use --timestamp
option (or something like --rs='^2014-[0-9][0-9]-[0-9][0-9]'
to split such lines into records. Then searching for entries with keyword is just like using ordinary grep
.
$ hmlgrep --timestamp abc logfile
2014-01-24 12:34:57 logentry 2
abc
def
Currently we have AWK, Haskell, and Python implementation. They're roughly equivalent, but have a few minor differences; AWK version accepts POSIX extended regular expressions and can match multiple lines, while Haskell version uses PCRE regex library and Python uses its own variant of regex with single-line match only.
Haskell and AWK versions are 'full featured'; equipped with most of the (relevant) grep options like -i
,-c
,-v
options, and it can highlight matches. Also be able to handle compressed files like *.gz
, *.bz2
, and *.xz
transparently. (But you need gzip
, bzip2
, and xz
installation for amlgrep
)
A Haskell implementation is most actively maintained. This is currently the fastest implementation in many use cases, often 5x to 10x faster than awk especially for sparse matches.
An awk
implementation. Fairly fast for most of the cases, but slow for sparse matches.
Being implemented by awk
, it should ran on any Unix-like platform, though I have only tested on Linux (Ubuntu 12.04, RHEL 5.x) and MaxOSX. Note that it requires GNU awk, won't run on MaxOSX out of the box with stock BSD awk.
KNOWN BUG : There is an subtle issue around newline (\n
) handling; amlgrep
will output slightly wrong output when there's an empty record (two continuous separator lines). For example, when RS=\n\n
, it consumes both of newlines. So when \n\n\n
(two empty records with blank line as separator) appears in input, first iteration consume two characters and leave single '\n' behind and next iteration cannot find \n\n
A Python implementation. Doesn't support --and
, and accepts just one pattern.
Slowest, sometimes it's about 20x slower than any other versions.
A Rust implementation with line-oriented reading; because this version doesn't employ any performance optimization techniques hmlgrep
does, 10x more slower for sparse match input/patterns, but comparably first to medium to dense matches.
Compared to the AWK version, typically 30~40% faster.
This version doesn't support some of the key options like --color
, --invert
amlgrep
- Just copy into any directory listed in your$PATH
. On non-Linux systems you may also need to install GNU awk. You may also needzcat
,bzcat
, andxzcat
to handle compressed files.hmlgrep
- Install the Haskell Platform, and runcabal install
. You can also usemake
to build it.pymlgrep
- Just copy into any directory listed in your$PATH
.rmlgrep
- Install the Rust SDK, and runcargo build
inrust
directory (with--release
as needed). You can also usemake
to build it (it just calls cargo build).
hmlgrep
uses regex-pcre
(Text.Regex.PCRE) as underlying regex engine and stringsearch
(Data.ByteString.Search) for non-regex patterns. In the source code, there are experimental versions available. They're using regex-tdfa
and haskell-re2
(jmillikin's version of Haskell wrapper for Google's regex implementation), disabled and switchable using #ifdef
.
PCRE is the best with current implementation so far, slightly better than re2
and much much better than regex-tdfa
.
You can use these regex engines just by changing #define PCRE
to #define RE2
or #define TDFA
(and comment/comment-out corresponding import
statements)
Note that it's using patched version of haskell-re2
which is available at https://github.com/ryo1kato/haskell-re2/tree/matchPos
Run a commandline-level test. It simply compares outputs from the four commands.
Run a simple performance test for several regex/non-regex and sparse/dense patterns on dummy multi-line log data.
You have to generate test data using test/gentestlog.sh
first.
$ ./test/gentestlog.sh
$ ./performance.sh
You can change test data size with ./test/gentestlog.sh 512
(generate 512MB test data)
Limited number of unit tests are implemented using doctest.
Just run it with cabal test
This software is licensed under the terms of the MIT license. (But the libraries it depends on may or may not. Check the license yourself if you're going to distribute binary.)
- Project page: https://github.com/ryo1kato/mlr-grep
- Author: Ryoichi Kato