Skip to content

erlxml - Erlang XML parsing library based on pugixml

License

Notifications You must be signed in to change notification settings

silviucpp/erlxml

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

71 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

erlxml

erlxml - Erlang XML parsing library based on pugixml

Build Status GitHub Hex.pm

Implementation notes

pugixml is the fastest dom parser available in c++ based on the benchmarks available here. The streaming parser works by dividing the stream into independent stanzas, which are then processed using pugixml. While the splitting algorithm is quite fast, it is designed for simplicity, which currently imposes some limitations on the streaming mode:

  • Does not support CDATA
  • Does not support comments containing special XML characters
  • Does not support DOCTYPE declarations

All of the above limitations apply only to streaming mode and not to DOM parsing mode.

Getting starting:

DOM parsing
erlxml:parse(<<"<foo attr1='bar'>Some Value</foo>">>).

Which results in

{ok,{xmlel,<<"foo">>,
           [{<<"attr1">>,<<"bar">>}],
           [{xmlcdata,<<"Some Value">>}]}}
Generate an XML document from Erlang terms
Xml = {xmlel,<<"foo">>,
    [{<<"attr1">>,<<"bar">>}],  % Attributes
    [{xmlcdata,<<"Some Value">>}]   % Elements
},
erlxml:to_binary(Xml).

Which results in

<<"<foo attr1=\"bar\">Some Value</foo>">>
Streaming parsing
Chunk1 = <<"<stream><foo attr1=\"bar">>,
Chunk2 = <<"\">Some Value</foo></stream>">>,
{ok, Parser} = erlxml:new_stream(),
{ok,[{xmlstreamstart,<<"stream">>,[]}]} = erlxml:parse_stream(Parser, Chunk1),
Rs = erlxml:parse_stream(Parser, Chunk2),
{ok,[{xmlel,<<"foo">>,
        [{<<"attr1">>,<<"bar">>}],
        [{xmlcdata,<<"Some Value">>}]},
     {xmlstreamend,<<"stream">>}]} = Rs.

Options

When you create a stream using new_stream/1 you can specify the following options:

  • stanza_limit - Specify the maximum size a stanza can have. In case the library parses more than this number of bytes without finding a stanza will return and error {error, {max_stanza_limit_hit, binary()}}. Example: {stanza_limit, 65000}. By default, it is 0 that means unlimited.

  • strip_non_utf8 - Will strip from attributes values and node values elements all invalid utf8 characters. This is considered user input and might have malformed chars. Default is false.

Benchmarks

The benchmark code is inside the benchmark folder. The performances are compared against:

All tests are running with three different concurrency levels (how many erlang processes are spawn)

  • C1 (concurrency level 1)
  • C5 (concurrency level 5)
  • C10 (concurrency level 10)
DOM parsing

Parse the same stanza defined in benchmark/benchmark.erl for 600000 times:

make bench_parsing
Library C1 (ms) C5 (ms) C10 (ms)
erlxml 1875.128 417.368 315.65
exml 2417.334 578.226 407.516
fast_xml 24159.517 5854.817 4007.837

Note:

  • Starting version 3.0.0, exml saw significant improvements by replacing Expat with RapidXML.
  • erlxml delivers the best performance, followed by exml, while fast_xml performs the worst (huge difference).
Generate an XML document from Erlang terms

Encode the same erlang term defined in benchmark/benchmark.erl for 600000 times:

make bench_encoding
Library C1 (ms) C5 (ms) C10 (ms)
erlxml 1381.338 322.851 251.936
exml 1333.54 301.625 234.295
fast_xml 1019.238 238.676 198.69

Note:

  • fast_xml delivers the best performance, followed by exml, and erlxml with almost the same performance.
  • erlxml improved encoding performance in version 2.1.0 by removing unnecessary memory copy and string length computing.
Streaming parsing

Test is located in benchmark/benchmark_stream.erl, and will load all stanza's from test/data/stream.txt and run the parsing mode over that stanza's for 60000 times:

make bench_streaming
### engine: erlxml concurrency: 1 -> 2337.112 ms 193.81 MB/sec total bytes processed: 452.96 MB
### engine: erlxml concurrency: 5 -> 598.737 ms 756.52 MB/sec total bytes processed: 452.96 MB
### engine: erlxml concurrency: 10 -> 407.379 ms 1.09 GB/sec total bytes processed: 452.96 MB
### engine: exml concurrency: 1 -> 11790.975 ms 38.42 MB/sec total bytes processed: 452.96 MB
### engine: exml concurrency: 5 -> 2552.339 ms 177.47 MB/sec total bytes processed: 452.96 MB
### engine: exml concurrency: 10 -> 1840.267 ms 246.14 MB/sec total bytes processed: 452.96 MB
### engine: fast_xml concurrency: 1 -> 22677.758 ms 19.97 MB/sec total bytes processed: 452.96 MB
### engine: fast_xml concurrency: 5 -> 5184.096 ms 87.37 MB/sec total bytes processed: 452.96 MB
### engine: fast_xml concurrency: 10 -> 3854.402 ms 117.52 MB/sec total bytes processed: 452.96 MB 
Library C1 (MB/s) C5 (MB/s) C10 (MB/s)
erlxml 193.81 756.52 1090
exml 38.42 177.47 246
fast_xml 19.97 87.37 117

Notes:

  • erlxml is the clear winner.

About

erlxml - Erlang XML parsing library based on pugixml

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages