XML benchmark #334

rossabaker · 2022-06-12T03:35:44Z

An initial benchmark comparing converting a stream with fs2-data-scala-xml and raw scala-xml with SAX. The sample XML is generated from onlinerandomtools.com.

The first commit fixes some bitrot on the CSV benchmark and could be its own PR, if you prefer.

Fixes #333.

rossabaker · 2022-06-12T03:36:13Z

Initial result:

[info] Benchmark                          Mode  Cnt      Score     Error  Units
[info] XmlParserBenchmarks.parseFs2Data   avgt    5  13466.202 ± 884.038  us/op
[info] XmlParserBenchmarks.parseScalaXml  avgt    5   5622.339 ± 382.749  us/op

rossabaker · 2022-06-12T03:44:35Z

The fs2-data parser crashes when I generate CDATA: rossabaker@6851774. I have not minimized this yet.

rossabaker · 2022-06-12T04:13:22Z

[info] XmlParserBenchmarks.parseFs2Data    avgt    5  14186.798 ± 3111.764  us/op
[info] XmlParserBenchmarks.parseSaxReader  avgt    5   5768.727 ±  582.328  us/op
[info] XmlParserBenchmarks.parseSaxStream  avgt    5   6192.101 ±  901.105  us/op
[info] XmlParserBenchmarks.parseSaxString  avgt    5   4732.937 ±  882.768  us/op

rossabaker · 2022-06-12T04:19:53Z

benchmarks/src/main/scala/fs2/data/benchmarks/XmlParserBenchmarks.scala

+    xmlStream
+      .through(fs2.text.utf8.decode)
+      .through(fs2.data.xml.events())
+      .through(fs2.data.xml.dom.documents)
+      .compile
+      .lastOrError
+      .unsafeRunSync()


I just discovered fs2.data.text, but this does not make it faster:

import fs2.data.text.utf8._ xmlStream .through(fs2.data.xml.events()) .through(fs2.data.xml.dom.documents) .compile .lastOrError .unsafeRunSync()

satabin · 2022-06-12T11:41:14Z

Thanks for the contribution.

I will have a look at the crash, the XML parser is complex and still not widely used, so there might be some corner cases not covered correctly yet. I’ll try to fix this.

Regarding parser performance, I did not put the emphasis on it yet, and the document builder is also naive. It would be interesting to compare event parsing vs document building, to identify where the bottleneck might be.

rossabaker · 2022-06-12T21:10:13Z

It looks like most of the time is spent in the event parser:

[info] XmlParserBenchmarks.decode                  avgt    5   1712.777 ±  254.995  us/op
[info] XmlParserBenchmarks.parseFs2DataToEvents    avgt    5  11361.425 ± 1966.170  us/op
[info] XmlParserBenchmarks.parseFs2DataToScalaXml  avgt    5  13519.461 ± 1307.664  us/op

satabin · 2022-06-13T08:34:42Z

Thanks for the measurements. Once crashes are fixed I will try to identify where the bottleneck is in the parser and make an attempt to address it.

satabin · 2022-06-13T11:32:36Z

The fs2-data parser crashes when I generate CDATA: rossabaker@6851774. I have not minimized this yet.

I tried to reproduce it locally, but I could parse it with success (both with and without DOM building). Do you have a stack trace to share?

rossabaker · 2022-06-13T17:46:02Z

Oh, interesting. From that linked commit,

[info] # Benchmark: fs2.data.benchmarks.XmlParserBenchmarks.parseFs2Data
[info] # Run progress: 0.00% complete, ETA 00:00:32
[info] # Fork: 1 of 1
[info] # Warmup Iteration   1: <failure>
[info] fs2.data.xml.XmlException: 'CDATA[' expected
[info] 	at fs2.data.xml.internals.EventParser$.$anonfun$pipe$1(EventParser.scala:39)
[info] 	at fs2.Pull$$anon$2.cont(Pull.scala:183)
[info] 	at fs2.Pull$BindBind.cont(Pull.scala:701)
[info] 	at fs2.Pull$.fs2$Pull$$viewL$1(Pull.scala:872)
[info] 	at fs2.Pull$.fs2$Pull$$go$1(Pull.scala:1195)
[info] 	at fs2.Pull$UnconsRunR$1.$anonfun$out$2(Pull.scala:997)
[info] 	at fs2.Pull$.$anonfun$compile$2(Pull.scala:930)
[info] 	at get @ fs2.internal.Scope.openScope(Scope.scala:281)
[info] 	at flatMap @ fs2.Compiler$Target.flatMap(Compiler.scala:163)
[info] 	at flatMap @ fs2.Pull$.$anonfun$compile$21(Pull.scala:1182)
[info] 	at update @ fs2.internal.Scope.releaseChildScope(Scope.scala:224)
[info] 	at flatMap @ fs2.Compiler$Target.flatMap(Compiler.scala:163)
[info] 	at flatMap @ fs2.Compiler$Target.flatMap(Compiler.scala:163)
[info] 	at flatMap @ fs2.Compiler$Target.flatMap(Compiler.scala:163)
[info] 	at flatMap @ fs2.Compiler$Target.flatMap(Compiler.scala:163)
[info] 	at update @ fs2.internal.Scope.releaseChildScope(Scope.scala:224)
[info] 	at flatMap @ fs2.Compiler$Target.flatMap(Compiler.scala:163)
[info] 	at flatMap @ fs2.Compiler$Target.flatMap(Compiler.scala:163)
[info] 	at flatMap @ fs2.Compiler$Target.flatMap(Compiler.scala:163)
[info] 	at flatMap @ fs2.Compiler$Target.flatMap(Compiler.scala:163)
[info] 	at update @ fs2.internal.Scope.releaseChildScope(Scope.scala:224)
[info] 	at flatMap @ fs2.Compiler$Target.flatMap(Compiler.scala:163)
[info] 	at flatMap @ fs2.Compiler$Target.flatMap(Compiler.scala:163)

The parseScalaXml runs fine, as does running xmllint command line on the test resource itself.

--

I also sketched a parser based on aalto-xml, which has incremental properties that make it a great fit for fs2. That parser is not yet production grade, doesn't emit events, and won't work on Scala.js. But it's currently coming in at about 4500us/op. If it's still that fast when productionized, I think it will be strictly better than the scala-xml SAX parsing, but that fs2-data remains compelling for its other features.

satabin · 2022-06-14T07:20:50Z

Do you run the benchmark on scala JVM or scala JS?

armanbilge · 2022-06-14T07:21:35Z

JVM, JMH doesn't run on Scala.js.

satabin · 2022-06-14T08:09:25Z

I think I found the bug. In EventParser line 128:

-              accept(ctx, s, chunkAcc)
+              loop(ctx, sidx, chunkAcc)

Having CDATA at chunk boundary fails currently. Does it fix it for you? If yes I’ll do a proper bugfix PR.

satabin · 2022-06-14T12:24:39Z

Crash should be fixed by #335.

rossabaker · 2022-06-14T19:40:45Z

Nice work. The CDATA is passing. The numbers are all worse, but the XML is bigger. Relative is what's interesting:

[info] XmlParserBenchmarks.decode                  avgt    5   3325.339 ±  605.950  us/op
[info] XmlParserBenchmarks.parseFs2DataToEvents    avgt    5  27683.543 ± 4832.627  us/op
[info] XmlParserBenchmarks.parseFs2DataToScalaXml  avgt    5  33478.473 ± 8531.365  us/op
[info] XmlParserBenchmarks.parseSaxReader          avgt    5  15590.355 ± 7675.151  us/op
[info] XmlParserBenchmarks.parseSaxStream          avgt    5  16077.269 ± 5029.600  us/op
[info] XmlParserBenchmarks.parseSaxString          avgt    5   9001.270 ± 1971.615  us/op

satabin · 2022-06-15T17:31:48Z

Thanks for this awesome work. I think we can merge it, and use it as a baseline to improve performances.

rossabaker added 2 commits June 11, 2022 23:09

Fix CSV benchmarks

f7261a9

Initial XML benchmarks

ca10b83

More variants of SAX parsing

7e136e3

rossabaker commented Jun 12, 2022

View reviewed changes

rossabaker mentioned this pull request Jun 12, 2022

fs2-data-xml streaming parser http4s/http4s-scala-xml#25

Closed

rossabaker added 2 commits June 12, 2022 17:02

Separate event parsing from document creation

5dd339b

Benchmark UTF-8 decoding

7fce287

rossabaker added 2 commits June 14, 2022 15:34

Merge remote-tracking branch 'origin/main' into benchmarks

e0d4499

CDATA that used to make it crash

193b261

satabin merged commit f7da233 into gnieh:main Jun 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

XML benchmark #334

XML benchmark #334

rossabaker commented Jun 12, 2022 •

edited

Loading

rossabaker commented Jun 12, 2022

rossabaker commented Jun 12, 2022

rossabaker commented Jun 12, 2022

rossabaker Jun 12, 2022

satabin commented Jun 12, 2022

rossabaker commented Jun 12, 2022 •

edited

Loading

satabin commented Jun 13, 2022

satabin commented Jun 13, 2022

rossabaker commented Jun 13, 2022

satabin commented Jun 14, 2022

armanbilge commented Jun 14, 2022

satabin commented Jun 14, 2022

satabin commented Jun 14, 2022

rossabaker commented Jun 14, 2022

satabin commented Jun 15, 2022

XML benchmark #334

XML benchmark #334

Conversation

rossabaker commented Jun 12, 2022 • edited Loading

rossabaker commented Jun 12, 2022

rossabaker commented Jun 12, 2022

rossabaker commented Jun 12, 2022

rossabaker Jun 12, 2022

Choose a reason for hiding this comment

satabin commented Jun 12, 2022

rossabaker commented Jun 12, 2022 • edited Loading

satabin commented Jun 13, 2022

satabin commented Jun 13, 2022

rossabaker commented Jun 13, 2022

satabin commented Jun 14, 2022

armanbilge commented Jun 14, 2022

satabin commented Jun 14, 2022

satabin commented Jun 14, 2022

rossabaker commented Jun 14, 2022

satabin commented Jun 15, 2022

rossabaker commented Jun 12, 2022 •

edited

Loading

rossabaker commented Jun 12, 2022 •

edited

Loading