Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XML benchmark #334

Merged
merged 7 commits into from
Jun 16, 2022
Merged

XML benchmark #334

merged 7 commits into from
Jun 16, 2022

Conversation

rossabaker
Copy link
Contributor

@rossabaker rossabaker commented Jun 12, 2022

An initial benchmark comparing converting a stream with fs2-data-scala-xml and raw scala-xml with SAX. The sample XML is generated from onlinerandomtools.com.

The first commit fixes some bitrot on the CSV benchmark and could be its own PR, if you prefer.

Fixes #333.

@rossabaker
Copy link
Contributor Author

Initial result:

[info] Benchmark                          Mode  Cnt      Score     Error  Units
[info] XmlParserBenchmarks.parseFs2Data   avgt    5  13466.202 ± 884.038  us/op
[info] XmlParserBenchmarks.parseScalaXml  avgt    5   5622.339 ± 382.749  us/op

@rossabaker
Copy link
Contributor Author

The fs2-data parser crashes when I generate CDATA: rossabaker@6851774. I have not minimized this yet.

@rossabaker
Copy link
Contributor Author

[info] XmlParserBenchmarks.parseFs2Data    avgt    5  14186.798 ± 3111.764  us/op
[info] XmlParserBenchmarks.parseSaxReader  avgt    5   5768.727 ±  582.328  us/op
[info] XmlParserBenchmarks.parseSaxStream  avgt    5   6192.101 ±  901.105  us/op
[info] XmlParserBenchmarks.parseSaxString  avgt    5   4732.937 ±  882.768  us/op

Comment on lines +28 to +34
xmlStream
.through(fs2.text.utf8.decode)
.through(fs2.data.xml.events())
.through(fs2.data.xml.dom.documents)
.compile
.lastOrError
.unsafeRunSync()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just discovered fs2.data.text, but this does not make it faster:

    import fs2.data.text.utf8._
    xmlStream
      .through(fs2.data.xml.events())
      .through(fs2.data.xml.dom.documents)
      .compile
      .lastOrError
      .unsafeRunSync()

@satabin
Copy link
Member

satabin commented Jun 12, 2022

Thanks for the contribution.

I will have a look at the crash, the XML parser is complex and still not widely used, so there might be some corner cases not covered correctly yet. I’ll try to fix this.

Regarding parser performance, I did not put the emphasis on it yet, and the document builder is also naive. It would be interesting to compare event parsing vs document building, to identify where the bottleneck might be.

@rossabaker
Copy link
Contributor Author

rossabaker commented Jun 12, 2022

It looks like most of the time is spent in the event parser:

[info] XmlParserBenchmarks.decode                  avgt    5   1712.777 ±  254.995  us/op
[info] XmlParserBenchmarks.parseFs2DataToEvents    avgt    5  11361.425 ± 1966.170  us/op
[info] XmlParserBenchmarks.parseFs2DataToScalaXml  avgt    5  13519.461 ± 1307.664  us/op

@satabin
Copy link
Member

satabin commented Jun 13, 2022

Thanks for the measurements. Once crashes are fixed I will try to identify where the bottleneck is in the parser and make an attempt to address it.

@satabin
Copy link
Member

satabin commented Jun 13, 2022

The fs2-data parser crashes when I generate CDATA: rossabaker@6851774. I have not minimized this yet.

I tried to reproduce it locally, but I could parse it with success (both with and without DOM building). Do you have a stack trace to share?

@rossabaker
Copy link
Contributor Author

Oh, interesting. From that linked commit,

[info] # Benchmark: fs2.data.benchmarks.XmlParserBenchmarks.parseFs2Data
[info] # Run progress: 0.00% complete, ETA 00:00:32
[info] # Fork: 1 of 1
[info] # Warmup Iteration   1: <failure>
[info] fs2.data.xml.XmlException: 'CDATA[' expected
[info] 	at fs2.data.xml.internals.EventParser$.$anonfun$pipe$1(EventParser.scala:39)
[info] 	at fs2.Pull$$anon$2.cont(Pull.scala:183)
[info] 	at fs2.Pull$BindBind.cont(Pull.scala:701)
[info] 	at fs2.Pull$.fs2$Pull$$viewL$1(Pull.scala:872)
[info] 	at fs2.Pull$.fs2$Pull$$go$1(Pull.scala:1195)
[info] 	at fs2.Pull$UnconsRunR$1.$anonfun$out$2(Pull.scala:997)
[info] 	at fs2.Pull$.$anonfun$compile$2(Pull.scala:930)
[info] 	at get @ fs2.internal.Scope.openScope(Scope.scala:281)
[info] 	at flatMap @ fs2.Compiler$Target.flatMap(Compiler.scala:163)
[info] 	at flatMap @ fs2.Pull$.$anonfun$compile$21(Pull.scala:1182)
[info] 	at update @ fs2.internal.Scope.releaseChildScope(Scope.scala:224)
[info] 	at flatMap @ fs2.Compiler$Target.flatMap(Compiler.scala:163)
[info] 	at flatMap @ fs2.Compiler$Target.flatMap(Compiler.scala:163)
[info] 	at flatMap @ fs2.Compiler$Target.flatMap(Compiler.scala:163)
[info] 	at flatMap @ fs2.Compiler$Target.flatMap(Compiler.scala:163)
[info] 	at update @ fs2.internal.Scope.releaseChildScope(Scope.scala:224)
[info] 	at flatMap @ fs2.Compiler$Target.flatMap(Compiler.scala:163)
[info] 	at flatMap @ fs2.Compiler$Target.flatMap(Compiler.scala:163)
[info] 	at flatMap @ fs2.Compiler$Target.flatMap(Compiler.scala:163)
[info] 	at flatMap @ fs2.Compiler$Target.flatMap(Compiler.scala:163)
[info] 	at update @ fs2.internal.Scope.releaseChildScope(Scope.scala:224)
[info] 	at flatMap @ fs2.Compiler$Target.flatMap(Compiler.scala:163)
[info] 	at flatMap @ fs2.Compiler$Target.flatMap(Compiler.scala:163)

The parseScalaXml runs fine, as does running xmllint command line on the test resource itself.

--

I also sketched a parser based on aalto-xml, which has incremental properties that make it a great fit for fs2. That parser is not yet production grade, doesn't emit events, and won't work on Scala.js. But it's currently coming in at about 4500us/op. If it's still that fast when productionized, I think it will be strictly better than the scala-xml SAX parsing, but that fs2-data remains compelling for its other features.

@satabin
Copy link
Member

satabin commented Jun 14, 2022

Do you run the benchmark on scala JVM or scala JS?

@armanbilge
Copy link
Contributor

JVM, JMH doesn't run on Scala.js.

@satabin
Copy link
Member

satabin commented Jun 14, 2022

I think I found the bug. In EventParser line 128:

-              accept(ctx, s, chunkAcc)
+              loop(ctx, sidx, chunkAcc)

Having CDATA at chunk boundary fails currently. Does it fix it for you? If yes I’ll do a proper bugfix PR.

@satabin
Copy link
Member

satabin commented Jun 14, 2022

Crash should be fixed by #335.

@rossabaker
Copy link
Contributor Author

Nice work. The CDATA is passing. The numbers are all worse, but the XML is bigger. Relative is what's interesting:

[info] XmlParserBenchmarks.decode                  avgt    5   3325.339 ±  605.950  us/op
[info] XmlParserBenchmarks.parseFs2DataToEvents    avgt    5  27683.543 ± 4832.627  us/op
[info] XmlParserBenchmarks.parseFs2DataToScalaXml  avgt    5  33478.473 ± 8531.365  us/op
[info] XmlParserBenchmarks.parseSaxReader          avgt    5  15590.355 ± 7675.151  us/op
[info] XmlParserBenchmarks.parseSaxStream          avgt    5  16077.269 ± 5029.600  us/op
[info] XmlParserBenchmarks.parseSaxString          avgt    5   9001.270 ± 1971.615  us/op

@satabin
Copy link
Member

satabin commented Jun 15, 2022

Thanks for this awesome work. I think we can merge it, and use it as a baseline to improve performances.

@satabin satabin merged commit f7da233 into gnieh:main Jun 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Benchmark for XML parser
3 participants