Bring JSON parser on par with circe-fs2 #491

satabin · 2023-06-26T17:36:26Z

The goal of this PR is to have performances on par with circe-fs2, so that users can switch fearlessly to fs2-data (see circe/circe-fs2#425).
The existing two steps approach has a major drawback, that prevents this from happening: it builds intermediate Token, which are immediately consumed to build the AST using an instance of Builder.
This can be avoided for people coming from circe-fs2, by adding a new pipe ast.parse that directly transforms the input byte/char/string stream into the AST.

To achieve this, this PR takes inspiration from the Facade approach in jawn to model the ChunkAccumulator in the JSON parser. One such simple accumulator simply wraps the existing VectorBuilder[Token] and makes it possible to build the token stream. Another implementation (BuilderChunkAccumulator[Json]) directly builds the AST from the parser without instantiating any Token. In the future, it can even be extended to implement filtering at the source, without constructing the Tokens that are not selected. The abstraction is kept internal for now, so that we have more latitude to change the API and stabilize it.

Implement the parser in terms of ChunkAccumulator
Implement the simple TokenChunkAccumulator
Implement the AST building BuilderChunkAccumulator
- implement ast.parse
Add sclafix rule to migrate from .through(tokens).through(ast.values) to .through(ast.parse)
Add ast.parse documentation
Add cookbook to migrate from circe-fs2

Using the new abstraction, here are the results of the benchmark.

Benchmark                                          Mode  Cnt     Score    Error  Units
JsonParserBenchmarks.parseCirceFs2                 avgt   10  5197.451 ± 18.083  us/op
JsonParserBenchmarks.parseJsonFs2DataParse         avgt   10  5259.336 ± 32.579  us/op
JsonParserBenchmarks.parseJsonFs2DataTokens        avgt   10  3531.900 ± 15.076  us/op
JsonParserBenchmarks.parseJsonFs2DataTokensValues  avgt   10  6091.062 ± 94.753  us/op
JsonParserBenchmarks.parseJsonJawn                 avgt   10  1945.402 ± 10.970  us/op
JsonValueBenchmarks.parseEscapedString             avgt   10     2.742 ±  0.021  us/op
JsonValueBenchmarks.parseSimpleString              avgt   10     2.588 ±  0.027  us/op

This abstraction allows to create chunks of various types out of an input stream. One possible accumulator is the `Token` accumulator which generates the token stream. Leveraging this abstraction, we can then implement a pipe that builds directly AST values instead of having intermediate `Token` representation.

Inspired by the `Facade` abstraction from jawn, we can build the AST directly, without emitting intermediate tokens, which makes it faster.

satabin · 2023-07-02T19:22:09Z

@ybasket this PR is ready to be reviewed!

ybasket

Nice work and sorry for the time it took me to review it!

documentation/docs/json/libraries.md

json/src/main/scala/fs2/data/json/internal/JsonTokenParser.scala

scalafix/rules/src/main/scala/fix/JsonParse.scala

The fact that it now returs `Unit` makes it clearer it is side effectful.

satabin requested a review from a team as a code owner June 26, 2023 17:36

satabin force-pushed the json/chunk-acc branch from 34456ad to f0d3c5a Compare June 26, 2023 17:38

satabin added enhancement json labels Jun 26, 2023

satabin added this to the 1.8.0 milestone Jun 26, 2023

satabin added 3 commits June 26, 2023 19:48

Add an accumulator that builds an AST directly

ecbd900

Inspired by the `Facade` abstraction from jawn, we can build the AST directly, without emitting intermediate tokens, which makes it faster.

Avoid walking twice through accumulated values

3ba90b3

satabin force-pushed the json/chunk-acc branch from f0d3c5a to 3ba90b3 Compare June 26, 2023 17:48

satabin added 5 commits June 26, 2023 22:41

Make it compile with scala 2.12

4c29e68

Add scalafix rule to migrate to parse

bf9860a

Document new parse pipe

2244e09

Fix workflow

3f3abbb

Add migration guide from circe-fs2

1d8edd7

satabin added the documentation label Jul 2, 2023

armanbilge mentioned this pull request Jul 10, 2023

RFC: deprecate in favor of fs2-data-json-circe circe/circe-fs2#425

Merged

ybasket approved these changes Jul 16, 2023

View reviewed changes

documentation/docs/json/libraries.md Show resolved Hide resolved

json/src/main/scala/fs2/data/json/internal/JsonTokenParser.scala Outdated Show resolved Hide resolved

scalafix/rules/src/main/scala/fix/JsonParse.scala Show resolved Hide resolved

satabin added 2 commits July 17, 2023 18:42

Add link to byte stream decoding

814845d

Make keyword acc function clearer in name and type

037c1d6

The fact that it now returs `Unit` makes it clearer it is side effectful.

satabin force-pushed the json/chunk-acc branch from 1ca07ca to 037c1d6 Compare July 17, 2023 16:53

Merge branch 'main' into json/chunk-acc

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.

GPG key ID: 4AEE18F83AFDEB23
Expired

Verified
Learn about vigilant mode

66f34e8

satabin merged commit 136b493 into main Jul 17, 2023

satabin deleted the json/chunk-acc branch July 17, 2023 17:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GitHub Sponsors

Bring JSON parser on par with circe-fs2 #491

Bring JSON parser on par with circe-fs2 #491

satabin commented Jun 26, 2023 •

edited

Loading

satabin commented Jul 2, 2023

ybasket left a comment

Bring JSON parser on par with circe-fs2 #491

Bring JSON parser on par with circe-fs2 #491

Conversation

satabin commented Jun 26, 2023 • edited Loading

satabin commented Jul 2, 2023

ybasket left a comment

Choose a reason for hiding this comment

satabin commented Jun 26, 2023 •

edited

Loading