Adds attempt decoding functions #436

alycklama · 2023-01-11T11:12:31Z

@GerretS and I worked on a potential improvement regarding errors as values. We noticed that when parsing CSV files, the default behaviour is to fail the stream, whenever an error occurs. For our use case, we would like to aggregate these errors at the end, so that we have all potential errors at the same time.

There are a few open questions that we would like your input on.

Tests have been added to the csv-generic
As we wanted to test the decoding of a case class, we added the tests for all our attempt* functions to the csv-generic project. Is that OK, or would you like us to change that. We didn't want to manually create our own decoder within csv/shared.

headersAttempt (bincompat)
The existence of the headersAttempt bincompat forces us to pass the implicits along, instead of them implicitly being passed along. Is that OK for now? The only solution that we saw is to remove the bincompat version, but we have no idea what the side effects are there.

Verbose tests
The tests are now a bit verbose, as there's quite some duplication. However, the tests are rather trivial and we wonder if you are OK with the setup, or whether we should improve that.

We're looking forward hearing from you!

…-functions

alycklama · 2023-01-11T13:29:54Z

We noticed that there's a difference between Scala 2 and Scala 3. One of the tests now fails unexpectedly.

    case class TestData(name: String, age: String, description: String)

    val content =
      """name1,age,description
        |John Doe,47,description 1
        |Bob Smith,80,description 2
        |""".stripMargin

In Scala 3, decoding the CSV into the TestData case class will not throw any exception. Instead, the name1 column is ignored, resulting in an empty string value for the name variable in the TestData case class.

In Scala 2, decoding the CSV into the TestData case class will result in the error "unknown column name 'name' in line x".

Any idea if this is intentional?

ybasket · 2023-01-11T14:25:11Z

We noticed that there's a difference between Scala 2 and Scala 3. One of the tests now fails unexpectedly.
    case class TestData(name: String, age: String, description: String)

    val content =
      """name1,age,description
        |John Doe,47,description 1
        |Bob Smith,80,description 2
        |""".stripMargin
In Scala 3, decoding the CSV into the TestData case class will not throw any exception. Instead, the name1 column is ignored, resulting in an empty string value for the name variable in the TestData case class.

In Scala 2, decoding the CSV into the TestData case class will result in the error "unknown column name 'name' in line x".

Any idea if this is intentional?

Both Scala versions are meant to behave consistent.

Thank you for your PR! I don't yet have the time to look at it closely (should find some the next days though), but will approve the workflow runs if you push more stuff so you get CI feedback until we can properly review it.

ybasket

Is your use case to parse CSV files that have rows with a wrong numbers of columns in between or just erroneous data (as in: no number where expected and alike)?

I'm not entirely convinced yet that these combinators are justified on the high level API, as they are easy to do using the low level API and seem to be rather non-standard use cases. But happy to learn more!

As for the implicits, that's fine. Maintaining bincompat is sometimes slightly ugly and at least, it stays internal to the library and we can clean up in fs2-data 2.0 one day.

Test verbosity would be OK, but let's put them where they belong (csv module), maybe in a new file. I'll add some inline comments for the tests as well.

csv/shared/src/main/scala/fs2/data/csv/package.scala

csv/generic/shared/src/test/scala/fs2/data/csv/generic/CsvRowDecoderTest.scala

GerretS · 2023-01-14T15:49:35Z

Is your use case to parse CSV files that have rows with a wrong numbers of columns in between or just erroneous data (as in: no number where expected and alike)?

People sometimes create CSV files by manually inputting data into Excel or whatever, and those files become invalid in the weirdest ways. Strings in an expected Int field, commas within an unquoted String and who knows what else. In that situation, it's much more convenient to be able to return a full list of errors for a file so they can fix them all at once than only returning the first error and requiring the user to try over and over again until they fix all of them.

The use case is similar to how you could use Cats Validated to parse several sources of data in parallel and return a list of all errors instead of throwing whatever happens to be the first exception.

As for your other comments, we'll take a look after the weekend.

alycklama · 2023-01-16T07:53:13Z

It could be a matter of perspective. When using a streaming library, especially in a functional programming language like Scala, I expect errors to be first-class citizens. So it surprised me that the default behaviour is to kill the stream whenever an error is encountered. That's not something I would have expected from a resiliency perspective. In my opinion, returning errors as values is not a novelty feature someone should implement via the low-level API. I would have expected this to be available in the high-level API, though, as I mentioned initially, it could all be a matter of perspective. It took quite a while to figure out that aggregating errors is not achievable via the high-level API and that one should use the low-level API instead. It would be great to have the documentation reflect this because it could have been more evident with a concrete example of achieving this.

ybasket · 2023-01-16T08:04:42Z

We have to differentiate the errors here though:

Errors that clearly affect decoding a single row (like no number in a cell though expected)
Errors that indicate the whole CSV could/should be considered invalid, like different column count per row, differing from headers

The first group is very easy to handle today, just decode as DecoderResult[A] instead of A (it's available in implicit scope), then you're free to accumulate the errors in the end or ignore them or whatever. Only the second group requires your additions and I'm not so sure these are use cases the library should support high-level. Will think about it (@satabin maybe you have an opinion here as well?).

PS: Sorry for the accidental close/re-open.

GerretS · 2023-01-16T08:15:40Z

I agree there are two different kinds of errors.

But in case of a row column not matching the header column count, this could be a single error for that row because someone accidentally deleted a comma. It would make sense to me to handle those as single-row errors and accumulate them.

alycklama · 2023-01-16T09:31:07Z

No worries for the close/re-open!

If you're dealing with manually created CSV files, then a single row that cannot be decoded is to be expected. That should, however, not affect the entire file from a resiliency point of view.

The only way I could figure out so far is to not use decodeUsingHeaders and instead use decodeSkippingHeaders. The problem is that the order of the columns in the case class needs to be precisely the same as the CSV file. The problem of having the columns ordered in the exact same way as the case class representation also holds for the low-level API.

Suppose we can get the headers per row by having an attemptUsingHeaders. The ordering can differ between the CSV and the case class representation in that case.

How would you create a resilient stream that does not fail, even if the structure is incorrect (e.g. a column is missing for a row) using the low-level API?

    import fs2._
    import fs2.data.csv._
    import fs2.data.csv.generic.semiauto._

    case class Data(name: String, description: String)
    implicit val testDataRowDecoder: RowDecoder[Data] = deriveRowDecoder[Data]
    implicit val testCsvDataRowDecoder: CsvRowDecoder[Data, String] = deriveCsvRowDecoder[Data]

    val csv =
      """description,name
        |hello world,John
        |this is a test
        |""".stripMargin

    val results1 = Stream
      .emit(csv)
      .covary[Fallible]
      .through(decodeSkippingHeaders[DecoderResult[Data]]())
      .compile
      .toList

    val results2 = Stream
      .emit(csv)
      .covary[Fallible]
      .through(decodeUsingHeaders[DecoderResult[Data]]())
      .compile
      .toList

    val results3 = Stream.emit(csv).through(lowlevel.rows[Fallible, String](',', QuoteHandling.Literal))
      .map(testDataRowDecoder.apply)
      .compile
      .toList

    // Using decodeSkippingHeaders:
    // results1: Right(List(Right(Data(hello world,John)), Left(fs2.data.csv.DecoderError: unexpect end of row in line 3)))
    //
    // Using decodeUsingHeaders:
    // results2: Left(fs2.data.csv.CsvException: Headers have size 2 but row has size 1. Both numbers must match! in line 3)
    //
    // Using the low-level API:
    // result3: Right(List(Right(Data(description,name)), Right(Data(hello world,John)), Left(fs2.data.csv.DecoderError: unexpect end of row in line 3)))

csv/generic/shared/src/test/scala/fs2/data/csv/generic/CsvRowDecoderTest.scala

ybasket · 2023-01-18T09:55:14Z

How would you create a resilient stream that does not fail, even if the structure is incorrect (e.g. a column is missing for a row) using the low-level API?
Same as you do in your implementations in this PR.

I discussed this with @satabin and we agreed that because varying column count is not generally safe to parse per row (because it can well be a sign of the whole file being corrupted), we would like to namespace the new pipes into a lenient object. That way there's a clear signal (and hopefully documentation) that users need to think carefully about their use case before using them.

So if you could move the new pipes into such object (similar to the lowlevel one) and fix the tests as discussed above, we'd be good to go :)

…ala 2 and Scala 3.

…-functions

alycklama · 2023-01-20T12:42:18Z

I've also added a bit of documentation FYI.

ybasket

Thank you for the docs! I left some comments to be resolved still, but we're getting closer to merging 👍

documentation/docs/index.md

csv/shared/src/main/scala/fs2/data/csv/package.scala

csv/generic/shared/src/test/scala/fs2/data/csv/generic/CsvRowDecoderTest.scala

documentation/docs/index.md

alycklama · 2023-01-30T14:25:45Z

@ybasket I've requested a review from you, just to be sure that we're not waiting on each other. If there's something still to be done in this PR, please let me know! 👍

ybasket · 2023-01-31T10:20:33Z

@ybasket I've requested a review from you, just to be sure that we're not waiting on each other. If there's something still to be done in this PR, please let me know! 👍

👍 I was on vacation the last days, so sorry for not replying earlier!

Code looks good, I'm just wondering why there are still changes in the generic module – they don't seem to be specific to generic and covered by other tests, so let's remove or move them? Once that's resolved I'm happy to merge!

Also saw during review we have a specific exception type (<: CsvException) for misaligned rows, but weren't using it, so I would do a follow-up PR and then release a new version with all CSV fixes.

alycklama · 2023-01-31T10:56:58Z

@ybasket No problem! Hope you had nice days off.

I've removed the tests in the generic module.

Also saw during review we have a specific exception type (<: CsvException) for misaligned rows, but weren't using it, so I would do a follow-up PR and then release a new version with all CSV fixes.

What do you mean here? All the Throwable types have been replaced by CsvException as that's the supertype of the potential errors that can be thrown. What kind of follow-up PR do you expect? And is this something you expect us to do, or is this something you will pick-up?

ybasket · 2023-01-31T11:03:46Z

I mean the specialised error type here:

fs2-data/csv/shared/src/main/scala/fs2/data/csv/exceptions.scala

Line 35 in f2dd6e3

class HeaderSizeError(msg: String,

I'll do the follow-up (it was apparently my oversight long ago), no worries. It's also compatible with your changes. And I'll also clean up the small remainders of changes in the generic module that are still in this PR and merge. Thank you for your work 😃 @alycklama @GerretS

alycklama · 2023-01-31T12:58:48Z

You're welcome! Thank you in advance for approving and merging the PR! 😄

Adds attempt decoding functions

695a523

alycklama requested a review from a team as a code owner January 11, 2023 11:12

Merge remote-tracking branch 'origin/main' into adds-attempt-decoding…

2d2b164

…-functions

Adds type to implicits

ac0ef29

GerretS mentioned this pull request Jan 11, 2023

Investigation in Scala2/Scala3 decoder inconsistency #437

Merged

ybasket reviewed Jan 14, 2023

View reviewed changes

ybasket closed this Jan 16, 2023

ybasket reopened this Jan 16, 2023

satabin reviewed Jan 18, 2023

View reviewed changes

csv/generic/shared/src/test/scala/fs2/data/csv/generic/CsvRowDecoderTest.scala Outdated Show resolved Hide resolved

satabin and others added 13 commits January 20, 2023 12:05

Create FUNDING.yml

03cef7d

Added some tests to show how the decoders are inconsistent between Sc…

de25a7c

…ala 2 and Scala 3.

Add RowF.withLine helper

c46da3b

Rewrite test case to avoid effects

5edbdf7

Fix handling of missing named columns on Scala 3

2ea817c

Clean up CsvRowDecoderTest

bbdb46a

Fix bincompat for OptCellDecoder

4e8ae7b

Update fs2-core, fs2-io to 3.5.0

4c266d8

Adds attempt decoding functions

2bf684a

Applies PR improvements

f6ad6e2

Moves attempt functions to lenient package object

d9a92d5

Merge remote-tracking branch 'origin/main' into adds-attempt-decoding…

ff9866b

…-functions

Applies PR improvements

6dc6f5b

alycklama requested review from satabin and ybasket and removed request for satabin and ybasket January 20, 2023 12:41

ybasket reviewed Jan 22, 2023

View reviewed changes

Anton Lijcklama à Nijeholt added 4 commits January 23, 2023 11:35

Updates documentation

166d4c8

Narrows type from Throwable to CsvException

48307c0

Drops failing test case for Scala 3

c5d5570

Moves tests to csvJVM

438f33f

alycklama requested review from ybasket and removed request for satabin January 23, 2023 11:09

satabin reviewed Jan 23, 2023

View reviewed changes

documentation/docs/index.md Outdated Show resolved Hide resolved

Moves documentation

41f7a46

alycklama requested review from satabin and removed request for ybasket January 24, 2023 07:19

satabin approved these changes Jan 24, 2023

View reviewed changes

alycklama requested a review from ybasket January 30, 2023 14:24

Removes tests in generic module

ae51f62

Revert changes in generic module

c12ee84

ybasket merged commit ac8200c into gnieh:main Feb 1, 2023

satabin added enhancement New feature or request csv labels Feb 2, 2023

satabin added this to the 1.7.0 milestone Feb 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adds attempt decoding functions #436

Adds attempt decoding functions #436

alycklama commented Jan 11, 2023 •

edited

Loading

alycklama commented Jan 11, 2023

ybasket commented Jan 11, 2023

ybasket left a comment

GerretS commented Jan 14, 2023

alycklama commented Jan 16, 2023

ybasket commented Jan 16, 2023 •

edited

Loading

GerretS commented Jan 16, 2023

alycklama commented Jan 16, 2023

ybasket commented Jan 18, 2023

alycklama commented Jan 20, 2023

ybasket left a comment

alycklama commented Jan 30, 2023

ybasket commented Jan 31, 2023

alycklama commented Jan 31, 2023

ybasket commented Jan 31, 2023

alycklama commented Jan 31, 2023

Adds attempt decoding functions #436

Adds attempt decoding functions #436

Conversation

alycklama commented Jan 11, 2023 • edited Loading

alycklama commented Jan 11, 2023

ybasket commented Jan 11, 2023

ybasket left a comment

Choose a reason for hiding this comment

GerretS commented Jan 14, 2023

alycklama commented Jan 16, 2023

ybasket commented Jan 16, 2023 • edited Loading

GerretS commented Jan 16, 2023

alycklama commented Jan 16, 2023

ybasket commented Jan 18, 2023

alycklama commented Jan 20, 2023

ybasket left a comment

Choose a reason for hiding this comment

alycklama commented Jan 30, 2023

ybasket commented Jan 31, 2023

alycklama commented Jan 31, 2023

ybasket commented Jan 31, 2023

alycklama commented Jan 31, 2023

alycklama commented Jan 11, 2023 •

edited

Loading

ybasket commented Jan 16, 2023 •

edited

Loading