Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds attempt decoding functions #436

Merged
merged 26 commits into from
Feb 1, 2023

Conversation

alycklama
Copy link

@alycklama alycklama commented Jan 11, 2023

@GerretS and I worked on a potential improvement regarding errors as values. We noticed that when parsing CSV files, the default behaviour is to fail the stream, whenever an error occurs. For our use case, we would like to aggregate these errors at the end, so that we have all potential errors at the same time.

There are a few open questions that we would like your input on.

Tests have been added to the csv-generic
As we wanted to test the decoding of a case class, we added the tests for all our attempt* functions to the csv-generic project. Is that OK, or would you like us to change that. We didn't want to manually create our own decoder within csv/shared.

headersAttempt (bincompat)
The existence of the headersAttempt bincompat forces us to pass the implicits along, instead of them implicitly being passed along. Is that OK for now? The only solution that we saw is to remove the bincompat version, but we have no idea what the side effects are there.

Verbose tests
The tests are now a bit verbose, as there's quite some duplication. However, the tests are rather trivial and we wonder if you are OK with the setup, or whether we should improve that.

We're looking forward hearing from you!

@alycklama alycklama requested a review from a team as a code owner January 11, 2023 11:12
@alycklama
Copy link
Author

We noticed that there's a difference between Scala 2 and Scala 3. One of the tests now fails unexpectedly.

    case class TestData(name: String, age: String, description: String)

    val content =
      """name1,age,description
        |John Doe,47,description 1
        |Bob Smith,80,description 2
        |""".stripMargin

In Scala 3, decoding the CSV into the TestData case class will not throw any exception. Instead, the name1 column is ignored, resulting in an empty string value for the name variable in the TestData case class.

In Scala 2, decoding the CSV into the TestData case class will result in the error "unknown column name 'name' in line x".

Any idea if this is intentional?

@ybasket
Copy link
Collaborator

ybasket commented Jan 11, 2023

We noticed that there's a difference between Scala 2 and Scala 3. One of the tests now fails unexpectedly.

    case class TestData(name: String, age: String, description: String)

    val content =
      """name1,age,description
        |John Doe,47,description 1
        |Bob Smith,80,description 2
        |""".stripMargin

In Scala 3, decoding the CSV into the TestData case class will not throw any exception. Instead, the name1 column is ignored, resulting in an empty string value for the name variable in the TestData case class.

In Scala 2, decoding the CSV into the TestData case class will result in the error "unknown column name 'name' in line x".

Any idea if this is intentional?

Both Scala versions are meant to behave consistent.

Thank you for your PR! I don't yet have the time to look at it closely (should find some the next days though), but will approve the workflow runs if you push more stuff so you get CI feedback until we can properly review it.

Copy link
Collaborator

@ybasket ybasket left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is your use case to parse CSV files that have rows with a wrong numbers of columns in between or just erroneous data (as in: no number where expected and alike)?

I'm not entirely convinced yet that these combinators are justified on the high level API, as they are easy to do using the low level API and seem to be rather non-standard use cases. But happy to learn more!

As for the implicits, that's fine. Maintaining bincompat is sometimes slightly ugly and at least, it stays internal to the library and we can clean up in fs2-data 2.0 one day.

Test verbosity would be OK, but let's put them where they belong (csv module), maybe in a new file. I'll add some inline comments for the tests as well.

@GerretS
Copy link

GerretS commented Jan 14, 2023

Is your use case to parse CSV files that have rows with a wrong numbers of columns in between or just erroneous data (as in: no number where expected and alike)?

People sometimes create CSV files by manually inputting data into Excel or whatever, and those files become invalid in the weirdest ways. Strings in an expected Int field, commas within an unquoted String and who knows what else. In that situation, it's much more convenient to be able to return a full list of errors for a file so they can fix them all at once than only returning the first error and requiring the user to try over and over again until they fix all of them.

The use case is similar to how you could use Cats Validated to parse several sources of data in parallel and return a list of all errors instead of throwing whatever happens to be the first exception.

As for your other comments, we'll take a look after the weekend.

@alycklama
Copy link
Author

It could be a matter of perspective. When using a streaming library, especially in a functional programming language like Scala, I expect errors to be first-class citizens. So it surprised me that the default behaviour is to kill the stream whenever an error is encountered. That's not something I would have expected from a resiliency perspective. In my opinion, returning errors as values is not a novelty feature someone should implement via the low-level API. I would have expected this to be available in the high-level API, though, as I mentioned initially, it could all be a matter of perspective. It took quite a while to figure out that aggregating errors is not achievable via the high-level API and that one should use the low-level API instead. It would be great to have the documentation reflect this because it could have been more evident with a concrete example of achieving this.

@ybasket
Copy link
Collaborator

ybasket commented Jan 16, 2023

We have to differentiate the errors here though:

  • Errors that clearly affect decoding a single row (like no number in a cell though expected)
  • Errors that indicate the whole CSV could/should be considered invalid, like different column count per row, differing from headers

The first group is very easy to handle today, just decode as DecoderResult[A] instead of A (it's available in implicit scope), then you're free to accumulate the errors in the end or ignore them or whatever. Only the second group requires your additions and I'm not so sure these are use cases the library should support high-level. Will think about it (@satabin maybe you have an opinion here as well?).

PS: Sorry for the accidental close/re-open.

@ybasket ybasket closed this Jan 16, 2023
@ybasket ybasket reopened this Jan 16, 2023
@GerretS
Copy link

GerretS commented Jan 16, 2023

I agree there are two different kinds of errors.

But in case of a row column not matching the header column count, this could be a single error for that row because someone accidentally deleted a comma. It would make sense to me to handle those as single-row errors and accumulate them.

@alycklama
Copy link
Author

No worries for the close/re-open!

If you're dealing with manually created CSV files, then a single row that cannot be decoded is to be expected. That should, however, not affect the entire file from a resiliency point of view.

The only way I could figure out so far is to not use decodeUsingHeaders and instead use decodeSkippingHeaders. The problem is that the order of the columns in the case class needs to be precisely the same as the CSV file. The problem of having the columns ordered in the exact same way as the case class representation also holds for the low-level API.

Suppose we can get the headers per row by having an attemptUsingHeaders. The ordering can differ between the CSV and the case class representation in that case.

How would you create a resilient stream that does not fail, even if the structure is incorrect (e.g. a column is missing for a row) using the low-level API?

    import fs2._
    import fs2.data.csv._
    import fs2.data.csv.generic.semiauto._

    case class Data(name: String, description: String)
    implicit val testDataRowDecoder: RowDecoder[Data] = deriveRowDecoder[Data]
    implicit val testCsvDataRowDecoder: CsvRowDecoder[Data, String] = deriveCsvRowDecoder[Data]

    val csv =
      """description,name
        |hello world,John
        |this is a test
        |""".stripMargin

    val results1 = Stream
      .emit(csv)
      .covary[Fallible]
      .through(decodeSkippingHeaders[DecoderResult[Data]]())
      .compile
      .toList

    val results2 = Stream
      .emit(csv)
      .covary[Fallible]
      .through(decodeUsingHeaders[DecoderResult[Data]]())
      .compile
      .toList

    val results3 = Stream.emit(csv).through(lowlevel.rows[Fallible, String](',', QuoteHandling.Literal))
      .map(testDataRowDecoder.apply)
      .compile
      .toList

    // Using decodeSkippingHeaders:
    // results1: Right(List(Right(Data(hello world,John)), Left(fs2.data.csv.DecoderError: unexpect end of row in line 3)))
    //
    // Using decodeUsingHeaders:
    // results2: Left(fs2.data.csv.CsvException: Headers have size 2 but row has size 1. Both numbers must match! in line 3)
    //
    // Using the low-level API:
    // result3: Right(List(Right(Data(description,name)), Right(Data(hello world,John)), Left(fs2.data.csv.DecoderError: unexpect end of row in line 3)))

@ybasket
Copy link
Collaborator

ybasket commented Jan 18, 2023

How would you create a resilient stream that does not fail, even if the structure is incorrect (e.g. a column is missing for a row) using the low-level API?
Same as you do in your implementations in this PR.

I discussed this with @satabin and we agreed that because varying column count is not generally safe to parse per row (because it can well be a sign of the whole file being corrupted), we would like to namespace the new pipes into a lenient object. That way there's a clear signal (and hopefully documentation) that users need to think carefully about their use case before using them.

So if you could move the new pipes into such object (similar to the lowlevel one) and fix the tests as discussed above, we'd be good to go :)

@alycklama alycklama requested review from satabin and ybasket and removed request for satabin and ybasket January 20, 2023 12:41
@alycklama
Copy link
Author

I've also added a bit of documentation FYI.

Copy link
Collaborator

@ybasket ybasket left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the docs! I left some comments to be resolved still, but we're getting closer to merging 👍

@alycklama alycklama requested review from ybasket and removed request for satabin January 23, 2023 11:09
@alycklama alycklama requested review from satabin and removed request for ybasket January 24, 2023 07:19
@alycklama alycklama requested a review from ybasket January 30, 2023 14:24
@alycklama
Copy link
Author

@ybasket I've requested a review from you, just to be sure that we're not waiting on each other. If there's something still to be done in this PR, please let me know! 👍

@ybasket
Copy link
Collaborator

ybasket commented Jan 31, 2023

@ybasket I've requested a review from you, just to be sure that we're not waiting on each other. If there's something still to be done in this PR, please let me know! 👍

👍 I was on vacation the last days, so sorry for not replying earlier!

Code looks good, I'm just wondering why there are still changes in the generic module – they don't seem to be specific to generic and covered by other tests, so let's remove or move them? Once that's resolved I'm happy to merge!

Also saw during review we have a specific exception type (<: CsvException) for misaligned rows, but weren't using it, so I would do a follow-up PR and then release a new version with all CSV fixes.

@alycklama
Copy link
Author

@ybasket No problem! Hope you had nice days off.

I've removed the tests in the generic module.

Also saw during review we have a specific exception type (<: CsvException) for misaligned rows, but weren't using it, so I would do a follow-up PR and then release a new version with all CSV fixes.

What do you mean here? All the Throwable types have been replaced by CsvException as that's the supertype of the potential errors that can be thrown. What kind of follow-up PR do you expect? And is this something you expect us to do, or is this something you will pick-up?

@ybasket
Copy link
Collaborator

ybasket commented Jan 31, 2023

I mean the specialised error type here:

class HeaderSizeError(msg: String,

I'll do the follow-up (it was apparently my oversight long ago), no worries. It's also compatible with your changes. And I'll also clean up the small remainders of changes in the generic module that are still in this PR and merge. Thank you for your work 😃 @alycklama @GerretS

@alycklama
Copy link
Author

You're welcome! Thank you in advance for approving and merging the PR! 😄

@ybasket ybasket merged commit ac8200c into gnieh:main Feb 1, 2023
@satabin satabin added enhancement New feature or request csv labels Feb 2, 2023
@satabin satabin added this to the 1.7.0 milestone Feb 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
csv enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants