Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nested data in CSV #15

Closed
tbsvttr opened this issue Oct 28, 2022 · 11 comments
Closed

Nested data in CSV #15

tbsvttr opened this issue Oct 28, 2022 · 11 comments

Comments

@tbsvttr
Copy link

tbsvttr commented Oct 28, 2022

Hi! It's me again!

As already mentioned in the other issue, I am very interested in the CSV serialization capabilities of this (cool!) library.

However, it seems that the CSV serialization is not working in a very useful manner when data is nested.

As an example, given this setup:

  address = Address.new(city: "San Francisco")
  person = Person.new(first_name: "John", last_name: "Doe", address: Address.new(city: "San Francisco"))
  person_presenter = PersonPresenter.new(person)

These are the results:

person_presenter.to_csv == <<~CSV.chomp
    John,Doe,,false,[],"{""city""=>""San Francisco"", ""street""=>nil, ""zip""=>nil}"
CSV

person_presenter.to_json == <<~JSON.chomp
    {\"first_name\":\"John\",\"last_name\":\"Doe\",\"married\":false,\"hobbies\":[],\"address\":{\"city\":\"San Francisco\"}}
JSON

person_presenter.to_xml == <<~XML.chomp
    <person>
        <first_name>John</first_name>
        <last_name>Doe</last_name>
        <married>false</married>
        <address>
            <city>San Francisco</city>
        </address>
   </person>
XML

Note how the JSON and XML can sensible handle the nested data structure, but the CSV now actually contains a JSON string in a field. My guess is that this will not be acceptable for most scenarios where CSV is actually used. Also note that in this JSON string, the nil values are rendered, while they are not rendered anywhere else. No render_nil was used anywhere. This lets me guess that this is somewhat of an unhandled or unintended behaviour.

My suggestion would be to flatten the structure so that it will be compatible with CSV. Something like this:

Or with render_nil: true:

<<~CSV.chomp
    John,Doe,false,[],San Francisco,,
CSV

Note that it render an empty string into each field, so that the table which is represented in CSV does maintain is coherence for each row.

With the following headers:

<<~CSV.chomp
    first_name,last_name,married,hobbies,address.city,address.street,address.zip
CSV

What do you think? Or did I overlook some feature of the library?

@kgiszczak
Copy link
Owner

well, that's because CSV is a flat format, it doesn't support "nesting", you can work around it by using methods to extract/generate data, that way you have full control. Here's an example:

class Address < Shale::Mapper
  attribute :street, Shale::Type::String
  attribute :city, Shale::Type::String
end

class Person < Shale::Mapper
  attribute :name, Shale::Type::String
  attribute :address, Address, default: -> { Address.new }

  csv do
    map 'name', to: :name
    map 'street', using: { from: :street_from_csv, to: :street_to_csv }
    map 'city', using: { from: :city_from_csv, to: :city_to_csv }
  end

  def street_from_csv(model, value)
    model.address.street = value
  end

  def street_to_csv(model, doc)
    doc['street'] = model.address.street
  end

  def city_from_csv(model, value)
    model.address.city = value
  end

  def city_to_csv(model, doc)
    doc['city'] = model.address.city
  end
end

person = Person.from_csv(<<~DATA)
John Doe,Oxford Street,London
DATA

# => #<Person:0x2a06a
#    @name="John Doe"
#    @city="London",
#    @street="Oxford Street">

person.to_csv

# => "John Doe,Oxford Street,London"

@tbsvttr
Copy link
Author

tbsvttr commented Oct 28, 2022

Thanks for your quick reply! I just edited my initial post again to improve upon it and showing that there is a bit of a problem with nil values in CSV, too.

well, that's because CSV is a flat format, it doesn't support "nesting",

That is absolutely correct! However, shouldn't default behaviour create a somewhat acceptable and unsurprising result? Just having a JSON string which is created with different rules as the actual to_json result seems kinda wired to me. Why not flatten the structure as I suggested?

you can work around it by using methods to extract/generate data, that way you have full control. Here's an example:

This would actually solve the problem and would be 'good enough'. However, it would create a lot of boilerplate for a lot of people, I guess.

@kgiszczak
Copy link
Owner

Thanks for your quick reply! I just edited my initial post again to improve upon it and showing that there is a bit of a problem with nil values in CSV, too.

That actually is documented (https://github.com/kgiszczak/shale#rendering-nil-values). CSV renders nil values by default (as opposed to other formats), because without it, it would mangle the CSV output. render_nil: false means don't include the field/column at all in the output (so in the case of CSV it would mean one comma less).

E.g. if your first_name = nil and last_name = 'Doe' you would get:

first_name,last_name
Doe

If you really need, you can overwrite this behaviour with explicitly setting render_nil: false - but in the case of CSV you would need that in a very rare circumstances.

That is absolutely correct! However, shouldn't default behaviour create a somewhat acceptable and unsurprising result? Just having a JSON string which is created with different rules as the actual to_json result seems kinda wired to me.

That actually is not a JSON but stringified Ruby Hash, as all formats except XML uses this data structure internally. Underneath to_csv is calling as_csv (or as_json, as_yaml and so on) method that returns this hash and then converts it to CSV. E.g.

person.as_csv

# => { 'first_name' => 'John', `last_name` => 'Doe' }

Why not flatten the structure as I suggested?

However, it would create a lot of boilerplate for a lot of people, I guess

I totally get it, ang it's not the first time this was raised. I tried to implement a solution for this problem, but it opend a can with worms with so much edge cases that I gave up. I might try to approch this problem again in the future, but there's no a quick solution for this.

@tbsvttr
Copy link
Author

tbsvttr commented Oct 31, 2022

I totally get it, ang it's not the first time this was raised. I tried to implement a solution for this problem, but it opend a can with worms with so much edge cases that I gave up. I might try to approch this problem again in the future, but there's no a quick solution for this.

Thanks for your understanding. Can I find your approach back then somewhere? Maybe we can improve upon it, or maybe it is even enough for our use case.

That actually is not a JSON but stringified Ruby Hash, as all formats except XML uses this data structure internally. Underneath to_csv is calling as_csv (or as_json, as_yaml and so on) method that returns this hash and then converts it to CSV.

Ah, I see. So this is some kind of fallback in case of “nested CSV”.

If you really need, you can overwrite this behaviour with explicitly setting render_nil: false - but in the case of CSV you would need that in a very rare circumstances.

You are absolutely right. Not rendering the nil values in a CSV would break the CSV in most cases. I actually suggested this behaviour for the representation of nested data within a CSV. My comment about it showing nil was about that stringified Hash you use internally (which I misunderstood for some wired JSON).

That actually is documented (https://github.com/kgiszczak/shale#rendering-nil-values). CSV renders nil values by default (as opposed to other formats), because without it, it would mangle the CSV output. render_nil: false means don't include the field/column at all in the output (so in the case of CSV it would mean one comma less).

You are right here, too. However, I feel it is a bit inconsistent to have CSV so different behaviour than anything else. Why not make render_nil: true the default behaviour?

@kgiszczak
Copy link
Owner

Thanks for your understanding. Can I find your approach back then somewhere? Maybe we can improve upon it, or maybe it is even enough for our use case.

I made pretty good progress on this and actually came up with a solution that I like and works really well.
It won't be production ready for at least couple of weeks, but I can push a branch with it for you to try out and hear your feedback on it. I'll do it tomorrow.

Ah, I see. So this is some kind of fallback in case of “nested CSV”.

Not really a fallback, more like a side effect of the underlying CSV parser.

You are right here, too. However, I feel it is a bit inconsistent to have CSV so different behaviour than anything else. Why not make render_nil: true the default behaviour?

Because CSV is much different than JSON, YAML, TOML or XML - in these formats you usually don't want to render empty fields/nodes, in CSV you have to, otherwise you mangle the output.

@tbsvttr
Copy link
Author

tbsvttr commented Nov 2, 2022

I made pretty good progress on this and actually came up with a solution that I like and works really well.
It won't be production ready for at least couple of weeks, but I can push a branch with it for you to try out and hear your feedback on it. I'll do it tomorrow.

Thanks, that would be very good!

@tbsvttr
Copy link
Author

tbsvttr commented Nov 10, 2022

@kgiszczak Just a short reminder. :)

@kgiszczak
Copy link
Owner

sorry for the delay, I had a lot of work past few days, but I'm almost done, I need one more day to finish tests and write documentation, it'll be ready after the weekend.

@kgiszczak
Copy link
Owner

Hey @tbsvttr the feature is ready https://github.com/kgiszczak/shale#delegating-fields-to-child-attributes

@tbsvttr
Copy link
Author

tbsvttr commented Nov 14, 2022

@kgiszczak Hey, that is very cool! As far as I can see, this solves the problem with relatively little boilerplate code.

Can you release this as v0.10?

@kgiszczak
Copy link
Owner

I will make a release in a few weeks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants