-
Notifications
You must be signed in to change notification settings - Fork 105
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement eland.DataFrame.to_json #661
Conversation
3d07310
to
e86d6f3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! LGTM.
buildkite test this please |
Nice! I removed the two linting errors. Sorry about that. |
buildkite test this please |
buildkite test this please |
Dumping an Elastic index was the right solution for me at first. After a while csv did not offer all the guarantees I was looking for, since csv records for example can span multiple lines if any of the values contain line breaks.
For that reason, JSON lines was a more suitable format. Pandas'
DataFrame.to_json
can generate ".jsonl" files by supplinglines=True, orient='records'
. This also lets us reuse the earlier solution that streams output from Elastic and appends it to a file, eliminating the need to be able to fit the entire dataset in memory.This pull request implements streaming an Elasticsearch index to
.jsonl
, while falling back to runningto_pandas().to_json(...)
if streaming is a bit harder to do.