-
Notifications
You must be signed in to change notification settings - Fork 105
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Switch agg defaults to numeric_only=None #270
Conversation
Since this is a community submitted pull request, a Jenkins build has not been kicked off automatically. Can an Elastic organization member please verify the contents of this patch and then kick off a build manually? |
@sethmlarson Please review this and let me know any changes are required. I am happy to do it. >>>import eland as ed
>>>from eland.tests.common import TestData
>>>funcs = ["max", "min", "mean", "sum", "median", "mad", "var", "std"]
>>>filter_data = ["AvgTicketPrice","Cancelled","dayOfWeek", "timestamp","DestCountry"]
>>>data = TestData()
>>>ed_flights = data.ed_flights().filter(filter_data)
>>>ed_flights.agg(funcs,numeric_only=True)
AvgTicketPrice Cancelled dayOfWeek
max 1.199729e+03 1.000000 6.000000
min 1.000205e+02 0.000000 0.000000
mean 6.282537e+02 0.128494 2.835975
sum 8.204365e+06 1678.000000 37035.000000
median 6.403873e+02 0.000000 3.000000
mad 2.134435e+02 NaN 2.000000
var 7.096457e+04 0.111987 3.761279
std 2.664071e+02 0.334664 1.939513 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks so much for this, it's a ton of work and very much appreciated :)
The above comment with NaN
being returned for mad
on bool is all good, agree with you there.
I also forgot to mention sum
in the related issue as it has special behavior as well. sum
should always be the same dtype
as the input except for bool where the sum of bools should be an int64
. I recommend this:
# These aggregations maintain the column datatype
elif pd_agg in {"max", "min", "median", "sum"}:
# 'sum' isn't representable with bool, use int64
if pd_agg == "sum" and field.is_bool:
agg_value = np.int64(agg_value)
else:
agg_value = field.np_dtype.type(agg_value)
In general make sure we're following this ruleset:
- numeric_only=True: Returns all values as float64, NaN/NaT values are removed in single-agg situations
- numeric_only=None: Returns all values as the same dtype where possible, NaN/NaT are removed in single-agg situations
- numeric_only=False: Returns all values as the same dtype where possible, NaN/NaT are preserved
- For data types that are lossy (ie if I throw anything that's not a 0 into
bool()
it'll return True) make sure we're only converting to bool for lossless aggs likemin
,max
, andmedian
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for these changes. Here's some comments and questions for you:
@V1NAY8 Btw I really appreciate everything you're doing here! Recently we announced a program for rewarding community contributors. You should check it out :) |
@sethmlarson elif pd_agg in {"max", "min", "median", "sum"}:
# 'sum' isn't representable with bool, use int64
if pd_agg == "sum" and field.is_bool:
agg_value = np.int64(agg_value)
else:
agg_value = field.np_dtype.type(agg_value) Should this happen for numeric_only=True |
@V1NAY8 I think this is only needed for numeric_only=None/False cases. |
@V1NAY8 Sorry I haven't reviewed this yet, other things are eating up all my time. It is still on my list, thank you for the patience :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is my last review! Thanks for all this work :)
jenkins test this please |
CI is passing btw :) |
jenkins test this please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thanks for the Patience and reviewing this. It came back and forth. Ill make my contributions better to reduce multiple reviews. 😄 |
@V1NAY8 No worries! This was a really big change, it's tough to get things like this through on round one. |
Boom! 🎉 |
Closes #254