Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding Types.py for wrapping DH column types and table operations around them. #1088

Merged
merged 17 commits into from
Aug 23, 2021

Conversation

jcferretti
Copy link
Member

Initial version contains enough to (1) rewrite Kafka column specifications (in subsequent PR, not here) and (2) provide a way to create table from an array of rows, with a similar API to Panda's DataFrame ctor.

import deephaven.Types as dh

data = [ [ "Ashley", 92, 94, 93, 3.9 ],
         [ "Jeff",   78, 88, 93, 2.9 ],
         [ "Rita",   87, 81, 84, 3.0 ],
         [ "Zach",   74, 70, 72, 1.8 ] ]
columns = [ ( "Name",    dh.string ),
            ( "Test1",   dh.int_ ),
            ( "Test2",   dh.int_ ),
            ( "Average", dh.int_ ),
            ( "GPA",     dh.double ) ]
grades = dh.table_of(data, columns)

Deephaven - Google Chrome_079

Selection_078

Comment on lines 60 to 62
@_passThrough
def asType(type_str):
return _table_tools_.typeFromName(type_str)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, returning a class is not a proper way to get the type information into jpy. You need to use jpy.get_type(...), as it will then be recognized as a proper python "type".

To demonstrate difference in return types between get_type and something that returns Class:

>>> Integer = jpy.get_type('java.lang.Integer')
>>> type(Integer)
<class 'type'>
>>>
>>> type(Integer(1).getClass())
<class 'java.lang.Class'>

So:

int_ = jpy.get_type('int')
int_array = jpy.get_type('[I')
stringset = jpy.get_type('io.deephaven.db.tables.libs.StringSet')

etc.

Copy link
Member

@devinrsmith devinrsmith Aug 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Must be proper python type, like np.

>>> import numpy as np
>>> type(np.int32)
<class 'type'>

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, take a peek hopefully now makes more sense.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, take a peek, hopefully now makes more sense.

def asType(type_str):
return _table_tools_.typeFromName(type_str)

bool_ = asType('java.lang.Boolean')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, numba and pandas represent bool_ as a byte; and I don't think they have the concept of a "null" boolean. Does this make our booleans incompatible?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without providing some kind of conversion, probably. Do we need a dh.bool_from_byte method?

raise Exception("when no column definitions are provided in the 'columns' argument, " +
"only an empty table can be created, and no data can be specified; instead " +
"got a non-empty 'data' argument with " + str(data))
return _table_tools_.emptyTable()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

empty tables still need a size for number of rows.

return _col_def_.fromGenericType(col_name, data_type, component_type)

@_passThrough
def cols(ts):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method seems unused?

# None until the first _defineSymbols() call
_table_tools_ = None
_col_def_ = None
_python_tools_ = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems unused.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still relevant comment.

col_header = col_header.header(t[0], t[1])
except Exception as e:
raise Exception("Could not create column definition from " + str(t)) from e
return _table_.of(col_header)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is valid, can't call of on a col_header. To create a table with zero rows from a ColumnHeader(s) you can do:

_table_.of(_qst_newtable_.empty(col_header.tableHeader())), or something to that effect.

Alternatively, it's perfectly fine to not handle this as a special case, and just fall through to the general case and create Columns with no data in it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could make the NewTable.empty method a bit more generic, and accept Iterable<ColumnHeader> instead of TableHeader, in which case we can get rid of one layer of adapting:

_table_.of(_qst_newtable_.empty(col_header))

@jcferretti
Copy link
Member Author

jcferretti commented Aug 21, 2021

We added support for creating tables by column, passing in a first argument of dict, like:

#
# By column
#                                                                                                                                                                                                                                                                                                                             

import deephaven.Types as dh

data_by_col = {
    "Name"    : ( dh.string, [ "Ashley", "Jeff", "Rita", "Zach"  ] ),
    "Test1"   : ( dh.int_,   [    92,      78,    87,      74,   ] ),
    "Test2"   : ( dh.int_,   [    94,      88,    81,      70,   ] ),
    "Average" : ( dh.int_,   [    93,      93,    84,      72,   ] ),
    "GPA"     : ( dh.double, [     3.9,     2.9,   3.0,     1.8  ] )
}

grades2 = dh.table_of(data_by_col)


--

#                                                                                                                                                                                                                                                                                                                                                     
# No data.                                                                                                                                                                                                                                                                                                                                            
#                                                                                                                                                                                                                                                                                                                                                     

import deephaven.Types as dh

column_defs = {
    "Name"    : dh.string,
    "Test1"   : dh.int_,
    "Test2"   : dh.int_,
    "Average" : dh.int_,
    "GPA"     : dh.double
}

grades3 = dh.table_of(column_defs)


Deephaven - Google Chrome_087

Deephaven - Google Chrome_086

@devinrsmith
Copy link
Member

So, I think there may be value in trying to directly reference io.deephaven.qst.type.Type as a field in a class that is a python Type. I know I led you down the jpy.get_type(...) route, but I think the adapting between different objects would be easier without having to do the map/class lookups.

For example, instead of having to call fromGenericType (which is incorrect for primitive types), we want to rely on the exact conversions from the QST when applicable:

public static ColumnDefinition<?> from(ColumnHeader<?> header)

And then also, I question if we actually need python support for creating/returning ColumnDefinitions? Or, would sticking w/ ColumnHeader serve us better? I'm not sure where cols(...) is used from.

@devinrsmith
Copy link
Member

Oh, I see the cols() calls from kafka tools now. I wonder if we should try to use ColumnHeader<?> in KafkaTools instead of ColumnDefinition?

@@ -211,7 +215,7 @@ def json(col_defs, mapping:dict = None):
raise Exception("'col_defs' argument needs to be a sequence of tuples, instead got " +
str(col_defs) + " of type " + type(col_defs).__name__)
try:
col_defs = _tuplesListToColDefsList(col_defs)
col_defs = dh.cols(col_defs)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the place where we need cols. The Java side API for Kafka ingestion takes a ColumnDefinition[] for JSON.

@jcferretti
Copy link
Member Author

This latest version has python DataType rebuilt on top of qst.type.Type.
Ready for review followup.
Deephaven - Google Chrome_088

Comment on lines +122 to +123
except Exception as e:
raise Exception("Could not get java class type from " + str(data_type)) from e
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you seen this happen? I think qst.type.Type#clazz should always return successfully.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Type annotations help linters but they don't guarantee the type at runtime. For user accessible functions that call here, if the user passes the wrong object in, that's the message they will receive.

Comment on lines 66 to 74
# For more involved types, you can always use the string representation
# of the Java class (Class.getName()) to get a python type for it.
@_passThrough
def typeFromJavaClassName(name : str):
"""
Get the column data type for the corresponding Java type string reprensentation
The string provided should match the output in Java for Class.getName()
for a class visible to the main ClassLoader in the Deephaven engine in use.
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These comments are not correct. String[].class.getName is not "java.lang.String[]". name here is a Deephaven-specific "pretty" version.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

stringset = typeFromJavaClassName('io.deephaven.db.tables.libs.StringSet')
datetime = DataType(_qst_type_.instantType())

byte_array = typeFromJavaClassName('byte[]')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can go from a non-array type to an array type like byte.arrayType()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

# For more involved types, you can always use the string representation
# of the Java class (Class.getName()) to get a python type for it.
@_passThrough
def typeFromJavaClassName(name : str):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to allow python user to define arbitrary types? I might suggest "no", in which case, this can become internal.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

# None until the first _defineSymbols() call
_table_tools_ = None
_col_def_ = None
_python_tools_ = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still relevant comment.

Comment on lines +203 to +204
if _isPrimitive(col_type):
return _qst_column_.ofUnsafe(col_name, jvalues)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will break for bool_, as written.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

Comment on lines +52 to +54
public static <T> Column<T> of(String name, Type<T> type, T... values) {
return of(name, Array.of(type, values));
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice addition.

@devinrsmith devinrsmith self-requested a review August 23, 2021 15:21
devinrsmith
devinrsmith previously approved these changes Aug 23, 2021
cpwright
cpwright previously approved these changes Aug 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants