-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor internals #110
base: main
Are you sure you want to change the base?
Refactor internals #110
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks 🚀. We've agreed the general design already a few times so I go into various other directions here. I'm mainly concerned about the conflation of some subtle breaking changes (and I'm sure I cannot catch every single one given the size of this PR), obvious user API breakages, and internal refactoring. Many (though not all) of the breaking changes feel uncorrelated with "Refactor internals"; I'm hoping to minimally get an exhaustive enumeration for a change log and as a reviewing aide for me.
# Copyright (c) QuantCo 2023-2024 | ||
# SPDX-License-Identifier: BSD-3-Clause | ||
|
||
import numpy as np |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the intention behind these tests? If it is to maintain precise understanding of upstream dependencies (Spox, onnx shape inference or ort) then it seems quite dubious to run this as part of the ndonnx test suite. Changes to ndonnx have no bearing on these tests so this will always do the same thing until an external change.
A better strategy would be to have this run in CI weekly or on a pixi update that change Spox or ONNX or onnxruntime. We can then immediately exploit any fixes/improvements or mitigate anything that is broken.
# Only set to `True` temporarily and only if there was a | ||
# deliberate update to the schema. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need schema updating code (and if we did, it would be an independent script and not in the test suite). Updating the schema breaks backwards compatibility. Any evolutions should go in a "v2" schema so as to not break users' input/output wrangling code. Do you have a use case in mind for editing existing schemas?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I backported this code to #111 . You are right that the schemas should not be updated. It might be that we add more schemas, though, or that one may want to regenerate the files for some other reasons. If nothing else it is good to see how the files were generated.
@@ -5,7 +5,7 @@ | |||
[](https://anaconda.org/conda-forge/ndonnx) | |||
[](https://pypi.org/project/ndonnx) | |||
|
|||
An ONNX-backed array library that is compliant with the [Array API](https://data-apis.org/array-api/) standard. | |||
An ONNX-backed implementation of the [Array API](https://data-apis.org/array-api/) standard. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't agree the rewrite of the README. Implementing the Array API is a very nice feature worth mentioning, but is not the entire point of the library, nor its biggest utility at the present moment! If that was the only goal, the library wouldn't have an extension module, datetime/timedelta/nullable/user-defined data types, Spox interoperability and probably other features.
In any case, it might be easier if you move the README change proposal to a new pull request as the edits are mostly independent of the refactoring exercise (except the Array
).
# | ||
# Union types are exhaustive and don't create ambiguities with respect to user-defined subtypes. | ||
# TODO: Rename | ||
NCoreIntegerDTypes = ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NPrimitiveIntegerDTypes?
) | ||
self._data = data | ||
|
||
def disassemble(self) -> dict[str, Var]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The optional mask creates a subtle but important breaking change: inference handling code that previously assumed that there is a one-two-one, static relationship between data type and "fields" see this invariant violated (I believe). This means you can read the metadata from an ONNX model but that does NOT tell you everything about the way to split up your inputs for the ONNX backend being used. i.e. inspecting the metadata, I have no idea if the "null" field is present or not. I also have to check all the inputs to the model.
Do I have this correct?
I think there are two aspects to address before merging this:
- This is a very subtle but undocumented breaking change. Is it truly so necessary that it gets incorporated along with a substantial refactoring of internal behaviour? I understand that there will be perf wins.
- If it is urgent, can we continue to have this very nice property that the dtype schema/metadata is sufficient to know exactly how to pass things to your ONNX model?
[f"{k}: {v}" for k, v in self._tyarray.__ndx_value_repr__().items()] | ||
) | ||
shape = self._tyarray.shape | ||
return f"array({value_repr}, shape={shape}, dtype={self.dtype})" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new print formatting gets a little long for constant arrays. Can we remove the shape for these?
return x.endpoints.shape[:-1] | ||
|
||
|
||
class List(StructType): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test was very valuable for testing the limits of the dtype system. I'm sure it's implementable again - can you please bring back the List
test and/or have some custom dtype tests that illustrate how dtypes get defined from outside of ndonnx. Right now, everything (categoricals, object dtype) has been inlined into the library which won't be the long term way to support niche dtypes without a clear numpy counterpart.
(it might also be a nice one for the docs in the user-defined dtypes example that's currently got a ...
!)
I also think enumerating the breaking changes (some of which I've commented on already) will help with this. |
@@ -1,52 +1,27 @@ | |||
Experimental | |||
============ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for already taking care of parts of the docs :). The Spox Integration page looks unedited.
I also think we should indeed ship datetime64, timedelta64 but not the categorical or "object" dtype. It would be good to remove the pandas dependency added to the test environment and leave that to downstream libraries. |
Co-authored-by: Aditya Goel <[email protected]>
This test is back-ported from #110 since it is important that that PR does not break the current schemas.
This PR is a large refactor of the internal workings of ndonnx. The newly introduced concepts are described in an explicit file.
Breaking changes are limited to functions not found in the array-api standard and can likely be reduced further.
Closes #75.
Closes #74.