Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for constructing tensor/ktensor from data without copying #145

Closed
etphipp opened this issue Jun 9, 2023 · 19 comments · Fixed by #182
Closed

Add support for constructing tensor/ktensor from data without copying #145

etphipp opened this issue Jun 9, 2023 · 19 comments · Fixed by #182
Assignees
Labels
bug Something isn't working doing Actively being worked on
Milestone

Comments

@etphipp
Copy link
Contributor

etphipp commented Jun 9, 2023

This is an issue to discuss adding support for creating tensor and ktensor objects (and probably others) without copying the original data the object is derived from. This will improve efficiency, especially for third-party codes that are trying to integrate with pyttb (such as GenTen). I have a PR that could be submitted that implements one proposed solution by adding a copy parameter to the from_data methods of tensor and ktensor (defaulting to True to preserve existing behavior). Note also that the various functions for creating various types of tensors from data are not entirely consistent on whether the data is copied or not.

Another reasonable solution in mind, and one that I would argue is preferable, is to never explicitly copy the data in these functions, and if the user wants a copy, to call the copy method of the class instead.

@dmdunla
Copy link
Collaborator

dmdunla commented Jun 9, 2023

I like the idea of having all from_data methods not making copies and providing an explicit copy method. This is essentially what ktensor.from_tensor_type.

@etphipp
Copy link
Contributor Author

etphipp commented Jun 9, 2023

I can start working on a PR for that. Scanning through the code, it looks like the arguments to from_data are often copied before passing to from_data anyway. And ktensor already has a copy method.

@dmdunla
Copy link
Collaborator

dmdunla commented Jun 9, 2023

Here are the constructors with suggested behavior:

  • tensor
    • from_data: no copy
    • from_tensor_type: copy (to support conversion)
    • from_function: new
  • ktensor
    • from_data: no copy
    • from_tensor_type: copy (for consistency)
    • from_factor_matrices: no copy (creates weights of 1 and then calls `from_data)
    • from_function: new
    • from_vector: copy (data is reshaped)
  • sptensor
    • from_data: no copy
    • from_tensor_type: copy (for consistency))
    • from_function: new
    • from_aggregator: new/copy (as data may be aggregated and not be the same size afterwards)
  • ttensor
    • from_data: no copy
    • from_tensor_type: copy (for consistency)
  • tenmat
    • from_data: copy
    • from_tensor_type: copy (for consistency)

@dmdunla
Copy link
Collaborator

dmdunla commented Jun 9, 2023

Because from_tensor_type has different behaviors, the consistency will be challenging.

Suggestion: remove from_tensor_type constructors and create either to_* or as_* methods that map from one class to another. This would limit the behavior to conversion only, as opposed to the currnt behaviors of both conversion and calling from_data, sometimes copying other times not.

@ntjohnson1 @etphipp what do you think about this suggestion? This would break from TTB for MATLAB, but it would also be more aligned with other Python array/data packages, like numpy, scipy, and pandas.

@etphipp
Copy link
Contributor Author

etphipp commented Jun 9, 2023

I'm not a python expert by any stretch of the imagination, but that seems reasonable to me. I think people are used to the fact that Python and Matlab just have different semantics when it comes to copies, and so I think this would align better with what people expect (and be more efficient).

@dmdunla
Copy link
Collaborator

dmdunla commented Jun 9, 2023

Here are example use cases that could drive this change (and corresponding methods in other classes):

  1. Constructor from data without copy:
K = ttb.ktensor.from_data(weight, [fm0, fm1])
  1. Reference (no copy)
K2 = K
  1. Explicit copy
K3 = K.copy()
  1. Conversion to other tensor type (may require a copy):
T = K.to_tensor()

I think this would clean up the inconsistencies across the different classes and class constructors.

@ntjohnson1
Copy link
Collaborator

I am 90% aligned with Danny's proposal in the final message. But this #145 (comment) seems to suggest a broader change to avoid the copy on construction wherever possible.

Here is numpy's current behavior around copies/references

import numpy as np
a = list(range(3))
b = np.array(a) # Copy required no matter what since lists aren't contiguous in memory
a[0] = 5
assert b[0] != 5
c = np.asarray(b) # Explicit as to just be a reference if possible
b[0] = 7
assert c[0] == 7
d = np.array(c) # Default constructor always yields copy
c[0] = 9
assert d[0] != 9

I like the as_ approach to highlight the difference to the from_ cases. In python copies and references are a little ambiguous so I think to make things less surprising the additional interfaces help. I took a look at pandas and numpy to confirm that this seems to align with their uses.

With the approach above we would have this (since we aren't doing the copy).

import numpy as np
from pyttb import tensor
a = np.ones((3,3))
b = tensor.from_data(a)
a[0,0] = 5
assert b[0,0] == 5

as seems to be a little bit more intuitive that we are interpreting the original variables and data in a new way through an additional reference. I think to suggests this similarly that we are reinterpretting the same data (if possible, in practice I don't think we can switch between our tensors without copies).

So something like:

  1. Constructor from data:
    1. Without copy:
    K = ttb.as_ktensor(weight, [fm0, fm1]) # or ttb.ktensor.as_ktensor(weight, [fm0, fm1])
    1. With copy:
    K = ttb.ktensor.from_data(weight, [fm0, fm1])
  2. Reference (no copy)
K2 = K
  1. Explicit copy
K3 = K.copy()
  1. Conversion to other tensor type (may require a copy):
T = K.to_tensor()

@dmdunla
Copy link
Collaborator

dmdunla commented Jun 13, 2023

Since numpy is mainly focused on a single data class and it accepts lots of constructor inputs that are array-like, this may not be a 1:1 alignment. I do not like the idea of numpy.asarray having different behavior based on whether the input is a numpy.ndarray or something else. Because it can make a copy or not based on the input, not based on which method is called from the user. I think we are more aligned with scipy.sparse data classes in pyttb, but I am not sure I like the constructor patterns there either; various types of input are accepted in the first argument, and when more than a single input is required you need to wrap it up into a tuple.

I would like to move towards a more simplified interface that supports explicit copy requests through the constructor or via the copy class method. Also, our from_data methods could be replaced with (or aliased to) the default, non-empty constructors. Here are some examples:

  1. Constructor from data:
    1. Without copy:
    K = ttb.ktensor(weights, factor_matrices) # limit to list of factor_matrices rather than allowing sequence of args
    K = ttb.ktensor(None, factor_matrices) # would replace ktensor.from_factor_matrices()
    1. With copy:
    K = ttb.ktensor(weights, factor_matrices, copy=True)
    K = ttb.ktensor(None, factor_matrices, copy=True)
  2. Reference (no copy)
K2 = K
  1. Explicit copy
K3 = K.copy()
  1. Conversion to other tensor type (may require a copy):
T = K.to_tensor()

This would mean that we could have the following constructors:

  • ktensor
    • ktensor(): empty, no copy
    • ktensor(weights, factor_matrices, copy=False): copy on request
    • ktensor.from_function(...): new
    • ktensor.from_vector(...): copy
  • tensor
    • tensor(): empty, no copy
    • tensor(data, copy=False): copy on request
    • from_function: new
  • sptensor
    • sptensor(): empty, no copy
    • sptensor(vals, subs, shape, copy=False): copy on request
    • from_function: new
    • from_aggregator: copy
  • ttensor
    • ttensor(): empty, no copy
    • ttensor(core, factor_matrices, copy=False): copy on request
  • tenmat
    • ttensor(): empty, no copy
    • ttensor(data, rdims, cdims, tshape, copy): copy on request

All of the classes with a full method already support to_tensor. There are not a lot of other conversions that we need to support, but those can be supported through other to_* methods as needed.

@ntjohnson1
Copy link
Collaborator

If we are globally going to have copies opt-out then I think we drop the copy flag entirely and just make it very clear that for the default constructors we will be just using a reference to the data (the users can always make a copy before handing to us). That's fine with me, but we probably want to show an example for that somewhere early in the docs to avoid subtle bugs. We probably will want to update our tests to confirm we are in fact avoiding the copies. Right now I think our from_data methods also don't do any validation, if they are the default option we probably want to do validation.

This example of references vs copies caught me the other day which is why I am more sensitive to it now I guess.

>>> a = [[0]*2]*3
>>> a[0][0] = 1
>>> a
[[1, 0], [1, 0], [1, 0]]

NIT: For ktensors if you are trying to drop from_factors I think flipping the order is nicer so when using factor matrices only the None can be left off

ktensor(factor_matrices=None, weights=None)

@dmdunla
Copy link
Collaborator

dmdunla commented Jun 13, 2023

I agree that would make it cleaner by leaving out the copy flag and requiring users to explicitly call .copy() when they want a copy.

+1 for validation in constructors.

Need to determine best way to support moving from ktensor.from_data() to ktensor() constructors. Should we use ktensor.__init__(factor_matrices=None, weights=None) to support the empty constructor, from_data, and from_factor_matrices functionality? Or is there another way to do this?

The example above regarding the reference vs copy from @ntjohnson1 is definitely confusing. It is also opaque without digging into code, as here are the docs:

>>> help(list.__mul__)
Help on wrapper_descriptor:

__mul__(self, value, /)
    Return self*value.

WIth better documentation, it may be clear that the nested lists are references to the same object. If we ever have a need for tensor comprehensions with short-cut nested reference semantics, we'll need to make sure we document the expected behavior.

@ntjohnson1
Copy link
Collaborator

I would just go with ktensor.__init__(factor_matrices=None, weights=None) since that seems simplest/clearest. When I add typing to the init we should also be able to make it explicit what combinations we expect to work (see some of our other typed interfaces with similar unions/overloads).

Sounds like we are in agreement and have a plan on this then

@tgkolda
Copy link
Collaborator

tgkolda commented Jun 13, 2023

It seems that copying should be the default. For the vast majority of people, that will be simpler and prevent and problems. For those that want performance, they can set the copying to false.

@ntjohnson1
Copy link
Collaborator

I am also fine with copy being opt-out. I just felt copy being opt-in seemed unintuitive.

@dmdunla
Copy link
Collaborator

dmdunla commented Jun 13, 2023

There does not seem to be a standard at this point across classes in related packages:

  • numpy.array: default is copy=True
  • scipy.sparse.coo_matrix: default is copy=False
  • pandas.DataFrame: default is copy=None (logic based on input type)

So, I suggest that we go for opt-out in pyttb classes: default is copy=True. I think we can leave the .copy() methods, too, as this will help support reference (K2=K) and copy (K2=K.copy()) assignment.

@etphipp
Copy link
Contributor Author

etphipp commented Jun 13, 2023

So is this back to what I had originally proposed in PR #138, e.g., adding copy=True argument tofrom_data() (probably along with other methods)?

@dmdunla
Copy link
Collaborator

dmdunla commented Jun 13, 2023

Yes, it just took us a long time to determine that what you proposed is what we wanted to do. 👍

@dmdunla dmdunla self-assigned this Jun 13, 2023
@dmdunla dmdunla added this to the v2.0 milestone Jun 13, 2023
@dmdunla dmdunla added bug Something isn't working doing Actively being worked on labels Jun 13, 2023
@etphipp
Copy link
Contributor Author

etphipp commented Jun 14, 2023

So how would you like to proceed with this? I have changes that add a copy=True argument to from_data() for ktensor and tensor. Would you like me to submit that as a PR as a starting point? We can add more changes to it.

@dmdunla
Copy link
Collaborator

dmdunla commented Jun 14, 2023 via email

@etphipp
Copy link
Contributor Author

etphipp commented Jun 14, 2023

Sounds good, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working doing Actively being worked on
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants