-
Notifications
You must be signed in to change notification settings - Fork 341
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for the CategoryMapper operator #941
Comments
@etiotto I though some more about the problem of initializing the data structure for Categroy Mapper. Approach 1: Go to a prepare / compute / cleanup model.
The advantage of this approach is that we can easily store all sorts of data to accelerate operations such as category mapper. Because the code to prepare/eval/cleanup is all into the runtime, it will be always in sink. Approach 2: store data structure directly in model
The advantage of this method is that we are preserving all of the info into the model. A possible drawback is that relative indexing might be less efficient (indexing into an array as opposed to follow a pointer), and that we need to make sure that the method used to store the data is compatible with the method that will retrieve the data. But I think we currently embed the runtime directly in the .so generated by onnx-mlir, so I think that is not a problem (except maybe that we may run the compiler in a different environment, different endianness... and since its the compiler that compute the data structure, so have to be a bit careful about this. @etiotto @tungld @tjingrant @caoimhinuibrian @doru1004 @chentong319 Can you think of other options? Preferences? |
I looked at the paper and the algorithm looks pretty good. http://stevehanov.ca/blog/index.php?id=119. In its simplest form, this algorithm take a simple first hash function to map a key There is a concern of infinite looping, which mainly occurs when the dictionary is pretty small. This can be addressed very simply by relaxing the "minimal" part of the perfect hashing. size = len(dict)
# Step 1: Place all of the keys into buckets
buckets = [ [] for i in range(size) ]
for key in dict.keys():
buckets[hash(0, key) % size].append( key ) If a lot of the keys map to the same (first hash map) bucket, that will give a lot of work to the set-specific By simply putting the above code in a loop, and repeatingly incrementing the overall size by a small prime number, we should be able to reduce the maximum number of conflicts until Alternatively, we can also "abort" if the search for a suitable The second confusion seems to be about the condition below if len(bucket) <= 1: break The thing to understand is that the algorithm works in 2 different ways.
To distinguish between case 1 or 2 above, it encode the key for the second case as a strictly negative entry. That is why the use of this map is as below. # Look up a value in the hash table, defined by G and V.
def PerfectHashLookup( G, V, key ):
d = G[hash(0,key) % len(G)]
if d < 0: return V[-d-1]
return V[hash(d, key) % len(V)]
Finally, we also need to add a OriginalKey array because we do have entries that correspond to no directory entries... |
And for integration of this with ONNX, we can simply map |
Implement the
CategoryMapper
operator. The operator has one input tensor. The output tensor has the same shape as the input tensor. The output tensor is produced by mapping the values in the input tensors. Mapping an input value to an output value is done by matching the input value in one of the two sequences (attributes) which have equal length.The operator converts either integers to strings or strings to integers, depending on which default value attribute is provided. Only one default value attribute should be defined.
If the string default value is set, it will convert integers to strings. If the int default value is set, it will convert strings to integers.
The text was updated successfully, but these errors were encountered: