Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for the CategoryMapper operator #941

Closed
etiotto opened this issue Oct 21, 2021 · 3 comments · Fixed by #960, #971 or #1019
Closed

Support for the CategoryMapper operator #941

etiotto opened this issue Oct 21, 2021 · 3 comments · Fixed by #960, #971 or #1019
Assignees

Comments

@etiotto
Copy link
Collaborator

etiotto commented Oct 21, 2021

Implement the CategoryMapper operator. The operator has one input tensor. The output tensor has the same shape as the input tensor. The output tensor is produced by mapping the values in the input tensors. Mapping an input value to an output value is done by matching the input value in one of the two sequences (attributes) which have equal length.

The operator converts either integers to strings or strings to integers, depending on which default value attribute is provided. Only one default value attribute should be defined.
If the string default value is set, it will convert integers to strings. If the int default value is set, it will convert strings to integers.

@AlexandreEichenberger
Copy link
Collaborator

@etiotto I though some more about the problem of initializing the data structure for Categroy Mapper.

Approach 1: Go to a prepare / compute / cleanup model.

  1. Compiler issue code in the prepare separate function to initialize the category mapper data structure (essentially a fast dictionary from string to int, for example).
  2. User has to call "prepare" before ever calling a compute. Prepare returns a data structure that is then passed to compute and cleanup function invocation.
  3. When calling the "compute", special nodes relying on state can locate the needed state in the model data structure.
  4. When user want to shut down the model to free memory, user call "cleanup".

The advantage of this approach is that we can easily store all sorts of data to accelerate operations such as category mapper. Because the code to prepare/eval/cleanup is all into the runtime, it will be always in sink.

Approach 2: store data structure directly in model

  1. If we use a data structure that does not use absolute pointers but only relative indices (for example index into an array), we can have the compiler prepare the data structure at compile time (by invoking a method that load the data structure into memory like we would for a constant).
  2. At run time, we use this data structure like a constant, and invoke a runtime function to scan that data structure and return the result.

The advantage of this method is that we are preserving all of the info into the model. A possible drawback is that relative indexing might be less efficient (indexing into an array as opposed to follow a pointer), and that we need to make sure that the method used to store the data is compatible with the method that will retrieve the data. But I think we currently embed the runtime directly in the .so generated by onnx-mlir, so I think that is not a problem (except maybe that we may run the compiler in a different environment, different endianness... and since its the compiler that compute the data structure, so have to be a bit careful about this.

@etiotto @tungld @tjingrant @caoimhinuibrian @doru1004 @chentong319 Can you think of other options? Preferences?

@AlexandreEichenberger
Copy link
Collaborator

@etiotto

I looked at the paper and the algorithm looks pretty good. http://stevehanov.ca/blog/index.php?id=119.

In its simplest form, this algorithm take a simple first hash function to map a key k to a integer number d. This is the first hash map step. This hash-set-specific integer number d is then combined with the original key k so that the now combined key k and set specific number d can reuse the simple hash map function resulting in a perfect hash map. This is the second hash map step. Perfect means no two original keys map to the same entry during the second hash map of key. Because d is defined per first hash map set, it gives lots of flexibility to the algorithm. The algorithm simply iterates for all the values of d from 1 until if finds one value of d that satisfy the perfect hash map (no conflict on the second hash). It potentially iterates to infinity (more on this below).

There is a concern of infinite looping, which mainly occurs when the dictionary is pretty small. This can be addressed very simply by relaxing the "minimal" part of the perfect hashing.

size = len(dict)

    # Step 1: Place all of the keys into buckets
    buckets = [ [] for i in range(size) ]
    
    for key in dict.keys():
        buckets[hash(0, key) % size].append( key )

If a lot of the keys map to the same (first hash map) bucket, that will give a lot of work to the set-specific d to disambiguate all these entries. So we can simply alleviate this by increasing size by a prime number (say 5, 13, ...), as this will increase the total number of buckets and thus should reduce the primary hash map conflicts. Since the problem seems acute with small dictionary, incrementing small dictionary by a few entries in the hash map seems pretty ok.

By simply putting the above code in a loop, and repeatingly incrementing the overall size by a small prime number, we should be able to reduce the maximum number of conflicts until max-bucket-size / size < 0.3 or some other threshold.

Alternatively, we can also "abort" if the search for a suitable d becomes too long, in which case we discard the current solution and restart with a larger size. That would be the simplest way to guarantee termination (because eventually the hash map is of such large size that there is no conflicts).

The second confusion seems to be about the condition below

        if len(bucket) <= 1: break

The thing to understand is that the algorithm works in 2 different ways.

  1. When one or more of the keys get a primary hash map conflict, it uses d to distinguish between the conflicting entries.
  2. When there is only a single primary key in a bucket, then the d value is simply an index to the single entry of the hash table. So in these cases, there are no need to rehash, it just looks at the entry directly.

To distinguish between case 1 or 2 above, it encode the key for the second case as a strictly negative entry. That is why the use of this map is as below.

# Look up a value in the hash table, defined by G and V.
def PerfectHashLookup( G, V, key ):
    d = G[hash(0,key) % len(G)]
    if d < 0: return V[-d-1]
    return V[hash(d, key) % len(V)]

d is the result of the first hash map. When it is negative, we simply return the V[-d-1], that is a direct access to the value array V. When positive, we use d and the key for the secondary hash map. A subtle point in the above code is that G (table of d values) and V may be of different size (as seen by the distinct modulo for the results of hash). So we could grow only the G array in the above description to prevent infinite looping and still preserve a minimal set of value table.

Finally, we also need to add a OriginalKey array because we do have entries that correspond to no directory entries...

@AlexandreEichenberger
Copy link
Collaborator

And for integration of this with ONNX, we can simply map CathegoryMapper(keys, values) to CategoryMapperImpl(keys, values, G) where G is an array of integer constant computed by the algorithms, and keys and values are the original keys and values reordered by the algorithm to match where the algorithm is expecting to map each key and value. We can rely on the current onnx-mlir to encode these 3 arrays and a runtime function can be used. So this should be pretty simple to do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment