Use dtype=int for the indices returned by TopK #11031

sxjscience · 2018-05-23T08:13:49Z

The dtype of ret_indices in "topk" is currently real_t. Should we revise it to int to avoid possible loss of precision? (https://github.com/apache/incubator-mxnet/blob/master/src/operator/tensor/ordering_op-inl.h#L432)

The text was updated successfully, but these errors were encountered:

marcoabreu · 2018-05-23T17:30:04Z

@asmushetzel

asitstands · 2018-05-25T14:27:25Z

This could break existing code. Sometimes the resulting array of indices is subject to another operation. Then the result can become different if the dtype of the index array changes. Or it can make an error as some opertions can be applied to only floating point arrays.

nd.topk(nd.arange(10), k=2).mean()

Currently the result is [8.5], but the result is [8] if the dtype is integer.

nd.linalg.gemm2(nd.topk(x, k=2), y)

Currently this is allowed, but it'll cause an error if the dtype changes to integer.

multinomial has a similar but opposite issue. It returns an index array of integer type, but it would be convenient to have floating point array sometimes (PR #10970). I think that it would be better if we can make multinomial and topk consistent.

sxjscience · 2018-05-29T06:25:45Z

Yes, this would be backward incompatible (So I create the issue to discuss how we should proceed). Using int32 will avoid bugs in the future when users run topk on some large tensors.

asmushetzel · 2018-05-29T12:07:59Z

A few general remarks about indices in mxnet:

There is no specific convention in MXNet about what datatype to be used when encoding indices. It is basically left to the user or inferred from other data. This is generally fine and consistent. For example look at all operators that are indeed consumers of index arrays (pick, take, gather_nd):
They are all written in a way such that the datatype of the indices can be anything, as long as it can be cast to an int internally. In particular, the datatype does not have to be compliant with other inputs to such operators (i.e. take() will accept the data array to be of type double and the index array of type int/float/half_t/whatever). I think this is a really good design and we should not impose any restrictions that the incoming indices must be of integral type. Neither that every index should be integral.

The situation is slightly different for operators that actually produce index arrays (without having any index array as input). Such operators are topk, argsort, argmax, argmin, argmax_channel (hope I didn't forget one here). Currently they infer the index type (used in the output array of the operator) from the datatype used in the input. But that is weird as there is no objective correlation between data type and index type. If data input is float, why should be then the index type also a float? Moreover it can cause weird problems: If you want to get the index of the maximum element of a 10M large array of floats, then the float-type used as output would not be able to represent all potential indices (while an int would be able). To make such thing work, you currently would have to cast the entire data input to doubles, then receive indices as a double, then cast back to int. And even that may not work as the operator may not support doubles (which is the case for topk).

So my proposal would be to make a change for all index-producing operators (listed above) to allow explicit setting of the output index type by a parameter.

We may decide to keep backward compatibility by making the default behaviour match the current behaviour or choose int as the default and accepting non-backward compatibility.

But we should do something;-)

sxjscience · 2018-05-29T12:34:12Z

Sounds good. We can add a argument called ‘idx_dtype’, which is set to be backward-compatible by default. Get Outlook for iOS<https://aka.ms/o0ukef>

…

________________________________ From: moin <[email protected]> Sent: Tuesday, May 29, 2018 8:08:21 PM To: apache/incubator-mxnet Cc: Xingjian SHI; Author Subject: Re: [apache/incubator-mxnet] Use dtype=int for the indices returned by TopK (#11031) A few general remarks about indices in mxnet: There is no specific convention in MXNet about what datatype to be used when encoding indices. It is basically left to the user or inferred from other data. This is generally fine and consistent. For example look at all operators that are indeed consumers of index arrays (pick, take, gather_nd): They are all written in a way such that the datatype of the indices can be anything, as long as it can be cast to an int internally. In particular, the datatype does not have to be compliant with other inputs to such operators (i.e. take() will accept the data array to be of type double and the index array of type int/float/half_t/whatever). I think this is a really good design and we should not impose any restrictions that the incoming indices must be of integral type. Neither that every index should be integral. The situation is slightly different for operators that actually produce index arrays (without having any index array as input). Such operators are topk, argsort, argmax, argmin, argmax_channel (hope I didn't forget one here). Currently they infer the index type (used in the output array of the operator) from the datatype used in the input. But that is weird as there is no objective correlation between data type and index type. If data input is float, why should be then the index type also a float? Moreover it can cause weird problems: If you want to get the index of the maximum element of a 10M large array of floats, then the float-type used as output would not be able to represent all potential indices (while an int would be able). To make such thing work, you currently would have to cast the entire data input to doubles, then receive indices as a double, then cast back to int. And even that may not work as the operator may not support doubles (which is the case for topk). So my proposal would be to make a change for all index-producing operators (listed above) to allow explicit setting of the output index type by a parameter. In order to keep backward compatibility, the default should stay as is (inferred from data etc). — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub<#11031 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AE8D7k1F_IUwfgiNxu41J0qBAAdm107Wks5t3To1gaJpZM4UJ-Oc>.

asitstands · 2018-05-29T14:33:00Z

I think that random.multinomial also need to be in the list. I'm positive on the introducion of the new name idx_dtype, but using the old dtype also looks not bad for the consistency, as the operators in random package use dtype to specify the data type of the output array. We need to set the backward compatible default currently, but I agree on changing it as integer in 2.0. Anyway an index is an integer by its nature.

sxjscience · 2018-05-29T16:04:27Z

Yes, dtype should be better. I can try to add the dtype option for topk first. One issue is that the compiling speed would be slower if we use a nested macro to consider all the combinations. Get Outlook for iOS<https://aka.ms/o0ukef>

…

________________________________ From: Deokjae Lee <[email protected]> Sent: Tuesday, May 29, 2018 10:33:26 PM To: apache/incubator-mxnet Cc: Xingjian SHI; Author Subject: Re: [apache/incubator-mxnet] Use dtype=int for the indices returned by TopK (#11031) I think that random.multinomial also need to be in the list. I'm positive on the introducion of the new name idx_dtype, but using the old dtype also looks not bad for the consistency, as the operators in random package use dtype to specify the data type of the output array. We need to set the backward compatible default currently, but I agree on changing it as integer in 2.0. Anyway an index is an integer by its nature. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub<#11031 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AE8D7venndZZjQAn9yfZMgWOs0XeROLLks5t3Vw2gaJpZM4UJ-Oc>.

asmushetzel · 2018-05-29T21:30:29Z

I think we should make it consistent from the beginning, i.e. add all affected operators in one shot.

szha added the Operator label May 24, 2018

szha mentioned this issue May 29, 2018

[Discussion] MXNet 2.0 Roadmap (was: APIs that might be a good idea to break in 2.0) #9686

Closed

sxjscience mentioned this issue Jun 4, 2018

[MXNET-507] Set dtype=int32 for ret_indices in ordering ops #11134

Closed

8 tasks

sxjscience mentioned this issue Jul 30, 2018

[Operator] Make sure the result of topk is sorted #10204

Closed

sxjscience closed this as completed Aug 23, 2018

szha mentioned this issue Sep 13, 2019

[RFC] Apache MXNet 2.0 Roadmap #16167

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use dtype=int for the indices returned by TopK #11031

Use dtype=int for the indices returned by TopK #11031

sxjscience commented May 23, 2018

marcoabreu commented May 23, 2018

asitstands commented May 25, 2018 •

edited

Loading

sxjscience commented May 29, 2018

asmushetzel commented May 29, 2018 •

edited

Loading

sxjscience commented May 29, 2018 via email

asitstands commented May 29, 2018

sxjscience commented May 29, 2018 via email

asmushetzel commented May 29, 2018

Use dtype=int for the indices returned by TopK #11031

Use dtype=int for the indices returned by TopK #11031

Comments

sxjscience commented May 23, 2018

marcoabreu commented May 23, 2018

asitstands commented May 25, 2018 • edited Loading

sxjscience commented May 29, 2018

asmushetzel commented May 29, 2018 • edited Loading

sxjscience commented May 29, 2018 via email

asitstands commented May 29, 2018

sxjscience commented May 29, 2018 via email

asmushetzel commented May 29, 2018

asitstands commented May 25, 2018 •

edited

Loading

asmushetzel commented May 29, 2018 •

edited

Loading