Skip to content

Use of "uint64" for the "hipscat ID" and the lack of 64-bit unsigned types in IVOA standards #317

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
gpdf opened this issue Jul 29, 2024 · 8 comments
Assignees
Milestone

Comments

@gpdf
Copy link

gpdf commented Jul 29, 2024

I have been experimenting with adapting "hipscat" datasets (concretely, https://data.lsdb.io/unstable/gaia_dr3/gaia/ , as a sample) to use in IVOA contexts.

I've run across an awkward issue: it appears that there are two "internal" columns in the dataset, Dir and Npix, which are declared uint64[pyarrow]. This is a perfectly reasonable choice, except for an awkward feature of the type system of existing astronomy standards:

The VOTable type primitives do not include an unsigned 64-bit integer (neither does FITS, though in FITS there's an ugly workaround using BZERO or TZEROn).

This means that it's unsafe to simply query a "hipscat" dataset and return a result including either of those two columns through VOTable, i.e., through any current IVOA protocol, if the high-order bit is set.

If the Npix column indeed contains IDs in your ( HEALPix_id, counter) scheme, in which the HEALPixel ID is in the high-order 42 bits of the word, it will indeed be the case that the high-order (sign) bit will be set frequently, perhaps for 1/3 of the space of IDs (the largest ID possible appears to be 0xBFFF FFFF FFFF FFFF, if the counter is maxed out).

As a member of the team for Rubin, which is expecting to require 64-bit IDs and is likely to eventually use every single bit, I find the inability to easily transport unsigned 64-bit values in the standards undesirable, but it is what it is for now.

If your ID scheme is non-negotiable, I think we will need to specify a standard interpretation for how "hipscat" IDs are expected to be represented in schemas in which only signed 64-bit integers are available.

I am assuming that you will find neither of the following alternatives acceptable? Both would permanently avoid the issue.

  1. Order-19 HEALPixel IDs (0.4' across) with 21-bit disambiguation counters (2M values)
  2. Order-18 HEALPixel IDs (0.8' across) with 23-bit disambiguation counters (8M values)

Note that the sole uint8 column, Norder, is not a problem. VOTable and FITS both have an unsigned byte type: unsignedByte and "B" (BITPIX=8), respectively.

@gpdf
Copy link
Author

gpdf commented Jul 30, 2024

I have had a relevant realization:

VOTable 1.4, Section 6 states:

  • 64-Bit Integer If the value of the datatype attribute specifies datatype "long", the data in the BINARY/BINARY2 serialization shall consist of big-endian twos-complement signed 64-bit integers contained in eight bytes, with the most significant byte first, and subsequent bytes in order of decreasing significance. The representation of a Long Integer in the TABLEDATA serialization is either its decimal representation between -9223372036854775808 and 9223372036854775807 made of an optional - or + sign followed by digits, or its hexadecimal representation when starting with 0x and followed by 1 to 16 hexadigits. (emphasis mine)

I'd forgotten that hexadecimal representations were an option. This means that a conforming service can, at least, transmit the bit pattern of an unsigned 64-bit integer without any complication, either in BINARY2 or by using the hexadecimal option to TABLEDATA. It will still be an issue for clients to know what to do when they receive the data.

@delucchi-cmu
Copy link
Contributor

I think we can get away with the first option, to simply reduce the counter space. However, this would make all existing hipscat datasets incompatible. it might make the most sense to hold off on this until we perform a whole-scale renaming, so that the _hipscat_index can be wholly invalid in IVOA (both based on the name, and requiring an unsigned int), and the _bleepbloop_index can use a slightly smaller counter and fit inside a signed int64.

@delucchi-cmu delucchi-cmu moved this to Todo in HATS / LSDB Sep 11, 2024
@nevencaplar nevencaplar added this to the HATS 0.4 milestone Sep 17, 2024
@smcguire-cmu smcguire-cmu self-assigned this Sep 24, 2024
@fxpineau
Copy link

I do no see any problem keeping uint64, on the contrary.

It don't know why VOTable does not support unsigned integers (except the UnsignedByte datatype) and the signed byte.
Because the VOTable standard has been developed during the Java hype? More probably because FITS does not support them (Why? Which historical reasons?).

One way to support those types may be to rely on the xtype attribute (like there is a limited set of stored types, but many logical types in Parquet). I don't know if such xtypes already exists, @mbtaylor?

I tend to think that the tool converting from Parquet to VOTable will have the responsibility to cast unsigned integers into signed integers. It is a no-operation: the bit patter is preserved. The difference will be that integer with the MSB set will be represented as negative numbers once converted into String. The reverse operation (parsing a negative value into a integer and then cast it into an unsigned integer) will give back the original bit pattern.

In the case of _hipscat_index, working from Java (only signed integers), you can retrieve the HEALPix index parsing the signed integer and then using the "unsigned right shift" >>> instead of the "right shift" >>.
But in other languages (supporting both signed and unsigned integers), it would be best to keep the unsigned type (else the right shift will preserve the MSB).
See this Rust example:

fn main() {
    println!("{:b}", -128_i8);
    println!("{:b}", 128_u8);
    println!("{:b}", -128_i8 >> 1);
    println!("{:b}", 128_u8 >> 1);
}

the output is:

10000000
10000000
11000000
1000000

To perform operations to retrieve the HEALPix index, we have to work on the signed type (in cases the MSB is set).

@mbtaylor
Copy link

The historical reason for FITS and hence VOTable supporting only signed integers is probably inherited from FORTRAN 77 which has no unsigned integer types. What the thinking behind the Java roster of integer types was (1-byte unsigned, 2-, 4- and 8-byte signed), I don't know.

As pointed out above, FITS can effectively encode signed types by suitable use of the BZERO/TZEROnnn headers. Introducing an xtype="unsigned" for VOTable would be in the spirit of this convention, as well as of parquet logical types; I think it would be rather a good solution for this problem. I don't believe that anybody else has made this suggestion.

Dealing with unsigned 64-bit integers in Java would be pretty painful. I don't think you can get round this in general using bit shifts given that all the bits may be in use. Although there are ways to deal with arbitrary precision integers in Java, topcat/stilts will probably not be able to do arithemetic on signed 64-bit integers. But, I doubt if arithmetic is required.

I think the right way to think about these values is as bit patterns, which as Gregory points out can be naturally expressed in VOTable TABLEDATA in hexadecimal, and which fit naturally into BINARY/BINARY2 anyway. Then if an unsigned xtype is present in the VOTable serialization, languages which can cope with unsigned longs can do so, and languages which can't will just do the best they can (represent some of the values as negative). In absence of an unsigned xtype, some of the values will just look like they are negative, which probably doesn't matter too much since the only manipulation of these values is likely to be bitwise.

If you can avoid using all 64 bits, it would probably simplify some use of this data. But if not most things will probably work, and if we introduce an unsigned xtype better still.

This may be a short-lived issue, since it relates only to fields for which exactly 64 (or 32 or 16) bits are required. I was talking to people at ROE yesterday about a 22-digit decimal field they wanted to expose using TAP. But I don't suggest that we introduce an int96 type in VOTable any time soon.

@fxpineau
Copy link

Sorry @mbtaylor, but I still don't get the point about hexadecimal.

It would be useful to directly write, in a TABLEDATA, a string representation of an unsigned integer which is compatible with a signed integer.
But it does not seem to be the right thing to do from my perspective.

For me, we should first convert the votable-not-supported-unsigned-integer into the votable-supported-signed integer (i.e. a cast which is a zero cost operation preserving the bit pattern here) and then write the string representation of the signed integer (in decimal, hexadecimal, octal or binary).
Then, when reading the TABLEDATA into memory, we parse the string representation of the signed-integer (leading to the original bit pattern) and, if we have the xtype=unsigned info and a tool supporting it, we can cast the signed-integer into an unsigned integer (again, the cast is a no-operation and preserves the bit pattern).

There will be no performance differences between both approaches since casting is here a no-operation.
And the second way of doing is more universal since it can be used for "logical" types having no string representation compatible with the "stored" type.

@mbtaylor
Copy link

@fxpineau you've expressed it more clearly than me - this is exactly what I meant. You're right that the hexadecimal nature of the serialization doesn't really change anything, it just seems a bit less confusing because it doesn't have a (potential) minus sign at the start.

@pdowler
Copy link

pdowler commented Jan 7, 2025

Interesting. The concept behind xtype is that a reader/client that understands the xtype can interpret the value. A client that does not understand the xtype can still read/write the value (e.g. round-trip correctly).

something like datatype="long" xtype="unsigned" would make a lot of sense with hex serialization (which I had never noticed before) but I think plain decimal serialization would be subtle and confusing. Since xtype(s) defined in DALI usually also specify the serialization (because most are values with structure), unsigned could be explicitly limited to hex.


The downside to using xtype to signal unsigned is that it does not enable use of unsigned in other (structured) types like datatype="long" xtype="interval", which can only be a signed long interval. I don't expect there are immediate use cases for that, but it seems like a hint that maybe unsigned belongs over in datatype.


On the implementation side, I agree with Mark that for Java it would be easy enough to read/write unsigned long correctly, but doing math would be problematic. I don't think that's a reason to not do something to support uint64.

@fxpineau
Copy link

fxpineau commented Jan 8, 2025

Hex is only one of the possible ASCII presentations for an integer. But we don't care about the exact ASCII presentation, see section 4.2 of the VOTable document:

The VOTable format is meant for transferring, storing, and processing tabular data, and is not intended for presentation purposes: therefore (in contrast to Astrores) we generally avoid giving rules on presentation, such as formatting.

Having negative decimal values would be confusing if someone tries to get the unsigned value directly from the ASCII representation. But from my current understanding (I am not familiar at all with xtype usages so I may have a naïve/incomplete view): one first deserialize from XML-ASCII to the in-memory VOTable datatype, and then transform/re-interpret it into the in-memory xtype datatype. Again: it is very similar to parquet with vot datatype=parquet stored type, and vot xtype=parquet logical type.

Having negative values for a datatype=long column is ok.
And if a user expects unsigned values (i.e. no negative values), it means that he knows about the xtype mechanism, so can quite easily spot that negative values = MSB set.
Am I missing something?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

No branches or pull requests

7 participants