-
Notifications
You must be signed in to change notification settings - Fork 5
Use of "uint64" for the "hipscat ID" and the lack of 64-bit unsigned types in IVOA standards #317
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I have had a relevant realization: VOTable 1.4, Section 6 states:
I'd forgotten that hexadecimal representations were an option. This means that a conforming service can, at least, transmit the bit pattern of an unsigned 64-bit integer without any complication, either in |
I think we can get away with the first option, to simply reduce the counter space. However, this would make all existing hipscat datasets incompatible. it might make the most sense to hold off on this until we perform a whole-scale renaming, so that the |
I do no see any problem keeping It don't know why VOTable does not support unsigned integers (except the One way to support those types may be to rely on the I tend to think that the tool converting from Parquet to VOTable will have the responsibility to cast unsigned integers into signed integers. It is a no-operation: the bit patter is preserved. The difference will be that integer with the MSB set will be represented as negative numbers once converted into String. The reverse operation (parsing a negative value into a integer and then cast it into an unsigned integer) will give back the original bit pattern. In the case of fn main() {
println!("{:b}", -128_i8);
println!("{:b}", 128_u8);
println!("{:b}", -128_i8 >> 1);
println!("{:b}", 128_u8 >> 1);
} the output is: 10000000
10000000
11000000
1000000 To perform operations to retrieve the HEALPix index, we have to work on the signed type (in cases the MSB is set). |
The historical reason for FITS and hence VOTable supporting only signed integers is probably inherited from FORTRAN 77 which has no unsigned integer types. What the thinking behind the Java roster of integer types was (1-byte unsigned, 2-, 4- and 8-byte signed), I don't know. As pointed out above, FITS can effectively encode signed types by suitable use of the BZERO/TZEROnnn headers. Introducing an Dealing with unsigned 64-bit integers in Java would be pretty painful. I don't think you can get round this in general using bit shifts given that all the bits may be in use. Although there are ways to deal with arbitrary precision integers in Java, topcat/stilts will probably not be able to do arithemetic on signed 64-bit integers. But, I doubt if arithmetic is required. I think the right way to think about these values is as bit patterns, which as Gregory points out can be naturally expressed in VOTable TABLEDATA in hexadecimal, and which fit naturally into BINARY/BINARY2 anyway. Then if an unsigned xtype is present in the VOTable serialization, languages which can cope with unsigned longs can do so, and languages which can't will just do the best they can (represent some of the values as negative). In absence of an unsigned xtype, some of the values will just look like they are negative, which probably doesn't matter too much since the only manipulation of these values is likely to be bitwise. If you can avoid using all 64 bits, it would probably simplify some use of this data. But if not most things will probably work, and if we introduce an unsigned xtype better still. This may be a short-lived issue, since it relates only to fields for which exactly 64 (or 32 or 16) bits are required. I was talking to people at ROE yesterday about a 22-digit decimal field they wanted to expose using TAP. But I don't suggest that we introduce an int96 type in VOTable any time soon. |
Sorry @mbtaylor, but I still don't get the point about hexadecimal. It would be useful to directly write, in a TABLEDATA, a string representation of an unsigned integer which is compatible with a signed integer. For me, we should first convert the votable-not-supported-unsigned-integer into the votable-supported-signed integer (i.e. a cast which is a zero cost operation preserving the bit pattern here) and then write the string representation of the signed integer (in decimal, hexadecimal, octal or binary). There will be no performance differences between both approaches since casting is here a no-operation. |
@fxpineau you've expressed it more clearly than me - this is exactly what I meant. You're right that the hexadecimal nature of the serialization doesn't really change anything, it just seems a bit less confusing because it doesn't have a (potential) minus sign at the start. |
Interesting. The concept behind xtype is that a reader/client that understands the xtype can interpret the value. A client that does not understand the xtype can still read/write the value (e.g. round-trip correctly). something like The downside to using xtype to signal unsigned is that it does not enable use of unsigned in other (structured) types like On the implementation side, I agree with Mark that for Java it would be easy enough to read/write unsigned long correctly, but doing math would be problematic. I don't think that's a reason to not do something to support uint64. |
Hex is only one of the possible ASCII presentations for an integer. But we don't care about the exact ASCII presentation, see section 4.2 of the VOTable document:
Having negative decimal values would be confusing if someone tries to get the unsigned value directly from the ASCII representation. But from my current understanding (I am not familiar at all with xtype usages so I may have a naïve/incomplete view): one first deserialize from XML-ASCII to the in-memory VOTable datatype, and then transform/re-interpret it into the in-memory xtype datatype. Again: it is very similar to parquet with vot datatype=parquet stored type, and vot xtype=parquet logical type. Having negative values for a |
I have been experimenting with adapting "hipscat" datasets (concretely, https://data.lsdb.io/unstable/gaia_dr3/gaia/ , as a sample) to use in IVOA contexts.
I've run across an awkward issue: it appears that there are two "internal" columns in the dataset,
Dir
andNpix
, which are declareduint64[pyarrow]
. This is a perfectly reasonable choice, except for an awkward feature of the type system of existing astronomy standards:The VOTable type primitives do not include an unsigned 64-bit integer (neither does FITS, though in FITS there's an ugly workaround using
BZERO
orTZEROn
).This means that it's unsafe to simply query a "hipscat" dataset and return a result including either of those two columns through VOTable, i.e., through any current IVOA protocol, if the high-order bit is set.
If the
Npix
column indeed contains IDs in your ( HEALPix_id, counter) scheme, in which the HEALPixel ID is in the high-order 42 bits of the word, it will indeed be the case that the high-order (sign) bit will be set frequently, perhaps for 1/3 of the space of IDs (the largest ID possible appears to be 0xBFFF FFFF FFFF FFFF, if the counter is maxed out).As a member of the team for Rubin, which is expecting to require 64-bit IDs and is likely to eventually use every single bit, I find the inability to easily transport unsigned 64-bit values in the standards undesirable, but it is what it is for now.
If your ID scheme is non-negotiable, I think we will need to specify a standard interpretation for how "hipscat" IDs are expected to be represented in schemas in which only signed 64-bit integers are available.
I am assuming that you will find neither of the following alternatives acceptable? Both would permanently avoid the issue.
Note that the sole
uint8
column,Norder
, is not a problem. VOTable and FITS both have an unsigned byte type:unsignedByte
and "B
" (BITPIX=8
), respectively.The text was updated successfully, but these errors were encountered: