Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how to handle DALI xtypes #10

Open
pdowler opened this issue Jan 6, 2025 · 12 comments
Open

how to handle DALI xtypes #10

pdowler opened this issue Jan 6, 2025 · 12 comments

Comments

@pdowler
Copy link
Member

pdowler commented Jan 6, 2025

I don't see any mention of xtypes in the document and I'm curious about how to serialize things like
datatype="char" arraysize="*" xtype="timestamp" or datatype="double" arraysize="2" xtype="point".

The obvious choice appears to be to format the date value as specified in DALI (iso8601) and stick it into a string column in the parquet file, but parquet has a TIMESTAMP logical type that is an int64 so there's a mismatch between VOTable and parquet serialization... so if we set the VOTable metadata in DALI-style we are requiring correct implementation of the parquet datatype is definitive logic because it is not possible to make the VOTable metadata correct. We could say datatype="long" xtype="timestamp" so the VOTable metadata can be made to match, but then we'd likely need to add that option to DALI (should be OK).

There is a similar mismatch with draft DALI xtype="uuid" (char) and parquet UUID (byte array).

For point it seems a little easier, just parquet type array of double and rely on the VOTable metadata to infer "point".


Does section 2.1, specifically "serialize the input table to parquet in the usual way" need a little more detail and maybe DALI need some options to play nice with parquet?

@mbtaylor
Copy link
Contributor

mbtaylor commented Jan 7, 2025

In general: add xtypes to the VOTable where appropriate, in the same way as ucds, units etc. So that covers the xtype="point" case. For TIMESTAMP and UUID parquet logical datatypes I agree they cannot be marked up with xtypes according to their current DALI definitions, and modifying DALI to accommodate parquet-friendly serializations could be a good idea. But unless/until DALI gets such modifications, reasonable client behaviour would be to spot that these columns are timestamps/UUIDs based on the parquet logical data types and treat them as such, even though the VOTable does not so label them. I'd say that the existing text covers this:

Readers must treat the parquet data and datatypes as authoritative, and may make use of as much VOTable column metadata as is feasible

But we could add a sentence in sec 2.3 to make that explicit, e.g.

xtype attributes can and should be included in VOTable FIELD elements where appropriate, but only where consistent with the declared VOTable datatypes.

It would be possible to add text talking about different ways to approach e.g. timestamp serialization (either write as a parquet TIMESTAMP with no xtype or as a string with xtype="timestamp"), but I wouldn't want to advise that one or the other of those is preferred, so I don't see much point. What extra detail in sec 2.1 did you have in mind?

@fxpineau
Copy link

fxpineau commented Jan 7, 2025

To the question "how to serialize [VOTable] things like [...] [in Parquet]?", the straightforward answer in this context is, from my current understanding: such matter is so far out of the scope of the document.
It is stated in the comments of this PR: the document is about "adding VO semantic metadata to a parquet file" and covers neither "how to serialize a VOTable in (VO)Parquet" nor "how to serialize a VOParquet file in VOTable".

@pdowler, although it is also out of the scope here, following this issue about serializing Parquet 64-bit unsigned integer into VOTable, what's your opinion about allowing "xtype=unsigned" in VOTable (for short/int/long)? How/where to do it? Vocabulary?

@pdowler
Copy link
Member Author

pdowler commented Jan 7, 2025

We are coming at this from the point of view of "how to serialize a TAP query result in voparquet" so that it is equivalent to the VOTable output, so I think "how to serialize" is relevant. As noted above, there are two ways to serialize some kinds of values (uuid, timestamp, maybe others) so unless we recommend something we're likely to get a mix of usage and maybe strange behaviours.

My opinion: we should use the parquet logical types where applicable (well, we have to because STRING is a logical type) because that is what people using parquet right already will have done :-)

For our TAP implementation specifically, I do not want to put wrong metadata in the empty VOTable and ideally the VOTable metadata would be the same for both real VOTable and voparquet output. That doesn't have to be required by the doc. I agree with the sentiment/practicality of the parquet datatypes are authoritative , but it should be possible for someone who cares about such details to get them all right/consistent and it currently is not possible. I don't know if it can be made possible.

So, do we explore "making consistency possible"? There could be consequences we end up not liking...

@mbtaylor
Copy link
Contributor

mbtaylor commented Jan 8, 2025

People will come at this from different angles. Some are primarily concerned with writing parquet files but would like to embed some extra metadata as a bonus, while others like you are primarily writing VOTables but would like to offer a parquet-format alternative. Taking timestamps as an example, in the former case it would make sense to use the parquet TIMESTAMP logical type on an INT64 column with no indication in the VOTable metadata that it's a timestamp, while in the latter case you'd write a parquet STRING with xtype="timestamp" in the VOTable. I'd say either of those is OK, and either way a VOParquet reader has enough information to work out (though it's not guaranteed to notice) that the column represents a timestamp.

I agree that the VOParquet document should not encourage strange behaviours, but IMO it's not a problem (at least, not one to be solved here) if we get a mix of usage.

I don't know if I understand what you're getting at by "consistency" here. There's no solution that guarantees avoidance of wrong metadata in the empty VOTable in all cases, because there are certainly parquet columns that can't be represented in VOTable (the current text advises to pretend these are strings in the empty VOTable). In many cases the data models will match up and there's no problem. In others there are mismatches and it's down to the VOParquet producer whether they want to go with a more parquet-like or more VOTable-like solution.

@pdowler
Copy link
Member Author

pdowler commented Jan 8, 2025

For "consistency" I would just like it to be possible to supply consistent parquet and votable metadata about types. I also want it to be possible to create the best possible parquet files. As is, providers have to chose one of:

  • votable datatype="char" arraysize="*" xtype="timestamp" and parquet STRING column - substandard parquet
  • votable datatype="char" arraysize="*" xtype="timestamp" and parquet TIMESTAMP column - best parquet, bad votable meta that clients must ignore, so saying xtype="timestamp" here is essentially useless
  • votable dataype="long" and parquet TIMESTAMP column - best parquet, substandard votable meta

The catch is that at some point we'll also want to attach additional metadata to that timestamp column and I expect that starts to look funny if it doesn't look like a timestamp value in the votable.

Anyway, for now we will likely implement the last option (best parquet) for timestamp and uuid.

@fxpineau
Copy link

fxpineau commented Jan 8, 2025

VOTable is supposed to be a transfer format, not a display format, see section 4.2 of the VOTable standard:

The VOTable format is meant for transferring, storing, and processing tabular data, and is not intended for presentation purposes: therefore (in contrast to Astrores) we generally avoid giving rules on presentation, such as formatting.

So I would strongly favor allowing for datatype="long" xtype="timestamp" in VOTable (i.e. to store unix epoch values).

That said, I do think (like Mark, if I understand correctly) that any option with VOTable xtype="timestamp" and Parquet TIMESTAMP is ok (other cases are sub-optimal).
The important information is that the targeted in-memory type is a timestamp object.
Then, VOTable datatype="char" arraysize="*" or datatype="long" is only about the timestamp representation in the VOTable, and it does not have to be the same as the internal Parquet representation.
The choice will therefore depend on how you would like the column to appear when converting back from VOParquet to VOTable.

Edit: And, if the VOTable to Parquet converting tool is not timestamp aware, it will use the VOTable datatype and convert it into a Parquet String (if datatype="char" arraysize="*" ) or a Parquet Int64 (if datatype="long" ) and let the xtype="timestamp" in the metadata. When reading such a VOParquet file, a timestamp aware tool will be able to handle in-memory a timestamp column either from a Parquet String column or from a Parquet Int64 column (or, obvioulsy, from a Parquet timestamp column if the VOT-->Parquet conversion tool was also timestamp aware).
The only potential "problem" I foresee is a timestamp unaware conversion tool dropping a Parquet TIMESTAMP column when converting to VOTable (but, instead of dropping, the tool may choose to provide the string representation of the column...).

@mbtaylor
Copy link
Contributor

mbtaylor commented Jan 9, 2025

My intention was that the column descriptions in the data-less VOTable must (or at least should) be consistent, where possible, with the parquet data. The recommendation to pretend that columns are VOTable strings when they are not is only supposed to apply when no VOTable datatype is applicable. So it would not be correct to annotate a parquet int64/TIMESTAMP column with datatype="char" arraysize="*" (with or without xtype="timestamp"); it has to be datatype="long".

This is not absolutely explicit in the current text, but it does say

Columns containing scalars and 1-d arrays of booleans, unsigned 8-bit integers, signed 16/32/64-bit integers and strings map straightforwardly to VOTable, but other types such as unsigned 64-bit integers and structured data do not.

Do people disagree with this intention? It doesn't make a huge amount of difference since readers are allowed to disregard VOTable metadata, but if such type mismatches are supposed to have some semantic value that should be spelled out. Should the text be more explicit that best-efforts correspondence of VOTable datatype and parquet type is required?

It would be possible to update DALI so that the timestamp xtype can be used for long values as well as string values, but there is a problem with that: you'd have to decide on particular semantics, which could only correspond to (at most) one of the unit choices (MILLIS, MICROS or NANOS) for the parquet TIMESTAMP logical type.

@mbtaylor
Copy link
Contributor

mbtaylor commented Jan 9, 2025

I don't know why I've only just realised, but for TIMESTAMP there is a much better way to go about this: use a VOTable TIMESYS element, which is meant for exactly this purpose, attaching temporal semantics (including those that are glossed over by parquet TIMESTAMP) to a numeric column.

Then you can write your parquet column as e.g. physical/logical type int64/TIMESTAMP(MILLIS,true), and your VOTable something like:

   <RESOURCE>
     <TIMESYS timeorigin="2440587.5" timescale="UTC" refposition="BARYCENTER" ID="t0"/>
     <TABLE>
       <FIELD datatype="long" name="time" ucd="time.epoch" unit="ms" ref="t0"/>
       ...
     </TABLE>
   </RESOURCE>

@mbtaylor
Copy link
Contributor

I don't want this discussion to postpone too much publication of the Note, since we agreed that the priority is to get something out there ASAP, even if it requires some revision or refinement at a later date.

So I suggest that either we leave this discussion where it is and move on with publication, or I draft some additional text covering these points and we review that in the next few days.

If I draft an amendment I think it would look like:

  • a note in sec 2.3 as per my comment above that xtypes are permitted if used in a way that's consistent with the rest of the VOTable metadata

  • a new section 2.4 that gives more specific advice for annotating some particular parquet column types. I think that would mainly apply to TIMESTAMP and DATE logical types (advise to use TIMESYS), but there may be some comments to make about other types as well.

Opinions?

@pdowler
Copy link
Member Author

pdowler commented Jan 13, 2025

Here is the scenario I am thinking through...

In a TAP service, the tap_schema has a column with datatype="char" arraysize="23*" xtype="timestamp" which is the normal database timestamp column that you query and get values in the DALI specified iso8601 form. If the caller asks for VOTable output, the metadata in the result matches the expectations set by tap_schema. If the caller asks for parquet output, then they get a column using parquet logical type timestamp and the embedded VOTable metadata should say datatype="long" ... that doesn't match what the TAP service said that column would be but I think that implementation is better than the alternatives. Also, since parquet timestamp has a boolean flag for UTC (vs local), I think most cases can be covered by the basic usage.

I previously suggested possibly having two options for timestamps allowed by DALI, but that doesn't really help because the tap_schema is only able to provide one of them and the TAP service is going to pick that one for normal VOTable output and the other for voparquet output.

It is true that the caller kind of requested that alternate serialization by specifying "parquet" output, or at least allowing for that option.

There is a very general expectation that VOTable metadata (eg the MAXREC=0 usage) is a legit way to look at column metadata in a service and figure out how to formulate queries. In TAP there are 3-4 other ways but in a lot of services there are only 2 ways (they can all have empty VOTable and tableset in the registry)... in voparquet we have choices that can make the empty VOTable describe the data and differ slightly from the source of the data, but there is wide usage of empty VOTable where that may be surprising. Is that worth mentioning/highlighting?


I don't think a new section 2.4 would do much. People that use local timezone and know about TIMESYS will already know they can do that, so mentioning it again here isn't that useful. Parquet timestamp type has a UTC flag and most use (certainly all of ours) would just set that and be done.

So, probably no need to to anything else for the first published version of the note unless you agree on my "empty VOTable" qualms above.

@mbtaylor
Copy link
Contributor

I wouldn't expect an empty VOTable to be reliable about metadata for non-VOTable output formats, so IMO that's not a major concern, though admittedly the empty-VOTable trick is not one that I'm accustomed to use.

The (main) point of using TIMESYS for annotating a VOTable long-valued timestamp field is not about timezones, it's about defining the zero-point (timeorigin attribute), which is not otherwise possible to infer from the VOTable markup. But if you already know you've got a parquet TIMESTAMP then you don't need VOTable for that information.

Possibly I'll add a general comment about xtype usage then, but unless anybody else says different it doesn't seem like significant changes to the text are indicated by this discussion.

@pdowler
Copy link
Member Author

pdowler commented Jan 13, 2025

agreed, good for now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants