-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
how to handle DALI xtypes #10
Comments
In general: add xtypes to the VOTable where appropriate, in the same way as ucds, units etc. So that covers the
But we could add a sentence in sec 2.3 to make that explicit, e.g.
It would be possible to add text talking about different ways to approach e.g. timestamp serialization (either write as a parquet TIMESTAMP with no xtype or as a string with xtype="timestamp"), but I wouldn't want to advise that one or the other of those is preferred, so I don't see much point. What extra detail in sec 2.1 did you have in mind? |
To the question "how to serialize [VOTable] things like [...] [in Parquet]?", the straightforward answer in this context is, from my current understanding: such matter is so far out of the scope of the document. @pdowler, although it is also out of the scope here, following this issue about serializing Parquet 64-bit unsigned integer into VOTable, what's your opinion about allowing "xtype=unsigned" in VOTable (for short/int/long)? How/where to do it? Vocabulary? |
We are coming at this from the point of view of "how to serialize a TAP query result in voparquet" so that it is equivalent to the VOTable output, so I think "how to serialize" is relevant. As noted above, there are two ways to serialize some kinds of values (uuid, timestamp, maybe others) so unless we recommend something we're likely to get a mix of usage and maybe strange behaviours. My opinion: we should use the parquet logical types where applicable (well, we have to because STRING is a logical type) because that is what people using parquet right already will have done :-) For our TAP implementation specifically, I do not want to put wrong metadata in the empty VOTable and ideally the VOTable metadata would be the same for both real VOTable and voparquet output. That doesn't have to be required by the doc. I agree with the sentiment/practicality of the parquet datatypes are authoritative , but it should be possible for someone who cares about such details to get them all right/consistent and it currently is not possible. I don't know if it can be made possible. So, do we explore "making consistency possible"? There could be consequences we end up not liking... |
People will come at this from different angles. Some are primarily concerned with writing parquet files but would like to embed some extra metadata as a bonus, while others like you are primarily writing VOTables but would like to offer a parquet-format alternative. Taking timestamps as an example, in the former case it would make sense to use the parquet TIMESTAMP logical type on an INT64 column with no indication in the VOTable metadata that it's a timestamp, while in the latter case you'd write a parquet STRING with I agree that the VOParquet document should not encourage strange behaviours, but IMO it's not a problem (at least, not one to be solved here) if we get a mix of usage. I don't know if I understand what you're getting at by "consistency" here. There's no solution that guarantees avoidance of wrong metadata in the empty VOTable in all cases, because there are certainly parquet columns that can't be represented in VOTable (the current text advises to pretend these are strings in the empty VOTable). In many cases the data models will match up and there's no problem. In others there are mismatches and it's down to the VOParquet producer whether they want to go with a more parquet-like or more VOTable-like solution. |
For "consistency" I would just like it to be possible to supply consistent parquet and votable metadata about types. I also want it to be possible to create the best possible parquet files. As is, providers have to chose one of:
The catch is that at some point we'll also want to attach additional metadata to that timestamp column and I expect that starts to look funny if it doesn't look like a timestamp value in the votable. Anyway, for now we will likely implement the last option (best parquet) for timestamp and uuid. |
VOTable is supposed to be a transfer format, not a display format, see section 4.2 of the VOTable standard:
So I would strongly favor allowing for That said, I do think (like Mark, if I understand correctly) that any option with VOTable Edit: And, if the VOTable to Parquet converting tool is not |
My intention was that the column descriptions in the data-less VOTable must (or at least should) be consistent, where possible, with the parquet data. The recommendation to pretend that columns are VOTable strings when they are not is only supposed to apply when no VOTable This is not absolutely explicit in the current text, but it does say
Do people disagree with this intention? It doesn't make a huge amount of difference since readers are allowed to disregard VOTable metadata, but if such type mismatches are supposed to have some semantic value that should be spelled out. Should the text be more explicit that best-efforts correspondence of VOTable datatype and parquet type is required? It would be possible to update DALI so that the timestamp xtype can be used for |
I don't know why I've only just realised, but for TIMESTAMP there is a much better way to go about this: use a VOTable TIMESYS element, which is meant for exactly this purpose, attaching temporal semantics (including those that are glossed over by parquet TIMESTAMP) to a numeric column. Then you can write your parquet column as e.g. physical/logical type
|
I don't want this discussion to postpone too much publication of the Note, since we agreed that the priority is to get something out there ASAP, even if it requires some revision or refinement at a later date. So I suggest that either we leave this discussion where it is and move on with publication, or I draft some additional text covering these points and we review that in the next few days. If I draft an amendment I think it would look like:
Opinions? |
Here is the scenario I am thinking through... In a TAP service, the tap_schema has a column with I previously suggested possibly having two options for timestamps allowed by DALI, but that doesn't really help because the tap_schema is only able to provide one of them and the TAP service is going to pick that one for normal VOTable output and the other for voparquet output. It is true that the caller kind of requested that alternate serialization by specifying "parquet" output, or at least allowing for that option. There is a very general expectation that VOTable metadata (eg the MAXREC=0 usage) is a legit way to look at column metadata in a service and figure out how to formulate queries. In TAP there are 3-4 other ways but in a lot of services there are only 2 ways (they can all have empty VOTable and tableset in the registry)... in voparquet we have choices that can make the empty VOTable describe the data and differ slightly from the source of the data, but there is wide usage of empty VOTable where that may be surprising. Is that worth mentioning/highlighting? I don't think a new section 2.4 would do much. People that use local timezone and know about TIMESYS will already know they can do that, so mentioning it again here isn't that useful. Parquet timestamp type has a UTC flag and most use (certainly all of ours) would just set that and be done. So, probably no need to to anything else for the first published version of the note unless you agree on my "empty VOTable" qualms above. |
I wouldn't expect an empty VOTable to be reliable about metadata for non-VOTable output formats, so IMO that's not a major concern, though admittedly the empty-VOTable trick is not one that I'm accustomed to use. The (main) point of using TIMESYS for annotating a VOTable long-valued timestamp field is not about timezones, it's about defining the zero-point ( Possibly I'll add a general comment about xtype usage then, but unless anybody else says different it doesn't seem like significant changes to the text are indicated by this discussion. |
agreed, good for now |
I don't see any mention of xtypes in the document and I'm curious about how to serialize things like
datatype="char" arraysize="*" xtype="timestamp"
ordatatype="double" arraysize="2" xtype="point"
.The obvious choice appears to be to format the date value as specified in DALI (iso8601) and stick it into a string column in the parquet file, but parquet has a TIMESTAMP logical type that is an int64 so there's a mismatch between VOTable and parquet serialization... so if we set the VOTable metadata in DALI-style we are requiring correct implementation of the parquet datatype is definitive logic because it is not possible to make the VOTable metadata correct. We could say
datatype="long" xtype="timestamp"
so the VOTable metadata can be made to match, but then we'd likely need to add that option to DALI (should be OK).There is a similar mismatch with draft DALI xtype="uuid" (char) and parquet UUID (byte array).
For
point
it seems a little easier, just parquet type array of double and rely on the VOTable metadata to infer "point".Does section 2.1, specifically "serialize the input table to parquet in the usual way" need a little more detail and maybe DALI need some options to play nice with parquet?
The text was updated successfully, but these errors were encountered: