Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem when validating xsd:float #140

Closed
tobiasschweizer opened this issue Apr 5, 2022 · 13 comments
Closed

Problem when validating xsd:float #140

tobiasschweizer opened this issue Apr 5, 2022 · 13 comments

Comments

@tobiasschweizer
Copy link

Hi there,

Validating an xsd:float gives me an unexpected validation report.
I am using "PySHACL Version: 0.19.0".

Example:

shapes graph "shapes.json":

{
  "@context": {
    "owl": "http://www.w3.org/2002/07/owl#",
    "rdf": "http://www.w3.org/1999/02/22-rdf-syntax-ns#",
    "rdfs": "http://www.w3.org/2000/01/rdf-schema#",
    "xsd": "http://www.w3.org/2001/XMLSchema#",
    "skos": "http://www.w3.org/2004/02/skos/core#",
    "prov": "http://www.w3.org/ns/prov#",
    "dcat": "http://www.w3.org/ns/dcat#",
    "sh": "http://www.w3.org/ns/shacl#",
    "shsh": "http://www.w3.org/ns/shacl-shacl#",
    "dcterms": "http://purl.org/dc/terms/",
    "schema": "http://schema.org/",
    "rescs": "http://rescs.org/"
  },
  "@graph": [
    {
      "@id": "rescs:dash/monetaryamount/MonetaryAmountShape",
      "@type": "sh:NodeShape",
      "rdfs:comment": {
        "@type": "xsd:string",
        "@value": "A monetary value or range. This type can be used to describe an amount of money such as $50 USD, or a range as in describing a bank account being suitable for a balance between £1,000 and £1,000,000 GBP, or the value of a salary, etc. It is recommended to use [[PriceSpecification]] Types to describe the price of an Offer, Invoice, etc."
      },
      "rdfs:label": {
        "@type": "xsd:string",
        "@value": "Monetary amount"
      },
      "sh:property": {
        "sh:datatype": {
          "@id": "xsd:float"
        },
        "sh:description": "The value of the quantitative value or property value node.\\\\n\\\\n* For [[QuantitativeValue]] and [[MonetaryAmount]], the recommended type for values is 'Number'.\\\\n* For [[PropertyValue]], it can be 'Text;', 'Number', 'Boolean', or 'StructuredValue'.\\\\n* Use values from 0123456789 (Unicode 'DIGIT ZERO' (U+0030) to 'DIGIT NINE' (U+0039)) rather than superficially similiar Unicode symbols.\\\\n* Use '.' (Unicode 'FULL STOP' (U+002E)) rather than ',' to indicate a decimal point. Avoid using these symbols as a readability separator.",
        "sh:maxCount": {
          "@type": "xsd:integer",
          "@value": 1
        },
        "sh:minCount": {
          "@type": "xsd:integer",
          "@value": 1
        },
        "sh:minExclusive": 0,
        "sh:name": "value",
        "sh:path": {
          "@id": "schema:value"
        }
      },
      "sh:targetClass": {
        "@id": "schema:MonetaryAmount"
      }
    }
  ]
}

data sample "monetaryamount.json":

{
  "@context": {
    "@vocab": "http://schema.org/",
    "xsd": "http://www.w3.org/2001/XMLSchema#"
  },
  "@type": "MonetaryAmount",
  "value": {
    "@type": "xsd:float",
    "@value": 100000
  }
}

pyshacl -sf json-ld -s shapes.json -df json-ld monetaryamount.json gives me:

Validation Report
Conforms: False
Results (1):
Constraint Violation in DatatypeConstraintComponent (http://www.w3.org/ns/shacl#DatatypeConstraintComponent):
Severity: sh:Violation
Source Shape: [ sh:datatype xsd:float ; sh:description Literal("The value of the quantitative value or property value node.\n\n* For [[QuantitativeValue]] and [[MonetaryAmount]], the recommended type for values is 'Number'.\n* For [[PropertyValue]], it can be 'Text;', 'Number', 'Boolean', or 'StructuredValue'.\n* Use values from 0123456789 (Unicode 'DIGIT ZERO' (U+0030) to 'DIGIT NINE' (U+0039)) rather than superficially similiar Unicode symbols.\n* Use '.' (Unicode 'FULL STOP' (U+002E)) rather than ',' to indicate a decimal point. Avoid using these symbols as a readability separator.") ; sh:maxCount Literal("1", datatype=xsd:integer) ; sh:minCount Literal("1", datatype=xsd:integer) ; sh:minExclusive Literal("0", datatype=xsd:integer) ; sh:name Literal("value") ; sh:path schema1:value ]
Focus Node: [ :value Literal("100000", datatype=xsd:float) ; rdf:type :MonetaryAmount ]
Value Node: Literal("100000", datatype=xsd:float)
Result Path: schema1:value
Message: Value is not Literal with datatype xsd:float

Changing the @value to 100000.0 or "100000" makes it pass. However, I think all three variants should be valid, no?

I tried the example above on https://shacl.org/playground/ which worked fine.

Could you tell me whether I am doing something wrong or this is a bug?

Thanks a lot!

@tobiasschweizer
Copy link
Author

tobiasschweizer commented Apr 5, 2022

For example, -1E4, 1267.43233E12, 12.78e-2, 12 , -0, 0 and INF are all legal literals for float.

https://www.w3.org/TR/2004/REC-xmlschema-2-20041028/#float

My assumption is that 100000 is implicitly 100000.0 when typed as xsd:float.

Could it be that 100000 is actually represented as an int in Python?

elif datatype_rule == XSD_float:
return isinstance(value, float)

@ashleysommer
Copy link
Collaborator

ashleysommer commented Apr 6, 2022

Hi @tobiasschweizer
Thanks for the bug report. I think this is a bug in the RDFLib JSON-LD parser. Is it possible for you to test the same example but encoded in Turtle format, to see if the issue remains?

@tobiasschweizer
Copy link
Author

Sure, I will try this and come back to you asap.

@tobiasschweizer
Copy link
Author

tobiasschweizer commented Apr 6, 2022

I tried the following which worked fine:

"monetaryamount.ttl"

<http://www.example.com/1> a <http://schema.org/MonetaryAmount> ;
  <http://schema.org/value> "100000"^^<http://www.w3.org/2001/XMLSchema#float> .

pyshacl -sf json-ld -s shapes.json -df turtle monetaryamount.ttl
Validation Report
Conforms: True

@ashleysommer
Copy link
Collaborator

Ok, great. Thanks, that confirms the bug lies in the JSON-LD parser. I'll create a corresponding bug in the RDFlib bug tracker.

@tobiasschweizer
Copy link
Author

Ok, thanks. Let me know if I can be of further assistance to substantiate the report.

@ashleysommer
Copy link
Collaborator

Hi @tobiasschweizer
I finally got a chance to do some testing on this.
A simple test:

    my_json = """
{
  "@context": {
    "@vocab": "http://schema.org/",
    "xsd": "http://www.w3.org/2001/XMLSchema#"
  },
  "@type": "MonetaryAmount",
  "value": {
    "@type": "xsd:float",
    "@value": 100000
  }
}
    """
    g = rdflib.Graph()
    g.parse(data=my_json, format="json-ld")
    g.print()

This prints

@prefix : <http://schema.org/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

[] a :MonetaryAmount ;
    :value "100000"^^xsd:float .

So it appears there is no bug in the JSON-LD parser, it parses the amount to a float, and when serializing back into turtle, it remains a float. So the issue must lie elsewhere. I'll look into it further.

@ashleysommer
Copy link
Collaborator

Ok, Ive worked out one key difference between the json-ld example and the turtle example.

Even though the datatype of both is xsd:float, the "lexical value" of the data in the Turtle value is a string ("1000"), and the lexical of the json-ld version is an integer (1000).

When setting up a Literal value, RDFLib has the ability to parse a lexical string into a real value matching the datatype, but only when the lexical value is a string.

This in the past has never been an issue because in Turtle and other RDF data formats, the lexical value of any typed value is always a string. But in JSON-LD, it can clearly be something other than a string.

As a simple example, replace @value string in your json-ld:

{
  "@context": {
    "@vocab": "http://schema.org/",
    "xsd": "http://www.w3.org/2001/XMLSchema#"
  },
  "@type": "MonetaryAmount",
  "value": {
    "@type": "xsd:float",
    "@value": "100000"
  }
}

You will see your example now passes as expected.

So now I understand that this issue lies somewhere in between the json-ld parser, and RDFLib's handling of Literal lexical values. It could possibly be fixed by adding an extra translation step in the json-ld parser, or it could be fixed by adding an extra conversion of non-string lexicals in RDFLib, or it might be easier to fix it at the PySHACL level, and modify how the datatype constraint works, allowing more kinds of values for xsd:float and xsd:double.

@ajnelson-nist
Copy link
Contributor

Hi @ashleysommer ,

Apologies for butting in, but I saw a notice for this fly by and remembered an issue with default datatypes I'd encountered a while ago. Some data from my community was getting flagged after we had a "All non-integer numbers are now xsd:decimal" decision. The standards-section citations are in this commit:

casework/CASE-Examples@af9d622

@ashleysommer
Copy link
Collaborator

ashleysommer commented Apr 8, 2022

Thanks @ajnelson-nist
Thats great to see. Personally I too always try to use xsd:decimal wherever possible rather than xsd:float or xsd:double. Floats and Doubles are plagued by implementation issues, they are treated differently in different programming languages, and it easy to run into the issue we see in this thread. Eg, should the lexical of 100000 be converted to float? A float in Python is actually really a double. So given the datatype is xsd:float, should it still fail validation? Should it really be xsd:double?
I believe the current way that RDFLib handles it is probably fine. After all, there's nothing stopping you from writing:
"cat"^^xsd:float
And rdflib will happily accept that as a real Literal value, because that's what you've specified, and the value will still be "cat", and the datatype will still be xsd:float. But it would fail the PySHACL datatype constraint of xsd:float.

Similarly, as per the issue described above, the lexical is an int, but the datatype is a xsd:float, RDFLib doesn't care, the value is still an int, and the datatype is still xsd:float, but as we see, it does fail the datatype constraint.

Given that xsd:decimal will always have a lexical form of a string (because there are some decimals that cannot be represented as an int, float, or double) and RDFlib will parse it to a python Decimal when loaded, then adopting this practice will solve the class of issues seen here.

@tobiasschweizer
Copy link
Author

Ok, Ive worked out one key difference between the json-ld example and the turtle example.

Even though the datatype of both is xsd:float, the "lexical value" of the data in the Turtle value is a string ("1000"), and the lexical of the json-ld version is an integer (1000).

When setting up a Literal value, RDFLib has the ability to parse a lexical string into a real value matching the datatype, but only when the lexical value is a string.

This in the past has never been an issue because in Turtle and other RDF data formats, the lexical value of any typed value is always a string. But in JSON-LD, it can clearly be something other than a string.

As a simple example, replace @value string in your json-ld:

{
  "@context": {
    "@vocab": "http://schema.org/",
    "xsd": "http://www.w3.org/2001/XMLSchema#"
  },
  "@type": "MonetaryAmount",
  "value": {
    "@type": "xsd:float",
    "@value": "100000"
  }
}

You will see your example now passes as expected.

So now I understand that this issue lies somewhere in between the json-ld parser, and RDFLib's handling of Literal lexical values. It could possibly be fixed by adding an extra translation step in the json-ld parser, or it could be fixed by adding an extra conversion of non-string lexicals in RDFLib, or it might be easier to fix it at the PySHACL level, and modify how the datatype constraint works, allowing more kinds of values for xsd:float and xsd:double.

Thanks @ashleysommer for looking into this.
So if I understand correctly, instead of "@value": 100000 we could simply write "@value": "100000" to sidestep the problem.

So maybe the source of the problem lies in the isinstance check as mentioned above? 100000 is represented as an int in Python which is not an instance of float.

Maybe the relations of numeric types need be taken into account here. I am no Python expert but I remember in Java you could assign an int to a variable of type double but not the opposite. So wouldn't the solution be to accept both int and float when doing the check for xsd:float?

@tobiasschweizer
Copy link
Author

@ajnelson-nist this is somehow off-topic but aren't you working on lambdamusic/Ontospy#107? :-)

@ashleysommer
Copy link
Collaborator

ashleysommer commented Apr 8, 2022

So if I understand correctly, instead of "@value": 100000 we could simply write "@value": "100000" to sidestep the problem.

Thats right. If it is possible to do that in your datafiles, that is the easiest way forward.

It works because when RDFLib processes a new Literal object, it has special rules for if the lexical value is a string. When it is a string, but the literal has a known XSD datatype attached, then RDFLib will attempt to parse the string into that format. So the value of the literal will be 100000 as a python float. On the other hand, when the lexical is an int, then RDFLib doesn't know it can convert it, so it keeps the value as an int.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants