fix serialiser makes invalid turtle curis#3364
Conversation
|
Suggested location for fix where the entire string should be perform backslash escaping for any character which will fail to parse later (includes % character). The list of characters that need backslash escaping is at https://www.w3.org/TR/turtle/#grammar-production-PN_LOCAL_ESC |
…cial characters in local names of IRIs
|
pre-commit.ci autofix |
for more information, see https://pre-commit.ci
…ackslash escaping
|
This looks fine to me at least. I have compared it to the discussions in the corresponding issue. Can you test every to be escaped character? |
I will address these point on Friday. I actually have a feeling some other magic code deals with every character except percent but haven't had a chance to get back to dev environment to test thoroughly. I will split each test into its own file if needed.... |
rdflib/plugins/serializers/turtle.py
Outdated
| prefix, namespace, local = parts | ||
|
|
||
| local = local.replace(r"(", r"\(").replace(r")", r"\)") | ||
| local = re.sub(r"[\"'~!$&\(\)*+,;=/\?#@%]", r"\\\0", local) |
There was a problem hiding this comment.
Thanks! Should this RE be compiled somewhere to make it faster?
I don't know whether there are perf test somewhere though.
e.g.
# After imports ...
RE_ESCAPE_CHARS = re.compile(r"[\"'~!$&\(\)*+,;=/\?#@%]")
...
# in the function ...
local RE_ESCAPE_CHARS.sub(r"\\\0", local)
|
These are the canonical references that I need to check when I get back to this:
|
|
Removing the following from the test file as none of these triples failed roundtrip and I assume there is probably test coverage elsewhere for these. It was just the percent sign being escaped that failed. |
|
Ready again. The test and fix is now very precisely targeted to escaped percent character in a localname. This works in Jena turtle serialiser/parser just fine. Rdflib must have missed it due to percent character appearing in the grammar specifically to percent-escape other/unprintable characters with 2-digit hexadecimal sequence. The fix is using a precompiled regex to detect percent (%) characters not followed by 2-digit hex sequence. Such characters are replaced by blackslash plus percent character. As requested also took opportunity to rename getQName function to get_pname where it appeared to be in fact getting a Prefixed Name. This was relevant to or touched four serialisers: Turtle, Long turtle, Trig and N3. |
|
Am I expected to rebase and selectively squash to clean up commits or will they get squashed on merge? |
|
Bump - anyone out there? |
|
Hi @lisat-dstg, thanks for providing a fix. I haven't reviewed your code just yet but I have approved the running of the validate workflows on this PR. There appears to be a mypy error. Do you mind taking a look? If you're unsure of anything, please reach out and I'll try to respond promptly. |
Thanks @edmondchuc I think I've fixed but having trouble with dev envt. Can you please rerun the CI pipelines? Thx |
|
@lisat-dstg I've triggered the CI again. Sorry for the delay. Your changes look great. It correctly escapes |
|
I've added the |
Fixes issue #1395
So far just the breaking test (1 new test artefact generates 4 breaking tests) -- one each for formats: trig, ttl, turtle and n3