Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RDF parser Bug with Unicode Character when Export. #3383

Closed
MichelDiz opened this issue May 7, 2019 · 8 comments · Fixed by #3424
Closed

RDF parser Bug with Unicode Character when Export. #3383

MichelDiz opened this issue May 7, 2019 · 8 comments · Fixed by #3424
Assignees
Labels
kind/bug Something is broken.

Comments

@MichelDiz
Copy link
Contributor

MichelDiz commented May 7, 2019

If you suspect this could be a bug, follow the template.

  • What version of Dgraph are you using?
    1.0.14, v1.0.15-rc4 and Master

  • Have you tried reproducing the issue with latest release?
    yes

  • What is the hardware spec (RAM, OS)?
    32GB, Darwin.

  • Steps to reproduce the issue (command/config used to run Dgraph).

This happened due a import from Twitter dataset from flock. The load was canceled:

[22:42:02-0300] Elapsed: 23m45s Txns: 41480 N-Quads: 41480000 N-Quads/s [last 5s]: 30200 Aborts: 952
[22:42:07-0300] Elapsed: 23m50s Txns: 41679 N-Quads: 41679000 N-Quads/s [last 5s]: 39800 Aborts: 952
2019/05/06 22:42:09 while parsing line "<0x273fdd> <description> \"Attention : ces jours-ci, Twitter pourra devenir instable, avec souvent des pro~po_~{po ─ ~ ®o n~poã_\\a~{o┼[po ╣y¿ po¿w4k*¿*n~p┌blèmes\\r\\nǝuuɐd uǝ ʇsǝ lı 'ʇsǝ ʎ ɐɔ\"^^<xs:string> .\n": while lexing <0x273fdd> <description> "Attention : ces jours-ci, Twitter pourra devenir instable, avec souvent des pro~po_~{po ─ ~ ®o n~poã_\a~{o┼[po ╣y¿ po¿w4k*¿*n~p┌blèmes\r\nǝuuɐd uǝ ʇsǝ lı 'ʇsǝ ʎ ɐɔ"^^<xs:string> . at line 1 column 25: Invalid escape character : 'a' in literal

So to reproduce it, just do like:

{
    set {
     <_:uid2> <pred> "\u0007"^^<xs:string> .
   }
}

Then export http://localhost:8080/admin/export
and the bug happens
<0x1> <description> "\a"^^<xs:string> .

When you try to reimport the RDF you have a lexing error.
"while lexing <_:0x1> <description> \"\\a\"^^<xs:string> . at line 1 column 22: Invalid escape character : 'a' in literal"

To solve this in part we should force (auto)escape. Or recommend users to do it in application level.

If you escape the string, mutate and export you gonna have a desirable result:

{
    set {
     <_:uid2> <pred> "\\u0007"^^<xs:string> .
   }
}

RDF exported:
<0x1> <pred> "\\u0007"^^<xs:string> .

@MichelDiz MichelDiz added the kind/bug Something is broken. label May 7, 2019
@codexnull codexnull self-assigned this May 7, 2019
@codexnull
Copy link
Contributor

I'll take a look at this. It shouldn't be too hard to fix.

@MichelDiz
Copy link
Contributor Author

MichelDiz commented May 7, 2019

BTW, the error only appears in export. If you query for the node the string will come correctly. I guess the JSON export could export correctly. I did not test because it is not ready.

@codexnull
Copy link
Contributor

No the problem exists with JSON as well.

Exported:

[
  {"uid":"0x1","pred":"\a"}
]

dgraph live:

Processing data file "/tmp/dgr/IDX0/export/dgraph.r192.u0508.1651/g01.json.gz"
2019/05/08 09:53:20 invalid character 'a' in string escape code

@codexnull codexnull reopened this May 8, 2019
@codexnull
Copy link
Contributor

Also, I don't think the problem is just with the export. Doing

{
    set {
     <_:uid2> <pred> "\a"^^<xs:string> .
   }
}

in ratel will also give the same error.

@MichelDiz
Copy link
Contributor Author

This error with \a is because you need to add the escape tho.
<_:uid2> <pred> "\\a"^^<xs:string> .

When you export it will be fine
<0x5> <pred> "\\a"^^<xs:string> .

The whole issue is solved if you escape as recommends JSON standard. So or we add a escape and unescape between the parser or we warn users that this task is theirs.

@MichelDiz
Copy link
Contributor Author

Or we could do a new feature like:
Captura de Tela 2019-05-08 às 15 43 06

@codexnull
Copy link
Contributor

codexnull commented May 8, 2019

I understand that using "\\a" works, but that's not doing the same thing. "\\a" is including the characters '\' (ASCII 92) and 'a' ASCII (97) into the string. "\u0007" (and "\a") is inserting the single character BEL (ASCII 7). That's why the "\u0007" is converted to just "\a" on export.

@MichelDiz
Copy link
Contributor Author

Other very curious case related:

see this (about this link, this tests are succeeded at all?)

{
    set {
     <_:alice> <lives> "\x02 wonderland" .
     <_:alice2> <lives> "\x07 wonderland" .
   }
}

If you send as Javascript Escape \x02 and \x07. It will show in query result as Java standard Escape. But when you export it, only \x02 will be back again to Javascript Escape and \x07 to \a. That's weird.

<0x6> <lives> "\x02 wonderland" .
<0x7> <lives> "\a wonderland" .
{
    set {
     <_:alice> <lives> "\v\t\b\n\r\f\"\\" .
   }
}

All results:

{
  "data": {
    "q": [
      {
        "uid": "0x6",
        "lives": "\u0002 wonderland"
      },
      {
        "uid": "0x7",
        "lives": "\u0007 wonderland"
      },
      {
        "uid": "0x9",
        "lives": "\u000b\t\b\n\r\f\"\\"
      }
    ]
  },
  "extensions": {
    "server_latency": {
      "parsing_ns": 15797,
      "processing_ns": 2276682,
      "encoding_ns": 503333
    },
    "txn": {
      "start_ts": 128
    }
  }
}

The exported RDF

<0x6> <lives> "\x02 wonderland" .
<0x7> <lives> "\a wonderland" .
<0x9> <lives> "\v\t\b\n\r\f\"\\" .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something is broken.
Development

Successfully merging a pull request may close this issue.

2 participants