Wikidata interlanguage #35

ninniuz · 2013-04-08T15:16:59Z

Simple proposal for #30

Had to include Lift JSON library as the included scala JSON library is slow (about 5ms per page, while it took less than 1ms with Lift).

Tested with wikidatawiki-20130330-pages-articles.xml.bz2 and got 14.677.279 interwiki links.

jcsahnwaldt · 2013-04-09T00:18:21Z

Hi Andrea!

Good stuff! Here are my comments. The first few are minor, but there are a few big ones further down. :-)

WikipediaDumpParser - the 0.8 schema added <model> and <format> elements. I thought we needed both, but now I read the documentation: format is more specific than model, we can ignore model. I guess you found out the same thing. :-) Maybe we should add a comment about this.

WikiPageFormat - I'm usually all for type safety etc., but now I saw there are many possible formats, and they may even differ for each MediaWiki, so maybe in this case we don't need an enumeration. I think we can just drop this class and use the format string directly. Constants for the most common formats could live in object WikiPage. (By the way, Java enums are great, but I found that Scala Enumerations don't work well. I don't use them anymore.)

XMLSource - I'm not sure, but maybe (rev \ "format").text throws a NullPointerException if there is no format element, which may happen with older dump formats (pre 0.8).

JsonWikiParser - use parse() instead of parseOpt(), let the exception fly. With parseOpt, you lose the original exception, which makes debugging much harder.

Well, that was the small stuff... now it gets bigger.

JsonWikiParser - the current InterLanguageLinksExtractor generates quads for all languages, there's no special treatment for English. When we extract links from Wikidata, we should generate triples stating that this Wikidata item links to these Wikipedia items. Something like this:

<http://data.dbpedia.org/resource/Q1234> owl:sameAs <http://de.dbpedia.org/resource/Foo>
<http://data.dbpedia.org/resource/Q1234> owl:sameAs <http://fr.dbpedia.org/resource/Bar>

In post-processing, we can transform them into triples like

<http://de.dbpedia.org/resource/Foo> owl:sameAs <http://fr.dbpedia.org/resource/Bar>

Ok. This was an important point because our choice of subject URI affects many other decisions. Now for the really big stuff...

Representing JSON text as a PageNode with all its sub nodes may be doable for inter-language links, but in general it will be rather confusing and error-prone. A PageNode is an AST for wikitext. Lift builds an AST for JSON, and our new extractors should work on that AST.

Most of your code in JsonWikiParser and InterLanguageLinksExtractor would then move to a new class InterLanguageLinksJsonExtractor whose extract method takes a JValue and returns quads.

All right... It is getting late. I have a pretty clear picture of how we should change our framework to deal with different formats in a clean, comprehensible and extensible way, but there's no time and space to explain it here and now. I'll just leave this comment as it is. We should discuss the stuff by mail.

jcsahnwaldt · 2013-04-09T17:34:46Z

By the way, cool that you tested the JSON parser speed. I wonder what Scala JSON is doing to be so incredibly slow.

Now for next part. Here's the big picture. :-)

First I'll try to describe the current object structure for the the dump extraction. Then I'll try to explain how we could adapt it.

Both designs make heavy use of the chain-of-responsibility and composite patterns.

If you have any questions, don't hesitate to send a mail to dbpedia-developers.

Current Design

During startup, ConfigLoader creates a WikiParser object. It also reads the configuration file and creates a CompositeExtractor that contains a list of extractors. We wrap the CompositeExtractor in a RootExtractor and pass the RootExtractor and the WikiParser object to an ExtractionJob object.

When the extraction runs, the ExtractionJob receives a WikiPage obect (unparsed text and meta data) from the XML dump parser and runs it through the WikiParser, which returns a PageNode tree, and passes the PageNode to the RootExtractor. The RootExtractor is just a silly little class that creates a PageContext and the subject URI for the page and passes them to the CompositeExtractor. (The PageContext is responsible for creating new subject URIs when needed. We should probably move its code into PageNode or Node. That's a relatively minor change, but it may be necessary to make larger changes easier.)

The CompositeExtractor contains a list of extractors and does the actual work: it simply passes the PageNode to each of its extractors, collects all the quads into a single list and returns them.

In a nutshell, this is what happens for each page during the extraction:

ExtractionJob:

receive WIkiPage from XML dump parser
parse page text: WIkiPage -> PageNode
pass PageNode to RootExtractor
when RootExtractor returns list of quads, serialize them

RootExtractor:

create subject URI and PageContext
pass PageNode, subject URI and PageContext to CompositeExtractor

CompositeExtractor:

pass PageNode, subject URI and PageContext to each extractor
collect quads and return them

I think that's about it.

New Design

Here's how a new, more flexible structure would work:

ExtractionJob:

receive WIkiPage from XML dump parser
pass WikiPage to main CompositeExtractor
when CompositeExtractor returns list of quads, serialize them

There will be multiple instances of CompositeExtractor. The main CompositeExtractor contains a list of ParsingExtractor objects and simply passes the WikiPage to each of them.

main CompositeExtractor:

pass WikiPage to each ParsingExtractor
collect quads and return them

ParsingExtractor are new classes that we add for this design. A ParsingExtractor parses a WikiPage and passes the result to the next extractor. For each format, there is a separate ParsingExtractor class. For example:

WikitextParsingExtractor:

If page format is wikitext, run WikiPage object through WikiParser and pass PageNode to next extractor
Otherwise, do nothing

JsonParsingExtractor:

If page format is JSON, run WikiPage object through JSON parser and pass JValue to next extractor
Otherwise, do nothing

The "next extractor" will usually also be a CompositeExtractor. For example:

CompositeExtractor for WikitextParsingExtractor:

pass PageNode to each extractor - almost exactly what we do now
collect quads and return them

(If we don't move the PageContext code to PageNode or Node, we may have to insert the old RootExtractor into the chain here.)

CompositeExtractor for JsonParsingExtractor:

pass JValue to each extractor - at the moment, your new InterLanguageLinksJsonExtractor will be the only one.
collect quads and return them

In addition to the ParsingExtractor objects, the main CompositeExtractor may also contain another CompositeExtractor that simply passes the unparsed WikiPage object to these extractors:

These extractor classes (and maybe some more, I'm not sure) currently also receive a PageNode object (just because it was simpler to let them implement the same interface as all other extractors), but all they need is a WikiTitle, which is also contained in WikiPage. Basically, we just have to change their generic type from Extractor[PageNode] to Extractor[WikiPage].

Class names

Currently, the names of the classes and traits in the mappings package are a bit chaotic: the root of the type hierarchy is called Mapping. We should rename it to Extractor. The specific trait for extractors that handle PageNode objects is called Extractor. We should rename it to PageNodeExtractor or WikitextExtractor.

jcsahnwaldt · 2013-04-09T19:49:36Z

Your number of 14.677.279 links looks good. In DBpedia 3.8, we had 13.184.401 links from English to other languages.

jimkont · 2013-04-10T09:33:40Z

The discussion on the framework refactoring is moved to
https://sourceforge.net/mailarchive/message.php?msg_id=30706291
Specific to the pull request comments will continue here

On Tue, Apr 9, 2013 at 10:49 PM, Christopher Sahnwaldt <
[email protected]> wrote:

Your number of 14.677.279 links looks good. In DBpedia 3.8, we had
13.184.401 links from English to other languages.

—
Reply to this email directly or view it on GitHubhttps://github.com//pull/35#issuecomment-16135773
.

Kontokostas Dimitris

jcsahnwaldt · 2013-04-10T13:21:42Z

The link above only works with login. Strange. Here's the list where we discuss this: https://lists.sourceforge.net/lists/listinfo/dbpedia-developers

working code merged, We will create issues, so we do not forget to include JC's suggestions and can work on them consecutively

jcsahnwaldt · 2013-04-10T16:55:18Z

Here are a few drawings that I hope make things clearer in connection with the text above.

The arrows show which type of data objects are passed from one extractor to the next, or passed to and returned from a parser.

I omitted the return types of the extractors because they're always lists of quads.

Current Design

New Design

jcsahnwaldt · 2013-04-11T09:53:27Z

Hi Andrea,

we decided to merge your pull request. Thanks for your many contributions!

Your code is a good start for handling Wikidata. In the long run, we will have to refactor some stuff - trying to handle JSON data through the API of our Wikitext AST won't work well in general - but for now it's fine.

Thanks!

Christopher

Andrea Di Menna added 4 commits April 8, 2013 17:01

Trying to fix Travis CI

c3cbe0e

Parse Wikidata JSON text into an object model.

6cb6a7b

Sync Travis CI

e59b059

Remove not needed toText methods (wrongly inserted in a previous commit)

8f8484b

kurzum added a commit that referenced this pull request Apr 10, 2013

Merge pull request #35 from ninniuz/Wikidata_Interlanguage

4de893c

working code merged, We will create issues, so we do not forget to include JC's suggestions and can work on them consecutively

kurzum merged commit 4de893c into dbpedia:master Apr 10, 2013

ninniuz deleted the Wikidata_Interlanguage branch April 11, 2013 16:49

jimkont mentioned this pull request Apr 13, 2013

Refactor core to accept new formats #38

Closed

jimkont modified the milestone: pastReleases Mar 19, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wikidata interlanguage #35

Wikidata interlanguage #35

ninniuz commented Apr 8, 2013

jcsahnwaldt commented Apr 9, 2013

jcsahnwaldt commented Apr 9, 2013

jcsahnwaldt commented Apr 9, 2013

jimkont commented Apr 10, 2013

jcsahnwaldt commented Apr 10, 2013

jcsahnwaldt commented Apr 10, 2013

jcsahnwaldt commented Apr 11, 2013

Wikidata interlanguage #35

Wikidata interlanguage #35

Conversation

ninniuz commented Apr 8, 2013

jcsahnwaldt commented Apr 9, 2013

jcsahnwaldt commented Apr 9, 2013

Current Design

New Design

Class names

jcsahnwaldt commented Apr 9, 2013

jimkont commented Apr 10, 2013

jcsahnwaldt commented Apr 10, 2013

jcsahnwaldt commented Apr 10, 2013

Current Design

New Design

jcsahnwaldt commented Apr 11, 2013