Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wikidata interlanguage #35

Merged
merged 4 commits into from
Apr 10, 2013
Merged

Conversation

ninniuz
Copy link
Contributor

@ninniuz ninniuz commented Apr 8, 2013

Simple proposal for #30

Had to include Lift JSON library as the included scala JSON library is slow (about 5ms per page, while it took less than 1ms with Lift).

Tested with wikidatawiki-20130330-pages-articles.xml.bz2 and got 14.677.279 interwiki links.

@jcsahnwaldt
Copy link
Contributor

Hi Andrea!

Good stuff! Here are my comments. The first few are minor, but there are a few big ones further down. :-)

WikipediaDumpParser - the 0.8 schema added <model> and <format> elements. I thought we needed both, but now I read the documentation: format is more specific than model, we can ignore model. I guess you found out the same thing. :-) Maybe we should add a comment about this.

WikiPageFormat - I'm usually all for type safety etc., but now I saw there are many possible formats, and they may even differ for each MediaWiki, so maybe in this case we don't need an enumeration. I think we can just drop this class and use the format string directly. Constants for the most common formats could live in object WikiPage. (By the way, Java enums are great, but I found that Scala Enumerations don't work well. I don't use them anymore.)

XMLSource - I'm not sure, but maybe (rev \ "format").text throws a NullPointerException if there is no format element, which may happen with older dump formats (pre 0.8).

JsonWikiParser - use parse() instead of parseOpt(), let the exception fly. With parseOpt, you lose the original exception, which makes debugging much harder.

Well, that was the small stuff... now it gets bigger.

JsonWikiParser - the current InterLanguageLinksExtractor generates quads for all languages, there's no special treatment for English. When we extract links from Wikidata, we should generate triples stating that this Wikidata item links to these Wikipedia items. Something like this:

<http://data.dbpedia.org/resource/Q1234> owl:sameAs <http://de.dbpedia.org/resource/Foo>
<http://data.dbpedia.org/resource/Q1234> owl:sameAs <http://fr.dbpedia.org/resource/Bar>

In post-processing, we can transform them into triples like

<http://de.dbpedia.org/resource/Foo> owl:sameAs <http://fr.dbpedia.org/resource/Bar>

Ok. This was an important point because our choice of subject URI affects many other decisions. Now for the really big stuff...

Representing JSON text as a PageNode with all its sub nodes may be doable for inter-language links, but in general it will be rather confusing and error-prone. A PageNode is an AST for wikitext. Lift builds an AST for JSON, and our new extractors should work on that AST.

Most of your code in JsonWikiParser and InterLanguageLinksExtractor would then move to a new class InterLanguageLinksJsonExtractor whose extract method takes a JValue and returns quads.

All right... It is getting late. I have a pretty clear picture of how we should change our framework to deal with different formats in a clean, comprehensible and extensible way, but there's no time and space to explain it here and now. I'll just leave this comment as it is. We should discuss the stuff by mail.

@jcsahnwaldt
Copy link
Contributor

By the way, cool that you tested the JSON parser speed. I wonder what Scala JSON is doing to be so incredibly slow.

Now for next part. Here's the big picture. :-)

First I'll try to describe the current object structure for the the dump extraction. Then I'll try to explain how we could adapt it.

Both designs make heavy use of the chain-of-responsibility and composite patterns.

If you have any questions, don't hesitate to send a mail to dbpedia-developers.

Current Design

During startup, ConfigLoader creates a WikiParser object. It also reads the configuration file and creates a CompositeExtractor that contains a list of extractors. We wrap the CompositeExtractor in a RootExtractor and pass the RootExtractor and the WikiParser object to an ExtractionJob object.

When the extraction runs, the ExtractionJob receives a WikiPage obect (unparsed text and meta data) from the XML dump parser and runs it through the WikiParser, which returns a PageNode tree, and passes the PageNode to the RootExtractor. The RootExtractor is just a silly little class that creates a PageContext and the subject URI for the page and passes them to the CompositeExtractor. (The PageContext is responsible for creating new subject URIs when needed. We should probably move its code into PageNode or Node. That's a relatively minor change, but it may be necessary to make larger changes easier.)

The CompositeExtractor contains a list of extractors and does the actual work: it simply passes the PageNode to each of its extractors, collects all the quads into a single list and returns them.

In a nutshell, this is what happens for each page during the extraction:

ExtractionJob:

  • receive WIkiPage from XML dump parser
  • parse page text: WIkiPage -> PageNode
  • pass PageNode to RootExtractor
  • when RootExtractor returns list of quads, serialize them

RootExtractor:

  • create subject URI and PageContext
  • pass PageNode, subject URI and PageContext to CompositeExtractor

CompositeExtractor:

  • pass PageNode, subject URI and PageContext to each extractor
  • collect quads and return them

I think that's about it.

New Design

Here's how a new, more flexible structure would work:

ExtractionJob:

  • receive WIkiPage from XML dump parser
  • pass WikiPage to main CompositeExtractor
  • when CompositeExtractor returns list of quads, serialize them

There will be multiple instances of CompositeExtractor. The main CompositeExtractor contains a list of ParsingExtractor objects and simply passes the WikiPage to each of them.

main CompositeExtractor:

  • pass WikiPage to each ParsingExtractor
  • collect quads and return them

ParsingExtractor are new classes that we add for this design. A ParsingExtractor parses a WikiPage and passes the result to the next extractor. For each format, there is a separate ParsingExtractor class. For example:

WikitextParsingExtractor:

  • If page format is wikitext, run WikiPage object through WikiParser and pass PageNode to next extractor
  • Otherwise, do nothing

JsonParsingExtractor:

  • If page format is JSON, run WikiPage object through JSON parser and pass JValue to next extractor
  • Otherwise, do nothing

The "next extractor" will usually also be a CompositeExtractor. For example:

CompositeExtractor for WikitextParsingExtractor:

  • pass PageNode to each extractor - almost exactly what we do now
  • collect quads and return them

(If we don't move the PageContext code to PageNode or Node, we may have to insert the old RootExtractor into the chain here.)

CompositeExtractor for JsonParsingExtractor:

  • pass JValue to each extractor - at the moment, your new InterLanguageLinksJsonExtractor will be the only one.
  • collect quads and return them

In addition to the ParsingExtractor objects, the main CompositeExtractor may also contain another CompositeExtractor that simply passes the unparsed WikiPage object to these extractors:

These extractor classes (and maybe some more, I'm not sure) currently also receive a PageNode object (just because it was simpler to let them implement the same interface as all other extractors), but all they need is a WikiTitle, which is also contained in WikiPage. Basically, we just have to change their generic type from Extractor[PageNode] to Extractor[WikiPage].

Class names

Currently, the names of the classes and traits in the mappings package are a bit chaotic: the root of the type hierarchy is called Mapping. We should rename it to Extractor. The specific trait for extractors that handle PageNode objects is called Extractor. We should rename it to PageNodeExtractor or WikitextExtractor.

@jcsahnwaldt
Copy link
Contributor

Your number of 14.677.279 links looks good. In DBpedia 3.8, we had 13.184.401 links from English to other languages.

@jimkont
Copy link
Member

jimkont commented Apr 10, 2013

The discussion on the framework refactoring is moved to
https://sourceforge.net/mailarchive/message.php?msg_id=30706291
Specific to the pull request comments will continue here

On Tue, Apr 9, 2013 at 10:49 PM, Christopher Sahnwaldt <
[email protected]> wrote:

Your number of 14.677.279 links looks good. In DBpedia 3.8, we had
13.184.401 links from English to other languages.


Reply to this email directly or view it on GitHubhttps://github.com//pull/35#issuecomment-16135773
.

Kontokostas Dimitris

@jcsahnwaldt
Copy link
Contributor

The link above only works with login. Strange. Here's the list where we discuss this: https://lists.sourceforge.net/lists/listinfo/dbpedia-developers

kurzum added a commit that referenced this pull request Apr 10, 2013
working code merged,  We will create issues, so we do not forget  to include JC's suggestions and can work on them consecutively
@kurzum kurzum merged commit 4de893c into dbpedia:master Apr 10, 2013
@jcsahnwaldt
Copy link
Contributor

Here are a few drawings that I hope make things clearer in connection with the text above.

The arrows show which type of data objects are passed from one extractor to the next, or passed to and returned from a parser.

I omitted the return types of the extractors because they're always lists of quads.

Current Design

extraction_design_old

New Design

extraction_design_new

@jcsahnwaldt
Copy link
Contributor

Hi Andrea,

we decided to merge your pull request. Thanks for your many contributions!

Your code is a good start for handling Wikidata. In the long run, we will have to refactor some stuff - trying to handle JSON data through the API of our Wikitext AST won't work well in general - but for now it's fine.

Thanks!

Christopher

@ninniuz ninniuz deleted the Wikidata_Interlanguage branch April 11, 2013 16:49
@jimkont jimkont modified the milestone: pastReleases Mar 19, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants