-
Notifications
You must be signed in to change notification settings - Fork 269
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wikidata interlanguage #35
Conversation
Hi Andrea! Good stuff! Here are my comments. The first few are minor, but there are a few big ones further down. :-) WikipediaDumpParser - the 0.8 schema added WikiPageFormat - I'm usually all for type safety etc., but now I saw there are many possible formats, and they may even differ for each MediaWiki, so maybe in this case we don't need an enumeration. I think we can just drop this class and use the format string directly. Constants for the most common formats could live in object WikiPage. (By the way, Java enums are great, but I found that Scala Enumerations don't work well. I don't use them anymore.) XMLSource - I'm not sure, but maybe JsonWikiParser - use parse() instead of parseOpt(), let the exception fly. With parseOpt, you lose the original exception, which makes debugging much harder. Well, that was the small stuff... now it gets bigger. JsonWikiParser - the current InterLanguageLinksExtractor generates quads for all languages, there's no special treatment for English. When we extract links from Wikidata, we should generate triples stating that this Wikidata item links to these Wikipedia items. Something like this:
In post-processing, we can transform them into triples like
Ok. This was an important point because our choice of subject URI affects many other decisions. Now for the really big stuff... Representing JSON text as a PageNode with all its sub nodes may be doable for inter-language links, but in general it will be rather confusing and error-prone. A PageNode is an AST for wikitext. Lift builds an AST for JSON, and our new extractors should work on that AST. Most of your code in JsonWikiParser and InterLanguageLinksExtractor would then move to a new class InterLanguageLinksJsonExtractor whose extract method takes a JValue and returns quads. All right... It is getting late. I have a pretty clear picture of how we should change our framework to deal with different formats in a clean, comprehensible and extensible way, but there's no time and space to explain it here and now. I'll just leave this comment as it is. We should discuss the stuff by mail. |
By the way, cool that you tested the JSON parser speed. I wonder what Scala JSON is doing to be so incredibly slow. Now for next part. Here's the big picture. :-) First I'll try to describe the current object structure for the the dump extraction. Then I'll try to explain how we could adapt it. Both designs make heavy use of the chain-of-responsibility and composite patterns. If you have any questions, don't hesitate to send a mail to dbpedia-developers. Current DesignDuring startup, ConfigLoader creates a WikiParser object. It also reads the configuration file and creates a CompositeExtractor that contains a list of extractors. We wrap the CompositeExtractor in a RootExtractor and pass the RootExtractor and the WikiParser object to an ExtractionJob object. When the extraction runs, the ExtractionJob receives a WikiPage obect (unparsed text and meta data) from the XML dump parser and runs it through the WikiParser, which returns a PageNode tree, and passes the PageNode to the RootExtractor. The RootExtractor is just a silly little class that creates a PageContext and the subject URI for the page and passes them to the CompositeExtractor. (The PageContext is responsible for creating new subject URIs when needed. We should probably move its code into PageNode or Node. That's a relatively minor change, but it may be necessary to make larger changes easier.) The CompositeExtractor contains a list of extractors and does the actual work: it simply passes the PageNode to each of its extractors, collects all the quads into a single list and returns them. In a nutshell, this is what happens for each page during the extraction: ExtractionJob:
RootExtractor:
CompositeExtractor:
I think that's about it. New DesignHere's how a new, more flexible structure would work: ExtractionJob:
There will be multiple instances of CompositeExtractor. The main CompositeExtractor contains a list of ParsingExtractor objects and simply passes the WikiPage to each of them. main CompositeExtractor:
ParsingExtractor are new classes that we add for this design. A ParsingExtractor parses a WikiPage and passes the result to the next extractor. For each format, there is a separate ParsingExtractor class. For example: WikitextParsingExtractor:
JsonParsingExtractor:
The "next extractor" will usually also be a CompositeExtractor. For example: CompositeExtractor for WikitextParsingExtractor:
(If we don't move the PageContext code to PageNode or Node, we may have to insert the old RootExtractor into the chain here.) CompositeExtractor for JsonParsingExtractor:
In addition to the ParsingExtractor objects, the main CompositeExtractor may also contain another CompositeExtractor that simply passes the unparsed WikiPage object to these extractors: These extractor classes (and maybe some more, I'm not sure) currently also receive a PageNode object (just because it was simpler to let them implement the same interface as all other extractors), but all they need is a WikiTitle, which is also contained in WikiPage. Basically, we just have to change their generic type from Extractor[PageNode] to Extractor[WikiPage]. Class namesCurrently, the names of the classes and traits in the mappings package are a bit chaotic: the root of the type hierarchy is called Mapping. We should rename it to Extractor. The specific trait for extractors that handle PageNode objects is called Extractor. We should rename it to PageNodeExtractor or WikitextExtractor. |
Your number of 14.677.279 links looks good. In DBpedia 3.8, we had 13.184.401 links from English to other languages. |
The discussion on the framework refactoring is moved to On Tue, Apr 9, 2013 at 10:49 PM, Christopher Sahnwaldt <
Kontokostas Dimitris |
The link above only works with login. Strange. Here's the list where we discuss this: https://lists.sourceforge.net/lists/listinfo/dbpedia-developers |
working code merged, We will create issues, so we do not forget to include JC's suggestions and can work on them consecutively
Here are a few drawings that I hope make things clearer in connection with the text above. The arrows show which type of data objects are passed from one extractor to the next, or passed to and returned from a parser. I omitted the return types of the extractors because they're always lists of quads. Current DesignNew Design |
Hi Andrea, we decided to merge your pull request. Thanks for your many contributions! Your code is a good start for handling Wikidata. In the long run, we will have to refactor some stuff - trying to handle JSON data through the API of our Wikitext AST won't work well in general - but for now it's fine. Thanks! Christopher |
Simple proposal for #30
Had to include Lift JSON library as the included scala JSON library is slow (about 5ms per page, while it took less than 1ms with Lift).
Tested with wikidatawiki-20130330-pages-articles.xml.bz2 and got 14.677.279 interwiki links.