Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Query for set information #2

Open
4 tasks
kevinlul opened this issue Jul 26, 2022 · 10 comments
Open
4 tasks

Query for set information #2

kevinlul opened this issue Jul 26, 2022 · 10 comments

Comments

@kevinlul
Copy link
Contributor

kevinlul commented Jul 26, 2022

Collecting set information is the last piece for YAML Yugi to exceed parity with other solutions. Unlike the other data collected so far, which are contained in flat categories, sets are indexed on Yugipedia in hierarchical categories. This means that instead of a target category for sets directly containing an article about a set, categories may be nested. When querying the MediaWiki API, only the immediate members of a category are returned, including the names of child categories, but the members of those child categories are not returned. Therefore, new code is required in order to download entire category hierarchies and subscribe to updates on them. Category hierarchies are allowed to contain cycles, and while this is not expected of the categories for sets, our code should be correct even if cycles are encountered and not fall into an infinite loop.

Design

Either create or extend the current full download script to recursively download a targeted category, without falling into infinite loops. For example, after fetching https://yugipedia.com/api.php?action=query&redirects=true&generator=categorymembers&prop=revisions&rvprop=content&format=json&formatversion=2&gcmlimit=50&gcmtitle=Category:Yu-Gi-Oh!_Master_Duel_sets, the the ns=14 category items in the response should be stored in a ordered set for additional follow-up requests once the current category is completely downloaded.

To subscribe to incremental updates, the existing script can be used, but each time, it should be called with all the known descendant categories cached from the last full download, in addition to the top-level category itself. This is because the MediaWiki API only provides the immediate parent categories of an article, not all ancestor categories.

Subtasks

@kevinlul

This comment was marked as outdated.

@kevinlul

This comment was marked as outdated.

@kevinlul

This comment was marked as outdated.

@kevinlul
Copy link
Contributor Author

kevinlul commented Apr 11, 2024

Notes:

Recursive full download needs to return a list of found categories (ns=14) after each page downloaded
In the main loop, this is appended to an OrderedSet
There's an additional outer loop iterating over the OrderedSet, thus fetching all categories,
without infinite recursion in the case of cycles, because the category will already be in the set and
have been iterated

Should I just add the second return value for the list of categories or switch to OOP?

@xyj-3
Copy link

xyj-3 commented Jul 20, 2024

Is OrderedSet a specific thing? Also what do you mean by "add the second return value for the list of categories or switch to OOP"?

Also do you want the new downloaded files to be flat in the top level category folder or nested?

@xyj-3
Copy link

xyj-3 commented Jul 21, 2024

How does gcmcontinue and grccontinue work, like when are you using it and what value do you give it

@kevinlul
Copy link
Contributor Author

I'm describing the changes that need to happen to the main logic in the download function in https://github.com/DawnbrandBots/yaml-yugipedia/blob/master/src/utils.py

Currently the category is specified to the MediaWiki API by the gcmtitle URL parameter in main.py. However, this only retrieves direct members of the category, so the download logic needs to keep track of child category pages that were retrieved, to be downloaded by another request to the MediaWiki API. I mentioned an OrderedSet because that is one way to keep track of the categories already downloaded and newly discovered in order to avoid infinite looping.

gcmcontinue and grccontinue are pagination tokens in the response from MediaWiki APIs when a generator is used, when the results don't fit on a single page of results. In the download scripts, this is populated from the previous request so it downloads all pages, but can also be provided on the command-line to restart a previous set of downloads from the middle. The parameter varies by script. For main.py, the generator CategoryMembers is used, so the parameter is GCMcontinue. For incremental.py, the generator RecentChanges is used, so the parameter is GRCcontinue.

https://yugipedia.com/api.php
https://www.mediawiki.org/wiki/API:Query#query:generator
https://www.mediawiki.org/w/api.php?action=help&modules=query%2Bcategorymembers
https://www.mediawiki.org/w/api.php?action=help&modules=query%2Brecentchanges

@xyj-3
Copy link

xyj-3 commented Jul 24, 2024

I got it to work with recursion and a second return value but it doesn't look that nice right now so I'm considering restructuring it.

The biggest issue so far is actually getting an identifier for the top category for preventing loops. The generator=categorymembers doesn't return any info about the category itself.

I figure you can use pageid or title to track if there is looping. So in that case it looks like you either have to

  • change your categories.txt to be a list of pageids
  • Make a request for every category to get the pageid/title
  • Do some string manipulation to transform between "TCG_Speed_Duel_Forbidden_%26_Limited_Lists" and "Category:TCG Speed Duel Forbidden & Limited Lists"

What do you think? Do you have any preferences because otherwise I'm probably picking making another request for every category.

@kevinlul
Copy link
Contributor Author

Feel free to restructure. I already anticipated it would be necessary and there's actually very little code in this repository. The only interface that needs to be respected for full downloads is the command-line interface. Everything else is an implementation detail.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants