Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .changelog/current/2711-fix-microdata-parser.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# Fix

- Parse Microdata correctly using itemscope properties
10 changes: 5 additions & 5 deletions lib/Helper/HTMLParser/HttpMicrodataParser.php
Original file line number Diff line number Diff line change
Expand Up @@ -13,9 +13,9 @@
/**
* This class is an AbstractHtmlParser which tries to extract micro data from the HTML page.
* @author Christian Wolf
* @todo Nutrition data is missing

Check warning on line 16 in lib/Helper/HTMLParser/HttpMicrodataParser.php

View workflow job for this annotation

GitHub Actions / Check for added todo messages

Found @todo Nutrition data is missing
* @todo Category needs checking

Check warning on line 17 in lib/Helper/HTMLParser/HttpMicrodataParser.php

View workflow job for this annotation

GitHub Actions / Check for added todo messages

Found @todo Category needs checking
* @todo Tools need to be imported

Check warning on line 18 in lib/Helper/HTMLParser/HttpMicrodataParser.php

View workflow job for this annotation

GitHub Actions / Check for added todo messages

Found @todo Tools need to be imported
*/
class HttpMicrodataParser extends AbstractHtmlParser {
/**
Expand All @@ -35,10 +35,10 @@
public function parse(DOMDocument $document, ?string $url): array {
$this->xpath = new DOMXPath($document);

$selectorHttp = "//*[@itemtype='http://schema.org/Recipe']";
$selectorHttps = "//*[@itemtype='https://schema.org/Recipe']";
$selectorHttpWww = "//*[@itemtype='http://www.schema.org/Recipe']";
$selectorHttpsWww = "//*[@itemtype='https://www.schema.org/Recipe']";
$selectorHttp = "//*[@itemscope and @itemtype='http://schema.org/Recipe']";
$selectorHttps = "//*[@itemscope and @itemtype='https://schema.org/Recipe']";
$selectorHttpWww = "//*[@itemscope and @itemtype='http://www.schema.org/Recipe']";
$selectorHttpsWww = "//*[@itemscope and @itemtype='https://www.schema.org/Recipe']";
$xpathSelector = "$selectorHttp | $selectorHttps | $selectorHttpWww | $selectorHttpsWww";

$recipes = $this->xpath->query($xpathSelector);
Expand Down Expand Up @@ -208,7 +208,7 @@
* @return DOMNodeList A list of all found child nodes with the given property
*/
private function searchChildEntries(DOMNode $recipeNode, string $prop): DOMNodeList {
return $this->xpath->query(".//*[@itemprop='$prop']", $recipeNode);
return $this->xpath->query(".//*[@itemprop='$prop' and count(ancestor::*[@itemscope]) = 1]", $recipeNode);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am unsure about this XPath stuff right now. Especially the hard-coded 1 makes me nervous.

Also one test case, that I can think of (and that might cause problems) but I have not found a concrete example of: Let the top-most element be a non-recipe. Like a WebPage. The WebPage is about a Recipe. That way, the Recipe is no longer the top-most element. How will ancestor::*[@itemscope] then behave? Will it detect the recipe's props or the Webpages?

Maybe you could even create a test case (similar to the PR introduction) and add similarly to the test suite?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of the failing test cases fail because... they are not actually valid Microdata. If you run them through https://validator.schema.org/ you will see that it can't see any elements either, same goes for https://foolip.org/microdatajs/live/. The spec says:

The itemtype attribute can only be specified on elements which have an itemscope attribute specified.
https://developer.mozilla.org/en-US/docs/Web/HTML/Global_attributes/itemtype

But these files specify an itemtype without an itemscope. I added this requirement to make sure the lookup for the parent itemscope works correctly. We can try to relax this requirement, but if even the official validator can't see them, then I don't expect this happens often in the real world. How would you like me to proceed here, add the missing itemscope to the test cases, or make the parser more lenient?

This is the case for: caseB.html, caseC.html, caseD.html, caseE.html, caseImageAttribute.html, caseImageContent.html, caseIngredient.html and caseInstruction.html.


The only actually failing test case is caseFix1209.html, and it seems to be exactly the issue with the parent WebPage itemscope you mentioned. I'm not really good with XPath so I wrongly assumed the ancestor query would be limited to the subtree of the node specified through $recipeNode, but this it not the case. Sadly I can't find a way around this without using XPath 2.0, which is not supported by DOMXPath. The easiest way I can see to fix this is to create a child DOMDocument containing only the recipe using DOMDocument::importNode, create a separate DOMXPath instance for it, and keep the current query that checks if we are not deeper than 1 itemscope from the root. Would you like me to do that, or perhaps we should replace the XPath queries with manual iteration over the nodes in PHP?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I think, I fixed the cases you mentioned (B through E, Image*, as well as Ingredient and Instruction). I think I fixed these in the tests.

Regarding the test Fix1209, this was added for #1209 and its fix in #1220. I am a bit more concerned as this might break the existing code somewhat. For the given test case, this is maybe even a good break (as it actually separates the instructions at least partially). However, it drops the first instruction without notice.

I marked it as incomplete for the meantime. If you see any good way to fix this or to work around it, feel free to give me a hint (or better open another PR 😉 )


Regarding the new parser routines, please do not put too much effort into it. There is a plan to rework the complete parser routines (also for the JSON+LD and other parsers). So, this would be wasted time. You are welcome to help with the rework if you want to. The point is to put the parsers on a common base.

}

/**
Expand Down
19 changes: 15 additions & 4 deletions tests/Unit/Helper/HTMLParser/HttpMicrodataParserTest.php
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,11 @@
'caseD' => ['caseD.html',true,'caseD.json'],
'caseE' => ['caseE.html',true,'caseE.json'],

'caseIssue1209' => ['caseFix1209.html',true,'caseFix1209.json'],
'caseIssue1209' => ['caseFix1209.html',true,'caseFix1209.json', true],
'caseIssue1617a' => ['caseIssue1617a.html',true,'caseIssue1617a.json'],
'caseIssue1617b' => ['caseIssue1617b.html',true,'caseIssue1617b.json'],
'caseIssue1617c' => ['caseIssue1617c.html',true,'caseIssue1617c.json'],
'casePR2711' => ['case2711.html', true, 'case2711.json'],
];
}

Expand All @@ -49,8 +53,13 @@
* @param mixed $filename
* @param mixed $valid
* @param mixed $jsonFile
* @param bool $skipped
*/
public function testHTMLFile($filename, $valid, $jsonFile): void {
public function testHTMLFile($filename, $valid, $jsonFile, $skipped = false): void {
if ($skipped) {
$this->markTestIncomplete();

Check warning on line 60 in tests/Unit/Helper/HTMLParser/HttpMicrodataParserTest.php

View workflow job for this annotation

GitHub Actions / Run the unittests (mysql, 31, 8.4, apache)

The test testHTMLFile with data set "caseIssue1209" is skipped.

Check warning on line 60 in tests/Unit/Helper/HTMLParser/HttpMicrodataParserTest.php

View workflow job for this annotation

GitHub Actions / Run the unittests (sqlite, 29, 8.2, apache)

The test testHTMLFile with data set "caseIssue1209" is skipped.

Check warning on line 60 in tests/Unit/Helper/HTMLParser/HttpMicrodataParserTest.php

View workflow job for this annotation

GitHub Actions / Run the unittests (pgsql, 30, 8.3, nginx)

The test testHTMLFile with data set "caseIssue1209" is skipped.

Check warning on line 60 in tests/Unit/Helper/HTMLParser/HttpMicrodataParserTest.php

View workflow job for this annotation

GitHub Actions / Run the tests against the master branch of the NC server

The test testHTMLFile with data set "caseIssue1209" is skipped.
}

$l = $this->createStub(IL10N::class);

$parser = new HttpMicrodataParser($l);
Expand All @@ -66,12 +75,14 @@
$jsonDest = file_get_contents(__DIR__ . "/res_Microdata/$jsonFile");
$expected = json_decode($jsonDest, true);

// $this->markTestSkipped();

// print_r(json_encode($res));

$this->assertTrue($valid);
$this->assertTrue($valid, 'Expected HtmlParsingException as the input was invalid. No exception was thrown.');
$this->assertEquals($expected, $res);
} catch (HtmlParsingException $ex) {
$this->assertFalse($valid);
$this->assertFalse($valid, 'Not expected HtmlParsingException was triggered.');
}
}

Expand Down
10 changes: 10 additions & 0 deletions tests/Unit/Helper/HTMLParser/res_Microdata/case2711.html
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
<html>
<body>
<div itemscope itemtype="https://schema.org/Recipe">
<span itemprop="name">Sample recipe name</span>
<div itemprop="author" itemscope itemtype="https://schema.org/Person">
<span itemprop="name">Sample person</span>
</div>
</div>
</body>
</html>
6 changes: 6 additions & 0 deletions tests/Unit/Helper/HTMLParser/res_Microdata/case2711.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
{
"@type": "Recipe",
"@context": "http://schema.org",
"name": "Sample recipe name",
"recipeInstructions": []
}
2 changes: 1 addition & 1 deletion tests/Unit/Helper/HTMLParser/res_Microdata/caseB.html
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
</head>

<body>
<article itemtype='https://schema.org/Recipe'>Foo</article>
<article itemscope="" itemtype='https://schema.org/Recipe'>Foo</article>
Bar

Baz
Expand Down
2 changes: 1 addition & 1 deletion tests/Unit/Helper/HTMLParser/res_Microdata/caseD.html
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
</head>

<body>
<article itemtype='https://schema.org/Recipe'>
<article itemscope="" itemtype='https://schema.org/Recipe'>
Foo
<span itemprop='name'>MyName</span>
<span itemprop='keywords'>foo,bar,baz</span>
Expand Down
2 changes: 1 addition & 1 deletion tests/Unit/Helper/HTMLParser/res_Microdata/caseE.html
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
</head>

<body>
<article itemtype='https://schema.org/Recipe'>
<article itemscope="" itemtype='https://schema.org/Recipe'>
Foo
<span itemprop='name'>MyName</span>
<span itemprop='keywords'>foo,bar,baz</span>
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
</head>

<body>
<article itemtype='https://schema.org/Recipe'>
<article itemscope="" itemtype='https://schema.org/Recipe'>
Foo
<img itemprop='%IMAGE_NAME%' src="A">
<img itemprop='%IMAGE_NAME%' src='B'>
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
</head>

<body>
<article itemtype='https://schema.org/Recipe'>
<article itemscope="" itemtype='https://schema.org/Recipe'>
Foo
<span itemprop='%IMAGE_NAME%'>A</span>
<span itemprop='%IMAGE_NAME%'>B</span>
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
</head>

<body>
<article itemtype='https://schema.org/Recipe'>
<article itemscope="" itemtype='https://schema.org/Recipe'>
Foo
<ul>
<li itemprop='%INGREDIENT_NAME%'>A</li>
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
</head>

<body>
<article itemtype='https://schema.org/Recipe'>
<article itemscope="" itemtype='https://schema.org/Recipe'>
Foo
<ul>
<li itemprop='%INSTRUCTION_NAME%'>A</li>
Expand Down
2,056 changes: 2,056 additions & 0 deletions tests/Unit/Helper/HTMLParser/res_Microdata/caseIssue1617a.html

Large diffs are not rendered by default.

31 changes: 31 additions & 0 deletions tests/Unit/Helper/HTMLParser/res_Microdata/caseIssue1617a.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
{
"@type": "Recipe",
"@context": "http://schema.org",
"name": "Eier-Curry",
"keywords": "Leben,Rezepte,Eier- & Mehlspeisen,Eier,Ernährungsformen,Vegetarische Rezepte,Rezepte nach Zutaten,Reis,Curry,Kokosmilch,Reisachwald",
"recipeYield": "4",
"recipeIngredient": [
"200 g Langkornreis",
"500 ml Wasser",
"etwas Salz",
"20 g Butter",
"6 Eier (Größe M)",
"2 Zwiebeln, rot",
"1 Knoblauchzehe",
"1 Stück Ingwer, ca. 2 cm",
"1 Chilischote",
"400 g gehackte Tomaten",
"400 ml Kokosmilch",
"30 g Butter",
"2 EL Currypulver",
"1 TL Kreuzkümmel, gemahlen",
"1 TL Koriander, gemahlen",
"1 TL Senfsamen",
"etwas Salz",
"1 EL Zucker",
"1 Bund Koriander oder Petersilie"
],
"recipeInstructions": [
"1. Reis kalt abbrausen. Reis, Wasser und Salz in einen Topf geben, aufkochen lassen. Temperatur herunterschalten und ca. 20-25 Minuten quellen lassen. ,2. Eier anpieken und in kochendem Wasser ca. 7-8 Minuten wachsweich garen. Danach die Eier abgießen und kalt abbrausen. ,3. Die Schalen von Zwiebeln und Knoblauch abziehen. Zwiebeln fein würfeln, Knoblauch hacken. ,4. Ingwer schälen und ebenfalls fein hacken. Chilischote halbieren, entkernen und hacken. ,5. Butter erhitzen (wahlweise etwas Fett von der Kokosmilch verwenden). Zwiebeln darin andünsten bis goldgelb sind. ,6. Knoblauch, Chili und Ingwer zugeben und ebenfalls kurz andünsten. Danach Currypulver, Kreuzkümmel, Koriander und Senfsamen untermischen und erhitzen. Zucker darüberstreuen. Tomaten unterrühren und alles, abgedeckt, ca. 20 Minuten köcheln lassen. ,7. Deckel abnehmen, Kokosmilch unterrühren und ohne Deckel weitere ca. 10 Minuten köcheln. Mit Salz abschmecken. ,8. Eier schälen, halbieren, in die Soße geben und erwärmen. Koriander abbrausen, trockenschütteln und fein schneiden. ,9. Butter unter den Reis mischen. Koriander über das Curry streuen. Mit dem Reis anrichten und servieren."
]
}
2 changes: 2 additions & 0 deletions tests/Unit/Helper/HTMLParser/res_Microdata/caseIssue1617a.url
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
https://www.swrfernsehen.de/kaffee-oder-tee/rezepte/eier-curry-100.html

217 changes: 217 additions & 0 deletions tests/Unit/Helper/HTMLParser/res_Microdata/caseIssue1617b.html

Large diffs are not rendered by default.

32 changes: 32 additions & 0 deletions tests/Unit/Helper/HTMLParser/res_Microdata/caseIssue1617b.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
{
"@type": "Recipe",
"@context": "http://schema.org",
"name": "Butter chicken",
"keywords": "Butter chicken, dania mięsne, obiad, Sylwester",
"recipeYield": "około 950 g dania (bez ryżu)",
"image": [
"https://cdn.aniagotuje.com/pictures/articles/2024/01/55155580-v-1500x1500.jpg"
],
"recipeIngredient": [
"1 bardzo duża pierś z kurczaka - do 800 g",
"170 g kremowego jogurtu naturalnego - pół szklanki",
"2 ząbki czosnku - około 10 g",
"około 2 cm korzenia imbiru lub 1 łyżeczka suszonego w proszku",
"2 łyżki oliwy lub innego oleju do smażenia - około 20 ml",
"przyprawy: 1 płaska łyżka garam masala; po 1/4 łyżeczki kuminu i soli; po sporej szczypcie chili i kurkumy",
"1 duża cebula - do 250 g",
"250 g pomidorów krojonych bez skórki - mogą być z puszki",
"150 ml śmietanki kremówki 30 % - około 3/5 szklanki",
"5 ząbków czosnku - około 25 g",
"2 łyżki masła klarowanego - do 40 g",
"przyprawy: 1 płaska łyżka garam masala; po płaskiej łyżeczce imbiru, soli, cukru, kuminu; po pół płaskiej łyżeczki chili i mielonej kolendry"
],
"recipeInstructions": [
"Krok 1: Zamarynuj kurczaka\n \n \n Bardzo dużą pierś z kurczaka oczyść z żyłek, czy chrząstek, a następnie pokrój na kawałki nie mniejsze niż 2 cm. Do miski z mięsem dodaj około pół szklanki kremowego jogurtu naturalnego oraz dwa ząbki czosnku (obrane i przepuszczone przez praskę lub drobno posiekane). Tak samo przygotuj i dodaj około 2 cm korzenia imbiru (można zastąpić łyżeczką imbiru suszonego). Dodaj też 1 płaską łyżkę mieszanki przypraw garam masala, po 1/4 łyżeczki kuminu (kminu rzymskiego) i soli oraz po sporej szczypcie chili i kurkumy. \n Całość dokładnie wymieszaj i odłóż obok na blacie, na godzinę. Jeśli planujesz szykować danie później to miskę trzymaj w lodówce - nawet przez całą noc (pod przykryciem). ",
"Krok 2: Usmaż kawałki mięsa\n \n \n Miskę wyjmij wcześniej z lodówki, by mięso osiągnęło temperaturę pomieszczenia. Nagrzej dużą patelnię z grubym dnem. Wlej dwie łyżki oliwy lub innego oleju do smażenia. Na patelni kładź kawałki kurczaka (razem z marynatą) i smaż przez kilka minut na rumiano. Mięso może się mocniej zarumienić. Usmażone zdejmij na miskę i odstaw na bok. Z patelni usuń resztki po smażeniu i użyj jej do szykowania sosu (bez mycia). ",
"Krok 3: Szykuj sos\n \n \n Na patelnię po mięsie wyłóż dwie solidne łyżki masła klarowanego. Cebulę obierz i pokrój w kostkę lub cienkie piórka i wyłóż na patelnię. Podsmażaj ją na średniej mocy palnika przez około 5 minut. Może się lekko zarumienić. Następnie wyłóż też obrane i przepuszczone przez praskę lub posiekane ząbki czosnku (5 sztuk). Całość wymieszaj i dodaj przyprawy: 1 płaska łyżka garam masala; po płaskiej łyżeczce imbiru, soli, cukru, kminu; po pół płaskiej łyżeczki chili i mielonej kolendry.\n Od razu dodaj również 250 gramów pomidorów krojonych bez skórki (mogą być z puszki). Wymieszaj sos. Zredukuj moc palnika, by tylko lekko mrugał i gotuj go na patelni bez przykrywki jeszcze minimum 10-15 minut. Sos ma zgęstnieć.Porada: Jeśli masz suszoną kozieradkę, to możesz dodać szczyptę razem z innymi przyprawami. Uważaj przy dodawaniu chili do dania. Moc użytego chili w proszku bywa bardzo różna. Lepiej jest dać go mniej i w razie potrzeby dołożyć pod sam koniec szykowania butter chicken. ",
"Krok 4: Zmiksuj sos\n \n \n Cały gorący sos umieść w malakserze i zmiksuj na gładko. Możesz tez użyć blendera ręcznego typu żyrafa. Patelni nie myj. ",
"Krok 5: Połącz składniki i gotuj\n \n \n Zmiksowany sos ponownie umieść na patelni. Wlej 150 ml śmietanki kremówki 30 % (około 3/5 szklanki). Wyłóż również całe podsmażone mięso z miski (razem z sosem, który zebrał się na dnie miski). Wymieszaj butter chicken i gotuj na niskiej patelni przez około 10 minut. Sos jeszcze zgęstnieje i oblepi mięso. \n Sprawdź smak gotowego dania i w razie potrzeby dodaj więcej wybranych przypraw. ",
"Krok 6: Dopraw i podawaj\n \n \n Butter chicken posyp posiekaną kolendrą lub natką pietruszki i podawaj z gorącym ryżem lub z domowym chlebem Naan! Tutaj przepis na chlebek Naan. "
]
}
2 changes: 2 additions & 0 deletions tests/Unit/Helper/HTMLParser/res_Microdata/caseIssue1617b.url
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
https://aniagotuje.pl/przepis/butter-chicken

Loading
Loading