-
Notifications
You must be signed in to change notification settings - Fork 102
Extract parsing into its own service class #553
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
30 commits
Select commit
Hold shift + click to select a range
9fa557a
Created basic structure for a download service
christianlupus f01786f
Save HTML in object field for later recovery
christianlupus 3855fe2
Integrating download service partially with existing RecipeService
christianlupus a98a1a2
Corrected casing of class names
christianlupus 145d9f8
Basic HTML parser structure plus a JSON+LD parser created
christianlupus e06c657
Created basic Microdata parser
christianlupus fe7533e
Include extraction service in import routines
christianlupus 0f35d9a
Fixed some bugs in the code
christianlupus a7dda3d
Added Changelog
christianlupus e5fc5bd
Writing test cases for new code
christianlupus 0ee3f89
Created tests for JSON+LD metadata parser
christianlupus 2f0a3d1
Reanmed resource folder
christianlupus 11aad12
Added test for microdata parser
christianlupus 74c9c4b
Typo in filename
christianlupus fad8dcf
Added test for html decoder
christianlupus 9267ff4
Updated test namespace
christianlupus 77c78c3
Added parser for HTML files
christianlupus 7b9f60b
Adding a download service
christianlupus 4fe07d0
Added ImportException to code coverage report
christianlupus ae683ab
Make test code compatible with PHP 7.3
christianlupus 19b6f43
Corrected code styling
christianlupus e3a57a7
Fix #724
christianlupus 07ada44
Added test case
christianlupus 7b0a5b1
Fixing code style after big rebase
christianlupus c608c49
Apply suggestions regarding typos and language from code review
christianlupus 457e88d
Fix PR checks
christianlupus aac0600
Apply suggestions from code review
christianlupus 0f6749f
Fix some manual corrections as suggested in code review
christianlupus d008b4f
Fixed test cases
christianlupus 8651abd
Corrected Workflow
christianlupus File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,9 @@ | ||
| <?php | ||
|
|
||
| namespace OCA\Cookbook\Exception; | ||
|
|
||
| class HtmlParsingException extends \Exception { | ||
| public function __construct($message = null, $code = null, $previous = null) { | ||
| parent::__construct($message, $code, $previous); | ||
| } | ||
| } |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,9 @@ | ||
| <?php | ||
|
|
||
| namespace OCA\Cookbook\Exception; | ||
|
|
||
| class ImportException extends \Exception { | ||
| public function __construct($message = null, $code = null, $previous = null) { | ||
| parent::__construct($message, $code, $previous); | ||
| } | ||
| } |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,15 @@ | ||
| <?php | ||
|
|
||
| namespace OCA\Cookbook\Helper\HTMLFilter; | ||
|
|
||
| abstract class AbstractHtmlFilter { | ||
|
|
||
| /** | ||
| * Filter the HTML according to the rules of this class | ||
| * | ||
| * This class operates on the original HTML code as passed by reference and may therefore modify the HTML string. | ||
| * | ||
| * @param string $html The HTML code to be filtered | ||
| */ | ||
| abstract public function apply(string &$html): void; | ||
| } |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,9 @@ | ||
| <?php | ||
|
|
||
| namespace OCA\Cookbook\Helper\HTMLFilter; | ||
|
|
||
| class HtmlEntityDecodeFilter extends AbstractHtmlFilter { | ||
| public function apply(string &$html): void { | ||
| $html = html_entity_decode($html); | ||
| } | ||
| } |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,27 @@ | ||
| <?php | ||
|
|
||
| namespace OCA\Cookbook\Helper\HTMLParser; | ||
|
|
||
| use OCA\Cookbook\Exception\HtmlParsingException; | ||
| use OCP\IL10N; | ||
|
|
||
| abstract class AbstractHtmlParser { | ||
|
|
||
| /** | ||
| * @var IL10N | ||
| */ | ||
| protected $l; | ||
|
|
||
| public function __construct(IL10N $l10n) { | ||
| $this->l = $l10n; | ||
| } | ||
|
|
||
| /** | ||
| * Extract the recipe from the given document. | ||
| * | ||
| * @param \DOMDocument $document The document to parse | ||
| * @return array The JSON content in the document as a PHP array | ||
| * @throws HtmlParsingException If the parsing was not successful | ||
| */ | ||
| abstract public function parse(\DOMDocument $document): array; | ||
| } |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,9 @@ | ||
| <?php | ||
|
|
||
| namespace OCA\Cookbook\Helper\HTMLParser; | ||
|
|
||
| class AttributeNotFoundException extends \Exception { | ||
| public function __construct($message = null, $code = null, $previous = null) { | ||
| parent::__construct($message, $code, $previous); | ||
| } | ||
| } |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,187 @@ | ||
| <?php | ||
|
|
||
| namespace OCA\Cookbook\Helper\HTMLParser; | ||
|
|
||
| use OCA\Cookbook\Exception\HtmlParsingException; | ||
| use OCP\IL10N; | ||
| use OCA\Cookbook\Service\JsonService; | ||
|
|
||
| /** | ||
| * This class is an AbstractHtmlParser which tries to extract a JSON+LD script from the HTML page. | ||
| * @author Christian Wolf | ||
| */ | ||
| class HttpJsonLdParser extends AbstractHtmlParser { | ||
|
|
||
| /** | ||
| * @var JsonService | ||
| */ | ||
| private $jsonService; | ||
|
|
||
| public function __construct(IL10N $l10n, JsonService $jsonService) { | ||
| parent::__construct($l10n); | ||
|
|
||
| $this->jsonService = $jsonService; | ||
| } | ||
|
|
||
| public function parse(\DOMDocument $document): array { | ||
| $xpath = new \DOMXPath($document); | ||
|
|
||
| $json_ld_elements = $xpath->query("//*[@type='application/ld+json']"); | ||
|
|
||
| foreach ($json_ld_elements as $json_ld_element) { | ||
| if (!$json_ld_element || !$json_ld_element->nodeValue) { | ||
| continue; | ||
| } | ||
|
|
||
| try { | ||
| return $this->parseJsonLdElement($json_ld_element); | ||
| } catch (HtmlParsingException $ex) { | ||
| // Parsing failed for this element. Let's see if there are more... | ||
| } | ||
| } | ||
|
|
||
| throw new HtmlParsingException($this->l->t('Could not find recipe in HTML code.')); | ||
| } | ||
|
|
||
| /** | ||
| * Parse a JSON+LD element in the DOM tree for a recipe | ||
| * | ||
| * @param \DOMNode $node The node to parse | ||
| * @throws HtmlParsingException The node does not contain a valid recipe | ||
| * @return array The recipe as an associate array | ||
| */ | ||
| private function parseJsonLdElement(\DOMNode $node): array { | ||
| $string = $node->nodeValue; | ||
|
|
||
| $this->fixRawJson($string); | ||
|
|
||
| $json = json_decode($string, true); | ||
|
|
||
| if ($json === null) { | ||
| throw new HtmlParsingException($this->l->t('JSON cannot be decoded.')); | ||
| } | ||
|
|
||
| if ($json === false || $json === true || ! is_array($json)) { | ||
| throw new HtmlParsingException($this->l->t('No recipe was found.')); | ||
| } | ||
|
|
||
| // Look through @graph field for recipe | ||
| $this->mapGraphField($json); | ||
|
|
||
| // Look for an array of recipes | ||
| $this->mapArray($json); | ||
|
|
||
| // Ensure the type of the object is never an array | ||
| $this->checkForArrayType($json); | ||
|
|
||
| if ($this->jsonService->isSchemaObject($json, 'Recipe')) { | ||
| // We found our recipe | ||
| return $json; | ||
| } else { | ||
| throw new HtmlParsingException($this->l->t('No recipe was found.')); | ||
| } | ||
| } | ||
|
|
||
| /** | ||
| * Fix any JSON issues before trying to decode it | ||
| * | ||
| * @param string $rawJson The JSON string to check and fix | ||
| */ | ||
| private function fixRawJson(string &$rawJson): void { | ||
| $rawJson = $this->removeNewlinesInJson($rawJson); | ||
| } | ||
|
|
||
| /** | ||
| * Fix newlines in raw JSON string | ||
| * | ||
| * Some recipes have newlines inside quotes, which is invalid JSON. Fix this before continuing. | ||
| * | ||
| * @param string $rawJson The original string | ||
| * @return string The corrected JSON | ||
| */ | ||
| private function removeNewlinesInJson(string $rawJson): string { | ||
| return preg_replace('/\s+/', ' ', $rawJson); | ||
| } | ||
|
|
||
| /** | ||
| * Look for recipes in the JSON graph | ||
| * | ||
| * Some sites use the @graph property to define elements. | ||
| * This is a quick workaround to extract the corresponding recipe. | ||
| * | ||
| * @todo This only extracts the very first recipe in the graph and only that. | ||
| * It might be favorable to look further into the json objects. | ||
| * This might especially be true when the recipe uses links to external JSON objects | ||
| * (as specified by the standard). | ||
| * Then, it might become necessary to parse ALL objects in the graph in order to extract e.g. | ||
| * the instruction objects for a recipe. | ||
| * | ||
| * @param array $json The JSON object to check | ||
| */ | ||
| private function mapGraphField(array &$json) { | ||
| if (isset($json['@graph']) && is_array($json['@graph'])) { | ||
| $tmp = $this->searchForRecipeInArray($json['@graph']); | ||
|
|
||
| if ($tmp !== null) { | ||
| $json = $tmp; | ||
| } | ||
| } | ||
| } | ||
|
|
||
| /** | ||
| * Look for an array of recipes. | ||
| * | ||
| * Some sites return an array of JSON objects instead of a plain recipe object. | ||
| * This functions checks for an indexed array and searches in it for recipes. | ||
| * | ||
| * When an array of recipes is found, the first found recipe will be used and written over the | ||
| * input parameter. | ||
| * @param array $json The JSON object to inspect | ||
| */ | ||
| private function mapArray(array &$json) { | ||
| if (isset($json[0])) { | ||
| $tmp = $this->searchForRecipeInArray($json); | ||
|
|
||
| if ($tmp !== null) { | ||
| $json = $tmp; | ||
| } | ||
| } | ||
| } | ||
|
|
||
| /** | ||
| * Search for a recipe object in an array | ||
| * @param array $arr The array to search | ||
| * @return array|NULL The found recipe or null if no recipe was found in the array | ||
| */ | ||
| private function searchForRecipeInArray(array $arr): ?array { | ||
| // Iterate through all objects in the array ... | ||
| foreach ($arr as $item) { | ||
| // ... looking for a recipe | ||
| if ($this->jsonService->isSchemaObject($item, 'Recipe')) { | ||
| // We found a recipe in the array, use it | ||
| return $item; | ||
| } | ||
| } | ||
|
|
||
| // No recipe was found | ||
| return null; | ||
| } | ||
|
|
||
| /** | ||
| * Check if the JSON element is a schema.org object but malformed. | ||
| * | ||
| * This checks if the '@type' entry is an array and corrects that. | ||
| * | ||
| * @param array $json The JSON object to parse | ||
| * @return void | ||
| */ | ||
| private function checkForArrayType(array &$json) { | ||
| if (! $this->jsonService->isSchemaObject($json)) { | ||
| return; | ||
| } | ||
|
|
||
| if (is_array($json['@type'])) { | ||
| $json['@type'] = $json['@type'][0]; | ||
| } | ||
| } | ||
| } | ||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.