Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix pickling problem in LinkedDataMapping #655

Merged
merged 3 commits into from
Nov 5, 2024
Merged

Conversation

addie9800
Copy link
Collaborator

@addie9800 addie9800 commented Oct 30, 2024

As of now CC-Crawling is crashing due to a pickling error, with a message along the lines of cannot pickle lxml.etree._Element. I traced the error back to this article: https://de.euronews.com/2023/02/17/chinas-botschaft-in-paris-twittert-falsche-behauptungen-uber-das-erdbeben-in-der-turkei where the issue seems to be the use of the __as_xml__() function, which creates an attribute in the LinkedDataMapping object of type lxml.etree._Element. To avoid crashing, I suggest overwriting the __getstate__() and __setstate__() functions converting the Elements to a string before pickling.

This is only an issue in multithreading situations.

@addie9800 addie9800 added the bug fix Fixes a bug or something labeled with bug label Oct 30, 2024
@addie9800 addie9800 requested a review from MaxDall October 30, 2024 22:31
@MaxDall
Copy link
Collaborator

MaxDall commented Nov 4, 2024

@addie9800 Oh wow, thanks a lot for catching this. That's actually a rather big thing, since it effects all parser who make use of the now officially supported xpath_search function.

I'm not quite sure about the road I want to go here, but that aside, your suggestion looks like a perfect solution if we're going to keep the XML persistent to save some time.

Edit: I also gonna add an additional test to the test_parsing test function checking if extractions are pickable.

Edit2: I think it's better to go with your solution and keep the _xml attribute

Copy link
Collaborator

@MaxDall MaxDall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@addie9800 Thanks a lot for fixing this! I completely overlooked that. That's a huge miss on my side.

@@ -61,6 +61,17 @@ def __init__(self, lds: Iterable[Dict[str, Any]] = ()):
self.add_ld(ld)
self.__xml: Optional[lxml.etree._Element] = None

def __getstate__(self):
picklable_dict = self.__dict__.copy()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rename pickable_dict -> state

@@ -61,6 +61,17 @@ def __init__(self, lds: Iterable[Dict[str, Any]] = ()):
self.add_ld(ld)
self.__xml: Optional[lxml.etree._Element] = None

def __getstate__(self):
picklable_dict = self.__dict__.copy()
if (xml_element := picklable_dict.get("_LinkedDataMapping__xml")) is not None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's less error-prone if we check in self._xml instead of the dict.

@MaxDall MaxDall added the high priority Urgent PR. label Nov 4, 2024
@addie9800 addie9800 merged commit 5385b9f into master Nov 5, 2024
4 checks passed
@addie9800 addie9800 deleted the fix-pickling-problem branch November 5, 2024 15:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug fix Fixes a bug or something labeled with bug high priority Urgent PR.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants