Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Idea on abstraction of scraping classes #534

Merged
merged 39 commits into from
Jan 5, 2022
Merged

Idea on abstraction of scraping classes #534

merged 39 commits into from
Jan 5, 2022

Conversation

Christovis
Copy link
Collaborator

@Christovis Christovis commented Dec 13, 2021

This PR focuses on the scraping of mailing archives from W3C (bigbang/w3c.py) and Listserv (bigbang/listserv.py). As they have multiple methods in common, they can be abstracted out into a new bigbang/abstract.py file containing AbstractMessageParser, AbstractList, and AbstractArchive.

As a consequence:

  • our scraping of W3C mailing archives becomes more sophisticated (e.g., now we can scrape the whole archive, filter it by time ranges and message content) and the
  • scraping of different mail archives are built around the same template which safes us from duplicating code

This is part of the conversation in #512.

@codecov-commenter
Copy link

codecov-commenter commented Dec 14, 2021

Codecov Report

Merging #534 (3ebffc7) into main (2c41b5c) will increase coverage by 0.76%.
The diff coverage is 84.04%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #534      +/-   ##
==========================================
+ Coverage   74.24%   75.00%   +0.76%     
==========================================
  Files          22       27       +5     
  Lines        3075     3305     +230     
==========================================
+ Hits         2283     2479     +196     
- Misses        792      826      +34     
Flag Coverage Δ
unittests 75.00% <84.04%> (+0.76%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
tests/unit/test_bigbang.py 100.00% <ø> (+2.11%) ⬆️
bigbang/ingress/mailman.py 61.61% <66.66%> (+0.19%) ⬆️
bigbang/ingress/utils.py 70.51% <70.51%> (ø)
bigbang/ingress/w3c.py 74.27% <74.27%> (ø)
bigbang/bigbang_io.py 82.11% <84.37%> (-0.29%) ⬇️
bigbang/ingress/abstract.py 85.55% <85.55%> (ø)
bigbang/ingress/listserv.py 83.28% <87.50%> (ø)
tests/ingress/test_listserv.py 90.62% <88.88%> (ø)
bigbang/analysis/listserv.py 63.71% <90.90%> (+1.01%) ⬆️
bigbang/__init__.py 100.00% <100.00%> (ø)
... and 8 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2c41b5c...3ebffc7. Read the comment docs.

class AbstractArchive(ABC):
"""
This class handles the scraping of a public mailing list archive that uses
the LISTSERV 16.5 and 17 format.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment should be abstracted for the abstract class.

@sbenthall
Copy link
Collaborator

Just skimming for now -- this looks like fantastic work, @Christovis . Thank you so much !

Since @npdoty wrote the original W3C scraper code and has used it in his own work, I'd suggest he review this.

@sbenthall
Copy link
Collaborator

Please add the ingress directory for the scraping code in this PR

@Christovis Christovis mentioned this pull request Dec 20, 2021
"""
This class handles the data transformations for Listserv Archives.
This class handles the data transformations for Archives.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These comments might be a good place to elucidate the difference between a List and an Archive.

I'll admit I'm fuzzy on this at the moment.

I believe Archives can sometimes be composed of multiple Lists. Is that right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a pointer to more detailed descriptions on the differences between lists and archives in the docstrings in this PR.

@@ -28,7 +28,7 @@
logger = logging.getLogger(__name__)


class ListservMessageIO:
class MessageIO:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a wrapper around mailbox.mboxMessage?
Does it generally represent a single email?

It would be good to mention that in these comments.

Maybe even reference the mailbox docs:
https://docs.python.org/3/library/mailbox.html#mboxmessage

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a description on the purpose of MessageIO in its docstring.

@@ -0,0 +1,15 @@
from bigbang.ingress.abstract import (
AbstractArchive,
AbstractList,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder why there are both general Message/List/Archive classes in bigbang.bigbang_io and also Abstract versions of the same in bigbang.ingress.abstract

Would it be possible for the ListServe and W3C classes to be subclasses of the bigbang.bigbang_io classes? Or is the Abstract class architecture really necessary for some reason?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wrote bigbang.bigbang_io to avoid duplicating methods that are concerned with reading/writing a message, list, or archive from/to various file formats. These function are generally needed in all parts (ingress, analysis and visualising) of BigBang.

At the moment I see the following complications with making, e.g., bigbang.ingress.W3CList a subclass of ListIO:

  1. we can't led AbstractList inheret from ListIO because @classmethod's from which the classes is initialised need to be part of the class itself (and not the one from which they inherit).
  2. W3CList.from_mbox() returns something different than ListIO.from_mbox(). This is because W3CList.from_mbox() initialises a class that has further attributes (such as filepath and name) while ListIO.from_mbox() simply reads the content of the file at filepath and returns it as a list.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. If I understand correctly, then it sounds like ListIO is just a bundle of email related IO methods, not something that's meant to be used in an object oriented way.

Ok, so I would maintain that the current version of this code is confusing for a couple reasons:

  • there are many other classes in this proposal named "List" or some variation, which function very differently from ListIO. There's no parallelism here.
  • ListIO.from_mbox() returns a list which is different from an instantiation of a subclass of AbstractList...

I think some of the following changes could remedy this confusion.

  • Changing ListIO.from_mbox() into a standalone function, list_from_mbox(). (I.e. abandon OOP for a 'utility function' model), etc.
  • If maintaining the object-oriented design is necessary here, make these methods @classmethods
  • When a generic Python data type is being systematically used to represent mailing list data, use either New Types or Type Aliases to wrap those data structures in a named way.

I haven't used the Python typing library very much myself, but I like the way you've introduced it and I think it would improve the library to make better and further use of it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have changed all methods in bigbang/bigbang_io.py into standalone functions and introduced costume data types which are defined in bigbang/data_types.py

return MessageIO.to_mbox(msg, filepath)


class AbstractList(ABC):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe AbstractListScraper if it really has to do with the ingress of the list and not the representation of the list?

I'm unsure of how this class relates to ListIO

self,
name: str,
source: Union[List[str], str],
msgs: List[mboxMessage],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suddenly struck by the problematic ambiguity of the term "List" in light of Python programming type signatures.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you have another suggestion? The term mailing list is quite common and combined with the SDO of interest, e.g. W3C, the code uses W3CList which might be enough to not be confused with a Python list.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that this List class is from the typing library.
Maybe it's unavoidable in this case, and a code comment is therefore appropriate.

See my other comment about how Type Aliases or New Types might be a solution for when data representing the contents of a mailing list is returned as a Python list.

url_pref: Optional[str] = None,
login: Optional[Dict[str, str]] = {"username": None, "password": None},
session: Optional[requests.Session] = None,
) -> "AbstractList":
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused about why the type hint for the output of this abstract method is a string "AbstractList" instead of a reference to the class.

Is this the best way to do it in Python?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An answer can be found here. I have just adapted it after seeing other code.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, thanks. I was just ignorant about that.


class AbstractArchive(ABC):
"""
This class handles the scraping of a public mailing list archive.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again -- what is the difference between a list and an archive?

I believe that in some notebooks, Archives including email from multiple lists are built and saved to disk.
But these muilt-archives are not scraped directly.

This old Archive design may be something to scrap and not build so much around.

But I wonder how you have used this class in your ListServ work, and how it differs from you use of the List classes.

Copy link
Collaborator Author

@Christovis Christovis Dec 31, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again -- what is the difference between a list and an archive?

Find an extended explanation below under the headline: "Difference between list and archive"



class ListservArchive(object):
class ListservArchive(AbstractArchive):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see: "An archive is a list of ListservList elements."

I think this could be made clearer in documentation throughout the library.

Does an Archive, generally speaking, keep track of which list each message is associated with? That would be useful.

Copy link
Collaborator Author

@Christovis Christovis Dec 31, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Difference between list and archive

Here an extended explanation for why I distinguished between mailing lists (currently AbstractList, W3CList, ListservList) and archives (currently AbstractArchive, W3CArchive, ListservArchive).

Mailing Archive:
I defined archive in the code as the collection of all Emails -- across time and topics -- that an SDO made publicly accessible. In other words, I used archive to mean the domain name, which for W3C is @w3.org and for 3GPP is @LIST.ETSI.ORG. The mailing archive is a Python list of mailing lists (see the def __init__(lists: List[AbstractList]):
), and a mailing list in turn is a Python list of Emails (see the def __init__(msgs: List[mboxMessage]):). Thus each Email is associated to a mailing list.
An alternative for the word archive could be library.

Mailing list:
SDOs, such as W3C and 3GPP, group messages into lists that focus on specific topics. Examples are [email protected] and [email protected].
These mailing lists are accessible from the mailing archive webpage (linked to above in the Mailing Archive section) in different ways for different SDOs (as the html code is different) and thus motivates to distinguish between a mailing list and a mailing archive (or what ever word described the latter best).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added improved docstrings to the classes of: bigbang/ingress/abstract.py, bigbang/ingress/listserv.py, bigbang/ingress/w3c.py

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Thanks for this explanation. I see how it is internally consistent and makes sense.

I believe your use of mailing list is entirely standard.

I have a remaining concern, which is that all the earlier mailing list handling code, built around Mailman and still used for IETF data, uses the term Archive to mean something else. So what you've done creates some confusion.

I gather that what you've written is intended for bulk analysis of an SDO's entire library (to use a neutral term for now).

BigBang is used for other use cases. For example, many Mailman instances are not managed by SDOs. The user may not be interested in all the lists associated with the mailing list host.

In the other BigBang code, 'Archive' refers to a locally stored representation of a collection of email, which might include email from multiple lists. It can in principle include some preprocessing. It might not be a good name for it.

But I think maybe a better name for what you are calling an Archive might be an email Host or Domain -- something that is more semiotically linked with the unit of analysis, the SDO's relationship to those lists. 'Archive' might be too generic for this purpose.

Copy link
Collaborator Author

@Christovis Christovis Jan 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clarification. Bigbang is just getting better!

I have made the following renaming:

  • AbstractList --> AbstractMailingList
  • AbstractArchive --> AbstractMailingListDomain

and the same hold for Listserv and W3C.

----------
website : Set 'True' if messages are going to be scraped from websites,
otherwise 'False' if read from local memory. This distinction needs to
be made if missing messages should be added.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the use of this parameter based on the comment.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I should find a better word/phrasing or abstraction for this. This parameter is used to update/complete the already ingested mailing archive. Thus, the mailing archive in the local memory needs to be compared to the one on the webpage. The argument website makes the difference between reading a message from the website or local memory.

otherwise 'False' if read from local memory. This distinction needs to
be made if missing messages should be added.
url_login : URL to the 'Log In' page.
url_pref : URL to the 'Preferences'/settings page.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am confused why a Parser object needs login information.
Is this responsible both for accessing the data on-line, and then also parsing it into a a standard format?

Maybe these functionalities should be separated or the class renamed to make it clear what the class's full scope is.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the 3GPP mailing archive one needs to create an account to access all the messages, otherwise all reply-to header fields just show <[login to unmask]>. But for W3C this argument is not necessary thus I should remove them from the /ingress/w3c.py file.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed those unnecessary method arguments for W3C in this PR

@@ -24,7 +24,8 @@

from config.config import CONFIG

from . import listserv, parse, w3crawl
from bigbang.ingress import listserv, w3c
from bigbang import parse
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes are going to create a merge conflict with #540

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will sort them out once you merged that PR into main

bigbang/utils.py Outdated
logger = logging.getLogger(__name__)


def get_website_content(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought I saw something like this elsewhere in this PR -- is this perhaps a redundant function? (I like it better in utils.py)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed! For some reason I had the def get_website_content(): also in bigbang/ingress/abstract.py. Changed this PR to only refer to the def get_website_content(): in bigbang/utils.py

bigbang/utils.py Outdated
password = ask_for_input("Enter your Password: ")
if record and isinstance(username, str) and isinstance(password, str):
loginkey_to_file(username, password, file_auth)
return username, password
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like many of these new functions have to do specifically with managing interactions with a remote website that requires login.

Maybe they should be in a separate submodule with a more descriptive name than utils.py

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I now moved ingress related functions from bigbang/utils.py to bigbang/ingress/utils.py in this PR

@sbenthall
Copy link
Collaborator

@Christovis Please see inline comments for my review.

Excellent work! I had a few minor recommendations, mainly around naming and documentation.

@sbenthall
Copy link
Collaborator

I've engaged on a couple remaining points:

  • The potential confusion around data types and use of the term "List"
  • The use of the term Archive

Happy to jump on a call to work this out if that's helpful.

@Christovis
Copy link
Collaborator Author

This is going to be a great PR!
I have addressed the issues with words -- let me know what you think.
If it's ok, then I will just polish it off a bit more and merge.

@sbenthall
Copy link
Collaborator

Yes, excellent! Thank you, this is awesome. Please feel free to merge when you like.

@Christovis Christovis merged commit 78f8e1b into main Jan 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants