-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Idea on abstraction of scraping classes #534
Conversation
Codecov Report
@@ Coverage Diff @@
## main #534 +/- ##
==========================================
+ Coverage 74.24% 75.00% +0.76%
==========================================
Files 22 27 +5
Lines 3075 3305 +230
==========================================
+ Hits 2283 2479 +196
- Misses 792 826 +34
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
bigbang/abstract.py
Outdated
class AbstractArchive(ABC): | ||
""" | ||
This class handles the scraping of a public mailing list archive that uses | ||
the LISTSERV 16.5 and 17 format. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment should be abstracted for the abstract class.
Just skimming for now -- this looks like fantastic work, @Christovis . Thank you so much ! Since @npdoty wrote the original W3C scraper code and has used it in his own work, I'd suggest he review this. |
Please add the |
bigbang/bigbang_io.py
Outdated
""" | ||
This class handles the data transformations for Listserv Archives. | ||
This class handles the data transformations for Archives. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These comments might be a good place to elucidate the difference between a List and an Archive.
I'll admit I'm fuzzy on this at the moment.
I believe Archives can sometimes be composed of multiple Lists. Is that right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a pointer to more detailed descriptions on the differences between lists and archives in the docstrings in this PR.
bigbang/bigbang_io.py
Outdated
@@ -28,7 +28,7 @@ | |||
logger = logging.getLogger(__name__) | |||
|
|||
|
|||
class ListservMessageIO: | |||
class MessageIO: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a wrapper around mailbox.mboxMessage?
Does it generally represent a single email?
It would be good to mention that in these comments.
Maybe even reference the mailbox docs:
https://docs.python.org/3/library/mailbox.html#mboxmessage
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a description on the purpose of MessageIO in its docstring.
bigbang/ingress/__init__.py
Outdated
@@ -0,0 +1,15 @@ | |||
from bigbang.ingress.abstract import ( | |||
AbstractArchive, | |||
AbstractList, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder why there are both general Message/List/Archive classes in bigbang.bigbang_io and also Abstract versions of the same in bigbang.ingress.abstract
Would it be possible for the ListServe and W3C classes to be subclasses of the bigbang.bigbang_io classes? Or is the Abstract class architecture really necessary for some reason?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wrote bigbang.bigbang_io to avoid duplicating methods that are concerned with reading/writing a message, list, or archive from/to various file formats. These function are generally needed in all parts (ingress, analysis and visualising) of BigBang.
At the moment I see the following complications with making, e.g., bigbang.ingress.W3CList
a subclass of ListIO
:
- we can't led
AbstractList
inheret fromListIO
because@classmethod
's from which the classes is initialised need to be part of the class itself (and not the one from which they inherit). W3CList.from_mbox()
returns something different than ListIO.from_mbox(). This is becauseW3CList.from_mbox()
initialises a class that has further attributes (such asfilepath
andname
) whileListIO.from_mbox()
simply reads the content of the file atfilepath
and returns it as a list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. If I understand correctly, then it sounds like ListIO is just a bundle of email related IO methods, not something that's meant to be used in an object oriented way.
Ok, so I would maintain that the current version of this code is confusing for a couple reasons:
- there are many other classes in this proposal named "List" or some variation, which function very differently from ListIO. There's no parallelism here.
ListIO.from_mbox()
returns alist
which is different from an instantiation of a subclass ofAbstractList
...
I think some of the following changes could remedy this confusion.
- Changing ListIO.from_mbox() into a standalone function,
list_from_mbox()
. (I.e. abandon OOP for a 'utility function' model), etc. - If maintaining the object-oriented design is necessary here, make these methods
@classmethods
- When a generic Python data type is being systematically used to represent mailing list data, use either New Types or Type Aliases to wrap those data structures in a named way.
I haven't used the Python typing library very much myself, but I like the way you've introduced it and I think it would improve the library to make better and further use of it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have changed all methods in bigbang/bigbang_io.py
into standalone functions and introduced costume data types which are defined in bigbang/data_types.py
bigbang/ingress/abstract.py
Outdated
return MessageIO.to_mbox(msg, filepath) | ||
|
||
|
||
class AbstractList(ABC): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe AbstractListScraper if it really has to do with the ingress of the list and not the representation of the list?
I'm unsure of how this class relates to ListIO
bigbang/ingress/abstract.py
Outdated
self, | ||
name: str, | ||
source: Union[List[str], str], | ||
msgs: List[mboxMessage], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suddenly struck by the problematic ambiguity of the term "List" in light of Python programming type signatures.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you have another suggestion? The term mailing list is quite common and combined with the SDO of interest, e.g. W3C, the code uses W3CList which might be enough to not be confused with a Python list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that this List
class is from the typing
library.
Maybe it's unavoidable in this case, and a code comment is therefore appropriate.
See my other comment about how Type Aliases or New Types might be a solution for when data representing the contents of a mailing list is returned as a Python list.
bigbang/ingress/abstract.py
Outdated
url_pref: Optional[str] = None, | ||
login: Optional[Dict[str, str]] = {"username": None, "password": None}, | ||
session: Optional[requests.Session] = None, | ||
) -> "AbstractList": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused about why the type hint for the output of this abstract method is a string "AbstractList" instead of a reference to the class.
Is this the best way to do it in Python?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An answer can be found here. I have just adapted it after seeing other code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, thanks. I was just ignorant about that.
bigbang/ingress/abstract.py
Outdated
|
||
class AbstractArchive(ABC): | ||
""" | ||
This class handles the scraping of a public mailing list archive. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again -- what is the difference between a list and an archive?
I believe that in some notebooks, Archives including email from multiple lists are built and saved to disk.
But these muilt-archives are not scraped directly.
This old Archive design may be something to scrap and not build so much around.
But I wonder how you have used this class in your ListServ work, and how it differs from you use of the List classes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again -- what is the difference between a list and an archive?
Find an extended explanation below under the headline: "Difference between list and archive"
bigbang/ingress/listserv.py
Outdated
|
||
|
||
class ListservArchive(object): | ||
class ListservArchive(AbstractArchive): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see: "An archive is a list of ListservList elements."
I think this could be made clearer in documentation throughout the library.
Does an Archive, generally speaking, keep track of which list each message is associated with? That would be useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Difference between list and archive
Here an extended explanation for why I distinguished between mailing lists (currently AbstractList, W3CList, ListservList
) and archives (currently AbstractArchive, W3CArchive, ListservArchive
).
Mailing Archive:
I defined archive in the code as the collection of all Emails -- across time and topics -- that an SDO made publicly accessible. In other words, I used archive to mean the domain name, which for W3C is @w3.org and for 3GPP is @LIST.ETSI.ORG. The mailing archive is a Python list of mailing lists (see the def __init__(lists: List[AbstractList]):
), and a mailing list in turn is a Python list of Emails (see the def __init__(msgs: List[mboxMessage]):
). Thus each Email is associated to a mailing list.
An alternative for the word archive could be library.
Mailing list:
SDOs, such as W3C and 3GPP, group messages into lists that focus on specific topics. Examples are [email protected] and [email protected].
These mailing lists are accessible from the mailing archive webpage (linked to above in the Mailing Archive section) in different ways for different SDOs (as the html code is different) and thus motivates to distinguish between a mailing list and a mailing archive (or what ever word described the latter best).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added improved docstrings to the classes of: bigbang/ingress/abstract.py, bigbang/ingress/listserv.py, bigbang/ingress/w3c.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. Thanks for this explanation. I see how it is internally consistent and makes sense.
I believe your use of mailing list is entirely standard.
I have a remaining concern, which is that all the earlier mailing list handling code, built around Mailman and still used for IETF data, uses the term Archive
to mean something else. So what you've done creates some confusion.
I gather that what you've written is intended for bulk analysis of an SDO's entire library (to use a neutral term for now).
BigBang is used for other use cases. For example, many Mailman instances are not managed by SDOs. The user may not be interested in all the lists associated with the mailing list host.
In the other BigBang code, 'Archive' refers to a locally stored representation of a collection of email, which might include email from multiple lists. It can in principle include some preprocessing. It might not be a good name for it.
But I think maybe a better name for what you are calling an Archive might be an email Host or Domain -- something that is more semiotically linked with the unit of analysis, the SDO's relationship to those lists. 'Archive' might be too generic for this purpose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the clarification. Bigbang is just getting better!
I have made the following renaming:
- AbstractList --> AbstractMailingList
- AbstractArchive --> AbstractMailingListDomain
and the same hold for Listserv and W3C.
---------- | ||
website : Set 'True' if messages are going to be scraped from websites, | ||
otherwise 'False' if read from local memory. This distinction needs to | ||
be made if missing messages should be added. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand the use of this parameter based on the comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I should find a better word/phrasing or abstraction for this. This parameter is used to update/complete the already ingested mailing archive. Thus, the mailing archive in the local memory needs to be compared to the one on the webpage. The argument website
makes the difference between reading a message from the website or local memory.
otherwise 'False' if read from local memory. This distinction needs to | ||
be made if missing messages should be added. | ||
url_login : URL to the 'Log In' page. | ||
url_pref : URL to the 'Preferences'/settings page. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am confused why a Parser object needs login information.
Is this responsible both for accessing the data on-line, and then also parsing it into a a standard format?
Maybe these functionalities should be separated or the class renamed to make it clear what the class's full scope is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the 3GPP mailing archive one needs to create an account to access all the messages, otherwise all reply-to header fields just show <[login to unmask]>
. But for W3C this argument is not necessary thus I should remove them from the /ingress/w3c.py
file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed those unnecessary method arguments for W3C in this PR
bigbang/mailman.py
Outdated
@@ -24,7 +24,8 @@ | |||
|
|||
from config.config import CONFIG | |||
|
|||
from . import listserv, parse, w3crawl | |||
from bigbang.ingress import listserv, w3c | |||
from bigbang import parse |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These changes are going to create a merge conflict with #540
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will sort them out once you merged that PR into main
bigbang/utils.py
Outdated
logger = logging.getLogger(__name__) | ||
|
||
|
||
def get_website_content( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought I saw something like this elsewhere in this PR -- is this perhaps a redundant function? (I like it better in utils.py)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed! For some reason I had the def get_website_content():
also in bigbang/ingress/abstract.py. Changed this PR to only refer to the def get_website_content():
in bigbang/utils.py
bigbang/utils.py
Outdated
password = ask_for_input("Enter your Password: ") | ||
if record and isinstance(username, str) and isinstance(password, str): | ||
loginkey_to_file(username, password, file_auth) | ||
return username, password |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like many of these new functions have to do specifically with managing interactions with a remote website that requires login.
Maybe they should be in a separate submodule with a more descriptive name than utils.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I now moved ingress related functions from bigbang/utils.py
to bigbang/ingress/utils.py
in this PR
@Christovis Please see inline comments for my review. Excellent work! I had a few minor recommendations, mainly around naming and documentation. |
I've engaged on a couple remaining points:
Happy to jump on a call to work this out if that's helpful. |
This is going to be a great PR! |
Yes, excellent! Thank you, this is awesome. Please feel free to merge when you like. |
This PR focuses on the scraping of mailing archives from W3C (
bigbang/w3c.py
) and Listserv (bigbang/listserv.py
). As they have multiple methods in common, they can be abstracted out into a newbigbang/abstract.py
file containingAbstractMessageParser
,AbstractList
, andAbstractArchive
.As a consequence:
This is part of the conversation in #512.