Idea on abstraction of scraping classes #534

Christovis · 2021-12-13T23:58:18Z

This PR focuses on the scraping of mailing archives from W3C (bigbang/w3c.py) and Listserv (bigbang/listserv.py). As they have multiple methods in common, they can be abstracted out into a new bigbang/abstract.py file containing AbstractMessageParser, AbstractList, and AbstractArchive.

As a consequence:

our scraping of W3C mailing archives becomes more sophisticated (e.g., now we can scrape the whole archive, filter it by time ranges and message content) and the
scraping of different mail archives are built around the same template which safes us from duplicating code

This is part of the conversation in #512.

codecov-commenter · 2021-12-14T00:32:41Z

Codecov Report

Merging #534 (3ebffc7) into main (2c41b5c) will increase coverage by 0.76%.
The diff coverage is 84.04%.

@@            Coverage Diff             @@
##             main     #534      +/-   ##
==========================================
+ Coverage   74.24%   75.00%   +0.76%     
==========================================
  Files          22       27       +5     
  Lines        3075     3305     +230     
==========================================
+ Hits         2283     2479     +196     
- Misses        792      826      +34

Flag	Coverage Δ
unittests	`75.00% <84.04%> (+0.76%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
tests/unit/test_bigbang.py	`100.00% <ø> (+2.11%)`	⬆️
bigbang/ingress/mailman.py	`61.61% <66.66%> (+0.19%)`	⬆️
bigbang/ingress/utils.py	`70.51% <70.51%> (ø)`
bigbang/ingress/w3c.py	`74.27% <74.27%> (ø)`
bigbang/bigbang_io.py	`82.11% <84.37%> (-0.29%)`	⬇️
bigbang/ingress/abstract.py	`85.55% <85.55%> (ø)`
bigbang/ingress/listserv.py	`83.28% <87.50%> (ø)`
tests/ingress/test_listserv.py	`90.62% <88.88%> (ø)`
bigbang/analysis/listserv.py	`63.71% <90.90%> (+1.01%)`	⬆️
bigbang/__init__.py	`100.00% <100.00%> (ø)`
... and 8 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2c41b5c...3ebffc7. Read the comment docs.

sbenthall · 2021-12-14T15:10:07Z

bigbang/abstract.py

+class AbstractArchive(ABC):
+    """
+    This class handles the scraping of a public mailing list archive that uses
+    the LISTSERV 16.5 and 17 format.


This comment should be abstracted for the abstract class.

sbenthall · 2021-12-14T15:12:43Z

Just skimming for now -- this looks like fantastic work, @Christovis . Thank you so much !

Since @npdoty wrote the original W3C scraper code and has used it in his own work, I'd suggest he review this.

sbenthall · 2021-12-17T14:18:55Z

Please add the ingress directory for the scraping code in this PR

sbenthall · 2021-12-30T18:38:22Z

bigbang/bigbang_io.py

    """
-    This class handles the data transformations for Listserv Archives.
+    This class handles the data transformations for Archives.


These comments might be a good place to elucidate the difference between a List and an Archive.

I'll admit I'm fuzzy on this at the moment.

I believe Archives can sometimes be composed of multiple Lists. Is that right?

I added a pointer to more detailed descriptions on the differences between lists and archives in the docstrings in this PR.

sbenthall · 2021-12-30T18:41:53Z

bigbang/bigbang_io.py

@@ -28,7 +28,7 @@
 logger = logging.getLogger(__name__)


-class ListservMessageIO:
+class MessageIO:


Is this a wrapper around mailbox.mboxMessage?
Does it generally represent a single email?

It would be good to mention that in these comments.

Maybe even reference the mailbox docs:
https://docs.python.org/3/library/mailbox.html#mboxmessage

I added a description on the purpose of MessageIO in its docstring.

sbenthall · 2021-12-30T18:45:25Z

bigbang/ingress/__init__.py

@@ -0,0 +1,15 @@
+from bigbang.ingress.abstract import (
+    AbstractArchive,
+    AbstractList,


I wonder why there are both general Message/List/Archive classes in bigbang.bigbang_io and also Abstract versions of the same in bigbang.ingress.abstract

Would it be possible for the ListServe and W3C classes to be subclasses of the bigbang.bigbang_io classes? Or is the Abstract class architecture really necessary for some reason?

I wrote bigbang.bigbang_io to avoid duplicating methods that are concerned with reading/writing a message, list, or archive from/to various file formats. These function are generally needed in all parts (ingress, analysis and visualising) of BigBang.

At the moment I see the following complications with making, e.g., bigbang.ingress.W3CList a subclass of ListIO:

we can't led AbstractList inheret from ListIO because @classmethod's from which the classes is initialised need to be part of the class itself (and not the one from which they inherit).

W3CList.from_mbox() returns something different than ListIO.from_mbox(). This is because W3CList.from_mbox() initialises a class that has further attributes (such as filepath and name) while ListIO.from_mbox() simply reads the content of the file at filepath and returns it as a list.

Ok. If I understand correctly, then it sounds like ListIO is just a bundle of email related IO methods, not something that's meant to be used in an object oriented way.

Ok, so I would maintain that the current version of this code is confusing for a couple reasons:

there are many other classes in this proposal named "List" or some variation, which function very differently from ListIO. There's no parallelism here.

ListIO.from_mbox() returns a list which is different from an instantiation of a subclass of AbstractList...

I think some of the following changes could remedy this confusion.

Changing ListIO.from_mbox() into a standalone function, list_from_mbox(). (I.e. abandon OOP for a 'utility function' model), etc.

If maintaining the object-oriented design is necessary here, make these methods @classmethods

When a generic Python data type is being systematically used to represent mailing list data, use either New Types or Type Aliases to wrap those data structures in a named way.

I haven't used the Python typing library very much myself, but I like the way you've introduced it and I think it would improve the library to make better and further use of it.

I have changed all methods in bigbang/bigbang_io.py into standalone functions and introduced costume data types which are defined in bigbang/data_types.py

sbenthall · 2021-12-30T19:19:58Z

bigbang/ingress/abstract.py

+        return MessageIO.to_mbox(msg, filepath)
+
+
+class AbstractList(ABC):


Maybe AbstractListScraper if it really has to do with the ingress of the list and not the representation of the list?

I'm unsure of how this class relates to ListIO

sbenthall · 2021-12-30T19:20:49Z

bigbang/ingress/abstract.py

+        self,
+        name: str,
+        source: Union[List[str], str],
+        msgs: List[mboxMessage],


Suddenly struck by the problematic ambiguity of the term "List" in light of Python programming type signatures.

Would you have another suggestion? The term mailing list is quite common and combined with the SDO of interest, e.g. W3C, the code uses W3CList which might be enough to not be confused with a Python list.

I see that this List class is from the typing library.
Maybe it's unavoidable in this case, and a code comment is therefore appropriate.

See my other comment about how Type Aliases or New Types might be a solution for when data representing the contents of a mailing list is returned as a Python list.

sbenthall · 2021-12-30T19:49:49Z

bigbang/ingress/abstract.py

+        url_pref: Optional[str] = None,
+        login: Optional[Dict[str, str]] = {"username": None, "password": None},
+        session: Optional[requests.Session] = None,
+    ) -> "AbstractList":


I'm confused about why the type hint for the output of this abstract method is a string "AbstractList" instead of a reference to the class.

Is this the best way to do it in Python?

An answer can be found here. I have just adapted it after seeing other code.

Ok, thanks. I was just ignorant about that.

sbenthall · 2021-12-30T19:53:24Z

bigbang/ingress/abstract.py

+
+class AbstractArchive(ABC):
+    """
+    This class handles the scraping of a public mailing list archive.


Again -- what is the difference between a list and an archive?

I believe that in some notebooks, Archives including email from multiple lists are built and saved to disk.
But these muilt-archives are not scraped directly.

This old Archive design may be something to scrap and not build so much around.

But I wonder how you have used this class in your ListServ work, and how it differs from you use of the List classes.

Again -- what is the difference between a list and an archive?

Find an extended explanation below under the headline: "Difference between list and archive"

sbenthall · 2021-12-30T19:56:21Z

bigbang/ingress/listserv.py


-
-class ListservArchive(object):
+class ListservArchive(AbstractArchive):


I see: "An archive is a list of ListservList elements."

I think this could be made clearer in documentation throughout the library.

Does an Archive, generally speaking, keep track of which list each message is associated with? That would be useful.

Difference between list and archive

Here an extended explanation for why I distinguished between mailing lists (currently AbstractList, W3CList, ListservList) and archives (currently AbstractArchive, W3CArchive, ListservArchive).

Mailing Archive:
I defined archive in the code as the collection of all Emails -- across time and topics -- that an SDO made publicly accessible. In other words, I used archive to mean the domain name, which for W3C is @w3.org and for 3GPP is @LIST.ETSI.ORG. The mailing archive is a Python list of mailing lists (see the def __init__(lists: List[AbstractList]):
), and a mailing list in turn is a Python list of Emails (see the def __init__(msgs: List[mboxMessage]):). Thus each Email is associated to a mailing list.
An alternative for the word archive could be library.

Mailing list:
SDOs, such as W3C and 3GPP, group messages into lists that focus on specific topics. Examples are [email protected] and [email protected].
These mailing lists are accessible from the mailing archive webpage (linked to above in the Mailing Archive section) in different ways for different SDOs (as the html code is different) and thus motivates to distinguish between a mailing list and a mailing archive (or what ever word described the latter best).

I added improved docstrings to the classes of: bigbang/ingress/abstract.py, bigbang/ingress/listserv.py, bigbang/ingress/w3c.py

Ok. Thanks for this explanation. I see how it is internally consistent and makes sense.

I believe your use of mailing list is entirely standard.

I have a remaining concern, which is that all the earlier mailing list handling code, built around Mailman and still used for IETF data, uses the term Archive to mean something else. So what you've done creates some confusion.

I gather that what you've written is intended for bulk analysis of an SDO's entire library (to use a neutral term for now).

BigBang is used for other use cases. For example, many Mailman instances are not managed by SDOs. The user may not be interested in all the lists associated with the mailing list host.

In the other BigBang code, 'Archive' refers to a locally stored representation of a collection of email, which might include email from multiple lists. It can in principle include some preprocessing. It might not be a good name for it.

But I think maybe a better name for what you are calling an Archive might be an email Host or Domain -- something that is more semiotically linked with the unit of analysis, the SDO's relationship to those lists. 'Archive' might be too generic for this purpose.

Thanks for the clarification. Bigbang is just getting better!

I have made the following renaming:

AbstractList --> AbstractMailingList

AbstractArchive --> AbstractMailingListDomain

and the same hold for Listserv and W3C.

sbenthall · 2021-12-30T20:15:55Z

bigbang/ingress/w3c.py

+    ----------
+    website : Set 'True' if messages are going to be scraped from websites,
+        otherwise 'False' if read from local memory. This distinction needs to
+        be made if missing messages should be added.


I don't understand the use of this parameter based on the comment.

Yes, I should find a better word/phrasing or abstraction for this. This parameter is used to update/complete the already ingested mailing archive. Thus, the mailing archive in the local memory needs to be compared to the one on the webpage. The argument website makes the difference between reading a message from the website or local memory.

sbenthall · 2021-12-30T20:18:07Z

bigbang/ingress/w3c.py

+        otherwise 'False' if read from local memory. This distinction needs to
+        be made if missing messages should be added.
+    url_login : URL to the 'Log In' page.
+    url_pref : URL to the 'Preferences'/settings page.


I am confused why a Parser object needs login information.
Is this responsible both for accessing the data on-line, and then also parsing it into a a standard format?

Maybe these functionalities should be separated or the class renamed to make it clear what the class's full scope is.

For the 3GPP mailing archive one needs to create an account to access all the messages, otherwise all reply-to header fields just show <[login to unmask]>. But for W3C this argument is not necessary thus I should remove them from the /ingress/w3c.py file.

removed those unnecessary method arguments for W3C in this PR

sbenthall · 2021-12-30T20:20:29Z

bigbang/mailman.py

@@ -24,7 +24,8 @@

 from config.config import CONFIG

-from . import listserv, parse, w3crawl
+from bigbang.ingress import listserv, w3c
+from bigbang import parse


These changes are going to create a merge conflict with #540

I will sort them out once you merged that PR into main

sbenthall · 2021-12-30T20:21:18Z

bigbang/utils.py

+logger = logging.getLogger(__name__)
+
+
+def get_website_content(


I thought I saw something like this elsewhere in this PR -- is this perhaps a redundant function? (I like it better in utils.py)

Indeed! For some reason I had the def get_website_content(): also in bigbang/ingress/abstract.py. Changed this PR to only refer to the def get_website_content(): in bigbang/utils.py

sbenthall · 2021-12-30T20:22:10Z

bigbang/utils.py

+        password = ask_for_input("Enter your Password: ")
+    if record and isinstance(username, str) and isinstance(password, str):
+        loginkey_to_file(username, password, file_auth)
+    return username, password


Looks like many of these new functions have to do specifically with managing interactions with a remote website that requires login.

Maybe they should be in a separate submodule with a more descriptive name than utils.py

I now moved ingress related functions from bigbang/utils.py to bigbang/ingress/utils.py in this PR

sbenthall · 2021-12-30T20:23:16Z

@Christovis Please see inline comments for my review.

Excellent work! I had a few minor recommendations, mainly around naming and documentation.

…ss/utils.py

sbenthall · 2022-01-03T16:58:24Z

I've engaged on a couple remaining points:

The potential confusion around data types and use of the term "List"
The use of the term Archive

Happy to jump on a call to work this out if that's helpful.

…roducing costume data types

Christovis · 2022-01-04T13:18:34Z

This is going to be a great PR!
I have addressed the issues with words -- let me know what you think.
If it's ok, then I will just polish it off a bit more and merge.

sbenthall · 2022-01-04T22:24:37Z

Yes, excellent! Thank you, this is awesome. Please feel free to merge when you like.

Christovis added 14 commits December 10, 2021 13:35

intermediate commit

d746453

Merge branch 'main' into dev_scraping_abc

3fc384c

refactor w3c to make unit testing easier

e440344

refactored W3C Email message scraping

57fc530

refactor W3C mailing list scraping

3d7b909

refactor of W3C mailing list scraping done

c90ee8a

tets new W3C mailing list scraping

58b55ed

start refactoring W3C archive scraping, name i/o more general

2906e15

implemented W3C archive scraping, remove ConversationKG function

5031ce7

W3C implementation and testing complete

4efb02e

rename file

594644a

creating AbstractMessageParser

7905cb8

add abstractmethod flags

32fad0d

abstraction v1 for scraping complete

0e263ff

Christovis requested a review from sbenthall December 13, 2021 23:58

Christovis mentioned this pull request Dec 13, 2021

combine load_data and open_list_archives ? #512

Open

Christovis added 3 commits December 14, 2021 00:01

correct for all tests

fca6355

Merge branch 'main' into dev_scraping_abc

e888ac9

bug fix

7cad009

sbenthall reviewed Dec 14, 2021

View reviewed changes

sbenthall requested a review from npdoty December 14, 2021 15:12

sbenthall mentioned this pull request Dec 14, 2021

moving domaine entropy calculation from notebooks to library. fixes #529 #535

Merged

Christovis added 3 commits December 17, 2021 13:27

improved test for address field resolution

00c0163

improved test for address field resolution

b85a527

correct docstrings

a62ffce

Christovis mentioned this pull request Dec 20, 2021

w3c scraping hangs #538

Closed

move code into ingress folder

d26e4e6

sbenthall reviewed Dec 30, 2021

View reviewed changes

Christovis added 5 commits January 3, 2022 10:39

remove login functionality for W3C

1f2860d

use "def get_website_content():" from bigbang/utils.py instead

e03db95

move ingress related functions from bigbang/utils.py to bigbang/ingre…

91bf516

…ss/utils.py

improve docstrings on definition of "list" and "archive"

b1084da

improved docstrings for bigbang_io

93cd146

Christovis added 5 commits January 4, 2022 09:30

remove irrelevant imports

11381a7

sync with main

e7da5bc

change bigbang_io from a more OOP to a more FP style

31ecdc2

renaming List to MailingList and Archive to MailingListDomain and int…

ac9c61a

…roducing costume data types

bug fix

de0a985

use costume data types in bigbang-io

3ebffc7

Christovis merged commit 78f8e1b into main Jan 5, 2022

final polishing

9ef9816

		return MessageIO.to_mbox(msg, filepath)


		class AbstractList(ABC):



		class ListservArchive(object):
		class ListservArchive(AbstractArchive):

		logger = logging.getLogger(__name__)


		def get_website_content(

Idea on abstraction of scraping classes #534

Idea on abstraction of scraping classes #534

Conversation

Christovis commented Dec 13, 2021 • edited Loading

codecov-commenter commented Dec 14, 2021 • edited Loading

Codecov Report

Choose a reason for hiding this comment

sbenthall commented Dec 14, 2021

sbenthall commented Dec 17, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Christovis Dec 31, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Christovis Dec 31, 2021 • edited Loading

Choose a reason for hiding this comment

Difference between list and archive

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Christovis Jan 4, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sbenthall commented Dec 30, 2021

sbenthall commented Jan 3, 2022

Christovis commented Jan 4, 2022

sbenthall commented Jan 4, 2022

Christovis commented Dec 13, 2021 •

edited

Loading

codecov-commenter commented Dec 14, 2021 •

edited

Loading

Christovis Dec 31, 2021 •

edited

Loading

Christovis Dec 31, 2021 •

edited

Loading

Christovis Jan 4, 2022 •

edited

Loading