Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Notebooks #137

Open
wants to merge 13 commits into
base: main
Choose a base branch
from
Open

Notebooks #137

wants to merge 13 commits into from

Conversation

bisd98
Copy link
Member

@bisd98 bisd98 commented Apr 28, 2024

No description provided.

Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@github-actions github-actions bot requested a review from bafaurazan April 28, 2024 08:44
Copy link
Member

@pgronkievitz pgronkievitz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

didn't check notebooks yet

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: why does this require separate gitignore?


article = Article("")

def __init__(self, url_list: Union[List[str], str], depth: int = 1):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: use modern typing syntax

Suggested change
def __init__(self, url_list: Union[List[str], str], depth: int = 1):
def __init__(self, url_list: list[str] | str, depth: int = 1):

self.depth = depth

@staticmethod
def newspaper_extractor(html):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue: untyped

Comment on lines 129 to 136
if text_splitter is None:
_text_splitter: TextSplitter = SpacyTextSplitter(
pipeline="pl_core_news_sm",
chunk_size=chunk,
chunk_overlap=chunk_overlap,
)
else:
_text_splitter = text_splitter
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion:

Suggested change
if text_splitter is None:
_text_splitter: TextSplitter = SpacyTextSplitter(
pipeline="pl_core_news_sm",
chunk_size=chunk,
chunk_overlap=chunk_overlap,
)
else:
_text_splitter = text_splitter
_text_splitter = text_splitter or SpacyTextSplitter(
pipeline="pl_core_news_sm",
chunk_size=chunk,
chunk_overlap=chunk_overlap,
)

Comment on lines 137 to 142
docs = self.load()
docs = reduce(
lambda data, method: method(data),
[CleanWebLoader.junk_remover, CleanWebLoader.ds_converter],
docs,
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion:

Suggested change
docs = self.load()
docs = reduce(
lambda data, method: method(data),
[CleanWebLoader.junk_remover, CleanWebLoader.ds_converter],
docs,
)
docs = reduce(
lambda data, method: method(data),
[CleanWebLoader.junk_remover, CleanWebLoader.ds_converter],
self.load(),
)

loader = LocalDataLoader.loaders[d_type](file_path)
try:
docs.append(loader.load()[0])
except Exception as e:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue: use specific exceptions, not generic Exception, as it'll silently fail on exception you didn't forsee


[tool.poetry.dev-dependencies]
black = "*"
flake8 = "*"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue: lack of EOL

requires = ["poetry-core"]
build-backend = "poetry.core.masonry.api"

[tool.poetry.dev-dependencies]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue: legacy syntax, don't use it

Comment on lines 19 to 20
black = "*"
flake8 = "*"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue: be more specific about required versions

docs.extend(local_loader.load())

for doc in docs:
requests.post(url, json=doc)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue: lack of EOL

Copy link
Member Author

@bisd98 bisd98 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed the code as suggested. Fixing exceptions and module versions require updating and testing the current code.

Copy link
Member

@pgronkievitz pgronkievitz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please, use black, ruff (with as much enabled options as possible) and pyright with strict typing


"""

article = Article("")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: are you aware this instance will be shared and may cause weird bugs with overwrites between objects? (this object is created once and will be shared between CleanWebLoader instances)

Comment on lines +41 to +84
@staticmethod
def newspaper_extractor(html: str):
"""
Extracts and cleans text content from HTML using the 'newspaper' library.

:param html: HTML content to be processed.
:return: Cleaned and concatenated text extracted from the HTML.
"""
CleanWebLoader.article.set_html(html)
CleanWebLoader.article.parse()
return " ".join(CleanWebLoader.article.text.split())

@staticmethod
def ds_converter(docs: list[Document]):
"""
Converts a list of documents into a specific data structure.

:param docs: List of documents to be converted.
:return: List of dictionaries, each representing a document with 'text' key.
"""
return [{"text": doc.page_content} for doc in docs]

@staticmethod
def junk_remover(docs: list[Document]):
"""
Identifies and returns a list of suspected junk documents based on specific criteria.

:param docs: A list of documents, where each document is represented as a dictionary.
Each dictionary should have a "text" key containing the text content of the document.
:return: A list of suspected junk documents based on the criteria of having less than 300 characters
or having the same text as another document in the input list.
"""
junk_docs = [doc for doc in docs if len(doc.page_content) < 300]
seen_texts = set()
clear_docs = []
for doc in docs:
if "title" not in doc.metadata.keys():
junk_docs.append(doc)
elif doc.page_content not in seen_texts and doc not in junk_docs:
clear_docs.append(doc)
seen_texts.add(doc.page_content)
else:
pass
return clear_docs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue: those should not be static, add self argument and use self.article.* instead of CleanWebLoader.article.*
ds_converter and junk_remover should be separate from this class as they've got nothing to do with it.
also - TYPING, please check with pyright set to strict

@@ -0,0 +1,140 @@
from newspaper import Article
from functools import reduce
from typing import List, Optional
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: both are unnecessary - list[type] and type | None work just fine

:param chunk_overlap: Overlap size between chunks (default is 80).
:return: List of dictionaries, each representing a document with 'text' key.
"""
_text_splitter: text_splitter or TextSplitter = SpacyTextSplitter(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue: that's some weird-ass typing, wtf

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad, I missed it

Comment on lines +12 to +17
loaders = {
".pdf": PyPDFLoader,
".json": JSONLoader,
".txt": TextLoader,
".csv": CSVLoader,
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: move this to init

".csv": CSVLoader,
}

def __init__(self, path: Union[List[str], str]):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion:

Suggested change
def __init__(self, path: Union[List[str], str]):
def __init__(self, path: list[str] | str):

Comment on lines +28 to +55
@staticmethod
def ds_converter(docs):
"""
Converts a list of documents into a specific data structure.

:param docs: List of documents to be converted.
:return: List of dictionaries, each representing a document with 'text' and 'url' keys.
"""
return [{"text": doc.page_content} for doc in docs]

@staticmethod
def junk_remover(docs):
"""
Identifies and returns a list of suspected junk documents based on specific criteria.

:param docs: A list of documents, where each document is represented as a dictionary.
Each dictionary should have a "text" key containing the text content of the document.
:return: A list of suspected junk documents based on the criteria of having less than 300 characters
or having the same text as another document in the input list.
"""
junk_docs = [doc for doc in docs if len(doc.page_content) < 300]
seen_texts = {}
clear_docs = []
for doc in docs:
if doc.page_content not in seen_texts and doc not in junk_docs:
clear_docs.append(doc)
seen_texts.add(doc.page_content)
return clear_docs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: yep, those should be shared, if you want them to be inside of this class - make them a mixin

def test_ds_converter(clean_web_loader):
docs = [Document(page_content="Document 1"), Document(page_content="Document 2")]
expected_output = [{"text": "Document 1"}, {"text": "Document 2"}]
assert clean_web_loader.ds_converter(docs) == expected_output
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue: tests should be clean functions with no side effects, this will break once example.com changes (or this fixture is unnnecessary at all)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants