Notebooks #137

bisd98 · 2024-04-28T08:44:23Z

No description provided.

review-notebook-app · 2024-04-28T08:44:28Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

pgronkievitz

didn't check notebooks yet

pgronkievitz · 2024-05-15T19:59:12Z

loaders/.gitignore

question: why does this require separate gitignore?

pgronkievitz · 2024-05-15T20:01:09Z

loaders/clean_web_loader.py

+
+    article = Article("")
+
+    def __init__(self, url_list: Union[List[str], str], depth: int = 1):


nitpick: use modern typing syntax

Suggested change

def __init__(self, url_list: Union[List[str], str], depth: int = 1):

def __init__(self, url_list: list[str] | str, depth: int = 1):

pgronkievitz · 2024-05-15T20:01:50Z

loaders/clean_web_loader.py

+        self.depth = depth
+
+    @staticmethod
+    def newspaper_extractor(html):


issue: untyped

pgronkievitz · 2024-05-15T20:04:54Z

loaders/clean_web_loader.py

+        if text_splitter is None:
+            _text_splitter: TextSplitter = SpacyTextSplitter(
+                pipeline="pl_core_news_sm",
+                chunk_size=chunk,
+                chunk_overlap=chunk_overlap,
+            )
+        else:
+            _text_splitter = text_splitter


suggestion:

Suggested change

if text_splitter is None:

_text_splitter: TextSplitter = SpacyTextSplitter(

pipeline="pl_core_news_sm",

chunk_size=chunk,

chunk_overlap=chunk_overlap,

)

else:

_text_splitter = text_splitter

_text_splitter = text_splitter or SpacyTextSplitter(

pipeline="pl_core_news_sm",

chunk_size=chunk,

chunk_overlap=chunk_overlap,

)

pgronkievitz · 2024-05-15T20:06:01Z

loaders/clean_web_loader.py

+        docs = self.load()
+        docs = reduce(
+            lambda data, method: method(data),
+            [CleanWebLoader.junk_remover, CleanWebLoader.ds_converter],
+            docs,
+        )


suggestion:

Suggested change

docs = self.load()

docs = reduce(

lambda data, method: method(data),

[CleanWebLoader.junk_remover, CleanWebLoader.ds_converter],

docs,

)

docs = reduce(

lambda data, method: method(data),

[CleanWebLoader.junk_remover, CleanWebLoader.ds_converter],

self.load(),

)

pgronkievitz · 2024-05-15T20:10:55Z

loaders/local_data_loader.py

+                        loader = LocalDataLoader.loaders[d_type](file_path)
+                        try:
+                            docs.append(loader.load()[0])
+                        except Exception as e:


issue: use specific exceptions, not generic Exception, as it'll silently fail on exception you didn't forsee

pgronkievitz · 2024-05-15T20:11:25Z

loaders/pyproject.toml

+
+[tool.poetry.dev-dependencies]
+black = "*"
+flake8 = "*"


issue: lack of EOL

pgronkievitz · 2024-05-15T20:11:50Z

loaders/pyproject.toml

+requires = ["poetry-core"]
+build-backend = "poetry.core.masonry.api"
+
+[tool.poetry.dev-dependencies]


issue: legacy syntax, don't use it

pgronkievitz · 2024-05-15T20:12:24Z

loaders/pyproject.toml

+black = "*"
+flake8 = "*"


issue: be more specific about required versions

pgronkievitz · 2024-05-15T20:13:44Z

loaders/upload.py

+        docs.extend(local_loader.load())
+
+    for doc in docs:
+        requests.post(url, json=doc)


issue: lack of EOL

…andling

bisd98

Fixed the code as suggested. Fixing exceptions and module versions require updating and testing the current code.

pgronkievitz

please, use black, ruff (with as much enabled options as possible) and pyright with strict typing

pgronkievitz · 2024-09-20T07:07:36Z

loaders/clean_web_loader.py

+
+    """
+
+    article = Article("")


nitpick: are you aware this instance will be shared and may cause weird bugs with overwrites between objects? (this object is created once and will be shared between CleanWebLoader instances)

pgronkievitz · 2024-09-20T07:09:12Z

loaders/clean_web_loader.py

+    @staticmethod
+    def newspaper_extractor(html: str):
+        """
+        Extracts and cleans text content from HTML using the 'newspaper' library.
+
+        :param html: HTML content to be processed.
+        :return: Cleaned and concatenated text extracted from the HTML.
+        """
+        CleanWebLoader.article.set_html(html)
+        CleanWebLoader.article.parse()
+        return " ".join(CleanWebLoader.article.text.split())
+
+    @staticmethod
+    def ds_converter(docs: list[Document]):
+        """
+        Converts a list of documents into a specific data structure.
+
+        :param docs: List of documents to be converted.
+        :return: List of dictionaries, each representing a document with 'text' key.
+        """
+        return [{"text": doc.page_content} for doc in docs]
+
+    @staticmethod
+    def junk_remover(docs: list[Document]):
+        """
+        Identifies and returns a list of suspected junk documents based on specific criteria.
+
+        :param docs: A list of documents, where each document is represented as a dictionary.
+                    Each dictionary should have a "text" key containing the text content of the document.
+        :return: A list of suspected junk documents based on the criteria of having less than 300 characters
+                or having the same text as another document in the input list.
+        """
+        junk_docs = [doc for doc in docs if len(doc.page_content) < 300]
+        seen_texts = set()
+        clear_docs = []
+        for doc in docs:
+            if "title" not in doc.metadata.keys():
+                junk_docs.append(doc)
+            elif doc.page_content not in seen_texts and doc not in junk_docs:
+                clear_docs.append(doc)
+                seen_texts.add(doc.page_content)
+            else:
+                pass
+        return clear_docs


issue: those should not be static, add self argument and use self.article.* instead of CleanWebLoader.article.*
ds_converter and junk_remover should be separate from this class as they've got nothing to do with it.
also - TYPING, please check with pyright set to strict

pgronkievitz · 2024-09-20T07:11:20Z

loaders/clean_web_loader.py

@@ -0,0 +1,140 @@
+from newspaper import Article
+from functools import reduce
+from typing import List, Optional


nitpick: both are unnecessary - list[type] and type | None work just fine

pgronkievitz · 2024-09-20T07:12:01Z

loaders/clean_web_loader.py

+        :param chunk_overlap: Overlap size between chunks (default is 80).
+        :return: List of dictionaries, each representing a document with 'text' key.
+        """
+        _text_splitter: text_splitter or TextSplitter = SpacyTextSplitter(


issue: that's some weird-ass typing, wtf

My bad, I missed it

pgronkievitz · 2024-09-20T07:12:28Z

loaders/local_data_loader.py

+    loaders = {
+        ".pdf": PyPDFLoader,
+        ".json": JSONLoader,
+        ".txt": TextLoader,
+        ".csv": CSVLoader,
+    }


nitpick: move this to init

pgronkievitz · 2024-09-20T07:12:53Z

loaders/local_data_loader.py

+        ".csv": CSVLoader,
+    }
+
+    def __init__(self, path: Union[List[str], str]):


suggestion:

Suggested change

def __init__(self, path: Union[List[str], str]):

def __init__(self, path: list[str] | str):

pgronkievitz · 2024-09-20T07:13:38Z

loaders/local_data_loader.py

+    @staticmethod
+    def ds_converter(docs):
+        """
+        Converts a list of documents into a specific data structure.
+
+        :param docs: List of documents to be converted.
+        :return: List of dictionaries, each representing a document with 'text' and 'url' keys.
+        """
+        return [{"text": doc.page_content} for doc in docs]
+
+    @staticmethod
+    def junk_remover(docs):
+        """
+        Identifies and returns a list of suspected junk documents based on specific criteria.
+
+        :param docs: A list of documents, where each document is represented as a dictionary.
+                    Each dictionary should have a "text" key containing the text content of the document.
+        :return: A list of suspected junk documents based on the criteria of having less than 300 characters
+                or having the same text as another document in the input list.
+        """
+        junk_docs = [doc for doc in docs if len(doc.page_content) < 300]
+        seen_texts = {}
+        clear_docs = []
+        for doc in docs:
+            if doc.page_content not in seen_texts and doc not in junk_docs:
+                clear_docs.append(doc)
+                seen_texts.add(doc.page_content)
+        return clear_docs


suggestion: yep, those should be shared, if you want them to be inside of this class - make them a mixin

pgronkievitz · 2024-09-20T07:15:41Z

loaders/tests/test_clean_web_loader.py

+def test_ds_converter(clean_web_loader):
+    docs = [Document(page_content="Document 1"), Document(page_content="Document 2")]
+    expected_output = [{"text": "Document 1"}, {"text": "Document 2"}]
+    assert clean_web_loader.ds_converter(docs) == expected_output


issue: tests should be clean functions with no side effects, this will break once example.com changes (or this fixture is unnnecessary at all)

Kleczyk and others added 12 commits December 4, 2023 19:04

add new folders for notebooks

8777375

add tutorials folder

75aba49

move some README.md

dacee35

Update README.md

aea8b85

Add files via upload

e919f6f

web extractor structure sketch

102ef8c

Delete notebooks/data_gathering/WEB/WEB.ipynb

ef69ea5

loaders/web_loader/clean_web_loader.py

03bf352

feat: Implement CleanWebLoader class

25e9d96

Add local data loader and tests for both loaders

0c1cd95

Fix notebooks readme file

384e7b3

feat: Implement upload_data function using REST API endpoint

1ab17eb

github-actions bot requested a review from bafaurazan April 28, 2024 08:44

pgronkievitz requested changes May 15, 2024

View reviewed changes

refactor: split loading logic into helper methods and improve error h…

93e232c

…andling

bisd98 commented Sep 19, 2024

View reviewed changes

pgronkievitz requested changes Sep 20, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Notebooks #137

Notebooks #137

bisd98 commented Apr 28, 2024

review-notebook-app bot commented Apr 28, 2024

pgronkievitz left a comment

pgronkievitz May 15, 2024

pgronkievitz May 15, 2024

pgronkievitz May 15, 2024

pgronkievitz May 15, 2024

pgronkievitz May 15, 2024

pgronkievitz May 15, 2024

pgronkievitz May 15, 2024

pgronkievitz May 15, 2024

pgronkievitz May 15, 2024

pgronkievitz May 15, 2024

bisd98 left a comment

pgronkievitz left a comment

pgronkievitz Sep 20, 2024

pgronkievitz Sep 20, 2024

pgronkievitz Sep 20, 2024

pgronkievitz Sep 20, 2024

bisd98 Sep 21, 2024

pgronkievitz Sep 20, 2024

pgronkievitz Sep 20, 2024

pgronkievitz Sep 20, 2024

pgronkievitz Sep 20, 2024


		article = Article("")

		def __init__(self, url_list: Union[List[str], str], depth: int = 1):

	def __init__(self, url_list: Union[List[str], str], depth: int = 1):
	def __init__(self, url_list: list[str] \| str, depth: int = 1):

	def __init__(self, path: Union[List[str], str]):
	def __init__(self, path: list[str] \| str):

		black = "*"
		flake8 = "*"

Notebooks #137

Are you sure you want to change the base?

Notebooks #137

Conversation

bisd98 commented Apr 28, 2024

review-notebook-app bot commented Apr 28, 2024

pgronkievitz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bisd98 left a comment

Choose a reason for hiding this comment

pgronkievitz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment