Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NLP] Fix URL normalization #65

Open
ronentk opened this issue Apr 30, 2024 · 3 comments
Open

[NLP] Fix URL normalization #65

ronentk opened this issue Apr 30, 2024 · 3 comments
Assignees

Comments

@ronentk
Copy link
Contributor

ronentk commented Apr 30, 2024

Example - https://twitter.com/marielgoddu/status/1784709899357716521

@ronentk ronentk self-assigned this Apr 30, 2024
@ShaRefOh
Copy link
Contributor

@ronentk Is this error connected to the issue?
back (most recent call last): File "/Users/shaharorielkagan/sensemakers/nlp/desci_sense/evaluation/mulitchain_filter_evaluation.py", line 208, in <module> pred_labels(df=df,config=config) File "/Users/shaharorielkagan/sensemakers/nlp/desci_sense/evaluation/mulitchain_filter_evaluation.py", line 54, in pred_labels results = model.batch_process_ref_posts(inputs=inputs,active_list=["keywords", "topics"],batch_size=10) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/shaharorielkagan/sensemakers/nlp/desci_sense/shared_functions/parsers/multi_chain_parser.py", line 213, in batch_process_ref_posts md_dict = extract_posts_ref_metadata_dict( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/shaharorielkagan/sensemakers/nlp/desci_sense/shared_functions/web_extractors/metadata_extractors.py", line 160, in extract_posts_ref_metadata_dict md_dict = extract_all_metadata_to_dict( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/shaharorielkagan/sensemakers/nlp/desci_sense/shared_functions/web_extractors/metadata_extractors.py", line 129, in extract_all_metadata_to_dict md_list = extract_all_metadata_by_type( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/shaharorielkagan/sensemakers/nlp/desci_sense/shared_functions/web_extractors/metadata_extractors.py", line 119, in extract_all_metadata_by_type return extract_urls_citoid_metadata(target_urls, max_summary_length) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/shaharorielkagan/sensemakers/nlp/desci_sense/shared_functions/web_extractors/metadata_extractors.py", line 101, in extract_urls_citoid_metadata return normalize_citoid_metadata(target_urls, metadatas_raw, max_summary_length) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/shaharorielkagan/sensemakers/nlp/desci_sense/shared_functions/web_extractors/metadata_extractors.py", line 30, in normalize_citoid_metadata metadata["original_url"] = url ~~~~~~~~^^^^^^^^^^^^^^^^ TypeError: 'ContentTypeError' object does not support item assignment

I am getting it now when running the batches on the dataset

@ronentk
Copy link
Contributor Author

ronentk commented Apr 30, 2024

@ShaRefOh not sure, please open an issue with steps to reproduce the error (a list of urls or something similar)

@ShaRefOh
Copy link
Contributor

Ok, but for that, I will need to go through the posts one by one instead of using the batch parser function.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants