[NLP] Fix URL normalization #65

ronentk · 2024-04-30T02:30:30Z

Example - https://twitter.com/marielgoddu/status/1784709899357716521

ShaRefOh · 2024-04-30T17:09:08Z

@ronentk Is this error connected to the issue?
back (most recent call last): File "/Users/shaharorielkagan/sensemakers/nlp/desci_sense/evaluation/mulitchain_filter_evaluation.py", line 208, in <module> pred_labels(df=df,config=config) File "/Users/shaharorielkagan/sensemakers/nlp/desci_sense/evaluation/mulitchain_filter_evaluation.py", line 54, in pred_labels results = model.batch_process_ref_posts(inputs=inputs,active_list=["keywords", "topics"],batch_size=10) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/shaharorielkagan/sensemakers/nlp/desci_sense/shared_functions/parsers/multi_chain_parser.py", line 213, in batch_process_ref_posts md_dict = extract_posts_ref_metadata_dict( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/shaharorielkagan/sensemakers/nlp/desci_sense/shared_functions/web_extractors/metadata_extractors.py", line 160, in extract_posts_ref_metadata_dict md_dict = extract_all_metadata_to_dict( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/shaharorielkagan/sensemakers/nlp/desci_sense/shared_functions/web_extractors/metadata_extractors.py", line 129, in extract_all_metadata_to_dict md_list = extract_all_metadata_by_type( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/shaharorielkagan/sensemakers/nlp/desci_sense/shared_functions/web_extractors/metadata_extractors.py", line 119, in extract_all_metadata_by_type return extract_urls_citoid_metadata(target_urls, max_summary_length) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/shaharorielkagan/sensemakers/nlp/desci_sense/shared_functions/web_extractors/metadata_extractors.py", line 101, in extract_urls_citoid_metadata return normalize_citoid_metadata(target_urls, metadatas_raw, max_summary_length) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/shaharorielkagan/sensemakers/nlp/desci_sense/shared_functions/web_extractors/metadata_extractors.py", line 30, in normalize_citoid_metadata metadata["original_url"] = url ~~~~~~~~^^^^^^^^^^^^^^^^ TypeError: 'ContentTypeError' object does not support item assignment

I am getting it now when running the batches on the dataset

ronentk · 2024-04-30T17:15:44Z

@ShaRefOh not sure, please open an issue with steps to reproduce the error (a list of urls or something similar)

ShaRefOh · 2024-04-30T18:54:45Z

Ok, but for that, I will need to go through the posts one by one instead of using the batch parser function.

ronentk self-assigned this Apr 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NLP] Fix URL normalization #65

[NLP] Fix URL normalization #65

ronentk commented Apr 30, 2024

ShaRefOh commented Apr 30, 2024

ronentk commented Apr 30, 2024

ShaRefOh commented Apr 30, 2024

[NLP] Fix URL normalization #65

[NLP] Fix URL normalization #65

Comments

ronentk commented Apr 30, 2024

ShaRefOh commented Apr 30, 2024

ronentk commented Apr 30, 2024

ShaRefOh commented Apr 30, 2024