Some ideas #300
Replies: 2 comments 10 replies
-
@antoineeripret Thank you so much for the inputs. Always happy to get feedback/suggestions. My thoughts:
crawldf = pd.read_json('output_file.jl', lines=True)
urldf = adv.url_to_df(crawldf['url'])
crawldf['product'] = urldf[urldf['dir_1'].eq('producto')]
crawldf['shoes'] = urldf[urldf['dir_1'].eq('producto') & urldf['dir_2'].eq('shoes')]
crawldf['offers'] = urldf[urldf['url'].str.contains('oferta')] Or was there something else you wanted to achieve with this?
Let me know your thoughts, and we'll discuss further. Thanks again! |
Beta Was this translation helpful? Give feedback.
-
Hi @eliasdabbas, Long time no speak :) I've been using your library for some projects lately and I wondered if it were possible to add the meta robots to the default elements being retrieved by the spider.py file. Right now, I often add it as such: adv.crawl(
"https://www.example.com",
output_file='crawl_results.jl',
follow_links = True,
xpath_selectors={
#meta robots tag
'meta_robots': '//meta[@name="robots"]/@content',
}
) While I do understand that the you can't obviously add every onpage elements by default, this one would be pretty useful without having to write custom code because:
Happy to create a PR if the idea makes sense for you :) Thank you ! |
Beta Was this translation helpful? Give feedback.
-
Hi @eliasdabbas!
Firstly, thank you for creating and updating a library that is now an essential for me. Really appreciate that.
I have a couple of features that I'd live to discuss with you. I'll try to keep my comment as short as possible and let me know what you think.
dict
for our architecture and have a new column using this segmentation. For instance, the following dictionary could be passed as an argument for theadv.crawl
method.client-secrets.json
key, we could add this information for a crawl or asitemap_to_df
method call. Using this library or something custom-built if you don't want to add dependencies.I'm not sure about GA, because GA4 API is still fresh and may be updated during the upcoming months.
sitemap_to_df
andurl_to_df
methods together with apd.merge
to have all the information I need. Won't it make sense to add the columns you already have in theurl_to_df
when you callsitemap_to_df
?Thanks in advance for your feedback!
Beta Was this translation helpful? Give feedback.
All reactions