no output #264

caroheymes · 2023-01-31T11:36:15Z

caroheymes
Jan 31, 2023

Hello Elias,
I have a lot of addresses to geocode and usually advertools was working better than very well, including in google colab.
The advantage of advertools is that it goes much faster than requests!
Here is the code that was working.
import advertools as adv adv.crawl('https://api-adresse.data.gouv.fr/search/?q=12+oulevard+de+reuilly+75012+paris', output_file = 'result.jl', custom_settings={ 'CONCURRENT_ITEMS': 50}, )
The result.jl file is empty.
I can't find any reason why nor how to get around this bug.
Thanks for your kind help

Answered by eliasdabbas

Jan 31, 2023

Just ran the code. It seems they block everything through their robots.txt. You can ignore that if you want using a special setting (it's up to you to make sure you are complying with their rules and consequences).

It's also very good to always save the logs of your crawl so you can check what might have happened if there are any issues:

import advertools as adv
adv.crawl('https://example.com', 'output.jl',
         custom_settings={'ROBOTSTXT_OBEY': False, 'LOG_FILE': 'mycrawl.log'})

Using your URL:

adv.crawl(
    'https://api-adresse.data.gouv.fr/search/?q=12+oulevard+de+reuilly+75012+paris',
    output_file='data_gouv_fr_result.jl',
    custom_settings={
         'CONCURRENT_ITEMS': 50,…

View full answer

eliasdabbas · 2023-01-31T13:50:42Z

eliasdabbas
Jan 31, 2023
Maintainer

Just ran the code. It seems they block everything through their robots.txt. You can ignore that if you want using a special setting (it's up to you to make sure you are complying with their rules and consequences).

It's also very good to always save the logs of your crawl so you can check what might have happened if there are any issues:

import advertools as adv
adv.crawl('https://example.com', 'output.jl',
         custom_settings={'ROBOTSTXT_OBEY': False, 'LOG_FILE': 'mycrawl.log'})

Using your URL:

adv.crawl(
    'https://api-adresse.data.gouv.fr/search/?q=12+oulevard+de+reuilly+75012+paris',
    output_file='data_gouv_fr_result.jl',
    custom_settings={
         'CONCURRENT_ITEMS': 50,
         'ROBOTSTXT_OBEY': False
})

df = pd.read_json('/Users/me/Desktop/temp/data_gouv_fr_result.jl', lines=True)
df

	url	body_text	size	download_timeout	download_slot	download_latency	depth	status	ip_address	crawl_time	resp_headers_server	resp_headers_date	resp_headers_content-type	resp_headers_vary	resp_headers_etag	resp_headers_x-cache-status	resp_headers_access-control-allow-headers	request_headers_accept	request_headers_accept-language	request_headers_user-agent	request_headers_accept-encoding
0	https://api-adresse.data.gouv.fr/search/?q=12+oulevard+de+reuilly+75012+paris	{"type":"FeatureCollection","version":"draft","features":[{"type":"Feature","geometry":{"type":"Point","coordinates":[2.390824,48.838917]},"properties":{"label":"12 Boulevard de Reuilly 75012 Paris","score":0.8170901298701299,"housenumber":"12","id":"75112_8174_00012","name":"12 Boulevard de Reuilly","postcode":"75012","citycode":"75112","x":655287.75,"y":6860046.52,"city":"Paris","district":"Paris 12e Arrondissement","context":"75, Paris, Île-de-France","type":"housenumber","importance":0.75942,"street":"Boulevard de Reuilly"}},{"type":"Feature","geometry":{"type":"Point","coordinates":[2.384823,48.849557]},"properties":{"label":"12 Rue de Reuilly 75012 Paris","score":0.6023925307125307,"housenumber":"12","id":"75112_8175_00012","name":"12 Rue de Reuilly","postcode":"75012","citycode":"75112","x":654856.52,"y":6861233,"city":"Paris","district":"Paris 12e Arrondissement","context":"75, Paris, Île-de-France","type":"housenumber","importance":0.78848,"street":"Rue de Reuilly"}},{"type":"Feature","geometry":{"type":"Point","coordinates":[2.395798,48.841871]},"properties":{"label":"Rue de la Gare de Reuilly 75012 Paris","score":0.4766209090909091,"id":"75112_3955","name":"Rue de la Gare de Reuilly","postcode":"75012","citycode":"75112","x":655655.34,"y":6860372.17,"city":"Paris","district":"Paris 12e Arrondissement","context":"75, Paris, Île-de-France","type":"street","importance":0.74283,"street":"Rue de la Gare de Reuilly"}}],"attribution":"BAN","licence":"ETALAB-2.0","query":"12 oulevard de reuilly 75012 paris","limit":5}	1549	180	api-adresse.data.gouv.fr	2.0087	0	200	57.128.3.190	2023-01-31 13:14:58	nginx/1.23.3	Tue, 31 Jan 2023 13:14:58 GMT	application/json; charset=utf-8	Origin	W/"60d-RpzZhN7z987t57ojB/7+fBFSLBY"	MISS	X-Requested-With,Content-Type	text/html,application/xhtml+xml,application/xml;q=0.9,/;q=0.8	en	advertools/0.13.2	gzip, deflate, br

We now need to parse the values in the body_text column. We can use the json module to parse the json into a py dict:

import json


json.loads(df['body_text'][0])


{'type': 'FeatureCollection',
 'version': 'draft',
 'features': [{'type': 'Feature',
   'geometry': {'type': 'Point', 'coordinates': [2.390824, 48.838917]},
   'properties': {'label': '12 Boulevard de Reuilly 75012 Paris',
    'score': 0.8170901298701299,
    'housenumber': '12',
    'id': '75112_8174_00012',
    'name': '12 Boulevard de Reuilly',
    'postcode': '75012',
    'citycode': '75112',
    'x': 655287.75,
    'y': 6860046.52,
    'city': 'Paris',
    'district': 'Paris 12e Arrondissement',
    'context': '75, Paris, Île-de-France',
    'type': 'housenumber',
    'importance': 0.75942,
    'street': 'Boulevard de Reuilly'}},
  {'type': 'Feature',
   'geometry': {'type': 'Point', 'coordinates': [2.384823, 48.849557]},
   'properties': {'label': '12 Rue de Reuilly 75012 Paris',
    'score': 0.6023925307125307,
    'housenumber': '12',
    'id': '75112_8175_00012',
    'name': '12 Rue de Reuilly',
    'postcode': '75012',
    'citycode': '75112',
    'x': 654856.52,
    'y': 6861233,
    'city': 'Paris',
    'district': 'Paris 12e Arrondissement',
    'context': '75, Paris, Île-de-France',
    'type': 'housenumber',
    'importance': 0.78848,
    'street': 'Rue de Reuilly'}},
  {'type': 'Feature',
   'geometry': {'type': 'Point', 'coordinates': [2.395798, 48.841871]},
   'properties': {'label': 'Rue de la Gare de Reuilly 75012 Paris',
    'score': 0.4766209090909091,
    'id': '75112_3955',
    'name': 'Rue de la Gare de Reuilly',
    'postcode': '75012',
    'citycode': '75112',
    'x': 655655.34,
    'y': 6860372.17,
    'city': 'Paris',
    'district': 'Paris 12e Arrondissement',
    'context': '75, Paris, Île-de-France',
    'type': 'street',
    'importance': 0.74283,
    'street': 'Rue de la Gare de Reuilly'}}],
 'attribution': 'BAN',
 'licence': 'ETALAB-2.0',
 'query': '12 oulevard de reuilly 75012 paris',
 'limit': 5}

Convert to a DataFrame:

pd.DataFrame(json.loads(df['body_text'][0]))

	type	version	features	attribution	licence	query	limit
0	FeatureCollection	draft	{'type': 'Feature', 'geometry': {'type': 'Point', 'coordinates': [2.390824, 48.838917]}, 'properties': {'label': '12 Boulevard de Reuilly 75012 Paris', 'score': 0.8170901298701299, 'housenumber': '12', 'id': '75112_8174_00012', 'name': '12 Boulevard de Reuilly', 'postcode': '75012', 'citycode': '75112', 'x': 655287.75, 'y': 6860046.52, 'city': 'Paris', 'district': 'Paris 12e Arrondissement', 'context': '75, Paris, Île-de-France', 'type': 'housenumber', 'importance': 0.75942, 'street': 'Boulevard de Reuilly'}}	BAN	ETALAB-2.0	12 oulevard de reuilly 75012 paris	5
1	FeatureCollection	draft	{'type': 'Feature', 'geometry': {'type': 'Point', 'coordinates': [2.384823, 48.849557]}, 'properties': {'label': '12 Rue de Reuilly 75012 Paris', 'score': 0.6023925307125307, 'housenumber': '12', 'id': '75112_8175_00012', 'name': '12 Rue de Reuilly', 'postcode': '75012', 'citycode': '75112', 'x': 654856.52, 'y': 6861233, 'city': 'Paris', 'district': 'Paris 12e Arrondissement', 'context': '75, Paris, Île-de-France', 'type': 'housenumber', 'importance': 0.78848, 'street': 'Rue de Reuilly'}}	BAN	ETALAB-2.0	12 oulevard de reuilly 75012 paris	5
2	FeatureCollection	draft	{'type': 'Feature', 'geometry': {'type': 'Point', 'coordinates': [2.395798, 48.841871]}, 'properties': {'label': 'Rue de la Gare de Reuilly 75012 Paris', 'score': 0.4766209090909091, 'id': '75112_3955', 'name': 'Rue de la Gare de Reuilly', 'postcode': '75012', 'citycode': '75112', 'x': 655655.34, 'y': 6860372.17, 'city': 'Paris', 'district': 'Paris 12e Arrondissement', 'context': '75, Paris, Île-de-France', 'type': 'street', 'importance': 0.74283, 'street': 'Rue de la Gare de Reuilly'}}	BAN	ETALAB-2.0	12 oulevard de reuilly 75012 paris	5

Now the features column contains python dictionaries, not just strings, and we can fully expand them (pd.json_normalize is one of your best friends):

pd.json_normalize(pd.DataFrame(json.loads(df['body_text'][0]))['features'])

	type	geometry.type	geometry.coordinates	properties.label	properties.score	properties.housenumber	properties.id	properties.name	properties.postcode	properties.citycode	properties.x	properties.y	properties.city	properties.district	properties.context	properties.type	properties.importance	properties.street
0	Feature	Point	[2.390824, 48.838917]	12 Boulevard de Reuilly 75012 Paris	0.81709	12	75112_8174_00012	12 Boulevard de Reuilly	75012	75112	655288	6.86005e+06	Paris	Paris 12e Arrondissement	75, Paris, Île-de-France	housenumber	0.75942	Boulevard de Reuilly
1	Feature	Point	[2.384823, 48.849557]	12 Rue de Reuilly 75012 Paris	0.602393	12	75112_8175_00012	12 Rue de Reuilly	75012	75112	654857	6.86123e+06	Paris	Paris 12e Arrondissement	75, Paris, Île-de-France	housenumber	0.78848	Rue de Reuilly
2	Feature	Point	[2.395798, 48.841871]	Rue de la Gare de Reuilly 75012 Paris	0.476621	nan	75112_3955	Rue de la Gare de Reuilly	75012	75112	655655	6.86037e+06	Paris	Paris 12e Arrondissement	75, Paris, Île-de-France	street	0.74283	Rue de la Gare de Reuilly

And Happy Birthday!

2 replies

caroheymes Jan 31, 2023
Author

Many thanks ! I had completely forgotten the ROBOTSTXT_OBEY feature !

By the way, do you know the chompjs package ? It's very useful to parse GA scripts in webpages (....) And sometimes you can find interesting things in it.
In my particular case it works also well

data = pd.read_json('/content/result.jl', lines=True)[['url', 'body_text']]
data['body_text'] = [chompjs.parse_js_object(elem) for elem in data.body_text]

data['latitude']      = [elem['features'][0]['geometry']['coordinates'][0] for elem in data.body_text]
data['longitude']     = [elem['features'][0]['geometry']['coordinates'][1] for elem in data.body_text]
data['adr1']          = [elem['features'][0]['properties']['name'] for elem in data.body_text]
data['postcode']      = [elem['features'][0]['properties']['postcode'] for elem in data.body_text]
data['city']          = [elem['features'][0]['properties']['city'] for elem in data.body_text]
data['context']       = [elem['features'][0]['properties']['context'] for elem in data.body_text]
data['score']         = [elem['features'][0]['properties']['score'] for elem in data.body_text]
data

Have a great day

eliasdabbas Jan 31, 2023
Maintainer

I've heard of it, and I think it's great for parsing javascript code, like GA or GTM for example. Yes, it would provide very valuable information.

In this case, the json module can be used, as it is part of the py standard library. Also pd.json_normalize expands things for you without you having to know the structure (some json objects might be more/less nested than others, contain other sub-items and so on).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

no output #264

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

no output #264

caroheymes Jan 31, 2023

Replies: 1 comment · 2 replies

eliasdabbas Jan 31, 2023 Maintainer

caroheymes Jan 31, 2023 Author

eliasdabbas Jan 31, 2023 Maintainer

caroheymes
Jan 31, 2023

Replies: 1 comment 2 replies

eliasdabbas
Jan 31, 2023
Maintainer

caroheymes Jan 31, 2023
Author

eliasdabbas Jan 31, 2023
Maintainer