no output #264
-
Hello Elias, |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 2 replies
-
Just ran the code. It seems they block everything through their robots.txt. You can ignore that if you want using a special setting (it's up to you to make sure you are complying with their rules and consequences). It's also very good to always save the logs of your crawl so you can check what might have happened if there are any issues: import advertools as adv
adv.crawl('https://example.com', 'output.jl',
custom_settings={'ROBOTSTXT_OBEY': False, 'LOG_FILE': 'mycrawl.log'}) Using your URL: adv.crawl(
'https://api-adresse.data.gouv.fr/search/?q=12+oulevard+de+reuilly+75012+paris',
output_file='data_gouv_fr_result.jl',
custom_settings={
'CONCURRENT_ITEMS': 50,
'ROBOTSTXT_OBEY': False
})
df = pd.read_json('/Users/me/Desktop/temp/data_gouv_fr_result.jl', lines=True)
df
We now need to parse the values in the import json
json.loads(df['body_text'][0])
{'type': 'FeatureCollection',
'version': 'draft',
'features': [{'type': 'Feature',
'geometry': {'type': 'Point', 'coordinates': [2.390824, 48.838917]},
'properties': {'label': '12 Boulevard de Reuilly 75012 Paris',
'score': 0.8170901298701299,
'housenumber': '12',
'id': '75112_8174_00012',
'name': '12 Boulevard de Reuilly',
'postcode': '75012',
'citycode': '75112',
'x': 655287.75,
'y': 6860046.52,
'city': 'Paris',
'district': 'Paris 12e Arrondissement',
'context': '75, Paris, Île-de-France',
'type': 'housenumber',
'importance': 0.75942,
'street': 'Boulevard de Reuilly'}},
{'type': 'Feature',
'geometry': {'type': 'Point', 'coordinates': [2.384823, 48.849557]},
'properties': {'label': '12 Rue de Reuilly 75012 Paris',
'score': 0.6023925307125307,
'housenumber': '12',
'id': '75112_8175_00012',
'name': '12 Rue de Reuilly',
'postcode': '75012',
'citycode': '75112',
'x': 654856.52,
'y': 6861233,
'city': 'Paris',
'district': 'Paris 12e Arrondissement',
'context': '75, Paris, Île-de-France',
'type': 'housenumber',
'importance': 0.78848,
'street': 'Rue de Reuilly'}},
{'type': 'Feature',
'geometry': {'type': 'Point', 'coordinates': [2.395798, 48.841871]},
'properties': {'label': 'Rue de la Gare de Reuilly 75012 Paris',
'score': 0.4766209090909091,
'id': '75112_3955',
'name': 'Rue de la Gare de Reuilly',
'postcode': '75012',
'citycode': '75112',
'x': 655655.34,
'y': 6860372.17,
'city': 'Paris',
'district': 'Paris 12e Arrondissement',
'context': '75, Paris, Île-de-France',
'type': 'street',
'importance': 0.74283,
'street': 'Rue de la Gare de Reuilly'}}],
'attribution': 'BAN',
'licence': 'ETALAB-2.0',
'query': '12 oulevard de reuilly 75012 paris',
'limit': 5} Convert to a DataFrame: pd.DataFrame(json.loads(df['body_text'][0]))
Now the pd.json_normalize(pd.DataFrame(json.loads(df['body_text'][0]))['features'])
And Happy Birthday! |
Beta Was this translation helpful? Give feedback.
Just ran the code. It seems they block everything through their robots.txt. You can ignore that if you want using a special setting (it's up to you to make sure you are complying with their rules and consequences).
It's also very good to always save the logs of your crawl so you can check what might have happened if there are any issues:
Using your URL: