layout | title | permalink |
---|---|---|
default |
Asynchronous programming |
/asyncio |
Asynchronous programming is becoming more and more popular in Python since the introduction in Python 3.5 (with PEP 492) of two keywords: await
and async
.
It is important to be aware of few facts about Python performance:
-
the GIL (Global Interpreter Lock) from the C Python implementation puts a strong limitation in Python performance: programs with strong computing requirements cannot be accelerated with multithreading. Multiprocessing is an option, but it comes with its set of limitations;
-
Multithreading remains however relevant when a program uses a lot of blocking calls. These includes I/O calls (writing a file, sending a request on the web, accessing USB devices): multithreading releases the GIL during these blocking calls, and allows the program to perform other tasks while the first task is waiting for a blocking call to complete.
Asynchronous programming, also known under the name of a Python module asyncio
, provides a single threaded efficient implementation of programs made of blocking calls.
If the following Spongebob is rather multiprocessing:
The tl;dr version of asyncio
goes as follows:
-
blocking calls are annotated with the
await
keyword. The Python interpreter (and its main loop) will put this function on hold until the reply comes, and proceed with different asynchronous calls; -
functions with an
await
keyword in their implementation must be prefixed with theasync
keyword; anasync
function must beawait
ed. This may sound like egg or chicken, and that actually may make it all confusing before you get used to it.
So let's start with a function which does nothing more than sleeping:
async def count():
print("one")
await asyncio.sleep(1)
print("two")
If you run it once, it will take... one second:
>>> import asyncio
>>>
>>> loop = asyncio.get_event_loop() # the loop in charge of sequencing async calls
>>> loop.run_until_complete(count())
one
two
But if you run several calls together, it will also take one second. Check the printing order: the loop schedules the next call of count()
when it hits on an await
instruction:
>>> loop.run_until_complete(asyncio.gather(count(), count(), count()))
one
one
one
two
two
two
You would get the following exception:
RuntimeError: This event loop is already running.
await asyncio.gather(count(), count(), count())
In practice, many libraries made of blocking calls provide an asynchronous version of their code, which becomes relevant if you need to make many small blocking calls, e.g. many small downloads, or many calls to a database.
requests
is the most common library for synchronous http requests. For this example, let's download all flags of the world from https://flagcdn.com/.
The full list of flags is available at the following link:
import requests
c = requests.get("https://flagcdn.com/fr/codes.json")
c.raise_for_status()
codes = c.json()
# >>> codes {'ad': 'Andorre', 'ae': 'Émirats arabes unis', 'af': 'Afghanistan',
# 'ag': 'Antigua-et-Barbuda', 'ai': 'Anguilla', 'al': 'Albanie', 'am':
# 'Arménie', 'ao': 'Angola', 'aq': 'Antarctique', 'ar': 'Argentine', ...
Now we can time the synchronous download of all flags:
from tqdm import tqdm
for c in tqdm(codes.keys()):
r = requests.get(f'https://flagcdn.com/256x192/{c}.png')
r.raise_for_status()
# ignoring content for this example
100%|█████████████████████████████████████████████████████████████| 306/306 [01:15<00:00, 3.77it/s]
One of the most widespread libraries for asynchronous web requests in aiohttp
which syntax is somehow similar. The proper code would be here:
import aiohttp
import time
async def fetch(code, session):
async with session.get(f"https://flagcdn.com/256x192/{code}.png") as resp:
return await resp.read()
async def main():
t0 = time.time()
async with aiohttp.ClientSession() as session:
futures = [fetch(code, session) for code in codes]
for response in await asyncio.gather(*futures):
data = response
print(f"done in {time.time() - t0:.5f}s")
asyncio.run(main())
done in 0.52194s
# with requests
requests.get(url, proxies={"http"=proxy, "https"=proxy})
# with aiohttp
async with session.get(url, proxy=proxy)
We will implement a particular case of webcrawling in this example, with a breadth first exploration in a graph.
- Background: Queen Victoria is considered as the grandmother of Europe. She is the ancestor of many famous reigning people today. We will write a program to ask to find the relationship between cousins in this extended family. A first example could be to explore the genealogy of Queen Elizabeth II and the Duke of Edinburgh. Yes, they are extended cousins.
- We will use the Wikidata API to explore the structure of the relationship between pages. Based on the Wikipedia entry, you will find in the left panel a Wikidata item
-
Pick the final identifier in the URL (here
Q9439
) and replace it in the JSON URLURL Wikidata item https://www.wikidata.org/wiki/Special:EntityPage/Q9439 JSON file https://www.wikidata.org/wiki/Special:EntityData/Q9439.json Explore the JSON and find specific relationships in the
claims
dictionary:P22
for the "father" relationship,P25
for the "mother" relationship andP40
for the "children" relationship. Find new identifiers for members of extended family in those dictionaries. -
Explore the entries for all neighbours of the current entry. Pay attention to stick to breadth-first exploration: explore all kins directly related to Queen Victoria, then all kins with two degrees of relationship, etc.
-
Draw the genealogic subtree (consider the
networkx
package) with Queen Victoria, Queen Elizabeth II and the Duke of Edinburgh. -
Extend the graph with another grandfather of Europe, Christian IX of Denmark. The late British royal couple was also cousin through this branch. Look at their relationships with other cousins, like Nicholas II of Russia (the last tsar of Russia), or Felipe VI of Spain, current King of Spain.
You will find a suggestion of solution in the asyncio.ipynb
notebook.