Skip to content

Latest commit

 

History

History
177 lines (124 loc) · 8.1 KB

asyncio.md

File metadata and controls

177 lines (124 loc) · 8.1 KB
layout title permalink
default
Asynchronous programming
/asyncio

Asynchronous programming

Asynchronous programming is becoming more and more popular in Python since the introduction in Python 3.5 (with PEP 492) of two keywords: await and async.

It is important to be aware of few facts about Python performance:

  • the GIL (Global Interpreter Lock) from the C Python implementation puts a strong limitation in Python performance: programs with strong computing requirements cannot be accelerated with multithreading. Multiprocessing is an option, but it comes with its set of limitations;

  • Multithreading remains however relevant when a program uses a lot of blocking calls. These includes I/O calls (writing a file, sending a request on the web, accessing USB devices): multithreading releases the GIL during these blocking calls, and allows the program to perform other tasks while the first task is waiting for a blocking call to complete.

Asynchronous programming, also known under the name of a Python module asyncio, provides a single threaded efficient implementation of programs made of blocking calls.

If the following Spongebob is rather multiprocessing:

asynchronous programming is more about vacuuming while the dishwasher cleans instead of waiting for it to finish before doing your next chores.

The tl;dr version of asyncio goes as follows:

  • blocking calls are annotated with the await keyword. The Python interpreter (and its main loop) will put this function on hold until the reply comes, and proceed with different asynchronous calls;

  • functions with an await keyword in their implementation must be prefixed with the async keyword; an async function must be awaited. This may sound like egg or chicken, and that actually may make it all confusing before you get used to it.

So let's start with a function which does nothing more than sleeping:

async def count():
    print("one")
    await asyncio.sleep(1)
    print("two")

If you run it once, it will take... one second:

>>> import asyncio
>>>
>>> loop = asyncio.get_event_loop()  # the loop in charge of sequencing async calls
>>> loop.run_until_complete(count())
one
two

But if you run several calls together, it will also take one second. Check the printing order: the loop schedules the next call of count() when it hits on an await instruction:

>>> loop.run_until_complete(asyncio.gather(count(), count(), count()))
one
one
one
two
two
two
Warning    Jupyter notebooks run in an asynchronous environment where an event loop already runs in background. It is therefore not possible to run the code above as is.
You would get the following exception:
RuntimeError: This event loop is already running.
It is however possible to run a cell with an `await` keyword. The following code is valid in Jupyter but not in Python:
await asyncio.gather(count(), count(), count())

In practice, many libraries made of blocking calls provide an asynchronous version of their code, which becomes relevant if you need to make many small blocking calls, e.g. many small downloads, or many calls to a database.

Comparison between blocking and non-blocking downloads

requests is the most common library for synchronous http requests. For this example, let's download all flags of the world from https://flagcdn.com/.

The full list of flags is available at the following link:

import requests

c = requests.get("https://flagcdn.com/fr/codes.json")
c.raise_for_status()
codes = c.json()
# >>> codes {'ad': 'Andorre', 'ae': 'Émirats arabes unis', 'af': 'Afghanistan',
# 'ag': 'Antigua-et-Barbuda', 'ai': 'Anguilla', 'al': 'Albanie', 'am':
# 'Arménie', 'ao': 'Angola', 'aq': 'Antarctique', 'ar': 'Argentine', ...

Now we can time the synchronous download of all flags:

from tqdm import tqdm

for c in tqdm(codes.keys()):
    r = requests.get(f'https://flagcdn.com/256x192/{c}.png')
    r.raise_for_status()
    # ignoring content for this example
100%|█████████████████████████████████████████████████████████████| 306/306 [01:15<00:00,  3.77it/s]

One of the most widespread libraries for asynchronous web requests in aiohttp which syntax is somehow similar. The proper code would be here:

import aiohttp
import time

async def fetch(code, session):
    async with session.get(f"https://flagcdn.com/256x192/{code}.png") as resp:
        return await resp.read()


async def main():
    t0 = time.time()
    async with aiohttp.ClientSession() as session:
        futures = [fetch(code, session) for code in codes]
        for response in await asyncio.gather(*futures):
            data = response
    print(f"done in {time.time() - t0:.5f}s")


asyncio.run(main())
done in 0.52194s
Note    This approach leads to a speedup of nearly 150. This significant speedup makes a particular sense here, with a lot of small blocking requests.
Warning    If you run this code behind a proxy, you may need to adjust the code.
# with requests
requests.get(url, proxies={"http"=proxy, "https"=proxy})
# with aiohttp
async with session.get(url, proxy=proxy)

Exercice

We will implement a particular case of webcrawling in this example, with a breadth first exploration in a graph.

  • Background: Queen Victoria is considered as the grandmother of Europe. She is the ancestor of many famous reigning people today. We will write a program to ask to find the relationship between cousins in this extended family. A first example could be to explore the genealogy of Queen Elizabeth II and the Duke of Edinburgh. Yes, they are extended cousins.

  • We will use the Wikidata API to explore the structure of the relationship between pages. Based on the Wikipedia entry, you will find in the left panel a Wikidata item

wikidata

  • Pick the final identifier in the URL (here Q9439) and replace it in the JSON URL

    URL
    Wikidata item https://www.wikidata.org/wiki/Special:EntityPage/Q9439
    JSON file https://www.wikidata.org/wiki/Special:EntityData/Q9439.json

    Explore the JSON and find specific relationships in the claims dictionary: P22 for the "father" relationship, P25 for the "mother" relationship and P40 for the "children" relationship. Find new identifiers for members of extended family in those dictionaries.

  • Explore the entries for all neighbours of the current entry. Pay attention to stick to breadth-first exploration: explore all kins directly related to Queen Victoria, then all kins with two degrees of relationship, etc.

  • Draw the genealogic subtree (consider the networkx package) with Queen Victoria, Queen Elizabeth II and the Duke of Edinburgh.

  • Extend the graph with another grandfather of Europe, Christian IX of Denmark. The late British royal couple was also cousin through this branch. Look at their relationships with other cousins, like Nicholas II of Russia (the last tsar of Russia), or Felipe VI of Spain, current King of Spain.

You will find a suggestion of solution in the asyncio.ipynb notebook.

↑ Home