2024-11-29 update (MP3)

fabien · fabien · commit f53890333181 · 2024-11-29T11:43:50.000+01:00
diff --git a/MP3.ipynb b/MP3.ipynb
@@ -0,0 +1,261 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "44bdd8b4",
+   "metadata": {},
+   "source": [
+    "# PROGRES 2024 - Mini-Projet 3\n",
+    "# DHT CAN (Content-Adressable Network)\n",
+    "\n",
+    "Fabien Mathieu - fabien.mathieu@normalesup.org\n",
+    "\n",
+    "Sébastien Tixeuil - Sebastien.Tixeuil@lip6.fr"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f27c5f0d",
+   "metadata": {},
+   "source": [
+    "The goal of this mini-project is to build a DHT simulator and to test its performance:\n",
+    "Table construction, distributed content management, load balancing and traffic analysis.\n",
+    "\n",
+    "The table will be used to manage Wikipedia pages. **Exceptionally**, the peer in charge of a page will also be responsible for retrieving and making available the content of the page. Recall that this is not the typical behavior of a DHT, which is only supposed to point to the peers owning the content. This change is introduced here in order to reduce the implementation complexity of the mini-project."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "38aba0fe",
+   "metadata": {},
+   "source": [
+    "# Rules"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "4720e742",
+   "metadata": {},
+   "source": [
+    "1. Cite your sources\n",
+    "2. One file to rule them all\n",
+    "3. Explain\n",
+    "4. Execute your code\n",
+    "\n",
+    "\n",
+    "https://github.com/balouf/progres/blob/main/rules.ipynb"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b9df04ae",
+   "metadata": {},
+   "source": [
+    "# Exercise 1. Creation of a Peer class"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f6a5e41f",
+   "metadata": {},
+   "source": [
+    "Realize in Python a `Peer` class allowing to simulate a DHT CAN as seen in course.\n",
+    "\n",
+    "The communication will be simulated through a dictionary `dht`: `id` -> `Peer`: knowing the id of a peer allows to contact it (in real life, in addition to the id the IP address and port must be provided), and the network calls within the DHT will be represented by calls to the id of the destination in the dictionary.\n",
+    "\n",
+    "The hash function to use is SHA1, available from the `hashlib` library https://docs.python.org/3/library/hashlib.html\n",
+    "\n",
+    "SHA1 provides 160-bit hashes.\n",
+    "\n",
+    "To simplify things, we will assume that the CAN space is a *square* of side $2^{80}$. *Square* means that you don't teleport from the top of to the bottom or from left to right as you do in a *torus*.\n",
+    "\n",
+    "The hash will be split in two: the first half will represent the `x` coordinate inside the square, the second half the `y` coordinate inside the square."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d1da7497",
+   "metadata": {},
+   "source": [
+    "Instances of the `Peer` class will have to implement (at least) the following attributes / methods:\n",
+    "- `id`: peer identifier (integer)\n",
+    "- `area`: the coordinates of area the peer is responsible of, e.g. a tuple $(x_\\min, x_\\max, y_\\min, y_\\max)$\n",
+    "- `neighbors` : the list of the peer neighbors. Each neighbors should be represented by a tuple (id, responsibility_area)\n",
+    "- `is_in_charge(key)`: a function that indicates whether the peer is in charge of `key`\n",
+    "- `lookup(key, dht)`: a function that returns the id of the peer in charge of `key` as well as the distance, in number of hops, that separates the peer performing the *lookup* from the peer in charge.\n",
+    "- `replace(old_neighbor, new_neighbor, new_neighbor_area)`: a function that removes a neighbor from `neighbors` and replace it with `new_neighbor` (and its area)\n",
+    "- `share_area_with(other_peer, dht)`: a function that splits the responsibility in two\n",
+    "  - If current area is a square, split it vertically; otherwise split it horizontally\n",
+    "  - Each of the two peers (this one and the other) updates its neighbor list to keep the ones side-adjacent to it. Remind that the two peers are also side-adjacent to each other. Also note that it is possible for a neighbor to end up in both peers\n",
+    "  - Don't forget to ask the neighbors that need it to replace the current peer with other_peer in their own neighbor lists\n",
+    "  - Split the table between the two peers (cf below for table)\n",
+    "- `join(dht, entry_point)`: if the dht is empty, the peer should inserts in it with full resposibility area, e.g. $(0, 2^{80}, 0, 2^{80})$. Otherwise, it should ask through the entry_point for the peer responsible of its id and ask that peer to share its area.\n",
+    "- `table`: see below \n",
+    "- `publish_page(page_title, dht)`: see below\n",
+    "- `retrieve_page(page_title, dht)`: see below\n",
+    "- `search(search_string, dht)`: see below"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ae3f59eb",
+   "metadata": {},
+   "source": [
+    "The `table` attribute contains the dictionary of keys managed by the peer; the table may contain two types of keys:\n",
+    "  - word keys: to a given word we associate the hash coordinates of the word prefixed by `word:` (e.g. `computer` is represented by the SHA1 hash-coordinates of `word:computer`, i.e. (688843315192981992985613, 1084012856129654940495670)). The associated value in the table is a set (`set` Python), which will typically contain the titles of wikipedia pages that contain the word.\n",
+    "  - page keys: to a given wikipedia page we associate the hash coordinates of the page title prefixed by `page:` (for example `Circuit (computer science)` is represented by the SHA1 hash coordinates of `page:Circuit (computer science)`, i.e. (1042659198402659147098924, 95359921870751321224488)). The associated value in the table is the content of the page."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "9956f32c",
+   "metadata": {},
+   "source": [
+    "The `publish_page` method should perform the following actions:\n",
+    "- Separate the title into words written in lower case. For example, `Circuit (computer science)` becomes `[\"circuit\", \"computer\", \"science\"]`, `Object-oriented programming` becomes `[\"object\", \"oriented\", \"programming\"]`.\n",
+    "- For each word, find the peer responsible for the word key. If this key exists, add the page title to the associated set, otherwise initialize the key (with the value of a one-element set, the page title). Remember that the update is done by the responsible peer, not by the peer calling the method.\n",
+    "\n",
+    "\n",
+    "\n",
+    "\n",
+    "The `publish_page` method is not in charge of retrieving the content of the page itself."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "809019cd",
+   "metadata": {},
+   "source": [
+    "The `retrieve_page` method is responsible for providing the content. It should perform the following actions:\n",
+    "- If the peer is responsible for the page, return the content:\n",
+    "  - If the key is present in the table, return the associated value\n",
+    "  - If the key is not present, use the wikipedia module to retrieve the page from the title (method `wikipedia.page`), store the content (attribute `content` of the result) in the table and return it\n",
+    "  - In case of an error from the wikipedia module (this can happen, but quite rarely), store nothing and return ``Page not found``.\n",
+    "- If the peer is not responsible for the page, retrieve the content from the responsible peer."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "50cc8dcc",
+   "metadata": {},
+   "source": [
+    "The `search` method should: \n",
+    "- Separate the search into words written in lower case.\n",
+    "- Retrieve for each word the set of Wikipedia pages containing the word (if the word is not indexed, it is the empty set)\n",
+    "- Return the list of titles containing all the words of the search.\n",
+    "- If the list is empty but at least one word is indexed, issue a warning (see https://docs.python.org/3/library/logging.html) and return the union (titles containing at least one of the words of the search).\n",
+    "- If no word is indexed, issue a warning that none of the words were found and return an empty list."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b316ea7e",
+   "metadata": {},
+   "source": [
+    "The initialization of a peer will take one argument: its `id`."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "87a099a6",
+   "metadata": {},
+   "source": [
+    "**Note**: `table`, `publish_page`, `retrieve_page` and `search` will only be used from exercise 3 on. You can code a first version that does not contain these attributes/methods, and update at the time of exercise 3."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "bde11615",
+   "metadata": {},
+   "source": [
+    "# Exercise 2. DHT creation"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "abda4dda",
+   "metadata": {},
+   "source": [
+    "From an empty `dht` dictionary, insert 10000 pairs of identifier 0 to 9999 in this order, using 0 as the insertion point.\n",
+    "- What is the average size of the neighbor list? The size of the smallest list? The size of the bigger list?\n",
+    "- What is the average logarithm of the area (in base 2)? The logarithm of the smallest area? The logarithm of the largest area?\n",
+    "- From the peer with identifier 42, identify with the `lookup` method the distance (in hops) that separates it from the successors of each of the 10000 peers. What is the average distance? The greatest distance?\n",
+    "- Are the results in line with what you saw in the course? Explain your statement."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e162d542",
+   "metadata": {},
+   "source": [
+    "# Exercise 3. Publishing pages from Wikipedia"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "1a8d948e",
+   "metadata": {},
+   "source": [
+    "Using the wikipedia module and the `links` method of the `WikipediaPage` objects, retrieve the set of pages located at most 2 hops away from the `Computer science` page, i.e. :\n",
+    "- `Computer science`\n",
+    "- The titles of the pages referenced by the `Computer science` page\n",
+    "- The titles of the pages referenced by the pages referenced by the `Computer science` page\n",
+    "- After eliminating duplicates, save the list of titles to a file (so you can re-use it later) and publish all the titles in the DHT. You can use an arbitrary peer to call the method.\n",
+    "\n",
+    "- How many words are indexed in total?\n",
+    "- What is the average size of the peer tables? The size of the largest table?"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c0a039f9",
+   "metadata": {},
+   "source": [
+    "# Exercise 4. Using the DHT"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "7af57281",
+   "metadata": {},
+   "source": [
+    "- Start by manually testing how the DHT works: from any peer (e.g. ID 42), perform some searches (`search`) and retrieve some content (`retrieve_page`). Do not hesitate to test partial searches (empty intersection) and unsuccessful searches.\n",
+    "- Programmatically, retrieve from a peer the content of 2000 pages among those published (you can use the list of titles you have saved). What is the largest number of articles indexed by a peer?"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.4"
+  },
+  "toc": {
+   "base_numbering": 1,
+   "nav_menu": {},
+   "number_sections": true,
+   "sideBar": true,
+   "skip_h1_title": true,
+   "title_cell": "Table of Contents",
+   "title_sidebar": "Contents",
+   "toc_cell": false,
+   "toc_position": {},
+   "toc_section_display": true,
+   "toc_window_display": false
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}