diff --git a/Makefile b/Makefile
index 5b2608a3..133b1533 100644
--- a/Makefile
+++ b/Makefile
@@ -26,13 +26,19 @@ clean:
 	make -C apps/temps clean
 	make -C apps/POLite/heat-gals clean
 	make -C apps/POLite/heat-sync clean
+	make -C apps/POLite/heat-cube-sync clean
+	make -C apps/POLite/heat-grid-sync clean
 	make -C apps/POLite/asp-gals clean
 	make -C apps/POLite/asp-sync clean
-	make -C apps/POLite/asp-pc clean
 	make -C apps/POLite/pagerank-sync clean
 	make -C apps/POLite/pagerank-gals clean
+	make -C apps/POLite/sssp-sync clean
 	make -C apps/POLite/sssp-async clean
-	make -C apps/POLite/ping-test clean
 	make -C apps/POLite/clocktree-async clean
+	make -C apps/POLite/izhikevich-gals clean
+	make -C apps/POLite/izhikevich-sync clean
+	make -C apps/POLite/pressure-sync clean
+	make -C apps/POLite/hashmin-sync clean
+	make -C apps/POLite/progrouters clean
 	make -C bin clean
 	make -C tests clean
diff --git a/README.md b/README.md
index 00f6a84b..a66aed56 100644
--- a/README.md
+++ b/README.md
@@ -1,13 +1,12 @@
-# Tinsel 0.7.1
+# Tinsel 0.8
 
 Tinsel is a [RISC-V](https://riscv.org/)-based manythread
 message-passing architecture designed for FPGA clusters.  It is being
 developed as part of the [POETS
 Project](https://poets-project.org/about) (Partial Ordered Event
-Triggered Systems).  This manual describes the architecture and
-associated APIs.  Further background can be found in our [FPL 2019
-paper](doc/fpl-2019-paper.pdf), which presents Tinsel 0.6.  If you're
-a POETS Partner, you can access a machine running Tinsel in the [POETS
+Triggered Systems).  Further background can be found in our [FPL 2019
+paper](doc/fpl-2019-paper.pdf).  If you're a POETS Partner, you can
+access a machine running Tinsel in the [POETS
 Cloud](https://github.com/POETSII/poets-cloud).  
 
 ## Release Log
@@ -27,15 +26,19 @@ Released on 10 Sep 2018 and maintained in the
 * [v0.5](https://github.com/POETSII/tinsel/releases/tag/v0.5):
 Released on 8 Jan 2019 and maintained in the
 [tinsel-0.5.1 branch](https://github.com/POETSII/tinsel/tree/tinsel-0.5.1).
-(Hardware idle-detection.)
+(Hardware termination-detection.)
 * [v0.6](https://github.com/POETSII/tinsel/releases/tag/v0.6):
 Released on 11 Apr 2019 and maintained in the
 [tinsel-0.6.3 branch](https://github.com/POETSII/tinsel/tree/tinsel-0.6.3).
 (Multi-box cluster.)
 * [v0.7](https://github.com/POETSII/tinsel/releases/tag/v0.7):
 Released on 2 Dec 2019 and maintained in the
+[tinsel-0.7.1 branch](https://github.com/POETSII/tinsel/tree/tinsel-0.7.1).
+(Local hardware multicast.)
+* [v0.8](https://github.com/POETSII/tinsel/releases/tag/v0.8):
+Released on 24 Jun 2020 and maintained in the
 [master branch](https://github.com/POETSII/tinsel/).
-(Localised hardware multicast.)
+(Global hardware multicast.)
 
 ## Contents
 
@@ -45,8 +48,9 @@ Released on 2 Dec 2019 and maintained in the
 * [4. Tinsel Cache](#4-tinsel-cache)
 * [5. Tinsel Mailbox](#5-tinsel-mailbox)
 * [6. Tinsel Network](#6-tinsel-network)
-* [7. Tinsel HostLink](#7-tinsel-hostlink)
-* [8. POLite API](#8-polite-api)
+* [7. Tinsel Router](#7-tinsel-router)
+* [8. Tinsel HostLink](#8-tinsel-hostlink)
+* [9. POLite API](#9-polite-api)
 
 ## Appendices
 
@@ -62,24 +66,19 @@ Released on 2 Dec 2019 and maintained in the
 ## 1. Overview
 
 On the [POETS Project](https://poets-project.org/about), we are
-looking at ways to accelerate applications that can be expressed as
-large numbers of small processes communicating by message-passing.
-Our first attempt is based around a manythread RISC-V architecture
-called Tinsel running on an FPGA cluster.  Tinsel aims to support
-irregular applications that have heavy memory and communication
-demands, but fairly modest compute requrements.  The main features are:
+looking at ways to accelerate applications that are naturally
+expressed as a large number of small processes communicating by
+message-passing.  Our first attempt is based around a manythread
+RISC-V architecture called Tinsel, running on an FPGA cluster.  The
+main features are:
 
   * **Multithreading**.  A critical aspect of the design
     is to tolerate latency as cleanly as possible.  This includes the
-    latencies arising from: floating-point on Stratix V FPGAs
-    (tens of cycles); off-chip memories; deep pipelines
-    (keeping Fmax high); and sharing of resources between cores
+    latencies arising from floating-point on Stratix V FPGAs
+    (tens of cycles), off-chip memories, deep pipelines
+    (keeping Fmax high), and sharing of resources between cores
     (such as caches, mailboxes, and FPUs).
 
-  * **Caches**.  To keep the programming model simple, we have opted
-    to use thread-partitioned data caches to optimise access to
-    off-chip memory rather than DMA. 
-
   * **Message-passing**. Although there is a requirement to support a
     large amount of memory, it is not necessary to provide the
     illusion of a single shared memory space: message-passing is intended
@@ -87,17 +86,22 @@ demands, but fairly modest compute requrements.  The main features are:
     instructions for sending and receiving messages 
     between any two threads in the cluster.
 
-  * **Hardware termination detection**.  A global termination event is
+  * **Hardware termination-detection**.  A global termination event is
     triggered when every thread indicates termination and no messages
     are in-flight.  Termination can be interpreted as termination of a
     time step, or termination of the application, supporting
     both synchronous and asynchronous event-driven systems.
 
-  * **Localised hardware multicast**.  Threads can send a message to
-    multiple colocated destination threads simultaneously, greatly reducing
+  * **Local hardware multicast**.  Threads can send a message to
+    multiple collocated destination threads simultaneously, greatly reducing
     the number of inter-thread messages in applications exhibiting good
     locality of communication.
 
+  * **Global hardware multicast**.  Programmable routers
+    automatically propagate messages to any number of destination
+    threads distributed throughout the cluster, minimising inter-FPGA
+    bandwidth usage for distributed fanouts.
+
   * **Host communication**. Tinsel threads communicate with x86
     machines distributed throughout the FPGA cluster, for command and
     control, via PCI Express and USB.
@@ -106,7 +110,7 @@ demands, but fairly modest compute requrements.  The main features are:
     include custom accelerators written in SystemVerilog.
 
 This repository also includes a prototype high-level vertex-centric
-programming API for Tinsel, called [POLite](#8-polite-api).
+programming API for Tinsel, called [POLite](#9-polite-api).
 
 ## 2. High-Level Structure
 
@@ -133,11 +137,13 @@ accelerators](doc/custom) in tiles.
 
 #### Tinsel FPGA
 
-Each FPGA contains two *Tinsel Slices*, with each slice typically
+Each FPGA contains two *Tinsel Slices*, with each slice by default
 comprising eight tiles connected to one 4GB DDR3 DIMM and two 8MB
 QDRII+ SRAMs.  All tiles are connected together via a routers to form
-a 2D NoC.  At the edges of the NoC are the inter-FPGA reliable
-links.
+a 2D NoC.  The NoC is connected to the inter-FPGA links using a
+*per-board programmable router*.  Note that the per-board router also
+has connections to off-chip memory: this is where the programmable
+routing tables are stored.
 
 <img align="center" src="doc/figures/fpga.png">
 
@@ -418,16 +424,22 @@ has reached the destination or none of it has.  As one would expect,
 shorter messages consume less bandwidth than longer ones.  The size of
 a flit is defined by `LogWordsPerFlit`.
 
-At the heart of a mailbox is a memory-mapped *scratchpad* that
-stores both incoming and outgoing messages.  The capacity of the
-scratchpad is defined by `LogMsgsPerMailbox`.  Each thread connected
-to the mailbox has one message slot reserved for sending messages.
-The address of this slot is obtained using the following Tinsel API
-call.
+At the heart of a mailbox is a memory-mapped *scratchpad* that stores
+both incoming and outgoing messages.  The capacity of the scratchpad
+is defined by `LogMsgsPerMailbox`.  Each thread connected to the
+mailbox has one or two message slots reserved for sending messages.
+(By default, only a single send slot is reserved; the extra send slot
+may be optionally reserved at power-up via a parameter to the
+[HostLink](#8-tinsel-hostlink) constructor.)  The addresses of these
+slots are obtained using the following Tinsel API calls.
 
 ```c
-// Get pointer to thread's message slot reserved for sending.
+// Get pointer to thread's message slot reserved for sending
 volatile void* tinselSendSlot();
+
+// Get pointer to thread's extra message slot reserved for sending
+// (Assumes that HostLink has requested the extra slot)
+volatile void* tinselSendSlotExtra();
 ```
 
 Once a thread has written a message to the scratchpad, it can trigger
@@ -544,7 +556,7 @@ Tinsel also provides a function
   int tinselIdle(bool vote);
 ```
 
-which blocks until either
+for global termination detection, which blocks until either
 
   1. a message is available to receive, or
 
@@ -639,7 +651,208 @@ communication.  And since we are using the links point-to-point,
 almost all of the ethernet header fields can be used for our own
 purposes, resulting in very little overhead on the wire.
 
-## 7. Tinsel HostLink
+## 7. Tinsel Router
+
+Tinsel provides a programmable router on each FPGA board to support
+*global* multicasting.  Programmable routers automatically propagate
+messages to any number of destination threads distributed throughout
+the cluster, minimising inter-FPGA bandwidth usage for distributed
+fanouts, and offloading work from the cores.  Further background can
+be found in [PIP 24](doc/PIP-0024-global-multicast.md).
+
+To support programmable routers, the destination component of a
+message is generalised so that it can be (1) a thread id; or (2) a
+*routing key*.  A message, sent by a thread, containing a routing
+key as a destination will go to a per-board router on the same
+FPGA.  The router will use the key as an index into a DRAM-based
+routing table and automatically propagate the message towards all the
+destinations associated with that key. 
+
+A **routing key** is a 32-bit value consisting of a board-local *ram
+id*, a *pointer*, and a *size*:
+
+```sv
+// 32-bit routing key (MSB to LSB)
+typedef struct {
+  // Which off-chip RAM on this board?
+  Bit#(`LogDRAMsPerBoard) ram;
+  // Pointer to array of routing beats containing routing records
+  Bit#(`LogBeatsPerDRAM) ptr;
+  // Number of beats in the array
+  Bit#(`LogRoutingEntryLen) numBeats;
+} RoutingKey;
+```
+
+To send a message using a routing key as the destination, a new Tinsel
+API call is provided:
+
+```c
+// Send message at addr using given routing key 
+inline void tinselKeySend(uint32_t key, volatile void* addr);
+```
+
+When a message reaches the per-board router, the `ptr` field of the
+routing key is used as an index into DRAM, where a sequence of 256-bit
+**routing beats** are found.  The `numBeats` field of the routing key
+indicates how many contiguous routing beats there are.  The value of
+`numBeats` may be zero, in which case there are no destinations
+associated with the key.
+
+A routing beat consists of a *size* and a sequence of five 48-bit
+*routing chunks*:
+
+```sv
+// 256-bit routing beat (aligned, MSB to LSB)
+typedef struct {
+  // Number of routing records present in this beat
+  Bit#(16) size;
+  // Five 48-bit record chunks
+  Vector#(5, Bit#(48)) chunks;
+} RoutingBeat;
+```
+
+The *size* must lie in the range 1 to 5 inclusive (0 is disallowed).
+A **routing record** consists of one or two routing chunks, depending
+on the **record type**.
+
+All byte orderings are little endian.  For example, the order of bytes
+in a routing beat is as follows.
+
+Byte | Contents
+---- | --------
+31:  | Upper byte of size (i.e. number of records in beat)
+30:  | Lower byte of size
+29:  | Upper byte of first chunk
+...  | ...
+24:  | Lower byte of first chunk
+23:  | Upper byte of second chunk
+...  | ...
+18:  | Lower byte of second chunk
+17:  | Upper byte of third chunk
+...  | ...
+12:  | Lower byte of third chunk
+11:  | Upper byte of fourth chunk
+...  | ...
+ 6:  | Lower byte of fourth chunk
+ 5:  | Upper byte of fifth chunk
+...  | ...
+ 0:  | Lower byte of fifth chunk
+
+Clearly, both routing keys and routing beats have a maximum size.
+However, in principle there is no limit to the number of records
+associated with a key, due to the possibility of *indirection records*
+(see below).
+
+There are five types of routing record, defined below.
+
+**48-bit Unicast Router-to-Mailbox (URM1):**
+
+```sv
+typedef struct {
+  // Record type (URM1 == 0)
+  Bit#(3) tag;
+  // Mailbox destination
+  Bit#(4) mbox;
+  // Mailbox-local thread identifier
+  Bit#(6) thread;
+  // Unused
+  Bit#(3) unused;
+  // Local key. The first word of the message
+  // payload is overwritten with this.
+  Bit#(32) localKey;
+} URM1Record;
+```
+
+The `localKey` can be used for anything, but might encode the
+destination thread-local device identifier, or edge identifier, or
+both.  The `mbox` field is currently 4 bits (two Y bits followed by
+two X bits), but there are spare bits available to increase the size
+of this field in future if necessary.
+
+**96-bit Unicast Router-to-Mailbox (URM2):**
+
+```sv
+typedef struct {
+  // Record type (URM2 == 1)
+  Bit#(3) tag;
+  // Mailbox destination
+  Bit#(4) mbox;
+  // Mailbox-local thread identifier
+  Bit#(6) thread;
+  // Currently unused
+  Bit#(19) unused;
+  // Local key. The first two words of the message
+  // payload is overwritten with this.
+  Bit#(64) localKey;
+} URM2Record;
+```
+
+This is the same as a URM1 record except the local key is 64-bits in
+size.
+
+**48-bit Router-to-Router (RR):**
+
+```sv
+typedef struct {
+  // Record type (RR == 2)
+  Bit#(3) tag;
+  // Direction (N,S,E,W == 0,1,2,3)
+  Bit#(2) dir;
+  // Currently unused
+  Bit#(11) unused;
+  // New 32-bit routing key that will replace the one in the
+  // current message for the next hop of the message's journey
+  Bit#(32) newKey;
+} RRRecord;
+```
+
+The `newKey` field will replace the key in the current message for the
+next hop of the message's journey.  Introducing a new key at each hop
+simplifies the mapping process (keeping it quick).
+
+**96-bit Multicast Router-to-Mailbox (MRM):**
+
+```sv
+typedef struct {
+  // Record type (MRM == 3)
+  Bit#(3) tag;
+  // Mailbox destination
+  Bit#(4) mbox;
+  // Currently unused
+  Bit#(9) unused;
+  // Local key. The least-significant half-word
+  // of the message is replaced with this
+  Bit#(16) localKey;
+  // Mailbox-local destination mask
+  Bit#(64) destMask;
+} MRMRecord;
+```
+
+**48-bit Indirection (IND):**
+
+```sv
+// 48-bit Indirection (IND) record
+// Note the restrictions on IND records:
+// 1. At most one IND record per key lookup
+// 2. A max-sized key lookup must contain an IND record
+typedef struct {
+  // Record type (IND == 4)
+  Bit#(3) tag;
+  // Currently unused
+  Bit#(13) unused;
+  // New 32-bit routing key for new set of records on current router
+  Bit#(32) newKey;
+} INDRecord;
+```
+
+Indirection records can be used to handle large fanouts, which exceed
+the number of bits available in the size portion of the routing key.
+
+Finally, it is worth noting that when using programmable routers,
+there is an added responsibility for the programmer to use a
+deadlock-free routing scheme, such as dimension-ordered routing.
+
+## 8. Tinsel HostLink
 
 *HostLink* is the means by which Tinsel cores running on a mesh of
 FPGA boards communicate with a *host PC*.  It comprises three main
@@ -647,7 +860,7 @@ communication channels:
 
 * An FPGA *bridge board* that connects the host PC inside a POETS box
 (PCI Express) to the FPGA mesh (SFP+).  Using this high-bandwidth
-channel (10Gbps), the host PC can efficiently send messages to any
+channel (2 x 10Gbps), the host PC can efficiently send messages to any
 Tinsel thread and vice-versa.
 
 * A set of *debug links* connecting the host PC inside a POETS box to
@@ -662,34 +875,45 @@ each FPGA's *power management module* via separate USB UART cables.
 These connections can be used to power-on/power-off each FPGA and to
 monitor power consumption, temperature, and fan tachometer.
 
-HostLink supports multiple POETS boxes, but requires that one of these
-boxes is designated as the **master box**.  Currently, all messages
-are injected/extracted to/from the FPGA network via the master box's
-bridge board.
-
-A Tinsel application typically consists of two programs: one which
-runs on the RISC-V cores, linked against the [Tinsel
+HostLink allows multiple POETS boxes to be used to run an application,
+but requires that one of these boxes is designated as the **master
+box**.  A Tinsel application typically consists of two programs: one
+which runs on the RISC-V cores, linked against the [Tinsel
 API](#f-tinsel-api), and the other which runs on the host PC of the
 master box, linked against the [HostLink API](#g-hostlink-api).  The
 HostLink API is implemented as a C++ class called `HostLink`.  The
 constructor for this class first powers up all the worker FPGAs (which
-are by default powered down).  On power-up the FPGAs are automatically
-programmed using the Tinsel bit-file residing in flash memory, and are
-ready to be used within a few seconds, as soon as the `HostLink`
-constructor returns.
+are by default powered down).  On power-up, the FPGAs are
+automatically programmed using the Tinsel bit-file residing in flash
+memory, and are ready to be used within a few seconds, as soon as the
+`HostLink` constructor returns.
 
 The `HostLink` constructor is overloaded:
 
 ```cpp
 HostLink::HostLink();
 HostLink::HostLink(uint32_t numBoxesX, uint32_t numBoxesY);
+HostLink::HostLink(HostLinkParams params);
 ```
 
 If it is called without any arguments, then it assumes that a single
-box is to be used.  Alternatively, the user may request multiple
-boxes by specifying the width and height of the box sub-mesh they
-wish to use.  (The box from which the application is started is
-considered as the origin of this sub-mesh.)
+box is to be used.  Alternatively, the user may request multiple boxes
+by specifying the width and height of the box sub-mesh they wish to
+use.  (The box from which the application is started, i.e. the master
+box, is considered as the the origin of this sub-mesh.)  The most
+general constructor takes a `HostLinkParams` structure as an argument,
+which allows additional options to be specified.
+
+```cpp
+// HostLink parameters
+struct HostLinkParams {
+  // Number of boxes to use (default is 1x1)
+  uint32_t numBoxesX;
+  uint32_t numBoxesY;
+  // Enable use of tinselSendSlotExtra() on threads (default is false)
+  bool useExtraSendSlot;
+};
+```
 
 HostLink methods for sending and receiving messages on the host PC are
 as follows.
@@ -711,6 +935,12 @@ bool HostLink::canRecv();
 // Receive a message (blocking), given size of message in bytes
 // Any bytes beyond numBytes up to the next message boundary will be ignored
 void HostLink::recvMsg(void* msg, uint32_t numBytes);
+
+// Send a message using routing key (blocking)
+bool HostLink::keySend(uint32_t key, uint32_t numFlits, void* msg);
+
+// Try to send using routing key (non-blocking, returns true on success)
+bool HostLink::keyTrySend(uint32_t key, uint32_t numFlits, void* msg);
 ```
 
 The `send` method allows a message consisting of multiple flits to be
@@ -895,7 +1125,7 @@ not be called.  When the application returns from `main()`, all but
 one thread on each core are killed and the remaining threads reenter
 the boot loader.
 
-## 8. POLite API
+## 9. POLite API
 
 POLite is a layer of abstraction that takes care of mapping arbitrary
 task graphs onto the Tinsel overlay, completely hiding architectural
@@ -1069,16 +1299,24 @@ by each thread.
 After mapping, POLite writes the graph into cluster memory and
 triggers execution.  By default, vertex states are written into the
 off-chip QDRII+ SRAMs, and edge lists are written in the DDR3 DRAMs.
-This default behaviour can be modified by setting the boolean flags
-`graph.mapVerticesToDRAM`, `graph.mapInEdgesToDRAM`,
-`graph.mapOutEdgesToDRAM` accordingly (true means "map to DRAM" and
-false means "map to SRAM").  Once the application is up and running,
-the host and the graph vertices can continue to communicate: any
-vertex can send messages to the host via the `HostPin` or the `finish`
-handler, and the host can send messages to any vertex.
+This default behaviour can be modified by adjusting the following
+flags of the `PGraph` class.
+
+  Flag                     | Default
+  ------------------------ | -------
+  `mapVerticesToDRAM`      | `false`
+  `mapInEdgeHeadersToDRAM` | `true`
+  `mapInEdgeRestToDRAM`    | `true`
+  `mapOutEdgesToDRAM`      | `true`
+
+A value of `true` means "map to DRAM", while `false` means "map to
+(off-chip) SRAM".  Once the application is up and running, the host
+and the graph vertices can continue to communicate: any vertex can
+send messages to the host via the `HostPin` or the `finish` handler,
+and the host can send messages to any vertex.
 
 **Softswitch**. Central to POLite is an event loop running on each
-Tinsel thread, which we call **the softswitch** as it effectively
+Tinsel thread, which we call the softswitch as it effectively
 context-switches between vertices mapped to the same thread.  The
 softswitch has four main responsibilities: (1) to maintain a queue of
 vertices wanting to send; (2) to implement multicast sends over a pin
@@ -1087,14 +1325,34 @@ messages efficiently between vertices running on the same thread and
 on different threads; and (4) to invoke the vertex handlers when
 required, to meet the semantics of the POLite library.
 
-**Limitations**. POLite provides several important features of the
-vertex-centric paradigm, but there are some limitations. One of the
-features of the Pregel framework is the ability for vertices to add
-and remove vertices and edges at runtime -- but currently, POLite only
-supports static graphs.  And for large *non-localised* fan-outs, a
-hierarchical hardware or software multicast feature may be desirable
-(where messages get forked at intermediate stages along the way to the
-destinations).
+**POLite static parameters**. The following macros can be defined,
+before the first instance of `#include <POLite.h>`, to control some
+aspects of POLite behaviour.
+
+  Macro                     | Meaning
+  ---------                 | -------
+  `POLITE_NUM_PINS`         | Max number of pins per vertex (default 1)
+  `POLITE_DUMP_STATS`       | Dump stats upon completion
+  `POLITE_COUNT_MSGS`       | Include message counts in stats dump
+  `POLITE_EDGES_PER_HEADER` | Lower this for large edge states (default 6)
+
+**POLite dynamic parameters**.  The following environment variables can
+be set, to control some aspects of POLite behaviour.
+
+  Environment variable | Meaning
+  -------------------- | -------
+  `HOSTLINK_BOXES_X`   | Size of box mesh to use in X dimension
+  `HOSTLINK_BOXES_Y`   | Size of box mesh to use in Y dimension
+  `POLITE_BOARDS_X`    | Size of board mesh to use in X dimension
+  `POLITE_BOARDS_Y`    | Size of board mesh to use in Y dimension
+  `POLITE_CHATTY`      | Set to `1` to enable emission of mapper stats
+  `POLITE_PLACER`      | Use `metis`, `random`, `bfs`, or `direct` placement
+
+**Limitations**. POLite is primarily intended as a prototype library
+for hardware evaluation purposes. It occupies a single, simple point
+in a wider, richer design space.  In particular, it doesn't support
+dynamic creation of vertices and edges, and it hasn't been optimised
+to deal with highly non-uniform fanouts.
 
 ## A. DE5-Net Synthesis Report
 
@@ -1111,9 +1369,10 @@ The default Tinsel configuration on a single DE5-Net board contains:
   * four QDRII+ SRAM controllers
   * four 10Gbps reliable links
   * one termination/idle detector
+  * one 8x8 programmable router
   * a JTAG UART
 
-The clock frequency is 225MHz and the resource utilisation is 74% of
+The clock frequency is 215MHz and the resource utilisation is 84% of
 the DE5-Net.
 
 ## B. Tinsel Parameters
@@ -1143,9 +1402,9 @@ the DE5-Net.
   `MeshXLenWithinBox`      |       3 | Boards in X dimension within box
   `MeshYLenWithinBox`      |       2 | Boards in Y dimension within box
   `EnablePerfCount`        |    True | Enable performance counters
-  `ClockFreq`              |     225 | Clock frequency in MHz
+  `ClockFreq`              |     215 | Clock frequency in MHz
 
-Further parameters can be found in [config.py](config.py).
+A full list of parameters can be found in [config.py](config.py).
 
 ## C. Tinsel Memory Map
 
@@ -1204,15 +1463,20 @@ separate memory regions (which they are not).
 
 Optional performance-counter CSRs (when `EnablePerfCount` is `True`):
 
-  Name             | CSR    | R/W | Function
-  ---------------- | ------ | --- | --------
-  `PerfCount`      | 0xc07  | W   | Reset(0)/Start(1)/Stop(2) all counters
-  `MissCount`      | 0xc08  | R   | Cache miss count
-  `HitCount`       | 0xc09  | R   | Cache hit count
-  `WritebackCount` | 0xc0a  | R   | Cache writeback count
-  `CPUIdleCount`   | 0xc0b  | R   | CPU idle-cycle count (lower 32 bits)
-  `CPUIdleCountU`  | 0xc0c  | R   | CPU idle-cycle count (upper 8 bits)
-  `CycleU`         | 0xc0d  | R   | Cycle counter (upper 8 bits)
+ Name                  | CSR    | R/W | Function
+ ----------------      | ------ | --- | --------
+ `PerfCount`           | 0xc07  | W   | Reset(0)/Start(1)/Stop(2) all counters
+ `MissCount`           | 0xc08  | R   | Cache miss count
+ `HitCount`            | 0xc09  | R   | Cache hit count
+ `WritebackCount`      | 0xc0a  | R   | Cache writeback count
+ `CPUIdleCount`        | 0xc0b  | R   | CPU idle-cycle count (lower 32 bits)
+ `CPUIdleCountU`       | 0xc0c  | R   | CPU idle-cycle count (upper 8 bits)
+ `CycleU`              | 0xc0d  | R   | Cycle counter (upper 8 bits)
+ `ProgRouterSent`      | 0xc0e  | R   | Total msgs sent by ProgRouter
+ `ProgRouterSentInter` | 0xc0f  | R   | Inter-board msgs sent by ProgRouter
+
+Note that `ProgRouterSent` and `ProgRouterSentInter` are only valid
+from thread zero on each board.
 
 Tinsel also supports the following custom instructions.
 
@@ -1258,6 +1522,13 @@ inline void tinselFlushLine(uint32_t lineNum, uint32_t way);
 // (A message of length n is comprised of n+1 flits)
 inline void tinselSetLen(uint32_t n);
 
+// Get pointer to thread's message slot reserved for sending
+volatile void* tinselSendSlot();
+
+// Get pointer to thread's extra message slot reserved for sending
+// (Assumes that HostLink has requested the extra slot)
+volatile void* tinselSendSlotExtra();
+
 // Determine if calling thread can send a message
 inline uint32_t tinselCanSend();
 
@@ -1273,6 +1544,9 @@ inline void tinselMulticast(
 // (Address must be aligned on message boundary)
 inline void tinselSend(uint32_t dest, volatile void* addr);
 
+// Send message at address using given routing key
+inline void tinselKeySend(uint32_t key, volatile void* addr);
+
 // Determine if calling thread can receive a message
 inline uint32_t tinselCanRecv();
 
@@ -1352,6 +1626,14 @@ inline uint32_t tinselCPUIdleCountU();
 // Read cycle counter (upper 8 bits)
 inline uint32_t tinselCycleCountU();
 
+// Performance counter: number of messages emitted by ProgRouter
+// (Only valid from thread zero on each board)
+inline uint32_t tinselProgRouterSent();
+
+// Performance counter: number of inter-board messages emitted by ProgRouter
+// (Only valid from thread zero on each board)
+inline uint32_t tinselProgRouterSentInterBoard();
+
 // Address construction
 inline uint32_t tinselToAddr(
          uint32_t boardX, uint32_t boardY,
@@ -1410,6 +1692,12 @@ class HostLink {
   // Any bytes beyond numBytes up to the next message boundary will be ignored
   void recvMsg(void* msg, uint32_t numBytes);
 
+  // Send a message using routing key (blocking by default)
+  bool keySend(uint32_t key, uint32_t numFlits, void* msg, bool block = true);
+
+  // Try to send using routing key (non-blocking, returns true on success)
+  bool keyTrySend(uint32_t key, uint32_t numFlits, void* msg);
+
   // Bulk send and receive
   // ---------------------
 
@@ -1476,14 +1764,24 @@ class HostLink {
   // Trigger application execution on all started threads on given core
   void goOne(uint32_t meshX, uint32_t meshY, uint32_t coreId);
 };
+
+// HostLink parameters (used by the most general HostLink constructor)
+struct HostLinkParams {
+  // Number of boxes to use (default is 1x1)
+  uint32_t numBoxesX;
+  uint32_t numBoxesY;
+  // Enable use of tinselSendSlotExtra() on threads (default is false)
+  bool useExtraSendSlot;
+};
 ```
 
 ```cpp
 class DebugLink {
  public:
 
-  // Constructor
+  // Constructors
   DebugLink(uint32_t numBoxesX, uint32_t numBoxesY);
+  DebugLink(DebugLinkParams params);
 
   // On given board, set destination core and thread
   void setDest(uint32_t boardX, uint32_t boardY,
diff --git a/apps/POLite/asp-gals/ASP.h b/apps/POLite/asp-gals/ASP.h
index 42462622..f69dfa3d 100644
--- a/apps/POLite/asp-gals/ASP.h
+++ b/apps/POLite/asp-gals/ASP.h
@@ -9,8 +9,8 @@
 #ifndef _ASP_H_
 #define _ASP_H_
 
-//#define POLITE_DUMP_STATS
-//#define POLITE_COUNT_MSGS
+#define POLITE_DUMP_STATS
+#define POLITE_COUNT_MSGS
 
 // Lightweight POETS frontend
 #include <POLite.h>
diff --git a/apps/POLite/asp-gals/Run.cpp b/apps/POLite/asp-gals/Run.cpp
index d50821ce..4c00e1da 100644
--- a/apps/POLite/asp-gals/Run.cpp
+++ b/apps/POLite/asp-gals/Run.cpp
@@ -51,7 +51,8 @@ int main(int argc, char**argv)
   // Create random set of source nodes
   uint32_t numSources = NUM_SOURCES*32;
   uint32_t sources[numSources];
-  randomSet(numSources, sources, graph.numDevices);
+  //randomSet(numSources, sources, graph.numDevices);
+  for (int i = 0; i < numSources; i++) sources[i] = i;
 
   // Initialise devices
   for (PDeviceId i = 0; i < graph.numDevices; i++) {
@@ -102,7 +103,9 @@ int main(int argc, char**argv)
   // Display time
   timersub(&finish, &start, &diff);
   double duration = (double) diff.tv_sec + (double) diff.tv_usec / 1000000.0;
+  #ifndef POLITE_DUMP_STATS
   printf("Time = %lf\n", duration);
+  #endif
 
   return 0;
 }
diff --git a/apps/POLite/asp-pc/Makefile b/apps/POLite/asp-pc/Makefile
index 0cf7448f..bf9439f3 100644
--- a/apps/POLite/asp-pc/Makefile
+++ b/apps/POLite/asp-pc/Makefile
@@ -1,10 +1,10 @@
 # SPDX-License-Identifier: BSD-2-Clause
-all: asp GenHypercube GenTree GenGeoGraph
+all: asp GenHypercube GenTree
 
 INC=../../../../include
 
 asp: asp.cpp
-	g++ -fopenmp -D_DEFAULT_SOURCE -I$(INC) -O3 asp.cpp -o asp
+	g++ -I$(INC) -O3 asp.cpp -o asp
 
 GenHypercube: GenHypercube.hs
 	ghc -O2 --make GenHypercube.hs
@@ -12,8 +12,5 @@ GenHypercube: GenHypercube.hs
 GenTree: GenTree.hs
 	ghc -O2 --make GenTree.hs
 
-GenGeoGraph: GenGeoGraph.cpp
-	g++ -O2 -lstdc++ GenGeoGraph.cpp -o GenGeoGraph
-
 clean:
-	rm -f asp GenHypercube GenTree GenGeoGraph *.hi *.o
+	rm -f asp GenHypercube GenTree *.hi *.o
diff --git a/apps/POLite/asp-pc/asp-push.cpp b/apps/POLite/asp-pc/asp-push.cpp
new file mode 100644
index 00000000..a75f6628
--- /dev/null
+++ b/apps/POLite/asp-pc/asp-push.cpp
@@ -0,0 +1,180 @@
+// SPDX-License-Identifier: BSD-2-Clause
+#include "RandomSet.h"
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <stdint.h>
+#include <string.h>
+#include <assert.h>
+#include <sys/time.h>
+
+// Number of nodes and edges
+uint32_t numNodes;
+uint32_t numEdges;
+
+// Mapping from node id to array of neighbouring node ids
+// First element of each array holds the number of neighbours
+uint32_t** neighbours;
+
+// Mapping from node id to bit vector of reaching nodes
+uint64_t** reaching;
+uint64_t** reachingNext;
+
+// Number of 64-bit words in reaching vector
+const uint64_t vectorSize = 1;
+
+void readGraph(const char* filename, bool undirected)
+{
+  // Read edges
+  FILE* fp = fopen(filename, "rt");
+  if (fp == NULL) {
+    fprintf(stderr, "Can't open '%s'\n", filename);
+    exit(EXIT_FAILURE);
+  }
+
+  // Note: we use a "pull" algorithm (rather than "push") to
+  // avoid parallel writes to the same address, hence we reverse
+  // the direction of the edges here.
+
+  // Count number of nodes and edges
+  numEdges = 0;
+  numNodes = 0;
+  int ret;
+  while (1) {
+    uint32_t src, dst;
+    ret = fscanf(fp, "%d %d", &dst, &src);
+    if (ret == EOF) break;
+    numEdges++;
+    numNodes = src >= numNodes ? src+1 : numNodes;
+    numNodes = dst >= numNodes ? dst+1 : numNodes;
+  }
+  rewind(fp);
+
+  // Create mapping from node id to number of neighbours
+  uint32_t* count = (uint32_t*) calloc(numNodes, sizeof(uint32_t));
+  for (int i = 0; i < numEdges; i++) {
+    uint32_t src, dst;
+    ret = fscanf(fp, "%d %d", &dst, &src);
+    count[src]++;
+    if (undirected) count[dst]++;
+  }
+
+  // Create mapping from node id to neighbours
+  neighbours = (uint32_t**) calloc(numNodes, sizeof(uint32_t*));
+  rewind(fp);
+  for (int i = 0; i < numNodes; i++) {
+    neighbours[i] = (uint32_t*) calloc(count[i]+1, sizeof(uint32_t));
+    neighbours[i][0] = count[i];
+  }
+  for (int i = 0; i < numEdges; i++) {
+    uint32_t src, dst;
+    ret = fscanf(fp, "%d %d", &dst, &src);
+    neighbours[src][count[src]--] = dst;
+    if (undirected) neighbours[dst][count[dst]--] = src;
+  }
+
+  // Create mapping from node id to bit vector of reaching nodes
+  reaching = (uint64_t**) calloc(numNodes, sizeof(uint64_t*));
+  reachingNext = (uint64_t**) calloc(numNodes, sizeof(uint64_t*));
+  for (int i = 0; i < numNodes; i++) {
+    reaching[i] = (uint64_t*) calloc(vectorSize, sizeof(uint64_t));
+    reachingNext[i] = (uint64_t*) calloc(vectorSize, sizeof(uint64_t));
+  }
+
+  // Release
+  free(count);
+  fclose(fp);
+}
+
+// Compute sum of all shortest paths from given sources
+uint64_t ssp(uint32_t numSources, uint32_t* sources)
+{
+  // Sum of distances
+  uint64_t sum = 0;
+
+  // Initialise reaching vector for each node
+  for (int i = 0; i < numNodes; i++) {
+    for (int j = 0; j < vectorSize; j++) {
+      reaching[i][j] = 0;
+      reachingNext[i][j] = 0;
+    }
+  }
+  for (int i = 0; i < numSources; i++) {
+    uint32_t src = sources[i];
+    reaching[src][i/64] |= 1ul << (i%64);
+  }
+
+  int* queue = new int [numNodes];
+  int queueSize = 0;
+  for (int i = 0; i < numNodes; i++) queue[queueSize++] = i;
+
+  // Distance increases on each iteration
+  uint32_t dist = 1;
+
+  while (queueSize > 0) {
+    // For each node
+    for (int i = 0; i < queueSize; i++) {
+      int me = queue[i];
+      // For each neighbour
+      uint32_t numNeighbours = neighbours[me][0];
+      for (int j = 1; j <= numNeighbours; j++) {
+        uint32_t n = neighbours[me][j];
+        // For each chunk
+        for (int k = 0; k < vectorSize; k++) {
+          if (reaching[me][k] & ~reachingNext[n][k])
+            reachingNext[n][k] |= reaching[me][k];
+        }
+      }
+    }
+
+    // For each node, update reaching vector
+    queueSize = 0;
+    for (int i = 0; i < numNodes; i++) {
+      for (int k = 0; k < vectorSize; k++) {
+        uint64_t diff = reachingNext[i][k] & ~reaching[i][k];
+        if (diff) {
+          queue[queueSize++] = i;
+          uint32_t n = __builtin_popcountll(diff);
+          sum += n * dist;
+          reaching[i][k] |= reachingNext[i][k];
+        }
+      }
+    }
+    dist++;
+  }
+
+  return sum;
+}
+
+int main(int argc, char**argv)
+{
+  if (argc != 2) {
+    printf("Specify edges file\n");
+    exit(EXIT_FAILURE);
+  }
+  bool undirected = false;
+  readGraph(argv[1], undirected);
+  printf("Nodes: %u.  Edges: %u\n", numNodes, numEdges);
+
+  uint32_t numSources = 64*vectorSize;
+  assert(numSources < numNodes);
+  uint32_t sources[numSources];
+  for (int i = 0; i < numSources; i++) sources[i] = i;
+  //randomSet(numSources, sources, numNodes);
+
+  struct timeval start, finish, diff;
+
+  uint64_t sum = 0;
+  const int nodesPerVector = 64 * vectorSize;
+  gettimeofday(&start, NULL);
+  sum = ssp(numSources, sources);
+  gettimeofday(&finish, NULL);
+
+  printf("Sum of subset of shortest paths = %lu\n", sum);
+ 
+  timersub(&finish, &start, &diff);
+  double duration = (double) diff.tv_sec + (double) diff.tv_usec / 1000000.0;
+  printf("Time = %lf\n", duration);
+
+  return 0;
+}
diff --git a/apps/POLite/asp-sync/Run.cpp b/apps/POLite/asp-sync/Run.cpp
index 25082646..518a33b5 100644
--- a/apps/POLite/asp-sync/Run.cpp
+++ b/apps/POLite/asp-sync/Run.cpp
@@ -19,9 +19,11 @@ int main(int argc, char**argv)
   // Read network
   EdgeList net;
   net.read(argv[1]);
-
+  
   // Print max fan-out
   printf("Max fan-out = %d\n", net.maxFanOut());
+  printf("Min fan-out = %d\n", net.minFanOut());
+  assert(net.minFanOut() > 0);
 
   // Check that parameters make sense
   assert(32*N <= net.numNodes);
@@ -97,7 +99,9 @@ int main(int argc, char**argv)
   // Display time
   timersub(&finish, &start, &diff);
   double duration = (double) diff.tv_sec + (double) diff.tv_usec / 1000000.0;
+  #ifndef POLITE_DUMP_STATS
   printf("Time = %lf\n", duration);
+  #endif
 
   return 0;
 }
diff --git a/apps/POLite/asp-tiles-sync/Run.cpp b/apps/POLite/asp-tiles-sync/Run.cpp
index 049d83a8..cdc2bb14 100644
--- a/apps/POLite/asp-tiles-sync/Run.cpp
+++ b/apps/POLite/asp-tiles-sync/Run.cpp
@@ -135,11 +135,11 @@ int main(int argc, char**argv)
   double duration;
   timersub(&finishCompute, &startCompute, &diff);
   duration = (double) diff.tv_sec + (double) diff.tv_usec / 1000000.0;
-  printf("Time (compute) = %lf\n", duration);
+  printf("Time (compute, including stats transfer over UART) = %lf\n", duration);
   gettimeofday(&finishAll, NULL);
   timersub(&finishAll, &startAll, &diff);
   duration = (double) diff.tv_sec + (double) diff.tv_usec / 1000000.0;
-  printf("Time (all) = %lf\n", duration);
+  printf("Time (all, including stats transfer over UART) = %lf\n", duration);
 
   return 0;
 }
diff --git a/apps/POLite/clocktree-async/Run.cpp b/apps/POLite/clocktree-async/Run.cpp
index 270c9b48..02f76723 100644
--- a/apps/POLite/clocktree-async/Run.cpp
+++ b/apps/POLite/clocktree-async/Run.cpp
@@ -93,7 +93,9 @@ int main(int argc, char** argv)
   // Display time
   timersub(&finish, &start, &diff);
   double duration = (double) diff.tv_sec + (double) diff.tv_usec / 1000000.0;
+  #ifndef POLITE_DUMP_STATS
   printf("Time = %lf\n", duration);
+  #endif
 
   return 0;
 }
diff --git a/apps/POLite/hashmin-sync/Run.cpp b/apps/POLite/hashmin-sync/Run.cpp
index cb6a7ced..eab92eff 100644
--- a/apps/POLite/hashmin-sync/Run.cpp
+++ b/apps/POLite/hashmin-sync/Run.cpp
@@ -82,7 +82,9 @@ int main(int argc, char**argv)
   // Display time
   timersub(&finish, &start, &diff);
   double duration = (double) diff.tv_sec + (double) diff.tv_usec / 1000000.0;
+  #ifndef POLITE_DUMP_STATS
   printf("Time = %lf\n", duration);
+  #endif
 
   return 0;
 }
diff --git a/apps/POLite/heat-cube-sync/Run.cpp b/apps/POLite/heat-cube-sync/Run.cpp
index aaa42c39..1163f01b 100644
--- a/apps/POLite/heat-cube-sync/Run.cpp
+++ b/apps/POLite/heat-cube-sync/Run.cpp
@@ -76,7 +76,9 @@ int main()
   // Display time
   timersub(&finish, &start, &diff);
   double duration = (double) diff.tv_sec + (double) diff.tv_usec / 1000000.0;
+  #ifndef POLITE_DUMP_STATS
   printf("Time = %lf\n", duration);
+  #endif
 
   return 0;
 }
diff --git a/apps/POLite/heat-gals/Heat.h b/apps/POLite/heat-gals/Heat.h
index 12ca9574..600b4d00 100644
--- a/apps/POLite/heat-gals/Heat.h
+++ b/apps/POLite/heat-gals/Heat.h
@@ -2,6 +2,8 @@
 #ifndef _HEAT_H_
 #define _HEAT_H_
 
+#define POLITE_DUMP_STATS
+#define POLITE_COUNT_MSGS
 #include <POLite.h>
 
 struct HeatMessage {
@@ -10,7 +12,7 @@ struct HeatMessage {
   // Time step
   uint32_t time;
   // Temperature at sender
-  uint32_t val;
+  float val;
 };
 
 struct HeatState {
@@ -21,9 +23,9 @@ struct HeatState {
   // Current time step of device
   uint32_t time;
   // Current temperature of device
-  uint32_t val;
+  float val;
   // Accumulator for temperatures received at times t and t+1
-  uint32_t acc, accNext;
+  float acc, accNext;
   // Count messages sent and received
   uint8_t sent, received, receivedNext;
   // Is the temperature of this device constant?
@@ -45,7 +47,7 @@ struct HeatDevice : PDevice<HeatState, None, HeatMessage> {
     // Proceed to next time step?
     if (s->sent && s->received == s->fanIn) {
       s->time--;
-      if (!s->isConstant) s->val = s->acc >> 2;
+      if (!s->isConstant) s->val = s->acc / (float) s->fanIn;
       s->acc = s->accNext;
       s->received = s->receivedNext;
       s->accNext = s->receivedNext = 0;
diff --git a/apps/POLite/heat-gals/Makefile b/apps/POLite/heat-gals/Makefile
index 0c343edd..86430b66 100644
--- a/apps/POLite/heat-gals/Makefile
+++ b/apps/POLite/heat-gals/Makefile
@@ -1,7 +1,7 @@
 # SPDX-License-Identifier: BSD-2-Clause
 APP_CPP = Heat.cpp
 APP_HDR = Heat.h
-RUN_CPP = Run.cpp Colours.cpp
-RUN_H = Colours.h
+RUN_CPP = Run.cpp
+RUN_H =
 
 include ../util/polite.mk
diff --git a/apps/POLite/heat-gals/Run.cpp b/apps/POLite/heat-gals/Run.cpp
index 0a08505b..44c2f921 100644
--- a/apps/POLite/heat-gals/Run.cpp
+++ b/apps/POLite/heat-gals/Run.cpp
@@ -1,17 +1,31 @@
 // SPDX-License-Identifier: BSD-2-Clause
 #include "Heat.h"
-#include "Colours.h"
 
 #include <HostLink.h>
 #include <POLite.h>
+#include <EdgeList.h>
 #include <sys/time.h>
 
-int main()
+int main(int argc, char **argv)
 {
   // Parameters
-  const uint32_t width  = 256;
-  const uint32_t height = 256;
-  const uint32_t time   = 1000;
+  const uint32_t time = 1000;
+
+  // Read in the example edge list and create data structure
+  if (argc != 2) {
+    printf("Specify edge file\n");
+    exit(EXIT_FAILURE);
+  }
+
+  // Load in the edge list file
+  printf("Loading in the graph..."); fflush(stdout);
+  EdgeList net;
+  net.read(argv[1]);
+  printf(" done\n");
+
+  // Print max fan-out
+  printf("Min fan-out = %d\n", net.minFanOut());
+  printf("Max fan-out = %d\n", net.maxFanOut());
 
   // Connection to tinsel machine
   HostLink hostLink;
@@ -19,58 +33,32 @@ int main()
   // Create POETS graph
   PGraph<HeatDevice, HeatState, None, HeatMessage> graph;
 
-  // Create 2D mesh of devices
-  PDeviceId **mesh = new PDeviceId* [height];
-  for (uint32_t y = 0; y < height; y++) {
-    mesh[y] = new PDeviceId [width];
-    for (uint32_t x = 0; x < width; x++)
-      mesh[y][x] = graph.newDevice();
+  // Create nodes in POETS graph
+  for (uint32_t i = 0; i < net.numNodes; i++) {
+    PDeviceId id = graph.newDevice();
+    assert(i == id);
   }
 
-  // Add edges
-  for (uint32_t y = 0; y < height; y++)
-    for (uint32_t x = 0; x < width; x++) {
-      if (x < width-1) {
-        graph.addEdge(mesh[y][x],   0, mesh[y][x+1]);
-        graph.addEdge(mesh[y][x+1], 0, mesh[y][x]);
-      }
-      if (y < height-1) {
-        graph.addEdge(mesh[y][x],   0, mesh[y+1][x]);
-        graph.addEdge(mesh[y+1][x], 0, mesh[y][x]);
-      }
-    }
+  // Create connections in POETS graph
+  for (uint32_t i = 0; i < net.numNodes; i++) {
+    uint32_t numNeighbours = net.neighbours[i][0];
+    for (uint32_t j = 0; j < numNeighbours; j++)
+      graph.addEdge(i, 0, net.neighbours[i][j+1]);
+  }
 
   // Prepare mapping from graph to hardware
   graph.map();
 
-  // Set device ids
-  for (uint32_t y = 0; y < height; y++)
-    for (uint32_t x = 0; x < width; x++)
-      graph.devices[mesh[y][x]]->state.id = mesh[y][x];
-
-  // Initialise time and fanIn fields
+  // Specify number of time steps to run on each device
+  srand(1);
   for (PDeviceId i = 0; i < graph.numDevices; i++) {
+    int r = rand() % 255;
+    graph.devices[i]->state.id = i;
     graph.devices[i]->state.time = time;
+    graph.devices[i]->state.val = (float) r;
+    graph.devices[i]->state.isConstant = false;
     graph.devices[i]->state.fanIn = graph.fanIn(i);
   }
- 
-  // Apply constant heat at north edge
-  // Apply constant cool at south edge
-  for (uint32_t x = 0; x < width; x++) {
-    graph.devices[mesh[0][x]]->state.val = 255 << 16;
-    graph.devices[mesh[0][x]]->state.isConstant = true;
-    graph.devices[mesh[height-1][x]]->state.val = 40 << 16;
-    graph.devices[mesh[height-1][x]]->state.isConstant = true;
-  }
-
-  // Apply constant heat at west edge
-  // Apply constant cool at east edge
-  for (uint32_t y = 0; y < height; y++) {
-    graph.devices[mesh[y][0]]->state.val = 255 << 16;
-    graph.devices[mesh[y][0]]->state.isConstant = true;
-    graph.devices[mesh[y][width-1]]->state.val = 40 << 16;
-    graph.devices[mesh[y][width-1]]->state.isConstant = true;
-  }
 
   // Write graph down to tinsel machine via HostLink
   graph.write(&hostLink);
@@ -84,8 +72,11 @@ int main()
   struct timeval start, finish, diff;
   gettimeofday(&start, NULL);
 
+  // Consume performance stats
+  politeSaveStats(&hostLink, "stats.txt");
+
   // Allocate array to contain final value of each device
-  uint32_t* pixels = new uint32_t [graph.numDevices];
+  float* pixels = new float [graph.numDevices];
 
   // Receive final value of each device
   for (uint32_t i = 0; i < graph.numDevices; i++) {
@@ -97,25 +88,19 @@ int main()
     pixels[msg.payload.from] = msg.payload.val;
   }
 
+  // Display final values of first ten devices
+  for (uint32_t i = 0; i < 10; i++) {
+    if (i < graph.numDevices) {
+      printf("%d: %f\n", i, pixels[i]);
+    }
+  }
+
   // Display time
   timersub(&finish, &start, &diff);
   double duration = (double) diff.tv_sec + (double) diff.tv_usec / 1000000.0;
+  #ifndef POLITE_DUMP_STATS
   printf("Time = %lf\n", duration);
-
-  // Emit image
-  FILE* fp = fopen("out.ppm", "wt");
-  if (fp == NULL) {
-    printf("Can't open output file for writing\n");
-    return -1;
-  }
-  fprintf(fp, "P3\n%d %d\n255\n", width, height);
-  for (uint32_t y = 0; y < height; y++)
-    for (uint32_t x = 0; x < width; x++) {
-      uint32_t val = (pixels[mesh[y][x]] >> 16) & 0xff;
-      fprintf(fp, "%d %d %d\n",
-        colours[val*3], colours[val*3+1], colours[val*3+2]);
-    }
-  fclose(fp);
+  #endif
 
   return 0;
 }
diff --git a/apps/POLite/heat-gals/Colours.cpp b/apps/POLite/heat-grid-sync/Colours.cpp
similarity index 100%
rename from apps/POLite/heat-gals/Colours.cpp
rename to apps/POLite/heat-grid-sync/Colours.cpp
diff --git a/apps/POLite/heat-gals/Colours.h b/apps/POLite/heat-grid-sync/Colours.h
similarity index 100%
rename from apps/POLite/heat-gals/Colours.h
rename to apps/POLite/heat-grid-sync/Colours.h
diff --git a/apps/POLite/ping-test/ping.cpp b/apps/POLite/heat-grid-sync/Heat.cpp
similarity index 57%
rename from apps/POLite/ping-test/ping.cpp
rename to apps/POLite/heat-grid-sync/Heat.cpp
index 74960d36..b2b4fc3e 100644
--- a/apps/POLite/ping-test/ping.cpp
+++ b/apps/POLite/heat-grid-sync/Heat.cpp
@@ -1,21 +1,21 @@
 // SPDX-License-Identifier: BSD-2-Clause
-#include "ping.h"
+#include "Heat.h"
 
 #include <tinsel.h>
 #include <POLite.h>
 
 typedef PThread<
-          PingDevice,
-          PingState,     // State
+          HeatDevice,
+          HeatState,    // State
           None,         // Edge label
-          PingMessage    // Message
-        > PingThread;
+          HeatMessage   // Message
+        > HeatThread;
 
 int main()
 {
   // Point thread structure at base of thread's heap
-  PingThread* thread = (PingThread*) tinselHeapBaseSRAM();
-
+  HeatThread* thread = (HeatThread*) tinselHeapBaseSRAM();
+  
   // Invoke interpreter
   thread->run();
 
diff --git a/apps/POLite/heat-grid-sync/Heat.h b/apps/POLite/heat-grid-sync/Heat.h
new file mode 100644
index 00000000..b3a63a93
--- /dev/null
+++ b/apps/POLite/heat-grid-sync/Heat.h
@@ -0,0 +1,71 @@
+// SPDX-License-Identifier: BSD-2-Clause
+#ifndef _HEAT_H_
+#define _HEAT_H_
+
+#include <POLite.h>
+
+struct HeatMessage {
+  // Sender id
+  uint32_t from;
+  // Time step
+  uint32_t time;
+  // Temperature at sender
+  uint32_t val;
+};
+
+struct HeatState {
+  // Device id
+  uint32_t id;
+  // Current time step of device
+  uint32_t time;
+  // Current temperature of device
+  uint32_t val, acc;
+  // Is the temperature of this device constant?
+  bool isConstant;
+};
+
+struct HeatDevice : PDevice<HeatState, None, HeatMessage> {
+
+  // Called once by POLite at start of execution
+  inline void init() {
+    *readyToSend = Pin(0);
+  }
+
+  // Send handler
+  inline void send(volatile HeatMessage* msg) {
+    msg->from = s->id;
+    msg->time = s->time;
+    msg->val = s->val;
+    *readyToSend = No;
+  }
+
+  // Receive handler
+  inline void recv(HeatMessage* msg, None* edge) {
+    s->acc += msg->val;
+  }
+
+  // Called by POLite when system becomes idle
+  inline bool step() {
+    // Execution complete?
+    if (s->time == 0) {
+      *readyToSend = No;
+      return false;
+    }
+    else {
+      s->time--;
+      if (!s->isConstant) s->val = s->acc >> 2;
+      s->acc = 0;
+      *readyToSend = Pin(0);
+      return true;
+    }
+  }
+
+  // Optionally send message to host on termination
+  inline bool finish(volatile HeatMessage* msg) {
+    msg->from = s->id;
+    msg->val = s->val;
+    return true;
+  }
+};
+
+#endif
diff --git a/apps/POLite/heat-grid-sync/Makefile b/apps/POLite/heat-grid-sync/Makefile
new file mode 100644
index 00000000..0c343edd
--- /dev/null
+++ b/apps/POLite/heat-grid-sync/Makefile
@@ -0,0 +1,7 @@
+# SPDX-License-Identifier: BSD-2-Clause
+APP_CPP = Heat.cpp
+APP_HDR = Heat.h
+RUN_CPP = Run.cpp Colours.cpp
+RUN_H = Colours.h
+
+include ../util/polite.mk
diff --git a/apps/POLite/heat-grid-sync/Run.cpp b/apps/POLite/heat-grid-sync/Run.cpp
new file mode 100644
index 00000000..a938a446
--- /dev/null
+++ b/apps/POLite/heat-grid-sync/Run.cpp
@@ -0,0 +1,119 @@
+// SPDX-License-Identifier: BSD-2-Clause
+#include "Heat.h"
+#include "Colours.h"
+
+#include <HostLink.h>
+#include <POLite.h>
+#include <sys/time.h>
+
+int main()
+{
+  // Parameters
+  const uint32_t width  = 256;
+  const uint32_t height = 256;
+  const uint32_t time   = 1000;
+
+  // Connection to tinsel machine
+  HostLink hostLink;
+
+  // Create POETS graph
+  PGraph<HeatDevice, HeatState, None, HeatMessage> graph;
+
+  // Create 2D mesh of devices
+  PDeviceId **mesh = new PDeviceId* [height];
+  for (uint32_t y = 0; y < height; y++) {
+    mesh[y] = new PDeviceId [width];
+    for (uint32_t x = 0; x < width; x++)
+      mesh[y][x] = graph.newDevice();
+  }
+
+  // Add edges
+  for (uint32_t y = 0; y < height; y++)
+    for (uint32_t x = 0; x < width; x++) {
+      if (x < width-1) {
+        graph.addEdge(mesh[y][x],   0, mesh[y][x+1]);
+        graph.addEdge(mesh[y][x+1], 0, mesh[y][x]);
+      }
+      if (y < height-1) {
+        graph.addEdge(mesh[y][x],   0, mesh[y+1][x]);
+        graph.addEdge(mesh[y+1][x], 0, mesh[y][x]);
+      }
+    }
+
+  // Prepare mapping from graph to hardware
+  graph.map();
+
+  // Set device ids
+  for (uint32_t y = 0; y < height; y++)
+    for (uint32_t x = 0; x < width; x++)
+      graph.devices[mesh[y][x]]->state.id = mesh[y][x];
+
+  // Specify number of time steps to run on each device
+  for (PDeviceId i = 0; i < graph.numDevices; i++)
+    graph.devices[i]->state.time = time;
+ 
+  // Apply constant heat at north edge
+  // Apply constant cool at south edge
+  for (uint32_t x = 0; x < width; x++) {
+    graph.devices[mesh[0][x]]->state.val = 255 << 16;
+    graph.devices[mesh[0][x]]->state.isConstant = true;
+    graph.devices[mesh[height-1][x]]->state.val = 40 << 16;
+    graph.devices[mesh[height-1][x]]->state.isConstant = true;
+  }
+
+  // Apply constant heat at west edge
+  // Apply constant cool at east edge
+  for (uint32_t y = 0; y < height; y++) {
+    graph.devices[mesh[y][0]]->state.val = 255 << 16;
+    graph.devices[mesh[y][0]]->state.isConstant = true;
+    graph.devices[mesh[y][width-1]]->state.val = 40 << 16;
+    graph.devices[mesh[y][width-1]]->state.isConstant = true;
+  }
+
+  // Write graph down to tinsel machine via HostLink
+  graph.write(&hostLink);
+
+  // Load code and trigger execution
+  hostLink.boot("code.v", "data.v");
+  hostLink.go();
+  printf("Starting\n");
+
+  // Start timer
+  struct timeval start, finish, diff;
+  gettimeofday(&start, NULL);
+
+  // Allocate array to contain final value of each device
+  uint32_t* pixels = new uint32_t [graph.numDevices];
+
+  // Receive final value of each device
+  for (uint32_t i = 0; i < graph.numDevices; i++) {
+    // Receive message
+    PMessage<HeatMessage> msg;
+    hostLink.recvMsg(&msg, sizeof(msg));
+    if (i == 0) gettimeofday(&finish, NULL);
+    // Save final value
+    pixels[msg.payload.from] = msg.payload.val;
+  }
+
+  // Display time
+  timersub(&finish, &start, &diff);
+  double duration = (double) diff.tv_sec + (double) diff.tv_usec / 1000000.0;
+  printf("Time = %lf\n", duration);
+
+  // Emit image
+  FILE* fp = fopen("out.ppm", "wt");
+  if (fp == NULL) {
+    printf("Can't open output file for writing\n");
+    return -1;
+  }
+  fprintf(fp, "P3\n%d %d\n255\n", width, height);
+  for (uint32_t y = 0; y < height; y++)
+    for (uint32_t x = 0; x < width; x++) {
+      uint32_t val = (pixels[mesh[y][x]] >> 16) & 0xff;
+      fprintf(fp, "%d %d %d\n",
+        colours[val*3], colours[val*3+1], colours[val*3+2]);
+    }
+  fclose(fp);
+
+  return 0;
+}
diff --git a/apps/POLite/heat-pc/Makefile b/apps/POLite/heat-pc/Makefile
new file mode 100644
index 00000000..235863ef
--- /dev/null
+++ b/apps/POLite/heat-pc/Makefile
@@ -0,0 +1,11 @@
+# SPDX-License-Identifier: BSD-2-Clause
+all: heat
+
+INC=../../../include
+
+heat: heat.cpp
+	g++ -I$(INC) -O3 heat.cpp -o heat
+
+.PHONY: clean
+clean:
+	rm heat
diff --git a/apps/POLite/heat-pc/heat.cpp b/apps/POLite/heat-pc/heat.cpp
new file mode 100644
index 00000000..194766ac
--- /dev/null
+++ b/apps/POLite/heat-pc/heat.cpp
@@ -0,0 +1,63 @@
+// SPDX-License-Identifier: BSD-2-Clause
+#include <stdio.h>
+#include <stdlib.h>
+#include <stdint.h>
+#include <string.h>
+#include <assert.h>
+#include <sys/time.h>
+#include <EdgeList.h>
+
+int main(int argc, char**argv)
+{
+  if (argc != 2) {
+    printf("Specify edges file\n");
+    exit(EXIT_FAILURE);
+  }
+
+  // Read network
+  EdgeList net;
+  net.read(argv[1]);
+
+  // Create states
+  float* heat = new float [net.numNodes];
+  float* heatNext = new float [net.numNodes];
+  srand(1);
+  for (int i = 0; i < net.numNodes; i++) {
+    int r = rand() % 255;
+    heat[i] = (float) r;
+  }
+
+  // Start timer
+  printf("Started\n");
+  struct timeval start, finish, diff;
+  gettimeofday(&start, NULL);
+
+  for (int t = 0; t < 100; t++) {
+    for (int i = 0; i < net.numNodes; i++) {
+      uint32_t numNeighbours = net.neighbours[i][0];
+      float acc = 0.0;
+      for (uint32_t j = 0; j < numNeighbours; j++) {
+        uint32_t neighbour = net.neighbours[i][j+1];
+        acc += heat[neighbour];
+      }
+      heatNext[i] = acc / (float) numNeighbours;
+    }
+    float* tmp = heat; heat = heatNext; heatNext = tmp;
+  }
+
+  // Stop timer
+  gettimeofday(&finish, NULL);
+
+  // Display final values of first ten devices
+  for (uint32_t i = 0; i < 10; i++) {
+    if (i < net.numNodes)
+      printf("%d: %f\n", i, heat[i]);
+  }
+
+  // Display time
+  timersub(&finish, &start, &diff);
+  double duration = (double) diff.tv_sec + (double) diff.tv_usec / 1000000.0;
+  printf("Time = %lf\n", duration);
+
+  return 0;
+}
diff --git a/apps/POLite/heat-sync/Colours.cpp b/apps/POLite/heat-sync/Colours.cpp
deleted file mode 100644
index 93b49740..00000000
--- a/apps/POLite/heat-sync/Colours.cpp
+++ /dev/null
@@ -1,71 +0,0 @@
-// SPDX-License-Identifier: BSD-2-Clause
-#include <stdint.h>
-
-// 256 x RGB colours representing heat intensities
-uint8_t colours[] = {
-  0x00, 0x00, 0x76, 0x00, 0x00, 0x7a, 0x00, 0x00, 0x7f, 0x00, 0x00, 0x83,
-  0x00, 0x00, 0x88, 0x00, 0x00, 0x8c, 0x00, 0x00, 0x91, 0x00, 0x00, 0x95,
-  0x00, 0x00, 0x9a, 0x00, 0x00, 0x9e, 0x00, 0x00, 0xa3, 0x00, 0x00, 0xa3,
-  0x00, 0x00, 0xa7, 0x00, 0x00, 0xac, 0x00, 0x00, 0xb0, 0x00, 0x00, 0xb5,
-  0x00, 0x00, 0xb9, 0x00, 0x00, 0xbe, 0x00, 0x00, 0xc2, 0x00, 0x00, 0xc7,
-  0x00, 0x00, 0xcb, 0x00, 0x00, 0xd0, 0x00, 0x00, 0xd4, 0x00, 0x00, 0xd9,
-  0x00, 0x00, 0xde, 0x00, 0x00, 0xe2, 0x00, 0x00, 0xe7, 0x00, 0x00, 0xeb,
-  0x00, 0x00, 0xf0, 0x00, 0x00, 0xf4, 0x00, 0x00, 0xf9, 0x00, 0x00, 0xfd,
-  0x00, 0x03, 0xff, 0x00, 0x07, 0xff, 0x00, 0x0c, 0xff, 0x00, 0x10, 0xff,
-  0x00, 0x15, 0xff, 0x00, 0x19, 0xff, 0x00, 0x1e, 0xff, 0x00, 0x22, 0xff,
-  0x00, 0x27, 0xff, 0x00, 0x2b, 0xff, 0x00, 0x30, 0xff, 0x00, 0x34, 0xff,
-  0x00, 0x39, 0xff, 0x00, 0x3d, 0xff, 0x00, 0x42, 0xff, 0x00, 0x47, 0xff,
-  0x00, 0x4b, 0xff, 0x00, 0x50, 0xff, 0x00, 0x54, 0xff, 0x00, 0x59, 0xff,
-  0x00, 0x5d, 0xff, 0x00, 0x62, 0xff, 0x00, 0x66, 0xff, 0x00, 0x6b, 0xff,
-  0x00, 0x6f, 0xff, 0x00, 0x74, 0xff, 0x00, 0x78, 0xff, 0x00, 0x7d, 0xff,
-  0x00, 0x81, 0xff, 0x00, 0x86, 0xff, 0x00, 0x8a, 0xff, 0x00, 0x8f, 0xff,
-  0x00, 0x93, 0xff, 0x00, 0x98, 0xff, 0x00, 0x9c, 0xff, 0x00, 0xa1, 0xff,
-  0x00, 0xa5, 0xff, 0x00, 0xaa, 0xff, 0x00, 0xaf, 0xff, 0x00, 0xb3, 0xff,
-  0x00, 0xb8, 0xff, 0x00, 0xbc, 0xff, 0x00, 0xc1, 0xff, 0x00, 0xc5, 0xff,
-  0x00, 0xca, 0xff, 0x00, 0xce, 0xff, 0x00, 0xd3, 0xff, 0x00, 0xd7, 0xff,
-  0x00, 0xdc, 0xff, 0x00, 0xe0, 0xff, 0x00, 0xe5, 0xff, 0x00, 0xe9, 0xff,
-  0x00, 0xee, 0xff, 0x00, 0xf2, 0xff, 0x00, 0xf7, 0xff, 0x00, 0xfb, 0xff,
-  0x00, 0xff, 0xff, 0x00, 0xff, 0xfa, 0x00, 0xff, 0xf5, 0x00, 0xff, 0xf1,
-  0x00, 0xff, 0xec, 0x00, 0xff, 0xe7, 0x00, 0xff, 0xe3, 0x00, 0xff, 0xde,
-  0x00, 0xff, 0xda, 0x00, 0xff, 0xd5, 0x00, 0xff, 0xd1, 0x00, 0xff, 0xcc,
-  0x00, 0xff, 0xc8, 0x00, 0xff, 0xc3, 0x00, 0xff, 0xbf, 0x00, 0xff, 0xba,
-  0x00, 0xff, 0xb6, 0x00, 0xff, 0xb1, 0x00, 0xff, 0xad, 0x00, 0xff, 0xa8,
-  0x00, 0xff, 0xa4, 0x00, 0xff, 0x9f, 0x00, 0xff, 0x9b, 0x00, 0xff, 0x96,
-  0x00, 0xff, 0x92, 0x00, 0xff, 0x8d, 0x00, 0xff, 0x89, 0x00, 0xff, 0x84,
-  0x00, 0xff, 0x80, 0x00, 0xff, 0x7b, 0x00, 0xff, 0x76, 0x00, 0xff, 0x72,
-  0x00, 0xff, 0x6d, 0x00, 0xff, 0x69, 0x00, 0xff, 0x64, 0x00, 0xff, 0x60,
-  0x00, 0xff, 0x5b, 0x00, 0xff, 0x57, 0x00, 0xff, 0x52, 0x00, 0xff, 0x4e,
-  0x00, 0xff, 0x49, 0x00, 0xff, 0x45, 0x00, 0xff, 0x40, 0x00, 0xff, 0x3c,
-  0x00, 0xff, 0x37, 0x00, 0xff, 0x33, 0x00, 0xff, 0x2e, 0x00, 0xff, 0x2a,
-  0x00, 0xff, 0x25, 0x00, 0xff, 0x21, 0x00, 0xff, 0x1c, 0x00, 0xff, 0x18,
-  0x00, 0xff, 0x13, 0x00, 0xff, 0x0e, 0x00, 0xff, 0x0a, 0x00, 0xff, 0x05,
-  0x00, 0xff, 0x01, 0x04, 0xff, 0x00, 0x08, 0xff, 0x00, 0x0d, 0xff, 0x00,
-  0x11, 0xff, 0x00, 0x16, 0xff, 0x00, 0x1a, 0xff, 0x00, 0x1f, 0xff, 0x00,
-  0x23, 0xff, 0x00, 0x28, 0xff, 0x00, 0x2c, 0xff, 0x00, 0x31, 0xff, 0x00,
-  0x35, 0xff, 0x00, 0x3a, 0xff, 0x00, 0x3e, 0xff, 0x00, 0x43, 0xff, 0x00,
-  0x47, 0xff, 0x00, 0x4c, 0xff, 0x00, 0x50, 0xff, 0x00, 0x55, 0xff, 0x00,
-  0x5a, 0xff, 0x00, 0x5e, 0xff, 0x00, 0x63, 0xff, 0x00, 0x67, 0xff, 0x00,
-  0x6c, 0xff, 0x00, 0x70, 0xff, 0x00, 0x75, 0xff, 0x00, 0x79, 0xff, 0x00,
-  0x7e, 0xff, 0x00, 0x82, 0xff, 0x00, 0x87, 0xff, 0x00, 0x8b, 0xff, 0x00,
-  0x90, 0xff, 0x00, 0x94, 0xff, 0x00, 0x99, 0xff, 0x00, 0x9d, 0xff, 0x00,
-  0xa2, 0xff, 0x00, 0xa6, 0xff, 0x00, 0xab, 0xff, 0x00, 0xaf, 0xff, 0x00,
-  0xb4, 0xff, 0x00, 0xb8, 0xff, 0x00, 0xbd, 0xff, 0x00, 0xc2, 0xff, 0x00,
-  0xc6, 0xff, 0x00, 0xcb, 0xff, 0x00, 0xcf, 0xff, 0x00, 0xd4, 0xff, 0x00,
-  0xd8, 0xff, 0x00, 0xdd, 0xff, 0x00, 0xe1, 0xff, 0x00, 0xe6, 0xff, 0x00,
-  0xea, 0xff, 0x00, 0xef, 0xff, 0x00, 0xf3, 0xff, 0x00, 0xf8, 0xff, 0x00,
-  0xfc, 0xff, 0x00, 0xff, 0xfd, 0x00, 0xff, 0xf9, 0x00, 0xff, 0xf4, 0x00,
-  0xff, 0xf0, 0x00, 0xff, 0xeb, 0x00, 0xff, 0xe7, 0x00, 0xff, 0xe2, 0x00,
-  0xff, 0xde, 0x00, 0xff, 0xd9, 0x00, 0xff, 0xd5, 0x00, 0xff, 0xd0, 0x00,
-  0xff, 0xcb, 0x00, 0xff, 0xc7, 0x00, 0xff, 0xc2, 0x00, 0xff, 0xbe, 0x00,
-  0xff, 0xb9, 0x00, 0xff, 0xb5, 0x00, 0xff, 0xb0, 0x00, 0xff, 0xac, 0x00,
-  0xff, 0xa7, 0x00, 0xff, 0xa3, 0x00, 0xff, 0x9e, 0x00, 0xff, 0x9a, 0x00,
-  0xff, 0x95, 0x00, 0xff, 0x91, 0x00, 0xff, 0x8c, 0x00, 0xff, 0x88, 0x00,
-  0xff, 0x83, 0x00, 0xff, 0x7f, 0x00, 0xff, 0x7a, 0x00, 0xff, 0x76, 0x00,
-  0xff, 0x71, 0x00, 0xff, 0x6d, 0x00, 0xff, 0x68, 0x00, 0xff, 0x63, 0x00,
-  0xff, 0x5f, 0x00, 0xff, 0x5a, 0x00, 0xff, 0x56, 0x00, 0xff, 0x51, 0x00,
-  0xff, 0x4d, 0x00, 0xff, 0x48, 0x00, 0xff, 0x44, 0x00, 0xff, 0x3f, 0x00,
-  0xff, 0x3b, 0x00, 0xff, 0x36, 0x00, 0xff, 0x32, 0x00, 0xff, 0x2d, 0x00,
-  0xff, 0x29, 0x00, 0xff, 0x24, 0x00, 0xff, 0x20, 0x00, 0xff, 0x1b, 0x00,
-  0xff, 0x17, 0x00, 0xff, 0x12, 0x00, 0xff, 0x0e, 0x00, 0xff, 0x09, 0x00,
-  0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
-};
diff --git a/apps/POLite/heat-sync/Colours.h b/apps/POLite/heat-sync/Colours.h
deleted file mode 100644
index fc34e04c..00000000
--- a/apps/POLite/heat-sync/Colours.h
+++ /dev/null
@@ -1,10 +0,0 @@
-// SPDX-License-Identifier: BSD-2-Clause
-#ifndef _COLOURS_H_
-#define _COLOURS_H_
-
-#include <stdint.h>
-
-// 256 x RGB colours representing heat intensities
-extern uint8_t colours[];
-
-#endif
diff --git a/apps/POLite/heat-sync/Heat.h b/apps/POLite/heat-sync/Heat.h
index b3a63a93..8dc926b3 100644
--- a/apps/POLite/heat-sync/Heat.h
+++ b/apps/POLite/heat-sync/Heat.h
@@ -2,24 +2,26 @@
 #ifndef _HEAT_H_
 #define _HEAT_H_
 
+#define POLITE_DUMP_STATS
+#define POLITE_COUNT_MSGS
 #include <POLite.h>
 
 struct HeatMessage {
   // Sender id
   uint32_t from;
-  // Time step
-  uint32_t time;
   // Temperature at sender
-  uint32_t val;
+  float val;
 };
 
 struct HeatState {
   // Device id
   uint32_t id;
-  // Current time step of device
-  uint32_t time;
   // Current temperature of device
-  uint32_t val, acc;
+  float val, acc;
+  // Time step
+  uint16_t time;
+  // Number of neighbours
+  uint16_t numNeighbours;
   // Is the temperature of this device constant?
   bool isConstant;
 };
@@ -34,7 +36,6 @@ struct HeatDevice : PDevice<HeatState, None, HeatMessage> {
   // Send handler
   inline void send(volatile HeatMessage* msg) {
     msg->from = s->id;
-    msg->time = s->time;
     msg->val = s->val;
     *readyToSend = No;
   }
@@ -42,6 +43,7 @@ struct HeatDevice : PDevice<HeatState, None, HeatMessage> {
   // Receive handler
   inline void recv(HeatMessage* msg, None* edge) {
     s->acc += msg->val;
+    s->numNeighbours++;
   }
 
   // Called by POLite when system becomes idle
@@ -53,8 +55,9 @@ struct HeatDevice : PDevice<HeatState, None, HeatMessage> {
     }
     else {
       s->time--;
-      if (!s->isConstant) s->val = s->acc >> 2;
-      s->acc = 0;
+      if (!s->isConstant) s->val = s->acc / (float) s->numNeighbours;
+      s->acc = 0.0;
+      s->numNeighbours = 0;
       *readyToSend = Pin(0);
       return true;
     }
diff --git a/apps/POLite/heat-sync/Makefile b/apps/POLite/heat-sync/Makefile
index 0c343edd..f44d5b09 100644
--- a/apps/POLite/heat-sync/Makefile
+++ b/apps/POLite/heat-sync/Makefile
@@ -1,7 +1,7 @@
 # SPDX-License-Identifier: BSD-2-Clause
 APP_CPP = Heat.cpp
 APP_HDR = Heat.h
-RUN_CPP = Run.cpp Colours.cpp
-RUN_H = Colours.h
+RUN_CPP = Run.cpp
+RUN_H = 
 
 include ../util/polite.mk
diff --git a/apps/POLite/heat-sync/Run.cpp b/apps/POLite/heat-sync/Run.cpp
index a938a446..ed978e39 100644
--- a/apps/POLite/heat-sync/Run.cpp
+++ b/apps/POLite/heat-sync/Run.cpp
@@ -1,17 +1,30 @@
 // SPDX-License-Identifier: BSD-2-Clause
 #include "Heat.h"
-#include "Colours.h"
 
 #include <HostLink.h>
 #include <POLite.h>
+#include <EdgeList.h>
 #include <sys/time.h>
 
-int main()
+int main(int argc, char **argv)
 {
-  // Parameters
-  const uint32_t width  = 256;
-  const uint32_t height = 256;
-  const uint32_t time   = 1000;
+  const uint32_t time = 1000;
+
+  // Read in the example edge list and create data structure
+  if (argc != 2) {
+    printf("Specify edge file\n");
+    exit(EXIT_FAILURE);
+  }
+
+  // Load in the edge list file
+  printf("Loading in the graph..."); fflush(stdout);
+  EdgeList net;
+  net.read(argv[1]);
+  printf(" done\n");
+
+  // Print max fan-out
+  printf("Min fan-out = %d\n", net.minFanOut());
+  printf("Max fan-out = %d\n", net.maxFanOut());
 
   // Connection to tinsel machine
   HostLink hostLink;
@@ -19,55 +32,31 @@ int main()
   // Create POETS graph
   PGraph<HeatDevice, HeatState, None, HeatMessage> graph;
 
-  // Create 2D mesh of devices
-  PDeviceId **mesh = new PDeviceId* [height];
-  for (uint32_t y = 0; y < height; y++) {
-    mesh[y] = new PDeviceId [width];
-    for (uint32_t x = 0; x < width; x++)
-      mesh[y][x] = graph.newDevice();
+  // Create nodes in POETS graph
+  for (uint32_t i = 0; i < net.numNodes; i++) {
+    PDeviceId id = graph.newDevice();
+    assert(i == id);
   }
 
-  // Add edges
-  for (uint32_t y = 0; y < height; y++)
-    for (uint32_t x = 0; x < width; x++) {
-      if (x < width-1) {
-        graph.addEdge(mesh[y][x],   0, mesh[y][x+1]);
-        graph.addEdge(mesh[y][x+1], 0, mesh[y][x]);
-      }
-      if (y < height-1) {
-        graph.addEdge(mesh[y][x],   0, mesh[y+1][x]);
-        graph.addEdge(mesh[y+1][x], 0, mesh[y][x]);
-      }
-    }
+  // Create connections in POETS graph
+  for (uint32_t i = 0; i < net.numNodes; i++) {
+    uint32_t numNeighbours = net.neighbours[i][0];
+    for (uint32_t j = 0; j < numNeighbours; j++)
+      graph.addEdge(i, 0, net.neighbours[i][j+1]);
+  }
 
   // Prepare mapping from graph to hardware
   graph.map();
 
-  // Set device ids
-  for (uint32_t y = 0; y < height; y++)
-    for (uint32_t x = 0; x < width; x++)
-      graph.devices[mesh[y][x]]->state.id = mesh[y][x];
-
   // Specify number of time steps to run on each device
-  for (PDeviceId i = 0; i < graph.numDevices; i++)
+  srand(1);
+  for (PDeviceId i = 0; i < graph.numDevices; i++) {
+    int r = rand() % 255;
+    graph.devices[i]->state.id = i;
     graph.devices[i]->state.time = time;
- 
-  // Apply constant heat at north edge
-  // Apply constant cool at south edge
-  for (uint32_t x = 0; x < width; x++) {
-    graph.devices[mesh[0][x]]->state.val = 255 << 16;
-    graph.devices[mesh[0][x]]->state.isConstant = true;
-    graph.devices[mesh[height-1][x]]->state.val = 40 << 16;
-    graph.devices[mesh[height-1][x]]->state.isConstant = true;
-  }
-
-  // Apply constant heat at west edge
-  // Apply constant cool at east edge
-  for (uint32_t y = 0; y < height; y++) {
-    graph.devices[mesh[y][0]]->state.val = 255 << 16;
-    graph.devices[mesh[y][0]]->state.isConstant = true;
-    graph.devices[mesh[y][width-1]]->state.val = 40 << 16;
-    graph.devices[mesh[y][width-1]]->state.isConstant = true;
+    graph.devices[i]->state.val = (float) r;
+    graph.devices[i]->state.isConstant = false;
+    //graph.devices[i]->state.fanOut = graph.fanOut(i);
   }
 
   // Write graph down to tinsel machine via HostLink
@@ -82,8 +71,11 @@ int main()
   struct timeval start, finish, diff;
   gettimeofday(&start, NULL);
 
+  // Consume performance stats
+  politeSaveStats(&hostLink, "stats.txt");
+
   // Allocate array to contain final value of each device
-  uint32_t* pixels = new uint32_t [graph.numDevices];
+  float* pixels = new float [graph.numDevices];
 
   // Receive final value of each device
   for (uint32_t i = 0; i < graph.numDevices; i++) {
@@ -95,25 +87,19 @@ int main()
     pixels[msg.payload.from] = msg.payload.val;
   }
 
+  // Display final values of first ten devices
+  for (uint32_t i = 0; i < 10; i++) {
+    if (i < graph.numDevices) {
+      printf("%d: %f\n", i, pixels[i]);
+    }
+  }
+
   // Display time
   timersub(&finish, &start, &diff);
   double duration = (double) diff.tv_sec + (double) diff.tv_usec / 1000000.0;
+  #ifndef POLITE_DUMP_STATS
   printf("Time = %lf\n", duration);
-
-  // Emit image
-  FILE* fp = fopen("out.ppm", "wt");
-  if (fp == NULL) {
-    printf("Can't open output file for writing\n");
-    return -1;
-  }
-  fprintf(fp, "P3\n%d %d\n255\n", width, height);
-  for (uint32_t y = 0; y < height; y++)
-    for (uint32_t x = 0; x < width; x++) {
-      uint32_t val = (pixels[mesh[y][x]] >> 16) & 0xff;
-      fprintf(fp, "%d %d %d\n",
-        colours[val*3], colours[val*3+1], colours[val*3+2]);
-    }
-  fclose(fp);
+  #endif
 
   return 0;
 }
diff --git a/apps/POLite/izhikevich-gals/Izhikevich.cpp b/apps/POLite/izhikevich-gals/Izhikevich.cpp
new file mode 100644
index 00000000..8533062a
--- /dev/null
+++ b/apps/POLite/izhikevich-gals/Izhikevich.cpp
@@ -0,0 +1,23 @@
+// SPDX-License-Identifier: BSD-2-Clause
+#include "Izhikevich.h"
+
+#include <tinsel.h>
+#include <POLite.h>
+
+typedef PThread<
+          IzhikevichDevice,
+          IzhikevichState,    // State
+          Weight,             // Edge label
+          IzhikevichMsg       // Message
+        > IzhikevichThread;
+
+int main()
+{
+  // Point thread structure at base of thread's heap
+  IzhikevichThread* thread = (IzhikevichThread*) tinselHeapBaseSRAM();
+  
+  // Invoke interpreter
+  thread->run();
+
+  return 0;
+}
diff --git a/apps/POLite/izhikevich-gals/Izhikevich.h b/apps/POLite/izhikevich-gals/Izhikevich.h
new file mode 100644
index 00000000..701af341
--- /dev/null
+++ b/apps/POLite/izhikevich-gals/Izhikevich.h
@@ -0,0 +1,115 @@
+// SPDX-License-Identifier: BSD-2-Clause
+// (Based on code by David Thomas)
+#ifndef _Izhikevich_H_
+#define _Izhikevich_H_
+
+#define POLITE_DUMP_STATS
+#define POLITE_COUNT_MSGS
+#include <POLite.h>
+#include "RNG.h"
+
+// Number of time steps to run for
+#define NUM_STEPS 100
+
+// Vertex state
+struct IzhikevichState {
+  // Random-number-generator state
+  uint32_t rng;
+  // Neuron state
+  float u, v, I, acc, accNext;
+  uint32_t spikeCount;
+  // Protocol
+  bool sent;
+  uint16_t received, receivedNext, fanIn, time;
+  // Neuron properties
+  float a, b, c, d, Ir;
+};
+
+// Edge weight type
+typedef float Weight;
+
+// Message type
+struct IzhikevichMsg {
+  // Did the sender spike or not?
+  bool spike;
+  // Time step of sender
+  uint16_t time;
+  // Number of times sender has spiked
+  uint32_t spikeCount;
+};
+
+// Vertex behaviour
+struct IzhikevichDevice : PDevice<IzhikevichState,Weight,IzhikevichMsg> {
+  inline void init() {
+    s->v = -65.0f;
+    s->u = s->b * s->v;
+    s->I = s->Ir * grng(s->rng);
+    *readyToSend = Pin(0);
+  }
+
+  // We call this on every state change
+  inline void change() {
+    // Execution complete?
+    if (s->time == NUM_STEPS) return;
+
+    // Proceed to next time step?
+    if (s->sent && s->received == s->fanIn) {
+      s->time++;
+      s->I += s->acc;
+      s->acc = s->accNext;
+      s->accNext = 0;
+      s->received = s->receivedNext;
+      s->receivedNext = 0;
+      s->sent = false;
+      *readyToSend = s->time == (NUM_STEPS+1) ? No : Pin(0);
+    }
+  }
+
+  // Send handler
+  inline void send(volatile IzhikevichMsg* msg) {
+    bool spike = false;
+    float &v = s->v;
+    float &u = s->u;
+    float &I = s->I;
+    v = v+0.5*(0.04*v*v+5*v+140-u+I); // Step 0.5 ms
+    v = v+0.5*(0.04*v*v+5*v+140-u+I); // for numerical
+    u = u + s->a*(s->b*v-u);          // stability
+    if (v >= 30.0) {
+      v = s->c;
+      u += s->d;
+      s->spikeCount++;
+      spike = true;
+    }
+    s->I = s->Ir * grng(s->rng);
+    msg->time = s->time;
+    msg->spike = spike;
+    msg->spikeCount = s->spikeCount;
+    s->sent = true;
+    *readyToSend = No;
+    change();
+  }
+
+  // Receive handler
+  inline void recv(IzhikevichMsg* msg, Weight* weight) {
+    if (msg->time == s->time) {
+      if (msg->spike) s->acc += *weight;
+      s->received++;
+      change();
+    }
+    else {
+      if (msg->spike) s->accNext += *weight;
+      s->receivedNext++;
+    }
+  }
+
+  inline bool step() {
+    return false;
+  }
+
+  inline bool finish(IzhikevichMsg* msg) {
+    msg->spikeCount = s->spikeCount;
+    return true;
+  }
+};
+
+#endif
diff --git a/apps/POLite/ping-test/Makefile b/apps/POLite/izhikevich-gals/Makefile
similarity index 63%
rename from apps/POLite/ping-test/Makefile
rename to apps/POLite/izhikevich-gals/Makefile
index 7e85d2c6..5ba3d9e3 100644
--- a/apps/POLite/ping-test/Makefile
+++ b/apps/POLite/izhikevich-gals/Makefile
@@ -1,6 +1,6 @@
 # SPDX-License-Identifier: BSD-2-Clause
-APP_CPP = ping.cpp
-APP_HDR = ping.h
+APP_CPP = Izhikevich.cpp 
+APP_HDR = Izhikevich.h
 RUN_CPP = Run.cpp
 
 include ../util/polite.mk
diff --git a/apps/POLite/izhikevich-gals/RNG.h b/apps/POLite/izhikevich-gals/RNG.h
new file mode 100644
index 00000000..61b719b3
--- /dev/null
+++ b/apps/POLite/izhikevich-gals/RNG.h
@@ -0,0 +1,23 @@
+#ifndef _RNG_H_
+#define _RNG_H_
+
+inline uint32_t urng(uint32_t &state) {
+  state = state*1664525+1013904223;
+  return state;
+}
+
+// World's crappiest gaussian (courtesy of dt10!)
+inline float grng(uint32_t &state) {
+  uint32_t u=urng(state);
+  int32_t acc=0;
+  for(unsigned i=0;i<8;i++){
+    acc += u&0xf;
+    u=u>>4;
+  }
+  // a four-bit uniform has mean 7.5 and variance ((15-0+1)^2-1)/12 = 85/4
+  // sum of four uniforms has mean 8*7.5=60 and variance of 8*85/4=170
+  const float scale=0.07669649888473704; // == 1/sqrt(170)
+  return (acc-60.0f) * scale;
+}
+
+#endif
diff --git a/apps/POLite/izhikevich-gals/Run.cpp b/apps/POLite/izhikevich-gals/Run.cpp
new file mode 100644
index 00000000..e542881f
--- /dev/null
+++ b/apps/POLite/izhikevich-gals/Run.cpp
@@ -0,0 +1,132 @@
+// SPDX-License-Identifier: BSD-2-Clause
+#include "Izhikevich.h"
+
+#include <HostLink.h>
+#include <POLite.h>
+
+#include <EdgeList.h>
+#include <assert.h>
+#include <sys/time.h>
+#include <config.h>
+
+inline double urand() { return (double) rand() / RAND_MAX; }
+
+int main(int argc, char**argv)
+{
+  if (argc != 2) {
+    printf("Specify edges file\n");
+    exit(EXIT_FAILURE);
+  }
+
+  // Read network
+  EdgeList net;
+  net.read(argv[1]);
+
+  // Connection to tinsel machine
+  HostLink hostLink;
+
+  // Create POETS graph
+  PGraph<IzhikevichDevice, IzhikevichState, Weight, IzhikevichMsg> graph;
+
+  // Create nodes in POETS graph
+  for (uint32_t i = 0; i < net.numNodes; i++) {
+    PDeviceId id = graph.newDevice();
+    assert(i == id);
+  }
+
+  // Ratio of excitatory to inhibitory neurons
+  double excitatory = 0.8;
+
+  // Mark each neuron as excitatory (or inhibiatory)
+  srand(1);
+  bool* excite = new bool [net.numNodes];
+  for (int i = 0; i < net.numNodes; i++)
+    excite[i] = urand() < excitatory;
+
+  // Create connections in POETS graph
+  for (uint32_t i = 0; i < net.numNodes; i++) {
+    uint32_t numNeighbours = net.neighbours[i][0];
+    for (uint32_t j = 0; j < numNeighbours; j++) {
+      float weight = excite[i] ? 0.5 * urand() : -urand();
+      graph.addLabelledEdge(weight, i, 0, net.neighbours[i][j+1]);
+    }
+  }
+
+  // Add zero-weight back-edges for any directed edges
+  // (For GALS synchronisation)
+  for (uint32_t i = 0; i < net.numNodes; i++) {
+    for (uint32_t j = 0; j < net.neighbours[i][0]; j++) {
+      uint32_t n = net.neighbours[i][j+1];
+      // TODO: can be more efficient here
+      bool needBackEdge = true;
+      for (uint32_t k = 0; k < net.neighbours[n][0]; k++)
+        if (net.neighbours[n][k+1] == i) needBackEdge = false;
+      if (needBackEdge) graph.addLabelledEdge(0.0, n, 0, i);
+    }
+  }
+
+  // Prepare mapping from graph to hardware
+  graph.map();
+
+  srand(2);
+  // Initialise devices
+  for (PDeviceId i = 0; i < graph.numDevices; i++) {
+    IzhikevichState* n = &graph.devices[i]->state;
+    n->rng = (int32_t) (urand()*((double) (1<<31)));
+    n->fanIn = graph.fanIn(i);
+    if (excite[i]) {
+      float re = (float) urand();
+      n->a = 0.02;
+      n->b = 0.2;
+      n->c = -65+15*re*re;
+      n->d = 8-6*re*re;
+      n->Ir = 5;
+    }
+    else {
+      float ri = (float) urand();
+      n->a = 0.02+0.08*ri;
+      n->b = 0.25-0.05*ri;
+      n->c = -65;
+      n->d = 2;
+      n->Ir = 2;
+    }
+  }
+
+  // Write graph down to tinsel machine via HostLink
+  graph.write(&hostLink);
+
+  // Load code and trigger execution
+  hostLink.boot("code.v", "data.v");
+  hostLink.go();
+
+  // Timer
+  printf("Started\n");
+  struct timeval start, finish, diff;
+  gettimeofday(&start, NULL);
+
+  // Consume performance stats
+  politeSaveStats(&hostLink, "stats.txt");
+
+  int64_t sum = 0;
+  // Receive final distance to each vertex
+  for (uint32_t i = 0; i < graph.numDevices; i++) {
+    // Receive message
+    PMessage<IzhikevichMsg> msg;
+    hostLink.recvMsg(&msg, sizeof(msg));
+    if (i == 0) gettimeofday(&finish, NULL);
+    // Accumulate
+    sum += msg.payload.spikeCount;
+  }
+
+  // Emit result
+  printf("Total spikes = %ld\n", sum);
+
+  // Display time
+  timersub(&finish, &start, &diff);
+  double duration = (double) diff.tv_sec + (double) diff.tv_usec / 1000000.0;
+  #ifndef POLITE_DUMP_STATS
+  printf("Time = %lf\n", duration);
+  #endif
+
+  return 0;
+}
diff --git a/apps/POLite/izhikevich-pc/Izhikevich.cpp b/apps/POLite/izhikevich-pc/Izhikevich.cpp
new file mode 100644
index 00000000..b4f03ed5
--- /dev/null
+++ b/apps/POLite/izhikevich-pc/Izhikevich.cpp
@@ -0,0 +1,139 @@
+// SPDX-License-Identifier: BSD-2-Clause
+// (Based on code by David Thomas)
+
+#include <EdgeList.h>
+#include <assert.h>
+#include <sys/time.h>
+#include "RNG.h"
+
+#define NUM_STEPS 100
+
+// Neuron 
+struct Neuron {
+  // Random-number-generator state
+  uint32_t rng;
+  // Neuron state
+  float u, v, I, spikeCount;
+  // Neuron properties
+  float a, b, c, d, Ir;
+};
+
+int main(int argc, char**argv)
+{
+  if (argc != 2) {
+    printf("Specify edges file\n");
+    exit(EXIT_FAILURE);
+  }
+
+  // Read network
+  EdgeList net;
+  net.read(argv[1]);
+
+  // Ratio of excitatory to inhibitory neurons
+  double excitatory = 0.8;
+
+  // Mark each neuron as excitatory (or inhibiatory)
+  srand(1);
+  bool* excite = new bool [net.numNodes];
+  for (int i = 0; i < net.numNodes; i++) {
+    excite[i] = urand() < excitatory;
+  }
+
+  // Edge weights
+  float** weight = new float* [net.numNodes];
+  for (int i = 0; i < net.numNodes; i++) {
+    uint32_t numEdges = net.neighbours[i][0];
+    weight[i] = new float [numEdges];
+    for (int j = 0; j < numEdges; j++) {
+      weight[i][j] = excite[i] ? 0.5 * urand() : -urand();
+    }
+  }
+
+  // State for each neuron
+  srand(2);
+  Neuron* neuron = new Neuron [net.numNodes];
+  for (int i = 0; i < net.numNodes; i++) {
+    Neuron* n = &neuron[i];
+    n->rng = (int32_t) (urand()*((double) (1<<31)));
+    if (excite[i]) {
+      float re = (float) urand();
+      n->a = 0.02;
+      n->b = 0.2;
+      n->c = -65+15*re*re;
+      n->d = 8-6*re*re;
+      n->Ir = 5;
+    }
+    else {
+      float ri = (float) urand();
+      n->a = 0.02+0.08*ri;
+      n->b = 0.25-0.05*ri;
+      n->c = -65;
+      n->d = 2;
+      n->Ir = 2;
+    }
+  }
+
+  // Spike array
+  bool* spike = new bool [net.numNodes];
+
+  // Initialisation
+  for (int i = 0; i < net.numNodes; i++) {
+    Neuron* n = &neuron[i];
+    n->v = -65.0;
+    n->u = n->b * n->v;
+    n->I = n->Ir * grng(n->rng);
+  }
+
+  // Timer
+  printf("Started\n");
+  struct timeval start, finish, diff;
+  gettimeofday(&start, NULL);
+
+  // Simulation
+  int64_t totalSpikes = 0;
+  for (int t = 0; t <= NUM_STEPS; t++) {
+    // Update state
+    for (int i = 0; i < net.numNodes; i++) {
+      spike[i] = false;
+      Neuron* n = &neuron[i];
+      float &v = n->v;
+      float &u = n->u;
+      float &I = n->I;
+      v = v+0.5*(0.04*v*v+5*v+140-u+I); // Step 0.5 ms
+      v = v+0.5*(0.04*v*v+5*v+140-u+I); // for numerical
+      u = u + n->a*(n->b*v-u);          // stability
+      if (v >= 30.0) {
+        n->v = n->c;
+        n->u += n->d;
+        spike[i] = true;
+      }
+      n->I = n->Ir * grng(n->rng);
+    }
+    // Update I-values
+    uint32_t spikes = 0;
+    for (int i = 0; i < net.numNodes; i++) {
+      Neuron* n = &neuron[i];
+      if (spike[i]) {
+        spikes++;
+        n->spikeCount++;
+        uint32_t numEdges = net.neighbours[i][0];
+        uint32_t* dst = &net.neighbours[i][1];
+        for (int j = 0; j < numEdges; j++) {
+          neuron[dst[j]].I += weight[i][j];
+        }
+      }
+    }
+    //printf("%d: %d\n", t, spikes);
+    totalSpikes += spikes;
+  }
+  gettimeofday(&finish, NULL);
+
+  printf("Total spikes: %ld\n", totalSpikes);
+
+  // Display time
+  timersub(&finish, &start, &diff);
+  double duration = (double) diff.tv_sec + (double) diff.tv_usec / 1000000.0;
+  printf("Time = %lf\n", duration);
+
+  return 0;
+}
diff --git a/apps/POLite/izhikevich-pc/Makefile b/apps/POLite/izhikevich-pc/Makefile
new file mode 100644
index 00000000..52c92c74
--- /dev/null
+++ b/apps/POLite/izhikevich-pc/Makefile
@@ -0,0 +1,6 @@
+Izhikevich: Izhikevich.cpp RNG.h
+	g++ -I../../../include -O2 Izhikevich.cpp -o Izhikevich
+
+.PHONY: clean
+clean:
+	rm Izhikevich
diff --git a/apps/POLite/izhikevich-pc/RNG.h b/apps/POLite/izhikevich-pc/RNG.h
new file mode 100644
index 00000000..decc32f1
--- /dev/null
+++ b/apps/POLite/izhikevich-pc/RNG.h
@@ -0,0 +1,27 @@
+#ifndef _RNG_H_
+#define _RNG_H_
+
+inline uint32_t urng(uint32_t &state) {
+  state = state*1664525+1013904223;
+  return state;
+}
+
+// World's crappiest gaussian (courtesy of dt10!)
+inline float grng(uint32_t &state) {
+  uint32_t u=urng(state);
+  int32_t acc=0;
+  for(unsigned i=0;i<8;i++){
+    acc += u&0xf;
+    u=u>>4;
+  }
+  // a four-bit uniform has mean 7.5 and variance ((15-0+1)^2-1)/12 = 85/4
+  // sum of four uniforms has mean 8*7.5=60 and variance of 8*85/4=170
+  const float scale=0.07669649888473704; // == 1/sqrt(170)
+  return (acc-60.0f) * scale;
+}
+
+inline double urand() {
+  return (double) rand() / RAND_MAX;
+}
+
+#endif
diff --git a/apps/POLite/izhikevich-sync/Izhikevich.cpp b/apps/POLite/izhikevich-sync/Izhikevich.cpp
new file mode 100644
index 00000000..8533062a
--- /dev/null
+++ b/apps/POLite/izhikevich-sync/Izhikevich.cpp
@@ -0,0 +1,23 @@
+// SPDX-License-Identifier: BSD-2-Clause
+#include "Izhikevich.h"
+
+#include <tinsel.h>
+#include <POLite.h>
+
+typedef PThread<
+          IzhikevichDevice,
+          IzhikevichState,    // State
+          Weight,             // Edge label
+          IzhikevichMsg       // Message
+        > IzhikevichThread;
+
+int main()
+{
+  // Point thread structure at base of thread's heap
+  IzhikevichThread* thread = (IzhikevichThread*) tinselHeapBaseSRAM();
+  
+  // Invoke interpreter
+  thread->run();
+
+  return 0;
+}
diff --git a/apps/POLite/izhikevich-sync/Izhikevich.h b/apps/POLite/izhikevich-sync/Izhikevich.h
new file mode 100644
index 00000000..150a4afa
--- /dev/null
+++ b/apps/POLite/izhikevich-sync/Izhikevich.h
@@ -0,0 +1,72 @@
+// SPDX-License-Identifier: BSD-2-Clause
+// (Based on code by David Thomas)
+#ifndef _Izhikevich_H_
+#define _Izhikevich_H_
+
+#define POLITE_DUMP_STATS
+#define POLITE_COUNT_MSGS
+
+#include <POLite.h>
+#include "RNG.h"
+
+// Number of time steps to run for
+#define NUM_STEPS 100
+
+// Vertex state
+struct IzhikevichState {
+  // Random-number-generator state
+  uint32_t rng;
+  // Neuron state
+  float u, v, I;
+  uint32_t spikeCount;
+  // Neuron properties
+  float a, b, c, d, Ir;
+};
+
+// Edge weight type
+typedef float Weight;
+
+// Message type
+struct IzhikevichMsg {
+  // Number of times sender has spiked
+  uint32_t spikeCount;
+};
+
+// Vertex behaviour
+struct IzhikevichDevice : PDevice<IzhikevichState,Weight,IzhikevichMsg> {
+  inline void init() {
+    s->v = -65.0f;
+    s->u = s->b * s->v;
+    s->I = s->Ir * grng(s->rng);
+    *readyToSend = No;
+  }
+  inline void send(IzhikevichMsg* msg) {
+    s->spikeCount++;
+    msg->spikeCount = s->spikeCount;
+    *readyToSend = No;
+  }
+  inline void recv(IzhikevichMsg* msg, Weight* weight) {
+    s->I += *weight;
+  }
+  inline bool step() {
+    float &v = s->v;
+    float &u = s->u;
+    float &I = s->I;
+    v = v+0.5*(0.04*v*v+5*v+140-u+I); // Step 0.5 ms
+    v = v+0.5*(0.04*v*v+5*v+140-u+I); // for numerical
+    u = u + s->a*(s->b*v-u);          // stability
+    if (v >= 30.0) {
+      v = s->c;
+      u += s->d;
+      *readyToSend = Pin(0);
+    }
+    s->I = s->Ir * grng(s->rng);
+    return (time < NUM_STEPS);
+  }
+  inline bool finish(IzhikevichMsg* msg) {
+    msg->spikeCount = s->spikeCount;
+    return true;
+  }
+};
+
+#endif
diff --git a/apps/POLite/izhikevich-sync/Makefile b/apps/POLite/izhikevich-sync/Makefile
new file mode 100644
index 00000000..5ba3d9e3
--- /dev/null
+++ b/apps/POLite/izhikevich-sync/Makefile
@@ -0,0 +1,6 @@
+# SPDX-License-Identifier: BSD-2-Clause
+APP_CPP = Izhikevich.cpp 
+APP_HDR = Izhikevich.h
+RUN_CPP = Run.cpp
+
+include ../util/polite.mk
diff --git a/apps/POLite/izhikevich-sync/RNG.h b/apps/POLite/izhikevich-sync/RNG.h
new file mode 100644
index 00000000..61b719b3
--- /dev/null
+++ b/apps/POLite/izhikevich-sync/RNG.h
@@ -0,0 +1,23 @@
+#ifndef _RNG_H_
+#define _RNG_H_
+
+inline uint32_t urng(uint32_t &state) {
+  state = state*1664525+1013904223;
+  return state;
+}
+
+// World's crappiest gaussian (courtesy of dt10!)
+inline float grng(uint32_t &state) {
+  uint32_t u=urng(state);
+  int32_t acc=0;
+  for(unsigned i=0;i<8;i++){
+    acc += u&0xf;
+    u=u>>4;
+  }
+  // a four-bit uniform has mean 7.5 and variance ((15-0+1)^2-1)/12 = 85/4
+  // sum of four uniforms has mean 8*7.5=60 and variance of 8*85/4=170
+  const float scale=0.07669649888473704; // == 1/sqrt(170)
+  return (acc-60.0f) * scale;
+}
+
+#endif
diff --git a/apps/POLite/izhikevich-sync/Run.cpp b/apps/POLite/izhikevich-sync/Run.cpp
new file mode 100644
index 00000000..0693b8c3
--- /dev/null
+++ b/apps/POLite/izhikevich-sync/Run.cpp
@@ -0,0 +1,120 @@
+// SPDX-License-Identifier: BSD-2-Clause
+#include "Izhikevich.h"
+
+#include <HostLink.h>
+#include <POLite.h>
+
+#include <EdgeList.h>
+#include <assert.h>
+#include <sys/time.h>
+#include <config.h>
+
+inline double urand() { return (double) rand() / RAND_MAX; }
+
+int main(int argc, char**argv)
+{
+  if (argc != 2) {
+    printf("Specify edges file\n");
+    exit(EXIT_FAILURE);
+  }
+
+  // Read network
+  EdgeList net;
+  net.read(argv[1]);
+  printf("Max fan-out = %d\n", net.maxFanOut());
+  printf("Min fan-out = %d\n", net.minFanOut());
+
+  // Connection to tinsel machine
+  HostLink hostLink;
+
+  // Create POETS graph
+  PGraph<IzhikevichDevice, IzhikevichState, Weight, IzhikevichMsg> graph;
+
+  // Create nodes in POETS graph
+  for (uint32_t i = 0; i < net.numNodes; i++) {
+    PDeviceId id = graph.newDevice();
+    assert(i == id);
+  }
+
+  // Ratio of excitatory to inhibitory neurons
+  double excitatory = 0.8;
+
+  // Mark each neuron as excitatory (or inhibiatory)
+  srand(1);
+  bool* excite = new bool [net.numNodes];
+  for (int i = 0; i < net.numNodes; i++)
+    excite[i] = urand() < excitatory;
+
+  // Create connections in POETS graph
+  for (uint32_t i = 0; i < net.numNodes; i++) {
+    uint32_t numNeighbours = net.neighbours[i][0];
+    for (uint32_t j = 0; j < numNeighbours; j++) {
+      float weight = excite[i] ? 0.5 * urand() : -urand();
+      graph.addLabelledEdge(weight, i, 0, net.neighbours[i][j+1]);
+    }
+  }
+
+  // Prepare mapping from graph to hardware
+  graph.map();
+
+  srand(2);
+  // Initialise devices
+  for (PDeviceId i = 0; i < graph.numDevices; i++) {
+    IzhikevichState* n = &graph.devices[i]->state;
+    n->rng = (int32_t) (urand()*((double) (1<<31)));
+    if (excite[i]) {
+      float re = (float) urand();
+      n->a = 0.02;
+      n->b = 0.2;
+      n->c = -65+15*re*re;
+      n->d = 8-6*re*re;
+      n->Ir = 5;
+    }
+    else {
+      float ri = (float) urand();
+      n->a = 0.02+0.08*ri;
+      n->b = 0.25-0.05*ri;
+      n->c = -65;
+      n->d = 2;
+      n->Ir = 2;
+    }
+  }
+
+  // Write graph down to tinsel machine via HostLink
+  graph.write(&hostLink);
+
+  // Load code and trigger execution
+  hostLink.boot("code.v", "data.v");
+  hostLink.go();
+
+  // Timer
+  printf("Started\n");
+  struct timeval start, finish, diff;
+  gettimeofday(&start, NULL);
+
+  // Consume performance stats
+  politeSaveStats(&hostLink, "stats.txt");
+
+  int64_t sum = 0;
+  // Receive final distance to each vertex
+  for (uint32_t i = 0; i < graph.numDevices; i++) {
+    // Receive message
+    PMessage<IzhikevichMsg> msg;
+    hostLink.recvMsg(&msg, sizeof(msg));
+    if (i == 0) gettimeofday(&finish, NULL);
+    // Accumulate
+    sum += msg.payload.spikeCount;
+  }
+
+  // Emit result
+  printf("Total spikes = %ld\n", sum);
+
+  // Display time
+  timersub(&finish, &start, &diff);
+  double duration = (double) diff.tv_sec + (double) diff.tv_usec / 1000000.0;
+  #ifndef POLITE_DUMP_STATS
+  printf("Time = %lf\n", duration);
+  #endif
+
+  return 0;
+}
diff --git a/apps/POLite/pagerank-sync/Run.cpp b/apps/POLite/pagerank-sync/Run.cpp
index 435a0750..1b0eb356 100644
--- a/apps/POLite/pagerank-sync/Run.cpp
+++ b/apps/POLite/pagerank-sync/Run.cpp
@@ -28,7 +28,8 @@ int main(int argc, char **argv)
   net.read(argv[1]);
   printf(" done\n");
 
-  // Print max fan-out
+  // Print fan-out
+  printf("Min fan-out = %d\n", net.minFanOut());
   printf("Max fan-out = %d\n", net.maxFanOut());
   
   // Create nodes in POETS graph
diff --git a/apps/POLite/ping-test/Run.cpp b/apps/POLite/ping-test/Run.cpp
deleted file mode 100644
index 57ac5441..00000000
--- a/apps/POLite/ping-test/Run.cpp
+++ /dev/null
@@ -1,57 +0,0 @@
-// SPDX-License-Identifier: BSD-2-Clause
-#include "ping.h"
-
-#include <HostLink.h>
-#include <POLite.h>
-#include <EdgeList.h>
-#include <assert.h>
-#include <sys/time.h>
-#include <config.h>
-
-int main(int argc, char**argv)
-{
-  // Connection to tinsel machine
-  HostLink hostLink;
-
-  // Create POETS graph
-  PGraph<PingDevice, PingState, None, PingMessage> graph;
-
-  // Create single ping device
-  PDeviceId id = graph.newDevice();
-
-  // Prepare mapping from graph to hardware
-  graph.map();
-
-  // Write graph down to tinsel machine via HostLink
-  graph.write(&hostLink);
-
-  // Load code and trigger execution
-  hostLink.boot("code.v", "data.v");
-  hostLink.go();
-
-  printf("Ping started\n");
-
-  // Consume performance stats
-  //politeSaveStats(&hostLink, "stats.txt");
-
-  int test = 0;
-  int deviceAddr = graph.toDeviceAddr[id];
-  printf("deviceAddr = %d\n", deviceAddr);
-  while (test < 100) {
-    // Send ping
-    PMessage<PingMessage> sendMsg;
-    sendMsg.devId = getLocalDeviceId(deviceAddr);
-    sendMsg.payload.test = test;
-    hostLink.send(getThreadId(deviceAddr), 1, &sendMsg);
-    printf("Sent %d to device\n", sendMsg.payload.test);
-
-    // Receive pong
-    PMessage<PingMessage> recvMsg;
-    hostLink.recvMsg(&recvMsg, sizeof(recvMsg));
-    printf("Received %d from device\n", recvMsg.payload.test);
-
-    test++;
-  }
-
-  return 0;
-}
diff --git a/apps/POLite/ping-test/ping.h b/apps/POLite/ping-test/ping.h
deleted file mode 100644
index 3d4c17de..00000000
--- a/apps/POLite/ping-test/ping.h
+++ /dev/null
@@ -1,54 +0,0 @@
-// SPDX-License-Identifier: BSD-2-Clause
-// Test messaging between host and threads.
-
-#ifndef _ping_H_
-#define _ping_H_
-
-//#define POLITE_DUMP_STATS
-//#define POLITE_COUNT_MSGS
-
-// Lightweight POETS frontend
-#include <POLite.h>
-
-struct PingMessage {
-  uint32_t test;
-};
-
-struct PingState {
-  // Number received to be sent back to host
-  uint32_t test;
-};
-
-struct PingDevice : PDevice<PingState, None, PingMessage> {
-  // Called once by POLite at start of execution
-  void init() {
-    // Do nothing until a message is received from the host
-    *readyToSend = No;
-  }
-
-  // Receive handler
-  inline void recv(PingMessage* msg, None* edge) {
-    // Store number from host to send back to host
-    s->test = msg->test;
-    *readyToSend = HostPin;
-  }
-
-  // Send handler
-  inline void send(volatile PingMessage* msg) {
-    // Put received value back in message for host to check
-    msg->test = s->test;
-    *readyToSend = No;
-  }
-
-  // Called by POLite when system becomes idle
-  inline bool step() {
-    return true; // Never terminate
-  }
-
-  // Optionally send message to host on termination
-  inline bool finish(volatile PingMessage* msg) {
-    return false;
-  }
-};
-
-#endif
diff --git a/apps/POLite/progrouters/Makefile b/apps/POLite/progrouters/Makefile
new file mode 100644
index 00000000..9c0837be
--- /dev/null
+++ b/apps/POLite/progrouters/Makefile
@@ -0,0 +1,7 @@
+# SPDX-License-Identifier: BSD-2-Clause
+APP_CPP = ProgRoutersTest.cpp
+APP_HDR = 
+RUN_CPP = Run.cpp
+RUN_H = 
+
+include ../util/polite.mk
diff --git a/apps/POLite/progrouters/ProgRoutersTest.cpp b/apps/POLite/progrouters/ProgRoutersTest.cpp
new file mode 100644
index 00000000..109565df
--- /dev/null
+++ b/apps/POLite/progrouters/ProgRoutersTest.cpp
@@ -0,0 +1,43 @@
+// SPDX-License-Identifier: BSD-2-Clause
+#include <tinsel.h>
+
+int main()
+{
+  // Get thread id
+  int me = tinselId();
+
+  // Sample outgoing message
+  volatile uint32_t* msgOut = (uint32_t*) tinselSendSlot();
+  msgOut[0] = 0x10;
+  msgOut[1] = 0x20;
+  msgOut[2] = 0x30;
+  msgOut[3] = 0x40;
+  msgOut[4] = 0x50;
+  msgOut[5] = 0x60;
+  msgOut[6] = 0x70;
+  msgOut[7] = 0x80;
+
+  // On thread 0, send to key supplied by host
+  if (me == 0) {
+    tinselSetLen(1);
+    tinselWaitUntil(TINSEL_CAN_RECV);
+    volatile uint32_t* msgIn = (uint32_t*) tinselRecv();
+    uint32_t key = msgIn[0];
+    tinselFree(msgIn);
+    
+    tinselWaitUntil(TINSEL_CAN_SEND);
+    tinselKeySend(key, msgOut);
+  }
+
+  // Print anything received
+  while (1) {
+    tinselWaitUntil(TINSEL_CAN_RECV);
+    volatile uint32_t* msgIn = (uint32_t*) tinselRecv();
+    printf("%x %x %x %x %x %x %x %x\n",
+        msgIn[0], msgIn[1], msgIn[2], msgIn[3]
+      , msgIn[4], msgIn[5], msgIn[6], msgIn[7]);
+    tinselFree(msgIn);
+  }
+
+  return 0;
+}
diff --git a/apps/POLite/progrouters/Run.cpp b/apps/POLite/progrouters/Run.cpp
new file mode 100644
index 00000000..c2b27bd2
--- /dev/null
+++ b/apps/POLite/progrouters/Run.cpp
@@ -0,0 +1,47 @@
+// SPDX-License-Identifier: BSD-2-Clause
+#include <HostLink.h>
+#include <POLite.h>
+
+int main(int argc, char **argv)
+{
+  // Connection to tinsel machine
+  HostLink hostLink;
+
+  // Create routing tables
+  ProgRouterMesh mesh(2, 1);
+
+  // Board (1, 0)
+  for (int i = 0; i < 2; i++) {
+    uint64_t mask = 1ul << i;
+    mesh.table[0][1].addMRM(1, 0, mask >> 32, mask, 0xf0f0);
+  }
+  uint32_t key01 = mesh.table[0][1].genKey();
+
+  // Board (0, 0)
+  for (int i = 0; i < 2; i++) {
+    uint64_t mask = 1ul << i;
+    mesh.table[0][0].addMRM(1, 0, mask >> 32, mask, 0xf0f0);
+  }
+  for (int i = 0; i < 2; i++) {
+    uint64_t mask = 1ul << i;
+    mesh.table[0][0].addMRM(1, 1, mask >> 32, mask, 0xf0f0);
+  }
+  mesh.table[0][0].addRR(2, key01); // East
+  uint32_t key00 = mesh.table[0][0].genKey();
+
+  // Transfer routing tables to FPGAs
+  mesh.write(&hostLink);
+
+  // Load code and trigger execution
+  hostLink.boot("code.v", "data.v");
+  hostLink.go();
+
+  // Send key
+  printf("Sending key %x\n", key00);
+  uint32_t msg[1 << TinselLogWordsPerMsg];
+  msg[0] = key00;
+  hostLink.send(0, 1, msg);
+
+  hostLink.dumpStdOut();
+  return 0;
+}
diff --git a/apps/POLite/sssp-async/Run.cpp b/apps/POLite/sssp-async/Run.cpp
index c7953795..37ffcb4e 100644
--- a/apps/POLite/sssp-async/Run.cpp
+++ b/apps/POLite/sssp-async/Run.cpp
@@ -20,8 +20,9 @@ int main(int argc, char**argv)
   EdgeList net;
   net.read(argv[1]);
 
-  // Print max fan-out
+  // Print fan-out
   printf("Max fan-out = %d\n", net.maxFanOut());
+  printf("Min fan-out = %d\n", net.minFanOut());
 
   // Connection to tinsel machine
   HostLink hostLink;
@@ -86,7 +87,9 @@ int main(int argc, char**argv)
   // Display time
   timersub(&finish, &start, &diff);
   double duration = (double) diff.tv_sec + (double) diff.tv_usec / 1000000.0;
+  #ifndef POLITE_DUMP_STATS
   printf("Time = %lf\n", duration);
+  #endif
 
   return 0;
 }
diff --git a/apps/POLite/sssp-pc/Makefile b/apps/POLite/sssp-pc/Makefile
new file mode 100644
index 00000000..2ddbeca3
--- /dev/null
+++ b/apps/POLite/sssp-pc/Makefile
@@ -0,0 +1,11 @@
+# SPDX-License-Identifier: BSD-2-Clause
+all: sssp
+
+INC=../../../include
+
+sssp: sssp.cpp
+	g++ -I$(INC) -O3 sssp.cpp -o sssp
+
+.PHONY: clean
+clean:
+	rm sssp
diff --git a/apps/POLite/sssp-pc/sssp.cpp b/apps/POLite/sssp-pc/sssp.cpp
new file mode 100644
index 00000000..9012f49e
--- /dev/null
+++ b/apps/POLite/sssp-pc/sssp.cpp
@@ -0,0 +1,92 @@
+// SPDX-License-Identifier: BSD-2-Clause
+#include <stdio.h>
+#include <stdlib.h>
+#include <stdint.h>
+#include <string.h>
+#include <assert.h>
+#include <sys/time.h>
+#include <EdgeList.h>
+
+int main(int argc, char**argv)
+{
+  if (argc != 2) {
+    printf("Specify edges file\n");
+    exit(EXIT_FAILURE);
+  }
+
+  // Read network
+  EdgeList net;
+  net.read(argv[1]);
+
+  // Create weights
+  srand(1);
+  uint32_t** weights = new uint32_t* [net.numNodes];
+  for (uint32_t i = 0; i < net.numNodes; i++) {
+    uint32_t numNeighbours = net.neighbours[i][0];
+    weights[i] = new uint32_t [numNeighbours];
+    for (uint32_t j = 0; j < numNeighbours; j++) {
+      weights[i][j] = rand() % 100;
+    }
+  }
+
+  // Create states
+  uint32_t* dist = new uint32_t [net.numNodes];
+  int* queue = new int [net.numNodes];
+  int queueSize = 0;
+  int* queueNext = new int [net.numNodes];
+  int queueSizeNext = 0;
+  bool* inQueue = new bool [net.numNodes];
+  for (int i = 0; i < net.numNodes; i++) {
+    inQueue[i] = false;
+    dist[i] = 0x7fffffff;
+  }
+
+  // Set source vertex
+  dist[2] = 0;
+  queue[queueSize++] = 2;
+ 
+  // Start timer
+  printf("Started\n");
+  struct timeval start, finish, diff;
+  gettimeofday(&start, NULL);
+
+  int iters = 0;
+  while (queueSize > 0) {
+    for (int i = 0; i < queueSize; i++) {
+      uint32_t me = queue[i];
+      uint32_t numNeighbours = net.neighbours[me][0];
+      for (uint32_t j = 0; j < numNeighbours; j++) {
+        uint32_t neighbour = net.neighbours[me][j+1];
+        uint32_t newDist = dist[me] + weights[me][j];
+        if (newDist < dist[neighbour]) {
+          dist[neighbour] = newDist;
+          if (!inQueue[neighbour]) {
+            queueNext[queueSizeNext++] = neighbour;
+            inQueue[neighbour] = true;
+          }
+        }
+      }
+    }
+    queueSize = queueSizeNext;
+    queueSizeNext = 0;
+    int32_t* tmp = queue; queue = queueNext; queueNext = tmp;
+    for (int i = 0; i < queueSize; i++) inQueue[queue[i]] = false;
+    iters++;
+  }
+
+  // Stop timer
+  gettimeofday(&finish, NULL);
+
+  uint64_t sum = 0;
+  for (int i = 0; i < net.numNodes; i++)
+    sum += dist[i];
+  printf("Sum of distances = %ld\n", sum);
+  printf("Iterations = %d\n", iters);
+
+  // Display time
+  timersub(&finish, &start, &diff);
+  double duration = (double) diff.tv_sec + (double) diff.tv_usec / 1000000.0;
+  printf("Time = %lf\n", duration);
+
+  return 0;
+}
diff --git a/apps/POLite/sssp-sync/Run.cpp b/apps/POLite/sssp-sync/Run.cpp
index c7953795..37ffcb4e 100644
--- a/apps/POLite/sssp-sync/Run.cpp
+++ b/apps/POLite/sssp-sync/Run.cpp
@@ -20,8 +20,9 @@ int main(int argc, char**argv)
   EdgeList net;
   net.read(argv[1]);
 
-  // Print max fan-out
+  // Print fan-out
   printf("Max fan-out = %d\n", net.maxFanOut());
+  printf("Min fan-out = %d\n", net.minFanOut());
 
   // Connection to tinsel machine
   HostLink hostLink;
@@ -86,7 +87,9 @@ int main(int argc, char**argv)
   // Display time
   timersub(&finish, &start, &diff);
   double duration = (double) diff.tv_sec + (double) diff.tv_usec / 1000000.0;
+  #ifndef POLITE_DUMP_STATS
   printf("Time = %lf\n", duration);
+  #endif
 
   return 0;
 }
diff --git a/apps/POLite/util/genld.sh b/apps/POLite/util/genld.sh
index 0350108e..474e5694 100755
--- a/apps/POLite/util/genld.sh
+++ b/apps/POLite/util/genld.sh
@@ -18,7 +18,7 @@ OUTPUT_ARCH( "riscv" )
 MEMORY
 {
   instrs  : ORIGIN = $MaxBootImageBytes, LENGTH = $MaxInstrBytes
-  globals : ORIGIN = $DRAMBase, LENGTH = $DRAMGlobalsLength
+  globals : ORIGIN = $DRAMBase, LENGTH = $POLiteDRAMGlobalsLength
 }
 
 SECTIONS
diff --git a/apps/POLite/util/polite.mk b/apps/POLite/util/polite.mk
index a1d96f83..4abe32ee 100644
--- a/apps/POLite/util/polite.mk
+++ b/apps/POLite/util/polite.mk
@@ -51,7 +51,7 @@ $(HL)/%.o:
 
 $(BUILD)/run: $(RUN_CPP) $(RUN_H) $(HL)/*.o
 	g++ -std=c++11 -O2 -I $(INC) -I $(HL) -o $(BUILD)/run $(RUN_CPP) $(HL)/*.o \
-	  -lmetis -fno-exceptions
+	  -lmetis -fno-exceptions -fopenmp
 
 $(BUILD)/sim: $(RUN_CPP) $(RUN_H) $(HL)/sim/*.o
 	g++ -O2 -I $(INC) -I $(HL) -o $(BUILD)/sim $(RUN_CPP) $(HL)/sim/*.o \
diff --git a/apps/POLite/util/sumstats.awk b/apps/POLite/util/sumstats.awk
index 4d037cca..719699aa 100755
--- a/apps/POLite/util/sumstats.awk
+++ b/apps/POLite/util/sumstats.awk
@@ -10,10 +10,12 @@ BEGIN {
   cacheCount = 0;
   coreCount = 0;
   cacheLineSize = 32;
-  intraThreadSendCount = 0;
-  interThreadSendCount = 0;
-  interBoardSendCount = 0;
-  fmax = 225000000;
+  msgsReceived = 0;
+  msgsSent = 0;
+  progRouterSent = 0;
+  progRouterSentInter = 0;
+  blockedSends = 0;
+  fmax = 215000000;
   if (boardsX == "" || boardsY == "") {
     boardsX = 3;
     boardsY = 2;
@@ -48,13 +50,18 @@ BEGIN {
         coreCount = coreCount+1;
       }
       # Per-thread message counts
-      else if (match($0, /(.*) LS:(.*),TS:(.*),BS:(.*)/, fields)) {
-        ls=strtonum("0x"fields[2]);
-        ts=strtonum("0x"fields[3]);
-        bs=strtonum("0x"fields[4]);
-        intraThreadSendCount = intraThreadSendCount+ls;
-        interThreadSendCount = interThreadSendCount+ts;
-        interBoardSendCount = interBoardSendCount+bs;
+      else if (match($0, /(.*) MS:(.*),MR:(.*),PR:(.*),PRI:(.*),BL:(.*)/,
+                 fields)) {
+        ms=strtonum("0x"fields[2]);
+        mr=strtonum("0x"fields[3]);
+        pr=strtonum("0x"fields[4]);
+        pri=strtonum("0x"fields[5]);
+        bl=strtonum("0x"fields[6]);
+        msgsSent = msgsSent + ms;
+        msgsReceived = msgsReceived + mr;
+        progRouterSent = progRouterSent + pr;
+        progRouterSentInter = progRouterSentInter + pri;
+        blockedSends = blockedSends + bl;
       }
     }
   }
@@ -70,7 +77,14 @@ END {
   bytes = cacheLineSize * (missCount + writebackCount)
   print "Off-chip memory (GBytes/s): ", ((1/time) * bytes)/1000000000
   print "CPU util (%): ", (1-(cpuIdleCount/cycleCount))*100
-  print "Intra-thread messages: ", intraThreadSendCount
-  print "Inter-thread messages: ", interThreadSendCount
-  print "Inter-board messages: ", interBoardSendCount
+  print "Msgs received: ", msgsReceived
+  print "Msgs sent by threads: ", msgsSent
+  print "Msgs injected by ProgRouter:", progRouterSent
+  print "Inter-board msgs:", progRouterSentInter
+  print "Blocked sends:", blockedSends
+  print ""
+  print "Notes:"
+  print "  * ProgRouter injections includes inter-board msgs"
+  print "  * Memory bandwidth does not include lookups by ProgRouter"
+  print "  * If runtime > 40s approx, hit/miss counts may overflow"
 }
diff --git a/config.py b/config.py
index 74c7f63e..6500be58 100755
--- a/config.py
+++ b/config.py
@@ -161,6 +161,16 @@ def quoted(s): return "'\"" + s + "\"'"
 p["SRAMLogMaxInFlight"] = 5
 p["SRAMStoreLatency"] = 2
 
+# Programmable router parameters:
+p["LogRoutingEntryLen"] = 5 # Number of beats in a routing table entry
+p["ProgRouterMaxBurst"] = 4
+p["FetcherLogIndQueueSize"] = 1
+p["FetcherLogBeatBufferSize"] = 5
+p["FetcherLogFlitBufferSize"] = 5
+p["FetcherLogMsgsPerFlitBuffer"] = (
+  p["FetcherLogFlitBufferSize"] - p["LogMaxFlitsPerMsg"])
+p["FetcherMsgsPerFlitBuffer"] = 2 ** p["FetcherLogMsgsPerFlitBuffer"]
+
 # Enable performance counters
 p["EnablePerfCount"] = True
 
@@ -178,7 +188,7 @@ def quoted(s): return "'\"" + s + "\"'"
 p["UseCustomAccelerator"] = False
 
 # Clock frequency (in MHz)
-p["ClockFreq"] = 225
+p["ClockFreq"] = 215
 
 #==============================================================================
 # Derived Parameters
@@ -300,6 +310,7 @@ def quoted(s): return "'\"" + s + "\"'"
 
 # Cores per board
 p["LogCoresPerBoard"] = p["LogCoresPerMailbox"] + p["LogMailboxesPerBoard"]
+p["LogCoresPerBoard1"] = p["LogCoresPerBoard"] + 1
 p["CoresPerBoard"] = 2**p["LogCoresPerBoard"]
 
 # Threads per core
@@ -356,10 +367,21 @@ def quoted(s): return "'\"" + s + "\"'"
 # DRAM base and length
 p["DRAMBase"] = 3 * (2 ** p["LogBytesPerSRAM"])
 p["DRAMGlobalsLength"] = 2 ** (p["LogBytesPerDRAM"] - 1) - p["DRAMBase"]
+p["POLiteDRAMGlobalsLength"] = 2 ** 14
+p["POLiteProgRouterBase"] = p["DRAMBase"] + p["POLiteDRAMGlobalsLength"]
+p["POLiteProgRouterLength"] = (p["DRAMGlobalsLength"] -
+                                 p["POLiteDRAMGlobalsLength"])
+
+# POLite globals
 
 # Number of FPGA boards per box (including bridge board)
 p["BoardsPerBox"] = p["MeshXLenWithinBox"] * p["MeshYLenWithinBox"] + 1
 
+# Parameters for programmable routers
+# (and the routing-record fetchers they contain)
+p["FetchersPerProgRouter"] = 4 + p["MailboxMeshXLen"]
+p["LogFetcherFlitBufferSize"] = 5
+
 #==============================================================================
 # Main 
 #==============================================================================
diff --git a/de5/S5_DDR3_QSYS.qsys b/de5/S5_DDR3_QSYS.qsys
index 0695a737..4d8e3a49 100644
--- a/de5/S5_DDR3_QSYS.qsys
+++ b/de5/S5_DDR3_QSYS.qsys
@@ -891,7 +891,7 @@
   <parameter name="MEM_CK_PHASE" value="0.0" />
   <parameter name="MEM_CK_WIDTH" value="1" />
   <parameter name="MEM_CLK_EN_WIDTH" value="1" />
-  <parameter name="MEM_CLK_FREQ" value="450.0" />
+  <parameter name="MEM_CLK_FREQ" value="430.0" />
   <parameter name="MEM_CLK_FREQ_MAX" value="800.0" />
   <parameter name="MEM_COL_ADDR_WIDTH" value="10" />
   <parameter name="MEM_CS_WIDTH" value="1" />
@@ -1214,7 +1214,7 @@
   <parameter name="MEM_CK_PHASE" value="0.0" />
   <parameter name="MEM_CK_WIDTH" value="1" />
   <parameter name="MEM_CLK_EN_WIDTH" value="1" />
-  <parameter name="MEM_CLK_FREQ" value="450.0" />
+  <parameter name="MEM_CLK_FREQ" value="430.0" />
   <parameter name="MEM_CLK_FREQ_MAX" value="800.0" />
   <parameter name="MEM_COL_ADDR_WIDTH" value="10" />
   <parameter name="MEM_CS_WIDTH" value="1" />
diff --git a/doc/PIP-0024-global-multicast.md b/doc/PIP-0024-global-multicast.md
new file mode 100644
index 00000000..65105f71
--- /dev/null
+++ b/doc/PIP-0024-global-multicast.md
@@ -0,0 +1,226 @@
+# PIP-0024: Programmable routers and global multicast
+
+Author: Matthew Naylor
+
+This proposal replaces PIP 21.
+
+## Proposal
+
+We propose to generalise the destination component of a message so
+that it can be (1) a thread id; or (2) a **routing key**.  A message,
+sent by a thread, containing a routing key as a destination will go to
+a **per-board router** on the same FPGA.  The router will use they key
+as an index into a DRAM-based routing table and automatically
+propagate the message towards all the destinations associated with
+that key.
+
+## Motivation/Rationale
+
+PIP 22 resulted in a *mailbox-level* multicast feature, implemented in
+Tinsel 0.7.  It enables each thread to send to a message
+simultaneously to any subset of the 64 threads on a destination
+mailbox.  It works well when graphs exhibit good locality, with
+destination vertices often collocated on the same mailbox.
+
+However, it has a few drawbacks:
+
+  1. Costly graph partitioning algorithms are needed to identify
+     locality. This is problematic for graphs with billions of edges
+     and vertices, because mapping time may significantly outweigh
+     execution time.  (Indeed, graph partitioning is itself an
+     interesting application for the hardware.)
+
+  2. In some graphs there are limits to how well destination vertices
+     can be collocated after partitioning.  For example, *small-world
+     graphs* contain some extremely large, highly-distributed fanouts.
+
+A *global multicast* feature should reduce the need to find optimal
+partitions for very large graphs, and support distributed fanouts.  It
+should also move work away from the cores and into the hardware
+routers: the softswitch no longer needs to iterate over the outgoing
+edges of a pin.  While providing these improvements, it is also
+important to maintain the advantages of the existing mailbox-level
+multicast, for applications in which the mapping time is not a
+concern.
+
+## Functional overview
+
+A **routing key** is a 32-bit value consisting of a *ram id*, an
+*address*, and a *size*:
+
+```sv
+// 32-bit routing key (MSB to LSB)
+typedef struct {
+  // Which off-chip RAM on this board?
+  Bit#(`LogDRAMsPerBoard) ram;
+  // Pointer to array of routing beats containing routing records
+  Bit#(`LogBeatsPerDRAM) ptr;
+  // Number of beats in the array
+  Bit#(`LogRoutingEntryLen) numBeats;
+} RoutingKey;
+```
+
+When a message reaches the per-board router, the `ptr` field of the
+routing key is used as an index into DRAM, where a sequence of 256-bit
+**routing beats** are found.  The `numBeats` field of the routing key
+indicates how many contiguous routing beats there are.  Knowing the
+size before the lookup makes the hardware simpler and more efficient,
+e.g. it can avoid blocking on responses and issue a burst of an
+appropriate size.  The value of `numBeats` may be zero.
+
+A routing beat consists of a *size* and a sequence of five 48-bit
+*routing chunks*:
+
+```sv
+// 256-bit routing beat (aligned, MSB to LSB)
+typedef struct {
+  // Number of routing records present in this beat
+  Bit#(16) size;
+  // Five 48-bit record chunks
+  Vector#(5, Bit#(48)) chunks;
+} RoutingBeat;
+```
+
+The *size* must lie in the range 1 to 5 inclusive (0 is disallowed).
+A **routing record** consists of one or two routing chunks, depending
+on the **record type**.
+
+All byte orderings are little endian.  For example, the order of bytes
+in a routing beat is as follows.
+
+```
+Byte  Contents
+----  --------
+31:   Upper byte of length (i.e. number of records in beat)
+30:   Lower byte of length
+29:   Upper byte of first chunk
+      ...
+24:   Lower byte of first chunk
+23:   Upper byte of second chunk
+      ...
+18:   Lower byte of second chunk
+17:   Upper byte of third chunk
+      ...
+12:   Lower byte of third chunk
+11:   Upper byte of fourth chunk
+      ...
+ 6:   Lower byte of fourth chunk
+ 5:   Upper byte of fifth chunk
+      ...
+ 0:   Lower byte of fifth chunk
+```
+
+Clearly, both routing keys and routing beats have a maximum size.
+However, in principle there is no limit to the number of records
+associated with a key, due to the possibility of *indirection records*
+(see below).
+
+There are five types of routing record, defined below.
+
+**48-bit Unicast Router-to-Mailbox (URM1).**
+
+```sv
+typedef struct {
+  // Record type (URM1 == 0)
+  Bit#(3) tag;
+  // Mailbox destination
+  Bit#(4) mbox;
+  // Mailbox-local thread identifier
+  Bit#(6) thread;
+  // Unused
+  Bit#(3) unused;
+  // Local key. The first word of the message
+  // payload is overwritten with this.
+  Bit#(32) localKey;
+} URM1Record;
+```
+
+The `localKey` can be used for anything, but might encode the
+destination thread-local device identifier, or edge identifier, or
+both.  The `mbox` field is currently 4 bits (two Y bits followed by
+two X bits), but there are spare bits available to increase the size
+of this field in future if necessary.
+
+**96-bit Unicast Router-to-Mailbox (URM2).**
+
+```sv
+typedef struct {
+  // Record type (URM2 == 1)
+  Bit#(3) tag;
+  // Mailbox destination
+  Bit#(4) mbox;
+  // Mailbox-local thread identifier
+  Bit#(6) thread;
+  // Currently unused
+  Bit#(19) unused;
+  // Local key. The first two words of the message
+  // payload is overwritten with this.
+  Bit#(64) localKey;
+} URM2Record;
+```
+
+This is the same as a URM1 record except the local key is 64-bits in
+size.
+
+**48-bit Router-to-Router (RR).**
+
+```sv
+typedef struct {
+  // Record type (RR == 2)
+  Bit#(3) tag;
+  // Direction (N,S,E,W == 0,1,2,3)
+  Bit#(2) dir;
+  // Currently unused
+  Bit#(11) unused;
+  // New 32-bit routing key that will replace the one in the
+  // current message for the next hop of the message's journey
+  Bit#(32) newKey;
+} RRRecord;
+```
+
+The `newKey` field will replace the key in the current message for the
+next hop of the message's journey.  Introducing a new key at each hop
+simplifies the mapping process (keeping it quick).
+
+**96-bit Multicast Router-to-Mailbox (MRM).**
+
+```sv
+typedef struct {
+  // Record type (MRM == 3)
+  Bit#(3) tag;
+  // Mailbox destination
+  Bit#(4) mbox;
+  // Currently unused
+  Bit#(9) unused;
+  // Local key. The least-significant half-word
+  // of the message is replaced with this
+  Bit#(16) localKey;
+  // Mailbox-local destination mask
+  Bit#(64) destMask;
+} MRMRecord;
+```
+
+**48-bit Indirection (IND).**
+
+```sv
+// 48-bit Indirection (IND) record
+// Note the restrictions on IND records:
+// 1. At most one IND record per key lookup
+// 2. A max-sized key lookup must contain an IND record
+typedef struct {
+  // Record type (IND == 4)
+  Bit#(3) tag;
+  // Currently unused
+  Bit#(13) unused;
+  // New 32-bit routing key for new set of records on current router
+  Bit#(32) newKey;
+} INDRecord;
+```
+
+Indirection records can be used to handle large fanouts, which exceed
+the number of bits available in the size portion of the routing key.
+
+## Impact
+
+Since use of routing keys is optional, existing applications will
+continue to work unmodified.
diff --git a/doc/custom/ExampleAccelerator.sv b/doc/custom/ExampleAccelerator.sv
index 34a97fc2..acc73455 100644
--- a/doc/custom/ExampleAccelerator.sv
+++ b/doc/custom/ExampleAccelerator.sv
@@ -5,6 +5,7 @@
 
 typedef struct packed {
   logic acc;
+  logic isKey;
   logic host;
   logic hostDir;
   logic [`TinselMeshYBits-1:0] boardY;
diff --git a/doc/custom/README.md b/doc/custom/README.md
index c380f9c9..fde29010 100644
--- a/doc/custom/README.md
+++ b/doc/custom/README.md
@@ -74,6 +74,7 @@ custom accelerator or a mailbox.
 ```sv
 typedef struct packed {
   logic acc;
+  logic isKey;
   logic host;
   logic hostDir;
   logic [`TinselMeshYBits-1:0] boardY;
diff --git a/doc/figures/fpga.png b/doc/figures/fpga.png
index f4d60fbb..71a4c97f 100644
Binary files a/doc/figures/fpga.png and b/doc/figures/fpga.png differ
diff --git a/doc/figures/fpga.tex b/doc/figures/fpga.tex
index 02922a0f..9eafda95 100644
--- a/doc/figures/fpga.tex
+++ b/doc/figures/fpga.tex
@@ -14,15 +14,6 @@
   \definecolor{myorange}{RGB}{197,90,17}
   \definecolor{mygreen}{RGB}{84,130,53}
 
-  \node[fill=gray!20,rounded corners,
-        minimum width=6.3cm,minimum height=4.8cm] (border0)
-    at (4.5,2.0) {};
-  \node[fill=white,rounded corners,
-        minimum width=5.8cm,minimum height=4.1cm] (border1)
-    at (4.5,1.8) {};
-  \node[fill=none,color=black] at (4.5,6.4)
-    {\footnotesize{inter-FPGA reliable links}};
-
   \node[fill=myblue,rounded corners] (tile00)
      at (0,0) {\footnotesize{tile}};
   \node[rectangle,sharp corners,fill=black] (router00)
@@ -123,16 +114,16 @@
   \draw[arrows=-,color=mygreen] (tile13) to (mem13);
 
   \node[rounded corners,fill=mygreen]
-    (ram0) at (1.7,-1.6) {\footnotesize{off-chip RAM}};
+    (ram0) at (1.3,-1.8) {\footnotesize{off-chip RAM}};
 
-  \draw[arrows=-,color=mygreen] (mem00) to ([xshift=-7mm]ram0.north);
-  \draw[arrows=-,color=mygreen] (mem01) to ([xshift=-5mm]ram0.north);
-  \draw[arrows=-,color=mygreen] (mem02) to ([xshift=-3mm]ram0.north);
-  \draw[arrows=-,color=mygreen] (mem03) to ([xshift=-1mm]ram0.north);
-  \draw[arrows=-,color=mygreen] (mem10) to ([xshift=7mm]ram0.north);
-  \draw[arrows=-,color=mygreen] (mem11) to ([xshift=5mm]ram0.north);
-  \draw[arrows=-,color=mygreen] (mem12) to ([xshift=3mm]ram0.north);
-  \draw[arrows=-,color=mygreen] (mem13) to ([xshift=1mm]ram0.north);
+  \draw[arrows=-,color=mygreen] (mem00) to ([xshift=-3mm]ram0.north);
+  \draw[arrows=-,color=mygreen] (mem01) to ([xshift=-1mm]ram0.north);
+  \draw[arrows=-,color=mygreen] (mem02) to ([xshift=1mm]ram0.north);
+  \draw[arrows=-,color=mygreen] (mem03) to ([xshift=3mm]ram0.north);
+  \draw[arrows=-,color=mygreen] (mem10) to ([xshift=11mm]ram0.north);
+  \draw[arrows=-,color=mygreen] (mem11) to ([xshift=9mm]ram0.north);
+  \draw[arrows=-,color=mygreen] (mem12) to ([xshift=7mm]ram0.north);
+  \draw[arrows=-,color=mygreen] (mem13) to ([xshift=5mm]ram0.north);
 
   \coordinate[] (south0b) at (4.3, -0.9) {};
   \coordinate[] (south0a) at (-0.83, -0.9) {};
@@ -282,16 +273,16 @@
   \draw[arrows=-,color=mygreen] (tile33) to (memb13);
 
   \node[rounded corners,fill=mygreen]
-    (ram1) at (7.57,-1.6) {\footnotesize{off-chip RAM}};
+    (ram1) at (7.97,-1.8) {\footnotesize{off-chip RAM}};
 
-  \draw[arrows=-,color=mygreen] (memb00) to ([xshift=-7mm]ram1.north);
-  \draw[arrows=-,color=mygreen] (memb01) to ([xshift=-5mm]ram1.north);
-  \draw[arrows=-,color=mygreen] (memb02) to ([xshift=-3mm]ram1.north);
-  \draw[arrows=-,color=mygreen] (memb03) to ([xshift=-1mm]ram1.north);
-  \draw[arrows=-,color=mygreen] (memb10) to ([xshift=7mm]ram1.north);
-  \draw[arrows=-,color=mygreen] (memb11) to ([xshift=5mm]ram1.north);
-  \draw[arrows=-,color=mygreen] (memb12) to ([xshift=3mm]ram1.north);
-  \draw[arrows=-,color=mygreen] (memb13) to ([xshift=1mm]ram1.north);
+  \draw[arrows=-,color=mygreen] (memb00) to ([xshift=-11mm]ram1.north);
+  \draw[arrows=-,color=mygreen] (memb01) to ([xshift=-9mm]ram1.north);
+  \draw[arrows=-,color=mygreen] (memb02) to ([xshift=-7mm]ram1.north);
+  \draw[arrows=-,color=mygreen] (memb03) to ([xshift=-5mm]ram1.north);
+  \draw[arrows=-,color=mygreen] (memb10) to ([xshift=3mm]ram1.north);
+  \draw[arrows=-,color=mygreen] (memb11) to ([xshift=1mm]ram1.north);
+  \draw[arrows=-,color=mygreen] (memb12) to ([xshift=-1mm]ram1.north);
+  \draw[arrows=-,color=mygreen] (memb13) to ([xshift=-3mm]ram1.north);
 
 
 
@@ -359,33 +350,20 @@
   \coordinate[] (south2c) at (4.7, -2.3) {};
   \draw[arrows=-,color=black] (south2b) to (south2c);
 
-  \draw[arrows=-,color=black] (router00.west) to
-    ([xshift=-2.3mm]router00.west);
-  \draw[arrows=-,color=black] (router01.west) to
-    ([xshift=-2.3mm]router01.west);
-  \draw[arrows=-,color=black] (router02.west) to
-    ([xshift=-2.3mm]router02.west);
-  \draw[arrows=-,color=black] (router03.west) to
-    ([xshift=-2.3mm]router03.west);
-
-  \draw[arrows=-,color=black] (router30.east) to
-    ([xshift=14.4mm]router30.east);
-  \draw[arrows=-,color=black] (router31.east) to
-    ([xshift=14.4mm]router31.east);
-  \draw[arrows=-,color=black] (router32.east) to
-    ([xshift=14.4mm]router32.east);
-  \draw[arrows=-,color=black] (router33.east) to
-    ([xshift=14.4mm]router33.east);
-
-  \draw[arrows=-,color=black] (router03.north) to
-    ([yshift=2mm]router03.north);
-  \draw[arrows=-,color=black] (router13.north) to
-    ([yshift=2mm]router13.north);
-  \draw[arrows=-,color=black] (router23.north) to
-    ([yshift=2mm]router23.north);
-  \draw[arrows=-,color=black] (router33.north) to
-    ([yshift=2mm]router33.north);
+  \node[rounded corners,fill=myorange,minimum height=0.5cm] (boardrouter)
+   at (4.63cm,-1.8cm) {\footnotesize{board}\\[-1mm]\footnotesize{router}};
+
+  \node[rounded corners,fill=gray!20, text=black,minimum width=5.25cm] (links)
+    at (4.63cm, -3.2cm) {\footnotesize{inter-FPGA reliable links}};
+
+  \draw[arrows=-,color=black] (links.north) to (boardrouter.south);
+
+  % Is the board router connected to off-chip RAM?
+  \draw[arrows=-,color=mygreen] (ram0.east) to (boardrouter.west);
+  \draw[arrows=-,color=mygreen] (ram1.west) to (boardrouter.east);
+
 
 \end{tikzpicture}
 
+
 \end{document}
diff --git a/doc/figures/logo.png b/doc/figures/logo.png
new file mode 100644
index 00000000..8271002b
Binary files /dev/null and b/doc/figures/logo.png differ
diff --git a/hostlink/DebugLink.cpp b/hostlink/DebugLink.cpp
index f838441d..0031969c 100644
--- a/hostlink/DebugLink.cpp
+++ b/hostlink/DebugLink.cpp
@@ -60,10 +60,10 @@ void DebugLink::putPacket(int x, int y, BoardCtrlPkt* pkt)
 }
 
 // Constructor
-DebugLink::DebugLink(uint32_t numBoxesX, uint32_t numBoxesY)
+DebugLink::DebugLink(DebugLinkParams p)
 {
-  boxMeshXLen = numBoxesX;
-  boxMeshYLen = numBoxesY;
+  boxMeshXLen = p.numBoxesX;
+  boxMeshYLen = p.numBoxesY;
   get_tryNextX = 0;
   get_tryNextY = 0;
 
@@ -105,11 +105,11 @@ DebugLink::DebugLink(uint32_t numBoxesX, uint32_t numBoxesY)
                     "But is has a box X coordinate of %i\n", thisBoxX);
     exit(EXIT_FAILURE);
   }
-  if ((thisBoxX+numBoxesX-1) >= TinselBoxMeshXLen ||
-      (thisBoxY+numBoxesY-1) >= TinselBoxMeshYLen) {
+  if ((thisBoxX+p.numBoxesX-1) >= TinselBoxMeshXLen ||
+      (thisBoxY+p.numBoxesY-1) >= TinselBoxMeshYLen) {
     fprintf(stderr, "Requested box sub-mesh of size %ix%i "
                     "is not valid from box %s\n",
-                    numBoxesX, numBoxesY, hostname);
+                    p.numBoxesX, p.numBoxesY, hostname);
     exit(EXIT_FAILURE);
   }
 
@@ -187,6 +187,8 @@ DebugLink::DebugLink(uint32_t numBoxesX, uint32_t numBoxesY)
       if (y == 0) pkt.payload[2] |= 2;
       if (thisBoxX == 0 && boxMeshXLen == 1) pkt.payload[2] |= 4;
       if (thisBoxX == 1 && boxMeshXLen == 1) pkt.payload[2] |= 8;
+      // Reserve extra send slot?
+      pkt.payload[2] |= p.useExtraSendSlot ? 0x10 : 0;
       // Send commands to each board
       for (int b = 0; b < TinselBoardsPerBox; b++) {
         pkt.linkId = b;
diff --git a/hostlink/DebugLink.h b/hostlink/DebugLink.h
index fd3c8291..18d352dc 100644
--- a/hostlink/DebugLink.h
+++ b/hostlink/DebugLink.h
@@ -8,6 +8,13 @@
 #include "BoardCtrl.h"
 #include "DebugLinkFormat.h"
 
+// DebugLinkH parameters
+struct DebugLinkParams {
+  uint32_t numBoxesX;
+  uint32_t numBoxesY;
+  bool useExtraSendSlot;
+};
+
 class DebugLink {
 
   // Location of this box with full box mesh
@@ -46,7 +53,7 @@ class DebugLink {
   int meshYLen;
 
   // Constructor
-  DebugLink(uint32_t numBoxesX, uint32_t numBoxesY);
+  DebugLink(DebugLinkParams params);
 
   // On given board, set destination core and thread
   void setDest(uint32_t boardX, uint32_t boardY,
diff --git a/hostlink/HostLink.cpp b/hostlink/HostLink.cpp
index aa4d3af6..dd896f4d 100644
--- a/hostlink/HostLink.cpp
+++ b/hostlink/HostLink.cpp
@@ -60,9 +60,11 @@ static int connectToPCIeStream(const char* socketPath)
 }
 
 // Internal constructor
-void HostLink::constructor(uint32_t numBoxesX, uint32_t numBoxesY)
+void HostLink::constructor(HostLinkParams p)
 {
-  if (numBoxesX > TinselBoxMeshXLen || numBoxesY > TinselBoxMeshYLen) {
+  useExtraSendSlot = p.useExtraSendSlot;
+
+  if (p.numBoxesX > TinselBoxMeshXLen || p.numBoxesY > TinselBoxMeshYLen) {
     fprintf(stderr, "Number of boxes requested exceeds those available\n");
     exit(EXIT_FAILURE);
   }
@@ -92,7 +94,11 @@ void HostLink::constructor(uint32_t numBoxesX, uint32_t numBoxesY)
   #endif
 
   // Create DebugLink
-  debugLink = new DebugLink(numBoxesX, numBoxesY);
+  DebugLinkParams debugLinkParams;
+  debugLinkParams.numBoxesX = p.numBoxesX;
+  debugLinkParams.numBoxesY = p.numBoxesY;
+  debugLinkParams.useExtraSendSlot = p.useExtraSendSlot;
+  debugLink = new DebugLink(debugLinkParams);
 
   // Set board mesh dimensions
   meshXLen = debugLink->meshXLen;
@@ -145,12 +151,25 @@ HostLink::HostLink()
   int x = str ? atoi(str) : 1;
   str = getenv("HOSTLINK_BOXES_Y");
   int y = str ? atoi(str) : 1;
-  constructor(x, y);
+  HostLinkParams params;
+  params.numBoxesX = x;
+  params.numBoxesY = y;
+  params.useExtraSendSlot = false;
+  constructor(params);
 }
 
 HostLink::HostLink(uint32_t numBoxesX, uint32_t numBoxesY)
 {
-  constructor(numBoxesX, numBoxesY);
+  HostLinkParams params;
+  params.numBoxesX = numBoxesX;
+  params.numBoxesY = numBoxesY;
+  params.useExtraSendSlot = false;
+  constructor(params);
+}
+
+HostLink::HostLink(HostLinkParams params)
+{
+  constructor(params);
 }
 
 // Destructor
@@ -218,8 +237,9 @@ void HostLink::fromAddr(uint32_t addr, uint32_t* meshX, uint32_t* meshY,
   *meshY = addr;
 }
 
-// Inject a message via PCIe (blocking by default)
-bool HostLink::send(uint32_t dest, uint32_t numFlits, void* payload, bool block)
+// Internal helper for sending messages
+bool HostLink::sendHelper(uint32_t dest, uint32_t numFlits, void* payload,
+       bool block, uint32_t key)
 {
   assert(useSendBuffer ? block : true);
 
@@ -242,7 +262,7 @@ bool HostLink::send(uint32_t dest, uint32_t numFlits, void* payload, bool block)
     buffer[0] = dest;
     buffer[1] = 0;
     buffer[2] = (numFlits-1) << 24;
-    buffer[3] = 0;
+    buffer[3] = key;
 
     // Fill in message payload
     memcpy(&buffer[4], payload, numFlits*16);
@@ -285,6 +305,13 @@ bool HostLink::send(uint32_t dest, uint32_t numFlits, void* payload, bool block)
   }
 }
 
+
+// Inject a message via PCIe (blocking by default)
+bool HostLink::send(uint32_t dest, uint32_t numFlits, void* msg, bool block)
+{
+  return sendHelper(dest, numFlits, msg, block, 0);
+}
+
 // Flush the send buffer
 void HostLink::flush()
 {
@@ -298,7 +325,28 @@ void HostLink::flush()
 // Try to send a message (non-blocking, returns true on success)
 bool HostLink::trySend(uint32_t dest, uint32_t numFlits, void* msg)
 {
-  return send(dest, numFlits, msg, false);
+  return sendHelper(dest, numFlits, msg, false, 0);
+}
+
+// Send a message using routing key (blocking by default)
+bool HostLink::keySend(uint32_t key, uint32_t numFlits,
+       void* msg, bool block)
+{
+  uint32_t useRoutingKey = 1 << (
+    TinselLogThreadsPerCore + TinselLogCoresPerMailbox +
+    TinselMailboxMeshXBits + TinselMailboxMeshYBits +
+    TinselMeshXBits + TinselMeshYBits + 2);
+  return sendHelper(useRoutingKey, numFlits, msg, block, key);
+}
+
+// Try to send using routing key (non-blocking, returns true on success)
+bool HostLink::keyTrySend(uint32_t key, uint32_t numFlits, void* msg)
+{
+  uint32_t useRoutingKey = 1 << (
+    TinselLogThreadsPerCore + TinselLogCoresPerMailbox +
+    TinselMailboxMeshXBits + TinselMailboxMeshYBits +
+    TinselMeshXBits + TinselMeshYBits + 2);
+  return sendHelper(useRoutingKey, numFlits, msg, false, key);
 }
 
 // Receive a message via PCIe (blocking)
diff --git a/hostlink/HostLink.h b/hostlink/HostLink.h
index 81c9b32f..41d78303 100644
--- a/hostlink/HostLink.h
+++ b/hostlink/HostLink.h
@@ -16,6 +16,13 @@
 #define PCIESTREAM      "pciestream"
 #define PCIESTREAM_SIM  "tinsel.b-1.1"
 
+// HostLink parameters
+struct HostLinkParams {
+  uint32_t numBoxesX;
+  uint32_t numBoxesY;
+  bool useExtraSendSlot;
+};
+
 class HostLink {
   // Lock file for acquring exclusive access to PCIeStream
   int lockFile;
@@ -33,8 +40,15 @@ class HostLink {
   char* sendBuffer;
   int sendBufferLen;
 
+  // Request an extra send slot when bringing up Tinsel FPGAs
+  bool useExtraSendSlot;
+
   // Internal constructor
-  void constructor(uint32_t numBoxesX, uint32_t numBoxesY);
+  void constructor(HostLinkParams params);
+
+  // Internal helper for sending messages
+  bool sendHelper(uint32_t dest, uint32_t numFlits, void* payload,
+         bool block, uint32_t key);
  public:
   // Dimensions of board mesh
   int meshXLen;
@@ -43,6 +57,7 @@ class HostLink {
   // Constructors
   HostLink();
   HostLink(uint32_t numBoxesX, uint32_t numBoxesY);
+  HostLink(HostLinkParams params);
 
   // Destructor
   ~HostLink();
@@ -65,6 +80,12 @@ class HostLink {
   // Try to send a message (non-blocking, returns true on success)
   bool trySend(uint32_t dest, uint32_t numFlits, void* msg);
 
+  // Send a message using routing key (blocking by default)
+  bool keySend(uint32_t key, uint32_t numFlits, void* msg, bool block = true);
+
+  // Try to send using routing key (non-blocking, returns true on success)
+  bool keyTrySend(uint32_t key, uint32_t numFlits, void* msg);
+
   // Receive a max-sized message (blocking)
   void recv(void* msg);
 
diff --git a/include/EdgeList.h b/include/EdgeList.h
index 7d03bb8f..ebd5d37f 100644
--- a/include/EdgeList.h
+++ b/include/EdgeList.h
@@ -3,8 +3,11 @@
 #define _NETWORK_H_
 
 #include <stdio.h>
-#include <stdlib.h>
 #include <stdint.h>
+#include <assert.h>
+#include <iostream>
+#include <fstream>
+#include <vector>
 
 struct EdgeList {
   // Number of nodes and edges
@@ -18,50 +21,42 @@ struct EdgeList {
   // Read network from file
   void read(const char* filename)
   {
-    // Read edges
-    FILE* fp = fopen(filename, "rt");
-    if (fp == NULL) {
-      fprintf(stderr, "Can't open '%s'\n", filename);
-      exit(EXIT_FAILURE);
-    }
+    std::fstream file(filename, std::ios_base::in);
+    std::vector<uint32_t> vec;
 
     // Count number of nodes and edges
     numEdges = 0;
     numNodes = 0;
-    int ret;
-    while (1) {
-      uint32_t src, dst;
-      ret = fscanf(fp, "%d %d", &src, &dst);
-      if (ret == EOF) break;
+    uint32_t numInts = 0;
+    uint32_t val;
+    while (file >> val) {
+      vec.push_back(val);
+      numNodes = val >= numNodes ? val+1 : numNodes;
       numEdges++;
-      numNodes = src >= numNodes ? src+1 : numNodes;
-      numNodes = dst >= numNodes ? dst+1 : numNodes;
     }
-    rewind(fp);
+    assert((numEdges&1) == 0);
+    numEdges >>= 1;
 
     uint32_t* count = (uint32_t*) calloc(numNodes, sizeof(uint32_t));
-    for (int i = 0; i < numEdges; i++) {
-      uint32_t src, dst;
-      ret = fscanf(fp, "%d %d", &src, &dst);
-      count[src]++;
+    for (int i = 0; i < vec.size(); i+=2) {
+      count[vec[i]]++;
     }
 
     // Create mapping from node id to neighbours
     neighbours = (uint32_t**) calloc(numNodes, sizeof(uint32_t*));
-    rewind(fp);
     for (int i = 0; i < numNodes; i++) {
       neighbours[i] = (uint32_t*) calloc(count[i]+1, sizeof(uint32_t));
       neighbours[i][0] = count[i];
     }
-    for (int i = 0; i < numEdges; i++) {
-      uint32_t src, dst;
-      ret = fscanf(fp, "%d %d", &src, &dst);
+    for (int i = 0; i < vec.size(); i+=2) {
+      uint32_t src = vec[i];
+      uint32_t dst = vec[i+1];
       neighbours[src][count[src]--] = dst;
     }
  
     // Release
     free(count);
-    fclose(fp);
+    file.close();
   }
 
   // Determine max fan-out
@@ -73,6 +68,17 @@ struct EdgeList {
     }
     return max;
   }
+
+  // Determine min fan-out
+  uint32_t minFanOut() {
+    uint32_t min = ~0;
+    for (uint32_t i = 0; i < numNodes; i++) {
+      uint32_t numNeighbours = neighbours[i][0];
+      if (numNeighbours < min) min = numNeighbours;
+    }
+    return min;
+  }
+
 };
 
 #endif
diff --git a/include/POLite.h b/include/POLite.h
index d12a0e73..f053e440 100644
--- a/include/POLite.h
+++ b/include/POLite.h
@@ -9,10 +9,10 @@
   #include <POLite/PDevice.h>
 #else
   #include <POLite/PDevice.h>
+  #include <POLite/PGraph.h>
   #include <POLite/Seq.h>
   #include <POLite/Graph.h>
   #include <POLite/Placer.h>
-  #include <POLite/PGraph.h>
 #endif
 
 #endif
diff --git a/include/POLite/Bitmap.h b/include/POLite/Bitmap.h
new file mode 100644
index 00000000..9271bc07
--- /dev/null
+++ b/include/POLite/Bitmap.h
@@ -0,0 +1,59 @@
+#ifndef _BITMAP_H_
+#define _BITMAP_H_
+
+#include <stdint.h>
+#include <assert.h>
+#include <POLite/Seq.h>
+
+struct Bitmap {
+  // Bitmap contents (sequence of 64-bit words)
+  Seq<uint64_t>* contents;
+
+  // Index of first non-full word in bitmap
+  uint32_t firstFree;
+
+  // Constructor
+  Bitmap() {
+    contents = new Seq<uint64_t> (16);
+    firstFree = 0;
+  }
+
+  // Destructor
+  ~Bitmap() {
+    if (contents) delete contents;
+  }
+
+  // Get value of word at given index, return 0 if out-of-bounds
+  inline uint64_t getWord(uint32_t index) {
+    return index >= contents->numElems ? 0ul : contents->elems[index];
+  }
+
+  // Find index of next free word in bitmap starting from given word index
+  inline uint32_t nextFreeWordFrom(uint32_t start) {
+    for (uint32_t i = start; i < contents->numElems; i++)
+      if (~contents->elems[i] != 0ul) return i;
+    return contents->numElems;
+  }
+
+  // Set bit at given index and bit offset in bitmap
+  inline void setBit(uint32_t wordIndex, uint32_t bitIndex) {
+    for (uint32_t i = contents->numElems; i <= wordIndex; i++)
+      contents->append(0ul);
+    contents->elems[wordIndex] |= 1ul << bitIndex;
+    if (wordIndex == firstFree) {
+      firstFree = nextFreeWordFrom(firstFree);
+    }
+  }
+
+  // Find index of next zero bit, and flip that bit
+  inline uint32_t grabNextBit() {
+    uint64_t word = getWord(firstFree);
+    assert(~word != 0ul);
+    uint32_t bit = __builtin_ctzll(~word);
+    uint32_t result = 64*firstFree + bit;
+    setBit(firstFree, bit);
+    return result;
+  }
+};
+
+#endif
diff --git a/include/POLite/PDevice.h b/include/POLite/PDevice.h
index 9eefda3a..508207bd 100644
--- a/include/POLite/PDevice.h
+++ b/include/POLite/PDevice.h
@@ -22,14 +22,22 @@
 #define POLITE_NUM_PINS 1
 #endif
 
-// Macros for performance stats
+// The local-multicast key points to a list of incoming edges.  Some
+// of those edges are stored in a header, the rest in an array at a
+// different location.  The number stored in the header is controlled
+// by the following parameter.  If it's too low, we risk wasting
+// memory bandwidth.  If it's too high, we risk wasting memory.  
+// The minimum value is 0.  For large edge state sizes, use 0.
+#ifndef POLITE_EDGES_PER_HEADER
+#define POLITE_EDGES_PER_HEADER 6
+#endif
+
+// Macros for performance stats:
 //   POLITE_DUMP_STATS - dump performance stats on termination
-//   POLITE_COUNT_MSGS - include message counts of performance stats
+//   POLITE_COUNT_MSGS - include message counts in performance stats
 
 // Thread-local device id
 typedef uint16_t PLocalDeviceId;
-#define InvalidLocalDevId 0xffff
-#define UnusedLocalDevId 0xfffe
 
 // Thread id
 typedef uint32_t PThreadId;
@@ -54,7 +62,7 @@ inline PLocalDeviceId getLocalDeviceId(PDeviceAddr addr) { return addr >> 19; }
 // What's the max allowed local device address?
 inline uint32_t maxLocalDeviceId() { return 8192; }
 
-// Routing key
+// Local multicast key
 typedef uint16_t Key;
 #define InvalidKey 0xffff
 
@@ -102,8 +110,8 @@ template <typename S> struct ALIGNED PState {
 
 // Message structure
 template <typename M> struct PMessage {
-  // Source-based routing key
-  Key key;
+  // Destination key
+  uint16_t destKey;
   // Application message
   M payload;
 };
@@ -119,15 +127,15 @@ struct POutEdge {
   uint32_t threadMaskHigh;
 };
 
-// An incoming edge to a device (labelleled)
+// An incoming edge to a device
 template <typename E> struct PInEdge {
   // Destination device
   PLocalDeviceId devId;
-  // Edge info
+  // Edge data
   E edge;
 };
 
-// An incoming edge to a device (unlabelleled)
+// An incoming edge to a device (unlabelled)
 template <> struct PInEdge<None> {
   union {
     // Destination device
@@ -137,15 +145,17 @@ template <> struct PInEdge<None> {
   };
 };
 
-// Helper function: Count board hops between two threads
-inline uint32_t hopsBetween(uint32_t t0, uint32_t t1) {
-  uint32_t xmask = ((1<<TinselMeshXBits)-1);
-  int32_t y0 = t0 >> (TinselLogThreadsPerBoard + TinselMeshXBits);
-  int32_t x0 = (t0 >> TinselLogThreadsPerBoard) & xmask;
-  int32_t y1 = t1 >> (TinselLogThreadsPerBoard + TinselMeshXBits);
-  int32_t x1 = (t1 >> TinselLogThreadsPerBoard) & xmask;
-  return (abs(x0-x1) + abs(y0-y1));
-}
+// Header for a list of incoming edges (fixed size structure to
+// support fast construction/packing of local-multicast tables)
+template <typename E> struct PInHeader {
+  // Number of receivers
+  uint16_t numReceivers;
+  // Pointer to remaining edges in inTableRest,
+  // if they don't all fit in the header
+  uint16_t restIndex;
+  // Edges stored in the header, to make good use of cached data
+  PInEdge<E> edges[POLITE_EDGES_PER_HEADER];
+};
 
 // Generic thread structure
 template <typename DeviceType,
@@ -161,7 +171,8 @@ template <typename DeviceType,
   PTR(PState<S>) devices;
   // Pointer to base of routing tables
   PTR(POutEdge) outTableBase;
-  PTR(PInEdge<E>) inTableBase;
+  PTR(PInHeader<E>) inTableHeaderBase;
+  PTR(PInEdge<E>) inTableRestBase;
   // Array of local device ids are ready to send
   PTR(PLocalDeviceId) senders;
   // This array is accessed in a LIFO manner
@@ -170,11 +181,11 @@ template <typename DeviceType,
   // Count number of messages sent
   #ifdef POLITE_COUNT_MSGS
   // Total messages sent
-  uint32_t intraThreadSendCount;
-  // Total messages sent between threads
-  uint32_t interThreadSendCount;
-  // Messages sent between threads on different boards
-  uint32_t interBoardSendCount;
+  uint32_t msgsSent;
+  // Total messages received
+  uint32_t msgsReceived;
+  // Number of times we wanted to send but couldn't
+  uint32_t blockedSends;
   #endif
 
   #ifdef TINSEL
@@ -211,8 +222,14 @@ template <typename DeviceType,
     }
     // Per-thread performance counters
     #ifdef POLITE_COUNT_MSGS
-    printf("LS:%x,TS:%x,BS:%x\n", intraThreadSendCount,
-             interThreadSendCount, interBoardSendCount);
+    uint32_t intraBoardId = me & ((1<<TinselLogThreadsPerBoard) - 1);
+    uint32_t progRouterSent =
+      intraBoardId == 0 ? tinselProgRouterSent() : 0;
+    uint32_t progRouterSentInter =
+      intraBoardId == 0 ? tinselProgRouterSentInterBoard() : 0;
+    printf("MS:%x,MR:%x,PR:%x,PRI:%x,BL:%x\n",
+      msgsSent, msgsReceived, progRouterSent,
+        progRouterSentInter, blockedSends);
     #endif
   }
 
@@ -257,20 +274,21 @@ template <typename DeviceType,
         if (tinselCanSend()) {
           PMessage<M>* m = (PMessage<M>*) tinselSendSlot();
           // Send message
-          m->key = outEdge->key;
+          m->destKey = outEdge->key;
           tinselMulticast(outEdge->mbox, outEdge->threadMaskHigh,
             outEdge->threadMaskLow, m);
           #ifdef POLITE_COUNT_MSGS
-          interThreadSendCount++;
-          interBoardSendCount +=
-            hopsBetween(outEdge->mbox << TinselLogThreadsPerMailbox,
-              tinselId());
+          msgsSent++;
           #endif
           // Move to next neighbour
           outEdge++;
         }
-        else
+        else {
+          #ifdef POLITE_COUNT_MSGS
+          blockedSends++;
+          #endif
           tinselWaitUntil(TINSEL_CAN_SEND|TINSEL_CAN_RECV);
+        }
       }
       else if (sendersTop != senders) {
         if (tinselCanSend()) {
@@ -292,8 +310,12 @@ template <typename DeviceType,
               devices[src].pinBase[pin-2]
             ];
         }
-        else
+        else {
+          #ifdef POLITE_COUNT_MSGS
+          blockedSends++;
+          #endif
           tinselWaitUntil(TINSEL_CAN_SEND|TINSEL_CAN_RECV);
+        }
       }
       else {
         // Idle detection
@@ -318,8 +340,14 @@ template <typename DeviceType,
       // Step 2: try to receive
       while (tinselCanRecv()) {
         PMessage<M>* inMsg = (PMessage<M>*) tinselRecv();
-        PInEdge<E>* inEdge = &inTableBase[inMsg->key];
-        while (inEdge->devId != InvalidLocalDevId) {
+        PInHeader<E>* inHeader = &inTableHeaderBase[inMsg->destKey];
+        // Determine number and location of edges/receivers
+        uint32_t numReceivers = inHeader->numReceivers;
+        PInEdge<E>* inEdge = inHeader->edges;
+        // For each receiver
+        for (uint32_t i = 0; i < numReceivers; i++) {
+          if (i == POLITE_EDGES_PER_HEADER)
+            inEdge = &inTableRestBase[inHeader->restIndex];
           // Lookup destination device
           PLocalDeviceId id = inEdge->devId;
           DeviceType dev = getDevice(id);
@@ -332,7 +360,7 @@ template <typename DeviceType,
             *(sendersTop++) = id;
           inEdge++;
           #ifdef POLITE_COUNT_MSGS
-          intraThreadSendCount++;
+          msgsReceived++;
           #endif
         }
         tinselFree(inMsg);
diff --git a/include/POLite/PGraph.h b/include/POLite/PGraph.h
index 4181c3da..1a67e2ef 100644
--- a/include/POLite/PGraph.h
+++ b/include/POLite/PGraph.h
@@ -12,8 +12,10 @@
 #include <POLite/Seq.h>
 #include <POLite/Graph.h>
 #include <POLite/Placer.h>
+#include <POLite/Bitmap.h>
+#include <POLite/ProgRouters.h>
 #include <type_traits>
-#include "Seq.h"
+#include <tinsel-interface.h>
 
 // Nodes of a POETS graph are devices
 typedef NodeId PDeviceId;
@@ -24,9 +26,27 @@ template <typename E> struct PReceiverGroup {
   // Thread id where all the receivers reside
   uint32_t threadId;
   // A sequence of receiving devices on that thread
-  Seq<PInEdge<E>>* receivers;
+  SmallSeq<PInEdge<E>> receivers;
 };
 
+// This structure holds info about an edge destination
+struct PEdgeDest {
+  // Index of edge in outgoing edge list
+  uint32_t index;
+  // Destination device
+  PDeviceId dest;
+  // Address where destination is located
+  PDeviceAddr addr;
+};
+
+// Comparison function for PEdgeDest
+// (Useful to sort destinations by thread id of destination)
+inline int cmpEdgeDest(const void* e0, const void* e1) {
+  PEdgeDest* d0 = (PEdgeDest*) e0;
+  PEdgeDest* d1 = (PEdgeDest*) e1;
+  return getThreadId(d0->addr) < getThreadId(d1->addr);
+}
+
 // POETS graph
 template <typename DeviceType,
           typename S, typename E, typename M> class PGraph {
@@ -59,8 +79,19 @@ template <typename DeviceType,
   // Multicast routing tables:
   // Sequence of outgoing edges for every (device, pin) pair
   Seq<POutEdge>*** outTable;
-  // Sequence of incoming edges for every thread
-  Seq<PInEdge<E>>** inTable;
+  // Sequence of in-edge headers, for each thread
+  Seq<PInHeader<E>>** inTableHeaders;
+  // Remaining in-edges that don't fit in the header table, for each thread
+  Seq<PInEdge<E>>** inTableRest;
+  // Bitmap denoting used space in header table, for each thread
+  Bitmap** inTableBitmaps;
+
+  // Programmable routing tables
+  ProgRouterMesh* progRouterTables;
+
+  // Receiver groups (used internally by some methods, but declared once
+  // to avoid repeated allocation)
+  PReceiverGroup<E> groups[TinselThreadsPerMailbox];
 
   // Generic constructor
   void constructor(uint32_t lenX, uint32_t lenY) {
@@ -79,18 +110,29 @@ template <typename DeviceType,
     vertexMem = NULL;
     vertexMemSize = NULL;
     vertexMemBase = NULL;
-    inEdgeMem = NULL;
-    inEdgeMemSize = NULL;
-    inEdgeMemBase = NULL;
+    inEdgeHeaderMem = NULL;
+    inEdgeHeaderMemSize = NULL;
+    inEdgeHeaderMemBase = NULL;
+    inEdgeRestMem = NULL;
+    inEdgeRestMemSize = NULL;
+    inEdgeRestMemBase = NULL;
     outEdgeMem = NULL;
     outEdgeMemSize = NULL;
     outEdgeMemBase = NULL;
     mapVerticesToDRAM = false;
-    mapInEdgesToDRAM = true;
+    mapInEdgeHeadersToDRAM = true;
+    mapInEdgeRestToDRAM = true;
     mapOutEdgesToDRAM = true;
     outTable = NULL;
-    inTable = NULL;
+    inTableHeaders = NULL;
+    inTableRest = NULL;
+    inTableBitmaps = NULL;
+    progRouterTables = NULL;
     chatty = 0;
+    str = getenv("POLITE_CHATTY");
+    if (str != NULL) {
+      chatty = !strcmp(str, "0") ? 0 : 1;
+    }
   }
 
  public:
@@ -124,14 +166,18 @@ template <typename DeviceType,
 
   // Each thread's in-edge and out-edge regions
   // (Not valid until the mapper is called)
-  uint8_t** inEdgeMem;      uint8_t** outEdgeMem;
-  uint32_t* inEdgeMemSize;  uint32_t* outEdgeMemSize;
-  uint32_t* inEdgeMemBase;  uint32_t* outEdgeMemBase;
+  uint8_t** inEdgeHeaderMem;      uint8_t** inEdgeRestMem;
+  uint32_t* inEdgeHeaderMemSize;  uint32_t* inEdgeRestMemSize;
+  uint32_t* inEdgeHeaderMemBase;  uint32_t* inEdgeRestMemBase;
+  uint8_t** outEdgeMem;
+  uint32_t* outEdgeMemSize;
+  uint32_t* outEdgeMemBase;
 
   // Where to map the various regions
   // (If false, map to SRAM instead)
   bool mapVerticesToDRAM;
-  bool mapInEdgesToDRAM;
+  bool mapInEdgeHeadersToDRAM;
+  bool mapInEdgeRestToDRAM;
   bool mapOutEdgesToDRAM;
 
   // Allow mapper to print useful information to stdout
@@ -186,9 +232,14 @@ template <typename DeviceType,
     threadMem = (uint8_t**) calloc(TinselMaxThreads, sizeof(uint8_t*));
     threadMemSize = (uint32_t*) calloc(TinselMaxThreads, sizeof(uint32_t));
     threadMemBase = (uint32_t*) calloc(TinselMaxThreads, sizeof(uint32_t));
-    inEdgeMem = (uint8_t**) calloc(TinselMaxThreads, sizeof(uint8_t*));
-    inEdgeMemSize = (uint32_t*) calloc(TinselMaxThreads, sizeof(uint32_t));
-    inEdgeMemBase = (uint32_t*) calloc(TinselMaxThreads, sizeof(uint32_t));
+    inEdgeHeaderMem = (uint8_t**) calloc(TinselMaxThreads, sizeof(uint8_t*));
+    inEdgeHeaderMemSize =
+      (uint32_t*) calloc(TinselMaxThreads, sizeof(uint32_t));
+    inEdgeHeaderMemBase =
+      (uint32_t*) calloc(TinselMaxThreads, sizeof(uint32_t));
+    inEdgeRestMem = (uint8_t**) calloc(TinselMaxThreads, sizeof(uint8_t*));
+    inEdgeRestMemSize = (uint32_t*) calloc(TinselMaxThreads, sizeof(uint32_t));
+    inEdgeRestMemBase = (uint32_t*) calloc(TinselMaxThreads, sizeof(uint32_t));
     outEdgeMem = (uint8_t**) calloc(TinselMaxThreads, sizeof(uint8_t*));
     outEdgeMemSize = (uint32_t*) calloc(TinselMaxThreads, sizeof(uint32_t));
     outEdgeMemBase = (uint32_t*) calloc(TinselMaxThreads, sizeof(uint32_t));
@@ -198,7 +249,8 @@ template <typename DeviceType,
       // partition.  The total partition size is larger as it includes
       // uninitialised portions.
       uint32_t sizeVMem = 0;
-      uint32_t sizeEIMem = 0;
+      uint32_t sizeEIHeaderMem = 0;
+      uint32_t sizeEIRestMem = 0;
       uint32_t sizeEOMem = 0;
       uint32_t sizeTMem = 0;
       // Add space for thread structure (always stored in SRAM)
@@ -209,10 +261,15 @@ template <typename DeviceType,
         // Add space for device
         sizeVMem = sizeVMem + sizeof(PState<S>);
       }
-      // Add space for incoming edge table
-      if (inTable[threadId]) {
-        sizeEIMem = inTable[threadId]->numElems * sizeof(PInEdge<E>);
-        sizeEIMem = wordAlign(sizeEIMem);
+      // Add space for incoming edge tables
+      if (inTableHeaders[threadId]) {
+        sizeEIHeaderMem = inTableHeaders[threadId]->numElems *
+                            sizeof(PInHeader<E>);
+        sizeEIHeaderMem = wordAlign(sizeEIHeaderMem);
+      }
+      if (inTableRest[threadId]) {
+        sizeEIRestMem = inTableRest[threadId]->numElems * sizeof(PInEdge<E>);
+        sizeEIRestMem = wordAlign(sizeEIRestMem);
       }
       // Add space for outgoing edge table
       for (uint32_t devNum = 0; devNum < numDevs; devNum++) {
@@ -231,8 +288,10 @@ template <typename DeviceType,
       uint32_t totalSizeDRAM = 0;
       if (mapVerticesToDRAM) totalSizeDRAM += totalSizeVMem;
                         else totalSizeSRAM += totalSizeVMem;
-      if (mapInEdgesToDRAM)  totalSizeDRAM += sizeEIMem;
-                        else totalSizeSRAM += sizeEIMem;
+      if (mapInEdgeHeadersToDRAM) totalSizeDRAM += sizeEIHeaderMem;
+                             else totalSizeSRAM += sizeEIHeaderMem;
+      if (mapInEdgeRestToDRAM) totalSizeDRAM += sizeEIRestMem;
+                          else totalSizeSRAM += sizeEIRestMem;
       if (mapOutEdgesToDRAM) totalSizeDRAM += sizeEOMem;
                         else totalSizeSRAM += sizeEOMem;
       if (totalSizeDRAM > maxDRAMSize) {
@@ -246,14 +305,17 @@ template <typename DeviceType,
       // Allocate space for the initialised portion of the partition
       assert((sizeVMem%4) == 0);
       assert((sizeTMem%4) == 0);
-      assert((sizeEIMem%4) == 0);
+      assert((sizeEIHeaderMem%4) == 0);
+      assert((sizeEIRestMem%4) == 0);
       assert((sizeEOMem%4) == 0);
       vertexMem[threadId] = (uint8_t*) calloc(sizeVMem, 1);
       vertexMemSize[threadId] = sizeVMem;
       threadMem[threadId] = (uint8_t*) calloc(sizeTMem, 1);
       threadMemSize[threadId] = sizeTMem;
-      inEdgeMem[threadId] = (uint8_t*) calloc(sizeEIMem, 1);
-      inEdgeMemSize[threadId] = sizeEIMem;
+      inEdgeHeaderMem[threadId] = (uint8_t*) calloc(sizeEIHeaderMem, 1);
+      inEdgeHeaderMemSize[threadId] = sizeEIHeaderMem;
+      inEdgeRestMem[threadId] = (uint8_t*) calloc(sizeEIRestMem, 1);
+      inEdgeRestMemSize[threadId] = sizeEIRestMem;
       outEdgeMem[threadId] = (uint8_t*) calloc(sizeEOMem, 1);
       outEdgeMemSize[threadId] = sizeEOMem;
       // Tinsel address of base of partition
@@ -275,13 +337,21 @@ template <typename DeviceType,
         vertexMemBase[threadId] = sramBase;
         sramBase += totalSizeVMem;
       }
-      if (mapInEdgesToDRAM) {
-        inEdgeMemBase[threadId] = dramBase;
-        dramBase += sizeEIMem;
+      if (mapInEdgeHeadersToDRAM) {
+        inEdgeHeaderMemBase[threadId] = dramBase;
+        dramBase += sizeEIHeaderMem;
+      }
+      else {
+        inEdgeHeaderMemBase[threadId] = sramBase;
+        sramBase += sizeEIHeaderMem;
+      }
+      if (mapInEdgeRestToDRAM) {
+        inEdgeRestMemBase[threadId] = dramBase;
+        dramBase += sizeEIRestMem;
       }
       else {
-        inEdgeMemBase[threadId] = sramBase;
-        sramBase += sizeEIMem;
+        inEdgeRestMemBase[threadId] = sramBase;
+        sramBase += sizeEIRestMem;
       }
       if (mapOutEdgesToDRAM) {
         outEdgeMemBase[threadId] = dramBase;
@@ -311,7 +381,8 @@ template <typename DeviceType,
       thread->devices = vertexMemBase[threadId];
       // Set tinsel address of base of edge tables
       thread->outTableBase = outEdgeMemBase[threadId];
-      thread->inTableBase = inEdgeMemBase[threadId];
+      thread->inTableHeaderBase = inEdgeHeaderMemBase[threadId];
+      thread->inTableRestBase = inEdgeRestMemBase[threadId];
       // Add space for each device on thread
       uint32_t numDevs = numDevicesOnThread[threadId];
       for (uint32_t devNum = 0; devNum < numDevs; devNum++) {
@@ -337,11 +408,18 @@ template <typename DeviceType,
         }
       }
       // Intialise thread's in edges
-      PInEdge<E>* inEdgeArray = (PInEdge<E>*) inEdgeMem[threadId];
-      Seq<PInEdge<E>>* edges = inTable[threadId];
+      PInHeader<E>* inEdgeHeaderArray =
+        (PInHeader<E>*) inEdgeHeaderMem[threadId];
+      Seq<PInHeader<E>>* headers = inTableHeaders[threadId];
+      if (headers)
+        for (uint32_t i = 0; i < headers->numElems; i++) {
+          inEdgeHeaderArray[i] = headers->elems[i];
+        }
+      PInEdge<E>* inEdgeRestArray = (PInEdge<E>*) inEdgeRestMem[threadId];
+      Seq<PInEdge<E>>* edges = inTableRest[threadId];
       if (edges)
         for (uint32_t i = 0; i < edges->numElems; i++) {
-          inEdgeArray[i] = edges->elems[i];
+          inEdgeRestArray[i] = edges->elems[i];
         }
       // At this point, check that next pointers line up with heap sizes
       if (nextVMem != vertexMemSize[threadId]) {
@@ -368,12 +446,27 @@ template <typename DeviceType,
   // Allocate routing tables
   // (Only valid after mapper is called)
   void allocateRoutingTables() {
-    // Receiver-side tables
-    inTable = (Seq<PInEdge<E>>**)
+    // Receiver-side tables (headers)
+    inTableHeaders = (Seq<PInHeader<E>>**)
+      calloc(TinselMaxThreads,sizeof(Seq<PInHeader<E>>*));
+    for (uint32_t t = 0; t < TinselMaxThreads; t++) {
+      if (numDevicesOnThread[t] != 0)
+        inTableHeaders[t] = new SmallSeq<PInHeader<E>>;
+    }
+
+    // Receiver-side tables (rest)
+    inTableRest = (Seq<PInEdge<E>>**)
       calloc(TinselMaxThreads,sizeof(Seq<PInEdge<E>>*));
     for (uint32_t t = 0; t < TinselMaxThreads; t++) {
       if (numDevicesOnThread[t] != 0)
-        inTable[t] = new SmallSeq<PInEdge<E>>;
+        inTableRest[t] = new SmallSeq<PInEdge<E>>;
+    }
+
+    // Receiver-side tables (bitmaps)
+    inTableBitmaps = (Bitmap**) calloc(TinselMaxThreads,sizeof(Bitmap*));
+    for (uint32_t t = 0; t < TinselMaxThreads; t++) {
+      if (numDevicesOnThread[t] != 0)
+        inTableBitmaps[t] = new Bitmap;
     }
 
     // Sender-side tables
@@ -386,174 +479,232 @@ template <typename DeviceType,
     }
   }
 
-  // Pack a receivers array
-  // Input: an in-edge sequence for each thread in a mailbox.
-  // Input array may contain lots of holes (0-element sequences)
-  // Output: a sequence of receiver groups
-  // Output array contains no empty receiver groups
-  void createReceiverGroups(
-        uint32_t mbox,
-        Seq<PInEdge<E>>* receivers,
-        Seq<PReceiverGroup<E>>* groups) {
-    groups->clear();
-    for (uint32_t i = 0; i < 64; i++) {
-      if (receivers[i].numElems > 0) {
-        // Add receiver group
-        PReceiverGroup<E> g;
-        g.threadId = (mbox << TinselLogThreadsPerMailbox) | i;
-        g.receivers = &receivers[i];
-        groups->append(g);
-      }
+  // Determine local-multicast routing key for given set of receivers
+  // (The key must be the same for all receivers)
+  uint32_t findKey(uint32_t numGroups) { 
+    // Fast path (single receiver)
+    if (numGroups == 1) {
+      Bitmap* bm = inTableBitmaps[groups[0].threadId];
+      return bm->grabNextBit();
     }
-  }
 
-  // Determine routing key for given set of receivers
-  // (The key must be the same for all receivers)
-  uint32_t findKey(Seq<PReceiverGroup<E>>* receivers) { 
-    uint32_t key = 0;
-
-    bool found = false;
-    while (!found) {
-      found = true; 
-      for (uint32_t i = 0; i < receivers->numElems; i++) {
-        PReceiverGroup<E> g = receivers->elems[i];
-        uint32_t numReceivers = g.receivers->numElems;
-        if (numReceivers > 0) {
-          // Lookup thread id of receiver
-          uint32_t t = g.threadId;
-          // Lookup table size for this thread
-          uint32_t tableSize = inTable[t]->numElems;
-          // Move to next receiver when we find a space
-          if (key >= tableSize) continue;
-          // Is there space at the current key?
-          // (Need space for numReceivers plus null terminator)
-          bool space = true;
-          for (int j = 0; j < numReceivers+1; j++) {
-            if ((key+j) >= tableSize) break;
-            if (inTable[t]->elems[key+j].devId != UnusedLocalDevId) {
-              found = false;
-              key = key+j+1;
-              break;
-            }
-          }
-        }
+    // Determine starting index for key search
+    uint32_t index = 0;
+    for (uint32_t i = 0; i < numGroups; i++) {
+      PReceiverGroup<E>* g = &groups[i];
+      Bitmap* bm = inTableBitmaps[g->threadId];
+      if (bm->firstFree > index) index = bm->firstFree;
+    }
+
+    // Find key that is available for all receivers
+    uint64_t mask;
+    retry:
+      mask = 0ul;
+      for (uint32_t i = 0; i < numGroups; i++) {
+        PReceiverGroup<E>* g = &groups[i];
+        Bitmap* bm = inTableBitmaps[g->threadId];
+        mask |= bm->getWord(index);
+        if (~mask == 0ul) { index++; goto retry; }
       }
+
+    // Mark key as taken in each bitmap
+    uint32_t bit = __builtin_ctzll(~mask);
+    for (uint32_t i = 0; i < numGroups; i++) {
+      PReceiverGroup<E>* g = &groups[i];
+      Bitmap* bm = inTableBitmaps[g->threadId];
+      bm->setBit(index, bit);
     }
-    return key;
+    return 64*index + bit;
   }
 
   // Add entries to the input tables for the given receivers
   // (Only valid after mapper is called)
-  uint32_t addInTableEntries(Seq<PReceiverGroup<E>>* receivers) {
-    uint32_t key = findKey(receivers);
-    if (key >= 0xfffe) {
+  uint32_t addInTableEntries(uint32_t numGroups) {
+    uint32_t key = findKey(numGroups);
+    if (key >= 0xffff) {
       printf("Routing key exceeds 16 bits\n");
       exit(EXIT_FAILURE);
     }
-    PInEdge<E> null, unused;
-    null.devId = InvalidLocalDevId;
-    unused.devId = UnusedLocalDevId;
-    // Now that a key with sufficient space has been found, populate the tables
-    for (uint32_t i = 0; i < receivers->numElems; i++) {
-      PReceiverGroup<E> g = receivers->elems[i];
-      uint32_t numReceivers = g.receivers->numElems;
-      if (numReceivers > 0) {
-        // Lookup thread id of receiver
-        uint32_t t = g.threadId;
-        // Lookup table size for this thread
-        uint32_t tableSize = inTable[t]->numElems;
-        // Make sure inTable is big enough for new entries
-        for (uint32_t j = tableSize; j < (key+numReceivers+1); j++)
-          inTable[t]->append(unused);
-        // Add receivers to thread's inTable
-        for (uint32_t j = 0; j < numReceivers; j++) {
-          inTable[t]->elems[key+j] = g.receivers->elems[j];
+    // Populate inTableHeaders and inTableRest using the key
+    for (uint32_t i = 0; i < numGroups; i++) {
+      PReceiverGroup<E>* g = &groups[i];
+      uint32_t numEdges = g->receivers.numElems;
+      PInEdge<E>* edgePtr = g->receivers.elems;
+      if (numEdges > 0) {
+        // Determine thread id of receiver
+        uint32_t t = g->threadId;
+        // Extend table
+        Seq<PInHeader<E>>* headers = inTableHeaders[t];
+        if (key >= headers->numElems)
+          headers->extendBy(key + 1 - headers->numElems);
+        // Fill in header
+        PInHeader<E>* header = &inTableHeaders[t]->elems[key];
+        header->numReceivers = numEdges;
+        if (inTableRest[t]->numElems > 0xffff) {
+          printf("In-table index exceeds 16 bits\n");
+          exit(EXIT_FAILURE);
+        }
+        header->restIndex = inTableRest[t]->numElems;
+        uint32_t numHeaderEdges = numEdges < POLITE_EDGES_PER_HEADER ?
+          numEdges : POLITE_EDGES_PER_HEADER;
+        for (uint32_t j = 0; j < numHeaderEdges; j++) {
+          header->edges[j] = *edgePtr;
+          edgePtr++;
+        }
+        numEdges -= numHeaderEdges;
+        // Overflow into rest memory if header not big enough
+        for (uint32_t j = 0; j < numEdges; j++) {
+          inTableRest[t]->append(*edgePtr);
+          edgePtr++;
         }
-        inTable[t]->elems[key+numReceivers] = null;
       }
     }
     return key;
   }
 
+  // Split edge list into board-local and non-board-local destinations
+  // And sort each list by destination thread id
+  // (Only valid after mapper is called)
+  void splitDests(PDeviceId devId, PinId pinId,
+                    Seq<PEdgeDest>* local, Seq<PEdgeDest>* nonLocal) {
+    local->clear();
+    nonLocal->clear();
+    PDeviceAddr devAddr = toDeviceAddr[devId];
+    uint32_t devBoard = getThreadId(devAddr) >> TinselLogThreadsPerBoard;
+    // Split destinations into local/non-local
+    Seq<PDeviceId>* dests = graph.outgoing->elems[devId];
+    Seq<PinId>* pinIds = graph.pins->elems[devId];
+    for (uint32_t d = 0; d < dests->numElems; d++) {
+      if (pinIds->elems[d] == pinId) {
+        PEdgeDest e;
+        e.index = d;
+        e.dest = dests->elems[d];
+        e.addr = toDeviceAddr[e.dest];
+        uint32_t destBoard = getThreadId(e.addr) >> TinselLogThreadsPerBoard;
+        if (devBoard == destBoard)
+          local->append(e);
+        else
+          nonLocal->append(e);
+      }
+    }
+    // Sort local list
+    qsort(local->elems, local->numElems, sizeof(PEdgeDest), cmpEdgeDest);
+    // Sort non-local list
+    qsort(nonLocal->elems, nonLocal->numElems, sizeof(PEdgeDest), cmpEdgeDest);
+  }
+
+  // Compute table updates for destinations for given device
+  // (Only valid after mapper is called)
+  void computeTables(Seq<PEdgeDest>* dests, uint32_t d,
+         Seq<PRoutingDest>* out) {
+    out->clear();
+    uint32_t index = 0;
+    while (index < dests->numElems) {
+      // New set of receiver groups on same mailbox
+      uint32_t threadMaskLow = 0;
+      uint32_t threadMaskHigh = 0;
+      uint32_t nextGroup = 0;
+      // Current mailbox & thread being considered
+      PDeviceAddr mbox = getThreadId(dests->elems[index].addr) >>
+                           TinselLogThreadsPerMailbox;
+      uint32_t thread = getThreadId(dests->elems[index].addr) &
+                          ((1<<TinselLogThreadsPerMailbox)-1);
+      // Determine edges targetting same mailbox
+      while (index < dests->numElems) {
+        PEdgeDest* edge = &dests->elems[index];
+        // Determine destination mailbox address and mailbox-local thread
+        uint32_t destMailbox = getThreadId(edge->addr) >>
+                                 TinselLogThreadsPerMailbox;
+        uint32_t destThread = getThreadId(edge->addr) &
+                                 ((1<<TinselLogThreadsPerMailbox)-1);
+        // Does destination match current destination?
+        if (destMailbox == mbox) {
+          if (destThread == thread) {
+            // Add to current receiver group
+            PInEdge<E> in;
+            in.devId = getLocalDeviceId(edge->addr);
+            Seq<E>* edges = edgeLabels.elems[d];
+            if (! std::is_same<E, None>::value)
+              in.edge = edges->elems[edge->index];
+            // Update current receiver group
+            groups[nextGroup].receivers.append(in);
+            groups[nextGroup].threadId = getThreadId(edge->addr);
+            if (thread < 32) threadMaskLow |= 1 << thread;
+            if (thread >= 32) threadMaskHigh |= 1 << (thread-32);
+            index++;
+          }
+          else {
+            // Start new receiver group
+            thread = destThread;
+            nextGroup++;
+            assert(nextGroup < TinselThreadsPerMailbox);
+          }
+        }
+        else break;
+      }
+      // Add input table entries
+      uint32_t key = addInTableEntries(nextGroup+1);
+      // Add output entry
+      PRoutingDest dest;
+      dest.kind = PRDestKindMRM;
+      dest.mbox = mbox;
+      dest.mrm.key = key;
+      dest.mrm.threadMaskLow = threadMaskLow;
+      dest.mrm.threadMaskHigh = threadMaskHigh;
+      out->append(dest);
+      // Clear receiver groups, for a new iteration
+      for (uint32_t i = 0; i <= nextGroup; i++) groups[i].receivers.clear();
+    }
+  }
+
   // Compute routing tables
   // (Only valid after mapper is called)
   void computeRoutingTables() {
-    // Routing table stats
-    uint64_t totalOutEdges = 0;
+    // Edge destinations (local to sender board, or not)
+    Seq<PEdgeDest> local;
+    Seq<PEdgeDest> nonLocal;
 
-    // Sequence of local device ids, for each multicast destiation
-    SmallSeq<PInEdge<E>> receivers[64];
+    // Routing destinations
+    Seq<PRoutingDest> dests;
 
-    // Sequence of receiver groups
-    // (A more compact representation of the receivers array)
-    SmallSeq<PReceiverGroup<E>> groups;
+    // Allocate per-board programmable routing tables
+    progRouterTables = new ProgRouterMesh(numBoardsX, numBoardsY);
 
     // For each device
     for (uint32_t d = 0; d < numDevices; d++) {
       // For each pin
       for (uint32_t p = 0; p < POLITE_NUM_PINS; p++) {
-        Seq<PDeviceId> dests = *(graph.outgoing->elems[d]);
-        Seq<E> edges = *(edgeLabels.elems[d]);
-        // While destinations are remaining
-        while (dests.numElems > 0) {
-          // Clear receivers
-          for (uint32_t i = 0; i < 64; i++) receivers[i].clear();
-          uint32_t threadMaskLow = 0;
-          uint32_t threadMaskHigh = 0;
-          // Current mailbox being considered
-          PDeviceAddr mbox = getThreadId(toDeviceAddr[dests.elems[0]]) >>
-                               TinselLogThreadsPerMailbox;
-          // For each destination
-          uint32_t destsRemaining = 0;
-          for (uint32_t i = 0; i < dests.numElems; i++) {
-            // Determine destination mailbox address and mailbox-local thread
-            PDeviceId destId = dests.elems[i];
-            PDeviceAddr destAddr = toDeviceAddr[destId];
-            uint32_t destMailbox = getThreadId(destAddr) >>
-                                     TinselLogThreadsPerMailbox;
-            uint32_t destThread = getThreadId(destAddr) &
-                                     ((1<<TinselLogThreadsPerMailbox)-1);
-            // Does destination match current destination?
-            if (destMailbox == mbox) {
-              PInEdge<E> edge;
-              edge.devId = getLocalDeviceId(destAddr);
-              if (! std::is_same<E, None>::value) edge.edge = edges.elems[i];
-              receivers[destThread].append(edge);
-              if (destThread < 32) threadMaskLow |= 1 << destThread;
-              if (destThread >= 32) threadMaskHigh |= 1 << (destThread-32);
-            }
-            else {
-              // Add destination back into sequence
-              dests.elems[destsRemaining] = dests.elems[i];
-              edges.elems[destsRemaining] = edges.elems[i];
-              destsRemaining++;
-            }
-          }
-          // Create receiver groups
-          createReceiverGroups(mbox, receivers, &groups);
-          // Add input table entries
-          uint32_t key = addInTableEntries(&groups);
-          // Add output table entry
+        // Split edge lists into local/non-local and sort by target thread id
+        splitDests(d, p, &local, &nonLocal);
+        // Deal with board-local connections
+        computeTables(&local, d, &dests);
+        for (uint32_t i = 0; i < dests.numElems; i++) {
+          PRoutingDest dest = dests.elems[i];
           POutEdge edge;
-          edge.mbox = mbox;
-          edge.key = key;
-          edge.threadMaskLow = threadMaskLow;
-          edge.threadMaskHigh = threadMaskHigh;
+          edge.mbox = dest.mbox;
+          edge.key = dest.mrm.key;
+          edge.threadMaskLow = dest.mrm.threadMaskLow;
+          edge.threadMaskHigh = dest.mrm.threadMaskHigh;
           outTable[d][p]->append(edge);
-          // Prepare for new output table entry
-          dests.numElems = destsRemaining;
-          edges.numElems = destsRemaining;
-          totalOutEdges++;
         }
-        // Add output edge terminator
+        // Deal with non-board-local connections
+        computeTables(&nonLocal, d, &dests);
+        uint32_t src = getThreadId(toDeviceAddr[d]) >>
+          TinselLogThreadsPerMailbox;
+        uint32_t key = progRouterTables->addDestsFromBoard(src, &dests);
+        POutEdge edge;
+        edge.mbox = tinselUseRoutingKey();
+        edge.key = 0;
+        edge.threadMaskLow = key;
+        edge.threadMaskHigh = 0; 
+        outTable[d][p]->append(edge);
+        // Add output list terminator
         POutEdge term;
         term.key = InvalidKey;
         outTable[d][p]->append(term);
       }
     }
-    //printf("Average edges per pin: %lu\n",
-    //  totalOutEdges / (numDevices * POLITE_NUM_PINS);
-  }  
+  }
 
   // Release all structures
   void releaseAll() {
@@ -575,21 +726,38 @@ template <typename DeviceType,
       free(threadMemSize);
       free(threadMemBase);
       for (uint32_t t = 0; t < TinselMaxThreads; t++)
-        if (inEdgeMem[t] != NULL) free(inEdgeMem[t]);
-      free(inEdgeMem);
-      free(inEdgeMemSize);
-      free(inEdgeMemBase);
+        if (inEdgeHeaderMem[t] != NULL) free(inEdgeHeaderMem[t]);
+      free(inEdgeHeaderMem);
+      free(inEdgeHeaderMemSize);
+      free(inEdgeHeaderMemBase);
+      for (uint32_t t = 0; t < TinselMaxThreads; t++)
+        if (inEdgeRestMem[t] != NULL) free(inEdgeRestMem[t]);
+      free(inEdgeRestMem);
+      free(inEdgeRestMemSize);
+      free(inEdgeRestMemBase);
       for (uint32_t t = 0; t < TinselMaxThreads; t++)
         if (outEdgeMem[t] != NULL) free(outEdgeMem[t]);
       free(outEdgeMem);
       free(outEdgeMemSize);
       free(outEdgeMemBase);
     }
-    if (inTable != NULL) {
+    if (inTableHeaders != NULL) {
+      for (uint32_t t = 0; t < TinselMaxThreads; t++)
+        if (inTableHeaders[t] != NULL) delete inTableHeaders[t];
+      free(inTableHeaders);
+      inTableHeaders = NULL;
+    }
+    if (inTableRest != NULL) {
       for (uint32_t t = 0; t < TinselMaxThreads; t++)
-        if (inTable[t] != NULL) delete inTable[t];
-      free(inTable);
-      inTable = NULL;
+        if (inTableRest[t] != NULL) delete inTableRest[t];
+      free(inTableRest);
+      inTableRest = NULL;
+    }
+    if (inTableBitmaps != NULL) {
+      for (uint32_t t = 0; t < TinselMaxThreads; t++)
+        if (inTableBitmaps[t] != NULL) delete inTableBitmaps[t];
+      free(inTableBitmaps);
+      inTableBitmaps = NULL;
     }
     if (outTable != NULL) {
       for (uint32_t d = 0; d < numDevices; d++) {
@@ -601,6 +769,7 @@ template <typename DeviceType,
       free(outTable);
       outTable = NULL;
     }
+    if (progRouterTables != NULL) delete progRouterTables;
   }
 
   // Implement mapping to tinsel threads
@@ -627,6 +796,7 @@ template <typename DeviceType,
     boards.place(placerEffort);
 
     // For each board
+    #pragma omp parallel for collapse(2)
     for (uint32_t boardY = 0; boardY < numBoardsY; boardY++) {
       for (uint32_t boardX = 0; boardX < numBoardsX; boardX++) {
         // Partition into subgraphs, one per mailbox
@@ -810,8 +980,11 @@ template <typename DeviceType,
     hostLink->useSendBuffer = true;
     writeRAM(hostLink, vertexMem, vertexMemSize, vertexMemBase);
     writeRAM(hostLink, threadMem, threadMemSize, threadMemBase);
-    writeRAM(hostLink, inEdgeMem, inEdgeMemSize, inEdgeMemBase);
+    writeRAM(hostLink, inEdgeHeaderMem,
+               inEdgeHeaderMemSize, inEdgeHeaderMemBase);
+    writeRAM(hostLink, inEdgeRestMem, inEdgeRestMemSize, inEdgeRestMemBase);
     writeRAM(hostLink, outEdgeMem, outEdgeMemSize, outEdgeMemBase);
+    progRouterTables->write(hostLink);
     hostLink->flush();
     hostLink->useSendBuffer = useSendBufferOld;
 
@@ -835,7 +1008,6 @@ template <typename DeviceType,
   uint32_t fanOut(PDeviceId id) {
     return graph.fanOut(id);
   }
-
 };
 
 // Read performance stats and store in file
diff --git a/include/POLite/Placer.h b/include/POLite/Placer.h
index 32aec831..57c00444 100644
--- a/include/POLite/Placer.h
+++ b/include/POLite/Placer.h
@@ -5,11 +5,23 @@
 #include <stdint.h>
 #include <metis.h>
 #include <POLite/Graph.h>
+#include <queue>
+#include <omp.h>
 
 typedef uint32_t PartitionId;
 
 // Partition and place a graph on a 2D mesh
 struct Placer {
+  // Select between different methods
+  enum Method {
+    Default,
+    Metis,
+    Random,
+    Direct,
+    BFS
+  };
+  const Method defaultMethod=Metis;
+
   // The graph being placed
   Graph* graph;
 
@@ -41,8 +53,40 @@ struct Placer {
   uint32_t* yCoordSaved;
   uint64_t savedCost;
 
+  // Random numbers
+  unsigned int seed;
+  void setRand(unsigned int s) { seed = s; };
+  int getRand() { return rand_r(&seed); }
+
+  // Controls which strategy is used
+  Method method = Default;
+
+  // Select placer method
+  void chooseMethod()
+  {
+    auto e = getenv("POLITE_PLACER");
+    if (e) {
+      if (!strcmp(e, "metis"))
+        method=Metis;
+      else if (!strcmp(e, "random"))
+        method=Random;
+      else if (!strcmp(e, "direct"))
+        method=Direct;
+      else if (!strcmp(e, "bfs"))
+        method=BFS;
+      else if (!strcmp(e, "default") || *e == '\0')
+        method=Default;
+      else {
+        fprintf(stderr, "Don't understand placer method : %s\n", e);
+        exit(EXIT_FAILURE);
+      }
+    }
+    if (method == Default)
+      method = defaultMethod;
+  }
+
   // Partition the graph using Metis
-  void partition() {
+  void partitionMetis() {
     // Compute total number of edges
     uint32_t numEdges = 0;
     for (uint32_t i = 0; i < graph->incoming->numElems; i++) {
@@ -116,6 +160,96 @@ struct Placer {
     free(parts);
   }
 
+  // Partition the graph randomly
+  void partitionRandom() {
+    uint32_t numVertices = graph->incoming->numElems;
+    uint32_t numParts = width * height;
+
+    // Populate result array
+    for (uint32_t i = 0; i < numVertices; i++) {
+      partitions[i] = getRand() % numParts;
+    }
+  }
+
+  // Partition the graph using direct mapping
+  void partitionDirect() {
+    uint32_t numVertices = graph->incoming->numElems;
+    uint32_t numParts = width * height;
+    uint32_t partSize = (numVertices + numParts) / numParts;
+
+    // Populate result array
+    for (uint32_t i = 0; i < numVertices; i++) {
+      partitions[i] = i / partSize;
+    }
+  }
+
+  // Partition the graph using repeated BFS
+  void partitionBFS() {
+    uint32_t numVertices = graph->incoming->numElems;
+    uint32_t numParts = width * height;
+    uint32_t partSize = (numVertices + numParts) / numParts;
+
+    // Visited bit for each vertex
+    bool* seen = new bool [numVertices];
+    memset(seen, 0, numVertices);
+
+    // Next vertex to visit
+    uint32_t nextUnseen = 0;
+
+    // Next partition id
+    uint32_t nextPart = 0;
+
+    while (nextUnseen < numVertices) {
+      // Frontier
+      std::queue<uint32_t> frontier;
+      uint32_t count = 0;
+
+      while (nextUnseen < numVertices && count < partSize) {
+        // Sized-bounded BFS from nextUnseen
+        frontier.push(nextUnseen);
+        while (count < partSize && !frontier.empty()) {
+          uint32_t v = frontier.front();
+          frontier.pop();
+          if (!seen[v]) {
+            seen[v] = true;
+            partitions[v] = nextPart;
+            count++;
+            // Add unvisited neighbours of v to the frontier
+            Seq<uint32_t>* dests = graph->outgoing->elems[v];
+            for (uint32_t i = 0; i < dests->numElems; i++) {
+              uint32_t w = dests->elems[i];
+              if (!seen[w]) frontier.push(w);
+            }
+          }
+        }
+        while (nextUnseen < numVertices && seen[nextUnseen]) nextUnseen++;
+      }
+
+      nextPart++;
+    }
+
+    delete [] seen;
+  }
+
+  void partition()
+  {
+    switch(method){
+    case Default:
+    case Metis:
+      partitionMetis();
+      break;
+    case Random:
+      partitionRandom();
+      break;
+    case Direct:
+      partitionDirect();
+      break;
+    case BFS:
+      partitionBFS();
+      break;
+    }
+  }
+
   // Create subgraph for each partition
   void computeSubgraphs() {
     uint32_t numPartitions = width*height;
@@ -179,7 +313,7 @@ struct Placer {
     // Random mapping
     for (uint32_t y = 0; y < height; y++) {
       for (uint32_t x = 0; x < width; x++) {
-        int index = rand() % numPartitions;
+        int index = getRand() % numPartitions;
         PartitionId p = pids[index];
         mapping[y][x] = p;
         xCoord[p] = x;
@@ -295,6 +429,8 @@ struct Placer {
     graph = g;
     width = w;
     height = h;
+    // Random seed
+    setRand(1 + omp_get_thread_num());
     // Allocate the partitions array
     partitions = new PartitionId [g->incoming->numElems];
     // Allocate subgraphs
@@ -316,6 +452,8 @@ struct Placer {
     yCoord = new uint32_t [width*height];
     xCoordSaved = new uint32_t [width*height];
     yCoordSaved = new uint32_t [width*height];
+    // Pick a placement method, or select default
+    chooseMethod();
     // Partition the graph using Metis
     partition();
     // Compute subgraphs, one per partition
diff --git a/include/POLite/ProgRouters.h b/include/POLite/ProgRouters.h
new file mode 100644
index 00000000..9890c43e
--- /dev/null
+++ b/include/POLite/ProgRouters.h
@@ -0,0 +1,413 @@
+// SPDX-License-Identifier: BSD-2-Clause
+#ifndef _PROGROUTERS_H_
+#define _PROGROUTERS_H_
+
+#include <assert.h>
+#include <config.h>
+#include <HostLink.h>
+#include <POLite.h>
+#include <POLite/Seq.h>
+#include <boot.h>
+
+// =============================
+// Per-board programmable router
+// =============================
+
+class ProgRouter {
+
+  // Number of chunks used so far in current beat
+  uint32_t numChunks;
+
+  // Number of records used so far in current beat
+  uint32_t numRecords;
+
+  // Number of beats associated with current key
+  uint32_t numBeats;
+
+  // Index of RAM currently being used
+  uint32_t currentRAM;
+
+  // Pointer to previously created indirection
+  // (We need indirections to handle record sequences of 31 beats or more)
+  uint8_t* prevInd;
+
+  // Move on to next the beat
+  void nextBeat() {
+    // Set number of records in current beat
+    uint32_t beatBase = table[currentRAM]->numElems - 32;
+    uint8_t* beat = &table[currentRAM]->elems[beatBase];
+    beat[31] = 0;
+    beat[30] = numRecords;
+    numChunks = numRecords = 0;
+    // Allocate new beat, and check for overflow
+    numBeats++;
+    table[currentRAM]->extendBy(32);
+    if (table[currentRAM]->numElems >= (TinselPOLiteProgRouterLength-1024)) {
+      printf("ProgRouter out of memory\n");
+      exit(EXIT_FAILURE);
+    }
+    // We need indirections to handle sequences of 31 beats or more
+    if ((numBeats % 31) == 0) {
+      // Set previous indirection, if there is one
+      if (prevInd) {
+        uint32_t key = TinselPOLiteProgRouterBase +
+                         table[currentRAM]->numElems - 31*32;
+        if (currentRAM) key |= 0x80000000;
+        key |= 31;
+        setIND(prevInd, key);
+      }
+      prevInd = addIND();
+    }
+  }
+
+  // Get current record pointer for 48-bit entry
+  inline uint8_t* currentRecord48() {
+    uint32_t beatBase = (table[currentRAM]->numElems-32) + 6*(4-numChunks);
+    return &table[currentRAM]->elems[beatBase];
+  }
+
+  // Get current record pointer for 96-bit entry
+  inline uint8_t* currentRecord96() {
+    uint32_t beatBase = (table[currentRAM]->numElems-32) + 6*(3-numChunks);
+    return &table[currentRAM]->elems[beatBase];
+  }
+
+ public:
+
+  // A table holding encoded routing beats for each RAM
+  Seq<uint8_t>** table;
+
+  // Constructor
+  ProgRouter() {
+    // Currently we assume two RAMs per board
+    assert(TinselDRAMsPerBoard == 2);
+    // Initialise member variables
+    prevInd = NULL;
+    numBeats = 1;
+    numChunks = numRecords = currentRAM = 0;
+    // Allocate one sequence per RAM
+    table = new Seq<uint8_t>* [TinselDRAMsPerBoard];
+    // Initially each sequence is 32MB
+    for (int i = 0; i < TinselDRAMsPerBoard; i++) {
+      table[i] = new Seq<uint8_t> (1 << 15);
+      // Allocate first beat
+      table[i]->extendBy(32);
+    }
+  }
+
+  // Destructor
+  ~ProgRouter() {
+    for (int i = 0; i < TinselDRAMsPerBoard; i++) delete table[i];
+    delete [] table;
+  }
+
+  // Generate a new key for the records added
+  uint32_t genKey() {
+    // Determine index of first beat in record sequence
+    uint32_t index = table[currentRAM]->numElems - numBeats*32;
+    // Determine final key length
+    uint32_t finalKeyLen = prevInd ? 31 : numBeats;
+    // Insert outstanding indirection, if there is one
+    if (prevInd) {
+      // Set previous indirection to latest block of beats
+      uint32_t indKey = TinselPOLiteProgRouterBase +
+        table[currentRAM]->numElems - (numBeats%31)*32;
+      if (currentRAM) indKey |= 0x80000000;
+      indKey |= (numBeats%31);
+      setIND(prevInd, indKey); 
+    }
+    // Determine final key
+    uint32_t key = TinselPOLiteProgRouterBase + index;
+    if (currentRAM) key |= 0x80000000;
+    key |= finalKeyLen;
+    // Move to next beat
+    nextBeat();
+    numBeats = 1;
+    prevInd = NULL;
+    // Pick smaller RAM for next key
+    currentRAM = table[0]->numElems < table[1]->numElems ? 0 : 1;
+    return key;
+  }
+
+  // Add an IND record to the table
+  // Return a pointer to the indirection key,
+  // so it can be set later by the caller
+  uint8_t* addIND() {
+    if (numChunks == 5) nextBeat();
+    uint8_t* ptr = currentRecord48();
+    ptr[5] = 4 << 5;
+    numChunks++;
+    numRecords++;
+    return ptr;
+  }
+
+  // Set indirection key
+  void setIND(uint8_t* ind, uint32_t key) {
+    ind[0] = key;
+    ind[1] = key >> 8;
+    ind[2] = key >> 16;
+    ind[3] = key >> 24;
+  }
+
+  // Add an MRM record to the table
+  void addMRM(uint32_t mboxX, uint32_t mboxY,
+                uint32_t threadsHigh, uint32_t threadsLow,
+                  uint16_t localKey) {
+    if (numChunks >= 4) nextBeat();
+    uint8_t* ptr = currentRecord96();
+    ptr[0] = threadsLow;
+    ptr[1] = threadsLow >> 8;
+    ptr[2] = threadsLow >> 16;
+    ptr[3] = threadsLow >> 24;
+    ptr[4] = threadsHigh;
+    ptr[5] = threadsHigh >> 8;
+    ptr[6] = threadsHigh >> 16;
+    ptr[7] = threadsHigh >> 24;
+    ptr[8] = localKey;
+    ptr[9] = localKey >> 8;
+    ptr[11] = (3 << 5) | (mboxY << 3) | (mboxX << 1);
+    numChunks += 2;
+    numRecords++;
+  }
+
+  // Add an RR record to the table
+  void addRR(uint32_t dir, uint32_t key) {
+    if (numChunks == 5) nextBeat();
+    uint8_t* ptr = currentRecord48();
+    ptr[0] = key;
+    ptr[1] = key >> 8;
+    ptr[2] = key >> 16;
+    ptr[3] = key >> 24;
+    ptr[5] = (2 << 5) | (dir << 3);
+    numChunks++;
+    numRecords++;
+  }
+
+  // Add a URM1 record to the table
+  void addURM1(uint32_t mboxX, uint32_t mboxY,
+                 uint32_t threadId, uint32_t key) {
+    if (numChunks == 5) nextBeat();
+    uint8_t* ptr = currentRecord48();
+    ptr[0] = key;
+    ptr[1] = key >> 8;
+    ptr[2] = key >> 16;
+    ptr[3] = key >> 24;
+    ptr[4] = (threadId << 3);
+    ptr[5] = (mboxY << 3) | (mboxX << 1) | (threadId >> 5);
+    numChunks++;
+    numRecords++;
+  }
+};
+
+// ==================================
+// Data type for routing destinations
+// ==================================
+
+enum PRoutingDestKind { PRDestKindURM1, PRDestKindMRM };
+
+// URM1 routing destination
+struct PRoutingDestURM1 {
+  // Mailbox-local thread
+  uint16_t threadId;
+  // Thread-local routing key
+  uint32_t key;
+};
+
+// MRM routing destination
+struct PRoutingDestMRM {
+  // Thread-local routing key
+  uint16_t key;
+  // Destination threads
+  uint32_t threadMaskLow;
+  uint32_t threadMaskHigh;
+};
+
+// Routing destination
+struct PRoutingDest {
+  PRoutingDestKind kind;
+  // Destination mailbox
+  uint32_t mbox;
+  // URM1 or MRM destination
+  union {
+    PRoutingDestURM1 urm1;
+    PRoutingDestMRM mrm;
+  };
+};
+
+// Extract board X coord from routing dest
+inline uint32_t destX(uint32_t mbox) {
+  uint32_t x = mbox >> (TinselMailboxMeshXBits + TinselMailboxMeshYBits);
+  return x & ((1<<TinselMeshXBits) - 1);
+}
+
+// Extract board Y coord from routing dest
+inline uint32_t destY(uint32_t mbox) {
+  uint32_t y = mbox >> (TinselMailboxMeshXBits +
+                 TinselMailboxMeshYBits + TinselMeshXBits);
+  return y & ((1<<TinselMeshYBits) - 1);
+}
+
+// Extract board-local mailbox X coord from routing dest
+inline uint32_t destMboxX(uint32_t mbox) {
+  return mbox & ((1<<TinselMailboxMeshXBits) - 1);
+}
+
+// Extract board-local mailbox Y coord from routing dest
+inline uint32_t destMboxY(uint32_t mbox) {
+  return (mbox >> TinselMailboxMeshXBits) &
+           ((1<<TinselMailboxMeshYBits) - 1);
+}
+
+// ============================
+// Mesh of programmable routers
+// ============================
+
+class ProgRouterMesh {
+  // Board mesh dimensions
+  uint32_t boardsX;
+  uint32_t boardsY;
+
+ public:
+  // 2D array of tables;
+  ProgRouter** table;
+
+  // Constructor
+  ProgRouterMesh(uint32_t numBoardsX, uint32_t numBoardsY) {
+    boardsX = numBoardsX;
+    boardsY = numBoardsY;
+    table = new ProgRouter* [numBoardsY];
+    for (int y = 0; y < numBoardsY; y++)
+      table[y] = new ProgRouter [numBoardsX];
+  }
+
+  // Add routing destinations from given sender board
+  // Returns routing key
+  uint32_t addDestsFromBoardXY(uint32_t senderX, uint32_t senderY,
+                                 Seq<PRoutingDest>* dests) {
+    if (dests->numElems == 0) return 0;
+
+    // Categorise dests into local, N, S, E, and W groups
+    Seq<PRoutingDest> local(dests->numElems);
+    Seq<PRoutingDest> north(dests->numElems);
+    Seq<PRoutingDest> south(dests->numElems);
+    Seq<PRoutingDest> east(dests->numElems);
+    Seq<PRoutingDest> west(dests->numElems);
+    for (int i = 0; i < dests->numElems; i++) {
+      PRoutingDest dest = dests->elems[i];
+      uint32_t receiverX = destX(dest.mbox);
+      uint32_t receiverY = destY(dest.mbox);
+      if (receiverX < senderX) west.append(dest);
+      else if (receiverX > senderX) east.append(dest);
+      else if (receiverY < senderY) south.append(dest);
+      else if (receiverY > senderY) north.append(dest);
+      else local.append(dest);
+    }
+
+    // Recurse on non-local groups and add RR records on return
+    if (north.numElems > 0) {
+      uint32_t key = addDestsFromBoardXY(senderX, senderY+1, &north);
+      table[senderY][senderX].addRR(0, key);
+    }
+    if (south.numElems > 0) {
+      uint32_t key = addDestsFromBoardXY(senderX, senderY-1, &south);
+      table[senderY][senderX].addRR(1, key);
+    }
+    if (east.numElems > 0) {
+      uint32_t key = addDestsFromBoardXY(senderX+1, senderY, &east);
+      table[senderY][senderX].addRR(2, key);
+    }
+    if (west.numElems > 0) {
+      uint32_t key = addDestsFromBoardXY(senderX-1, senderY, &west);
+      table[senderY][senderX].addRR(3, key);
+    }
+
+    // Add local records
+    for (int i = 0; i < local.numElems; i++) {
+      PRoutingDest dest = local.elems[i];
+      if (dest.kind == PRDestKindMRM) {
+        table[senderY][senderX].addMRM(destMboxX(dest.mbox),
+          destMboxY(dest.mbox), dest.mrm.threadMaskHigh,
+          dest.mrm.threadMaskLow, dest.mrm.key);
+      }
+      else if (dest.kind == PRDestKindURM1) {
+        table[senderY][senderX].addURM1(destMboxX(dest.mbox),
+          destMboxY(dest.mbox), dest.urm1.threadId, dest.urm1.key);
+      }
+      else {
+        fprintf(stderr, "ProgRouters.h: unknown routing record kind\n");
+        exit(EXIT_FAILURE);
+      }
+    }
+
+    return table[senderY][senderX].genKey();
+  }
+
+  // Add routing destinations from given global mailbox id
+  uint32_t addDestsFromBoard(uint32_t mbox, Seq<PRoutingDest>* dests) {
+    return addDestsFromBoardXY(destX(mbox), destY(mbox), dests);
+  }
+
+  // Write routing tables to memory via HostLink
+  void write(HostLink* hostLink) {
+    // Request to boot loader
+    BootReq req;
+
+    // Compute number of cores per DRAM
+    const uint32_t coresPerDRAM = 1 <<
+      (TinselLogCoresPerDCache + TinselLogDCachesPerDRAM);
+
+    // Initialise write address for each routing table
+    for (int y = 0; y < boardsY; y++) {
+      for (int x = 0; x < boardsX; x++) {
+        for (int i = 0; i < TinselDRAMsPerBoard; i++) {
+          // Use one core to initialise each DRAM
+          uint32_t dest = hostLink->toAddr(x, y, coresPerDRAM * i, 0);
+          req.cmd = SetAddrCmd;
+          req.numArgs = 1;
+          req.args[0] = TinselPOLiteProgRouterBase;
+          hostLink->send(dest, 1, &req);
+          // Ensure space for an extra 32 bytes in each 
+          // table so we don't have to check for overflow below
+          // when consuming the tables in chunks of 12 bytes
+          table[y][x].table[i]->ensureSpaceFor(32);
+        }
+      }
+    }
+
+    // Write each routing table
+    bool allDone = false;
+    uint32_t offset = 0;
+    while (! allDone) {
+      allDone = true;
+      for (int y = 0; y < boardsY; y++) {
+        for (int x = 0; x < boardsX; x++) {
+          for (int i = 0; i < TinselDRAMsPerBoard; i++) {
+            Seq<uint8_t>* seq = table[y][x].table[i];
+            if (offset < seq->numElems) {
+              uint32_t dest = hostLink->toAddr(x, y, coresPerDRAM * i, 0);
+              uint8_t* base = &seq->elems[offset];
+              allDone = false;
+              req.cmd = StoreCmd;
+              req.numArgs = 3;
+              req.args[0] = ((uint32_t*) base)[0];
+              req.args[1] = ((uint32_t*) base)[1];
+              req.args[2] = ((uint32_t*) base)[2];
+              hostLink->send(dest, 1, &req);
+            }
+          }
+        }
+      }
+      offset += 12;
+    }
+  }
+
+  // Destructor
+  ~ProgRouterMesh() {
+     for (int y = 0; y < boardsY; y++)
+       delete [] table[y];
+     delete [] table;
+  }
+};
+
+
+#endif
diff --git a/include/POLite/Seq.h b/include/POLite/Seq.h
index b6cb61f1..23a7616c 100644
--- a/include/POLite/Seq.h
+++ b/include/POLite/Seq.h
@@ -45,12 +45,26 @@ template <class T> class Seq
       elems = newElems;
     }
 
+    // Extend size of sequence by N
+    void extendBy(int n)
+    {
+      numElems += n;
+      if (numElems > maxElems)
+        setCapacity(numElems*2);
+    }
+
     // Extend size of sequence by one
     void extend()
     {
-      numElems++;
-      if (numElems > maxElems)
-        setCapacity(maxElems*2);
+      extendBy(1);
+    }
+
+    // Ensure space for a further N elements
+    void ensureSpaceFor(int n)
+    {
+      int newNumElems = numElems + n;
+      if (newNumElems > maxElems)
+        setCapacity(newNumElems*2);
     }
 
     // Append
diff --git a/include/tinsel-interface.h b/include/tinsel-interface.h
index 93b5ec96..21dfdfcb 100644
--- a/include/tinsel-interface.h
+++ b/include/tinsel-interface.h
@@ -166,7 +166,7 @@ INLINE uint32_t tinselAccId(
            uint32_t tileX, uint32_t tileY)
 {
   uint32_t addr;
-  addr = 0x4;
+  addr = 0x8;
   addr = (addr << TinselMeshYBits) | boardY;
   addr = (addr << TinselMeshXBits) | boardX;
   addr = (addr << TinselMailboxMeshYBits) | tileY;
@@ -175,4 +175,13 @@ INLINE uint32_t tinselAccId(
   return addr;
 }
 
+// Special address to signify use of routing key
+INLINE uint32_t tinselUseRoutingKey()
+{
+  // Special address to signify use of routing key
+  return 1 <<
+    (TinselMailboxMeshYBits + TinselMailboxMeshXBits +
+     TinselMeshXBits + TinselMeshYBits + 2);
+}
+
 #endif
diff --git a/include/tinsel.h b/include/tinsel.h
index 9ebd8451..0b88844d 100644
--- a/include/tinsel.h
+++ b/include/tinsel.h
@@ -28,13 +28,15 @@
 #define CSR_FLUSH       "0xc01"
 
 // Performance counter CSRs
-#define CSR_PERFCOUNT     "0xc07"
-#define CSR_MISSCOUNT     "0xc08"
-#define CSR_HITCOUNT      "0xc09"
-#define CSR_WBCOUNT       "0xc0a"
-#define CSR_CPUIDLECOUNT  "0xc0b"
-#define CSR_CPUIDLECOUNTU "0xc0c"
-#define CSR_CYCLEU        "0xc0d"
+#define CSR_PERFCOUNT           "0xc07"
+#define CSR_MISSCOUNT           "0xc08"
+#define CSR_HITCOUNT            "0xc09"
+#define CSR_WBCOUNT             "0xc0a"
+#define CSR_CPUIDLECOUNT        "0xc0b"
+#define CSR_CPUIDLECOUNTU       "0xc0c"
+#define CSR_CYCLEU              "0xc0d"
+#define CSR_PROGROUTERSENT      "0xc0e"
+#define CSR_PROGROUTERSENTINTER "0xc0f"
 
 // Get globally unique thread id of caller
 INLINE uint32_t tinselId()
@@ -127,6 +129,18 @@ INLINE volatile void* tinselSendSlot()
   return mb_scratchpad_base + (threadId << TinselLogBytesPerMsg);
 }
 
+// Get pointer to thread's extra message slot reserved for sending
+// (Assumes that HostLink has requested the extra slot)
+INLINE volatile void* tinselSendSlotExtra()
+{
+  volatile char* mb_scratchpad_base =
+    (volatile char*) (1 << TinselLogBytesPerMailbox);
+  uint32_t threadId = tinselId() &
+    ((1<<TinselLogThreadsPerMailbox) - 1);
+  return mb_scratchpad_base +
+           ((TinselThreadsPerMailbox+threadId) << TinselLogBytesPerMsg);
+}
+
 // Determine if calling thread can send a message
 INLINE int tinselCanSend()
 {
@@ -176,6 +190,12 @@ INLINE void tinselSend(int dest, volatile void* addr)
   tinselMulticast(dest >> 6, high, low, addr);
 }
 
+// Send message at addr using given routing key
+INLINE void tinselKeySend(int key, volatile void* addr)
+{
+  tinselMulticast(tinselUseRoutingKey(), 0, key, addr);
+}
+
 // Receive message
 INLINE volatile void* tinselRecv()
 {
@@ -270,7 +290,7 @@ INLINE uint32_t tinselWritebackCount()
   return n;
 }
 
-// Performance counter:: get the CPU-idle count
+// Performance counter: get the CPU-idle count
 INLINE uint32_t tinselCPUIdleCount()
 {
   uint32_t n;
@@ -294,6 +314,22 @@ INLINE uint32_t tinselCycleCountU()
   return n;
 }
 
+// Performance counter: number of messages emitted by ProgRouter
+INLINE uint32_t tinselProgRouterSent()
+{
+  uint32_t n;
+  asm volatile ("csrrw %0, " CSR_PROGROUTERSENT ", zero" : "=r"(n));
+  return n;
+}
+
+// Performance counter: number of inter-board messages emitted by ProgRouter
+INLINE uint32_t tinselProgRouterSentInterBoard()
+{
+  uint32_t n;
+  asm volatile ("csrrw %0, " CSR_PROGROUTERSENTINTER ", zero" : "=r"(n));
+  return n;
+}
+
 // Get address of any specified host
 // (This Y coordinate specifies the row of the FPGA mesh that the
 // host is connected to, and the X coordinate specifies whether it is
diff --git a/rtl/Connections.bsv b/rtl/Connections.bsv
new file mode 100644
index 00000000..7f542acc
--- /dev/null
+++ b/rtl/Connections.bsv
@@ -0,0 +1,151 @@
+package Connections;
+
+import Vector      :: *;
+import OffChipRAM  :: *;
+import Interface   :: *;
+import DRAM        :: *;
+import Queue       :: *;
+import DCache      :: *;
+import DCacheTypes :: *;
+import Util        :: *;
+import ProgRouter  :: *;
+import Core        :: *;
+
+// ============================================================================
+// DCache <-> Core connections
+// ============================================================================
+
+module connectCoresToDCache#(
+         Vector#(`CoresPerDCache, DCacheClient) clients,
+         DCache dcache) ();
+
+  // Connect requests
+  function getDCacheReqOut(client) = client.dcacheReqOut;
+  let dcacheReqs <- mkMergeTree(Fair,
+                      mkUGShiftQueue1(QueueOptFmax),
+                      map(getDCacheReqOut, clients));
+  connectUsing(mkUGQueue, dcacheReqs, dcache.reqIn);
+
+  // Connect responses
+  function Bit#(`LogCoresPerDCache) getDCacheRespKey(DCacheResp resp) =
+    truncateLSB(resp.id);
+  function getDCacheRespIn(client) = client.dcacheRespIn;
+  let dcacheResps <- mkResponseDistributor(
+                      getDCacheRespKey,
+                      mkUGShiftQueue1(QueueOptFmax),
+                      map(getDCacheRespIn, clients));
+  connectDirect(dcache.respOut, dcacheResps);
+
+  // Connect performance-counter wires
+  rule connectPerfCountWires;
+    clients[0].incMissCount(dcache.incMissCount);
+    clients[0].incHitCount(dcache.incHitCount);
+    clients[0].incWritebackCount(dcache.incWritebackCount);
+    for (Integer i = 1; i < `CoresPerDCache; i=i+1) begin
+      clients[i].incMissCount(False);
+      clients[i].incHitCount(False);
+      clients[i].incWritebackCount(False);
+    end
+  endrule
+
+endmodule
+
+// ============================================================================
+// Off-chip RAM connections
+// ============================================================================
+
+module connectClientsToOffChipRAM#(
+  // Data caches
+  Vector#(`DCachesPerDRAM, DCache) caches,
+  // Reqs and resps from ProgRouter's fetchers
+  Vector#(`FetchersPerProgRouter, BOut#(DRAMReq)) routerReqs,
+  Vector#(`FetchersPerProgRouter, In#(DRAMResp)) routerResps,
+  // Off-chip memory
+  OffChipRAM ram) ();
+
+  // Count the number of outstanding fetcher requests
+  // Used to throttle the fetcher requests to avoid starving/blocking
+  // the cache requests
+  Integer throttleCount = 2 ** (`DRAMLogMaxInFlight - 1);
+  Count#(`DRAMLogMaxInFlight) fetcherCount <- mkCount(throttleCount);
+
+  // Merge cache requests
+  function getReqOut(cache) = cache.reqOut;
+  Out#(DRAMReq) cacheReqs <-
+    mkMergeTreeB(Fair,
+      mkUGShiftQueue1(QueueOptFmax),
+      map(getReqOut, caches));
+  Queue#(DRAMReq) cacheReqsQueue <- mkUGQueue;
+  connectToQueue(cacheReqs, cacheReqsQueue);
+  BOut#(DRAMReq) cacheReqsB = queueToBOut(cacheReqsQueue);
+
+  // Merge router requests
+  Out#(DRAMReq) fetcherReqs <-
+    mkMergeTreeB(Fair,
+      mkUGShiftQueue1(QueueOptFmax),
+      routerReqs);
+  Queue#(DRAMReq) fetcherReqsQueue <- mkUGQueue;
+  connectToQueue(fetcherReqs, fetcherReqsQueue);
+  BOut#(DRAMReq) fetcherReqsB = queueToBOut(fetcherReqsQueue);
+
+  // Update count on router request
+  BOut#(DRAMReq) fetcherReqsIncCountB =
+    interface BOut
+      method Action get =
+        action
+          fetcherReqsB.get;
+          fetcherCount.incBy(zeroExtend(fetcherReqsB.value.burst));
+        endaction;
+      method Bool valid = fetcherReqsB.valid && 
+        zeroExtend(fetcherReqsB.value.burst) <= fetcherCount.available;
+      method DRAMReq value = fetcherReqsB.value;
+    endinterface;
+
+  // Merge cache and router requests, and connect to off-chip RAM
+  let reqs <- mkMergeTwoB(Fair, cacheReqsB, fetcherReqsIncCountB);
+  connectUsing(mkUGQueue, reqs, ram.reqIn);
+
+  // Connect load responses
+  function DRAMClientId getRespKey(DRAMResp resp) = resp.id;
+  function getRespIn(cache) = cache.respIn;
+  let ramResps <- mkResponseDistributor(
+                    getRespKey,
+                    mkUGShiftQueue2(QueueOptFmax),
+                    append(map(getRespIn, caches), routerResps));
+
+  // Update count on respose
+  BOut#(DRAMResp) ramRespOutDecCount =
+    interface BOut
+      method Action get =
+        action
+          ram.respOut.get;
+          if (ram.respOut.value.id >= fromInteger(`DCachesPerDRAM))
+            fetcherCount.dec;
+        endaction;
+      method Bool valid = ram.respOut.valid;
+      method DRAMResp value = ram.respOut.value;
+    endinterface;
+
+  // Connect responses from off-chip RAM
+  connectDirect(ramRespOutDecCount, ramResps);
+
+endmodule
+
+// ============================================================================
+// ProgRouter performance counter connections
+// ============================================================================
+
+module connectProgRouterPerfCountersToCores#(
+         ProgRouterPerfCounters counters, Vector#(n, Core) cores) (Empty);
+  rule connect;
+    // Only core zero can access the ProgRouter perf counters
+    cores[0].progRouterPerfClient.incSent(counters.incSent);
+    cores[0].progRouterPerfClient.incSentInterBoard(counters.incSentInterBoard);
+    for (Integer i = 1; i < valueOf(n); i=i+1) begin
+      cores[i].progRouterPerfClient.incSent(?);
+      cores[i].progRouterPerfClient.incSentInterBoard(?);
+    end
+  endrule
+endmodule
+
+endpackage
diff --git a/rtl/Core.bsv b/rtl/Core.bsv
index 1d35d278..4c454c98 100644
--- a/rtl/Core.bsv
+++ b/rtl/Core.bsv
@@ -25,6 +25,7 @@ import FPUOps       :: *;
 import InstrMem     :: *;
 import DCacheTypes  :: *;
 import IdleDetector :: *;
+import ProgRouter   :: *;
 
 // ============================================================================
 // Control/status registers (CSRs) supported
@@ -60,15 +61,17 @@ import IdleDetector :: *;
 // Performance Counter CSRs (Optional)
 // ============================================================================
 
-// Name            | CSR    | R/W | Function
-// --------------- | ------ | --- | --------
-// PerfCount       | 0xc07  | W   | Reset(0)/Start(1)/Stop(2) all counters
-// MissCount       | 0xc08  | R   | Cache miss count
-// HitCount        | 0xc09  | R   | Cache hit count
-// WritebackCount  | 0xc0a  | R   | Cache writeback count
-// CPUIdleCount    | 0xc0b  | R   | CPU idle-cycle count (lower 32 bits)
-// CPUIdleCountU   | 0xc0c  | R   | CPU idle-cycle count (upper 8 bits)
-// CycleU          | 0xc0d  | R   | Cycle counter (upper 8 bits)
+// Name                | CSR    | R/W | Function
+// ------------------- | ------ | --- | --------
+// PerfCount           | 0xc07  | W   | Reset(0)/Start(1)/Stop(2) all counters
+// MissCount           | 0xc08  | R   | Cache miss count
+// HitCount            | 0xc09  | R   | Cache hit count
+// WritebackCount      | 0xc0a  | R   | Cache writeback count
+// CPUIdleCount        | 0xc0b  | R   | CPU idle-cycle count (lower 32 bits)
+// CPUIdleCountU       | 0xc0c  | R   | CPU idle-cycle count (upper 8 bits)
+// CycleU              | 0xc0d  | R   | Cycle counter (upper 8 bits)
+// ProgRouterSent      | 0xc0e  | R   | Msgs sent by ProgRouter
+// ProgRouterSentInter | 0xc0f  | R   | Inter-board msgs sent by ProgRouter
 
 // ============================================================================
 // Types
@@ -505,12 +508,13 @@ endfunction
 // ============================================================================
 
 interface Core;
-  interface DCacheClient       dcacheClient;
-  interface MailboxClient      mailboxClient;
-  interface DebugLinkClient    debugLinkClient;
-  interface FPUClient          fpuClient;
-  interface InstrMemClient     instrMemClient;
-  interface IdleDetectorClient idleClient;
+  interface DCacheClient         dcacheClient;
+  interface MailboxClient        mailboxClient;
+  interface DebugLinkClient      debugLinkClient;
+  interface FPUClient            fpuClient;
+  interface InstrMemClient       instrMemClient;
+  interface IdleDetectorClient   idleClient;
+  interface ProgRouterPerfClient progRouterPerfClient;
 
   // Each core can see its board id
   (* always_ready, always_enabled *)
@@ -676,18 +680,27 @@ module mkCore#(CoreId myId) (Core);
   Reg#(Bit#(32)) hitCount       <- mkConfigReg(0);
   Reg#(Bit#(32)) writebackCount <- mkConfigReg(0);
   Reg#(Bit#(40)) cpuIdleCount   <- mkConfigReg(0);
+  // Only core zero maintains the following two counters
+  Reg#(Bit#(32)) progRouterSent <- mkConfigReg(0);
+  Reg#(Bit#(32)) progRouterSentInterBoard <- mkConfigReg(0);
 
   // Indexable vector of performance counters
-  Vector#(6, Bit#(32)) perfCounters =
+  Vector#(8, Bit#(32)) perfCounters =
     vector(missCount, hitCount, writebackCount, cpuIdleCount[31:0],
              zeroExtend(cpuIdleCount[39:32]),
-             zeroExtend(cycleCount[39:32]));
+             zeroExtend(cycleCount[39:32]),
+             myId == 0 ? progRouterSent : ?,
+             myId == 0 ? progRouterSentInterBoard : ?);
 
   // Increment wires
   Wire#(Bool) incMissCountWire      <- mkDWire(False);
   Wire#(Bool) incHitCountWire       <- mkDWire(False);
   Wire#(Bool) incWritebackCountWire <- mkDWire(False);
   Wire#(Bool) incCPUIdleCountWire   <- mkDWire(False);
+  Wire#(Bit#(LogFetchersPerProgRouter))
+    incProgRouterSent <- mkBypassWire;
+  Wire#(Bit#(LogFetchersPerProgRouter))
+    incProgRouterSentInterBoard <- mkBypassWire;
 
   // Update performance counters
   rule updatePerfCounters;
@@ -696,11 +709,20 @@ module mkCore#(CoreId myId) (Core);
       hitCount       <= 0;
       writebackCount <= 0;
       cpuIdleCount   <= 0;
+      if (myId == 0) begin
+        progRouterSent <= 0;
+        progRouterSentInterBoard <= 0;
+      end
     end else if (perfCountEnabled) begin
       if (incMissCountWire) missCount <= missCount+1;
       if (incHitCountWire) hitCount <= hitCount+1;
       if (incWritebackCountWire) writebackCount <= writebackCount+1;
       if (incCPUIdleCountWire) cpuIdleCount <= cpuIdleCount+1;
+      if (myId == 0) begin
+        progRouterSent <= progRouterSent + zeroExtend(incProgRouterSent);
+        progRouterSentInterBoard <= progRouterSentInterBoard +
+          zeroExtend(incProgRouterSentInterBoard);
+      end
     end
   endrule
   `endif
@@ -1321,6 +1343,19 @@ module mkCore#(CoreId myId) (Core);
     method Bool idleStage1Ack = mailbox.idleStage1Ack;
   endinterface
 
+  interface ProgRouterPerfClient progRouterPerfClient;
+    method Action incSent(Bit#(LogFetchersPerProgRouter) amount);
+      `ifdef EnablePerfCount
+        incProgRouterSent <= amount;
+      `endif
+    endmethod
+    method Action incSentInterBoard(Bit#(LogFetchersPerProgRouter) amount);
+      `ifdef EnablePerfCount
+        incProgRouterSentInterBoard <= amount;
+      `endif
+    endmethod
+  endinterface
+
 endmodule
 
 endpackage
diff --git a/rtl/DCache.bsv b/rtl/DCache.bsv
index 3162aade..e972a858 100644
--- a/rtl/DCache.bsv
+++ b/rtl/DCache.bsv
@@ -437,9 +437,11 @@ module mkDCache#(DCacheId myId) (DCache);
     // This rule either consumes a flush request or a memory response
     let flush = flushQueue.dataOut;
     let resp = respPort.value;
+    InflightDCacheReqInfo info = unpack(truncate(resp.info));
+    Bit#(`LogBeatsPerLine) beat = truncate(resp.beat);
     lineWriteDataWire <= resp.data;
-    lineWriteIndexWire <= beatIndex(resp.info.beat, resp.info.req.id,
-                            resp.info.req.addr, resp.info.way);
+    lineWriteIndexWire <= beatIndex(beat, info.req.id,
+                            info.req.addr, info.way);
     // Ready to consume flush queue?
     if (flushQueue.canDeq && flushQueue.canPeek) begin
       flush.req.cmd.isFlush = False;
@@ -453,14 +455,14 @@ module mkDCache#(DCacheId myId) (DCache);
       // Remove item from fill queue and feed associated request (which
       // will definitely hit if it starts again from the beginning of
       // the pipeline) back to beginning of the pipeline
-      if (allHigh(resp.info.beat))
+      if (allHigh(beat))
         feedbackTrigger <= True;
       // Write new line data to dataMem
       // (The write parameters are set outside condition for better timing)
       lineWriteReqWire <= True;
       respPort.get;
       // Set feedback request
-      feedbackReq <= resp.info.req;
+      feedbackReq <= info.req;
     end
   endrule
 
@@ -492,11 +494,10 @@ module mkDCache#(DCacheId myId) (DCache);
     InflightDCacheReqInfo info;
     info.req = miss.req;
     info.way = miss.evictWay;
-    info.beat = ?;
     // Create memory request
     DRAMReq memReq;
     memReq.isStore = !isLoad;
-    memReq.id = myId;
+    memReq.id = zeroExtend(myId);
     memReq.addr = {isLoad ? readLineAddr : writeLineAddr, reqBeat};
     memReq.data = isLoad ? {?, pack(info)} : dataMem.dataOutA;
     memReq.burst = isLoad ? `BeatsPerLine : 1;
@@ -589,66 +590,6 @@ interface DCacheClient;
   method Action incWritebackCount(Bool inc);
 endinterface
 
-// ============================================================================
-// Connections
-// ============================================================================
-
-module connectCoresToDCache#(
-         Vector#(`CoresPerDCache, DCacheClient) clients,
-         DCache dcache) ();
-
-  // Connect requests
-  function getDCacheReqOut(client) = client.dcacheReqOut;
-  let dcacheReqs <- mkMergeTree(Fair,
-                      mkUGShiftQueue1(QueueOptFmax),
-                      map(getDCacheReqOut, clients));
-  connectUsing(mkUGQueue, dcacheReqs, dcache.reqIn);
-
-  // Connect responses
-  function Bit#(`LogCoresPerDCache) getDCacheRespKey(DCacheResp resp) =
-    truncateLSB(resp.id);
-  function getDCacheRespIn(client) = client.dcacheRespIn;
-  let dcacheResps <- mkResponseDistributor(
-                      getDCacheRespKey,
-                      mkUGShiftQueue1(QueueOptFmax),
-                      map(getDCacheRespIn, clients));
-  connectDirect(dcache.respOut, dcacheResps);
-
-  // Connect performance-counter wires
-  rule connectPerfCountWires;
-    clients[0].incMissCount(dcache.incMissCount);
-    clients[0].incHitCount(dcache.incHitCount);
-    clients[0].incWritebackCount(dcache.incWritebackCount);
-    for (Integer i = 1; i < `CoresPerDCache; i=i+1) begin
-      clients[i].incMissCount(False);
-      clients[i].incHitCount(False);
-      clients[i].incWritebackCount(False);
-    end
-  endrule
-
-endmodule
-
-module connectDCachesToOffChipRAM#(
-         Vector#(`DCachesPerDRAM, DCache) caches, OffChipRAM ram) ();
-
-  // Connect requests
-  function getReqOut(cache) = cache.reqOut;
-  let reqs <- mkMergeTreeB(Fair,
-                mkUGShiftQueue1(QueueOptFmax),
-                map(getReqOut, caches));
-  connectUsing(mkUGQueue, reqs, ram.reqIn);
-
-  // Connect load responses
-  function DCacheId getRespKey(DRAMResp resp) = resp.id;
-  function getRespIn(cache) = cache.respIn;
-  let ramResps <- mkResponseDistributor(
-                    getRespKey,
-                    mkUGShiftQueue2(QueueOptFmax),
-                    map(getRespIn, caches));
-  connectDirect(ram.respOut, ramResps);
-
-endmodule
-
 // ============================================================================
 // Dummy cache
 // ============================================================================
diff --git a/rtl/DCacheTypes.bsv b/rtl/DCacheTypes.bsv
index fa6ba407..4ddd809f 100644
--- a/rtl/DCacheTypes.bsv
+++ b/rtl/DCacheTypes.bsv
@@ -43,7 +43,6 @@ typedef struct {
 typedef struct {
   DCacheReq req;
   Way way;
-  Bit#(`LogBeatsPerLine) beat;
 } InflightDCacheReqInfo deriving (Bits);
 
 endpackage
diff --git a/rtl/DE5BridgeTop.bsv b/rtl/DE5BridgeTop.bsv
index 5dce9e25..15e2ba8f 100644
--- a/rtl/DE5BridgeTop.bsv
+++ b/rtl/DE5BridgeTop.bsv
@@ -12,9 +12,10 @@
 //   1. DA: Destination address (4 bytes)
 //   2. NM: Number of messages that follow minus one (4 bytes)
 //   3. FM: Number of flit payloads per message minus one (1 byte)
-//   4. Padding (7 bytes)
-//   5. (NM+1)*(FM+1) flit payloads ((NM+1)*(FM+1)*BytesPerFlit bytes)
-//   6. Goto step 1
+//   4. Padding (3 bytes)
+//   5. Routing key (optional, 4 bytes)
+//   6. (NM+1)*(FM+1) flit payloads ((NM+1)*(FM+1)*BytesPerFlit bytes)
+//   7. Goto step 1
 //
 // The format of the data stream in the FPGA->PC direction is simply
 // raw flit payloads.
@@ -161,6 +162,7 @@ module de5BridgeTop (DE5BridgeTop);
   Reg#(Bit#(32)) fromPCIeDA    <- mkConfigRegU;
   Reg#(Bit#(32)) fromPCIeNM    <- mkConfigRegU;
   Reg#(Bit#(8))  fromPCIeFM    <- mkConfigRegU;
+  Reg#(Bit#(32))  fromPCIeKey   <- mkConfigRegU;
   Reg#(Bit#(1))  toLinkState   <- mkConfigReg(0);
 
   Reg#(Bit#(32)) messageCount  <- mkConfigReg(0);
@@ -182,6 +184,7 @@ module de5BridgeTop (DE5BridgeTop);
         fromPCIeDA <= data[31:0];
         fromPCIeNM <= data[63:32];
         fromPCIeFM <= data[95:88];
+        fromPCIeKey <= data[127:96];
         toLinkState <= 1;
         fromPCIe.get;
       end
@@ -203,6 +206,10 @@ module de5BridgeTop (DE5BridgeTop);
         Flit flit;
         flit.dest.addr = unpack(truncate(fromPCIeDA[31:`LogThreadsPerMailbox]));
         flit.dest.threads = pack(destThreads);
+        // If address says to use routing key, then use it
+        if (flit.dest.addr.isKey) begin
+          flit.dest.threads = zeroExtend(fromPCIeKey);
+        end
         flit.payload = fromPCIe.value;
         flit.notFinalFlit = True;
         flit.isIdleToken = False;
diff --git a/rtl/DE5Top.bsv b/rtl/DE5Top.bsv
index 2173526d..bb35bc19 100644
--- a/rtl/DE5Top.bsv
+++ b/rtl/DE5Top.bsv
@@ -22,6 +22,7 @@ import InstrMem     :: *;
 import NarrowSRAM   :: *;
 import OffChipRAM   :: *;
 import IdleDetector :: *;
+import Connections  :: *;
 
 // ============================================================================
 // Interface
@@ -114,10 +115,6 @@ module de5Top (DE5Top);
     for (Integer j = 0; j < `DCachesPerDRAM; j=j+1)
       connectCoresToDCache(map(dcacheClient, cores[i][j]), dcaches[i][j]);
 
-  // Connect data caches to DRAM
-  for (Integer i = 0; i < `DRAMsPerBoard; i=i+1)
-    connectDCachesToOffChipRAM(dcaches[i], rams[i]);
-
   // Create FPUs
   Vector#(`FPUsPerBoard, FPU) fpus;
   for (Integer i = 0; i < `FPUsPerBoard; i=i+1)
@@ -143,10 +140,6 @@ module de5Top (DE5Top);
   // Create idle-detector
   IdleDetector idle <- mkIdleDetector;
 
-  // Connect cores to idle-detector
-  function idleClient(core) = core.idleClient;
-  connectCoresToIdleDetector(map(idleClient, vecOfCores), idle);
-
   // Create mailboxes
   Vector#(`MailboxMeshYLen,
     Vector#(`MailboxMeshXLen, Mailbox)) mailboxes =
@@ -155,6 +148,13 @@ module de5Top (DE5Top);
     for (Integer x = 0; x < `MailboxMeshXLen; x=x+1)
       mailboxes[y][x] <- mkMailboxAcc(debugLink.getBoardId(), x, y);
 
+  // Initialise mailbox send slots
+  rule initSendSlots;
+    for (Integer y = 0; y < `MailboxMeshYLen; y=y+1)
+      for (Integer x = 0; x < `MailboxMeshXLen; x=x+1)
+        mailboxes[y][x].initSendSlots(debugLink.useExtraSendSlot);
+  endrule
+
   // Connect cores to mailboxes
   for (Integer y = 0; y < `MailboxMeshYLen; y=y+1)
     for (Integer x = 0; x < `MailboxMeshXLen; x=x+1) begin
@@ -167,13 +167,27 @@ module de5Top (DE5Top);
       connectCoresToMailbox(map(mailboxClient, cs), mailboxes[y][x]);
     end
 
-  // Create mesh of mailboxes
+  // Create network-on-chip
   function MailboxNet mailboxNet(Mailbox mbox) = mbox.net;
-  ExtNetwork net <- mkMailboxMesh(
-                      debugLink.getBoardId(),
-                      debugLink.linkEnable,
-                      map(map(mailboxNet), mailboxes),
-                      idle);
+  NoC noc <- mkNoC(
+    debugLink.getBoardId(),
+    debugLink.linkEnable,
+    map(map(mailboxNet), mailboxes),
+    idle);
+
+  // Connect cores and ProgRouter fetchers to idle-detector
+  function idleClient(core) = core.idleClient;
+  connectClientsToIdleDetector(
+    map(idleClient, vecOfCores), noc.activities, idle);
+
+  // Connections to off-chip RAMs
+  for (Integer i = 0; i < `DRAMsPerBoard; i=i+1)
+    connectClientsToOffChipRAM(dcaches[i],
+      noc.dramReqs[i], noc.dramResps[i], rams[i]);
+
+  // Connects ProgRouter performance counters to cores
+  connectProgRouterPerfCountersToCores(noc.progRouterPerfCounters,
+    concat(concat(cores)));
 
   // Set board ids
   rule setBoardIds;
@@ -199,10 +213,10 @@ module de5Top (DE5Top);
   interface dramIfcs = map(getDRAMExtIfc, rams);
   interface sramIfcs = concat(map(getSRAMExtIfcs, rams));
   interface jtagIfc  = debugLink.jtagAvalon;
-  interface northMac = net.north;
-  interface southMac = net.south;
-  interface eastMac  = net.east;
-  interface westMac  = net.west;
+  interface northMac = noc.north;
+  interface southMac = noc.south;
+  interface eastMac  = noc.east;
+  interface westMac  = noc.west;
   method Action setBoardId(Bit#(4) id);
     localBoardId <= id;
   endmethod
diff --git a/rtl/DRAM.bsv b/rtl/DRAM.bsv
index b9bab54e..406cfe89 100644
--- a/rtl/DRAM.bsv
+++ b/rtl/DRAM.bsv
@@ -5,8 +5,11 @@ package DRAM;
 // Types
 // ============================================================================
 
+// DRAM client id
+typedef Bit#(TLog#(TAdd#(`DCachesPerDRAM,`FetchersPerProgRouter))) DRAMClientId;
+
 // DRAM request id
-typedef DCacheId DRAMReqId;
+typedef DRAMClientId DRAMReqId;
 
 // DRAM request
 typedef struct {
@@ -22,8 +25,13 @@ typedef struct {
 typedef struct {
   DRAMReqId id;
   Bit#(`BeatWidth) data;
-  InflightDCacheReqInfo info;
+  // Which beat is it?
   Bool finalBeat;
+  Bit#(`BeatBurstWidth) beat;
+  // Data from original load request
+  // (Can be largely ignored and optimised away, but
+  // can also hold useful info about the original request)
+  Bit#(`BeatWidth) info;
 } DRAMResp deriving (Bits);
 
 // DRAM identifier
@@ -80,7 +88,6 @@ import Util        :: *;
 import Interface   :: *;
 import Queue       :: *;
 import Assert      :: *;
-import DCacheTypes :: *;
 
 // Types
 // -----
@@ -151,8 +158,8 @@ module mkDRAM#(RAMId id) (DRAM);
           DRAMResp resp;
           resp.id = req.id;
           resp.data = pack(elems);
-          resp.info = unpack(truncate(req.data));
-          resp.info.beat = truncate(burstCount);
+          resp.info = req.data;
+          resp.beat = burstCount;
           resp.finalBeat = finalBeat;
           resps.enq(resp);
           decOutstanding.send;
@@ -219,7 +226,6 @@ import Interface   :: *;
 import Assert      :: *;
 import Util        :: *;
 import Assert      :: *;
-import DCacheTypes :: *;
 
 // Types
 // -----
@@ -244,7 +250,7 @@ endinterface
 typedef struct {
   DRAMReqId id;
   Bit#(`BeatBurstWidth) burst;
-  InflightDCacheReqInfo info;
+  Bit#(`BeatWidth) info;
 } DRAMInFlightReq deriving (Bits);
 
 // Implementation
@@ -309,7 +315,7 @@ module mkDRAM#(t id) (DRAM);
           DRAMInFlightReq inflightReq;
           inflightReq.id = req.id;
           inflightReq.burst = req.burst;
-          inflightReq.info = unpack(truncate(req.data));
+          inflightReq.info = req.data;
           inFlight.enq(inflightReq);
           inFlightCount.incBy(zeroExtend(req.burst));
         end
@@ -336,7 +342,7 @@ module mkDRAM#(t id) (DRAM);
       DRAMResp resp;
       resp.id = inFlight.dataOut.id;
       resp.info = inFlight.dataOut.info;
-      resp.info.beat = truncate(burstCount-1);
+      resp.beat = truncate(burstCount-1);
       resp.data = respBuffer.dataOut;
       resp.finalBeat = burstCount == inFlight.dataOut.burst;
       return resp;
diff --git a/rtl/DebugLink.bsv b/rtl/DebugLink.bsv
index 676696e7..a09236b5 100644
--- a/rtl/DebugLink.bsv
+++ b/rtl/DebugLink.bsv
@@ -13,16 +13,18 @@ package DebugLink;
 // Commands sent from the host PC to DebugLink typically consist of a
 // few bytes over the JTAG UART.
 //
-//   QueryIn: tag (1 byte), board offset (1 byte), edge disable (1 byte)
-//   -------------------------------------------------------------------
+//   QueryIn: tag (1 byte), board offset (1 byte), config (1 byte)
+//   -------------------------------------------------------------
 //
 //   Sets the X offset (offset[3:0]) and the Y offset (offset[7:4])
 //   of the board id (to support multiple boxes).
 //   Disable the specified inter-FPGA links:
-//     * disable[0]: disable links on north side of box
-//     * disable[1]: disable links on south side of box
-//     * disable[2]: disable links on east side of box
-//     * disable[3]: disable links on west side of box
+//     * config[0]: disable links on north side of box
+//     * config[1]: disable links on south side of box
+//     * config[2]: disable links on east side of box
+//     * config[3]: disable links on west side of box
+//   Enable extra send slot:
+//     * config[4]: reserve extra send slot
 //   Responds with a QueryOut (see below).
 //
 //   SetDest: tag (1 byte), thread id (1 byte), core id (1 byte)
@@ -202,9 +204,13 @@ interface DebugLink;
   // Get board id via DebugLink
   (* always_ready, always_enabled *)
   method BoardId getBoardId();
-  // Optionally disable each inter-FPGA link via DebugLink
+  // Config option: disable each inter-FPGA link via DebugLink
+  // (Allows sanboxing of boxes or groups of boxes)
   (* always_ready, always_enabled *)
   method Vector#(4, Bool) linkEnable;
+  // Config option: reserve extra send slot per thread in mailbox
+  (* always_ready, always_enabled *)
+  method Option#(Bool) useExtraSendSlot;
 endinterface
 
 module mkDebugLink#(
@@ -224,6 +230,11 @@ module mkDebugLink#(
   // (Initially, all disabled)
   Reg#(Vector#(4, Bool)) linkEnableReg <- mkConfigReg(replicate(False));
 
+  // Config option: reserve extra send slot in mailbox?
+  // Use a chain of registers to aid propagation on chip
+  Vector#(3, Reg#(Option#(Bool))) useExtraSendSlotReg <-
+     replicateM(mkConfigReg(Option {valid : False, value: False}));
+
   // Ports
   InPort#(Bit#(8)) fromJtag <- mkInPort;
   OutPort#(Bit#(8)) toJtag <- mkOutPort;
@@ -331,6 +342,9 @@ module mkDebugLink#(
         // Disable west link?
         if (x == 0 && edgeEn[3] == 1) linkEn[3] = False;
         linkEnableReg <= linkEn;
+        // Reserve extra send slot?
+        useExtraSendSlotReg[2] <=
+          Option {valid: True, value: fromJtag.value[4] == 1};
         respondFlag <= True;
         respondCmd <= cmdQueryIn;
         recvState <= 0;
@@ -404,6 +418,11 @@ module mkDebugLink#(
     end
   endrule
 
+  // Propagate extra send slot option through chain of registers (for timing)
+  rule chain;
+    for (Integer i = 0; i < 2; i=i+1)
+      useExtraSendSlotReg[i] <= useExtraSendSlotReg[i+1];
+  endrule
 
   `ifndef SIMULATE
   interface jtagAvalon = uart.jtagAvalon;
@@ -411,7 +430,7 @@ module mkDebugLink#(
 
   method BoardId getBoardId() = boardId;
   method Vector#(4, Bool) linkEnable = linkEnableReg;
-
+  method Option#(Bool) useExtraSendSlot = useExtraSendSlotReg[0];
 endmodule
 
 endpackage
diff --git a/rtl/GenInit.sh b/rtl/GenInit.sh
deleted file mode 100755
index ad2a6e0c..00000000
--- a/rtl/GenInit.sh
+++ /dev/null
@@ -1,19 +0,0 @@
-#!/bin/bash
-
-# Generate memory initialisation files
-
-# Load config parameters
-while read -r EXPORT; do
-  eval $EXPORT
-done <<< `python ../config.py envs`
-
-MaxSlot=$(((2**LogMsgsPerMailbox) - 1))
-ThreadsPerMailbox=$((2**$LogThreadsPerMailbox))
-
-# Emit hex file
-for I in $(seq $ThreadsPerMailbox $MaxSlot); do
-  printf "%x\n" $I
-done >> FreeSlots.hex
-
-# Emit MIF file
-../bin/hex-to-mif.py FreeSlots.hex $LogMsgsPerMailbox > ../de5/FreeSlots.mif
diff --git a/rtl/Globals.bsv b/rtl/Globals.bsv
index a2648a23..d240aa2c 100644
--- a/rtl/Globals.bsv
+++ b/rtl/Globals.bsv
@@ -20,10 +20,13 @@ typedef struct {
 // destination board, it is routed either left or right depending
 // the contents of the host bit.  This is to support bridge boards
 // connected at the east/west rims of the FPGA mesh.
+// The 'isKey' bit means that the destination is a routing key, held
+// in the botom 32 bits of the 'NetAddr'.
 // The 'acc' bit means message is routed to a custom accelerator rather
 // than a mailbox.
 typedef struct {
   Bool acc;
+  Bool isKey;
   Option#(Bit#(1)) host;
   BoardId board;
   MailboxId mbox;
@@ -42,6 +45,9 @@ typedef struct {
 
 function MailboxId getMailboxId(NetAddr addr) = addr.addr.mbox;
 
+// Extract routing key from network address
+function Bit#(32) getRoutingKeyRaw(NetAddr addr) = truncate(pack(addr));
+
 // ============================================================================
 // Messages
 // ============================================================================
@@ -63,7 +69,7 @@ typedef struct {
   Bool notFinalFlit;
   // Is this a special packet for idle-detection?
   Bool isIdleToken;
-} Flit deriving (Bits);
+} Flit deriving (Bits, FShow);
 
 // A padded flit is a multiple of 64 bits
 // (i.e. the data width of the 10G MAC interface)
diff --git a/rtl/IdleDetector.bsv b/rtl/IdleDetector.bsv
index 0307f198..59e4b530 100644
--- a/rtl/IdleDetector.bsv
+++ b/rtl/IdleDetector.bsv
@@ -18,14 +18,16 @@
 // The implementation below is based on Safra's termination detection
 // algorithm (EWD998).
 
-import Mailbox   :: *;
-import Globals   :: *;
-import Interface :: *;
-import Queue     :: *;
-import Vector    :: *;
-import ConfigReg :: *;
-import Util      :: *;
-import DReg      :: *;
+import Mailbox    :: *;
+import Globals    :: *;
+import Interface  :: *;
+import Queue      :: *;
+import Vector     :: *;
+import ConfigReg  :: *;
+import Util       :: *;
+import DReg       :: *;
+import ProgRouter :: *;
+import Assert     :: *;
 
 // The total number of messages sent by all threads on an FPGA minus
 // the total number of messages received by all threads on an FPGA.
@@ -221,6 +223,7 @@ module mkIdleDetector (IdleDetector);
       NetAddr {
         addr: MailboxNetAddr {
           acc: False,
+          isKey: False,
           host: option(True, 0),
           board: BoardId { y: 0, x: 0 },
           mbox: MailboxId { y: 0, x: 0 }
@@ -301,33 +304,6 @@ module mkIdleDetector (IdleDetector);
 
 endmodule
 
-// Pipelined reduction tree
-module mkPipelinedReductionTree#(
-         function a reduce(a x, a y),
-         a init,
-         List#(a) xs)
-       (a) provisos(Bits#(a, _));
-  Integer len = List::length(xs);
-  if (len == 0)
-    return error("mkSumList applied to empty list");
-  else if (len == 1)
-    return xs[0];
-  else begin
-    List#(a) ys = xs;
-    List#(a) reduced = Nil;
-    for (Integer i = 0; i < len; i=i+2) begin
-      Reg#(a) r <- mkConfigReg(init);
-      rule assignOut;
-        r <= reduce(ys[0], ys[1]);
-      endrule
-      ys = List::drop(2, ys);
-      reduced = Cons(readReg(r), reduced);
-    end
-    a res <- mkPipelinedReductionTree(reduce, init, reduced);
-    return res;
-  end
-endmodule
-
 interface IdleDetectorClient;
   method Bit#(1) incSent;
   method Bit#(1) incReceived;
@@ -342,22 +318,33 @@ interface IdleDetectorClient;
   method Bool idleStage1Ack;
 endinterface
 
-// Connect cores to idle detector
-module connectCoresToIdleDetector#(
-         Vector#(n, IdleDetectorClient) core, IdleDetector detector) ()
-           provisos (Log#(n, log_n), Add#(log_n, 1, m), Add#(_a, m, 62));
+// Connect cores and fetchers to idle detector
+module connectClientsToIdleDetector#(
+         Vector#(`CoresPerBoard, IdleDetectorClient) core,
+         Vector#(`FetchersPerProgRouter, FetcherActivity) fetcher,
+         IdleDetector detector) ()
+           provisos (Mul#(2, `CoresPerBoard, n));
+
+  staticAssert(2**`LogCoresPerBoard1 > `CoresPerBoard+`FetchersPerProgRouter,
+    "connectCoresToIdleDetector: insufficient width");
 
   // Sum "incSent" wires from each core
-  Vector#(n, Bit#(m)) incSents = newVector;
-  for (Integer i = 0; i < valueOf(n); i=i+1)
+  Vector#(n, Bit#(`LogCoresPerBoard1)) incSents = replicate(0);
+  for (Integer i = 0; i < `CoresPerBoard; i=i+1)
     incSents[i] = zeroExtend(core[i].incSent);
-  Bit#(m) incSent <- mkPipelinedReductionTree( \+ , 0, toList(incSents));
+  for (Integer i = 0; i < `FetchersPerProgRouter; i=i+1)
+    incSents[`CoresPerBoard+i] = zeroExtend(fetcher[i].incSent);
+  Bit#(`LogCoresPerBoard1) incSent <-
+    mkPipelinedReductionTree( \+ , 0, toList(incSents));
 
   // Sum "incRecv" wires from each core
-  Vector#(n, Bit#(m)) incRecvs = newVector;
-  for (Integer i = 0; i < valueOf(n); i=i+1)
+  Vector#(n, Bit#(`LogCoresPerBoard1)) incRecvs = replicate(0);
+  for (Integer i = 0; i < `CoresPerBoard; i=i+1)
     incRecvs[i] = zeroExtend(core[i].incReceived);
-  Bit#(m) incRecv <- mkPipelinedReductionTree( \+ , 0, toList(incRecvs));
+  for (Integer i = 0; i < `FetchersPerProgRouter; i=i+1)
+    incRecvs[`CoresPerBoard+i] = zeroExtend(fetcher[i].incReceived);
+  Bit#(`LogCoresPerBoard1) incRecv <-
+    mkPipelinedReductionTree( \+ , 0, toList(incRecvs));
 
   // Maintain the total count
   Reg#(MsgCount) count <- mkConfigReg(0);
@@ -368,16 +355,18 @@ module connectCoresToIdleDetector#(
   endrule
 
   // OR the "active" wires from each core
-  Vector#(n, Bool) actives = newVector;
-  for (Integer i = 0; i < valueOf(n); i=i+1)
+  Vector#(n, Bool) actives = replicate(False);
+  for (Integer i = 0; i < `CoresPerBoard; i=i+1)
     actives[i] = core[i].active;
+  for (Integer i = 0; i < `FetchersPerProgRouter; i=i+1)
+    actives[`CoresPerBoard+i] = fetcher[i].active;
   Bool anyActive <- mkPipelinedReductionTree( \|| , True, toList(actives));
 
-  // OR the "vote" wires from each core
-  Vector#(n, Bool) votes = newVector;
-  for (Integer i = 0; i < valueOf(n); i=i+1)
+  // AND the "vote" wires from each core
+  Vector#(n, Bool) votes = replicate(True);
+  for (Integer i = 0; i < `CoresPerBoard; i=i+1)
     votes[i] = core[i].vote;
-  Bool unanamous <- mkPipelinedReductionTree( \&& , False, toList(votes));
+  Bool voteDecision <- mkPipelinedReductionTree( \&& , False, toList(votes));
 
   // Register the result
   Reg#(Bool) active <- mkConfigReg(True);
@@ -385,24 +374,25 @@ module connectCoresToIdleDetector#(
   
   rule updateActive;
     active <= anyActive;
-    vote <= unanamous;
+    vote <= voteDecision;
   endrule
 
   // Counter number of stage 1 acks
-  Reg#(Bit#(m)) numAcks <- mkConfigReg(0);
+  Reg#(Bit#(`LogCoresPerBoard1)) numAcks <- mkConfigReg(0);
 
   // Sum stage 1 ack wires from each core
-  Vector#(n, Bit#(m)) incAcks = newVector;
-  for (Integer i = 0; i < valueOf(n); i=i+1)
+  Vector#(`CoresPerBoard, Bit#(`LogCoresPerBoard1)) incAcks = newVector;
+  for (Integer i = 0; i < `CoresPerBoard; i=i+1)
     incAcks[i] = zeroExtend(pack(core[i].idleStage1Ack));
-  Bit#(m) incAck <- mkPipelinedReductionTree( \+ , 0, toList(incAcks));
+  Bit#(`LogCoresPerBoard1) incAck <-
+    mkPipelinedReductionTree( \+ , 0, toList(incAcks));
 
   // Stage 1 output ack
   Wire#(Bool) stage1AckWire <- mkDWire(False);
 
   rule updateAcks;
-    Bit#(m) total = numAcks + incAck;
-    if (total == fromInteger(valueOf(n))) begin
+    Bit#(`LogCoresPerBoard1) total = numAcks + incAck;
+    if (total == `CoresPerBoard) begin
       numAcks <= 0;
       stage1AckWire <= True;
     end else begin
@@ -418,7 +408,7 @@ module connectCoresToIdleDetector#(
     detector.idle.voteIn(vote);
     detector.idle.ackStage1(stage1AckWire);
 
-    for (Integer i = 0; i < valueOf(n); i=i+1) begin
+    for (Integer i = 0; i < `CoresPerBoard; i=i+1) begin
       core[i].idleDetectedStage1(detector.idle.detectedStage1);
       core[i].idleVoteStage1(detector.idle.voteStage1);
       core[i].idleDetectedStage2(detector.idle.detectedStage2);
@@ -538,6 +528,7 @@ module mkIdleDetectMaster (IdleDetectMaster);
       NetAddr {
         addr: MailboxNetAddr {
           acc: False,
+          isKey: False,
           host: option(False, 0),
           board: BoardId { y: truncate(boardY), x: truncate(boardX) },
           mbox: MailboxId { y: 0, x: 0 }
diff --git a/rtl/Interface.bsv b/rtl/Interface.bsv
index c3d16860..a7cd0e91 100644
--- a/rtl/Interface.bsv
+++ b/rtl/Interface.bsv
@@ -212,6 +212,14 @@ module onBOut#(function u f(t x), BOut#(t) out) (BOut#(u));
   method u value = f(out.value);
 endmodule
 
+// Convert BOut to Out
+function Out#(t) fromBOut(BOut#(t) out) =
+  interface Out
+    method Action tryGet = out.get;
+    method Bool valid = out.valid;
+    method t value = out.value;
+  endinterface;
+
 // A null In port accepts and discards all inputs
 module mkNullIn (In#(t));
   method Action tryPut(u val); endmethod
@@ -248,6 +256,14 @@ function BOut#(t) enableBOut(Bool en, BOut#(t) out) =
     method t value = out.value;
   endinterface;
 
+// Convert queue to BOut interface
+function BOut#(t) queueToBOut(SizedQueue#(n, t) q) =
+  interface BOut
+    method Action get = q.deq;
+    method Bool valid = q.canDeq && q.canPeek;
+    method t value = q.dataOut;
+  endinterface;
+
 // =============================================================================
 // Merge unit
 // =============================================================================
@@ -396,7 +412,7 @@ module mkMergeTreeB#(MergeMethod m, module#(SizedQueue#(d, t)) mkQ,
     xs = List::cons(x, xs);
   end
 
-  let out <- mkMergeTreeList(m, mkQ, xs);
+  let out <- mkMergeTreeList(m, mkQ, List::reverse(xs));
   return out;
 endmodule
 
@@ -578,7 +594,7 @@ module mkDeserialiser (Deserialiser#(typeIn, typeOut))
 endmodule
 
 // =============================================================================
-// Expansion and reduction connectors
+// Reduction connectors
 // =============================================================================
 
 // Reduce a list of interfaces down to a given number of interfaces,
@@ -651,31 +667,4 @@ module reduceConnect#(
 
 endmodule
 
-// Connect 'from' ports to 'to' ports,
-// where 'length(from)' may be less than 'length(to)'.
-// Works by wiring null to any unused 'to' ports.
-module expandConnect#(List#(Out#(t)) from, List#(In#(t)) to) ()
-         provisos (Bits#(t, twidth));
-
-  // Count inputs and outputs
-  Integer numFrom = List::length(from);
-  Integer numTo = List::length(to);
-  Integer q = numTo/numFrom;
-
-  for (Integer i = 0; i < numTo; i=i+1) begin
-    if (q == 0) begin
-      // Connect input
-      connectUsing(mkUGShiftQueue1(QueueOptFmax), from[i], to[i]);
-    end else if ((i%q) == 0) begin
-      // Connect input
-      connectUsing(mkUGShiftQueue1(QueueOptFmax), from[i/q], to[i]);
-    end else begin
-      // Connect terminator
-      BOut#(t) nullOut <- mkNullBOut;
-      connectDirect(nullOut, to[i]);
-    end
-  end
-  
-endmodule
-
 endpackage
diff --git a/rtl/Mailbox.bsv b/rtl/Mailbox.bsv
index 0398b0e2..e08b1b9a 100644
--- a/rtl/Mailbox.bsv
+++ b/rtl/Mailbox.bsv
@@ -260,6 +260,9 @@ interface Mailbox;
   (* always_ready *) method Bit#(1) freeDone;
   // Network-side interface
   interface MailboxNet            net;
+  // Initialise send slots (use extra send slot?)
+  (* always_ready, always_enabled *)
+  method Action initSendSlots(Option#(Bool) useExtraSendSlot);
 endinterface
 
 // Combined receive request/response interface
@@ -292,6 +295,45 @@ module mkMailbox (Mailbox);
   Vector#(`CoresPerMailbox, InPort#(ReceiveReq)) rxReqPorts <-
     replicateM(mkInPort);
 
+  // Initialise free slots
+  // =====================
+
+  // Set of currently-unused message slots
+  // By default, the first ThreadsPerMailbox slots are reserved for sending
+  // Optionally, the first 2*ThreadsPerMailbox slots are reserved for sending
+  SizedQueue#(`LogMsgsPerMailbox, Bit#(`LogMsgsPerMailbox))
+    freeSlots <- mkUGSizedQueuePrefetch;
+
+  // Reserve extra send slot?
+  Wire#(Option#(Bool)) useExtraSendSlot <- mkBypassWire;
+
+  // State of free slot initialiser
+  Reg#(Bit#(1)) freeSlotsInitState <- mkConfigReg(0);
+
+  // Have the free slots been initialised yet?
+  Reg#(Bool) freeSlotsInitDone <- mkConfigReg(False);
+
+  // Next slot to insert into free slot queue
+  Reg#(Bit#(`LogMsgsPerMailbox)) freeSlotsInitNext <- mkConfigRegU;
+
+  // Wait until config option available, which tells us how
+  // many slots to reserve for sending
+  rule initFreeSlots0 (freeSlotsInitState == 0);
+    if (useExtraSendSlot.valid) begin
+      freeSlotsInitNext <= useExtraSendSlot.value ?
+        fromInteger(2*`ThreadsPerMailbox) : `ThreadsPerMailbox;
+      freeSlotsInitState <= 1;
+    end
+  endrule
+
+  // Initialise free slots
+  rule initFreeSlots1 (!freeSlotsInitDone && freeSlotsInitState == 1);
+    freeSlots.enq(freeSlotsInitNext);
+    freeSlotsInitNext <= freeSlotsInitNext + 1;
+    if (freeSlotsInitNext == fromInteger(2**`LogMsgsPerMailbox - 1))
+      freeSlotsInitDone <= True;
+  endrule
+
   // Message access unit
   // ===================
 
@@ -336,15 +378,6 @@ module mkMailbox (Mailbox);
   Reg#(RefCount) refCountReg <- mkConfigRegU;
   Reg#(Bit#(`LogMsgsPerMailbox)) refCountSlot <- mkConfigRegU;
 
-  // Set of currently-unused message slots
-  // (The first ThreadsPerMailbox slots are reserved for sending)
-  QueueOpts freeSlotsOpts;
-  freeSlotsOpts.style = "AUTO";
-  freeSlotsOpts.size = 2**`LogMsgsPerMailbox - `ThreadsPerMailbox;
-  freeSlotsOpts.file = Valid("FreeSlots");
-  SizedQueue#(`LogMsgsPerMailbox, Bit#(`LogMsgsPerMailbox))
-    freeSlots <- mkUGSizedQueuePrefetchOpts(freeSlotsOpts);
-
   // Multicast buffer
   Vector#(`CoresPerMailbox,
     SizedQueue#(`LogMulticastBufferSize, MulticastBufferEntry))
@@ -598,7 +631,7 @@ module mkMailbox (Mailbox);
   // to a message slot is freed
   Reg#(Bit#(1)) freeDoneReg <- mkDReg(0);
 
-  rule free (freeReqPort.canGet);
+  rule free (freeReqPort.canGet && freeSlotsInitDone);
     FreeReq req = freeReqPort.value;
     // Process request in two cycles
     let count = refCount.dataOutB;
@@ -667,6 +700,10 @@ module mkMailbox (Mailbox);
     endinterface
   endinterface
 
+  method Action initSendSlots(Option#(Bool) useExtra);
+    useExtraSendSlot <= useExtra;
+  endmethod
+
 endmodule
 
 // =============================================================================
@@ -1138,14 +1175,16 @@ import "BVI" ExternalTinselAccelerator =
 
 `ifndef UseCustomAccelerator
 
-module mkMailboxAcc#(BoardId boardId, Integer tileX, Integer tileY) (Mailbox);
+module mkMailboxAcc#(BoardId boardId,
+         Integer tileX, Integer tileY) (Mailbox);
   Mailbox mbox <- mkMailbox;
   return mbox;
 endmodule
 
 `else
 
-module mkMailboxAcc#(BoardId boardId, Integer tileX, Integer tileY) (Mailbox);
+module mkMailboxAcc#(BoardId boardId,
+         Integer tileX, Integer tileY) (Mailbox);
   // Instantiate standard mailbox
   Mailbox mbox <- mkMailbox;
 
diff --git a/rtl/Makefile b/rtl/Makefile
index cc521bae..e938b015 100644
--- a/rtl/Makefile
+++ b/rtl/Makefile
@@ -11,7 +11,7 @@ DEFS = $(shell python ../config.py defs)
 BSC = bsc
 BSCFLAGS = -wait-for-license -suppress-warnings S0015 \
            -suppress-warnings G0023 \
-           -steps-warn-interval 500000 -check-assert \
+           -steps-warn-interval 750000 -check-assert \
            +RTS -K32M -RTS
 
 # Top level module
@@ -28,13 +28,13 @@ sim: $(TOPMOD) $(HOSTTOPMOD)
 .PHONY: verilog
 verilog: $(TOPMOD).v $(HOSTTOPMOD).v
 
-$(TOPMOD): *.bsv *.c InstrMem.hex FreeSlots.hex
+$(TOPMOD): *.bsv *.c InstrMem.hex
 	make -C $(TINSEL_ROOT)/apps/boot
 	make -C $(TINSEL_ROOT)/hostlink udsock
 	$(BSC) $(BSCFLAGS) $(DEFS) -D SIMULATE -sim -g $(TOPMOD) -u $(TOPFILE)
 	$(BSC) $(BSCFLAGS) -sim -o $(TOPMOD) -e $(TOPMOD) *.c
 
-$(TOPMOD).v: *.bsv $(QP)/InstrMem.mif $(QP)/FreeSlots.mif
+$(TOPMOD).v: *.bsv $(QP)/InstrMem.mif
 	make -C $(TINSEL_ROOT)/apps/boot
 	$(BSC) $(BSCFLAGS) -opt-undetermined-vals -unspecified-to X \
          $(DEFS) -u -verilog -g $(TOPMOD) $(TOPFILE)
@@ -63,12 +63,6 @@ InstrMem.hex:
 $(QP)/InstrMem.mif:
 	make -C $(TINSEL_ROOT)/apps/boot
 
-FreeSlots.hex: GenInit.sh
-	./GenInit.sh
-
-$(QP)/FreeSlots.mif: GenInit.sh
-	./GenInit.sh
-
 .PHONY: test-mem
 test-mem: testMem
 
@@ -83,7 +77,6 @@ clean:
 	rm -f de5Top.v mkCore.v mkDCache.v mkMailbox.v mkDebugLinkRouter.v
 	rm -f mkFPU.v mkMeshRouter.v
 	rm -f de5BridgeTop.v
-	rm -f FreeSlots.hex ../de5/FreeSlots.mif
 	rm -rf test-mem-log
 	rm -rf test-mailbox-log
 	rm -rf test-array-of-queue-log
diff --git a/rtl/NarrowSRAM.bsv b/rtl/NarrowSRAM.bsv
index d0651392..4e51be85 100644
--- a/rtl/NarrowSRAM.bsv
+++ b/rtl/NarrowSRAM.bsv
@@ -1,22 +1,21 @@
 // SPDX-License-Identifier: BSD-2-Clause
 package NarrowSRAM;
 
-import DCacheTypes :: *;
-import Util        :: *;
+import Util :: *;
 
 // ============================================================================
 // Types
 // ============================================================================
 
 // SRAM request id
-typedef Bit#(`LogDCachesPerDRAM) SRAMReqId;
+typedef Bit#(TLog#(TAdd#(`DCachesPerDRAM,`FetchersPerProgRouter))) SRAMReqId;
 
 // SRAM load request
 typedef struct {
   SRAMReqId id;
   Bit#(`SRAMAddrWidth) addr;
   Bit#(`SRAMBurstWidth) burst;
-  InflightDCacheReqInfo info;
+  Bit#(`BeatWidth) info;
 } SRAMLoadReq deriving (Bits);
 
 // SRAM store request
@@ -31,7 +30,7 @@ typedef struct {
 typedef struct {
   SRAMReqId id;
   Bit#(`SRAMDataWidth) data;
-  InflightDCacheReqInfo info;
+  Bit#(`BeatWidth) info;
 } SRAMResp deriving (Bits);
 
 // ============================================================================
@@ -140,7 +139,6 @@ module mkSRAM#(RAMId id) (SRAM);
         resp.id = req.id;
         resp.data = pack(elems);
         resp.info = req.info;
-        resp.info.beat = truncate(loadBurstCount);
         resps.enq(resp);
         inFlightCount.dec;
       end
@@ -243,7 +241,7 @@ endinterface
 typedef struct {
   SRAMReqId id;
   Bit#(`SRAMBurstWidth) burst;
-  InflightDCacheReqInfo info;
+  Bit#(`BeatWidth) info;
 } SRAMInFlightReq deriving (Bits);
 
 // SRAM Implementation
diff --git a/rtl/Network.bsv b/rtl/Network.bsv
index 3efbb480..07d9adfd 100644
--- a/rtl/Network.bsv
+++ b/rtl/Network.bsv
@@ -23,6 +23,9 @@ import Socket       :: *;
 import Util         :: *;
 import IdleDetector :: *;
 import FlitMerger   :: *;
+import OffChipRAM   :: *;
+import DRAM         :: *;
+import ProgRouter   :: *;
 
 // =============================================================================
 // Mesh Router
@@ -146,11 +149,9 @@ module mkMeshRouter#(MailboxId m) (MeshRouter);
 
   // Routing function
   function Route route(NetAddr a);
-         if (a.addr.board.y < b.y) return Down;
-    else if (a.addr.board.y > b.y) return Up;
-    else if (a.addr.host.valid) return a.addr.host.value == 0 ? Left : Right;
-    else if (a.addr.board.x < b.x) return Left;
-    else if (a.addr.board.x > b.x) return Right;
+         if (a.addr.board != b)   return Down;
+    else if (a.addr.isKey)        return Down;
+    else if (a.addr.host.valid)   return Down;
     else if (a.addr.mbox.y < m.y) return Down;
     else if (a.addr.mbox.y > m.y) return Up;
     else if (a.addr.mbox.x < m.x) return Left;
@@ -271,27 +272,35 @@ module mkBoardLink#(Bool en, SocketId id) (BoardLink);
 endmodule
 
 // =============================================================================
-// Mailbox Mesh
+// Network-on-chip
 // =============================================================================
 
-// Interface to external (off-board) network
-interface ExtNetwork;
-`ifndef SIMULATE
-  // Avalon interfaces to 10G MACs
+// NoC interface
+interface NoC;
+  `ifndef SIMULATE
+  // Avalon interfaces to 10G MACs (inter-FPGA links)
   interface Vector#(`NumNorthSouthLinks, AvalonMac) north;
   interface Vector#(`NumNorthSouthLinks, AvalonMac) south;
   interface Vector#(`NumEastWestLinks, AvalonMac) east;
   interface Vector#(`NumEastWestLinks, AvalonMac) west;
-`endif
+  `endif
+  // Connections to off-chip memory (for the programmable router)
+  interface Vector#(`DRAMsPerBoard,
+    Vector#(`FetchersPerProgRouter, BOut#(DRAMReq))) dramReqs;
+  interface Vector#(`DRAMsPerBoard,
+    Vector#(`FetchersPerProgRouter, In#(DRAMResp))) dramResps;
+  // ProgRouter fetcher activities & performance counters
+  interface Vector#(`FetchersPerProgRouter, FetcherActivity) activities;
+  interface ProgRouterPerfCounters progRouterPerfCounters;
 endinterface
 
-module mkMailboxMesh#(
+module mkNoC#(
          BoardId boardId,
          Vector#(4, Bool) linkEnable,
          Vector#(`MailboxMeshYLen,
            Vector#(`MailboxMeshXLen, MailboxNet)) mailboxes,
          IdleDetector idle)
-       (ExtNetwork);
+       (NoC);
 
   // Create off-board links
   Vector#(`NumNorthSouthLinks, BoardLink) northLink <-
@@ -303,6 +312,9 @@ module mkMailboxMesh#(
   Vector#(`NumEastWestLinks, BoardLink) westLink <-
     mapM(mkBoardLink(linkEnable[3]), westSocket);
 
+  // Dimension-ordered routers
+  // -------------------------
+
   // Create mailbox routers
   Vector#(`MailboxMeshYLen,
     Vector#(`MailboxMeshXLen, MeshRouter)) routers =
@@ -362,79 +374,43 @@ module mkMailboxMesh#(
                      routers[y+1][x].bottomOut, routers[y][x].topIn);
   end
 
-  // Connect north links
-  // -------------------
+  // Programmable board router
+  // -------------------------
 
-  // Extract mesh top inputs and outputs
-  List#(In#(Flit)) topInList = Nil;
-  List#(Out#(Flit)) topOutList = Nil;
-  for (Integer x = `MailboxMeshXLen-1; x >= 0; x=x-1) begin
-    topOutList = Cons(routers[`MailboxMeshYLen-1][x].topOut, topOutList);
-    topInList = Cons(routers[`MailboxMeshYLen-1][x].topIn, topInList);
-  end
+  // Programmable router
+  ProgRouter boardRouter <- mkProgRouter(boardId);
 
-  // Connect the outgoing links
-  function In#(Flit) getFlitIn(BoardLink link) = link.flitIn;
-  reduceConnect(mkFlitMerger,
-    topOutList, List::map(getFlitIn, toList(northLink)));
-  
-  // Connect the incoming links
-  function Out#(Flit) getFlitOut(BoardLink link) = link.flitOut;
-  expandConnect(List::map(getFlitOut, toList(northLink)), topInList);
-
-  // Connect south links
-  // -------------------
-
-  // Extract mesh bottom inputs and outputs
-  List#(In#(Flit)) botInList = Nil;
-  List#(Out#(Flit)) botOutList = Nil;
-  for (Integer x = `MailboxMeshXLen-1; x >= 0; x=x-1) begin
-    botOutList = Cons(routers[0][x].bottomOut, botOutList);
-    botInList = Cons(routers[0][x].bottomIn, botInList);
-  end
+  // Connect board router to north link
+  connectDirect(boardRouter.flitOut[0], northLink[0].flitIn);
+  connectUsing(mkUGShiftQueue1(QueueOptFmax),
+    northLink[0].flitOut, boardRouter.flitIn[0]);
 
-  // Connect the outgoing links
-  reduceConnect(mkFlitMerger, botOutList,
-    List::map(getFlitIn, toList(southLink)));
-  
-  // Connect the incoming links
-  expandConnect(List::map(getFlitOut, toList(southLink)), botInList);
-
-  // Connect east links
-  // ------------------
-
-  // Extract mesh right inputs and outputs
-  List#(In#(Flit)) rightInList = Nil;
-  List#(Out#(Flit)) rightOutList = Nil;
-  for (Integer y = `MailboxMeshYLen-1; y >= 0; y=y-1) begin
-    rightOutList = Cons(routers[y][`MailboxMeshXLen-1].rightOut, rightOutList);
-    rightInList = Cons(routers[y][`MailboxMeshXLen-1].rightIn, rightInList);
-  end
+  // Connect board router to south link
+  connectDirect(boardRouter.flitOut[1], southLink[0].flitIn);
+  connectUsing(mkUGShiftQueue1(QueueOptFmax),
+    southLink[0].flitOut, boardRouter.flitIn[1]);
 
-  // Connect the outgoing links
-  reduceConnect(mkFlitMerger,
-    rightOutList, List::map(getFlitIn, toList(eastLink)));
-  
-  // Connect the incoming links
-  expandConnect(List::map(getFlitOut, toList(eastLink)), rightInList);
-
-  // Connect west links
-  // ------------------
-
-   // Extract mesh right inputs and outputs
-  List#(In#(Flit)) leftInList = Nil;
-  List#(Out#(Flit)) leftOutList = Nil;
-  for (Integer y = `MailboxMeshYLen-1; y >= 0; y=y-1) begin
-    leftOutList = Cons(routers[y][0].leftOut, leftOutList);
-    leftInList = Cons(routers[y][0].leftIn, leftInList);
-  end
+  // Connect board router to east link
+  connectDirect(boardRouter.flitOut[2], eastLink[0].flitIn);
+  connectUsing(mkUGShiftQueue1(QueueOptFmax),
+    eastLink[0].flitOut, boardRouter.flitIn[2]);
 
-  // Connect the outgoing links
-  reduceConnect(mkFlitMerger,
-    leftOutList, List::map(getFlitIn, toList(westLink)));
-  
-  // Connect the incoming links
-  expandConnect(List::map(getFlitOut, toList(westLink)), leftInList);
+  // Connect board router to west link
+  connectDirect(boardRouter.flitOut[3], westLink[0].flitIn);
+  connectUsing(mkUGShiftQueue1(QueueOptFmax),
+    westLink[0].flitOut, boardRouter.flitIn[3]);
+
+  // Connect mailbox mesh south rim to board router
+  for (Integer i = 0; i < `MailboxMeshXLen; i=i+1)
+    connectUsing(mkUGShiftQueue1(QueueOptFmax),
+      routers[0][i].bottomOut, boardRouter.flitIn[4+i]);
+
+  // Connect board router to mailbox mesh south rim
+  function In#(Flit) getBottomIn(MeshRouter r) = r.bottomIn;
+  Vector#(`MailboxMeshXLen, In#(Flit)) southRimInPorts =
+    map(getBottomIn, routers[0]);
+  for (Integer i = 0; i < `MailboxMeshXLen; i=i+1)
+    connectDirect(boardRouter.flitOut[4+i], southRimInPorts[i]);
 
   // Detect inter-board activity
   // ---------------------------
@@ -465,13 +441,31 @@ module mkMailboxMesh#(
     idle.idle.interBoardActivity(activityReg);
   endrule
 
-`ifndef SIMULATE
+  // Interfaces
+  // ----------
+
+  function In#(t) getIn(InPort#(t) p) = p.in;
+
+  `ifndef SIMULATE
   function AvalonMac getMac(BoardLink link) = link.avalonMac;
   interface north = Vector::map(getMac, northLink);
   interface south = Vector::map(getMac, southLink);
   interface east = Vector::map(getMac, eastLink);
   interface west = Vector::map(getMac, westLink);
-`endif
+  `endif
+
+  // Requests to off-chip memory
+  interface dramReqs = boardRouter.ramReqs;
+
+  // Responses from off-chip memory
+  interface dramResps = boardRouter.ramResps;
+
+  // Fetcher activities
+  interface activities = boardRouter.activities;
+
+  // Performance counters
+  interface ProgRouterPerfCounters progRouterPerfCounters =
+    boardRouter.perfCounters;
 
 endmodule
 
diff --git a/rtl/ProgRouter.bsv b/rtl/ProgRouter.bsv
new file mode 100644
index 00000000..6e531261
--- /dev/null
+++ b/rtl/ProgRouter.bsv
@@ -0,0 +1,948 @@
+// SPDX-License-Identifier: BSD-2-Clause
+// Functions, data types, and modules for programmable routers
+package ProgRouter;
+
+import Globals   :: *;
+import Util      :: *;
+import DRAM      :: *;
+import Vector    :: *;
+import Queue     :: *;
+import Interface :: *;
+import BlockRam  :: *;
+import Assert    :: *;
+import Util      :: *;
+import DReg      :: *;
+
+// =============================================================================
+// Routing keys and beats
+// =============================================================================
+
+// A routing record is either 48 bits or 96 bits in size (aligned on a
+// 48-bit or 96-bit boundary respectively). Multiple records are
+// packed into a 256-bit DRAM beat (aligned on a 256-bit boundary).
+// The most significant 16 bits of the beat contain a count of the
+// number of records in the beat (in the range 1 to 5 inclusive). The
+// remaining 240 bits contain records. The first record lies in the
+// least-significant bits of the beat. The size portion of the routing
+// key contains the number of contiguous DRAM beats holding all
+// records for the key.
+
+// 256-bit routing beat
+typedef struct {
+  // Number of records present
+  Bit#(16) size;
+  // The 48-bit record chunks
+  Vector#(5, Bit#(48)) chunks;
+} RoutingBeat deriving (Bits, FShow);
+
+// 32-bit routing key
+typedef struct {
+  // Which off-chip RAM?
+  Bit#(`LogDRAMsPerBoard) ram;
+  // Pointer to array of routing beats containing routing records
+  Bit#(`LogBeatsPerDRAM) ptr;
+  // Number of beats in the array
+  Bit#(`LogRoutingEntryLen) numBeats;
+} RoutingKey deriving (Bits, FShow);
+
+// Extract routing key from an address
+function RoutingKey getRoutingKey(NetAddr addr) =
+  unpack(getRoutingKeyRaw(addr));
+
+// =============================================================================
+// Types of routing record
+// =============================================================================
+
+typedef enum {
+  URM1 = 3'd0, // 48-bit Unicast Router-to-Mailbox
+  URM2 = 3'd1, // 96-bit Unicast Router-to-Mailbox
+  RR   = 3'd2, // 48-bit Router-to-Router
+  MRM  = 3'd3, // 96-bit Multicast Router-to-Mailbox
+  IND  = 3'd4  // 48-bit Indirection
+} RoutingRecordTag deriving (Bits, Eq, FShow);
+
+typedef enum {
+  NORTH = 2'd0,
+  SOUTH = 2'd1,
+  EAST  = 2'd2,
+  WEST  = 2'd3
+} RoutingDir deriving (Bits, Eq);
+
+// 48-bit Unicast Router-to-Mailbox (URM1) record
+typedef struct {
+  // Record type
+  RoutingRecordTag tag;
+  // Mailbox destination
+  Bit#(4) mbox;
+  // Mailbox-local thread identifier
+  Bit#(6) thread;
+  // Unused
+  Bit#(3) unused;
+  // Local key. The first word of the message
+  // payload is overwritten with this.
+  Bit#(32) localKey;
+} URM1Record deriving (Bits, FShow);
+
+// 96-bit Unicast Router-to-Mailbox (URM2) record
+typedef struct {
+  // Record type
+  RoutingRecordTag tag;
+  // Mailbox destination
+  Bit#(4) mbox;
+  // Mailbox-local thread identifier
+  Bit#(6) thread;
+  // Currently unused
+  Bit#(19) unused;
+  // Local key. The first two words of the message
+  // payload is overwritten with this.
+  Bit#(64) localKey;
+} URM2Record deriving (Bits);
+
+// 48-bit Router-to-Router (RR) record
+typedef struct {
+  // Record type
+  RoutingRecordTag tag;
+  // Direction (N, S, E, or W)
+  RoutingDir dir;
+  // Currently unused
+  Bit#(11) unused;
+  // New 32-bit routing key that will replace the one in the
+  // current message for the next hop of the message's journey
+  Bit#(32) newKey;
+} RRRecord deriving (Bits);
+
+// 96-bit Multicast Router-to-Mailbox (MRM) record
+typedef struct {
+  // Record type
+  RoutingRecordTag tag;
+  // Mailbox destination
+  Bit#(4) mbox;
+  // Currently unused
+  Bit#(9) unused;
+  // Local key. The least-significant half-word
+  // of the message is replaced with this
+  Bit#(16) localKey;
+  // Mailbox-local destination mask
+  Bit#(64) destMask;
+} MRMRecord deriving (Bits);
+
+// 48-bit Indirection (IND) record
+// Note the restrictions on IND records:
+// 1. At most one IND record per key lookup
+// 2. A max-sized key lookup must contain an IND record
+typedef struct {
+  // Record type
+  RoutingRecordTag tag;
+  // Currently unused
+  Bit#(13) unused;
+  // New 32-bit routing key for new set of records on current router
+  Bit#(32) newKey;
+} INDRecord deriving (Bits);
+
+// =============================================================================
+// Internal types
+// =============================================================================
+
+// It is sometimes convenient (though redundant) to record a routing
+// decision for a flit internally within the programmable router
+typedef struct {
+  // Normal flit
+  Flit flit;
+  // Routing decision for flit
+  RoutingDecision decision;
+} RoutedFlit deriving (Bits, FShow);
+
+// Routing decision
+typedef enum {
+  RouteNorth,
+  RouteSouth,
+  RouteEast,
+  RouteWest,
+  RouteNoC
+} RoutingDecision deriving (Bits, Eq, FShow);
+
+// Elements of the indirection queue inside each fetcher
+typedef struct {
+  // The indirection
+  RoutingKey key;
+  // The location of the message in the flit buffer
+  FetcherFlitBufferMsgAddr addr;
+} IndQueueEntry deriving (Bits, FShow);
+
+// =============================================================================
+// Design
+// =============================================================================
+
+// In the following diagram N/S/E/W are the inter-FPGA links and
+// L0..L3 are links at one edge of the NoC.  Depending on the NoC
+// dimensions, there may be more or less than four links on a single
+// NoC edge, but the diagram assumes four.
+
+//
+//   N     S     E     W    L0     L1    L2    L3     Input flits
+//   |     |     |     |     |     |     |     |
+// +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+
+// | F | | F | | F | | F | | F | | F | | F | | F |    Fetchers
+// +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+
+//   |     |     |     |     |     |     |     |
+// +-------------------------------------------+      
+// |                 Crossbar                  |      Routing
+// +-------------------------------------------+      
+//   |     |     |     |     |     |     |     |  
+//   N     S     E     W     L0    L1    L2    L3     Output queues
+
+// The core functionality is implemented in the fetchers, which:
+//   (1) extract routing keys from incoming flits;
+//   (2) lookup the keys in RAM;
+//   (3) interpret the resulting routing records; and
+//   (4) emit the interpreted flits.
+
+// The key property of these fetchers is that they act entirely
+// indepdedently of each other: each one can make progress even if
+// another is blocked.  This leads to duplicated logic resources, but
+// is necessary to avoid deadlock.
+
+// As the routers are fully programmable, it is possible for the
+// programmer to introduce deadlock using an ill-defined routing
+// scheme, e.g. where a flit arrives in on (say) link N and requires a
+// flit to be sent back along the same direction N.  However, the
+// hardware does guarantee deadlock-freedom if the routing scheme is
+// based on dimension-ordered routing.
+
+// After the fetchers have interpreted the flits, they are fed to a
+// fair crossbar which organises them by destination into output
+// queues.
+
+// =============================================================================
+// Fetcher
+// =============================================================================
+
+// Flit address in a fetcher's flit buffer
+typedef Bit#(`FetcherLogFlitBufferSize) FetcherFlitBufferAddr;
+
+// Message address in a fetcher's flit buffer
+typedef Bit#(`FetcherLogMsgsPerFlitBuffer) FetcherFlitBufferMsgAddr;
+
+// This structure contains information about an in-flight memory
+// request from a fetcher.  When a fetcher issues a memory load
+// request, this info is packed into the unused data field of the
+// request.  When the memory subsystem responds, it passes back the
+// same info in an extra field inside the memory response structure.
+// Maintaining info about an inflight request inside the request
+// itself provides an easy way to handle out-of-order responses from
+// memory.
+typedef struct {
+  // Message address in the fetcher's flit buffer
+  FetcherFlitBufferMsgAddr msgAddr;
+  // How many beats in the burst?
+  Bit#(`BeatBurstWidth) burst;
+  // Is this the final burst of routing records for the current key?
+  Bool finalBurst;
+  // Are we processing a max-sized key (which must contain an IND record)?
+  Bool isMaxSizedKey;
+} InflightFetcherReqInfo deriving (Bits, FShow);
+
+// Routing beat, tagged with the beat number in the DRAM burst
+typedef struct {
+  // Beat
+  RoutingBeat beat;
+  // Beat number
+  Bit#(`BeatBurstWidth) beatNum;
+  // Inflight request info
+  InflightFetcherReqInfo info;
+} NumberedRoutingBeat deriving (Bits, FShow);
+
+// Fetcher interface
+interface Fetcher;
+  // Incoming and outgoing flits
+  interface In#(Flit) flitIn;
+  interface BOut#(RoutedFlit) flitOut;
+  // Off-chip RAM connections
+  interface Vector#(`DRAMsPerBoard, BOut#(DRAMReq)) ramReqs;
+  interface Vector#(`DRAMsPerBoard, In#(DRAMResp)) ramResps;
+  // Activity
+  interface FetcherActivity activity;
+endinterface
+
+// Fetcher activity for performance counters and termination detection
+(* always_ready *)
+interface FetcherActivity;
+  // Increment number of sent messages
+  method Bit#(1) incSent;
+  // Increment number of messages sent to another board
+  method Bit#(1) incSentInterBoard;
+  // Increment number of received messages
+  method Bit#(1) incReceived;
+  // Active (in the termination-detection sense)?
+  method Bool active;
+endinterface
+
+// Fetcher module
+module mkFetcher#(BoardId boardId, Integer fetcherId) (Fetcher);
+
+  // Flit input port
+  InPort#(Flit) flitInPort <- mkInPort;
+
+  // RAM request queues
+  Vector#(`DRAMsPerBoard, Queue1#(DRAMReq)) ramReqQueue <-
+    replicateM(mkUGShiftQueue(QueueOptFmax));
+
+  // Flit buffer
+  BlockRamOpts flitBufferOpts =
+    BlockRamOpts {
+      readDuringWrite: DontCare,
+      style: "AUTO",
+      registerDataOut: False,
+      initFile: Invalid
+    };
+  BlockRam#(FetcherFlitBufferAddr, Flit) flitBuffer <-
+    mkBlockRamOpts(flitBufferOpts);
+
+  // Beat buffer
+  SizedQueue#(`FetcherLogBeatBufferSize, NumberedRoutingBeat)
+    beatBuffer <- mkUGSizedQueue;
+
+  // Track length of beat buffer, so that we don't overfetch
+  Count#(TAdd#(`FetcherLogBeatBufferSize, 1)) beatBufferLen <-
+      mkCount(2 ** `FetcherLogBeatBufferSize);
+
+  // For flits whose destinations are *not* routing keys
+  Queue1#(RoutedFlit) flitBypassQueue <- mkUGShiftQueue(QueueOptFmax);
+
+  // For flits whose destinations are routing keys
+  Queue1#(RoutedFlit) flitProcessedQueue <- mkUGShiftQueue(QueueOptFmax);
+
+  // Final output queue for flits
+  Queue1#(RoutedFlit) flitOutQueue <- mkUGShiftQueue(QueueOptFmax);
+
+  // Indirection queue and size
+  SizedQueue#(`FetcherLogIndQueueSize, IndQueueEntry) indQueue <-
+    mkUGShiftQueue(QueueOptFmax);
+  Count#(TAdd#(`FetcherLogIndQueueSize, 1)) indQueueLen <-
+      mkCount(2 ** `FetcherLogIndQueueSize);
+
+  // Activity
+  Reg#(Bit#(1)) incSentReg <- mkDReg(0);
+  Reg#(Bit#(1)) incSentInterBoardReg <- mkDReg(0);
+  Reg#(Bit#(1)) incReceivedReg <- mkDReg(0);
+
+  // Stage 1: consume input message
+  // ------------------------------
+
+  // Consumer state
+  // State 0: pass through flits that don't contain routing keys
+  // State 1: buffer flits that do contain routing keys
+  // State 2: fetch routing beats
+  Reg#(Bit#(2)) consumeState <- mkReg(0);
+
+  // Count number of flits of message consumed so far
+  Reg#(Bit#(`LogMaxFlitsPerMsg)) consumeFlitCount <- mkReg(0);
+
+  // Flit slot allocator
+  Vector#(`FetcherMsgsPerFlitBuffer, SetReset) flitBufferUsedSlots <-
+    replicateM(mkSetReset(False));
+
+  // Chosen message slot
+  Reg#(FetcherFlitBufferMsgAddr) chosenReg <- mkRegU;
+
+  // Routing key of message consumed
+  Reg#(RoutingKey) consumeKey <- mkRegU;
+
+  // Maintain count of routing beats fetched so far
+  Reg#(Bit#(`LogRoutingEntryLen)) fetchBeatCount <- mkReg(0);
+
+  // Track when messages are bypassing fetcher, to keep the bypass atomic
+  Reg#(Bool) bypassInProgress <- mkReg(False);
+
+  // State 0: pass through flits that don't contain routing keys
+  rule consumeMessage0 (consumeState == 0);
+    Flit flit = flitInPort.value;
+    // Find unused message slot
+    Bool found = False;
+    FetcherFlitBufferMsgAddr chosen = ?;
+    for (Integer i = 0; i < `FetcherMsgsPerFlitBuffer; i=i+1) 
+      if (! flitBufferUsedSlots[i].value) begin
+        found = True;
+        chosen = fromInteger(i);
+      end
+    // Initialise counters for subsequent states
+    consumeFlitCount <= 0;
+    fetchBeatCount <= 0;
+    // First, try to consume indirection
+    if (indQueue.canDeq && indQueue.canPeek && !bypassInProgress) begin
+      IndQueueEntry ind = indQueue.dataOut;
+      // Consume
+      indQueue.deq;
+      // Release space in indQueue, unless we have another max-sized key
+      if (!allHigh(ind.key.numBeats))
+        indQueueLen.dec;
+      // Jump straight to fetch state, as message already in flit buffer
+      chosenReg <= ind.addr;
+      consumeKey <= ind.key;
+      // Proceed only if key size is non-zero
+      if (ind.key.numBeats != 0)
+        consumeState <= 2;
+    end else begin
+      chosenReg <= chosen;
+      // Otherwise, try to consume flit
+      if (flitInPort.canGet) begin
+        if (flit.dest.addr.isKey) begin
+          if (found) begin
+            RoutingKey key = getRoutingKey(flit.dest);
+            // For a full-size key, we must reserve space in the indQueue
+            if (allHigh(key.numBeats)) begin
+              if (indQueueLen.notFull) begin
+                indQueueLen.inc;
+                consumeState <= 1;
+              end
+            end else
+              consumeState <= 1;
+          end
+        end else if (flitBypassQueue.notFull) begin
+          flitInPort.get;
+          bypassInProgress <= flit.notFinalFlit;
+          // Make routing decision
+          RoutingDecision decision = RouteNoC;
+          MailboxNetAddr addr = flit.dest.addr;
+          if (addr.board.y < boardId.y) decision = RouteSouth;
+          else if (addr.board.y > boardId.y) decision = RouteNorth;
+          else if (addr.host.valid)
+            decision = addr.host.value == 0 ? RouteWest : RouteEast;
+          else if (addr.board.x < boardId.x) decision = RouteWest;
+          else if (addr.board.x > boardId.x) decision = RouteEast;
+          // Insert into bypass queue
+          flitBypassQueue.enq(RoutedFlit { decision: decision, flit: flit});
+        end
+      end
+    end
+  endrule
+
+  // State 1: buffer flits that do contain routing keys
+  rule consumeMessage1 (consumeState == 1);
+    Flit flit = flitInPort.value;
+    if (flitInPort.canGet) begin
+      flitInPort.get;
+      RoutingKey key = getRoutingKey(flit.dest);
+      consumeKey <= key;
+      // Write to flit buffer
+      flitBuffer.write({chosenReg, consumeFlitCount}, flit);
+      consumeFlitCount <= consumeFlitCount + 1;
+      // On final flit, move to fetch state
+      if (! flit.notFinalFlit) begin
+        // Ignore keys with zero beats
+        if (key.numBeats == 0) begin
+          consumeState <= 0;
+          incReceivedReg <= 1;
+        end else begin
+          consumeState <= 2;
+          // Claim chosen slot
+          flitBufferUsedSlots[chosenReg].set;
+        end
+      end
+    end
+  endrule
+
+  // State 2: fetch routing beats
+  rule consumeMessage2 (consumeState == 2);
+    // Have we finished fetching beats?
+    Bool finished = (consumeKey.numBeats-fetchBeatCount) <= `ProgRouterMaxBurst;
+    // Prepare inflight RAM request info
+    // (to handle out of order resps from the RAMs)
+    InflightFetcherReqInfo info;
+    info.msgAddr = chosenReg;
+    info.burst = truncate(
+      min(consumeKey.numBeats - fetchBeatCount, `ProgRouterMaxBurst));
+    info.finalBurst = finished;
+    info.isMaxSizedKey = allHigh(consumeKey.numBeats);
+    // Prepare RAM request
+    DRAMReq req;
+    req.isStore = False;
+    req.id = fromInteger(`DCachesPerDRAM + fetcherId);
+    req.addr = {1'b0, consumeKey.ptr + zeroExtend(fetchBeatCount)};
+    req.data = {?, pack(info)};
+    req.burst = info.burst;
+    // Don't overfetch (beat buffer has finite size)
+    if (ramReqQueue[consumeKey.ram].notFull &&
+          beatBufferLen.available >= zeroExtend(req.burst)) begin
+      ramReqQueue[consumeKey.ram].enq(req);
+      fetchBeatCount <= fetchBeatCount + zeroExtend(req.burst);
+      beatBufferLen.incBy(zeroExtend(req.burst));
+      if (finished) begin
+        consumeState <= 0;
+        incReceivedReg <= 1;
+      end
+    end
+  endrule
+
+  // Stage 2: interpret routing beats
+  // --------------------------------
+
+  // Merge responses from each RAM
+  staticAssert(`DRAMsPerBoard == 2,
+    "Fetcher: need to generalise number of RAMs used");
+  MergeUnit#(NumberedRoutingBeat) ramRespMerger <- mkMergeUnitFair;
+
+  // Convert a RAM response to a numbered routing beat
+  function NumberedRoutingBeat fromDRAMResp(DRAMResp resp) =
+    NumberedRoutingBeat {
+      beat: unpack(resp.data)
+    , beatNum: resp.beat
+    , info: unpack(truncate(resp.info))
+    };
+
+  // Create RAM response input interfaces for this module
+  In#(DRAMResp) respA <- onIn(fromDRAMResp, ramRespMerger.inA);
+  In#(DRAMResp) respB <- onIn(fromDRAMResp, ramRespMerger.inB);
+  Vector#(`DRAMsPerBoard, In#(DRAMResp)) ramRespsOut = vector(respA, respB);
+
+  // Connect the merger to the beat buffer
+  connectToQueue(ramRespMerger.out, beatBuffer);
+
+  // Count number of flits of message emitted so far
+  Reg#(Bit#(`LogMaxFlitsPerMsg)) emitFlitCount <- mkReg(0);
+
+  // Count number of records processed so far in current beat
+  Reg#(Bit#(3)) recordCount <- mkReg(0);
+
+  // (Shift) register holding current routing beat
+  Reg#(NumberedRoutingBeat) beatReg <- mkRegU;
+
+  // Interpreter state
+  // 0: register the routing beat and fetch first flit
+  // 1: interpret flits
+  Reg#(Bit#(1)) interpreterState <- mkReg(0);
+
+  // State 0: register the routing beat and fetch first flit
+  rule interpreter0 (interpreterState == 0);
+    let beat = beatBuffer.dataOut;
+    InflightFetcherReqInfo info = beat.info;
+    // Consume beat
+    if (beatBuffer.canDeq && beatBuffer.canPeek) begin
+      beatReg <= beat;
+      beatBuffer.deq;
+      beatBufferLen.dec;
+      interpreterState <= 1;
+    end
+    // Load first flit
+    flitBuffer.read({info.msgAddr, 0});
+    emitFlitCount <= 0;
+    recordCount <= 0;
+  endrule
+
+  // State 1: interpret flits
+  rule interpreter1 (interpreterState == 1);
+    // Extract details of registered routing beat
+    let beat = beatReg.beat;
+    let beatNum = beatReg.beatNum;
+    let info = beatReg.info;
+    // Extract tag from next record
+    RoutingRecordTag tag = unpack(truncateLSB(beat.chunks[4]));
+    // Is this the first flit of a message?
+    Bool firstFlit = emitFlitCount == 0;
+    // Modify flit by interpreting routing key
+    RoutingDecision decision = ?;
+    Flit flit = flitBuffer.dataOut;
+    // Unless otherwise stated (e.g. RR records),
+    // flits emitted will be destined for this board
+    flit.dest.addr.board = boardId;
+    case (tag)
+      // 48-bit Unicast Router-to-Mailbox
+      URM1: begin
+        URM1Record rec = unpack(beat.chunks[4]);
+        flit.dest.addr.isKey = False;
+        flit.dest.addr.mbox.x = unpack(truncate(rec.mbox[1:0]));
+        flit.dest.addr.mbox.y = unpack(truncate(rec.mbox[3:2]));
+        Vector#(`ThreadsPerMailbox, Bool) threadMask = newVector;
+        for (Integer j = 0; j < `ThreadsPerMailbox; j=j+1)
+          threadMask[j] = rec.thread == fromInteger(j);
+        flit.dest.threads = pack(threadMask);
+        // Replace first word of message with local key
+        if (firstFlit)
+          flit.payload = {truncateLSB(flit.payload), rec.localKey};
+        decision = RouteNoC;
+      end
+      // 96-bit Unicast Router-to-Mailbox
+      URM2: begin
+        URM2Record rec = unpack({beat.chunks[4], beat.chunks[3]});
+        flit.dest.addr.isKey = False;
+        flit.dest.addr.mbox.x = unpack(truncate(rec.mbox[1:0]));
+        flit.dest.addr.mbox.y = unpack(truncate(rec.mbox[3:2]));
+        Vector#(`ThreadsPerMailbox, Bool) threadMask = newVector;
+        for (Integer j = 0; j < `ThreadsPerMailbox; j=j+1)
+          threadMask[j] = rec.thread == fromInteger(j);
+        flit.dest.threads = pack(threadMask);
+        // Replace first two words of message with local key
+        if (firstFlit)
+          flit.payload = {truncateLSB(flit.payload), rec.localKey};
+        decision = RouteNoC;
+      end
+      // 48-bit Router-to-Router
+      RR: begin
+        RRRecord rec = unpack(beat.chunks[4]);
+        case (rec.dir)
+          NORTH: begin
+            decision = RouteNorth;
+            flit.dest.addr.board = BoardId {x: boardId.x, y: boardId.y+1};
+          end
+          SOUTH: begin
+            decision = RouteSouth;
+            flit.dest.addr.board = BoardId {x: boardId.x, y: boardId.y-1};
+          end
+          EAST: begin
+            decision = RouteEast;
+            flit.dest.addr.board = BoardId {x: boardId.x+1, y: boardId.y};
+          end
+          WEST: begin
+            decision = RouteWest;
+            flit.dest.addr.board = BoardId {x: boardId.x-1, y: boardId.y};
+          end
+        endcase
+        flit.dest.threads = {?, rec.newKey};
+      end
+      // 96-bit Multicast Router-to-Mailbox
+      MRM: begin
+        MRMRecord rec = unpack({beat.chunks[4], beat.chunks[3]});
+        flit.dest.addr.isKey = False;
+        flit.dest.addr.mbox.x = unpack(truncate(rec.mbox[1:0]));
+        flit.dest.addr.mbox.y = unpack(truncate(rec.mbox[3:2]));
+        flit.dest.threads = rec.destMask;
+        // Replace first half-word of message with local key
+        if (firstFlit)
+          flit.payload = {truncateLSB(flit.payload), rec.localKey};
+        decision = RouteNoC;
+      end
+      // 48-bit Indirection
+      IND: begin end
+    endcase
+    // Is output queue ready for new flit?
+    Bool emit = flitProcessedQueue.notFull;
+    let newFlitCount = emitFlitCount;
+    // Consume routing record
+    if (emit) begin
+      // Only enqueue if not an IND record
+      if (tag != IND)
+        flitProcessedQueue.enq(RoutedFlit { decision: decision, flit: flit });
+      // Shift beat to point to next record
+      RoutingBeat newBeat = beat;
+      Bool doubleChunk = unpack(pack(tag)[0]);
+      if (doubleChunk) begin
+        for (Integer i = 4; i > 1; i=i-1)
+          newBeat.chunks[i] = beat.chunks[i-2];
+      end else begin
+        for (Integer i = 4; i > 0; i=i-1)
+          newBeat.chunks[i] = beat.chunks[i-1];
+      end
+      // Is this the final flit in the message?
+      if (flit.notFinalFlit)
+        newFlitCount = emitFlitCount + 1;
+      else begin
+        // Move to next record
+        recordCount <= recordCount + 1;
+        beatReg <= NumberedRoutingBeat {
+          beat: newBeat, beatNum: beatNum, info: info };
+        // Handle IND record: insert into indirection queue
+        if (tag == IND) begin
+          myAssert(indQueue.notFull, "Restrictions on IND records violated");
+          INDRecord ind = unpack(beat.chunks[4]);
+          indQueue.enq(IndQueueEntry
+            { key: unpack(ind.newKey), addr: info.msgAddr });
+        end
+        // Is this the final record in the beat?
+        if ((recordCount+1) == truncate(beat.size)) begin
+          interpreterState <= 0;
+          // Have we finished with this message yet?
+          if (info.finalBurst && info.burst == (beatNum+1)) begin
+            // Reclaim message slot in flit buffer
+            // (Don't do this when we have an indirection to process)
+            if (! info.isMaxSizedKey)
+              flitBufferUsedSlots[info.msgAddr].clear;
+          end
+        end
+        incSentReg <= 1;
+        if (tag == RR) incSentInterBoardReg <= 1;
+        newFlitCount = 0;
+      end
+    end
+    // Issue flit load request
+    flitBuffer.read({info.msgAddr, newFlitCount});
+    emitFlitCount <= newFlitCount;
+  endrule
+
+  // Stage 3: merge output queues
+  // ----------------------------
+
+  // We want to merge messages, not flits
+  // Are we in the middle of consuming a message?
+  Reg#(Bool) mergeInProgress <- mkReg(False);
+  Reg#(Bool) prevFromBypass <- mkReg(False);
+
+  rule merge (flitOutQueue.notFull);
+    // Favour the bypass queue
+    Bool chooseBypass = mergeInProgress ? prevFromBypass :
+      flitBypassQueue.canDeq;
+    if (chooseBypass) begin
+      if (flitBypassQueue.canDeq) begin
+        flitBypassQueue.deq;
+        flitOutQueue.enq(flitBypassQueue.dataOut);
+        mergeInProgress <= flitBypassQueue.dataOut.flit.notFinalFlit;
+        prevFromBypass <= True;
+      end
+    end else if (flitProcessedQueue.canDeq) begin
+      flitProcessedQueue.deq;
+      flitOutQueue.enq(flitProcessedQueue.dataOut);
+      mergeInProgress <= flitProcessedQueue.dataOut.flit.notFinalFlit;
+      prevFromBypass <= False;
+    end
+  endrule
+
+  // Interfaces
+  // -----------
+
+  interface flitIn = flitInPort.in;
+  interface flitOut = queueToBOut(flitOutQueue);
+  interface ramReqs = map(queueToBOut, ramReqQueue);
+  interface ramResps = ramRespsOut;
+
+  interface FetcherActivity activity;
+    method Bit#(1) incSent = incSentReg;
+    method Bit#(1) incSentInterBoard = incSentInterBoardReg;
+    method Bit#(1) incReceived = incReceivedReg;
+    method Bool active =
+      beatBufferLen.value != 0 || consumeState != 0;
+  endinterface
+
+endmodule
+
+// =============================================================================
+// Crossbar
+// =============================================================================
+
+// Selector function for a mux in the programmable router crossbar
+typedef function Bool selector(RoutedFlit flit) SelectorFunc;
+
+module mkProgRouterCrossbar#(
+         Vector#(numOut, SelectorFunc) f,
+         Vector#(numIn, BOut#(RoutedFlit)) out)
+           (Vector#(numOut, BOut#(RoutedFlit)))
+             provisos(Add#(a__, 1, numIn));
+
+  // Input ports
+  Vector#(numIn, InPort#(RoutedFlit)) inPort <- replicateM(mkInPort);
+
+  // Connect up input ports
+  for (Integer i = 0; i < valueOf(numIn); i=i+1)
+    connectDirect(out[i], inPort[i].in);
+
+  // Cosume wires, for each input port
+  Vector#(numIn, PulseWire) consumeWire <- replicateM(mkPulseWireOR);
+
+  // Keep track of service history for flit sources (for fair selection)
+  Vector#(numOut, Reg#(Bit#(numIn))) hist <- replicateM(mkReg(0));
+
+  // Current choice of flit source
+  Vector#(numOut, Reg#(Bit#(numIn))) choiceReg <- replicateM(mkReg(0));
+
+  // Output queues
+  Vector#(numOut, Queue#(RoutedFlit)) outQueue <-
+    replicateM(mkUGShiftQueue(QueueOptFmax));
+
+  // Selector mux for each out queue
+  for (Integer i = 0; i < valueOf(numOut); i=i+1) begin
+
+    rule select;
+      // Vector of input flits and available flits
+      Vector#(numIn, RoutedFlit) flits = newVector;
+      Vector#(numIn, Bool) nextAvails = newVector;
+      Bool avail = False;
+      for (Integer j = 0; j < valueOf(numIn); j=j+1) begin
+        flits[j] = inPort[j].value;
+        nextAvails[j] = inPort[j].canGet && f[i](inPort[j].value)
+                          && choiceReg[i][j] == 0;
+        avail = avail || (choiceReg[i][j] == 1 && inPort[j].canGet);
+      end
+      Bit#(numIn) nextAvail = pack(nextAvails);
+      // Choose a new source using fair scheduler
+      match {.newHist, .nextChoice} = sched(hist[i], nextAvail);
+      // Select a flit
+      RoutedFlit flit = oneHotSelect(unpack(choiceReg[i]), flits);
+      // Consume a flit
+      if (avail) begin
+        if (outQueue[i].notFull) begin
+          // Pass chosen flit to out queue
+          outQueue[i].enq(flit);
+          // On final flit of message
+          if (!flit.flit.notFinalFlit) begin
+            choiceReg[i] <= nextChoice;
+            hist[i] <= newHist;
+          end
+        end
+      end else if (choiceReg[i] == 0) begin
+        choiceReg[i] <= nextChoice;
+        hist[i] <= newHist;
+      end
+      // Consume from chosen source
+      for (Integer j = 0; j < valueOf(numIn); j=j+1)
+        if (inPort[j].canGet && choiceReg[i][j] == 1 && outQueue[i].notFull)
+          consumeWire[j].send;
+    endrule
+
+  end
+
+  // Consume from flit sources
+  rule consumeFlitSources;
+    for (Integer j = 0; j < valueOf(numIn); j=j+1)
+      if (consumeWire[j]) inPort[j].get;
+  endrule
+
+  return map(queueToBOut, outQueue);
+endmodule
+
+
+// =============================================================================
+// Splitter
+// =============================================================================
+
+// Split a single stream in two based on a predicate
+module splitFlits#(SelectorFunc f, BOut#(RoutedFlit) out)
+                  (Tuple2#(BOut#(Flit), BOut#(Flit)));
+
+  // Consume wire
+  PulseWire consumeWire <- mkPulseWireOR;
+
+  // Output streams
+  BOut#(Flit) outYes =
+    interface BOut
+      method Action get = consumeWire.send;
+      method Bool valid = out.valid && f(out.value);
+      method Flit value = out.value.flit;
+    endinterface;
+  BOut#(Flit) outNo =
+    interface BOut
+      method Action get = consumeWire.send;
+      method Bool valid = out.valid && !f(out.value);
+      method Flit value = out.value.flit;
+    endinterface;
+
+  // Consume
+  rule consume;
+    if (consumeWire) out.get;
+  endrule
+
+  return tuple2(outYes, outNo);
+endmodule
+
+// =============================================================================
+// Programmable router
+// =============================================================================
+
+// Enough bits to store a count of the number of fetchers
+typedef TLog#(TAdd#(`FetchersPerProgRouter, 1)) LogFetchersPerProgRouter;
+
+// ProgRouter's performance counters
+(* always_ready, always_enabled *)
+interface ProgRouterPerfCounters;
+  method Bit#(LogFetchersPerProgRouter) incSent;
+  method Bit#(LogFetchersPerProgRouter) incSentInterBoard;
+endinterface
+
+interface ProgRouter;
+  // Incoming and outgoing flits
+  interface Vector#(`FetchersPerProgRouter, In#(Flit)) flitIn;
+  interface Vector#(`FetchersPerProgRouter, BOut#(Flit)) flitOut;
+
+  // Interface to off-chip memory
+  interface Vector#(`DRAMsPerBoard,
+    Vector#(`FetchersPerProgRouter, BOut#(DRAMReq))) ramReqs;
+  interface Vector#(`DRAMsPerBoard,
+    Vector#(`FetchersPerProgRouter, In#(DRAMResp))) ramResps;
+
+  // Activities & performance counters
+  interface Vector#(`FetchersPerProgRouter, FetcherActivity) activities;
+  interface ProgRouterPerfCounters perfCounters;
+endinterface
+
+module mkProgRouter#(BoardId boardId) (ProgRouter);
+
+  // Fetchers
+  Vector#(`FetchersPerProgRouter, Fetcher) fetchers = newVector;
+  for (Integer i = 0; i < `FetchersPerProgRouter; i=i+1)
+    fetchers[i] <- mkFetcher(boardId, i);
+
+  // Crossbar routing functions
+  function Bit#(`MailboxMeshXBits) xcoord(RoutedFlit rf) =
+    zeroExtend(rf.flit.dest.addr.mbox.x);
+  function Bool routeN(RoutedFlit rf) = rf.decision == RouteNorth;
+  function Bool routeS(RoutedFlit rf) = rf.decision == RouteSouth;
+  function Bool routeE(RoutedFlit rf) = rf.decision == RouteEast;
+  function Bool routeW(RoutedFlit rf) = rf.decision == RouteWest;
+  function Bool routeL(Bit#(`MailboxMeshXBits) x, RoutedFlit rf) =
+    rf.decision == RouteNoC && xcoord(rf) == x;
+  Vector#(`FetchersPerProgRouter, SelectorFunc) funcs;
+  funcs[0] = routeN; funcs[1] = routeS;
+  funcs[2] = routeE; funcs[3] = routeW;
+  for (Integer i = 0; i < `MailboxMeshXLen; i=i+1)
+    funcs[4+i] = routeL(fromInteger(i));
+
+  // Crossbar
+  function BOut#(RoutedFlit) getFetcherFlitOut(Fetcher f) = f.flitOut;
+  Vector#(`FetchersPerProgRouter, BOut#(RoutedFlit)) fetcherOuts =
+    map(getFetcherFlitOut, fetchers);
+  Vector#(`FetchersPerProgRouter, BOut#(RoutedFlit))
+    crossbarOuts <- mkProgRouterCrossbar(funcs, fetcherOuts);
+  Vector#(`FetchersPerProgRouter, BOut#(Flit)) crossbarOutFlits;
+  function Flit toFlit (RoutedFlit rf) = rf.flit;
+  for (Integer i = 0; i < `FetchersPerProgRouter; i=i+1)
+    crossbarOutFlits[i] <- onBOut(toFlit, crossbarOuts[i]);
+
+  // Flit input interfaces
+  Vector#(`FetchersPerProgRouter, In#(Flit)) flitInIfc = newVector;
+  for (Integer i = 0; i < `FetchersPerProgRouter; i=i+1)
+    flitInIfc[i] = fetchers[i].flitIn;
+
+  // RAM interfaces
+  Vector#(`DRAMsPerBoard, Vector#(`FetchersPerProgRouter, In#(DRAMResp)))
+    ramRespIfc = replicate(newVector);
+  Vector#(`DRAMsPerBoard, Vector#(`FetchersPerProgRouter, BOut#(DRAMReq)))
+    ramReqIfc = replicate(newVector);
+  for (Integer i = 0; i < `DRAMsPerBoard; i=i+1)
+    for (Integer j = 0; j < `FetchersPerProgRouter; j=j+1) begin
+      ramReqIfc[i][j] = fetchers[j].ramReqs[i];
+      ramRespIfc[i][j] = fetchers[j].ramResps[i];
+    end
+
+  // Performance counters
+  Vector#(TExp#(TLog#(`FetchersPerProgRouter)),
+    Bit#(LogFetchersPerProgRouter)) incSents = replicate(0);
+  Vector#(TExp#(TLog#(`FetchersPerProgRouter)),
+    Bit#(LogFetchersPerProgRouter)) incSentsInterBoard = replicate(0);
+  for (Integer i = 0; i < `FetchersPerProgRouter; i=i+1) begin
+    incSents[i] = zeroExtend(fetchers[i].activity.incSent);
+    incSentsInterBoard[i] =
+      zeroExtend(fetchers[i].activity.incSentInterBoard);
+  end
+  Bit#(LogFetchersPerProgRouter) numSent <-
+    mkPipelinedReductionTree( \+ , 0, toList(incSents));
+  Bit#(LogFetchersPerProgRouter) numSentInterBoard <-
+    mkPipelinedReductionTree( \+ , 0, toList(incSentsInterBoard));
+
+  function FetcherActivity getActivity(Fetcher f) = f.activity;
+  interface flitIn = flitInIfc;
+  interface flitOut = crossbarOutFlits;
+  interface ramReqs = ramReqIfc;
+  interface ramResps = ramRespIfc;
+  interface activities = map(getActivity, fetchers);
+  interface ProgRouterPerfCounters perfCounters;
+    method incSent = numSent;
+    method incSentInterBoard = numSentInterBoard;
+  endinterface
+
+endmodule
+
+// For core(s) to access ProgRouter's performance counters
+(* always_ready, always_enabled *)
+interface ProgRouterPerfClient;
+  method Action incSent(Bit#(LogFetchersPerProgRouter) amount);
+  method Action incSentInterBoard(Bit#(LogFetchersPerProgRouter) amount);
+endinterface
+
+endpackage
diff --git a/rtl/Util.bsv b/rtl/Util.bsv
index 7ac885c3..f45ece48 100644
--- a/rtl/Util.bsv
+++ b/rtl/Util.bsv
@@ -254,4 +254,51 @@ module mkBuffer#(Integer n, dataT init, dataT inp) (dataT)
   return regs[n-1];
 endmodule
 
+// Isolate first hot bit
+function Bit#(n) firstHot(Bit#(n) x) = x & (~x + 1);
+
+// Function for fair scheduling of n tasks
+function Tuple2#(Bit#(n), Bit#(n)) sched(Bit#(n) hist, Bit#(n) avail);
+  // First choice: an available bit that's not in the history
+  Bit#(n) first = firstHot(avail & ~hist);
+  // Second choice: any available bit
+  Bit#(n) second = firstHot(avail);
+
+  // Return new history, and chosen bit
+  if (first != 0) begin
+    // Return first choice, and update history
+    return tuple2(hist | first, first);
+  end else begin
+    // Return second choice, and reset history
+    return tuple2(second, second);
+  end
+endfunction
+
+// Pipelined reduction tree
+module mkPipelinedReductionTree#(
+         function a reduce(a x, a y),
+         a init,
+         List#(a) xs)
+       (a) provisos(Bits#(a, _));
+  Integer len = List::length(xs);
+  if (len == 0)
+    return error("mkSumList applied to empty list");
+  else if (len == 1)
+    return xs[0];
+  else begin
+    List#(a) ys = xs;
+    List#(a) reduced = Nil;
+    for (Integer i = 0; i < len; i=i+2) begin
+      Reg#(a) r <- mkConfigReg(init);
+      rule assignOut;
+        r <= reduce(ys[0], ys[1]);
+      endrule
+      ys = List::drop(2, ys);
+      reduced = Cons(readReg(r), reduced);
+    end
+    a res <- mkPipelinedReductionTree(reduce, init, reduced);
+    return res;
+  end
+endmodule
+
 endpackage
diff --git a/rtl/WideSRAM.bsv b/rtl/WideSRAM.bsv
index a3816a38..04af1dc7 100644
--- a/rtl/WideSRAM.bsv
+++ b/rtl/WideSRAM.bsv
@@ -108,6 +108,7 @@ module mkWideSRAM#(RAMId id) (WideSRAM);
         respOut.data = pack(data);
         respOut.info = respIn.info;
         respOut.finalBeat = True;
+        respOut.beat = 0;
         respQueue.enq(respOut);
         respCount <= 0;
       end