diff --git a/Makefile b/Makefile
index 5b2608a3..133b1533 100644
--- a/Makefile
+++ b/Makefile
@@ -26,13 +26,19 @@ clean:
make -C apps/temps clean
make -C apps/POLite/heat-gals clean
make -C apps/POLite/heat-sync clean
+ make -C apps/POLite/heat-cube-sync clean
+ make -C apps/POLite/heat-grid-sync clean
make -C apps/POLite/asp-gals clean
make -C apps/POLite/asp-sync clean
- make -C apps/POLite/asp-pc clean
make -C apps/POLite/pagerank-sync clean
make -C apps/POLite/pagerank-gals clean
+ make -C apps/POLite/sssp-sync clean
make -C apps/POLite/sssp-async clean
- make -C apps/POLite/ping-test clean
make -C apps/POLite/clocktree-async clean
+ make -C apps/POLite/izhikevich-gals clean
+ make -C apps/POLite/izhikevich-sync clean
+ make -C apps/POLite/pressure-sync clean
+ make -C apps/POLite/hashmin-sync clean
+ make -C apps/POLite/progrouters clean
make -C bin clean
make -C tests clean
diff --git a/README.md b/README.md
index 00f6a84b..a66aed56 100644
--- a/README.md
+++ b/README.md
@@ -1,13 +1,12 @@
-# Tinsel 0.7.1
+# Tinsel 0.8
Tinsel is a [RISC-V](https://riscv.org/)-based manythread
message-passing architecture designed for FPGA clusters. It is being
developed as part of the [POETS
Project](https://poets-project.org/about) (Partial Ordered Event
-Triggered Systems). This manual describes the architecture and
-associated APIs. Further background can be found in our [FPL 2019
-paper](doc/fpl-2019-paper.pdf), which presents Tinsel 0.6. If you're
-a POETS Partner, you can access a machine running Tinsel in the [POETS
+Triggered Systems). Further background can be found in our [FPL 2019
+paper](doc/fpl-2019-paper.pdf). If you're a POETS Partner, you can
+access a machine running Tinsel in the [POETS
Cloud](https://github.com/POETSII/poets-cloud).
## Release Log
@@ -27,15 +26,19 @@ Released on 10 Sep 2018 and maintained in the
* [v0.5](https://github.com/POETSII/tinsel/releases/tag/v0.5):
Released on 8 Jan 2019 and maintained in the
[tinsel-0.5.1 branch](https://github.com/POETSII/tinsel/tree/tinsel-0.5.1).
-(Hardware idle-detection.)
+(Hardware termination-detection.)
* [v0.6](https://github.com/POETSII/tinsel/releases/tag/v0.6):
Released on 11 Apr 2019 and maintained in the
[tinsel-0.6.3 branch](https://github.com/POETSII/tinsel/tree/tinsel-0.6.3).
(Multi-box cluster.)
* [v0.7](https://github.com/POETSII/tinsel/releases/tag/v0.7):
Released on 2 Dec 2019 and maintained in the
+[tinsel-0.7.1 branch](https://github.com/POETSII/tinsel/tree/tinsel-0.7.1).
+(Local hardware multicast.)
+* [v0.8](https://github.com/POETSII/tinsel/releases/tag/v0.8):
+Released on 24 Jun 2020 and maintained in the
[master branch](https://github.com/POETSII/tinsel/).
-(Localised hardware multicast.)
+(Global hardware multicast.)
## Contents
@@ -45,8 +48,9 @@ Released on 2 Dec 2019 and maintained in the
* [4. Tinsel Cache](#4-tinsel-cache)
* [5. Tinsel Mailbox](#5-tinsel-mailbox)
* [6. Tinsel Network](#6-tinsel-network)
-* [7. Tinsel HostLink](#7-tinsel-hostlink)
-* [8. POLite API](#8-polite-api)
+* [7. Tinsel Router](#7-tinsel-router)
+* [8. Tinsel HostLink](#8-tinsel-hostlink)
+* [9. POLite API](#9-polite-api)
## Appendices
@@ -62,24 +66,19 @@ Released on 2 Dec 2019 and maintained in the
## 1. Overview
On the [POETS Project](https://poets-project.org/about), we are
-looking at ways to accelerate applications that can be expressed as
-large numbers of small processes communicating by message-passing.
-Our first attempt is based around a manythread RISC-V architecture
-called Tinsel running on an FPGA cluster. Tinsel aims to support
-irregular applications that have heavy memory and communication
-demands, but fairly modest compute requrements. The main features are:
+looking at ways to accelerate applications that are naturally
+expressed as a large number of small processes communicating by
+message-passing. Our first attempt is based around a manythread
+RISC-V architecture called Tinsel, running on an FPGA cluster. The
+main features are:
* **Multithreading**. A critical aspect of the design
is to tolerate latency as cleanly as possible. This includes the
- latencies arising from: floating-point on Stratix V FPGAs
- (tens of cycles); off-chip memories; deep pipelines
- (keeping Fmax high); and sharing of resources between cores
+ latencies arising from floating-point on Stratix V FPGAs
+ (tens of cycles), off-chip memories, deep pipelines
+ (keeping Fmax high), and sharing of resources between cores
(such as caches, mailboxes, and FPUs).
- * **Caches**. To keep the programming model simple, we have opted
- to use thread-partitioned data caches to optimise access to
- off-chip memory rather than DMA.
-
* **Message-passing**. Although there is a requirement to support a
large amount of memory, it is not necessary to provide the
illusion of a single shared memory space: message-passing is intended
@@ -87,17 +86,22 @@ demands, but fairly modest compute requrements. The main features are:
instructions for sending and receiving messages
between any two threads in the cluster.
- * **Hardware termination detection**. A global termination event is
+ * **Hardware termination-detection**. A global termination event is
triggered when every thread indicates termination and no messages
are in-flight. Termination can be interpreted as termination of a
time step, or termination of the application, supporting
both synchronous and asynchronous event-driven systems.
- * **Localised hardware multicast**. Threads can send a message to
- multiple colocated destination threads simultaneously, greatly reducing
+ * **Local hardware multicast**. Threads can send a message to
+ multiple collocated destination threads simultaneously, greatly reducing
the number of inter-thread messages in applications exhibiting good
locality of communication.
+ * **Global hardware multicast**. Programmable routers
+ automatically propagate messages to any number of destination
+ threads distributed throughout the cluster, minimising inter-FPGA
+ bandwidth usage for distributed fanouts.
+
* **Host communication**. Tinsel threads communicate with x86
machines distributed throughout the FPGA cluster, for command and
control, via PCI Express and USB.
@@ -106,7 +110,7 @@ demands, but fairly modest compute requrements. The main features are:
include custom accelerators written in SystemVerilog.
This repository also includes a prototype high-level vertex-centric
-programming API for Tinsel, called [POLite](#8-polite-api).
+programming API for Tinsel, called [POLite](#9-polite-api).
## 2. High-Level Structure
@@ -133,11 +137,13 @@ accelerators](doc/custom) in tiles.
#### Tinsel FPGA
-Each FPGA contains two *Tinsel Slices*, with each slice typically
+Each FPGA contains two *Tinsel Slices*, with each slice by default
comprising eight tiles connected to one 4GB DDR3 DIMM and two 8MB
QDRII+ SRAMs. All tiles are connected together via a routers to form
-a 2D NoC. At the edges of the NoC are the inter-FPGA reliable
-links.
+a 2D NoC. The NoC is connected to the inter-FPGA links using a
+*per-board programmable router*. Note that the per-board router also
+has connections to off-chip memory: this is where the programmable
+routing tables are stored.
@@ -418,16 +424,22 @@ has reached the destination or none of it has. As one would expect,
shorter messages consume less bandwidth than longer ones. The size of
a flit is defined by `LogWordsPerFlit`.
-At the heart of a mailbox is a memory-mapped *scratchpad* that
-stores both incoming and outgoing messages. The capacity of the
-scratchpad is defined by `LogMsgsPerMailbox`. Each thread connected
-to the mailbox has one message slot reserved for sending messages.
-The address of this slot is obtained using the following Tinsel API
-call.
+At the heart of a mailbox is a memory-mapped *scratchpad* that stores
+both incoming and outgoing messages. The capacity of the scratchpad
+is defined by `LogMsgsPerMailbox`. Each thread connected to the
+mailbox has one or two message slots reserved for sending messages.
+(By default, only a single send slot is reserved; the extra send slot
+may be optionally reserved at power-up via a parameter to the
+[HostLink](#8-tinsel-hostlink) constructor.) The addresses of these
+slots are obtained using the following Tinsel API calls.
```c
-// Get pointer to thread's message slot reserved for sending.
+// Get pointer to thread's message slot reserved for sending
volatile void* tinselSendSlot();
+
+// Get pointer to thread's extra message slot reserved for sending
+// (Assumes that HostLink has requested the extra slot)
+volatile void* tinselSendSlotExtra();
```
Once a thread has written a message to the scratchpad, it can trigger
@@ -544,7 +556,7 @@ Tinsel also provides a function
int tinselIdle(bool vote);
```
-which blocks until either
+for global termination detection, which blocks until either
1. a message is available to receive, or
@@ -639,7 +651,208 @@ communication. And since we are using the links point-to-point,
almost all of the ethernet header fields can be used for our own
purposes, resulting in very little overhead on the wire.
-## 7. Tinsel HostLink
+## 7. Tinsel Router
+
+Tinsel provides a programmable router on each FPGA board to support
+*global* multicasting. Programmable routers automatically propagate
+messages to any number of destination threads distributed throughout
+the cluster, minimising inter-FPGA bandwidth usage for distributed
+fanouts, and offloading work from the cores. Further background can
+be found in [PIP 24](doc/PIP-0024-global-multicast.md).
+
+To support programmable routers, the destination component of a
+message is generalised so that it can be (1) a thread id; or (2) a
+*routing key*. A message, sent by a thread, containing a routing
+key as a destination will go to a per-board router on the same
+FPGA. The router will use the key as an index into a DRAM-based
+routing table and automatically propagate the message towards all the
+destinations associated with that key.
+
+A **routing key** is a 32-bit value consisting of a board-local *ram
+id*, a *pointer*, and a *size*:
+
+```sv
+// 32-bit routing key (MSB to LSB)
+typedef struct {
+ // Which off-chip RAM on this board?
+ Bit#(`LogDRAMsPerBoard) ram;
+ // Pointer to array of routing beats containing routing records
+ Bit#(`LogBeatsPerDRAM) ptr;
+ // Number of beats in the array
+ Bit#(`LogRoutingEntryLen) numBeats;
+} RoutingKey;
+```
+
+To send a message using a routing key as the destination, a new Tinsel
+API call is provided:
+
+```c
+// Send message at addr using given routing key
+inline void tinselKeySend(uint32_t key, volatile void* addr);
+```
+
+When a message reaches the per-board router, the `ptr` field of the
+routing key is used as an index into DRAM, where a sequence of 256-bit
+**routing beats** are found. The `numBeats` field of the routing key
+indicates how many contiguous routing beats there are. The value of
+`numBeats` may be zero, in which case there are no destinations
+associated with the key.
+
+A routing beat consists of a *size* and a sequence of five 48-bit
+*routing chunks*:
+
+```sv
+// 256-bit routing beat (aligned, MSB to LSB)
+typedef struct {
+ // Number of routing records present in this beat
+ Bit#(16) size;
+ // Five 48-bit record chunks
+ Vector#(5, Bit#(48)) chunks;
+} RoutingBeat;
+```
+
+The *size* must lie in the range 1 to 5 inclusive (0 is disallowed).
+A **routing record** consists of one or two routing chunks, depending
+on the **record type**.
+
+All byte orderings are little endian. For example, the order of bytes
+in a routing beat is as follows.
+
+Byte | Contents
+---- | --------
+31: | Upper byte of size (i.e. number of records in beat)
+30: | Lower byte of size
+29: | Upper byte of first chunk
+... | ...
+24: | Lower byte of first chunk
+23: | Upper byte of second chunk
+... | ...
+18: | Lower byte of second chunk
+17: | Upper byte of third chunk
+... | ...
+12: | Lower byte of third chunk
+11: | Upper byte of fourth chunk
+... | ...
+ 6: | Lower byte of fourth chunk
+ 5: | Upper byte of fifth chunk
+... | ...
+ 0: | Lower byte of fifth chunk
+
+Clearly, both routing keys and routing beats have a maximum size.
+However, in principle there is no limit to the number of records
+associated with a key, due to the possibility of *indirection records*
+(see below).
+
+There are five types of routing record, defined below.
+
+**48-bit Unicast Router-to-Mailbox (URM1):**
+
+```sv
+typedef struct {
+ // Record type (URM1 == 0)
+ Bit#(3) tag;
+ // Mailbox destination
+ Bit#(4) mbox;
+ // Mailbox-local thread identifier
+ Bit#(6) thread;
+ // Unused
+ Bit#(3) unused;
+ // Local key. The first word of the message
+ // payload is overwritten with this.
+ Bit#(32) localKey;
+} URM1Record;
+```
+
+The `localKey` can be used for anything, but might encode the
+destination thread-local device identifier, or edge identifier, or
+both. The `mbox` field is currently 4 bits (two Y bits followed by
+two X bits), but there are spare bits available to increase the size
+of this field in future if necessary.
+
+**96-bit Unicast Router-to-Mailbox (URM2):**
+
+```sv
+typedef struct {
+ // Record type (URM2 == 1)
+ Bit#(3) tag;
+ // Mailbox destination
+ Bit#(4) mbox;
+ // Mailbox-local thread identifier
+ Bit#(6) thread;
+ // Currently unused
+ Bit#(19) unused;
+ // Local key. The first two words of the message
+ // payload is overwritten with this.
+ Bit#(64) localKey;
+} URM2Record;
+```
+
+This is the same as a URM1 record except the local key is 64-bits in
+size.
+
+**48-bit Router-to-Router (RR):**
+
+```sv
+typedef struct {
+ // Record type (RR == 2)
+ Bit#(3) tag;
+ // Direction (N,S,E,W == 0,1,2,3)
+ Bit#(2) dir;
+ // Currently unused
+ Bit#(11) unused;
+ // New 32-bit routing key that will replace the one in the
+ // current message for the next hop of the message's journey
+ Bit#(32) newKey;
+} RRRecord;
+```
+
+The `newKey` field will replace the key in the current message for the
+next hop of the message's journey. Introducing a new key at each hop
+simplifies the mapping process (keeping it quick).
+
+**96-bit Multicast Router-to-Mailbox (MRM):**
+
+```sv
+typedef struct {
+ // Record type (MRM == 3)
+ Bit#(3) tag;
+ // Mailbox destination
+ Bit#(4) mbox;
+ // Currently unused
+ Bit#(9) unused;
+ // Local key. The least-significant half-word
+ // of the message is replaced with this
+ Bit#(16) localKey;
+ // Mailbox-local destination mask
+ Bit#(64) destMask;
+} MRMRecord;
+```
+
+**48-bit Indirection (IND):**
+
+```sv
+// 48-bit Indirection (IND) record
+// Note the restrictions on IND records:
+// 1. At most one IND record per key lookup
+// 2. A max-sized key lookup must contain an IND record
+typedef struct {
+ // Record type (IND == 4)
+ Bit#(3) tag;
+ // Currently unused
+ Bit#(13) unused;
+ // New 32-bit routing key for new set of records on current router
+ Bit#(32) newKey;
+} INDRecord;
+```
+
+Indirection records can be used to handle large fanouts, which exceed
+the number of bits available in the size portion of the routing key.
+
+Finally, it is worth noting that when using programmable routers,
+there is an added responsibility for the programmer to use a
+deadlock-free routing scheme, such as dimension-ordered routing.
+
+## 8. Tinsel HostLink
*HostLink* is the means by which Tinsel cores running on a mesh of
FPGA boards communicate with a *host PC*. It comprises three main
@@ -647,7 +860,7 @@ communication channels:
* An FPGA *bridge board* that connects the host PC inside a POETS box
(PCI Express) to the FPGA mesh (SFP+). Using this high-bandwidth
-channel (10Gbps), the host PC can efficiently send messages to any
+channel (2 x 10Gbps), the host PC can efficiently send messages to any
Tinsel thread and vice-versa.
* A set of *debug links* connecting the host PC inside a POETS box to
@@ -662,34 +875,45 @@ each FPGA's *power management module* via separate USB UART cables.
These connections can be used to power-on/power-off each FPGA and to
monitor power consumption, temperature, and fan tachometer.
-HostLink supports multiple POETS boxes, but requires that one of these
-boxes is designated as the **master box**. Currently, all messages
-are injected/extracted to/from the FPGA network via the master box's
-bridge board.
-
-A Tinsel application typically consists of two programs: one which
-runs on the RISC-V cores, linked against the [Tinsel
+HostLink allows multiple POETS boxes to be used to run an application,
+but requires that one of these boxes is designated as the **master
+box**. A Tinsel application typically consists of two programs: one
+which runs on the RISC-V cores, linked against the [Tinsel
API](#f-tinsel-api), and the other which runs on the host PC of the
master box, linked against the [HostLink API](#g-hostlink-api). The
HostLink API is implemented as a C++ class called `HostLink`. The
constructor for this class first powers up all the worker FPGAs (which
-are by default powered down). On power-up the FPGAs are automatically
-programmed using the Tinsel bit-file residing in flash memory, and are
-ready to be used within a few seconds, as soon as the `HostLink`
-constructor returns.
+are by default powered down). On power-up, the FPGAs are
+automatically programmed using the Tinsel bit-file residing in flash
+memory, and are ready to be used within a few seconds, as soon as the
+`HostLink` constructor returns.
The `HostLink` constructor is overloaded:
```cpp
HostLink::HostLink();
HostLink::HostLink(uint32_t numBoxesX, uint32_t numBoxesY);
+HostLink::HostLink(HostLinkParams params);
```
If it is called without any arguments, then it assumes that a single
-box is to be used. Alternatively, the user may request multiple
-boxes by specifying the width and height of the box sub-mesh they
-wish to use. (The box from which the application is started is
-considered as the origin of this sub-mesh.)
+box is to be used. Alternatively, the user may request multiple boxes
+by specifying the width and height of the box sub-mesh they wish to
+use. (The box from which the application is started, i.e. the master
+box, is considered as the the origin of this sub-mesh.) The most
+general constructor takes a `HostLinkParams` structure as an argument,
+which allows additional options to be specified.
+
+```cpp
+// HostLink parameters
+struct HostLinkParams {
+ // Number of boxes to use (default is 1x1)
+ uint32_t numBoxesX;
+ uint32_t numBoxesY;
+ // Enable use of tinselSendSlotExtra() on threads (default is false)
+ bool useExtraSendSlot;
+};
+```
HostLink methods for sending and receiving messages on the host PC are
as follows.
@@ -711,6 +935,12 @@ bool HostLink::canRecv();
// Receive a message (blocking), given size of message in bytes
// Any bytes beyond numBytes up to the next message boundary will be ignored
void HostLink::recvMsg(void* msg, uint32_t numBytes);
+
+// Send a message using routing key (blocking)
+bool HostLink::keySend(uint32_t key, uint32_t numFlits, void* msg);
+
+// Try to send using routing key (non-blocking, returns true on success)
+bool HostLink::keyTrySend(uint32_t key, uint32_t numFlits, void* msg);
```
The `send` method allows a message consisting of multiple flits to be
@@ -895,7 +1125,7 @@ not be called. When the application returns from `main()`, all but
one thread on each core are killed and the remaining threads reenter
the boot loader.
-## 8. POLite API
+## 9. POLite API
POLite is a layer of abstraction that takes care of mapping arbitrary
task graphs onto the Tinsel overlay, completely hiding architectural
@@ -1069,16 +1299,24 @@ by each thread.
After mapping, POLite writes the graph into cluster memory and
triggers execution. By default, vertex states are written into the
off-chip QDRII+ SRAMs, and edge lists are written in the DDR3 DRAMs.
-This default behaviour can be modified by setting the boolean flags
-`graph.mapVerticesToDRAM`, `graph.mapInEdgesToDRAM`,
-`graph.mapOutEdgesToDRAM` accordingly (true means "map to DRAM" and
-false means "map to SRAM"). Once the application is up and running,
-the host and the graph vertices can continue to communicate: any
-vertex can send messages to the host via the `HostPin` or the `finish`
-handler, and the host can send messages to any vertex.
+This default behaviour can be modified by adjusting the following
+flags of the `PGraph` class.
+
+ Flag | Default
+ ------------------------ | -------
+ `mapVerticesToDRAM` | `false`
+ `mapInEdgeHeadersToDRAM` | `true`
+ `mapInEdgeRestToDRAM` | `true`
+ `mapOutEdgesToDRAM` | `true`
+
+A value of `true` means "map to DRAM", while `false` means "map to
+(off-chip) SRAM". Once the application is up and running, the host
+and the graph vertices can continue to communicate: any vertex can
+send messages to the host via the `HostPin` or the `finish` handler,
+and the host can send messages to any vertex.
**Softswitch**. Central to POLite is an event loop running on each
-Tinsel thread, which we call **the softswitch** as it effectively
+Tinsel thread, which we call the softswitch as it effectively
context-switches between vertices mapped to the same thread. The
softswitch has four main responsibilities: (1) to maintain a queue of
vertices wanting to send; (2) to implement multicast sends over a pin
@@ -1087,14 +1325,34 @@ messages efficiently between vertices running on the same thread and
on different threads; and (4) to invoke the vertex handlers when
required, to meet the semantics of the POLite library.
-**Limitations**. POLite provides several important features of the
-vertex-centric paradigm, but there are some limitations. One of the
-features of the Pregel framework is the ability for vertices to add
-and remove vertices and edges at runtime -- but currently, POLite only
-supports static graphs. And for large *non-localised* fan-outs, a
-hierarchical hardware or software multicast feature may be desirable
-(where messages get forked at intermediate stages along the way to the
-destinations).
+**POLite static parameters**. The following macros can be defined,
+before the first instance of `#include `, to control some
+aspects of POLite behaviour.
+
+ Macro | Meaning
+ --------- | -------
+ `POLITE_NUM_PINS` | Max number of pins per vertex (default 1)
+ `POLITE_DUMP_STATS` | Dump stats upon completion
+ `POLITE_COUNT_MSGS` | Include message counts in stats dump
+ `POLITE_EDGES_PER_HEADER` | Lower this for large edge states (default 6)
+
+**POLite dynamic parameters**. The following environment variables can
+be set, to control some aspects of POLite behaviour.
+
+ Environment variable | Meaning
+ -------------------- | -------
+ `HOSTLINK_BOXES_X` | Size of box mesh to use in X dimension
+ `HOSTLINK_BOXES_Y` | Size of box mesh to use in Y dimension
+ `POLITE_BOARDS_X` | Size of board mesh to use in X dimension
+ `POLITE_BOARDS_Y` | Size of board mesh to use in Y dimension
+ `POLITE_CHATTY` | Set to `1` to enable emission of mapper stats
+ `POLITE_PLACER` | Use `metis`, `random`, `bfs`, or `direct` placement
+
+**Limitations**. POLite is primarily intended as a prototype library
+for hardware evaluation purposes. It occupies a single, simple point
+in a wider, richer design space. In particular, it doesn't support
+dynamic creation of vertices and edges, and it hasn't been optimised
+to deal with highly non-uniform fanouts.
## A. DE5-Net Synthesis Report
@@ -1111,9 +1369,10 @@ The default Tinsel configuration on a single DE5-Net board contains:
* four QDRII+ SRAM controllers
* four 10Gbps reliable links
* one termination/idle detector
+ * one 8x8 programmable router
* a JTAG UART
-The clock frequency is 225MHz and the resource utilisation is 74% of
+The clock frequency is 215MHz and the resource utilisation is 84% of
the DE5-Net.
## B. Tinsel Parameters
@@ -1143,9 +1402,9 @@ the DE5-Net.
`MeshXLenWithinBox` | 3 | Boards in X dimension within box
`MeshYLenWithinBox` | 2 | Boards in Y dimension within box
`EnablePerfCount` | True | Enable performance counters
- `ClockFreq` | 225 | Clock frequency in MHz
+ `ClockFreq` | 215 | Clock frequency in MHz
-Further parameters can be found in [config.py](config.py).
+A full list of parameters can be found in [config.py](config.py).
## C. Tinsel Memory Map
@@ -1204,15 +1463,20 @@ separate memory regions (which they are not).
Optional performance-counter CSRs (when `EnablePerfCount` is `True`):
- Name | CSR | R/W | Function
- ---------------- | ------ | --- | --------
- `PerfCount` | 0xc07 | W | Reset(0)/Start(1)/Stop(2) all counters
- `MissCount` | 0xc08 | R | Cache miss count
- `HitCount` | 0xc09 | R | Cache hit count
- `WritebackCount` | 0xc0a | R | Cache writeback count
- `CPUIdleCount` | 0xc0b | R | CPU idle-cycle count (lower 32 bits)
- `CPUIdleCountU` | 0xc0c | R | CPU idle-cycle count (upper 8 bits)
- `CycleU` | 0xc0d | R | Cycle counter (upper 8 bits)
+ Name | CSR | R/W | Function
+ ---------------- | ------ | --- | --------
+ `PerfCount` | 0xc07 | W | Reset(0)/Start(1)/Stop(2) all counters
+ `MissCount` | 0xc08 | R | Cache miss count
+ `HitCount` | 0xc09 | R | Cache hit count
+ `WritebackCount` | 0xc0a | R | Cache writeback count
+ `CPUIdleCount` | 0xc0b | R | CPU idle-cycle count (lower 32 bits)
+ `CPUIdleCountU` | 0xc0c | R | CPU idle-cycle count (upper 8 bits)
+ `CycleU` | 0xc0d | R | Cycle counter (upper 8 bits)
+ `ProgRouterSent` | 0xc0e | R | Total msgs sent by ProgRouter
+ `ProgRouterSentInter` | 0xc0f | R | Inter-board msgs sent by ProgRouter
+
+Note that `ProgRouterSent` and `ProgRouterSentInter` are only valid
+from thread zero on each board.
Tinsel also supports the following custom instructions.
@@ -1258,6 +1522,13 @@ inline void tinselFlushLine(uint32_t lineNum, uint32_t way);
// (A message of length n is comprised of n+1 flits)
inline void tinselSetLen(uint32_t n);
+// Get pointer to thread's message slot reserved for sending
+volatile void* tinselSendSlot();
+
+// Get pointer to thread's extra message slot reserved for sending
+// (Assumes that HostLink has requested the extra slot)
+volatile void* tinselSendSlotExtra();
+
// Determine if calling thread can send a message
inline uint32_t tinselCanSend();
@@ -1273,6 +1544,9 @@ inline void tinselMulticast(
// (Address must be aligned on message boundary)
inline void tinselSend(uint32_t dest, volatile void* addr);
+// Send message at address using given routing key
+inline void tinselKeySend(uint32_t key, volatile void* addr);
+
// Determine if calling thread can receive a message
inline uint32_t tinselCanRecv();
@@ -1352,6 +1626,14 @@ inline uint32_t tinselCPUIdleCountU();
// Read cycle counter (upper 8 bits)
inline uint32_t tinselCycleCountU();
+// Performance counter: number of messages emitted by ProgRouter
+// (Only valid from thread zero on each board)
+inline uint32_t tinselProgRouterSent();
+
+// Performance counter: number of inter-board messages emitted by ProgRouter
+// (Only valid from thread zero on each board)
+inline uint32_t tinselProgRouterSentInterBoard();
+
// Address construction
inline uint32_t tinselToAddr(
uint32_t boardX, uint32_t boardY,
@@ -1410,6 +1692,12 @@ class HostLink {
// Any bytes beyond numBytes up to the next message boundary will be ignored
void recvMsg(void* msg, uint32_t numBytes);
+ // Send a message using routing key (blocking by default)
+ bool keySend(uint32_t key, uint32_t numFlits, void* msg, bool block = true);
+
+ // Try to send using routing key (non-blocking, returns true on success)
+ bool keyTrySend(uint32_t key, uint32_t numFlits, void* msg);
+
// Bulk send and receive
// ---------------------
@@ -1476,14 +1764,24 @@ class HostLink {
// Trigger application execution on all started threads on given core
void goOne(uint32_t meshX, uint32_t meshY, uint32_t coreId);
};
+
+// HostLink parameters (used by the most general HostLink constructor)
+struct HostLinkParams {
+ // Number of boxes to use (default is 1x1)
+ uint32_t numBoxesX;
+ uint32_t numBoxesY;
+ // Enable use of tinselSendSlotExtra() on threads (default is false)
+ bool useExtraSendSlot;
+};
```
```cpp
class DebugLink {
public:
- // Constructor
+ // Constructors
DebugLink(uint32_t numBoxesX, uint32_t numBoxesY);
+ DebugLink(DebugLinkParams params);
// On given board, set destination core and thread
void setDest(uint32_t boardX, uint32_t boardY,
diff --git a/apps/POLite/asp-gals/ASP.h b/apps/POLite/asp-gals/ASP.h
index 42462622..f69dfa3d 100644
--- a/apps/POLite/asp-gals/ASP.h
+++ b/apps/POLite/asp-gals/ASP.h
@@ -9,8 +9,8 @@
#ifndef _ASP_H_
#define _ASP_H_
-//#define POLITE_DUMP_STATS
-//#define POLITE_COUNT_MSGS
+#define POLITE_DUMP_STATS
+#define POLITE_COUNT_MSGS
// Lightweight POETS frontend
#include
diff --git a/apps/POLite/asp-gals/Run.cpp b/apps/POLite/asp-gals/Run.cpp
index d50821ce..4c00e1da 100644
--- a/apps/POLite/asp-gals/Run.cpp
+++ b/apps/POLite/asp-gals/Run.cpp
@@ -51,7 +51,8 @@ int main(int argc, char**argv)
// Create random set of source nodes
uint32_t numSources = NUM_SOURCES*32;
uint32_t sources[numSources];
- randomSet(numSources, sources, graph.numDevices);
+ //randomSet(numSources, sources, graph.numDevices);
+ for (int i = 0; i < numSources; i++) sources[i] = i;
// Initialise devices
for (PDeviceId i = 0; i < graph.numDevices; i++) {
@@ -102,7 +103,9 @@ int main(int argc, char**argv)
// Display time
timersub(&finish, &start, &diff);
double duration = (double) diff.tv_sec + (double) diff.tv_usec / 1000000.0;
+ #ifndef POLITE_DUMP_STATS
printf("Time = %lf\n", duration);
+ #endif
return 0;
}
diff --git a/apps/POLite/asp-pc/Makefile b/apps/POLite/asp-pc/Makefile
index 0cf7448f..bf9439f3 100644
--- a/apps/POLite/asp-pc/Makefile
+++ b/apps/POLite/asp-pc/Makefile
@@ -1,10 +1,10 @@
# SPDX-License-Identifier: BSD-2-Clause
-all: asp GenHypercube GenTree GenGeoGraph
+all: asp GenHypercube GenTree
INC=../../../../include
asp: asp.cpp
- g++ -fopenmp -D_DEFAULT_SOURCE -I$(INC) -O3 asp.cpp -o asp
+ g++ -I$(INC) -O3 asp.cpp -o asp
GenHypercube: GenHypercube.hs
ghc -O2 --make GenHypercube.hs
@@ -12,8 +12,5 @@ GenHypercube: GenHypercube.hs
GenTree: GenTree.hs
ghc -O2 --make GenTree.hs
-GenGeoGraph: GenGeoGraph.cpp
- g++ -O2 -lstdc++ GenGeoGraph.cpp -o GenGeoGraph
-
clean:
- rm -f asp GenHypercube GenTree GenGeoGraph *.hi *.o
+ rm -f asp GenHypercube GenTree *.hi *.o
diff --git a/apps/POLite/asp-pc/asp-push.cpp b/apps/POLite/asp-pc/asp-push.cpp
new file mode 100644
index 00000000..a75f6628
--- /dev/null
+++ b/apps/POLite/asp-pc/asp-push.cpp
@@ -0,0 +1,180 @@
+// SPDX-License-Identifier: BSD-2-Clause
+#include "RandomSet.h"
+
+#include
+#include
+#include
+#include
+#include
+#include
+
+// Number of nodes and edges
+uint32_t numNodes;
+uint32_t numEdges;
+
+// Mapping from node id to array of neighbouring node ids
+// First element of each array holds the number of neighbours
+uint32_t** neighbours;
+
+// Mapping from node id to bit vector of reaching nodes
+uint64_t** reaching;
+uint64_t** reachingNext;
+
+// Number of 64-bit words in reaching vector
+const uint64_t vectorSize = 1;
+
+void readGraph(const char* filename, bool undirected)
+{
+ // Read edges
+ FILE* fp = fopen(filename, "rt");
+ if (fp == NULL) {
+ fprintf(stderr, "Can't open '%s'\n", filename);
+ exit(EXIT_FAILURE);
+ }
+
+ // Note: we use a "pull" algorithm (rather than "push") to
+ // avoid parallel writes to the same address, hence we reverse
+ // the direction of the edges here.
+
+ // Count number of nodes and edges
+ numEdges = 0;
+ numNodes = 0;
+ int ret;
+ while (1) {
+ uint32_t src, dst;
+ ret = fscanf(fp, "%d %d", &dst, &src);
+ if (ret == EOF) break;
+ numEdges++;
+ numNodes = src >= numNodes ? src+1 : numNodes;
+ numNodes = dst >= numNodes ? dst+1 : numNodes;
+ }
+ rewind(fp);
+
+ // Create mapping from node id to number of neighbours
+ uint32_t* count = (uint32_t*) calloc(numNodes, sizeof(uint32_t));
+ for (int i = 0; i < numEdges; i++) {
+ uint32_t src, dst;
+ ret = fscanf(fp, "%d %d", &dst, &src);
+ count[src]++;
+ if (undirected) count[dst]++;
+ }
+
+ // Create mapping from node id to neighbours
+ neighbours = (uint32_t**) calloc(numNodes, sizeof(uint32_t*));
+ rewind(fp);
+ for (int i = 0; i < numNodes; i++) {
+ neighbours[i] = (uint32_t*) calloc(count[i]+1, sizeof(uint32_t));
+ neighbours[i][0] = count[i];
+ }
+ for (int i = 0; i < numEdges; i++) {
+ uint32_t src, dst;
+ ret = fscanf(fp, "%d %d", &dst, &src);
+ neighbours[src][count[src]--] = dst;
+ if (undirected) neighbours[dst][count[dst]--] = src;
+ }
+
+ // Create mapping from node id to bit vector of reaching nodes
+ reaching = (uint64_t**) calloc(numNodes, sizeof(uint64_t*));
+ reachingNext = (uint64_t**) calloc(numNodes, sizeof(uint64_t*));
+ for (int i = 0; i < numNodes; i++) {
+ reaching[i] = (uint64_t*) calloc(vectorSize, sizeof(uint64_t));
+ reachingNext[i] = (uint64_t*) calloc(vectorSize, sizeof(uint64_t));
+ }
+
+ // Release
+ free(count);
+ fclose(fp);
+}
+
+// Compute sum of all shortest paths from given sources
+uint64_t ssp(uint32_t numSources, uint32_t* sources)
+{
+ // Sum of distances
+ uint64_t sum = 0;
+
+ // Initialise reaching vector for each node
+ for (int i = 0; i < numNodes; i++) {
+ for (int j = 0; j < vectorSize; j++) {
+ reaching[i][j] = 0;
+ reachingNext[i][j] = 0;
+ }
+ }
+ for (int i = 0; i < numSources; i++) {
+ uint32_t src = sources[i];
+ reaching[src][i/64] |= 1ul << (i%64);
+ }
+
+ int* queue = new int [numNodes];
+ int queueSize = 0;
+ for (int i = 0; i < numNodes; i++) queue[queueSize++] = i;
+
+ // Distance increases on each iteration
+ uint32_t dist = 1;
+
+ while (queueSize > 0) {
+ // For each node
+ for (int i = 0; i < queueSize; i++) {
+ int me = queue[i];
+ // For each neighbour
+ uint32_t numNeighbours = neighbours[me][0];
+ for (int j = 1; j <= numNeighbours; j++) {
+ uint32_t n = neighbours[me][j];
+ // For each chunk
+ for (int k = 0; k < vectorSize; k++) {
+ if (reaching[me][k] & ~reachingNext[n][k])
+ reachingNext[n][k] |= reaching[me][k];
+ }
+ }
+ }
+
+ // For each node, update reaching vector
+ queueSize = 0;
+ for (int i = 0; i < numNodes; i++) {
+ for (int k = 0; k < vectorSize; k++) {
+ uint64_t diff = reachingNext[i][k] & ~reaching[i][k];
+ if (diff) {
+ queue[queueSize++] = i;
+ uint32_t n = __builtin_popcountll(diff);
+ sum += n * dist;
+ reaching[i][k] |= reachingNext[i][k];
+ }
+ }
+ }
+ dist++;
+ }
+
+ return sum;
+}
+
+int main(int argc, char**argv)
+{
+ if (argc != 2) {
+ printf("Specify edges file\n");
+ exit(EXIT_FAILURE);
+ }
+ bool undirected = false;
+ readGraph(argv[1], undirected);
+ printf("Nodes: %u. Edges: %u\n", numNodes, numEdges);
+
+ uint32_t numSources = 64*vectorSize;
+ assert(numSources < numNodes);
+ uint32_t sources[numSources];
+ for (int i = 0; i < numSources; i++) sources[i] = i;
+ //randomSet(numSources, sources, numNodes);
+
+ struct timeval start, finish, diff;
+
+ uint64_t sum = 0;
+ const int nodesPerVector = 64 * vectorSize;
+ gettimeofday(&start, NULL);
+ sum = ssp(numSources, sources);
+ gettimeofday(&finish, NULL);
+
+ printf("Sum of subset of shortest paths = %lu\n", sum);
+
+ timersub(&finish, &start, &diff);
+ double duration = (double) diff.tv_sec + (double) diff.tv_usec / 1000000.0;
+ printf("Time = %lf\n", duration);
+
+ return 0;
+}
diff --git a/apps/POLite/asp-sync/Run.cpp b/apps/POLite/asp-sync/Run.cpp
index 25082646..518a33b5 100644
--- a/apps/POLite/asp-sync/Run.cpp
+++ b/apps/POLite/asp-sync/Run.cpp
@@ -19,9 +19,11 @@ int main(int argc, char**argv)
// Read network
EdgeList net;
net.read(argv[1]);
-
+
// Print max fan-out
printf("Max fan-out = %d\n", net.maxFanOut());
+ printf("Min fan-out = %d\n", net.minFanOut());
+ assert(net.minFanOut() > 0);
// Check that parameters make sense
assert(32*N <= net.numNodes);
@@ -97,7 +99,9 @@ int main(int argc, char**argv)
// Display time
timersub(&finish, &start, &diff);
double duration = (double) diff.tv_sec + (double) diff.tv_usec / 1000000.0;
+ #ifndef POLITE_DUMP_STATS
printf("Time = %lf\n", duration);
+ #endif
return 0;
}
diff --git a/apps/POLite/asp-tiles-sync/Run.cpp b/apps/POLite/asp-tiles-sync/Run.cpp
index 049d83a8..cdc2bb14 100644
--- a/apps/POLite/asp-tiles-sync/Run.cpp
+++ b/apps/POLite/asp-tiles-sync/Run.cpp
@@ -135,11 +135,11 @@ int main(int argc, char**argv)
double duration;
timersub(&finishCompute, &startCompute, &diff);
duration = (double) diff.tv_sec + (double) diff.tv_usec / 1000000.0;
- printf("Time (compute) = %lf\n", duration);
+ printf("Time (compute, including stats transfer over UART) = %lf\n", duration);
gettimeofday(&finishAll, NULL);
timersub(&finishAll, &startAll, &diff);
duration = (double) diff.tv_sec + (double) diff.tv_usec / 1000000.0;
- printf("Time (all) = %lf\n", duration);
+ printf("Time (all, including stats transfer over UART) = %lf\n", duration);
return 0;
}
diff --git a/apps/POLite/clocktree-async/Run.cpp b/apps/POLite/clocktree-async/Run.cpp
index 270c9b48..02f76723 100644
--- a/apps/POLite/clocktree-async/Run.cpp
+++ b/apps/POLite/clocktree-async/Run.cpp
@@ -93,7 +93,9 @@ int main(int argc, char** argv)
// Display time
timersub(&finish, &start, &diff);
double duration = (double) diff.tv_sec + (double) diff.tv_usec / 1000000.0;
+ #ifndef POLITE_DUMP_STATS
printf("Time = %lf\n", duration);
+ #endif
return 0;
}
diff --git a/apps/POLite/hashmin-sync/Run.cpp b/apps/POLite/hashmin-sync/Run.cpp
index cb6a7ced..eab92eff 100644
--- a/apps/POLite/hashmin-sync/Run.cpp
+++ b/apps/POLite/hashmin-sync/Run.cpp
@@ -82,7 +82,9 @@ int main(int argc, char**argv)
// Display time
timersub(&finish, &start, &diff);
double duration = (double) diff.tv_sec + (double) diff.tv_usec / 1000000.0;
+ #ifndef POLITE_DUMP_STATS
printf("Time = %lf\n", duration);
+ #endif
return 0;
}
diff --git a/apps/POLite/heat-cube-sync/Run.cpp b/apps/POLite/heat-cube-sync/Run.cpp
index aaa42c39..1163f01b 100644
--- a/apps/POLite/heat-cube-sync/Run.cpp
+++ b/apps/POLite/heat-cube-sync/Run.cpp
@@ -76,7 +76,9 @@ int main()
// Display time
timersub(&finish, &start, &diff);
double duration = (double) diff.tv_sec + (double) diff.tv_usec / 1000000.0;
+ #ifndef POLITE_DUMP_STATS
printf("Time = %lf\n", duration);
+ #endif
return 0;
}
diff --git a/apps/POLite/heat-gals/Heat.h b/apps/POLite/heat-gals/Heat.h
index 12ca9574..600b4d00 100644
--- a/apps/POLite/heat-gals/Heat.h
+++ b/apps/POLite/heat-gals/Heat.h
@@ -2,6 +2,8 @@
#ifndef _HEAT_H_
#define _HEAT_H_
+#define POLITE_DUMP_STATS
+#define POLITE_COUNT_MSGS
#include
struct HeatMessage {
@@ -10,7 +12,7 @@ struct HeatMessage {
// Time step
uint32_t time;
// Temperature at sender
- uint32_t val;
+ float val;
};
struct HeatState {
@@ -21,9 +23,9 @@ struct HeatState {
// Current time step of device
uint32_t time;
// Current temperature of device
- uint32_t val;
+ float val;
// Accumulator for temperatures received at times t and t+1
- uint32_t acc, accNext;
+ float acc, accNext;
// Count messages sent and received
uint8_t sent, received, receivedNext;
// Is the temperature of this device constant?
@@ -45,7 +47,7 @@ struct HeatDevice : PDevice {
// Proceed to next time step?
if (s->sent && s->received == s->fanIn) {
s->time--;
- if (!s->isConstant) s->val = s->acc >> 2;
+ if (!s->isConstant) s->val = s->acc / (float) s->fanIn;
s->acc = s->accNext;
s->received = s->receivedNext;
s->accNext = s->receivedNext = 0;
diff --git a/apps/POLite/heat-gals/Makefile b/apps/POLite/heat-gals/Makefile
index 0c343edd..86430b66 100644
--- a/apps/POLite/heat-gals/Makefile
+++ b/apps/POLite/heat-gals/Makefile
@@ -1,7 +1,7 @@
# SPDX-License-Identifier: BSD-2-Clause
APP_CPP = Heat.cpp
APP_HDR = Heat.h
-RUN_CPP = Run.cpp Colours.cpp
-RUN_H = Colours.h
+RUN_CPP = Run.cpp
+RUN_H =
include ../util/polite.mk
diff --git a/apps/POLite/heat-gals/Run.cpp b/apps/POLite/heat-gals/Run.cpp
index 0a08505b..44c2f921 100644
--- a/apps/POLite/heat-gals/Run.cpp
+++ b/apps/POLite/heat-gals/Run.cpp
@@ -1,17 +1,31 @@
// SPDX-License-Identifier: BSD-2-Clause
#include "Heat.h"
-#include "Colours.h"
#include
#include
+#include
#include
-int main()
+int main(int argc, char **argv)
{
// Parameters
- const uint32_t width = 256;
- const uint32_t height = 256;
- const uint32_t time = 1000;
+ const uint32_t time = 1000;
+
+ // Read in the example edge list and create data structure
+ if (argc != 2) {
+ printf("Specify edge file\n");
+ exit(EXIT_FAILURE);
+ }
+
+ // Load in the edge list file
+ printf("Loading in the graph..."); fflush(stdout);
+ EdgeList net;
+ net.read(argv[1]);
+ printf(" done\n");
+
+ // Print max fan-out
+ printf("Min fan-out = %d\n", net.minFanOut());
+ printf("Max fan-out = %d\n", net.maxFanOut());
// Connection to tinsel machine
HostLink hostLink;
@@ -19,58 +33,32 @@ int main()
// Create POETS graph
PGraph graph;
- // Create 2D mesh of devices
- PDeviceId **mesh = new PDeviceId* [height];
- for (uint32_t y = 0; y < height; y++) {
- mesh[y] = new PDeviceId [width];
- for (uint32_t x = 0; x < width; x++)
- mesh[y][x] = graph.newDevice();
+ // Create nodes in POETS graph
+ for (uint32_t i = 0; i < net.numNodes; i++) {
+ PDeviceId id = graph.newDevice();
+ assert(i == id);
}
- // Add edges
- for (uint32_t y = 0; y < height; y++)
- for (uint32_t x = 0; x < width; x++) {
- if (x < width-1) {
- graph.addEdge(mesh[y][x], 0, mesh[y][x+1]);
- graph.addEdge(mesh[y][x+1], 0, mesh[y][x]);
- }
- if (y < height-1) {
- graph.addEdge(mesh[y][x], 0, mesh[y+1][x]);
- graph.addEdge(mesh[y+1][x], 0, mesh[y][x]);
- }
- }
+ // Create connections in POETS graph
+ for (uint32_t i = 0; i < net.numNodes; i++) {
+ uint32_t numNeighbours = net.neighbours[i][0];
+ for (uint32_t j = 0; j < numNeighbours; j++)
+ graph.addEdge(i, 0, net.neighbours[i][j+1]);
+ }
// Prepare mapping from graph to hardware
graph.map();
- // Set device ids
- for (uint32_t y = 0; y < height; y++)
- for (uint32_t x = 0; x < width; x++)
- graph.devices[mesh[y][x]]->state.id = mesh[y][x];
-
- // Initialise time and fanIn fields
+ // Specify number of time steps to run on each device
+ srand(1);
for (PDeviceId i = 0; i < graph.numDevices; i++) {
+ int r = rand() % 255;
+ graph.devices[i]->state.id = i;
graph.devices[i]->state.time = time;
+ graph.devices[i]->state.val = (float) r;
+ graph.devices[i]->state.isConstant = false;
graph.devices[i]->state.fanIn = graph.fanIn(i);
}
-
- // Apply constant heat at north edge
- // Apply constant cool at south edge
- for (uint32_t x = 0; x < width; x++) {
- graph.devices[mesh[0][x]]->state.val = 255 << 16;
- graph.devices[mesh[0][x]]->state.isConstant = true;
- graph.devices[mesh[height-1][x]]->state.val = 40 << 16;
- graph.devices[mesh[height-1][x]]->state.isConstant = true;
- }
-
- // Apply constant heat at west edge
- // Apply constant cool at east edge
- for (uint32_t y = 0; y < height; y++) {
- graph.devices[mesh[y][0]]->state.val = 255 << 16;
- graph.devices[mesh[y][0]]->state.isConstant = true;
- graph.devices[mesh[y][width-1]]->state.val = 40 << 16;
- graph.devices[mesh[y][width-1]]->state.isConstant = true;
- }
// Write graph down to tinsel machine via HostLink
graph.write(&hostLink);
@@ -84,8 +72,11 @@ int main()
struct timeval start, finish, diff;
gettimeofday(&start, NULL);
+ // Consume performance stats
+ politeSaveStats(&hostLink, "stats.txt");
+
// Allocate array to contain final value of each device
- uint32_t* pixels = new uint32_t [graph.numDevices];
+ float* pixels = new float [graph.numDevices];
// Receive final value of each device
for (uint32_t i = 0; i < graph.numDevices; i++) {
@@ -97,25 +88,19 @@ int main()
pixels[msg.payload.from] = msg.payload.val;
}
+ // Display final values of first ten devices
+ for (uint32_t i = 0; i < 10; i++) {
+ if (i < graph.numDevices) {
+ printf("%d: %f\n", i, pixels[i]);
+ }
+ }
+
// Display time
timersub(&finish, &start, &diff);
double duration = (double) diff.tv_sec + (double) diff.tv_usec / 1000000.0;
+ #ifndef POLITE_DUMP_STATS
printf("Time = %lf\n", duration);
-
- // Emit image
- FILE* fp = fopen("out.ppm", "wt");
- if (fp == NULL) {
- printf("Can't open output file for writing\n");
- return -1;
- }
- fprintf(fp, "P3\n%d %d\n255\n", width, height);
- for (uint32_t y = 0; y < height; y++)
- for (uint32_t x = 0; x < width; x++) {
- uint32_t val = (pixels[mesh[y][x]] >> 16) & 0xff;
- fprintf(fp, "%d %d %d\n",
- colours[val*3], colours[val*3+1], colours[val*3+2]);
- }
- fclose(fp);
+ #endif
return 0;
}
diff --git a/apps/POLite/heat-gals/Colours.cpp b/apps/POLite/heat-grid-sync/Colours.cpp
similarity index 100%
rename from apps/POLite/heat-gals/Colours.cpp
rename to apps/POLite/heat-grid-sync/Colours.cpp
diff --git a/apps/POLite/heat-gals/Colours.h b/apps/POLite/heat-grid-sync/Colours.h
similarity index 100%
rename from apps/POLite/heat-gals/Colours.h
rename to apps/POLite/heat-grid-sync/Colours.h
diff --git a/apps/POLite/ping-test/ping.cpp b/apps/POLite/heat-grid-sync/Heat.cpp
similarity index 57%
rename from apps/POLite/ping-test/ping.cpp
rename to apps/POLite/heat-grid-sync/Heat.cpp
index 74960d36..b2b4fc3e 100644
--- a/apps/POLite/ping-test/ping.cpp
+++ b/apps/POLite/heat-grid-sync/Heat.cpp
@@ -1,21 +1,21 @@
// SPDX-License-Identifier: BSD-2-Clause
-#include "ping.h"
+#include "Heat.h"
#include
#include
typedef PThread<
- PingDevice,
- PingState, // State
+ HeatDevice,
+ HeatState, // State
None, // Edge label
- PingMessage // Message
- > PingThread;
+ HeatMessage // Message
+ > HeatThread;
int main()
{
// Point thread structure at base of thread's heap
- PingThread* thread = (PingThread*) tinselHeapBaseSRAM();
-
+ HeatThread* thread = (HeatThread*) tinselHeapBaseSRAM();
+
// Invoke interpreter
thread->run();
diff --git a/apps/POLite/heat-grid-sync/Heat.h b/apps/POLite/heat-grid-sync/Heat.h
new file mode 100644
index 00000000..b3a63a93
--- /dev/null
+++ b/apps/POLite/heat-grid-sync/Heat.h
@@ -0,0 +1,71 @@
+// SPDX-License-Identifier: BSD-2-Clause
+#ifndef _HEAT_H_
+#define _HEAT_H_
+
+#include
+
+struct HeatMessage {
+ // Sender id
+ uint32_t from;
+ // Time step
+ uint32_t time;
+ // Temperature at sender
+ uint32_t val;
+};
+
+struct HeatState {
+ // Device id
+ uint32_t id;
+ // Current time step of device
+ uint32_t time;
+ // Current temperature of device
+ uint32_t val, acc;
+ // Is the temperature of this device constant?
+ bool isConstant;
+};
+
+struct HeatDevice : PDevice {
+
+ // Called once by POLite at start of execution
+ inline void init() {
+ *readyToSend = Pin(0);
+ }
+
+ // Send handler
+ inline void send(volatile HeatMessage* msg) {
+ msg->from = s->id;
+ msg->time = s->time;
+ msg->val = s->val;
+ *readyToSend = No;
+ }
+
+ // Receive handler
+ inline void recv(HeatMessage* msg, None* edge) {
+ s->acc += msg->val;
+ }
+
+ // Called by POLite when system becomes idle
+ inline bool step() {
+ // Execution complete?
+ if (s->time == 0) {
+ *readyToSend = No;
+ return false;
+ }
+ else {
+ s->time--;
+ if (!s->isConstant) s->val = s->acc >> 2;
+ s->acc = 0;
+ *readyToSend = Pin(0);
+ return true;
+ }
+ }
+
+ // Optionally send message to host on termination
+ inline bool finish(volatile HeatMessage* msg) {
+ msg->from = s->id;
+ msg->val = s->val;
+ return true;
+ }
+};
+
+#endif
diff --git a/apps/POLite/heat-grid-sync/Makefile b/apps/POLite/heat-grid-sync/Makefile
new file mode 100644
index 00000000..0c343edd
--- /dev/null
+++ b/apps/POLite/heat-grid-sync/Makefile
@@ -0,0 +1,7 @@
+# SPDX-License-Identifier: BSD-2-Clause
+APP_CPP = Heat.cpp
+APP_HDR = Heat.h
+RUN_CPP = Run.cpp Colours.cpp
+RUN_H = Colours.h
+
+include ../util/polite.mk
diff --git a/apps/POLite/heat-grid-sync/Run.cpp b/apps/POLite/heat-grid-sync/Run.cpp
new file mode 100644
index 00000000..a938a446
--- /dev/null
+++ b/apps/POLite/heat-grid-sync/Run.cpp
@@ -0,0 +1,119 @@
+// SPDX-License-Identifier: BSD-2-Clause
+#include "Heat.h"
+#include "Colours.h"
+
+#include
+#include
+#include
+
+int main()
+{
+ // Parameters
+ const uint32_t width = 256;
+ const uint32_t height = 256;
+ const uint32_t time = 1000;
+
+ // Connection to tinsel machine
+ HostLink hostLink;
+
+ // Create POETS graph
+ PGraph graph;
+
+ // Create 2D mesh of devices
+ PDeviceId **mesh = new PDeviceId* [height];
+ for (uint32_t y = 0; y < height; y++) {
+ mesh[y] = new PDeviceId [width];
+ for (uint32_t x = 0; x < width; x++)
+ mesh[y][x] = graph.newDevice();
+ }
+
+ // Add edges
+ for (uint32_t y = 0; y < height; y++)
+ for (uint32_t x = 0; x < width; x++) {
+ if (x < width-1) {
+ graph.addEdge(mesh[y][x], 0, mesh[y][x+1]);
+ graph.addEdge(mesh[y][x+1], 0, mesh[y][x]);
+ }
+ if (y < height-1) {
+ graph.addEdge(mesh[y][x], 0, mesh[y+1][x]);
+ graph.addEdge(mesh[y+1][x], 0, mesh[y][x]);
+ }
+ }
+
+ // Prepare mapping from graph to hardware
+ graph.map();
+
+ // Set device ids
+ for (uint32_t y = 0; y < height; y++)
+ for (uint32_t x = 0; x < width; x++)
+ graph.devices[mesh[y][x]]->state.id = mesh[y][x];
+
+ // Specify number of time steps to run on each device
+ for (PDeviceId i = 0; i < graph.numDevices; i++)
+ graph.devices[i]->state.time = time;
+
+ // Apply constant heat at north edge
+ // Apply constant cool at south edge
+ for (uint32_t x = 0; x < width; x++) {
+ graph.devices[mesh[0][x]]->state.val = 255 << 16;
+ graph.devices[mesh[0][x]]->state.isConstant = true;
+ graph.devices[mesh[height-1][x]]->state.val = 40 << 16;
+ graph.devices[mesh[height-1][x]]->state.isConstant = true;
+ }
+
+ // Apply constant heat at west edge
+ // Apply constant cool at east edge
+ for (uint32_t y = 0; y < height; y++) {
+ graph.devices[mesh[y][0]]->state.val = 255 << 16;
+ graph.devices[mesh[y][0]]->state.isConstant = true;
+ graph.devices[mesh[y][width-1]]->state.val = 40 << 16;
+ graph.devices[mesh[y][width-1]]->state.isConstant = true;
+ }
+
+ // Write graph down to tinsel machine via HostLink
+ graph.write(&hostLink);
+
+ // Load code and trigger execution
+ hostLink.boot("code.v", "data.v");
+ hostLink.go();
+ printf("Starting\n");
+
+ // Start timer
+ struct timeval start, finish, diff;
+ gettimeofday(&start, NULL);
+
+ // Allocate array to contain final value of each device
+ uint32_t* pixels = new uint32_t [graph.numDevices];
+
+ // Receive final value of each device
+ for (uint32_t i = 0; i < graph.numDevices; i++) {
+ // Receive message
+ PMessage msg;
+ hostLink.recvMsg(&msg, sizeof(msg));
+ if (i == 0) gettimeofday(&finish, NULL);
+ // Save final value
+ pixels[msg.payload.from] = msg.payload.val;
+ }
+
+ // Display time
+ timersub(&finish, &start, &diff);
+ double duration = (double) diff.tv_sec + (double) diff.tv_usec / 1000000.0;
+ printf("Time = %lf\n", duration);
+
+ // Emit image
+ FILE* fp = fopen("out.ppm", "wt");
+ if (fp == NULL) {
+ printf("Can't open output file for writing\n");
+ return -1;
+ }
+ fprintf(fp, "P3\n%d %d\n255\n", width, height);
+ for (uint32_t y = 0; y < height; y++)
+ for (uint32_t x = 0; x < width; x++) {
+ uint32_t val = (pixels[mesh[y][x]] >> 16) & 0xff;
+ fprintf(fp, "%d %d %d\n",
+ colours[val*3], colours[val*3+1], colours[val*3+2]);
+ }
+ fclose(fp);
+
+ return 0;
+}
diff --git a/apps/POLite/heat-pc/Makefile b/apps/POLite/heat-pc/Makefile
new file mode 100644
index 00000000..235863ef
--- /dev/null
+++ b/apps/POLite/heat-pc/Makefile
@@ -0,0 +1,11 @@
+# SPDX-License-Identifier: BSD-2-Clause
+all: heat
+
+INC=../../../include
+
+heat: heat.cpp
+ g++ -I$(INC) -O3 heat.cpp -o heat
+
+.PHONY: clean
+clean:
+ rm heat
diff --git a/apps/POLite/heat-pc/heat.cpp b/apps/POLite/heat-pc/heat.cpp
new file mode 100644
index 00000000..194766ac
--- /dev/null
+++ b/apps/POLite/heat-pc/heat.cpp
@@ -0,0 +1,63 @@
+// SPDX-License-Identifier: BSD-2-Clause
+#include
+#include
+#include
+#include
+#include
+#include
+#include
+
+int main(int argc, char**argv)
+{
+ if (argc != 2) {
+ printf("Specify edges file\n");
+ exit(EXIT_FAILURE);
+ }
+
+ // Read network
+ EdgeList net;
+ net.read(argv[1]);
+
+ // Create states
+ float* heat = new float [net.numNodes];
+ float* heatNext = new float [net.numNodes];
+ srand(1);
+ for (int i = 0; i < net.numNodes; i++) {
+ int r = rand() % 255;
+ heat[i] = (float) r;
+ }
+
+ // Start timer
+ printf("Started\n");
+ struct timeval start, finish, diff;
+ gettimeofday(&start, NULL);
+
+ for (int t = 0; t < 100; t++) {
+ for (int i = 0; i < net.numNodes; i++) {
+ uint32_t numNeighbours = net.neighbours[i][0];
+ float acc = 0.0;
+ for (uint32_t j = 0; j < numNeighbours; j++) {
+ uint32_t neighbour = net.neighbours[i][j+1];
+ acc += heat[neighbour];
+ }
+ heatNext[i] = acc / (float) numNeighbours;
+ }
+ float* tmp = heat; heat = heatNext; heatNext = tmp;
+ }
+
+ // Stop timer
+ gettimeofday(&finish, NULL);
+
+ // Display final values of first ten devices
+ for (uint32_t i = 0; i < 10; i++) {
+ if (i < net.numNodes)
+ printf("%d: %f\n", i, heat[i]);
+ }
+
+ // Display time
+ timersub(&finish, &start, &diff);
+ double duration = (double) diff.tv_sec + (double) diff.tv_usec / 1000000.0;
+ printf("Time = %lf\n", duration);
+
+ return 0;
+}
diff --git a/apps/POLite/heat-sync/Colours.cpp b/apps/POLite/heat-sync/Colours.cpp
deleted file mode 100644
index 93b49740..00000000
--- a/apps/POLite/heat-sync/Colours.cpp
+++ /dev/null
@@ -1,71 +0,0 @@
-// SPDX-License-Identifier: BSD-2-Clause
-#include
-
-// 256 x RGB colours representing heat intensities
-uint8_t colours[] = {
- 0x00, 0x00, 0x76, 0x00, 0x00, 0x7a, 0x00, 0x00, 0x7f, 0x00, 0x00, 0x83,
- 0x00, 0x00, 0x88, 0x00, 0x00, 0x8c, 0x00, 0x00, 0x91, 0x00, 0x00, 0x95,
- 0x00, 0x00, 0x9a, 0x00, 0x00, 0x9e, 0x00, 0x00, 0xa3, 0x00, 0x00, 0xa3,
- 0x00, 0x00, 0xa7, 0x00, 0x00, 0xac, 0x00, 0x00, 0xb0, 0x00, 0x00, 0xb5,
- 0x00, 0x00, 0xb9, 0x00, 0x00, 0xbe, 0x00, 0x00, 0xc2, 0x00, 0x00, 0xc7,
- 0x00, 0x00, 0xcb, 0x00, 0x00, 0xd0, 0x00, 0x00, 0xd4, 0x00, 0x00, 0xd9,
- 0x00, 0x00, 0xde, 0x00, 0x00, 0xe2, 0x00, 0x00, 0xe7, 0x00, 0x00, 0xeb,
- 0x00, 0x00, 0xf0, 0x00, 0x00, 0xf4, 0x00, 0x00, 0xf9, 0x00, 0x00, 0xfd,
- 0x00, 0x03, 0xff, 0x00, 0x07, 0xff, 0x00, 0x0c, 0xff, 0x00, 0x10, 0xff,
- 0x00, 0x15, 0xff, 0x00, 0x19, 0xff, 0x00, 0x1e, 0xff, 0x00, 0x22, 0xff,
- 0x00, 0x27, 0xff, 0x00, 0x2b, 0xff, 0x00, 0x30, 0xff, 0x00, 0x34, 0xff,
- 0x00, 0x39, 0xff, 0x00, 0x3d, 0xff, 0x00, 0x42, 0xff, 0x00, 0x47, 0xff,
- 0x00, 0x4b, 0xff, 0x00, 0x50, 0xff, 0x00, 0x54, 0xff, 0x00, 0x59, 0xff,
- 0x00, 0x5d, 0xff, 0x00, 0x62, 0xff, 0x00, 0x66, 0xff, 0x00, 0x6b, 0xff,
- 0x00, 0x6f, 0xff, 0x00, 0x74, 0xff, 0x00, 0x78, 0xff, 0x00, 0x7d, 0xff,
- 0x00, 0x81, 0xff, 0x00, 0x86, 0xff, 0x00, 0x8a, 0xff, 0x00, 0x8f, 0xff,
- 0x00, 0x93, 0xff, 0x00, 0x98, 0xff, 0x00, 0x9c, 0xff, 0x00, 0xa1, 0xff,
- 0x00, 0xa5, 0xff, 0x00, 0xaa, 0xff, 0x00, 0xaf, 0xff, 0x00, 0xb3, 0xff,
- 0x00, 0xb8, 0xff, 0x00, 0xbc, 0xff, 0x00, 0xc1, 0xff, 0x00, 0xc5, 0xff,
- 0x00, 0xca, 0xff, 0x00, 0xce, 0xff, 0x00, 0xd3, 0xff, 0x00, 0xd7, 0xff,
- 0x00, 0xdc, 0xff, 0x00, 0xe0, 0xff, 0x00, 0xe5, 0xff, 0x00, 0xe9, 0xff,
- 0x00, 0xee, 0xff, 0x00, 0xf2, 0xff, 0x00, 0xf7, 0xff, 0x00, 0xfb, 0xff,
- 0x00, 0xff, 0xff, 0x00, 0xff, 0xfa, 0x00, 0xff, 0xf5, 0x00, 0xff, 0xf1,
- 0x00, 0xff, 0xec, 0x00, 0xff, 0xe7, 0x00, 0xff, 0xe3, 0x00, 0xff, 0xde,
- 0x00, 0xff, 0xda, 0x00, 0xff, 0xd5, 0x00, 0xff, 0xd1, 0x00, 0xff, 0xcc,
- 0x00, 0xff, 0xc8, 0x00, 0xff, 0xc3, 0x00, 0xff, 0xbf, 0x00, 0xff, 0xba,
- 0x00, 0xff, 0xb6, 0x00, 0xff, 0xb1, 0x00, 0xff, 0xad, 0x00, 0xff, 0xa8,
- 0x00, 0xff, 0xa4, 0x00, 0xff, 0x9f, 0x00, 0xff, 0x9b, 0x00, 0xff, 0x96,
- 0x00, 0xff, 0x92, 0x00, 0xff, 0x8d, 0x00, 0xff, 0x89, 0x00, 0xff, 0x84,
- 0x00, 0xff, 0x80, 0x00, 0xff, 0x7b, 0x00, 0xff, 0x76, 0x00, 0xff, 0x72,
- 0x00, 0xff, 0x6d, 0x00, 0xff, 0x69, 0x00, 0xff, 0x64, 0x00, 0xff, 0x60,
- 0x00, 0xff, 0x5b, 0x00, 0xff, 0x57, 0x00, 0xff, 0x52, 0x00, 0xff, 0x4e,
- 0x00, 0xff, 0x49, 0x00, 0xff, 0x45, 0x00, 0xff, 0x40, 0x00, 0xff, 0x3c,
- 0x00, 0xff, 0x37, 0x00, 0xff, 0x33, 0x00, 0xff, 0x2e, 0x00, 0xff, 0x2a,
- 0x00, 0xff, 0x25, 0x00, 0xff, 0x21, 0x00, 0xff, 0x1c, 0x00, 0xff, 0x18,
- 0x00, 0xff, 0x13, 0x00, 0xff, 0x0e, 0x00, 0xff, 0x0a, 0x00, 0xff, 0x05,
- 0x00, 0xff, 0x01, 0x04, 0xff, 0x00, 0x08, 0xff, 0x00, 0x0d, 0xff, 0x00,
- 0x11, 0xff, 0x00, 0x16, 0xff, 0x00, 0x1a, 0xff, 0x00, 0x1f, 0xff, 0x00,
- 0x23, 0xff, 0x00, 0x28, 0xff, 0x00, 0x2c, 0xff, 0x00, 0x31, 0xff, 0x00,
- 0x35, 0xff, 0x00, 0x3a, 0xff, 0x00, 0x3e, 0xff, 0x00, 0x43, 0xff, 0x00,
- 0x47, 0xff, 0x00, 0x4c, 0xff, 0x00, 0x50, 0xff, 0x00, 0x55, 0xff, 0x00,
- 0x5a, 0xff, 0x00, 0x5e, 0xff, 0x00, 0x63, 0xff, 0x00, 0x67, 0xff, 0x00,
- 0x6c, 0xff, 0x00, 0x70, 0xff, 0x00, 0x75, 0xff, 0x00, 0x79, 0xff, 0x00,
- 0x7e, 0xff, 0x00, 0x82, 0xff, 0x00, 0x87, 0xff, 0x00, 0x8b, 0xff, 0x00,
- 0x90, 0xff, 0x00, 0x94, 0xff, 0x00, 0x99, 0xff, 0x00, 0x9d, 0xff, 0x00,
- 0xa2, 0xff, 0x00, 0xa6, 0xff, 0x00, 0xab, 0xff, 0x00, 0xaf, 0xff, 0x00,
- 0xb4, 0xff, 0x00, 0xb8, 0xff, 0x00, 0xbd, 0xff, 0x00, 0xc2, 0xff, 0x00,
- 0xc6, 0xff, 0x00, 0xcb, 0xff, 0x00, 0xcf, 0xff, 0x00, 0xd4, 0xff, 0x00,
- 0xd8, 0xff, 0x00, 0xdd, 0xff, 0x00, 0xe1, 0xff, 0x00, 0xe6, 0xff, 0x00,
- 0xea, 0xff, 0x00, 0xef, 0xff, 0x00, 0xf3, 0xff, 0x00, 0xf8, 0xff, 0x00,
- 0xfc, 0xff, 0x00, 0xff, 0xfd, 0x00, 0xff, 0xf9, 0x00, 0xff, 0xf4, 0x00,
- 0xff, 0xf0, 0x00, 0xff, 0xeb, 0x00, 0xff, 0xe7, 0x00, 0xff, 0xe2, 0x00,
- 0xff, 0xde, 0x00, 0xff, 0xd9, 0x00, 0xff, 0xd5, 0x00, 0xff, 0xd0, 0x00,
- 0xff, 0xcb, 0x00, 0xff, 0xc7, 0x00, 0xff, 0xc2, 0x00, 0xff, 0xbe, 0x00,
- 0xff, 0xb9, 0x00, 0xff, 0xb5, 0x00, 0xff, 0xb0, 0x00, 0xff, 0xac, 0x00,
- 0xff, 0xa7, 0x00, 0xff, 0xa3, 0x00, 0xff, 0x9e, 0x00, 0xff, 0x9a, 0x00,
- 0xff, 0x95, 0x00, 0xff, 0x91, 0x00, 0xff, 0x8c, 0x00, 0xff, 0x88, 0x00,
- 0xff, 0x83, 0x00, 0xff, 0x7f, 0x00, 0xff, 0x7a, 0x00, 0xff, 0x76, 0x00,
- 0xff, 0x71, 0x00, 0xff, 0x6d, 0x00, 0xff, 0x68, 0x00, 0xff, 0x63, 0x00,
- 0xff, 0x5f, 0x00, 0xff, 0x5a, 0x00, 0xff, 0x56, 0x00, 0xff, 0x51, 0x00,
- 0xff, 0x4d, 0x00, 0xff, 0x48, 0x00, 0xff, 0x44, 0x00, 0xff, 0x3f, 0x00,
- 0xff, 0x3b, 0x00, 0xff, 0x36, 0x00, 0xff, 0x32, 0x00, 0xff, 0x2d, 0x00,
- 0xff, 0x29, 0x00, 0xff, 0x24, 0x00, 0xff, 0x20, 0x00, 0xff, 0x1b, 0x00,
- 0xff, 0x17, 0x00, 0xff, 0x12, 0x00, 0xff, 0x0e, 0x00, 0xff, 0x09, 0x00,
- 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
-};
diff --git a/apps/POLite/heat-sync/Colours.h b/apps/POLite/heat-sync/Colours.h
deleted file mode 100644
index fc34e04c..00000000
--- a/apps/POLite/heat-sync/Colours.h
+++ /dev/null
@@ -1,10 +0,0 @@
-// SPDX-License-Identifier: BSD-2-Clause
-#ifndef _COLOURS_H_
-#define _COLOURS_H_
-
-#include
-
-// 256 x RGB colours representing heat intensities
-extern uint8_t colours[];
-
-#endif
diff --git a/apps/POLite/heat-sync/Heat.h b/apps/POLite/heat-sync/Heat.h
index b3a63a93..8dc926b3 100644
--- a/apps/POLite/heat-sync/Heat.h
+++ b/apps/POLite/heat-sync/Heat.h
@@ -2,24 +2,26 @@
#ifndef _HEAT_H_
#define _HEAT_H_
+#define POLITE_DUMP_STATS
+#define POLITE_COUNT_MSGS
#include
struct HeatMessage {
// Sender id
uint32_t from;
- // Time step
- uint32_t time;
// Temperature at sender
- uint32_t val;
+ float val;
};
struct HeatState {
// Device id
uint32_t id;
- // Current time step of device
- uint32_t time;
// Current temperature of device
- uint32_t val, acc;
+ float val, acc;
+ // Time step
+ uint16_t time;
+ // Number of neighbours
+ uint16_t numNeighbours;
// Is the temperature of this device constant?
bool isConstant;
};
@@ -34,7 +36,6 @@ struct HeatDevice : PDevice {
// Send handler
inline void send(volatile HeatMessage* msg) {
msg->from = s->id;
- msg->time = s->time;
msg->val = s->val;
*readyToSend = No;
}
@@ -42,6 +43,7 @@ struct HeatDevice : PDevice {
// Receive handler
inline void recv(HeatMessage* msg, None* edge) {
s->acc += msg->val;
+ s->numNeighbours++;
}
// Called by POLite when system becomes idle
@@ -53,8 +55,9 @@ struct HeatDevice : PDevice {
}
else {
s->time--;
- if (!s->isConstant) s->val = s->acc >> 2;
- s->acc = 0;
+ if (!s->isConstant) s->val = s->acc / (float) s->numNeighbours;
+ s->acc = 0.0;
+ s->numNeighbours = 0;
*readyToSend = Pin(0);
return true;
}
diff --git a/apps/POLite/heat-sync/Makefile b/apps/POLite/heat-sync/Makefile
index 0c343edd..f44d5b09 100644
--- a/apps/POLite/heat-sync/Makefile
+++ b/apps/POLite/heat-sync/Makefile
@@ -1,7 +1,7 @@
# SPDX-License-Identifier: BSD-2-Clause
APP_CPP = Heat.cpp
APP_HDR = Heat.h
-RUN_CPP = Run.cpp Colours.cpp
-RUN_H = Colours.h
+RUN_CPP = Run.cpp
+RUN_H =
include ../util/polite.mk
diff --git a/apps/POLite/heat-sync/Run.cpp b/apps/POLite/heat-sync/Run.cpp
index a938a446..ed978e39 100644
--- a/apps/POLite/heat-sync/Run.cpp
+++ b/apps/POLite/heat-sync/Run.cpp
@@ -1,17 +1,30 @@
// SPDX-License-Identifier: BSD-2-Clause
#include "Heat.h"
-#include "Colours.h"
#include
#include
+#include
#include
-int main()
+int main(int argc, char **argv)
{
- // Parameters
- const uint32_t width = 256;
- const uint32_t height = 256;
- const uint32_t time = 1000;
+ const uint32_t time = 1000;
+
+ // Read in the example edge list and create data structure
+ if (argc != 2) {
+ printf("Specify edge file\n");
+ exit(EXIT_FAILURE);
+ }
+
+ // Load in the edge list file
+ printf("Loading in the graph..."); fflush(stdout);
+ EdgeList net;
+ net.read(argv[1]);
+ printf(" done\n");
+
+ // Print max fan-out
+ printf("Min fan-out = %d\n", net.minFanOut());
+ printf("Max fan-out = %d\n", net.maxFanOut());
// Connection to tinsel machine
HostLink hostLink;
@@ -19,55 +32,31 @@ int main()
// Create POETS graph
PGraph graph;
- // Create 2D mesh of devices
- PDeviceId **mesh = new PDeviceId* [height];
- for (uint32_t y = 0; y < height; y++) {
- mesh[y] = new PDeviceId [width];
- for (uint32_t x = 0; x < width; x++)
- mesh[y][x] = graph.newDevice();
+ // Create nodes in POETS graph
+ for (uint32_t i = 0; i < net.numNodes; i++) {
+ PDeviceId id = graph.newDevice();
+ assert(i == id);
}
- // Add edges
- for (uint32_t y = 0; y < height; y++)
- for (uint32_t x = 0; x < width; x++) {
- if (x < width-1) {
- graph.addEdge(mesh[y][x], 0, mesh[y][x+1]);
- graph.addEdge(mesh[y][x+1], 0, mesh[y][x]);
- }
- if (y < height-1) {
- graph.addEdge(mesh[y][x], 0, mesh[y+1][x]);
- graph.addEdge(mesh[y+1][x], 0, mesh[y][x]);
- }
- }
+ // Create connections in POETS graph
+ for (uint32_t i = 0; i < net.numNodes; i++) {
+ uint32_t numNeighbours = net.neighbours[i][0];
+ for (uint32_t j = 0; j < numNeighbours; j++)
+ graph.addEdge(i, 0, net.neighbours[i][j+1]);
+ }
// Prepare mapping from graph to hardware
graph.map();
- // Set device ids
- for (uint32_t y = 0; y < height; y++)
- for (uint32_t x = 0; x < width; x++)
- graph.devices[mesh[y][x]]->state.id = mesh[y][x];
-
// Specify number of time steps to run on each device
- for (PDeviceId i = 0; i < graph.numDevices; i++)
+ srand(1);
+ for (PDeviceId i = 0; i < graph.numDevices; i++) {
+ int r = rand() % 255;
+ graph.devices[i]->state.id = i;
graph.devices[i]->state.time = time;
-
- // Apply constant heat at north edge
- // Apply constant cool at south edge
- for (uint32_t x = 0; x < width; x++) {
- graph.devices[mesh[0][x]]->state.val = 255 << 16;
- graph.devices[mesh[0][x]]->state.isConstant = true;
- graph.devices[mesh[height-1][x]]->state.val = 40 << 16;
- graph.devices[mesh[height-1][x]]->state.isConstant = true;
- }
-
- // Apply constant heat at west edge
- // Apply constant cool at east edge
- for (uint32_t y = 0; y < height; y++) {
- graph.devices[mesh[y][0]]->state.val = 255 << 16;
- graph.devices[mesh[y][0]]->state.isConstant = true;
- graph.devices[mesh[y][width-1]]->state.val = 40 << 16;
- graph.devices[mesh[y][width-1]]->state.isConstant = true;
+ graph.devices[i]->state.val = (float) r;
+ graph.devices[i]->state.isConstant = false;
+ //graph.devices[i]->state.fanOut = graph.fanOut(i);
}
// Write graph down to tinsel machine via HostLink
@@ -82,8 +71,11 @@ int main()
struct timeval start, finish, diff;
gettimeofday(&start, NULL);
+ // Consume performance stats
+ politeSaveStats(&hostLink, "stats.txt");
+
// Allocate array to contain final value of each device
- uint32_t* pixels = new uint32_t [graph.numDevices];
+ float* pixels = new float [graph.numDevices];
// Receive final value of each device
for (uint32_t i = 0; i < graph.numDevices; i++) {
@@ -95,25 +87,19 @@ int main()
pixels[msg.payload.from] = msg.payload.val;
}
+ // Display final values of first ten devices
+ for (uint32_t i = 0; i < 10; i++) {
+ if (i < graph.numDevices) {
+ printf("%d: %f\n", i, pixels[i]);
+ }
+ }
+
// Display time
timersub(&finish, &start, &diff);
double duration = (double) diff.tv_sec + (double) diff.tv_usec / 1000000.0;
+ #ifndef POLITE_DUMP_STATS
printf("Time = %lf\n", duration);
-
- // Emit image
- FILE* fp = fopen("out.ppm", "wt");
- if (fp == NULL) {
- printf("Can't open output file for writing\n");
- return -1;
- }
- fprintf(fp, "P3\n%d %d\n255\n", width, height);
- for (uint32_t y = 0; y < height; y++)
- for (uint32_t x = 0; x < width; x++) {
- uint32_t val = (pixels[mesh[y][x]] >> 16) & 0xff;
- fprintf(fp, "%d %d %d\n",
- colours[val*3], colours[val*3+1], colours[val*3+2]);
- }
- fclose(fp);
+ #endif
return 0;
}
diff --git a/apps/POLite/izhikevich-gals/Izhikevich.cpp b/apps/POLite/izhikevich-gals/Izhikevich.cpp
new file mode 100644
index 00000000..8533062a
--- /dev/null
+++ b/apps/POLite/izhikevich-gals/Izhikevich.cpp
@@ -0,0 +1,23 @@
+// SPDX-License-Identifier: BSD-2-Clause
+#include "Izhikevich.h"
+
+#include
+#include
+
+typedef PThread<
+ IzhikevichDevice,
+ IzhikevichState, // State
+ Weight, // Edge label
+ IzhikevichMsg // Message
+ > IzhikevichThread;
+
+int main()
+{
+ // Point thread structure at base of thread's heap
+ IzhikevichThread* thread = (IzhikevichThread*) tinselHeapBaseSRAM();
+
+ // Invoke interpreter
+ thread->run();
+
+ return 0;
+}
diff --git a/apps/POLite/izhikevich-gals/Izhikevich.h b/apps/POLite/izhikevich-gals/Izhikevich.h
new file mode 100644
index 00000000..701af341
--- /dev/null
+++ b/apps/POLite/izhikevich-gals/Izhikevich.h
@@ -0,0 +1,115 @@
+// SPDX-License-Identifier: BSD-2-Clause
+// (Based on code by David Thomas)
+#ifndef _Izhikevich_H_
+#define _Izhikevich_H_
+
+#define POLITE_DUMP_STATS
+#define POLITE_COUNT_MSGS
+#include
+#include "RNG.h"
+
+// Number of time steps to run for
+#define NUM_STEPS 100
+
+// Vertex state
+struct IzhikevichState {
+ // Random-number-generator state
+ uint32_t rng;
+ // Neuron state
+ float u, v, I, acc, accNext;
+ uint32_t spikeCount;
+ // Protocol
+ bool sent;
+ uint16_t received, receivedNext, fanIn, time;
+ // Neuron properties
+ float a, b, c, d, Ir;
+};
+
+// Edge weight type
+typedef float Weight;
+
+// Message type
+struct IzhikevichMsg {
+ // Did the sender spike or not?
+ bool spike;
+ // Time step of sender
+ uint16_t time;
+ // Number of times sender has spiked
+ uint32_t spikeCount;
+};
+
+// Vertex behaviour
+struct IzhikevichDevice : PDevice {
+ inline void init() {
+ s->v = -65.0f;
+ s->u = s->b * s->v;
+ s->I = s->Ir * grng(s->rng);
+ *readyToSend = Pin(0);
+ }
+
+ // We call this on every state change
+ inline void change() {
+ // Execution complete?
+ if (s->time == NUM_STEPS) return;
+
+ // Proceed to next time step?
+ if (s->sent && s->received == s->fanIn) {
+ s->time++;
+ s->I += s->acc;
+ s->acc = s->accNext;
+ s->accNext = 0;
+ s->received = s->receivedNext;
+ s->receivedNext = 0;
+ s->sent = false;
+ *readyToSend = s->time == (NUM_STEPS+1) ? No : Pin(0);
+ }
+ }
+
+ // Send handler
+ inline void send(volatile IzhikevichMsg* msg) {
+ bool spike = false;
+ float &v = s->v;
+ float &u = s->u;
+ float &I = s->I;
+ v = v+0.5*(0.04*v*v+5*v+140-u+I); // Step 0.5 ms
+ v = v+0.5*(0.04*v*v+5*v+140-u+I); // for numerical
+ u = u + s->a*(s->b*v-u); // stability
+ if (v >= 30.0) {
+ v = s->c;
+ u += s->d;
+ s->spikeCount++;
+ spike = true;
+ }
+ s->I = s->Ir * grng(s->rng);
+ msg->time = s->time;
+ msg->spike = spike;
+ msg->spikeCount = s->spikeCount;
+ s->sent = true;
+ *readyToSend = No;
+ change();
+ }
+
+ // Receive handler
+ inline void recv(IzhikevichMsg* msg, Weight* weight) {
+ if (msg->time == s->time) {
+ if (msg->spike) s->acc += *weight;
+ s->received++;
+ change();
+ }
+ else {
+ if (msg->spike) s->accNext += *weight;
+ s->receivedNext++;
+ }
+ }
+
+ inline bool step() {
+ return false;
+ }
+
+ inline bool finish(IzhikevichMsg* msg) {
+ msg->spikeCount = s->spikeCount;
+ return true;
+ }
+};
+
+#endif
diff --git a/apps/POLite/ping-test/Makefile b/apps/POLite/izhikevich-gals/Makefile
similarity index 63%
rename from apps/POLite/ping-test/Makefile
rename to apps/POLite/izhikevich-gals/Makefile
index 7e85d2c6..5ba3d9e3 100644
--- a/apps/POLite/ping-test/Makefile
+++ b/apps/POLite/izhikevich-gals/Makefile
@@ -1,6 +1,6 @@
# SPDX-License-Identifier: BSD-2-Clause
-APP_CPP = ping.cpp
-APP_HDR = ping.h
+APP_CPP = Izhikevich.cpp
+APP_HDR = Izhikevich.h
RUN_CPP = Run.cpp
include ../util/polite.mk
diff --git a/apps/POLite/izhikevich-gals/RNG.h b/apps/POLite/izhikevich-gals/RNG.h
new file mode 100644
index 00000000..61b719b3
--- /dev/null
+++ b/apps/POLite/izhikevich-gals/RNG.h
@@ -0,0 +1,23 @@
+#ifndef _RNG_H_
+#define _RNG_H_
+
+inline uint32_t urng(uint32_t &state) {
+ state = state*1664525+1013904223;
+ return state;
+}
+
+// World's crappiest gaussian (courtesy of dt10!)
+inline float grng(uint32_t &state) {
+ uint32_t u=urng(state);
+ int32_t acc=0;
+ for(unsigned i=0;i<8;i++){
+ acc += u&0xf;
+ u=u>>4;
+ }
+ // a four-bit uniform has mean 7.5 and variance ((15-0+1)^2-1)/12 = 85/4
+ // sum of four uniforms has mean 8*7.5=60 and variance of 8*85/4=170
+ const float scale=0.07669649888473704; // == 1/sqrt(170)
+ return (acc-60.0f) * scale;
+}
+
+#endif
diff --git a/apps/POLite/izhikevich-gals/Run.cpp b/apps/POLite/izhikevich-gals/Run.cpp
new file mode 100644
index 00000000..e542881f
--- /dev/null
+++ b/apps/POLite/izhikevich-gals/Run.cpp
@@ -0,0 +1,132 @@
+// SPDX-License-Identifier: BSD-2-Clause
+#include "Izhikevich.h"
+
+#include
+#include
+
+#include
+#include
+#include
+#include
+
+inline double urand() { return (double) rand() / RAND_MAX; }
+
+int main(int argc, char**argv)
+{
+ if (argc != 2) {
+ printf("Specify edges file\n");
+ exit(EXIT_FAILURE);
+ }
+
+ // Read network
+ EdgeList net;
+ net.read(argv[1]);
+
+ // Connection to tinsel machine
+ HostLink hostLink;
+
+ // Create POETS graph
+ PGraph graph;
+
+ // Create nodes in POETS graph
+ for (uint32_t i = 0; i < net.numNodes; i++) {
+ PDeviceId id = graph.newDevice();
+ assert(i == id);
+ }
+
+ // Ratio of excitatory to inhibitory neurons
+ double excitatory = 0.8;
+
+ // Mark each neuron as excitatory (or inhibiatory)
+ srand(1);
+ bool* excite = new bool [net.numNodes];
+ for (int i = 0; i < net.numNodes; i++)
+ excite[i] = urand() < excitatory;
+
+ // Create connections in POETS graph
+ for (uint32_t i = 0; i < net.numNodes; i++) {
+ uint32_t numNeighbours = net.neighbours[i][0];
+ for (uint32_t j = 0; j < numNeighbours; j++) {
+ float weight = excite[i] ? 0.5 * urand() : -urand();
+ graph.addLabelledEdge(weight, i, 0, net.neighbours[i][j+1]);
+ }
+ }
+
+ // Add zero-weight back-edges for any directed edges
+ // (For GALS synchronisation)
+ for (uint32_t i = 0; i < net.numNodes; i++) {
+ for (uint32_t j = 0; j < net.neighbours[i][0]; j++) {
+ uint32_t n = net.neighbours[i][j+1];
+ // TODO: can be more efficient here
+ bool needBackEdge = true;
+ for (uint32_t k = 0; k < net.neighbours[n][0]; k++)
+ if (net.neighbours[n][k+1] == i) needBackEdge = false;
+ if (needBackEdge) graph.addLabelledEdge(0.0, n, 0, i);
+ }
+ }
+
+ // Prepare mapping from graph to hardware
+ graph.map();
+
+ srand(2);
+ // Initialise devices
+ for (PDeviceId i = 0; i < graph.numDevices; i++) {
+ IzhikevichState* n = &graph.devices[i]->state;
+ n->rng = (int32_t) (urand()*((double) (1<<31)));
+ n->fanIn = graph.fanIn(i);
+ if (excite[i]) {
+ float re = (float) urand();
+ n->a = 0.02;
+ n->b = 0.2;
+ n->c = -65+15*re*re;
+ n->d = 8-6*re*re;
+ n->Ir = 5;
+ }
+ else {
+ float ri = (float) urand();
+ n->a = 0.02+0.08*ri;
+ n->b = 0.25-0.05*ri;
+ n->c = -65;
+ n->d = 2;
+ n->Ir = 2;
+ }
+ }
+
+ // Write graph down to tinsel machine via HostLink
+ graph.write(&hostLink);
+
+ // Load code and trigger execution
+ hostLink.boot("code.v", "data.v");
+ hostLink.go();
+
+ // Timer
+ printf("Started\n");
+ struct timeval start, finish, diff;
+ gettimeofday(&start, NULL);
+
+ // Consume performance stats
+ politeSaveStats(&hostLink, "stats.txt");
+
+ int64_t sum = 0;
+ // Receive final distance to each vertex
+ for (uint32_t i = 0; i < graph.numDevices; i++) {
+ // Receive message
+ PMessage msg;
+ hostLink.recvMsg(&msg, sizeof(msg));
+ if (i == 0) gettimeofday(&finish, NULL);
+ // Accumulate
+ sum += msg.payload.spikeCount;
+ }
+
+ // Emit result
+ printf("Total spikes = %ld\n", sum);
+
+ // Display time
+ timersub(&finish, &start, &diff);
+ double duration = (double) diff.tv_sec + (double) diff.tv_usec / 1000000.0;
+ #ifndef POLITE_DUMP_STATS
+ printf("Time = %lf\n", duration);
+ #endif
+
+ return 0;
+}
diff --git a/apps/POLite/izhikevich-pc/Izhikevich.cpp b/apps/POLite/izhikevich-pc/Izhikevich.cpp
new file mode 100644
index 00000000..b4f03ed5
--- /dev/null
+++ b/apps/POLite/izhikevich-pc/Izhikevich.cpp
@@ -0,0 +1,139 @@
+// SPDX-License-Identifier: BSD-2-Clause
+// (Based on code by David Thomas)
+
+#include
+#include
+#include
+#include "RNG.h"
+
+#define NUM_STEPS 100
+
+// Neuron
+struct Neuron {
+ // Random-number-generator state
+ uint32_t rng;
+ // Neuron state
+ float u, v, I, spikeCount;
+ // Neuron properties
+ float a, b, c, d, Ir;
+};
+
+int main(int argc, char**argv)
+{
+ if (argc != 2) {
+ printf("Specify edges file\n");
+ exit(EXIT_FAILURE);
+ }
+
+ // Read network
+ EdgeList net;
+ net.read(argv[1]);
+
+ // Ratio of excitatory to inhibitory neurons
+ double excitatory = 0.8;
+
+ // Mark each neuron as excitatory (or inhibiatory)
+ srand(1);
+ bool* excite = new bool [net.numNodes];
+ for (int i = 0; i < net.numNodes; i++) {
+ excite[i] = urand() < excitatory;
+ }
+
+ // Edge weights
+ float** weight = new float* [net.numNodes];
+ for (int i = 0; i < net.numNodes; i++) {
+ uint32_t numEdges = net.neighbours[i][0];
+ weight[i] = new float [numEdges];
+ for (int j = 0; j < numEdges; j++) {
+ weight[i][j] = excite[i] ? 0.5 * urand() : -urand();
+ }
+ }
+
+ // State for each neuron
+ srand(2);
+ Neuron* neuron = new Neuron [net.numNodes];
+ for (int i = 0; i < net.numNodes; i++) {
+ Neuron* n = &neuron[i];
+ n->rng = (int32_t) (urand()*((double) (1<<31)));
+ if (excite[i]) {
+ float re = (float) urand();
+ n->a = 0.02;
+ n->b = 0.2;
+ n->c = -65+15*re*re;
+ n->d = 8-6*re*re;
+ n->Ir = 5;
+ }
+ else {
+ float ri = (float) urand();
+ n->a = 0.02+0.08*ri;
+ n->b = 0.25-0.05*ri;
+ n->c = -65;
+ n->d = 2;
+ n->Ir = 2;
+ }
+ }
+
+ // Spike array
+ bool* spike = new bool [net.numNodes];
+
+ // Initialisation
+ for (int i = 0; i < net.numNodes; i++) {
+ Neuron* n = &neuron[i];
+ n->v = -65.0;
+ n->u = n->b * n->v;
+ n->I = n->Ir * grng(n->rng);
+ }
+
+ // Timer
+ printf("Started\n");
+ struct timeval start, finish, diff;
+ gettimeofday(&start, NULL);
+
+ // Simulation
+ int64_t totalSpikes = 0;
+ for (int t = 0; t <= NUM_STEPS; t++) {
+ // Update state
+ for (int i = 0; i < net.numNodes; i++) {
+ spike[i] = false;
+ Neuron* n = &neuron[i];
+ float &v = n->v;
+ float &u = n->u;
+ float &I = n->I;
+ v = v+0.5*(0.04*v*v+5*v+140-u+I); // Step 0.5 ms
+ v = v+0.5*(0.04*v*v+5*v+140-u+I); // for numerical
+ u = u + n->a*(n->b*v-u); // stability
+ if (v >= 30.0) {
+ n->v = n->c;
+ n->u += n->d;
+ spike[i] = true;
+ }
+ n->I = n->Ir * grng(n->rng);
+ }
+ // Update I-values
+ uint32_t spikes = 0;
+ for (int i = 0; i < net.numNodes; i++) {
+ Neuron* n = &neuron[i];
+ if (spike[i]) {
+ spikes++;
+ n->spikeCount++;
+ uint32_t numEdges = net.neighbours[i][0];
+ uint32_t* dst = &net.neighbours[i][1];
+ for (int j = 0; j < numEdges; j++) {
+ neuron[dst[j]].I += weight[i][j];
+ }
+ }
+ }
+ //printf("%d: %d\n", t, spikes);
+ totalSpikes += spikes;
+ }
+ gettimeofday(&finish, NULL);
+
+ printf("Total spikes: %ld\n", totalSpikes);
+
+ // Display time
+ timersub(&finish, &start, &diff);
+ double duration = (double) diff.tv_sec + (double) diff.tv_usec / 1000000.0;
+ printf("Time = %lf\n", duration);
+
+ return 0;
+}
diff --git a/apps/POLite/izhikevich-pc/Makefile b/apps/POLite/izhikevich-pc/Makefile
new file mode 100644
index 00000000..52c92c74
--- /dev/null
+++ b/apps/POLite/izhikevich-pc/Makefile
@@ -0,0 +1,6 @@
+Izhikevich: Izhikevich.cpp RNG.h
+ g++ -I../../../include -O2 Izhikevich.cpp -o Izhikevich
+
+.PHONY: clean
+clean:
+ rm Izhikevich
diff --git a/apps/POLite/izhikevich-pc/RNG.h b/apps/POLite/izhikevich-pc/RNG.h
new file mode 100644
index 00000000..decc32f1
--- /dev/null
+++ b/apps/POLite/izhikevich-pc/RNG.h
@@ -0,0 +1,27 @@
+#ifndef _RNG_H_
+#define _RNG_H_
+
+inline uint32_t urng(uint32_t &state) {
+ state = state*1664525+1013904223;
+ return state;
+}
+
+// World's crappiest gaussian (courtesy of dt10!)
+inline float grng(uint32_t &state) {
+ uint32_t u=urng(state);
+ int32_t acc=0;
+ for(unsigned i=0;i<8;i++){
+ acc += u&0xf;
+ u=u>>4;
+ }
+ // a four-bit uniform has mean 7.5 and variance ((15-0+1)^2-1)/12 = 85/4
+ // sum of four uniforms has mean 8*7.5=60 and variance of 8*85/4=170
+ const float scale=0.07669649888473704; // == 1/sqrt(170)
+ return (acc-60.0f) * scale;
+}
+
+inline double urand() {
+ return (double) rand() / RAND_MAX;
+}
+
+#endif
diff --git a/apps/POLite/izhikevich-sync/Izhikevich.cpp b/apps/POLite/izhikevich-sync/Izhikevich.cpp
new file mode 100644
index 00000000..8533062a
--- /dev/null
+++ b/apps/POLite/izhikevich-sync/Izhikevich.cpp
@@ -0,0 +1,23 @@
+// SPDX-License-Identifier: BSD-2-Clause
+#include "Izhikevich.h"
+
+#include
+#include
+
+typedef PThread<
+ IzhikevichDevice,
+ IzhikevichState, // State
+ Weight, // Edge label
+ IzhikevichMsg // Message
+ > IzhikevichThread;
+
+int main()
+{
+ // Point thread structure at base of thread's heap
+ IzhikevichThread* thread = (IzhikevichThread*) tinselHeapBaseSRAM();
+
+ // Invoke interpreter
+ thread->run();
+
+ return 0;
+}
diff --git a/apps/POLite/izhikevich-sync/Izhikevich.h b/apps/POLite/izhikevich-sync/Izhikevich.h
new file mode 100644
index 00000000..150a4afa
--- /dev/null
+++ b/apps/POLite/izhikevich-sync/Izhikevich.h
@@ -0,0 +1,72 @@
+// SPDX-License-Identifier: BSD-2-Clause
+// (Based on code by David Thomas)
+#ifndef _Izhikevich_H_
+#define _Izhikevich_H_
+
+#define POLITE_DUMP_STATS
+#define POLITE_COUNT_MSGS
+
+#include
+#include "RNG.h"
+
+// Number of time steps to run for
+#define NUM_STEPS 100
+
+// Vertex state
+struct IzhikevichState {
+ // Random-number-generator state
+ uint32_t rng;
+ // Neuron state
+ float u, v, I;
+ uint32_t spikeCount;
+ // Neuron properties
+ float a, b, c, d, Ir;
+};
+
+// Edge weight type
+typedef float Weight;
+
+// Message type
+struct IzhikevichMsg {
+ // Number of times sender has spiked
+ uint32_t spikeCount;
+};
+
+// Vertex behaviour
+struct IzhikevichDevice : PDevice {
+ inline void init() {
+ s->v = -65.0f;
+ s->u = s->b * s->v;
+ s->I = s->Ir * grng(s->rng);
+ *readyToSend = No;
+ }
+ inline void send(IzhikevichMsg* msg) {
+ s->spikeCount++;
+ msg->spikeCount = s->spikeCount;
+ *readyToSend = No;
+ }
+ inline void recv(IzhikevichMsg* msg, Weight* weight) {
+ s->I += *weight;
+ }
+ inline bool step() {
+ float &v = s->v;
+ float &u = s->u;
+ float &I = s->I;
+ v = v+0.5*(0.04*v*v+5*v+140-u+I); // Step 0.5 ms
+ v = v+0.5*(0.04*v*v+5*v+140-u+I); // for numerical
+ u = u + s->a*(s->b*v-u); // stability
+ if (v >= 30.0) {
+ v = s->c;
+ u += s->d;
+ *readyToSend = Pin(0);
+ }
+ s->I = s->Ir * grng(s->rng);
+ return (time < NUM_STEPS);
+ }
+ inline bool finish(IzhikevichMsg* msg) {
+ msg->spikeCount = s->spikeCount;
+ return true;
+ }
+};
+
+#endif
diff --git a/apps/POLite/izhikevich-sync/Makefile b/apps/POLite/izhikevich-sync/Makefile
new file mode 100644
index 00000000..5ba3d9e3
--- /dev/null
+++ b/apps/POLite/izhikevich-sync/Makefile
@@ -0,0 +1,6 @@
+# SPDX-License-Identifier: BSD-2-Clause
+APP_CPP = Izhikevich.cpp
+APP_HDR = Izhikevich.h
+RUN_CPP = Run.cpp
+
+include ../util/polite.mk
diff --git a/apps/POLite/izhikevich-sync/RNG.h b/apps/POLite/izhikevich-sync/RNG.h
new file mode 100644
index 00000000..61b719b3
--- /dev/null
+++ b/apps/POLite/izhikevich-sync/RNG.h
@@ -0,0 +1,23 @@
+#ifndef _RNG_H_
+#define _RNG_H_
+
+inline uint32_t urng(uint32_t &state) {
+ state = state*1664525+1013904223;
+ return state;
+}
+
+// World's crappiest gaussian (courtesy of dt10!)
+inline float grng(uint32_t &state) {
+ uint32_t u=urng(state);
+ int32_t acc=0;
+ for(unsigned i=0;i<8;i++){
+ acc += u&0xf;
+ u=u>>4;
+ }
+ // a four-bit uniform has mean 7.5 and variance ((15-0+1)^2-1)/12 = 85/4
+ // sum of four uniforms has mean 8*7.5=60 and variance of 8*85/4=170
+ const float scale=0.07669649888473704; // == 1/sqrt(170)
+ return (acc-60.0f) * scale;
+}
+
+#endif
diff --git a/apps/POLite/izhikevich-sync/Run.cpp b/apps/POLite/izhikevich-sync/Run.cpp
new file mode 100644
index 00000000..0693b8c3
--- /dev/null
+++ b/apps/POLite/izhikevich-sync/Run.cpp
@@ -0,0 +1,120 @@
+// SPDX-License-Identifier: BSD-2-Clause
+#include "Izhikevich.h"
+
+#include
+#include
+
+#include
+#include
+#include
+#include
+
+inline double urand() { return (double) rand() / RAND_MAX; }
+
+int main(int argc, char**argv)
+{
+ if (argc != 2) {
+ printf("Specify edges file\n");
+ exit(EXIT_FAILURE);
+ }
+
+ // Read network
+ EdgeList net;
+ net.read(argv[1]);
+ printf("Max fan-out = %d\n", net.maxFanOut());
+ printf("Min fan-out = %d\n", net.minFanOut());
+
+ // Connection to tinsel machine
+ HostLink hostLink;
+
+ // Create POETS graph
+ PGraph graph;
+
+ // Create nodes in POETS graph
+ for (uint32_t i = 0; i < net.numNodes; i++) {
+ PDeviceId id = graph.newDevice();
+ assert(i == id);
+ }
+
+ // Ratio of excitatory to inhibitory neurons
+ double excitatory = 0.8;
+
+ // Mark each neuron as excitatory (or inhibiatory)
+ srand(1);
+ bool* excite = new bool [net.numNodes];
+ for (int i = 0; i < net.numNodes; i++)
+ excite[i] = urand() < excitatory;
+
+ // Create connections in POETS graph
+ for (uint32_t i = 0; i < net.numNodes; i++) {
+ uint32_t numNeighbours = net.neighbours[i][0];
+ for (uint32_t j = 0; j < numNeighbours; j++) {
+ float weight = excite[i] ? 0.5 * urand() : -urand();
+ graph.addLabelledEdge(weight, i, 0, net.neighbours[i][j+1]);
+ }
+ }
+
+ // Prepare mapping from graph to hardware
+ graph.map();
+
+ srand(2);
+ // Initialise devices
+ for (PDeviceId i = 0; i < graph.numDevices; i++) {
+ IzhikevichState* n = &graph.devices[i]->state;
+ n->rng = (int32_t) (urand()*((double) (1<<31)));
+ if (excite[i]) {
+ float re = (float) urand();
+ n->a = 0.02;
+ n->b = 0.2;
+ n->c = -65+15*re*re;
+ n->d = 8-6*re*re;
+ n->Ir = 5;
+ }
+ else {
+ float ri = (float) urand();
+ n->a = 0.02+0.08*ri;
+ n->b = 0.25-0.05*ri;
+ n->c = -65;
+ n->d = 2;
+ n->Ir = 2;
+ }
+ }
+
+ // Write graph down to tinsel machine via HostLink
+ graph.write(&hostLink);
+
+ // Load code and trigger execution
+ hostLink.boot("code.v", "data.v");
+ hostLink.go();
+
+ // Timer
+ printf("Started\n");
+ struct timeval start, finish, diff;
+ gettimeofday(&start, NULL);
+
+ // Consume performance stats
+ politeSaveStats(&hostLink, "stats.txt");
+
+ int64_t sum = 0;
+ // Receive final distance to each vertex
+ for (uint32_t i = 0; i < graph.numDevices; i++) {
+ // Receive message
+ PMessage msg;
+ hostLink.recvMsg(&msg, sizeof(msg));
+ if (i == 0) gettimeofday(&finish, NULL);
+ // Accumulate
+ sum += msg.payload.spikeCount;
+ }
+
+ // Emit result
+ printf("Total spikes = %ld\n", sum);
+
+ // Display time
+ timersub(&finish, &start, &diff);
+ double duration = (double) diff.tv_sec + (double) diff.tv_usec / 1000000.0;
+ #ifndef POLITE_DUMP_STATS
+ printf("Time = %lf\n", duration);
+ #endif
+
+ return 0;
+}
diff --git a/apps/POLite/pagerank-sync/Run.cpp b/apps/POLite/pagerank-sync/Run.cpp
index 435a0750..1b0eb356 100644
--- a/apps/POLite/pagerank-sync/Run.cpp
+++ b/apps/POLite/pagerank-sync/Run.cpp
@@ -28,7 +28,8 @@ int main(int argc, char **argv)
net.read(argv[1]);
printf(" done\n");
- // Print max fan-out
+ // Print fan-out
+ printf("Min fan-out = %d\n", net.minFanOut());
printf("Max fan-out = %d\n", net.maxFanOut());
// Create nodes in POETS graph
diff --git a/apps/POLite/ping-test/Run.cpp b/apps/POLite/ping-test/Run.cpp
deleted file mode 100644
index 57ac5441..00000000
--- a/apps/POLite/ping-test/Run.cpp
+++ /dev/null
@@ -1,57 +0,0 @@
-// SPDX-License-Identifier: BSD-2-Clause
-#include "ping.h"
-
-#include
-#include
-#include
-#include
-#include
-#include
-
-int main(int argc, char**argv)
-{
- // Connection to tinsel machine
- HostLink hostLink;
-
- // Create POETS graph
- PGraph graph;
-
- // Create single ping device
- PDeviceId id = graph.newDevice();
-
- // Prepare mapping from graph to hardware
- graph.map();
-
- // Write graph down to tinsel machine via HostLink
- graph.write(&hostLink);
-
- // Load code and trigger execution
- hostLink.boot("code.v", "data.v");
- hostLink.go();
-
- printf("Ping started\n");
-
- // Consume performance stats
- //politeSaveStats(&hostLink, "stats.txt");
-
- int test = 0;
- int deviceAddr = graph.toDeviceAddr[id];
- printf("deviceAddr = %d\n", deviceAddr);
- while (test < 100) {
- // Send ping
- PMessage sendMsg;
- sendMsg.devId = getLocalDeviceId(deviceAddr);
- sendMsg.payload.test = test;
- hostLink.send(getThreadId(deviceAddr), 1, &sendMsg);
- printf("Sent %d to device\n", sendMsg.payload.test);
-
- // Receive pong
- PMessage recvMsg;
- hostLink.recvMsg(&recvMsg, sizeof(recvMsg));
- printf("Received %d from device\n", recvMsg.payload.test);
-
- test++;
- }
-
- return 0;
-}
diff --git a/apps/POLite/ping-test/ping.h b/apps/POLite/ping-test/ping.h
deleted file mode 100644
index 3d4c17de..00000000
--- a/apps/POLite/ping-test/ping.h
+++ /dev/null
@@ -1,54 +0,0 @@
-// SPDX-License-Identifier: BSD-2-Clause
-// Test messaging between host and threads.
-
-#ifndef _ping_H_
-#define _ping_H_
-
-//#define POLITE_DUMP_STATS
-//#define POLITE_COUNT_MSGS
-
-// Lightweight POETS frontend
-#include
-
-struct PingMessage {
- uint32_t test;
-};
-
-struct PingState {
- // Number received to be sent back to host
- uint32_t test;
-};
-
-struct PingDevice : PDevice {
- // Called once by POLite at start of execution
- void init() {
- // Do nothing until a message is received from the host
- *readyToSend = No;
- }
-
- // Receive handler
- inline void recv(PingMessage* msg, None* edge) {
- // Store number from host to send back to host
- s->test = msg->test;
- *readyToSend = HostPin;
- }
-
- // Send handler
- inline void send(volatile PingMessage* msg) {
- // Put received value back in message for host to check
- msg->test = s->test;
- *readyToSend = No;
- }
-
- // Called by POLite when system becomes idle
- inline bool step() {
- return true; // Never terminate
- }
-
- // Optionally send message to host on termination
- inline bool finish(volatile PingMessage* msg) {
- return false;
- }
-};
-
-#endif
diff --git a/apps/POLite/progrouters/Makefile b/apps/POLite/progrouters/Makefile
new file mode 100644
index 00000000..9c0837be
--- /dev/null
+++ b/apps/POLite/progrouters/Makefile
@@ -0,0 +1,7 @@
+# SPDX-License-Identifier: BSD-2-Clause
+APP_CPP = ProgRoutersTest.cpp
+APP_HDR =
+RUN_CPP = Run.cpp
+RUN_H =
+
+include ../util/polite.mk
diff --git a/apps/POLite/progrouters/ProgRoutersTest.cpp b/apps/POLite/progrouters/ProgRoutersTest.cpp
new file mode 100644
index 00000000..109565df
--- /dev/null
+++ b/apps/POLite/progrouters/ProgRoutersTest.cpp
@@ -0,0 +1,43 @@
+// SPDX-License-Identifier: BSD-2-Clause
+#include
+
+int main()
+{
+ // Get thread id
+ int me = tinselId();
+
+ // Sample outgoing message
+ volatile uint32_t* msgOut = (uint32_t*) tinselSendSlot();
+ msgOut[0] = 0x10;
+ msgOut[1] = 0x20;
+ msgOut[2] = 0x30;
+ msgOut[3] = 0x40;
+ msgOut[4] = 0x50;
+ msgOut[5] = 0x60;
+ msgOut[6] = 0x70;
+ msgOut[7] = 0x80;
+
+ // On thread 0, send to key supplied by host
+ if (me == 0) {
+ tinselSetLen(1);
+ tinselWaitUntil(TINSEL_CAN_RECV);
+ volatile uint32_t* msgIn = (uint32_t*) tinselRecv();
+ uint32_t key = msgIn[0];
+ tinselFree(msgIn);
+
+ tinselWaitUntil(TINSEL_CAN_SEND);
+ tinselKeySend(key, msgOut);
+ }
+
+ // Print anything received
+ while (1) {
+ tinselWaitUntil(TINSEL_CAN_RECV);
+ volatile uint32_t* msgIn = (uint32_t*) tinselRecv();
+ printf("%x %x %x %x %x %x %x %x\n",
+ msgIn[0], msgIn[1], msgIn[2], msgIn[3]
+ , msgIn[4], msgIn[5], msgIn[6], msgIn[7]);
+ tinselFree(msgIn);
+ }
+
+ return 0;
+}
diff --git a/apps/POLite/progrouters/Run.cpp b/apps/POLite/progrouters/Run.cpp
new file mode 100644
index 00000000..c2b27bd2
--- /dev/null
+++ b/apps/POLite/progrouters/Run.cpp
@@ -0,0 +1,47 @@
+// SPDX-License-Identifier: BSD-2-Clause
+#include
+#include
+
+int main(int argc, char **argv)
+{
+ // Connection to tinsel machine
+ HostLink hostLink;
+
+ // Create routing tables
+ ProgRouterMesh mesh(2, 1);
+
+ // Board (1, 0)
+ for (int i = 0; i < 2; i++) {
+ uint64_t mask = 1ul << i;
+ mesh.table[0][1].addMRM(1, 0, mask >> 32, mask, 0xf0f0);
+ }
+ uint32_t key01 = mesh.table[0][1].genKey();
+
+ // Board (0, 0)
+ for (int i = 0; i < 2; i++) {
+ uint64_t mask = 1ul << i;
+ mesh.table[0][0].addMRM(1, 0, mask >> 32, mask, 0xf0f0);
+ }
+ for (int i = 0; i < 2; i++) {
+ uint64_t mask = 1ul << i;
+ mesh.table[0][0].addMRM(1, 1, mask >> 32, mask, 0xf0f0);
+ }
+ mesh.table[0][0].addRR(2, key01); // East
+ uint32_t key00 = mesh.table[0][0].genKey();
+
+ // Transfer routing tables to FPGAs
+ mesh.write(&hostLink);
+
+ // Load code and trigger execution
+ hostLink.boot("code.v", "data.v");
+ hostLink.go();
+
+ // Send key
+ printf("Sending key %x\n", key00);
+ uint32_t msg[1 << TinselLogWordsPerMsg];
+ msg[0] = key00;
+ hostLink.send(0, 1, msg);
+
+ hostLink.dumpStdOut();
+ return 0;
+}
diff --git a/apps/POLite/sssp-async/Run.cpp b/apps/POLite/sssp-async/Run.cpp
index c7953795..37ffcb4e 100644
--- a/apps/POLite/sssp-async/Run.cpp
+++ b/apps/POLite/sssp-async/Run.cpp
@@ -20,8 +20,9 @@ int main(int argc, char**argv)
EdgeList net;
net.read(argv[1]);
- // Print max fan-out
+ // Print fan-out
printf("Max fan-out = %d\n", net.maxFanOut());
+ printf("Min fan-out = %d\n", net.minFanOut());
// Connection to tinsel machine
HostLink hostLink;
@@ -86,7 +87,9 @@ int main(int argc, char**argv)
// Display time
timersub(&finish, &start, &diff);
double duration = (double) diff.tv_sec + (double) diff.tv_usec / 1000000.0;
+ #ifndef POLITE_DUMP_STATS
printf("Time = %lf\n", duration);
+ #endif
return 0;
}
diff --git a/apps/POLite/sssp-pc/Makefile b/apps/POLite/sssp-pc/Makefile
new file mode 100644
index 00000000..2ddbeca3
--- /dev/null
+++ b/apps/POLite/sssp-pc/Makefile
@@ -0,0 +1,11 @@
+# SPDX-License-Identifier: BSD-2-Clause
+all: sssp
+
+INC=../../../include
+
+sssp: sssp.cpp
+ g++ -I$(INC) -O3 sssp.cpp -o sssp
+
+.PHONY: clean
+clean:
+ rm sssp
diff --git a/apps/POLite/sssp-pc/sssp.cpp b/apps/POLite/sssp-pc/sssp.cpp
new file mode 100644
index 00000000..9012f49e
--- /dev/null
+++ b/apps/POLite/sssp-pc/sssp.cpp
@@ -0,0 +1,92 @@
+// SPDX-License-Identifier: BSD-2-Clause
+#include
+#include
+#include
+#include
+#include
+#include
+#include
+
+int main(int argc, char**argv)
+{
+ if (argc != 2) {
+ printf("Specify edges file\n");
+ exit(EXIT_FAILURE);
+ }
+
+ // Read network
+ EdgeList net;
+ net.read(argv[1]);
+
+ // Create weights
+ srand(1);
+ uint32_t** weights = new uint32_t* [net.numNodes];
+ for (uint32_t i = 0; i < net.numNodes; i++) {
+ uint32_t numNeighbours = net.neighbours[i][0];
+ weights[i] = new uint32_t [numNeighbours];
+ for (uint32_t j = 0; j < numNeighbours; j++) {
+ weights[i][j] = rand() % 100;
+ }
+ }
+
+ // Create states
+ uint32_t* dist = new uint32_t [net.numNodes];
+ int* queue = new int [net.numNodes];
+ int queueSize = 0;
+ int* queueNext = new int [net.numNodes];
+ int queueSizeNext = 0;
+ bool* inQueue = new bool [net.numNodes];
+ for (int i = 0; i < net.numNodes; i++) {
+ inQueue[i] = false;
+ dist[i] = 0x7fffffff;
+ }
+
+ // Set source vertex
+ dist[2] = 0;
+ queue[queueSize++] = 2;
+
+ // Start timer
+ printf("Started\n");
+ struct timeval start, finish, diff;
+ gettimeofday(&start, NULL);
+
+ int iters = 0;
+ while (queueSize > 0) {
+ for (int i = 0; i < queueSize; i++) {
+ uint32_t me = queue[i];
+ uint32_t numNeighbours = net.neighbours[me][0];
+ for (uint32_t j = 0; j < numNeighbours; j++) {
+ uint32_t neighbour = net.neighbours[me][j+1];
+ uint32_t newDist = dist[me] + weights[me][j];
+ if (newDist < dist[neighbour]) {
+ dist[neighbour] = newDist;
+ if (!inQueue[neighbour]) {
+ queueNext[queueSizeNext++] = neighbour;
+ inQueue[neighbour] = true;
+ }
+ }
+ }
+ }
+ queueSize = queueSizeNext;
+ queueSizeNext = 0;
+ int32_t* tmp = queue; queue = queueNext; queueNext = tmp;
+ for (int i = 0; i < queueSize; i++) inQueue[queue[i]] = false;
+ iters++;
+ }
+
+ // Stop timer
+ gettimeofday(&finish, NULL);
+
+ uint64_t sum = 0;
+ for (int i = 0; i < net.numNodes; i++)
+ sum += dist[i];
+ printf("Sum of distances = %ld\n", sum);
+ printf("Iterations = %d\n", iters);
+
+ // Display time
+ timersub(&finish, &start, &diff);
+ double duration = (double) diff.tv_sec + (double) diff.tv_usec / 1000000.0;
+ printf("Time = %lf\n", duration);
+
+ return 0;
+}
diff --git a/apps/POLite/sssp-sync/Run.cpp b/apps/POLite/sssp-sync/Run.cpp
index c7953795..37ffcb4e 100644
--- a/apps/POLite/sssp-sync/Run.cpp
+++ b/apps/POLite/sssp-sync/Run.cpp
@@ -20,8 +20,9 @@ int main(int argc, char**argv)
EdgeList net;
net.read(argv[1]);
- // Print max fan-out
+ // Print fan-out
printf("Max fan-out = %d\n", net.maxFanOut());
+ printf("Min fan-out = %d\n", net.minFanOut());
// Connection to tinsel machine
HostLink hostLink;
@@ -86,7 +87,9 @@ int main(int argc, char**argv)
// Display time
timersub(&finish, &start, &diff);
double duration = (double) diff.tv_sec + (double) diff.tv_usec / 1000000.0;
+ #ifndef POLITE_DUMP_STATS
printf("Time = %lf\n", duration);
+ #endif
return 0;
}
diff --git a/apps/POLite/util/genld.sh b/apps/POLite/util/genld.sh
index 0350108e..474e5694 100755
--- a/apps/POLite/util/genld.sh
+++ b/apps/POLite/util/genld.sh
@@ -18,7 +18,7 @@ OUTPUT_ARCH( "riscv" )
MEMORY
{
instrs : ORIGIN = $MaxBootImageBytes, LENGTH = $MaxInstrBytes
- globals : ORIGIN = $DRAMBase, LENGTH = $DRAMGlobalsLength
+ globals : ORIGIN = $DRAMBase, LENGTH = $POLiteDRAMGlobalsLength
}
SECTIONS
diff --git a/apps/POLite/util/polite.mk b/apps/POLite/util/polite.mk
index a1d96f83..4abe32ee 100644
--- a/apps/POLite/util/polite.mk
+++ b/apps/POLite/util/polite.mk
@@ -51,7 +51,7 @@ $(HL)/%.o:
$(BUILD)/run: $(RUN_CPP) $(RUN_H) $(HL)/*.o
g++ -std=c++11 -O2 -I $(INC) -I $(HL) -o $(BUILD)/run $(RUN_CPP) $(HL)/*.o \
- -lmetis -fno-exceptions
+ -lmetis -fno-exceptions -fopenmp
$(BUILD)/sim: $(RUN_CPP) $(RUN_H) $(HL)/sim/*.o
g++ -O2 -I $(INC) -I $(HL) -o $(BUILD)/sim $(RUN_CPP) $(HL)/sim/*.o \
diff --git a/apps/POLite/util/sumstats.awk b/apps/POLite/util/sumstats.awk
index 4d037cca..719699aa 100755
--- a/apps/POLite/util/sumstats.awk
+++ b/apps/POLite/util/sumstats.awk
@@ -10,10 +10,12 @@ BEGIN {
cacheCount = 0;
coreCount = 0;
cacheLineSize = 32;
- intraThreadSendCount = 0;
- interThreadSendCount = 0;
- interBoardSendCount = 0;
- fmax = 225000000;
+ msgsReceived = 0;
+ msgsSent = 0;
+ progRouterSent = 0;
+ progRouterSentInter = 0;
+ blockedSends = 0;
+ fmax = 215000000;
if (boardsX == "" || boardsY == "") {
boardsX = 3;
boardsY = 2;
@@ -48,13 +50,18 @@ BEGIN {
coreCount = coreCount+1;
}
# Per-thread message counts
- else if (match($0, /(.*) LS:(.*),TS:(.*),BS:(.*)/, fields)) {
- ls=strtonum("0x"fields[2]);
- ts=strtonum("0x"fields[3]);
- bs=strtonum("0x"fields[4]);
- intraThreadSendCount = intraThreadSendCount+ls;
- interThreadSendCount = interThreadSendCount+ts;
- interBoardSendCount = interBoardSendCount+bs;
+ else if (match($0, /(.*) MS:(.*),MR:(.*),PR:(.*),PRI:(.*),BL:(.*)/,
+ fields)) {
+ ms=strtonum("0x"fields[2]);
+ mr=strtonum("0x"fields[3]);
+ pr=strtonum("0x"fields[4]);
+ pri=strtonum("0x"fields[5]);
+ bl=strtonum("0x"fields[6]);
+ msgsSent = msgsSent + ms;
+ msgsReceived = msgsReceived + mr;
+ progRouterSent = progRouterSent + pr;
+ progRouterSentInter = progRouterSentInter + pri;
+ blockedSends = blockedSends + bl;
}
}
}
@@ -70,7 +77,14 @@ END {
bytes = cacheLineSize * (missCount + writebackCount)
print "Off-chip memory (GBytes/s): ", ((1/time) * bytes)/1000000000
print "CPU util (%): ", (1-(cpuIdleCount/cycleCount))*100
- print "Intra-thread messages: ", intraThreadSendCount
- print "Inter-thread messages: ", interThreadSendCount
- print "Inter-board messages: ", interBoardSendCount
+ print "Msgs received: ", msgsReceived
+ print "Msgs sent by threads: ", msgsSent
+ print "Msgs injected by ProgRouter:", progRouterSent
+ print "Inter-board msgs:", progRouterSentInter
+ print "Blocked sends:", blockedSends
+ print ""
+ print "Notes:"
+ print " * ProgRouter injections includes inter-board msgs"
+ print " * Memory bandwidth does not include lookups by ProgRouter"
+ print " * If runtime > 40s approx, hit/miss counts may overflow"
}
diff --git a/config.py b/config.py
index 74c7f63e..6500be58 100755
--- a/config.py
+++ b/config.py
@@ -161,6 +161,16 @@ def quoted(s): return "'\"" + s + "\"'"
p["SRAMLogMaxInFlight"] = 5
p["SRAMStoreLatency"] = 2
+# Programmable router parameters:
+p["LogRoutingEntryLen"] = 5 # Number of beats in a routing table entry
+p["ProgRouterMaxBurst"] = 4
+p["FetcherLogIndQueueSize"] = 1
+p["FetcherLogBeatBufferSize"] = 5
+p["FetcherLogFlitBufferSize"] = 5
+p["FetcherLogMsgsPerFlitBuffer"] = (
+ p["FetcherLogFlitBufferSize"] - p["LogMaxFlitsPerMsg"])
+p["FetcherMsgsPerFlitBuffer"] = 2 ** p["FetcherLogMsgsPerFlitBuffer"]
+
# Enable performance counters
p["EnablePerfCount"] = True
@@ -178,7 +188,7 @@ def quoted(s): return "'\"" + s + "\"'"
p["UseCustomAccelerator"] = False
# Clock frequency (in MHz)
-p["ClockFreq"] = 225
+p["ClockFreq"] = 215
#==============================================================================
# Derived Parameters
@@ -300,6 +310,7 @@ def quoted(s): return "'\"" + s + "\"'"
# Cores per board
p["LogCoresPerBoard"] = p["LogCoresPerMailbox"] + p["LogMailboxesPerBoard"]
+p["LogCoresPerBoard1"] = p["LogCoresPerBoard"] + 1
p["CoresPerBoard"] = 2**p["LogCoresPerBoard"]
# Threads per core
@@ -356,10 +367,21 @@ def quoted(s): return "'\"" + s + "\"'"
# DRAM base and length
p["DRAMBase"] = 3 * (2 ** p["LogBytesPerSRAM"])
p["DRAMGlobalsLength"] = 2 ** (p["LogBytesPerDRAM"] - 1) - p["DRAMBase"]
+p["POLiteDRAMGlobalsLength"] = 2 ** 14
+p["POLiteProgRouterBase"] = p["DRAMBase"] + p["POLiteDRAMGlobalsLength"]
+p["POLiteProgRouterLength"] = (p["DRAMGlobalsLength"] -
+ p["POLiteDRAMGlobalsLength"])
+
+# POLite globals
# Number of FPGA boards per box (including bridge board)
p["BoardsPerBox"] = p["MeshXLenWithinBox"] * p["MeshYLenWithinBox"] + 1
+# Parameters for programmable routers
+# (and the routing-record fetchers they contain)
+p["FetchersPerProgRouter"] = 4 + p["MailboxMeshXLen"]
+p["LogFetcherFlitBufferSize"] = 5
+
#==============================================================================
# Main
#==============================================================================
diff --git a/de5/S5_DDR3_QSYS.qsys b/de5/S5_DDR3_QSYS.qsys
index 0695a737..4d8e3a49 100644
--- a/de5/S5_DDR3_QSYS.qsys
+++ b/de5/S5_DDR3_QSYS.qsys
@@ -891,7 +891,7 @@
-
+
@@ -1214,7 +1214,7 @@
-
+
diff --git a/doc/PIP-0024-global-multicast.md b/doc/PIP-0024-global-multicast.md
new file mode 100644
index 00000000..65105f71
--- /dev/null
+++ b/doc/PIP-0024-global-multicast.md
@@ -0,0 +1,226 @@
+# PIP-0024: Programmable routers and global multicast
+
+Author: Matthew Naylor
+
+This proposal replaces PIP 21.
+
+## Proposal
+
+We propose to generalise the destination component of a message so
+that it can be (1) a thread id; or (2) a **routing key**. A message,
+sent by a thread, containing a routing key as a destination will go to
+a **per-board router** on the same FPGA. The router will use they key
+as an index into a DRAM-based routing table and automatically
+propagate the message towards all the destinations associated with
+that key.
+
+## Motivation/Rationale
+
+PIP 22 resulted in a *mailbox-level* multicast feature, implemented in
+Tinsel 0.7. It enables each thread to send to a message
+simultaneously to any subset of the 64 threads on a destination
+mailbox. It works well when graphs exhibit good locality, with
+destination vertices often collocated on the same mailbox.
+
+However, it has a few drawbacks:
+
+ 1. Costly graph partitioning algorithms are needed to identify
+ locality. This is problematic for graphs with billions of edges
+ and vertices, because mapping time may significantly outweigh
+ execution time. (Indeed, graph partitioning is itself an
+ interesting application for the hardware.)
+
+ 2. In some graphs there are limits to how well destination vertices
+ can be collocated after partitioning. For example, *small-world
+ graphs* contain some extremely large, highly-distributed fanouts.
+
+A *global multicast* feature should reduce the need to find optimal
+partitions for very large graphs, and support distributed fanouts. It
+should also move work away from the cores and into the hardware
+routers: the softswitch no longer needs to iterate over the outgoing
+edges of a pin. While providing these improvements, it is also
+important to maintain the advantages of the existing mailbox-level
+multicast, for applications in which the mapping time is not a
+concern.
+
+## Functional overview
+
+A **routing key** is a 32-bit value consisting of a *ram id*, an
+*address*, and a *size*:
+
+```sv
+// 32-bit routing key (MSB to LSB)
+typedef struct {
+ // Which off-chip RAM on this board?
+ Bit#(`LogDRAMsPerBoard) ram;
+ // Pointer to array of routing beats containing routing records
+ Bit#(`LogBeatsPerDRAM) ptr;
+ // Number of beats in the array
+ Bit#(`LogRoutingEntryLen) numBeats;
+} RoutingKey;
+```
+
+When a message reaches the per-board router, the `ptr` field of the
+routing key is used as an index into DRAM, where a sequence of 256-bit
+**routing beats** are found. The `numBeats` field of the routing key
+indicates how many contiguous routing beats there are. Knowing the
+size before the lookup makes the hardware simpler and more efficient,
+e.g. it can avoid blocking on responses and issue a burst of an
+appropriate size. The value of `numBeats` may be zero.
+
+A routing beat consists of a *size* and a sequence of five 48-bit
+*routing chunks*:
+
+```sv
+// 256-bit routing beat (aligned, MSB to LSB)
+typedef struct {
+ // Number of routing records present in this beat
+ Bit#(16) size;
+ // Five 48-bit record chunks
+ Vector#(5, Bit#(48)) chunks;
+} RoutingBeat;
+```
+
+The *size* must lie in the range 1 to 5 inclusive (0 is disallowed).
+A **routing record** consists of one or two routing chunks, depending
+on the **record type**.
+
+All byte orderings are little endian. For example, the order of bytes
+in a routing beat is as follows.
+
+```
+Byte Contents
+---- --------
+31: Upper byte of length (i.e. number of records in beat)
+30: Lower byte of length
+29: Upper byte of first chunk
+ ...
+24: Lower byte of first chunk
+23: Upper byte of second chunk
+ ...
+18: Lower byte of second chunk
+17: Upper byte of third chunk
+ ...
+12: Lower byte of third chunk
+11: Upper byte of fourth chunk
+ ...
+ 6: Lower byte of fourth chunk
+ 5: Upper byte of fifth chunk
+ ...
+ 0: Lower byte of fifth chunk
+```
+
+Clearly, both routing keys and routing beats have a maximum size.
+However, in principle there is no limit to the number of records
+associated with a key, due to the possibility of *indirection records*
+(see below).
+
+There are five types of routing record, defined below.
+
+**48-bit Unicast Router-to-Mailbox (URM1).**
+
+```sv
+typedef struct {
+ // Record type (URM1 == 0)
+ Bit#(3) tag;
+ // Mailbox destination
+ Bit#(4) mbox;
+ // Mailbox-local thread identifier
+ Bit#(6) thread;
+ // Unused
+ Bit#(3) unused;
+ // Local key. The first word of the message
+ // payload is overwritten with this.
+ Bit#(32) localKey;
+} URM1Record;
+```
+
+The `localKey` can be used for anything, but might encode the
+destination thread-local device identifier, or edge identifier, or
+both. The `mbox` field is currently 4 bits (two Y bits followed by
+two X bits), but there are spare bits available to increase the size
+of this field in future if necessary.
+
+**96-bit Unicast Router-to-Mailbox (URM2).**
+
+```sv
+typedef struct {
+ // Record type (URM2 == 1)
+ Bit#(3) tag;
+ // Mailbox destination
+ Bit#(4) mbox;
+ // Mailbox-local thread identifier
+ Bit#(6) thread;
+ // Currently unused
+ Bit#(19) unused;
+ // Local key. The first two words of the message
+ // payload is overwritten with this.
+ Bit#(64) localKey;
+} URM2Record;
+```
+
+This is the same as a URM1 record except the local key is 64-bits in
+size.
+
+**48-bit Router-to-Router (RR).**
+
+```sv
+typedef struct {
+ // Record type (RR == 2)
+ Bit#(3) tag;
+ // Direction (N,S,E,W == 0,1,2,3)
+ Bit#(2) dir;
+ // Currently unused
+ Bit#(11) unused;
+ // New 32-bit routing key that will replace the one in the
+ // current message for the next hop of the message's journey
+ Bit#(32) newKey;
+} RRRecord;
+```
+
+The `newKey` field will replace the key in the current message for the
+next hop of the message's journey. Introducing a new key at each hop
+simplifies the mapping process (keeping it quick).
+
+**96-bit Multicast Router-to-Mailbox (MRM).**
+
+```sv
+typedef struct {
+ // Record type (MRM == 3)
+ Bit#(3) tag;
+ // Mailbox destination
+ Bit#(4) mbox;
+ // Currently unused
+ Bit#(9) unused;
+ // Local key. The least-significant half-word
+ // of the message is replaced with this
+ Bit#(16) localKey;
+ // Mailbox-local destination mask
+ Bit#(64) destMask;
+} MRMRecord;
+```
+
+**48-bit Indirection (IND).**
+
+```sv
+// 48-bit Indirection (IND) record
+// Note the restrictions on IND records:
+// 1. At most one IND record per key lookup
+// 2. A max-sized key lookup must contain an IND record
+typedef struct {
+ // Record type (IND == 4)
+ Bit#(3) tag;
+ // Currently unused
+ Bit#(13) unused;
+ // New 32-bit routing key for new set of records on current router
+ Bit#(32) newKey;
+} INDRecord;
+```
+
+Indirection records can be used to handle large fanouts, which exceed
+the number of bits available in the size portion of the routing key.
+
+## Impact
+
+Since use of routing keys is optional, existing applications will
+continue to work unmodified.
diff --git a/doc/custom/ExampleAccelerator.sv b/doc/custom/ExampleAccelerator.sv
index 34a97fc2..acc73455 100644
--- a/doc/custom/ExampleAccelerator.sv
+++ b/doc/custom/ExampleAccelerator.sv
@@ -5,6 +5,7 @@
typedef struct packed {
logic acc;
+ logic isKey;
logic host;
logic hostDir;
logic [`TinselMeshYBits-1:0] boardY;
diff --git a/doc/custom/README.md b/doc/custom/README.md
index c380f9c9..fde29010 100644
--- a/doc/custom/README.md
+++ b/doc/custom/README.md
@@ -74,6 +74,7 @@ custom accelerator or a mailbox.
```sv
typedef struct packed {
logic acc;
+ logic isKey;
logic host;
logic hostDir;
logic [`TinselMeshYBits-1:0] boardY;
diff --git a/doc/figures/fpga.png b/doc/figures/fpga.png
index f4d60fbb..71a4c97f 100644
Binary files a/doc/figures/fpga.png and b/doc/figures/fpga.png differ
diff --git a/doc/figures/fpga.tex b/doc/figures/fpga.tex
index 02922a0f..9eafda95 100644
--- a/doc/figures/fpga.tex
+++ b/doc/figures/fpga.tex
@@ -14,15 +14,6 @@
\definecolor{myorange}{RGB}{197,90,17}
\definecolor{mygreen}{RGB}{84,130,53}
- \node[fill=gray!20,rounded corners,
- minimum width=6.3cm,minimum height=4.8cm] (border0)
- at (4.5,2.0) {};
- \node[fill=white,rounded corners,
- minimum width=5.8cm,minimum height=4.1cm] (border1)
- at (4.5,1.8) {};
- \node[fill=none,color=black] at (4.5,6.4)
- {\footnotesize{inter-FPGA reliable links}};
-
\node[fill=myblue,rounded corners] (tile00)
at (0,0) {\footnotesize{tile}};
\node[rectangle,sharp corners,fill=black] (router00)
@@ -123,16 +114,16 @@
\draw[arrows=-,color=mygreen] (tile13) to (mem13);
\node[rounded corners,fill=mygreen]
- (ram0) at (1.7,-1.6) {\footnotesize{off-chip RAM}};
+ (ram0) at (1.3,-1.8) {\footnotesize{off-chip RAM}};
- \draw[arrows=-,color=mygreen] (mem00) to ([xshift=-7mm]ram0.north);
- \draw[arrows=-,color=mygreen] (mem01) to ([xshift=-5mm]ram0.north);
- \draw[arrows=-,color=mygreen] (mem02) to ([xshift=-3mm]ram0.north);
- \draw[arrows=-,color=mygreen] (mem03) to ([xshift=-1mm]ram0.north);
- \draw[arrows=-,color=mygreen] (mem10) to ([xshift=7mm]ram0.north);
- \draw[arrows=-,color=mygreen] (mem11) to ([xshift=5mm]ram0.north);
- \draw[arrows=-,color=mygreen] (mem12) to ([xshift=3mm]ram0.north);
- \draw[arrows=-,color=mygreen] (mem13) to ([xshift=1mm]ram0.north);
+ \draw[arrows=-,color=mygreen] (mem00) to ([xshift=-3mm]ram0.north);
+ \draw[arrows=-,color=mygreen] (mem01) to ([xshift=-1mm]ram0.north);
+ \draw[arrows=-,color=mygreen] (mem02) to ([xshift=1mm]ram0.north);
+ \draw[arrows=-,color=mygreen] (mem03) to ([xshift=3mm]ram0.north);
+ \draw[arrows=-,color=mygreen] (mem10) to ([xshift=11mm]ram0.north);
+ \draw[arrows=-,color=mygreen] (mem11) to ([xshift=9mm]ram0.north);
+ \draw[arrows=-,color=mygreen] (mem12) to ([xshift=7mm]ram0.north);
+ \draw[arrows=-,color=mygreen] (mem13) to ([xshift=5mm]ram0.north);
\coordinate[] (south0b) at (4.3, -0.9) {};
\coordinate[] (south0a) at (-0.83, -0.9) {};
@@ -282,16 +273,16 @@
\draw[arrows=-,color=mygreen] (tile33) to (memb13);
\node[rounded corners,fill=mygreen]
- (ram1) at (7.57,-1.6) {\footnotesize{off-chip RAM}};
+ (ram1) at (7.97,-1.8) {\footnotesize{off-chip RAM}};
- \draw[arrows=-,color=mygreen] (memb00) to ([xshift=-7mm]ram1.north);
- \draw[arrows=-,color=mygreen] (memb01) to ([xshift=-5mm]ram1.north);
- \draw[arrows=-,color=mygreen] (memb02) to ([xshift=-3mm]ram1.north);
- \draw[arrows=-,color=mygreen] (memb03) to ([xshift=-1mm]ram1.north);
- \draw[arrows=-,color=mygreen] (memb10) to ([xshift=7mm]ram1.north);
- \draw[arrows=-,color=mygreen] (memb11) to ([xshift=5mm]ram1.north);
- \draw[arrows=-,color=mygreen] (memb12) to ([xshift=3mm]ram1.north);
- \draw[arrows=-,color=mygreen] (memb13) to ([xshift=1mm]ram1.north);
+ \draw[arrows=-,color=mygreen] (memb00) to ([xshift=-11mm]ram1.north);
+ \draw[arrows=-,color=mygreen] (memb01) to ([xshift=-9mm]ram1.north);
+ \draw[arrows=-,color=mygreen] (memb02) to ([xshift=-7mm]ram1.north);
+ \draw[arrows=-,color=mygreen] (memb03) to ([xshift=-5mm]ram1.north);
+ \draw[arrows=-,color=mygreen] (memb10) to ([xshift=3mm]ram1.north);
+ \draw[arrows=-,color=mygreen] (memb11) to ([xshift=1mm]ram1.north);
+ \draw[arrows=-,color=mygreen] (memb12) to ([xshift=-1mm]ram1.north);
+ \draw[arrows=-,color=mygreen] (memb13) to ([xshift=-3mm]ram1.north);
@@ -359,33 +350,20 @@
\coordinate[] (south2c) at (4.7, -2.3) {};
\draw[arrows=-,color=black] (south2b) to (south2c);
- \draw[arrows=-,color=black] (router00.west) to
- ([xshift=-2.3mm]router00.west);
- \draw[arrows=-,color=black] (router01.west) to
- ([xshift=-2.3mm]router01.west);
- \draw[arrows=-,color=black] (router02.west) to
- ([xshift=-2.3mm]router02.west);
- \draw[arrows=-,color=black] (router03.west) to
- ([xshift=-2.3mm]router03.west);
-
- \draw[arrows=-,color=black] (router30.east) to
- ([xshift=14.4mm]router30.east);
- \draw[arrows=-,color=black] (router31.east) to
- ([xshift=14.4mm]router31.east);
- \draw[arrows=-,color=black] (router32.east) to
- ([xshift=14.4mm]router32.east);
- \draw[arrows=-,color=black] (router33.east) to
- ([xshift=14.4mm]router33.east);
-
- \draw[arrows=-,color=black] (router03.north) to
- ([yshift=2mm]router03.north);
- \draw[arrows=-,color=black] (router13.north) to
- ([yshift=2mm]router13.north);
- \draw[arrows=-,color=black] (router23.north) to
- ([yshift=2mm]router23.north);
- \draw[arrows=-,color=black] (router33.north) to
- ([yshift=2mm]router33.north);
+ \node[rounded corners,fill=myorange,minimum height=0.5cm] (boardrouter)
+ at (4.63cm,-1.8cm) {\footnotesize{board}\\[-1mm]\footnotesize{router}};
+
+ \node[rounded corners,fill=gray!20, text=black,minimum width=5.25cm] (links)
+ at (4.63cm, -3.2cm) {\footnotesize{inter-FPGA reliable links}};
+
+ \draw[arrows=-,color=black] (links.north) to (boardrouter.south);
+
+ % Is the board router connected to off-chip RAM?
+ \draw[arrows=-,color=mygreen] (ram0.east) to (boardrouter.west);
+ \draw[arrows=-,color=mygreen] (ram1.west) to (boardrouter.east);
+
\end{tikzpicture}
+
\end{document}
diff --git a/doc/figures/logo.png b/doc/figures/logo.png
new file mode 100644
index 00000000..8271002b
Binary files /dev/null and b/doc/figures/logo.png differ
diff --git a/hostlink/DebugLink.cpp b/hostlink/DebugLink.cpp
index f838441d..0031969c 100644
--- a/hostlink/DebugLink.cpp
+++ b/hostlink/DebugLink.cpp
@@ -60,10 +60,10 @@ void DebugLink::putPacket(int x, int y, BoardCtrlPkt* pkt)
}
// Constructor
-DebugLink::DebugLink(uint32_t numBoxesX, uint32_t numBoxesY)
+DebugLink::DebugLink(DebugLinkParams p)
{
- boxMeshXLen = numBoxesX;
- boxMeshYLen = numBoxesY;
+ boxMeshXLen = p.numBoxesX;
+ boxMeshYLen = p.numBoxesY;
get_tryNextX = 0;
get_tryNextY = 0;
@@ -105,11 +105,11 @@ DebugLink::DebugLink(uint32_t numBoxesX, uint32_t numBoxesY)
"But is has a box X coordinate of %i\n", thisBoxX);
exit(EXIT_FAILURE);
}
- if ((thisBoxX+numBoxesX-1) >= TinselBoxMeshXLen ||
- (thisBoxY+numBoxesY-1) >= TinselBoxMeshYLen) {
+ if ((thisBoxX+p.numBoxesX-1) >= TinselBoxMeshXLen ||
+ (thisBoxY+p.numBoxesY-1) >= TinselBoxMeshYLen) {
fprintf(stderr, "Requested box sub-mesh of size %ix%i "
"is not valid from box %s\n",
- numBoxesX, numBoxesY, hostname);
+ p.numBoxesX, p.numBoxesY, hostname);
exit(EXIT_FAILURE);
}
@@ -187,6 +187,8 @@ DebugLink::DebugLink(uint32_t numBoxesX, uint32_t numBoxesY)
if (y == 0) pkt.payload[2] |= 2;
if (thisBoxX == 0 && boxMeshXLen == 1) pkt.payload[2] |= 4;
if (thisBoxX == 1 && boxMeshXLen == 1) pkt.payload[2] |= 8;
+ // Reserve extra send slot?
+ pkt.payload[2] |= p.useExtraSendSlot ? 0x10 : 0;
// Send commands to each board
for (int b = 0; b < TinselBoardsPerBox; b++) {
pkt.linkId = b;
diff --git a/hostlink/DebugLink.h b/hostlink/DebugLink.h
index fd3c8291..18d352dc 100644
--- a/hostlink/DebugLink.h
+++ b/hostlink/DebugLink.h
@@ -8,6 +8,13 @@
#include "BoardCtrl.h"
#include "DebugLinkFormat.h"
+// DebugLinkH parameters
+struct DebugLinkParams {
+ uint32_t numBoxesX;
+ uint32_t numBoxesY;
+ bool useExtraSendSlot;
+};
+
class DebugLink {
// Location of this box with full box mesh
@@ -46,7 +53,7 @@ class DebugLink {
int meshYLen;
// Constructor
- DebugLink(uint32_t numBoxesX, uint32_t numBoxesY);
+ DebugLink(DebugLinkParams params);
// On given board, set destination core and thread
void setDest(uint32_t boardX, uint32_t boardY,
diff --git a/hostlink/HostLink.cpp b/hostlink/HostLink.cpp
index aa4d3af6..dd896f4d 100644
--- a/hostlink/HostLink.cpp
+++ b/hostlink/HostLink.cpp
@@ -60,9 +60,11 @@ static int connectToPCIeStream(const char* socketPath)
}
// Internal constructor
-void HostLink::constructor(uint32_t numBoxesX, uint32_t numBoxesY)
+void HostLink::constructor(HostLinkParams p)
{
- if (numBoxesX > TinselBoxMeshXLen || numBoxesY > TinselBoxMeshYLen) {
+ useExtraSendSlot = p.useExtraSendSlot;
+
+ if (p.numBoxesX > TinselBoxMeshXLen || p.numBoxesY > TinselBoxMeshYLen) {
fprintf(stderr, "Number of boxes requested exceeds those available\n");
exit(EXIT_FAILURE);
}
@@ -92,7 +94,11 @@ void HostLink::constructor(uint32_t numBoxesX, uint32_t numBoxesY)
#endif
// Create DebugLink
- debugLink = new DebugLink(numBoxesX, numBoxesY);
+ DebugLinkParams debugLinkParams;
+ debugLinkParams.numBoxesX = p.numBoxesX;
+ debugLinkParams.numBoxesY = p.numBoxesY;
+ debugLinkParams.useExtraSendSlot = p.useExtraSendSlot;
+ debugLink = new DebugLink(debugLinkParams);
// Set board mesh dimensions
meshXLen = debugLink->meshXLen;
@@ -145,12 +151,25 @@ HostLink::HostLink()
int x = str ? atoi(str) : 1;
str = getenv("HOSTLINK_BOXES_Y");
int y = str ? atoi(str) : 1;
- constructor(x, y);
+ HostLinkParams params;
+ params.numBoxesX = x;
+ params.numBoxesY = y;
+ params.useExtraSendSlot = false;
+ constructor(params);
}
HostLink::HostLink(uint32_t numBoxesX, uint32_t numBoxesY)
{
- constructor(numBoxesX, numBoxesY);
+ HostLinkParams params;
+ params.numBoxesX = numBoxesX;
+ params.numBoxesY = numBoxesY;
+ params.useExtraSendSlot = false;
+ constructor(params);
+}
+
+HostLink::HostLink(HostLinkParams params)
+{
+ constructor(params);
}
// Destructor
@@ -218,8 +237,9 @@ void HostLink::fromAddr(uint32_t addr, uint32_t* meshX, uint32_t* meshY,
*meshY = addr;
}
-// Inject a message via PCIe (blocking by default)
-bool HostLink::send(uint32_t dest, uint32_t numFlits, void* payload, bool block)
+// Internal helper for sending messages
+bool HostLink::sendHelper(uint32_t dest, uint32_t numFlits, void* payload,
+ bool block, uint32_t key)
{
assert(useSendBuffer ? block : true);
@@ -242,7 +262,7 @@ bool HostLink::send(uint32_t dest, uint32_t numFlits, void* payload, bool block)
buffer[0] = dest;
buffer[1] = 0;
buffer[2] = (numFlits-1) << 24;
- buffer[3] = 0;
+ buffer[3] = key;
// Fill in message payload
memcpy(&buffer[4], payload, numFlits*16);
@@ -285,6 +305,13 @@ bool HostLink::send(uint32_t dest, uint32_t numFlits, void* payload, bool block)
}
}
+
+// Inject a message via PCIe (blocking by default)
+bool HostLink::send(uint32_t dest, uint32_t numFlits, void* msg, bool block)
+{
+ return sendHelper(dest, numFlits, msg, block, 0);
+}
+
// Flush the send buffer
void HostLink::flush()
{
@@ -298,7 +325,28 @@ void HostLink::flush()
// Try to send a message (non-blocking, returns true on success)
bool HostLink::trySend(uint32_t dest, uint32_t numFlits, void* msg)
{
- return send(dest, numFlits, msg, false);
+ return sendHelper(dest, numFlits, msg, false, 0);
+}
+
+// Send a message using routing key (blocking by default)
+bool HostLink::keySend(uint32_t key, uint32_t numFlits,
+ void* msg, bool block)
+{
+ uint32_t useRoutingKey = 1 << (
+ TinselLogThreadsPerCore + TinselLogCoresPerMailbox +
+ TinselMailboxMeshXBits + TinselMailboxMeshYBits +
+ TinselMeshXBits + TinselMeshYBits + 2);
+ return sendHelper(useRoutingKey, numFlits, msg, block, key);
+}
+
+// Try to send using routing key (non-blocking, returns true on success)
+bool HostLink::keyTrySend(uint32_t key, uint32_t numFlits, void* msg)
+{
+ uint32_t useRoutingKey = 1 << (
+ TinselLogThreadsPerCore + TinselLogCoresPerMailbox +
+ TinselMailboxMeshXBits + TinselMailboxMeshYBits +
+ TinselMeshXBits + TinselMeshYBits + 2);
+ return sendHelper(useRoutingKey, numFlits, msg, false, key);
}
// Receive a message via PCIe (blocking)
diff --git a/hostlink/HostLink.h b/hostlink/HostLink.h
index 81c9b32f..41d78303 100644
--- a/hostlink/HostLink.h
+++ b/hostlink/HostLink.h
@@ -16,6 +16,13 @@
#define PCIESTREAM "pciestream"
#define PCIESTREAM_SIM "tinsel.b-1.1"
+// HostLink parameters
+struct HostLinkParams {
+ uint32_t numBoxesX;
+ uint32_t numBoxesY;
+ bool useExtraSendSlot;
+};
+
class HostLink {
// Lock file for acquring exclusive access to PCIeStream
int lockFile;
@@ -33,8 +40,15 @@ class HostLink {
char* sendBuffer;
int sendBufferLen;
+ // Request an extra send slot when bringing up Tinsel FPGAs
+ bool useExtraSendSlot;
+
// Internal constructor
- void constructor(uint32_t numBoxesX, uint32_t numBoxesY);
+ void constructor(HostLinkParams params);
+
+ // Internal helper for sending messages
+ bool sendHelper(uint32_t dest, uint32_t numFlits, void* payload,
+ bool block, uint32_t key);
public:
// Dimensions of board mesh
int meshXLen;
@@ -43,6 +57,7 @@ class HostLink {
// Constructors
HostLink();
HostLink(uint32_t numBoxesX, uint32_t numBoxesY);
+ HostLink(HostLinkParams params);
// Destructor
~HostLink();
@@ -65,6 +80,12 @@ class HostLink {
// Try to send a message (non-blocking, returns true on success)
bool trySend(uint32_t dest, uint32_t numFlits, void* msg);
+ // Send a message using routing key (blocking by default)
+ bool keySend(uint32_t key, uint32_t numFlits, void* msg, bool block = true);
+
+ // Try to send using routing key (non-blocking, returns true on success)
+ bool keyTrySend(uint32_t key, uint32_t numFlits, void* msg);
+
// Receive a max-sized message (blocking)
void recv(void* msg);
diff --git a/include/EdgeList.h b/include/EdgeList.h
index 7d03bb8f..ebd5d37f 100644
--- a/include/EdgeList.h
+++ b/include/EdgeList.h
@@ -3,8 +3,11 @@
#define _NETWORK_H_
#include
-#include
#include
+#include
+#include
+#include
+#include
struct EdgeList {
// Number of nodes and edges
@@ -18,50 +21,42 @@ struct EdgeList {
// Read network from file
void read(const char* filename)
{
- // Read edges
- FILE* fp = fopen(filename, "rt");
- if (fp == NULL) {
- fprintf(stderr, "Can't open '%s'\n", filename);
- exit(EXIT_FAILURE);
- }
+ std::fstream file(filename, std::ios_base::in);
+ std::vector vec;
// Count number of nodes and edges
numEdges = 0;
numNodes = 0;
- int ret;
- while (1) {
- uint32_t src, dst;
- ret = fscanf(fp, "%d %d", &src, &dst);
- if (ret == EOF) break;
+ uint32_t numInts = 0;
+ uint32_t val;
+ while (file >> val) {
+ vec.push_back(val);
+ numNodes = val >= numNodes ? val+1 : numNodes;
numEdges++;
- numNodes = src >= numNodes ? src+1 : numNodes;
- numNodes = dst >= numNodes ? dst+1 : numNodes;
}
- rewind(fp);
+ assert((numEdges&1) == 0);
+ numEdges >>= 1;
uint32_t* count = (uint32_t*) calloc(numNodes, sizeof(uint32_t));
- for (int i = 0; i < numEdges; i++) {
- uint32_t src, dst;
- ret = fscanf(fp, "%d %d", &src, &dst);
- count[src]++;
+ for (int i = 0; i < vec.size(); i+=2) {
+ count[vec[i]]++;
}
// Create mapping from node id to neighbours
neighbours = (uint32_t**) calloc(numNodes, sizeof(uint32_t*));
- rewind(fp);
for (int i = 0; i < numNodes; i++) {
neighbours[i] = (uint32_t*) calloc(count[i]+1, sizeof(uint32_t));
neighbours[i][0] = count[i];
}
- for (int i = 0; i < numEdges; i++) {
- uint32_t src, dst;
- ret = fscanf(fp, "%d %d", &src, &dst);
+ for (int i = 0; i < vec.size(); i+=2) {
+ uint32_t src = vec[i];
+ uint32_t dst = vec[i+1];
neighbours[src][count[src]--] = dst;
}
// Release
free(count);
- fclose(fp);
+ file.close();
}
// Determine max fan-out
@@ -73,6 +68,17 @@ struct EdgeList {
}
return max;
}
+
+ // Determine min fan-out
+ uint32_t minFanOut() {
+ uint32_t min = ~0;
+ for (uint32_t i = 0; i < numNodes; i++) {
+ uint32_t numNeighbours = neighbours[i][0];
+ if (numNeighbours < min) min = numNeighbours;
+ }
+ return min;
+ }
+
};
#endif
diff --git a/include/POLite.h b/include/POLite.h
index d12a0e73..f053e440 100644
--- a/include/POLite.h
+++ b/include/POLite.h
@@ -9,10 +9,10 @@
#include
#else
#include
+ #include
#include
#include
#include
- #include
#endif
#endif
diff --git a/include/POLite/Bitmap.h b/include/POLite/Bitmap.h
new file mode 100644
index 00000000..9271bc07
--- /dev/null
+++ b/include/POLite/Bitmap.h
@@ -0,0 +1,59 @@
+#ifndef _BITMAP_H_
+#define _BITMAP_H_
+
+#include
+#include
+#include
+
+struct Bitmap {
+ // Bitmap contents (sequence of 64-bit words)
+ Seq* contents;
+
+ // Index of first non-full word in bitmap
+ uint32_t firstFree;
+
+ // Constructor
+ Bitmap() {
+ contents = new Seq (16);
+ firstFree = 0;
+ }
+
+ // Destructor
+ ~Bitmap() {
+ if (contents) delete contents;
+ }
+
+ // Get value of word at given index, return 0 if out-of-bounds
+ inline uint64_t getWord(uint32_t index) {
+ return index >= contents->numElems ? 0ul : contents->elems[index];
+ }
+
+ // Find index of next free word in bitmap starting from given word index
+ inline uint32_t nextFreeWordFrom(uint32_t start) {
+ for (uint32_t i = start; i < contents->numElems; i++)
+ if (~contents->elems[i] != 0ul) return i;
+ return contents->numElems;
+ }
+
+ // Set bit at given index and bit offset in bitmap
+ inline void setBit(uint32_t wordIndex, uint32_t bitIndex) {
+ for (uint32_t i = contents->numElems; i <= wordIndex; i++)
+ contents->append(0ul);
+ contents->elems[wordIndex] |= 1ul << bitIndex;
+ if (wordIndex == firstFree) {
+ firstFree = nextFreeWordFrom(firstFree);
+ }
+ }
+
+ // Find index of next zero bit, and flip that bit
+ inline uint32_t grabNextBit() {
+ uint64_t word = getWord(firstFree);
+ assert(~word != 0ul);
+ uint32_t bit = __builtin_ctzll(~word);
+ uint32_t result = 64*firstFree + bit;
+ setBit(firstFree, bit);
+ return result;
+ }
+};
+
+#endif
diff --git a/include/POLite/PDevice.h b/include/POLite/PDevice.h
index 9eefda3a..508207bd 100644
--- a/include/POLite/PDevice.h
+++ b/include/POLite/PDevice.h
@@ -22,14 +22,22 @@
#define POLITE_NUM_PINS 1
#endif
-// Macros for performance stats
+// The local-multicast key points to a list of incoming edges. Some
+// of those edges are stored in a header, the rest in an array at a
+// different location. The number stored in the header is controlled
+// by the following parameter. If it's too low, we risk wasting
+// memory bandwidth. If it's too high, we risk wasting memory.
+// The minimum value is 0. For large edge state sizes, use 0.
+#ifndef POLITE_EDGES_PER_HEADER
+#define POLITE_EDGES_PER_HEADER 6
+#endif
+
+// Macros for performance stats:
// POLITE_DUMP_STATS - dump performance stats on termination
-// POLITE_COUNT_MSGS - include message counts of performance stats
+// POLITE_COUNT_MSGS - include message counts in performance stats
// Thread-local device id
typedef uint16_t PLocalDeviceId;
-#define InvalidLocalDevId 0xffff
-#define UnusedLocalDevId 0xfffe
// Thread id
typedef uint32_t PThreadId;
@@ -54,7 +62,7 @@ inline PLocalDeviceId getLocalDeviceId(PDeviceAddr addr) { return addr >> 19; }
// What's the max allowed local device address?
inline uint32_t maxLocalDeviceId() { return 8192; }
-// Routing key
+// Local multicast key
typedef uint16_t Key;
#define InvalidKey 0xffff
@@ -102,8 +110,8 @@ template struct ALIGNED PState {
// Message structure
template struct PMessage {
- // Source-based routing key
- Key key;
+ // Destination key
+ uint16_t destKey;
// Application message
M payload;
};
@@ -119,15 +127,15 @@ struct POutEdge {
uint32_t threadMaskHigh;
};
-// An incoming edge to a device (labelleled)
+// An incoming edge to a device
template struct PInEdge {
// Destination device
PLocalDeviceId devId;
- // Edge info
+ // Edge data
E edge;
};
-// An incoming edge to a device (unlabelleled)
+// An incoming edge to a device (unlabelled)
template <> struct PInEdge {
union {
// Destination device
@@ -137,15 +145,17 @@ template <> struct PInEdge {
};
};
-// Helper function: Count board hops between two threads
-inline uint32_t hopsBetween(uint32_t t0, uint32_t t1) {
- uint32_t xmask = ((1<> (TinselLogThreadsPerBoard + TinselMeshXBits);
- int32_t x0 = (t0 >> TinselLogThreadsPerBoard) & xmask;
- int32_t y1 = t1 >> (TinselLogThreadsPerBoard + TinselMeshXBits);
- int32_t x1 = (t1 >> TinselLogThreadsPerBoard) & xmask;
- return (abs(x0-x1) + abs(y0-y1));
-}
+// Header for a list of incoming edges (fixed size structure to
+// support fast construction/packing of local-multicast tables)
+template struct PInHeader {
+ // Number of receivers
+ uint16_t numReceivers;
+ // Pointer to remaining edges in inTableRest,
+ // if they don't all fit in the header
+ uint16_t restIndex;
+ // Edges stored in the header, to make good use of cached data
+ PInEdge edges[POLITE_EDGES_PER_HEADER];
+};
// Generic thread structure
template ) devices;
// Pointer to base of routing tables
PTR(POutEdge) outTableBase;
- PTR(PInEdge) inTableBase;
+ PTR(PInHeader) inTableHeaderBase;
+ PTR(PInEdge) inTableRestBase;
// Array of local device ids are ready to send
PTR(PLocalDeviceId) senders;
// This array is accessed in a LIFO manner
@@ -170,11 +181,11 @@ template * m = (PMessage*) tinselSendSlot();
// Send message
- m->key = outEdge->key;
+ m->destKey = outEdge->key;
tinselMulticast(outEdge->mbox, outEdge->threadMaskHigh,
outEdge->threadMaskLow, m);
#ifdef POLITE_COUNT_MSGS
- interThreadSendCount++;
- interBoardSendCount +=
- hopsBetween(outEdge->mbox << TinselLogThreadsPerMailbox,
- tinselId());
+ msgsSent++;
#endif
// Move to next neighbour
outEdge++;
}
- else
+ else {
+ #ifdef POLITE_COUNT_MSGS
+ blockedSends++;
+ #endif
tinselWaitUntil(TINSEL_CAN_SEND|TINSEL_CAN_RECV);
+ }
}
else if (sendersTop != senders) {
if (tinselCanSend()) {
@@ -292,8 +310,12 @@ template * inMsg = (PMessage*) tinselRecv();
- PInEdge* inEdge = &inTableBase[inMsg->key];
- while (inEdge->devId != InvalidLocalDevId) {
+ PInHeader* inHeader = &inTableHeaderBase[inMsg->destKey];
+ // Determine number and location of edges/receivers
+ uint32_t numReceivers = inHeader->numReceivers;
+ PInEdge* inEdge = inHeader->edges;
+ // For each receiver
+ for (uint32_t i = 0; i < numReceivers; i++) {
+ if (i == POLITE_EDGES_PER_HEADER)
+ inEdge = &inTableRestBase[inHeader->restIndex];
// Lookup destination device
PLocalDeviceId id = inEdge->devId;
DeviceType dev = getDevice(id);
@@ -332,7 +360,7 @@ template
#include
#include
+#include
+#include
#include
-#include "Seq.h"
+#include
// Nodes of a POETS graph are devices
typedef NodeId PDeviceId;
@@ -24,9 +26,27 @@ template struct PReceiverGroup {
// Thread id where all the receivers reside
uint32_t threadId;
// A sequence of receiving devices on that thread
- Seq>* receivers;
+ SmallSeq> receivers;
};
+// This structure holds info about an edge destination
+struct PEdgeDest {
+ // Index of edge in outgoing edge list
+ uint32_t index;
+ // Destination device
+ PDeviceId dest;
+ // Address where destination is located
+ PDeviceAddr addr;
+};
+
+// Comparison function for PEdgeDest
+// (Useful to sort destinations by thread id of destination)
+inline int cmpEdgeDest(const void* e0, const void* e1) {
+ PEdgeDest* d0 = (PEdgeDest*) e0;
+ PEdgeDest* d1 = (PEdgeDest*) e1;
+ return getThreadId(d0->addr) < getThreadId(d1->addr);
+}
+
// POETS graph
template class PGraph {
@@ -59,8 +79,19 @@ template *** outTable;
- // Sequence of incoming edges for every thread
- Seq>** inTable;
+ // Sequence of in-edge headers, for each thread
+ Seq>** inTableHeaders;
+ // Remaining in-edges that don't fit in the header table, for each thread
+ Seq>** inTableRest;
+ // Bitmap denoting used space in header table, for each thread
+ Bitmap** inTableBitmaps;
+
+ // Programmable routing tables
+ ProgRouterMesh* progRouterTables;
+
+ // Receiver groups (used internally by some methods, but declared once
+ // to avoid repeated allocation)
+ PReceiverGroup groups[TinselThreadsPerMailbox];
// Generic constructor
void constructor(uint32_t lenX, uint32_t lenY) {
@@ -79,18 +110,29 @@ template );
}
- // Add space for incoming edge table
- if (inTable[threadId]) {
- sizeEIMem = inTable[threadId]->numElems * sizeof(PInEdge);
- sizeEIMem = wordAlign(sizeEIMem);
+ // Add space for incoming edge tables
+ if (inTableHeaders[threadId]) {
+ sizeEIHeaderMem = inTableHeaders[threadId]->numElems *
+ sizeof(PInHeader);
+ sizeEIHeaderMem = wordAlign(sizeEIHeaderMem);
+ }
+ if (inTableRest[threadId]) {
+ sizeEIRestMem = inTableRest[threadId]->numElems * sizeof(PInEdge);
+ sizeEIRestMem = wordAlign(sizeEIRestMem);
}
// Add space for outgoing edge table
for (uint32_t devNum = 0; devNum < numDevs; devNum++) {
@@ -231,8 +288,10 @@ template maxDRAMSize) {
@@ -246,14 +305,17 @@ template devices = vertexMemBase[threadId];
// Set tinsel address of base of edge tables
thread->outTableBase = outEdgeMemBase[threadId];
- thread->inTableBase = inEdgeMemBase[threadId];
+ thread->inTableHeaderBase = inEdgeHeaderMemBase[threadId];
+ thread->inTableRestBase = inEdgeRestMemBase[threadId];
// Add space for each device on thread
uint32_t numDevs = numDevicesOnThread[threadId];
for (uint32_t devNum = 0; devNum < numDevs; devNum++) {
@@ -337,11 +408,18 @@ template * inEdgeArray = (PInEdge*) inEdgeMem[threadId];
- Seq>* edges = inTable[threadId];
+ PInHeader* inEdgeHeaderArray =
+ (PInHeader*) inEdgeHeaderMem[threadId];
+ Seq>* headers = inTableHeaders[threadId];
+ if (headers)
+ for (uint32_t i = 0; i < headers->numElems; i++) {
+ inEdgeHeaderArray[i] = headers->elems[i];
+ }
+ PInEdge* inEdgeRestArray = (PInEdge*) inEdgeRestMem[threadId];
+ Seq>* edges = inTableRest[threadId];
if (edges)
for (uint32_t i = 0; i < edges->numElems; i++) {
- inEdgeArray[i] = edges->elems[i];
+ inEdgeRestArray[i] = edges->elems[i];
}
// At this point, check that next pointers line up with heap sizes
if (nextVMem != vertexMemSize[threadId]) {
@@ -368,12 +446,27 @@ template >**)
+ // Receiver-side tables (headers)
+ inTableHeaders = (Seq>**)
+ calloc(TinselMaxThreads,sizeof(Seq>*));
+ for (uint32_t t = 0; t < TinselMaxThreads; t++) {
+ if (numDevicesOnThread[t] != 0)
+ inTableHeaders[t] = new SmallSeq>;
+ }
+
+ // Receiver-side tables (rest)
+ inTableRest = (Seq>**)
calloc(TinselMaxThreads,sizeof(Seq>*));
for (uint32_t t = 0; t < TinselMaxThreads; t++) {
if (numDevicesOnThread[t] != 0)
- inTable[t] = new SmallSeq>;
+ inTableRest[t] = new SmallSeq>;
+ }
+
+ // Receiver-side tables (bitmaps)
+ inTableBitmaps = (Bitmap**) calloc(TinselMaxThreads,sizeof(Bitmap*));
+ for (uint32_t t = 0; t < TinselMaxThreads; t++) {
+ if (numDevicesOnThread[t] != 0)
+ inTableBitmaps[t] = new Bitmap;
}
// Sender-side tables
@@ -386,174 +479,232 @@ template >* receivers,
- Seq>* groups) {
- groups->clear();
- for (uint32_t i = 0; i < 64; i++) {
- if (receivers[i].numElems > 0) {
- // Add receiver group
- PReceiverGroup g;
- g.threadId = (mbox << TinselLogThreadsPerMailbox) | i;
- g.receivers = &receivers[i];
- groups->append(g);
- }
+ // Determine local-multicast routing key for given set of receivers
+ // (The key must be the same for all receivers)
+ uint32_t findKey(uint32_t numGroups) {
+ // Fast path (single receiver)
+ if (numGroups == 1) {
+ Bitmap* bm = inTableBitmaps[groups[0].threadId];
+ return bm->grabNextBit();
}
- }
- // Determine routing key for given set of receivers
- // (The key must be the same for all receivers)
- uint32_t findKey(Seq>* receivers) {
- uint32_t key = 0;
-
- bool found = false;
- while (!found) {
- found = true;
- for (uint32_t i = 0; i < receivers->numElems; i++) {
- PReceiverGroup g = receivers->elems[i];
- uint32_t numReceivers = g.receivers->numElems;
- if (numReceivers > 0) {
- // Lookup thread id of receiver
- uint32_t t = g.threadId;
- // Lookup table size for this thread
- uint32_t tableSize = inTable[t]->numElems;
- // Move to next receiver when we find a space
- if (key >= tableSize) continue;
- // Is there space at the current key?
- // (Need space for numReceivers plus null terminator)
- bool space = true;
- for (int j = 0; j < numReceivers+1; j++) {
- if ((key+j) >= tableSize) break;
- if (inTable[t]->elems[key+j].devId != UnusedLocalDevId) {
- found = false;
- key = key+j+1;
- break;
- }
- }
- }
+ // Determine starting index for key search
+ uint32_t index = 0;
+ for (uint32_t i = 0; i < numGroups; i++) {
+ PReceiverGroup* g = &groups[i];
+ Bitmap* bm = inTableBitmaps[g->threadId];
+ if (bm->firstFree > index) index = bm->firstFree;
+ }
+
+ // Find key that is available for all receivers
+ uint64_t mask;
+ retry:
+ mask = 0ul;
+ for (uint32_t i = 0; i < numGroups; i++) {
+ PReceiverGroup* g = &groups[i];
+ Bitmap* bm = inTableBitmaps[g->threadId];
+ mask |= bm->getWord(index);
+ if (~mask == 0ul) { index++; goto retry; }
}
+
+ // Mark key as taken in each bitmap
+ uint32_t bit = __builtin_ctzll(~mask);
+ for (uint32_t i = 0; i < numGroups; i++) {
+ PReceiverGroup* g = &groups[i];
+ Bitmap* bm = inTableBitmaps[g->threadId];
+ bm->setBit(index, bit);
}
- return key;
+ return 64*index + bit;
}
// Add entries to the input tables for the given receivers
// (Only valid after mapper is called)
- uint32_t addInTableEntries(Seq>* receivers) {
- uint32_t key = findKey(receivers);
- if (key >= 0xfffe) {
+ uint32_t addInTableEntries(uint32_t numGroups) {
+ uint32_t key = findKey(numGroups);
+ if (key >= 0xffff) {
printf("Routing key exceeds 16 bits\n");
exit(EXIT_FAILURE);
}
- PInEdge null, unused;
- null.devId = InvalidLocalDevId;
- unused.devId = UnusedLocalDevId;
- // Now that a key with sufficient space has been found, populate the tables
- for (uint32_t i = 0; i < receivers->numElems; i++) {
- PReceiverGroup g = receivers->elems[i];
- uint32_t numReceivers = g.receivers->numElems;
- if (numReceivers > 0) {
- // Lookup thread id of receiver
- uint32_t t = g.threadId;
- // Lookup table size for this thread
- uint32_t tableSize = inTable[t]->numElems;
- // Make sure inTable is big enough for new entries
- for (uint32_t j = tableSize; j < (key+numReceivers+1); j++)
- inTable[t]->append(unused);
- // Add receivers to thread's inTable
- for (uint32_t j = 0; j < numReceivers; j++) {
- inTable[t]->elems[key+j] = g.receivers->elems[j];
+ // Populate inTableHeaders and inTableRest using the key
+ for (uint32_t i = 0; i < numGroups; i++) {
+ PReceiverGroup* g = &groups[i];
+ uint32_t numEdges = g->receivers.numElems;
+ PInEdge* edgePtr = g->receivers.elems;
+ if (numEdges > 0) {
+ // Determine thread id of receiver
+ uint32_t t = g->threadId;
+ // Extend table
+ Seq>* headers = inTableHeaders[t];
+ if (key >= headers->numElems)
+ headers->extendBy(key + 1 - headers->numElems);
+ // Fill in header
+ PInHeader* header = &inTableHeaders[t]->elems[key];
+ header->numReceivers = numEdges;
+ if (inTableRest[t]->numElems > 0xffff) {
+ printf("In-table index exceeds 16 bits\n");
+ exit(EXIT_FAILURE);
+ }
+ header->restIndex = inTableRest[t]->numElems;
+ uint32_t numHeaderEdges = numEdges < POLITE_EDGES_PER_HEADER ?
+ numEdges : POLITE_EDGES_PER_HEADER;
+ for (uint32_t j = 0; j < numHeaderEdges; j++) {
+ header->edges[j] = *edgePtr;
+ edgePtr++;
+ }
+ numEdges -= numHeaderEdges;
+ // Overflow into rest memory if header not big enough
+ for (uint32_t j = 0; j < numEdges; j++) {
+ inTableRest[t]->append(*edgePtr);
+ edgePtr++;
}
- inTable[t]->elems[key+numReceivers] = null;
}
}
return key;
}
+ // Split edge list into board-local and non-board-local destinations
+ // And sort each list by destination thread id
+ // (Only valid after mapper is called)
+ void splitDests(PDeviceId devId, PinId pinId,
+ Seq* local, Seq* nonLocal) {
+ local->clear();
+ nonLocal->clear();
+ PDeviceAddr devAddr = toDeviceAddr[devId];
+ uint32_t devBoard = getThreadId(devAddr) >> TinselLogThreadsPerBoard;
+ // Split destinations into local/non-local
+ Seq* dests = graph.outgoing->elems[devId];
+ Seq* pinIds = graph.pins->elems[devId];
+ for (uint32_t d = 0; d < dests->numElems; d++) {
+ if (pinIds->elems[d] == pinId) {
+ PEdgeDest e;
+ e.index = d;
+ e.dest = dests->elems[d];
+ e.addr = toDeviceAddr[e.dest];
+ uint32_t destBoard = getThreadId(e.addr) >> TinselLogThreadsPerBoard;
+ if (devBoard == destBoard)
+ local->append(e);
+ else
+ nonLocal->append(e);
+ }
+ }
+ // Sort local list
+ qsort(local->elems, local->numElems, sizeof(PEdgeDest), cmpEdgeDest);
+ // Sort non-local list
+ qsort(nonLocal->elems, nonLocal->numElems, sizeof(PEdgeDest), cmpEdgeDest);
+ }
+
+ // Compute table updates for destinations for given device
+ // (Only valid after mapper is called)
+ void computeTables(Seq* dests, uint32_t d,
+ Seq* out) {
+ out->clear();
+ uint32_t index = 0;
+ while (index < dests->numElems) {
+ // New set of receiver groups on same mailbox
+ uint32_t threadMaskLow = 0;
+ uint32_t threadMaskHigh = 0;
+ uint32_t nextGroup = 0;
+ // Current mailbox & thread being considered
+ PDeviceAddr mbox = getThreadId(dests->elems[index].addr) >>
+ TinselLogThreadsPerMailbox;
+ uint32_t thread = getThreadId(dests->elems[index].addr) &
+ ((1<numElems) {
+ PEdgeDest* edge = &dests->elems[index];
+ // Determine destination mailbox address and mailbox-local thread
+ uint32_t destMailbox = getThreadId(edge->addr) >>
+ TinselLogThreadsPerMailbox;
+ uint32_t destThread = getThreadId(edge->addr) &
+ ((1< in;
+ in.devId = getLocalDeviceId(edge->addr);
+ Seq* edges = edgeLabels.elems[d];
+ if (! std::is_same::value)
+ in.edge = edges->elems[edge->index];
+ // Update current receiver group
+ groups[nextGroup].receivers.append(in);
+ groups[nextGroup].threadId = getThreadId(edge->addr);
+ if (thread < 32) threadMaskLow |= 1 << thread;
+ if (thread >= 32) threadMaskHigh |= 1 << (thread-32);
+ index++;
+ }
+ else {
+ // Start new receiver group
+ thread = destThread;
+ nextGroup++;
+ assert(nextGroup < TinselThreadsPerMailbox);
+ }
+ }
+ else break;
+ }
+ // Add input table entries
+ uint32_t key = addInTableEntries(nextGroup+1);
+ // Add output entry
+ PRoutingDest dest;
+ dest.kind = PRDestKindMRM;
+ dest.mbox = mbox;
+ dest.mrm.key = key;
+ dest.mrm.threadMaskLow = threadMaskLow;
+ dest.mrm.threadMaskHigh = threadMaskHigh;
+ out->append(dest);
+ // Clear receiver groups, for a new iteration
+ for (uint32_t i = 0; i <= nextGroup; i++) groups[i].receivers.clear();
+ }
+ }
+
// Compute routing tables
// (Only valid after mapper is called)
void computeRoutingTables() {
- // Routing table stats
- uint64_t totalOutEdges = 0;
+ // Edge destinations (local to sender board, or not)
+ Seq local;
+ Seq nonLocal;
- // Sequence of local device ids, for each multicast destiation
- SmallSeq> receivers[64];
+ // Routing destinations
+ Seq dests;
- // Sequence of receiver groups
- // (A more compact representation of the receivers array)
- SmallSeq> groups;
+ // Allocate per-board programmable routing tables
+ progRouterTables = new ProgRouterMesh(numBoardsX, numBoardsY);
// For each device
for (uint32_t d = 0; d < numDevices; d++) {
// For each pin
for (uint32_t p = 0; p < POLITE_NUM_PINS; p++) {
- Seq dests = *(graph.outgoing->elems[d]);
- Seq edges = *(edgeLabels.elems[d]);
- // While destinations are remaining
- while (dests.numElems > 0) {
- // Clear receivers
- for (uint32_t i = 0; i < 64; i++) receivers[i].clear();
- uint32_t threadMaskLow = 0;
- uint32_t threadMaskHigh = 0;
- // Current mailbox being considered
- PDeviceAddr mbox = getThreadId(toDeviceAddr[dests.elems[0]]) >>
- TinselLogThreadsPerMailbox;
- // For each destination
- uint32_t destsRemaining = 0;
- for (uint32_t i = 0; i < dests.numElems; i++) {
- // Determine destination mailbox address and mailbox-local thread
- PDeviceId destId = dests.elems[i];
- PDeviceAddr destAddr = toDeviceAddr[destId];
- uint32_t destMailbox = getThreadId(destAddr) >>
- TinselLogThreadsPerMailbox;
- uint32_t destThread = getThreadId(destAddr) &
- ((1< edge;
- edge.devId = getLocalDeviceId(destAddr);
- if (! std::is_same::value) edge.edge = edges.elems[i];
- receivers[destThread].append(edge);
- if (destThread < 32) threadMaskLow |= 1 << destThread;
- if (destThread >= 32) threadMaskHigh |= 1 << (destThread-32);
- }
- else {
- // Add destination back into sequence
- dests.elems[destsRemaining] = dests.elems[i];
- edges.elems[destsRemaining] = edges.elems[i];
- destsRemaining++;
- }
- }
- // Create receiver groups
- createReceiverGroups(mbox, receivers, &groups);
- // Add input table entries
- uint32_t key = addInTableEntries(&groups);
- // Add output table entry
+ // Split edge lists into local/non-local and sort by target thread id
+ splitDests(d, p, &local, &nonLocal);
+ // Deal with board-local connections
+ computeTables(&local, d, &dests);
+ for (uint32_t i = 0; i < dests.numElems; i++) {
+ PRoutingDest dest = dests.elems[i];
POutEdge edge;
- edge.mbox = mbox;
- edge.key = key;
- edge.threadMaskLow = threadMaskLow;
- edge.threadMaskHigh = threadMaskHigh;
+ edge.mbox = dest.mbox;
+ edge.key = dest.mrm.key;
+ edge.threadMaskLow = dest.mrm.threadMaskLow;
+ edge.threadMaskHigh = dest.mrm.threadMaskHigh;
outTable[d][p]->append(edge);
- // Prepare for new output table entry
- dests.numElems = destsRemaining;
- edges.numElems = destsRemaining;
- totalOutEdges++;
}
- // Add output edge terminator
+ // Deal with non-board-local connections
+ computeTables(&nonLocal, d, &dests);
+ uint32_t src = getThreadId(toDeviceAddr[d]) >>
+ TinselLogThreadsPerMailbox;
+ uint32_t key = progRouterTables->addDestsFromBoard(src, &dests);
+ POutEdge edge;
+ edge.mbox = tinselUseRoutingKey();
+ edge.key = 0;
+ edge.threadMaskLow = key;
+ edge.threadMaskHigh = 0;
+ outTable[d][p]->append(edge);
+ // Add output list terminator
POutEdge term;
term.key = InvalidKey;
outTable[d][p]->append(term);
}
}
- //printf("Average edges per pin: %lu\n",
- // totalOutEdges / (numDevices * POLITE_NUM_PINS);
- }
+ }
// Release all structures
void releaseAll() {
@@ -575,21 +726,38 @@ template useSendBuffer = true;
writeRAM(hostLink, vertexMem, vertexMemSize, vertexMemBase);
writeRAM(hostLink, threadMem, threadMemSize, threadMemBase);
- writeRAM(hostLink, inEdgeMem, inEdgeMemSize, inEdgeMemBase);
+ writeRAM(hostLink, inEdgeHeaderMem,
+ inEdgeHeaderMemSize, inEdgeHeaderMemBase);
+ writeRAM(hostLink, inEdgeRestMem, inEdgeRestMemSize, inEdgeRestMemBase);
writeRAM(hostLink, outEdgeMem, outEdgeMemSize, outEdgeMemBase);
+ progRouterTables->write(hostLink);
hostLink->flush();
hostLink->useSendBuffer = useSendBufferOld;
@@ -835,7 +1008,6 @@ template
#include
#include
+#include
+#include
typedef uint32_t PartitionId;
// Partition and place a graph on a 2D mesh
struct Placer {
+ // Select between different methods
+ enum Method {
+ Default,
+ Metis,
+ Random,
+ Direct,
+ BFS
+ };
+ const Method defaultMethod=Metis;
+
// The graph being placed
Graph* graph;
@@ -41,8 +53,40 @@ struct Placer {
uint32_t* yCoordSaved;
uint64_t savedCost;
+ // Random numbers
+ unsigned int seed;
+ void setRand(unsigned int s) { seed = s; };
+ int getRand() { return rand_r(&seed); }
+
+ // Controls which strategy is used
+ Method method = Default;
+
+ // Select placer method
+ void chooseMethod()
+ {
+ auto e = getenv("POLITE_PLACER");
+ if (e) {
+ if (!strcmp(e, "metis"))
+ method=Metis;
+ else if (!strcmp(e, "random"))
+ method=Random;
+ else if (!strcmp(e, "direct"))
+ method=Direct;
+ else if (!strcmp(e, "bfs"))
+ method=BFS;
+ else if (!strcmp(e, "default") || *e == '\0')
+ method=Default;
+ else {
+ fprintf(stderr, "Don't understand placer method : %s\n", e);
+ exit(EXIT_FAILURE);
+ }
+ }
+ if (method == Default)
+ method = defaultMethod;
+ }
+
// Partition the graph using Metis
- void partition() {
+ void partitionMetis() {
// Compute total number of edges
uint32_t numEdges = 0;
for (uint32_t i = 0; i < graph->incoming->numElems; i++) {
@@ -116,6 +160,96 @@ struct Placer {
free(parts);
}
+ // Partition the graph randomly
+ void partitionRandom() {
+ uint32_t numVertices = graph->incoming->numElems;
+ uint32_t numParts = width * height;
+
+ // Populate result array
+ for (uint32_t i = 0; i < numVertices; i++) {
+ partitions[i] = getRand() % numParts;
+ }
+ }
+
+ // Partition the graph using direct mapping
+ void partitionDirect() {
+ uint32_t numVertices = graph->incoming->numElems;
+ uint32_t numParts = width * height;
+ uint32_t partSize = (numVertices + numParts) / numParts;
+
+ // Populate result array
+ for (uint32_t i = 0; i < numVertices; i++) {
+ partitions[i] = i / partSize;
+ }
+ }
+
+ // Partition the graph using repeated BFS
+ void partitionBFS() {
+ uint32_t numVertices = graph->incoming->numElems;
+ uint32_t numParts = width * height;
+ uint32_t partSize = (numVertices + numParts) / numParts;
+
+ // Visited bit for each vertex
+ bool* seen = new bool [numVertices];
+ memset(seen, 0, numVertices);
+
+ // Next vertex to visit
+ uint32_t nextUnseen = 0;
+
+ // Next partition id
+ uint32_t nextPart = 0;
+
+ while (nextUnseen < numVertices) {
+ // Frontier
+ std::queue frontier;
+ uint32_t count = 0;
+
+ while (nextUnseen < numVertices && count < partSize) {
+ // Sized-bounded BFS from nextUnseen
+ frontier.push(nextUnseen);
+ while (count < partSize && !frontier.empty()) {
+ uint32_t v = frontier.front();
+ frontier.pop();
+ if (!seen[v]) {
+ seen[v] = true;
+ partitions[v] = nextPart;
+ count++;
+ // Add unvisited neighbours of v to the frontier
+ Seq* dests = graph->outgoing->elems[v];
+ for (uint32_t i = 0; i < dests->numElems; i++) {
+ uint32_t w = dests->elems[i];
+ if (!seen[w]) frontier.push(w);
+ }
+ }
+ }
+ while (nextUnseen < numVertices && seen[nextUnseen]) nextUnseen++;
+ }
+
+ nextPart++;
+ }
+
+ delete [] seen;
+ }
+
+ void partition()
+ {
+ switch(method){
+ case Default:
+ case Metis:
+ partitionMetis();
+ break;
+ case Random:
+ partitionRandom();
+ break;
+ case Direct:
+ partitionDirect();
+ break;
+ case BFS:
+ partitionBFS();
+ break;
+ }
+ }
+
// Create subgraph for each partition
void computeSubgraphs() {
uint32_t numPartitions = width*height;
@@ -179,7 +313,7 @@ struct Placer {
// Random mapping
for (uint32_t y = 0; y < height; y++) {
for (uint32_t x = 0; x < width; x++) {
- int index = rand() % numPartitions;
+ int index = getRand() % numPartitions;
PartitionId p = pids[index];
mapping[y][x] = p;
xCoord[p] = x;
@@ -295,6 +429,8 @@ struct Placer {
graph = g;
width = w;
height = h;
+ // Random seed
+ setRand(1 + omp_get_thread_num());
// Allocate the partitions array
partitions = new PartitionId [g->incoming->numElems];
// Allocate subgraphs
@@ -316,6 +452,8 @@ struct Placer {
yCoord = new uint32_t [width*height];
xCoordSaved = new uint32_t [width*height];
yCoordSaved = new uint32_t [width*height];
+ // Pick a placement method, or select default
+ chooseMethod();
// Partition the graph using Metis
partition();
// Compute subgraphs, one per partition
diff --git a/include/POLite/ProgRouters.h b/include/POLite/ProgRouters.h
new file mode 100644
index 00000000..9890c43e
--- /dev/null
+++ b/include/POLite/ProgRouters.h
@@ -0,0 +1,413 @@
+// SPDX-License-Identifier: BSD-2-Clause
+#ifndef _PROGROUTERS_H_
+#define _PROGROUTERS_H_
+
+#include
+#include
+#include
+#include
+#include
+#include
+
+// =============================
+// Per-board programmable router
+// =============================
+
+class ProgRouter {
+
+ // Number of chunks used so far in current beat
+ uint32_t numChunks;
+
+ // Number of records used so far in current beat
+ uint32_t numRecords;
+
+ // Number of beats associated with current key
+ uint32_t numBeats;
+
+ // Index of RAM currently being used
+ uint32_t currentRAM;
+
+ // Pointer to previously created indirection
+ // (We need indirections to handle record sequences of 31 beats or more)
+ uint8_t* prevInd;
+
+ // Move on to next the beat
+ void nextBeat() {
+ // Set number of records in current beat
+ uint32_t beatBase = table[currentRAM]->numElems - 32;
+ uint8_t* beat = &table[currentRAM]->elems[beatBase];
+ beat[31] = 0;
+ beat[30] = numRecords;
+ numChunks = numRecords = 0;
+ // Allocate new beat, and check for overflow
+ numBeats++;
+ table[currentRAM]->extendBy(32);
+ if (table[currentRAM]->numElems >= (TinselPOLiteProgRouterLength-1024)) {
+ printf("ProgRouter out of memory\n");
+ exit(EXIT_FAILURE);
+ }
+ // We need indirections to handle sequences of 31 beats or more
+ if ((numBeats % 31) == 0) {
+ // Set previous indirection, if there is one
+ if (prevInd) {
+ uint32_t key = TinselPOLiteProgRouterBase +
+ table[currentRAM]->numElems - 31*32;
+ if (currentRAM) key |= 0x80000000;
+ key |= 31;
+ setIND(prevInd, key);
+ }
+ prevInd = addIND();
+ }
+ }
+
+ // Get current record pointer for 48-bit entry
+ inline uint8_t* currentRecord48() {
+ uint32_t beatBase = (table[currentRAM]->numElems-32) + 6*(4-numChunks);
+ return &table[currentRAM]->elems[beatBase];
+ }
+
+ // Get current record pointer for 96-bit entry
+ inline uint8_t* currentRecord96() {
+ uint32_t beatBase = (table[currentRAM]->numElems-32) + 6*(3-numChunks);
+ return &table[currentRAM]->elems[beatBase];
+ }
+
+ public:
+
+ // A table holding encoded routing beats for each RAM
+ Seq** table;
+
+ // Constructor
+ ProgRouter() {
+ // Currently we assume two RAMs per board
+ assert(TinselDRAMsPerBoard == 2);
+ // Initialise member variables
+ prevInd = NULL;
+ numBeats = 1;
+ numChunks = numRecords = currentRAM = 0;
+ // Allocate one sequence per RAM
+ table = new Seq* [TinselDRAMsPerBoard];
+ // Initially each sequence is 32MB
+ for (int i = 0; i < TinselDRAMsPerBoard; i++) {
+ table[i] = new Seq (1 << 15);
+ // Allocate first beat
+ table[i]->extendBy(32);
+ }
+ }
+
+ // Destructor
+ ~ProgRouter() {
+ for (int i = 0; i < TinselDRAMsPerBoard; i++) delete table[i];
+ delete [] table;
+ }
+
+ // Generate a new key for the records added
+ uint32_t genKey() {
+ // Determine index of first beat in record sequence
+ uint32_t index = table[currentRAM]->numElems - numBeats*32;
+ // Determine final key length
+ uint32_t finalKeyLen = prevInd ? 31 : numBeats;
+ // Insert outstanding indirection, if there is one
+ if (prevInd) {
+ // Set previous indirection to latest block of beats
+ uint32_t indKey = TinselPOLiteProgRouterBase +
+ table[currentRAM]->numElems - (numBeats%31)*32;
+ if (currentRAM) indKey |= 0x80000000;
+ indKey |= (numBeats%31);
+ setIND(prevInd, indKey);
+ }
+ // Determine final key
+ uint32_t key = TinselPOLiteProgRouterBase + index;
+ if (currentRAM) key |= 0x80000000;
+ key |= finalKeyLen;
+ // Move to next beat
+ nextBeat();
+ numBeats = 1;
+ prevInd = NULL;
+ // Pick smaller RAM for next key
+ currentRAM = table[0]->numElems < table[1]->numElems ? 0 : 1;
+ return key;
+ }
+
+ // Add an IND record to the table
+ // Return a pointer to the indirection key,
+ // so it can be set later by the caller
+ uint8_t* addIND() {
+ if (numChunks == 5) nextBeat();
+ uint8_t* ptr = currentRecord48();
+ ptr[5] = 4 << 5;
+ numChunks++;
+ numRecords++;
+ return ptr;
+ }
+
+ // Set indirection key
+ void setIND(uint8_t* ind, uint32_t key) {
+ ind[0] = key;
+ ind[1] = key >> 8;
+ ind[2] = key >> 16;
+ ind[3] = key >> 24;
+ }
+
+ // Add an MRM record to the table
+ void addMRM(uint32_t mboxX, uint32_t mboxY,
+ uint32_t threadsHigh, uint32_t threadsLow,
+ uint16_t localKey) {
+ if (numChunks >= 4) nextBeat();
+ uint8_t* ptr = currentRecord96();
+ ptr[0] = threadsLow;
+ ptr[1] = threadsLow >> 8;
+ ptr[2] = threadsLow >> 16;
+ ptr[3] = threadsLow >> 24;
+ ptr[4] = threadsHigh;
+ ptr[5] = threadsHigh >> 8;
+ ptr[6] = threadsHigh >> 16;
+ ptr[7] = threadsHigh >> 24;
+ ptr[8] = localKey;
+ ptr[9] = localKey >> 8;
+ ptr[11] = (3 << 5) | (mboxY << 3) | (mboxX << 1);
+ numChunks += 2;
+ numRecords++;
+ }
+
+ // Add an RR record to the table
+ void addRR(uint32_t dir, uint32_t key) {
+ if (numChunks == 5) nextBeat();
+ uint8_t* ptr = currentRecord48();
+ ptr[0] = key;
+ ptr[1] = key >> 8;
+ ptr[2] = key >> 16;
+ ptr[3] = key >> 24;
+ ptr[5] = (2 << 5) | (dir << 3);
+ numChunks++;
+ numRecords++;
+ }
+
+ // Add a URM1 record to the table
+ void addURM1(uint32_t mboxX, uint32_t mboxY,
+ uint32_t threadId, uint32_t key) {
+ if (numChunks == 5) nextBeat();
+ uint8_t* ptr = currentRecord48();
+ ptr[0] = key;
+ ptr[1] = key >> 8;
+ ptr[2] = key >> 16;
+ ptr[3] = key >> 24;
+ ptr[4] = (threadId << 3);
+ ptr[5] = (mboxY << 3) | (mboxX << 1) | (threadId >> 5);
+ numChunks++;
+ numRecords++;
+ }
+};
+
+// ==================================
+// Data type for routing destinations
+// ==================================
+
+enum PRoutingDestKind { PRDestKindURM1, PRDestKindMRM };
+
+// URM1 routing destination
+struct PRoutingDestURM1 {
+ // Mailbox-local thread
+ uint16_t threadId;
+ // Thread-local routing key
+ uint32_t key;
+};
+
+// MRM routing destination
+struct PRoutingDestMRM {
+ // Thread-local routing key
+ uint16_t key;
+ // Destination threads
+ uint32_t threadMaskLow;
+ uint32_t threadMaskHigh;
+};
+
+// Routing destination
+struct PRoutingDest {
+ PRoutingDestKind kind;
+ // Destination mailbox
+ uint32_t mbox;
+ // URM1 or MRM destination
+ union {
+ PRoutingDestURM1 urm1;
+ PRoutingDestMRM mrm;
+ };
+};
+
+// Extract board X coord from routing dest
+inline uint32_t destX(uint32_t mbox) {
+ uint32_t x = mbox >> (TinselMailboxMeshXBits + TinselMailboxMeshYBits);
+ return x & ((1<> (TinselMailboxMeshXBits +
+ TinselMailboxMeshYBits + TinselMeshXBits);
+ return y & ((1<> TinselMailboxMeshXBits) &
+ ((1<* dests) {
+ if (dests->numElems == 0) return 0;
+
+ // Categorise dests into local, N, S, E, and W groups
+ Seq local(dests->numElems);
+ Seq north(dests->numElems);
+ Seq south(dests->numElems);
+ Seq east(dests->numElems);
+ Seq west(dests->numElems);
+ for (int i = 0; i < dests->numElems; i++) {
+ PRoutingDest dest = dests->elems[i];
+ uint32_t receiverX = destX(dest.mbox);
+ uint32_t receiverY = destY(dest.mbox);
+ if (receiverX < senderX) west.append(dest);
+ else if (receiverX > senderX) east.append(dest);
+ else if (receiverY < senderY) south.append(dest);
+ else if (receiverY > senderY) north.append(dest);
+ else local.append(dest);
+ }
+
+ // Recurse on non-local groups and add RR records on return
+ if (north.numElems > 0) {
+ uint32_t key = addDestsFromBoardXY(senderX, senderY+1, &north);
+ table[senderY][senderX].addRR(0, key);
+ }
+ if (south.numElems > 0) {
+ uint32_t key = addDestsFromBoardXY(senderX, senderY-1, &south);
+ table[senderY][senderX].addRR(1, key);
+ }
+ if (east.numElems > 0) {
+ uint32_t key = addDestsFromBoardXY(senderX+1, senderY, &east);
+ table[senderY][senderX].addRR(2, key);
+ }
+ if (west.numElems > 0) {
+ uint32_t key = addDestsFromBoardXY(senderX-1, senderY, &west);
+ table[senderY][senderX].addRR(3, key);
+ }
+
+ // Add local records
+ for (int i = 0; i < local.numElems; i++) {
+ PRoutingDest dest = local.elems[i];
+ if (dest.kind == PRDestKindMRM) {
+ table[senderY][senderX].addMRM(destMboxX(dest.mbox),
+ destMboxY(dest.mbox), dest.mrm.threadMaskHigh,
+ dest.mrm.threadMaskLow, dest.mrm.key);
+ }
+ else if (dest.kind == PRDestKindURM1) {
+ table[senderY][senderX].addURM1(destMboxX(dest.mbox),
+ destMboxY(dest.mbox), dest.urm1.threadId, dest.urm1.key);
+ }
+ else {
+ fprintf(stderr, "ProgRouters.h: unknown routing record kind\n");
+ exit(EXIT_FAILURE);
+ }
+ }
+
+ return table[senderY][senderX].genKey();
+ }
+
+ // Add routing destinations from given global mailbox id
+ uint32_t addDestsFromBoard(uint32_t mbox, Seq* dests) {
+ return addDestsFromBoardXY(destX(mbox), destY(mbox), dests);
+ }
+
+ // Write routing tables to memory via HostLink
+ void write(HostLink* hostLink) {
+ // Request to boot loader
+ BootReq req;
+
+ // Compute number of cores per DRAM
+ const uint32_t coresPerDRAM = 1 <<
+ (TinselLogCoresPerDCache + TinselLogDCachesPerDRAM);
+
+ // Initialise write address for each routing table
+ for (int y = 0; y < boardsY; y++) {
+ for (int x = 0; x < boardsX; x++) {
+ for (int i = 0; i < TinselDRAMsPerBoard; i++) {
+ // Use one core to initialise each DRAM
+ uint32_t dest = hostLink->toAddr(x, y, coresPerDRAM * i, 0);
+ req.cmd = SetAddrCmd;
+ req.numArgs = 1;
+ req.args[0] = TinselPOLiteProgRouterBase;
+ hostLink->send(dest, 1, &req);
+ // Ensure space for an extra 32 bytes in each
+ // table so we don't have to check for overflow below
+ // when consuming the tables in chunks of 12 bytes
+ table[y][x].table[i]->ensureSpaceFor(32);
+ }
+ }
+ }
+
+ // Write each routing table
+ bool allDone = false;
+ uint32_t offset = 0;
+ while (! allDone) {
+ allDone = true;
+ for (int y = 0; y < boardsY; y++) {
+ for (int x = 0; x < boardsX; x++) {
+ for (int i = 0; i < TinselDRAMsPerBoard; i++) {
+ Seq* seq = table[y][x].table[i];
+ if (offset < seq->numElems) {
+ uint32_t dest = hostLink->toAddr(x, y, coresPerDRAM * i, 0);
+ uint8_t* base = &seq->elems[offset];
+ allDone = false;
+ req.cmd = StoreCmd;
+ req.numArgs = 3;
+ req.args[0] = ((uint32_t*) base)[0];
+ req.args[1] = ((uint32_t*) base)[1];
+ req.args[2] = ((uint32_t*) base)[2];
+ hostLink->send(dest, 1, &req);
+ }
+ }
+ }
+ }
+ offset += 12;
+ }
+ }
+
+ // Destructor
+ ~ProgRouterMesh() {
+ for (int y = 0; y < boardsY; y++)
+ delete [] table[y];
+ delete [] table;
+ }
+};
+
+
+#endif
diff --git a/include/POLite/Seq.h b/include/POLite/Seq.h
index b6cb61f1..23a7616c 100644
--- a/include/POLite/Seq.h
+++ b/include/POLite/Seq.h
@@ -45,12 +45,26 @@ template class Seq
elems = newElems;
}
+ // Extend size of sequence by N
+ void extendBy(int n)
+ {
+ numElems += n;
+ if (numElems > maxElems)
+ setCapacity(numElems*2);
+ }
+
// Extend size of sequence by one
void extend()
{
- numElems++;
- if (numElems > maxElems)
- setCapacity(maxElems*2);
+ extendBy(1);
+ }
+
+ // Ensure space for a further N elements
+ void ensureSpaceFor(int n)
+ {
+ int newNumElems = numElems + n;
+ if (newNumElems > maxElems)
+ setCapacity(newNumElems*2);
}
// Append
diff --git a/include/tinsel-interface.h b/include/tinsel-interface.h
index 93b5ec96..21dfdfcb 100644
--- a/include/tinsel-interface.h
+++ b/include/tinsel-interface.h
@@ -166,7 +166,7 @@ INLINE uint32_t tinselAccId(
uint32_t tileX, uint32_t tileY)
{
uint32_t addr;
- addr = 0x4;
+ addr = 0x8;
addr = (addr << TinselMeshYBits) | boardY;
addr = (addr << TinselMeshXBits) | boardX;
addr = (addr << TinselMailboxMeshYBits) | tileY;
@@ -175,4 +175,13 @@ INLINE uint32_t tinselAccId(
return addr;
}
+// Special address to signify use of routing key
+INLINE uint32_t tinselUseRoutingKey()
+{
+ // Special address to signify use of routing key
+ return 1 <<
+ (TinselMailboxMeshYBits + TinselMailboxMeshXBits +
+ TinselMeshXBits + TinselMeshYBits + 2);
+}
+
#endif
diff --git a/include/tinsel.h b/include/tinsel.h
index 9ebd8451..0b88844d 100644
--- a/include/tinsel.h
+++ b/include/tinsel.h
@@ -28,13 +28,15 @@
#define CSR_FLUSH "0xc01"
// Performance counter CSRs
-#define CSR_PERFCOUNT "0xc07"
-#define CSR_MISSCOUNT "0xc08"
-#define CSR_HITCOUNT "0xc09"
-#define CSR_WBCOUNT "0xc0a"
-#define CSR_CPUIDLECOUNT "0xc0b"
-#define CSR_CPUIDLECOUNTU "0xc0c"
-#define CSR_CYCLEU "0xc0d"
+#define CSR_PERFCOUNT "0xc07"
+#define CSR_MISSCOUNT "0xc08"
+#define CSR_HITCOUNT "0xc09"
+#define CSR_WBCOUNT "0xc0a"
+#define CSR_CPUIDLECOUNT "0xc0b"
+#define CSR_CPUIDLECOUNTU "0xc0c"
+#define CSR_CYCLEU "0xc0d"
+#define CSR_PROGROUTERSENT "0xc0e"
+#define CSR_PROGROUTERSENTINTER "0xc0f"
// Get globally unique thread id of caller
INLINE uint32_t tinselId()
@@ -127,6 +129,18 @@ INLINE volatile void* tinselSendSlot()
return mb_scratchpad_base + (threadId << TinselLogBytesPerMsg);
}
+// Get pointer to thread's extra message slot reserved for sending
+// (Assumes that HostLink has requested the extra slot)
+INLINE volatile void* tinselSendSlotExtra()
+{
+ volatile char* mb_scratchpad_base =
+ (volatile char*) (1 << TinselLogBytesPerMailbox);
+ uint32_t threadId = tinselId() &
+ ((1<> 6, high, low, addr);
}
+// Send message at addr using given routing key
+INLINE void tinselKeySend(int key, volatile void* addr)
+{
+ tinselMulticast(tinselUseRoutingKey(), 0, key, addr);
+}
+
// Receive message
INLINE volatile void* tinselRecv()
{
@@ -270,7 +290,7 @@ INLINE uint32_t tinselWritebackCount()
return n;
}
-// Performance counter:: get the CPU-idle count
+// Performance counter: get the CPU-idle count
INLINE uint32_t tinselCPUIdleCount()
{
uint32_t n;
@@ -294,6 +314,22 @@ INLINE uint32_t tinselCycleCountU()
return n;
}
+// Performance counter: number of messages emitted by ProgRouter
+INLINE uint32_t tinselProgRouterSent()
+{
+ uint32_t n;
+ asm volatile ("csrrw %0, " CSR_PROGROUTERSENT ", zero" : "=r"(n));
+ return n;
+}
+
+// Performance counter: number of inter-board messages emitted by ProgRouter
+INLINE uint32_t tinselProgRouterSentInterBoard()
+{
+ uint32_t n;
+ asm volatile ("csrrw %0, " CSR_PROGROUTERSENTINTER ", zero" : "=r"(n));
+ return n;
+}
+
// Get address of any specified host
// (This Y coordinate specifies the row of the FPGA mesh that the
// host is connected to, and the X coordinate specifies whether it is
diff --git a/rtl/Connections.bsv b/rtl/Connections.bsv
new file mode 100644
index 00000000..7f542acc
--- /dev/null
+++ b/rtl/Connections.bsv
@@ -0,0 +1,151 @@
+package Connections;
+
+import Vector :: *;
+import OffChipRAM :: *;
+import Interface :: *;
+import DRAM :: *;
+import Queue :: *;
+import DCache :: *;
+import DCacheTypes :: *;
+import Util :: *;
+import ProgRouter :: *;
+import Core :: *;
+
+// ============================================================================
+// DCache <-> Core connections
+// ============================================================================
+
+module connectCoresToDCache#(
+ Vector#(`CoresPerDCache, DCacheClient) clients,
+ DCache dcache) ();
+
+ // Connect requests
+ function getDCacheReqOut(client) = client.dcacheReqOut;
+ let dcacheReqs <- mkMergeTree(Fair,
+ mkUGShiftQueue1(QueueOptFmax),
+ map(getDCacheReqOut, clients));
+ connectUsing(mkUGQueue, dcacheReqs, dcache.reqIn);
+
+ // Connect responses
+ function Bit#(`LogCoresPerDCache) getDCacheRespKey(DCacheResp resp) =
+ truncateLSB(resp.id);
+ function getDCacheRespIn(client) = client.dcacheRespIn;
+ let dcacheResps <- mkResponseDistributor(
+ getDCacheRespKey,
+ mkUGShiftQueue1(QueueOptFmax),
+ map(getDCacheRespIn, clients));
+ connectDirect(dcache.respOut, dcacheResps);
+
+ // Connect performance-counter wires
+ rule connectPerfCountWires;
+ clients[0].incMissCount(dcache.incMissCount);
+ clients[0].incHitCount(dcache.incHitCount);
+ clients[0].incWritebackCount(dcache.incWritebackCount);
+ for (Integer i = 1; i < `CoresPerDCache; i=i+1) begin
+ clients[i].incMissCount(False);
+ clients[i].incHitCount(False);
+ clients[i].incWritebackCount(False);
+ end
+ endrule
+
+endmodule
+
+// ============================================================================
+// Off-chip RAM connections
+// ============================================================================
+
+module connectClientsToOffChipRAM#(
+ // Data caches
+ Vector#(`DCachesPerDRAM, DCache) caches,
+ // Reqs and resps from ProgRouter's fetchers
+ Vector#(`FetchersPerProgRouter, BOut#(DRAMReq)) routerReqs,
+ Vector#(`FetchersPerProgRouter, In#(DRAMResp)) routerResps,
+ // Off-chip memory
+ OffChipRAM ram) ();
+
+ // Count the number of outstanding fetcher requests
+ // Used to throttle the fetcher requests to avoid starving/blocking
+ // the cache requests
+ Integer throttleCount = 2 ** (`DRAMLogMaxInFlight - 1);
+ Count#(`DRAMLogMaxInFlight) fetcherCount <- mkCount(throttleCount);
+
+ // Merge cache requests
+ function getReqOut(cache) = cache.reqOut;
+ Out#(DRAMReq) cacheReqs <-
+ mkMergeTreeB(Fair,
+ mkUGShiftQueue1(QueueOptFmax),
+ map(getReqOut, caches));
+ Queue#(DRAMReq) cacheReqsQueue <- mkUGQueue;
+ connectToQueue(cacheReqs, cacheReqsQueue);
+ BOut#(DRAMReq) cacheReqsB = queueToBOut(cacheReqsQueue);
+
+ // Merge router requests
+ Out#(DRAMReq) fetcherReqs <-
+ mkMergeTreeB(Fair,
+ mkUGShiftQueue1(QueueOptFmax),
+ routerReqs);
+ Queue#(DRAMReq) fetcherReqsQueue <- mkUGQueue;
+ connectToQueue(fetcherReqs, fetcherReqsQueue);
+ BOut#(DRAMReq) fetcherReqsB = queueToBOut(fetcherReqsQueue);
+
+ // Update count on router request
+ BOut#(DRAMReq) fetcherReqsIncCountB =
+ interface BOut
+ method Action get =
+ action
+ fetcherReqsB.get;
+ fetcherCount.incBy(zeroExtend(fetcherReqsB.value.burst));
+ endaction;
+ method Bool valid = fetcherReqsB.valid &&
+ zeroExtend(fetcherReqsB.value.burst) <= fetcherCount.available;
+ method DRAMReq value = fetcherReqsB.value;
+ endinterface;
+
+ // Merge cache and router requests, and connect to off-chip RAM
+ let reqs <- mkMergeTwoB(Fair, cacheReqsB, fetcherReqsIncCountB);
+ connectUsing(mkUGQueue, reqs, ram.reqIn);
+
+ // Connect load responses
+ function DRAMClientId getRespKey(DRAMResp resp) = resp.id;
+ function getRespIn(cache) = cache.respIn;
+ let ramResps <- mkResponseDistributor(
+ getRespKey,
+ mkUGShiftQueue2(QueueOptFmax),
+ append(map(getRespIn, caches), routerResps));
+
+ // Update count on respose
+ BOut#(DRAMResp) ramRespOutDecCount =
+ interface BOut
+ method Action get =
+ action
+ ram.respOut.get;
+ if (ram.respOut.value.id >= fromInteger(`DCachesPerDRAM))
+ fetcherCount.dec;
+ endaction;
+ method Bool valid = ram.respOut.valid;
+ method DRAMResp value = ram.respOut.value;
+ endinterface;
+
+ // Connect responses from off-chip RAM
+ connectDirect(ramRespOutDecCount, ramResps);
+
+endmodule
+
+// ============================================================================
+// ProgRouter performance counter connections
+// ============================================================================
+
+module connectProgRouterPerfCountersToCores#(
+ ProgRouterPerfCounters counters, Vector#(n, Core) cores) (Empty);
+ rule connect;
+ // Only core zero can access the ProgRouter perf counters
+ cores[0].progRouterPerfClient.incSent(counters.incSent);
+ cores[0].progRouterPerfClient.incSentInterBoard(counters.incSentInterBoard);
+ for (Integer i = 1; i < valueOf(n); i=i+1) begin
+ cores[i].progRouterPerfClient.incSent(?);
+ cores[i].progRouterPerfClient.incSentInterBoard(?);
+ end
+ endrule
+endmodule
+
+endpackage
diff --git a/rtl/Core.bsv b/rtl/Core.bsv
index 1d35d278..4c454c98 100644
--- a/rtl/Core.bsv
+++ b/rtl/Core.bsv
@@ -25,6 +25,7 @@ import FPUOps :: *;
import InstrMem :: *;
import DCacheTypes :: *;
import IdleDetector :: *;
+import ProgRouter :: *;
// ============================================================================
// Control/status registers (CSRs) supported
@@ -60,15 +61,17 @@ import IdleDetector :: *;
// Performance Counter CSRs (Optional)
// ============================================================================
-// Name | CSR | R/W | Function
-// --------------- | ------ | --- | --------
-// PerfCount | 0xc07 | W | Reset(0)/Start(1)/Stop(2) all counters
-// MissCount | 0xc08 | R | Cache miss count
-// HitCount | 0xc09 | R | Cache hit count
-// WritebackCount | 0xc0a | R | Cache writeback count
-// CPUIdleCount | 0xc0b | R | CPU idle-cycle count (lower 32 bits)
-// CPUIdleCountU | 0xc0c | R | CPU idle-cycle count (upper 8 bits)
-// CycleU | 0xc0d | R | Cycle counter (upper 8 bits)
+// Name | CSR | R/W | Function
+// ------------------- | ------ | --- | --------
+// PerfCount | 0xc07 | W | Reset(0)/Start(1)/Stop(2) all counters
+// MissCount | 0xc08 | R | Cache miss count
+// HitCount | 0xc09 | R | Cache hit count
+// WritebackCount | 0xc0a | R | Cache writeback count
+// CPUIdleCount | 0xc0b | R | CPU idle-cycle count (lower 32 bits)
+// CPUIdleCountU | 0xc0c | R | CPU idle-cycle count (upper 8 bits)
+// CycleU | 0xc0d | R | Cycle counter (upper 8 bits)
+// ProgRouterSent | 0xc0e | R | Msgs sent by ProgRouter
+// ProgRouterSentInter | 0xc0f | R | Inter-board msgs sent by ProgRouter
// ============================================================================
// Types
@@ -505,12 +508,13 @@ endfunction
// ============================================================================
interface Core;
- interface DCacheClient dcacheClient;
- interface MailboxClient mailboxClient;
- interface DebugLinkClient debugLinkClient;
- interface FPUClient fpuClient;
- interface InstrMemClient instrMemClient;
- interface IdleDetectorClient idleClient;
+ interface DCacheClient dcacheClient;
+ interface MailboxClient mailboxClient;
+ interface DebugLinkClient debugLinkClient;
+ interface FPUClient fpuClient;
+ interface InstrMemClient instrMemClient;
+ interface IdleDetectorClient idleClient;
+ interface ProgRouterPerfClient progRouterPerfClient;
// Each core can see its board id
(* always_ready, always_enabled *)
@@ -676,18 +680,27 @@ module mkCore#(CoreId myId) (Core);
Reg#(Bit#(32)) hitCount <- mkConfigReg(0);
Reg#(Bit#(32)) writebackCount <- mkConfigReg(0);
Reg#(Bit#(40)) cpuIdleCount <- mkConfigReg(0);
+ // Only core zero maintains the following two counters
+ Reg#(Bit#(32)) progRouterSent <- mkConfigReg(0);
+ Reg#(Bit#(32)) progRouterSentInterBoard <- mkConfigReg(0);
// Indexable vector of performance counters
- Vector#(6, Bit#(32)) perfCounters =
+ Vector#(8, Bit#(32)) perfCounters =
vector(missCount, hitCount, writebackCount, cpuIdleCount[31:0],
zeroExtend(cpuIdleCount[39:32]),
- zeroExtend(cycleCount[39:32]));
+ zeroExtend(cycleCount[39:32]),
+ myId == 0 ? progRouterSent : ?,
+ myId == 0 ? progRouterSentInterBoard : ?);
// Increment wires
Wire#(Bool) incMissCountWire <- mkDWire(False);
Wire#(Bool) incHitCountWire <- mkDWire(False);
Wire#(Bool) incWritebackCountWire <- mkDWire(False);
Wire#(Bool) incCPUIdleCountWire <- mkDWire(False);
+ Wire#(Bit#(LogFetchersPerProgRouter))
+ incProgRouterSent <- mkBypassWire;
+ Wire#(Bit#(LogFetchersPerProgRouter))
+ incProgRouterSentInterBoard <- mkBypassWire;
// Update performance counters
rule updatePerfCounters;
@@ -696,11 +709,20 @@ module mkCore#(CoreId myId) (Core);
hitCount <= 0;
writebackCount <= 0;
cpuIdleCount <= 0;
+ if (myId == 0) begin
+ progRouterSent <= 0;
+ progRouterSentInterBoard <= 0;
+ end
end else if (perfCountEnabled) begin
if (incMissCountWire) missCount <= missCount+1;
if (incHitCountWire) hitCount <= hitCount+1;
if (incWritebackCountWire) writebackCount <= writebackCount+1;
if (incCPUIdleCountWire) cpuIdleCount <= cpuIdleCount+1;
+ if (myId == 0) begin
+ progRouterSent <= progRouterSent + zeroExtend(incProgRouterSent);
+ progRouterSentInterBoard <= progRouterSentInterBoard +
+ zeroExtend(incProgRouterSentInterBoard);
+ end
end
endrule
`endif
@@ -1321,6 +1343,19 @@ module mkCore#(CoreId myId) (Core);
method Bool idleStage1Ack = mailbox.idleStage1Ack;
endinterface
+ interface ProgRouterPerfClient progRouterPerfClient;
+ method Action incSent(Bit#(LogFetchersPerProgRouter) amount);
+ `ifdef EnablePerfCount
+ incProgRouterSent <= amount;
+ `endif
+ endmethod
+ method Action incSentInterBoard(Bit#(LogFetchersPerProgRouter) amount);
+ `ifdef EnablePerfCount
+ incProgRouterSentInterBoard <= amount;
+ `endif
+ endmethod
+ endinterface
+
endmodule
endpackage
diff --git a/rtl/DCache.bsv b/rtl/DCache.bsv
index 3162aade..e972a858 100644
--- a/rtl/DCache.bsv
+++ b/rtl/DCache.bsv
@@ -437,9 +437,11 @@ module mkDCache#(DCacheId myId) (DCache);
// This rule either consumes a flush request or a memory response
let flush = flushQueue.dataOut;
let resp = respPort.value;
+ InflightDCacheReqInfo info = unpack(truncate(resp.info));
+ Bit#(`LogBeatsPerLine) beat = truncate(resp.beat);
lineWriteDataWire <= resp.data;
- lineWriteIndexWire <= beatIndex(resp.info.beat, resp.info.req.id,
- resp.info.req.addr, resp.info.way);
+ lineWriteIndexWire <= beatIndex(beat, info.req.id,
+ info.req.addr, info.way);
// Ready to consume flush queue?
if (flushQueue.canDeq && flushQueue.canPeek) begin
flush.req.cmd.isFlush = False;
@@ -453,14 +455,14 @@ module mkDCache#(DCacheId myId) (DCache);
// Remove item from fill queue and feed associated request (which
// will definitely hit if it starts again from the beginning of
// the pipeline) back to beginning of the pipeline
- if (allHigh(resp.info.beat))
+ if (allHigh(beat))
feedbackTrigger <= True;
// Write new line data to dataMem
// (The write parameters are set outside condition for better timing)
lineWriteReqWire <= True;
respPort.get;
// Set feedback request
- feedbackReq <= resp.info.req;
+ feedbackReq <= info.req;
end
endrule
@@ -492,11 +494,10 @@ module mkDCache#(DCacheId myId) (DCache);
InflightDCacheReqInfo info;
info.req = miss.req;
info.way = miss.evictWay;
- info.beat = ?;
// Create memory request
DRAMReq memReq;
memReq.isStore = !isLoad;
- memReq.id = myId;
+ memReq.id = zeroExtend(myId);
memReq.addr = {isLoad ? readLineAddr : writeLineAddr, reqBeat};
memReq.data = isLoad ? {?, pack(info)} : dataMem.dataOutA;
memReq.burst = isLoad ? `BeatsPerLine : 1;
@@ -589,66 +590,6 @@ interface DCacheClient;
method Action incWritebackCount(Bool inc);
endinterface
-// ============================================================================
-// Connections
-// ============================================================================
-
-module connectCoresToDCache#(
- Vector#(`CoresPerDCache, DCacheClient) clients,
- DCache dcache) ();
-
- // Connect requests
- function getDCacheReqOut(client) = client.dcacheReqOut;
- let dcacheReqs <- mkMergeTree(Fair,
- mkUGShiftQueue1(QueueOptFmax),
- map(getDCacheReqOut, clients));
- connectUsing(mkUGQueue, dcacheReqs, dcache.reqIn);
-
- // Connect responses
- function Bit#(`LogCoresPerDCache) getDCacheRespKey(DCacheResp resp) =
- truncateLSB(resp.id);
- function getDCacheRespIn(client) = client.dcacheRespIn;
- let dcacheResps <- mkResponseDistributor(
- getDCacheRespKey,
- mkUGShiftQueue1(QueueOptFmax),
- map(getDCacheRespIn, clients));
- connectDirect(dcache.respOut, dcacheResps);
-
- // Connect performance-counter wires
- rule connectPerfCountWires;
- clients[0].incMissCount(dcache.incMissCount);
- clients[0].incHitCount(dcache.incHitCount);
- clients[0].incWritebackCount(dcache.incWritebackCount);
- for (Integer i = 1; i < `CoresPerDCache; i=i+1) begin
- clients[i].incMissCount(False);
- clients[i].incHitCount(False);
- clients[i].incWritebackCount(False);
- end
- endrule
-
-endmodule
-
-module connectDCachesToOffChipRAM#(
- Vector#(`DCachesPerDRAM, DCache) caches, OffChipRAM ram) ();
-
- // Connect requests
- function getReqOut(cache) = cache.reqOut;
- let reqs <- mkMergeTreeB(Fair,
- mkUGShiftQueue1(QueueOptFmax),
- map(getReqOut, caches));
- connectUsing(mkUGQueue, reqs, ram.reqIn);
-
- // Connect load responses
- function DCacheId getRespKey(DRAMResp resp) = resp.id;
- function getRespIn(cache) = cache.respIn;
- let ramResps <- mkResponseDistributor(
- getRespKey,
- mkUGShiftQueue2(QueueOptFmax),
- map(getRespIn, caches));
- connectDirect(ram.respOut, ramResps);
-
-endmodule
-
// ============================================================================
// Dummy cache
// ============================================================================
diff --git a/rtl/DCacheTypes.bsv b/rtl/DCacheTypes.bsv
index fa6ba407..4ddd809f 100644
--- a/rtl/DCacheTypes.bsv
+++ b/rtl/DCacheTypes.bsv
@@ -43,7 +43,6 @@ typedef struct {
typedef struct {
DCacheReq req;
Way way;
- Bit#(`LogBeatsPerLine) beat;
} InflightDCacheReqInfo deriving (Bits);
endpackage
diff --git a/rtl/DE5BridgeTop.bsv b/rtl/DE5BridgeTop.bsv
index 5dce9e25..15e2ba8f 100644
--- a/rtl/DE5BridgeTop.bsv
+++ b/rtl/DE5BridgeTop.bsv
@@ -12,9 +12,10 @@
// 1. DA: Destination address (4 bytes)
// 2. NM: Number of messages that follow minus one (4 bytes)
// 3. FM: Number of flit payloads per message minus one (1 byte)
-// 4. Padding (7 bytes)
-// 5. (NM+1)*(FM+1) flit payloads ((NM+1)*(FM+1)*BytesPerFlit bytes)
-// 6. Goto step 1
+// 4. Padding (3 bytes)
+// 5. Routing key (optional, 4 bytes)
+// 6. (NM+1)*(FM+1) flit payloads ((NM+1)*(FM+1)*BytesPerFlit bytes)
+// 7. Goto step 1
//
// The format of the data stream in the FPGA->PC direction is simply
// raw flit payloads.
@@ -161,6 +162,7 @@ module de5BridgeTop (DE5BridgeTop);
Reg#(Bit#(32)) fromPCIeDA <- mkConfigRegU;
Reg#(Bit#(32)) fromPCIeNM <- mkConfigRegU;
Reg#(Bit#(8)) fromPCIeFM <- mkConfigRegU;
+ Reg#(Bit#(32)) fromPCIeKey <- mkConfigRegU;
Reg#(Bit#(1)) toLinkState <- mkConfigReg(0);
Reg#(Bit#(32)) messageCount <- mkConfigReg(0);
@@ -182,6 +184,7 @@ module de5BridgeTop (DE5BridgeTop);
fromPCIeDA <= data[31:0];
fromPCIeNM <= data[63:32];
fromPCIeFM <= data[95:88];
+ fromPCIeKey <= data[127:96];
toLinkState <= 1;
fromPCIe.get;
end
@@ -203,6 +206,10 @@ module de5BridgeTop (DE5BridgeTop);
Flit flit;
flit.dest.addr = unpack(truncate(fromPCIeDA[31:`LogThreadsPerMailbox]));
flit.dest.threads = pack(destThreads);
+ // If address says to use routing key, then use it
+ if (flit.dest.addr.isKey) begin
+ flit.dest.threads = zeroExtend(fromPCIeKey);
+ end
flit.payload = fromPCIe.value;
flit.notFinalFlit = True;
flit.isIdleToken = False;
diff --git a/rtl/DE5Top.bsv b/rtl/DE5Top.bsv
index 2173526d..bb35bc19 100644
--- a/rtl/DE5Top.bsv
+++ b/rtl/DE5Top.bsv
@@ -22,6 +22,7 @@ import InstrMem :: *;
import NarrowSRAM :: *;
import OffChipRAM :: *;
import IdleDetector :: *;
+import Connections :: *;
// ============================================================================
// Interface
@@ -114,10 +115,6 @@ module de5Top (DE5Top);
for (Integer j = 0; j < `DCachesPerDRAM; j=j+1)
connectCoresToDCache(map(dcacheClient, cores[i][j]), dcaches[i][j]);
- // Connect data caches to DRAM
- for (Integer i = 0; i < `DRAMsPerBoard; i=i+1)
- connectDCachesToOffChipRAM(dcaches[i], rams[i]);
-
// Create FPUs
Vector#(`FPUsPerBoard, FPU) fpus;
for (Integer i = 0; i < `FPUsPerBoard; i=i+1)
@@ -143,10 +140,6 @@ module de5Top (DE5Top);
// Create idle-detector
IdleDetector idle <- mkIdleDetector;
- // Connect cores to idle-detector
- function idleClient(core) = core.idleClient;
- connectCoresToIdleDetector(map(idleClient, vecOfCores), idle);
-
// Create mailboxes
Vector#(`MailboxMeshYLen,
Vector#(`MailboxMeshXLen, Mailbox)) mailboxes =
@@ -155,6 +148,13 @@ module de5Top (DE5Top);
for (Integer x = 0; x < `MailboxMeshXLen; x=x+1)
mailboxes[y][x] <- mkMailboxAcc(debugLink.getBoardId(), x, y);
+ // Initialise mailbox send slots
+ rule initSendSlots;
+ for (Integer y = 0; y < `MailboxMeshYLen; y=y+1)
+ for (Integer x = 0; x < `MailboxMeshXLen; x=x+1)
+ mailboxes[y][x].initSendSlots(debugLink.useExtraSendSlot);
+ endrule
+
// Connect cores to mailboxes
for (Integer y = 0; y < `MailboxMeshYLen; y=y+1)
for (Integer x = 0; x < `MailboxMeshXLen; x=x+1) begin
@@ -167,13 +167,27 @@ module de5Top (DE5Top);
connectCoresToMailbox(map(mailboxClient, cs), mailboxes[y][x]);
end
- // Create mesh of mailboxes
+ // Create network-on-chip
function MailboxNet mailboxNet(Mailbox mbox) = mbox.net;
- ExtNetwork net <- mkMailboxMesh(
- debugLink.getBoardId(),
- debugLink.linkEnable,
- map(map(mailboxNet), mailboxes),
- idle);
+ NoC noc <- mkNoC(
+ debugLink.getBoardId(),
+ debugLink.linkEnable,
+ map(map(mailboxNet), mailboxes),
+ idle);
+
+ // Connect cores and ProgRouter fetchers to idle-detector
+ function idleClient(core) = core.idleClient;
+ connectClientsToIdleDetector(
+ map(idleClient, vecOfCores), noc.activities, idle);
+
+ // Connections to off-chip RAMs
+ for (Integer i = 0; i < `DRAMsPerBoard; i=i+1)
+ connectClientsToOffChipRAM(dcaches[i],
+ noc.dramReqs[i], noc.dramResps[i], rams[i]);
+
+ // Connects ProgRouter performance counters to cores
+ connectProgRouterPerfCountersToCores(noc.progRouterPerfCounters,
+ concat(concat(cores)));
// Set board ids
rule setBoardIds;
@@ -199,10 +213,10 @@ module de5Top (DE5Top);
interface dramIfcs = map(getDRAMExtIfc, rams);
interface sramIfcs = concat(map(getSRAMExtIfcs, rams));
interface jtagIfc = debugLink.jtagAvalon;
- interface northMac = net.north;
- interface southMac = net.south;
- interface eastMac = net.east;
- interface westMac = net.west;
+ interface northMac = noc.north;
+ interface southMac = noc.south;
+ interface eastMac = noc.east;
+ interface westMac = noc.west;
method Action setBoardId(Bit#(4) id);
localBoardId <= id;
endmethod
diff --git a/rtl/DRAM.bsv b/rtl/DRAM.bsv
index b9bab54e..406cfe89 100644
--- a/rtl/DRAM.bsv
+++ b/rtl/DRAM.bsv
@@ -5,8 +5,11 @@ package DRAM;
// Types
// ============================================================================
+// DRAM client id
+typedef Bit#(TLog#(TAdd#(`DCachesPerDRAM,`FetchersPerProgRouter))) DRAMClientId;
+
// DRAM request id
-typedef DCacheId DRAMReqId;
+typedef DRAMClientId DRAMReqId;
// DRAM request
typedef struct {
@@ -22,8 +25,13 @@ typedef struct {
typedef struct {
DRAMReqId id;
Bit#(`BeatWidth) data;
- InflightDCacheReqInfo info;
+ // Which beat is it?
Bool finalBeat;
+ Bit#(`BeatBurstWidth) beat;
+ // Data from original load request
+ // (Can be largely ignored and optimised away, but
+ // can also hold useful info about the original request)
+ Bit#(`BeatWidth) info;
} DRAMResp deriving (Bits);
// DRAM identifier
@@ -80,7 +88,6 @@ import Util :: *;
import Interface :: *;
import Queue :: *;
import Assert :: *;
-import DCacheTypes :: *;
// Types
// -----
@@ -151,8 +158,8 @@ module mkDRAM#(RAMId id) (DRAM);
DRAMResp resp;
resp.id = req.id;
resp.data = pack(elems);
- resp.info = unpack(truncate(req.data));
- resp.info.beat = truncate(burstCount);
+ resp.info = req.data;
+ resp.beat = burstCount;
resp.finalBeat = finalBeat;
resps.enq(resp);
decOutstanding.send;
@@ -219,7 +226,6 @@ import Interface :: *;
import Assert :: *;
import Util :: *;
import Assert :: *;
-import DCacheTypes :: *;
// Types
// -----
@@ -244,7 +250,7 @@ endinterface
typedef struct {
DRAMReqId id;
Bit#(`BeatBurstWidth) burst;
- InflightDCacheReqInfo info;
+ Bit#(`BeatWidth) info;
} DRAMInFlightReq deriving (Bits);
// Implementation
@@ -309,7 +315,7 @@ module mkDRAM#(t id) (DRAM);
DRAMInFlightReq inflightReq;
inflightReq.id = req.id;
inflightReq.burst = req.burst;
- inflightReq.info = unpack(truncate(req.data));
+ inflightReq.info = req.data;
inFlight.enq(inflightReq);
inFlightCount.incBy(zeroExtend(req.burst));
end
@@ -336,7 +342,7 @@ module mkDRAM#(t id) (DRAM);
DRAMResp resp;
resp.id = inFlight.dataOut.id;
resp.info = inFlight.dataOut.info;
- resp.info.beat = truncate(burstCount-1);
+ resp.beat = truncate(burstCount-1);
resp.data = respBuffer.dataOut;
resp.finalBeat = burstCount == inFlight.dataOut.burst;
return resp;
diff --git a/rtl/DebugLink.bsv b/rtl/DebugLink.bsv
index 676696e7..a09236b5 100644
--- a/rtl/DebugLink.bsv
+++ b/rtl/DebugLink.bsv
@@ -13,16 +13,18 @@ package DebugLink;
// Commands sent from the host PC to DebugLink typically consist of a
// few bytes over the JTAG UART.
//
-// QueryIn: tag (1 byte), board offset (1 byte), edge disable (1 byte)
-// -------------------------------------------------------------------
+// QueryIn: tag (1 byte), board offset (1 byte), config (1 byte)
+// -------------------------------------------------------------
//
// Sets the X offset (offset[3:0]) and the Y offset (offset[7:4])
// of the board id (to support multiple boxes).
// Disable the specified inter-FPGA links:
-// * disable[0]: disable links on north side of box
-// * disable[1]: disable links on south side of box
-// * disable[2]: disable links on east side of box
-// * disable[3]: disable links on west side of box
+// * config[0]: disable links on north side of box
+// * config[1]: disable links on south side of box
+// * config[2]: disable links on east side of box
+// * config[3]: disable links on west side of box
+// Enable extra send slot:
+// * config[4]: reserve extra send slot
// Responds with a QueryOut (see below).
//
// SetDest: tag (1 byte), thread id (1 byte), core id (1 byte)
@@ -202,9 +204,13 @@ interface DebugLink;
// Get board id via DebugLink
(* always_ready, always_enabled *)
method BoardId getBoardId();
- // Optionally disable each inter-FPGA link via DebugLink
+ // Config option: disable each inter-FPGA link via DebugLink
+ // (Allows sanboxing of boxes or groups of boxes)
(* always_ready, always_enabled *)
method Vector#(4, Bool) linkEnable;
+ // Config option: reserve extra send slot per thread in mailbox
+ (* always_ready, always_enabled *)
+ method Option#(Bool) useExtraSendSlot;
endinterface
module mkDebugLink#(
@@ -224,6 +230,11 @@ module mkDebugLink#(
// (Initially, all disabled)
Reg#(Vector#(4, Bool)) linkEnableReg <- mkConfigReg(replicate(False));
+ // Config option: reserve extra send slot in mailbox?
+ // Use a chain of registers to aid propagation on chip
+ Vector#(3, Reg#(Option#(Bool))) useExtraSendSlotReg <-
+ replicateM(mkConfigReg(Option {valid : False, value: False}));
+
// Ports
InPort#(Bit#(8)) fromJtag <- mkInPort;
OutPort#(Bit#(8)) toJtag <- mkOutPort;
@@ -331,6 +342,9 @@ module mkDebugLink#(
// Disable west link?
if (x == 0 && edgeEn[3] == 1) linkEn[3] = False;
linkEnableReg <= linkEn;
+ // Reserve extra send slot?
+ useExtraSendSlotReg[2] <=
+ Option {valid: True, value: fromJtag.value[4] == 1};
respondFlag <= True;
respondCmd <= cmdQueryIn;
recvState <= 0;
@@ -404,6 +418,11 @@ module mkDebugLink#(
end
endrule
+ // Propagate extra send slot option through chain of registers (for timing)
+ rule chain;
+ for (Integer i = 0; i < 2; i=i+1)
+ useExtraSendSlotReg[i] <= useExtraSendSlotReg[i+1];
+ endrule
`ifndef SIMULATE
interface jtagAvalon = uart.jtagAvalon;
@@ -411,7 +430,7 @@ module mkDebugLink#(
method BoardId getBoardId() = boardId;
method Vector#(4, Bool) linkEnable = linkEnableReg;
-
+ method Option#(Bool) useExtraSendSlot = useExtraSendSlotReg[0];
endmodule
endpackage
diff --git a/rtl/GenInit.sh b/rtl/GenInit.sh
deleted file mode 100755
index ad2a6e0c..00000000
--- a/rtl/GenInit.sh
+++ /dev/null
@@ -1,19 +0,0 @@
-#!/bin/bash
-
-# Generate memory initialisation files
-
-# Load config parameters
-while read -r EXPORT; do
- eval $EXPORT
-done <<< `python ../config.py envs`
-
-MaxSlot=$(((2**LogMsgsPerMailbox) - 1))
-ThreadsPerMailbox=$((2**$LogThreadsPerMailbox))
-
-# Emit hex file
-for I in $(seq $ThreadsPerMailbox $MaxSlot); do
- printf "%x\n" $I
-done >> FreeSlots.hex
-
-# Emit MIF file
-../bin/hex-to-mif.py FreeSlots.hex $LogMsgsPerMailbox > ../de5/FreeSlots.mif
diff --git a/rtl/Globals.bsv b/rtl/Globals.bsv
index a2648a23..d240aa2c 100644
--- a/rtl/Globals.bsv
+++ b/rtl/Globals.bsv
@@ -20,10 +20,13 @@ typedef struct {
// destination board, it is routed either left or right depending
// the contents of the host bit. This is to support bridge boards
// connected at the east/west rims of the FPGA mesh.
+// The 'isKey' bit means that the destination is a routing key, held
+// in the botom 32 bits of the 'NetAddr'.
// The 'acc' bit means message is routed to a custom accelerator rather
// than a mailbox.
typedef struct {
Bool acc;
+ Bool isKey;
Option#(Bit#(1)) host;
BoardId board;
MailboxId mbox;
@@ -42,6 +45,9 @@ typedef struct {
function MailboxId getMailboxId(NetAddr addr) = addr.addr.mbox;
+// Extract routing key from network address
+function Bit#(32) getRoutingKeyRaw(NetAddr addr) = truncate(pack(addr));
+
// ============================================================================
// Messages
// ============================================================================
@@ -63,7 +69,7 @@ typedef struct {
Bool notFinalFlit;
// Is this a special packet for idle-detection?
Bool isIdleToken;
-} Flit deriving (Bits);
+} Flit deriving (Bits, FShow);
// A padded flit is a multiple of 64 bits
// (i.e. the data width of the 10G MAC interface)
diff --git a/rtl/IdleDetector.bsv b/rtl/IdleDetector.bsv
index 0307f198..59e4b530 100644
--- a/rtl/IdleDetector.bsv
+++ b/rtl/IdleDetector.bsv
@@ -18,14 +18,16 @@
// The implementation below is based on Safra's termination detection
// algorithm (EWD998).
-import Mailbox :: *;
-import Globals :: *;
-import Interface :: *;
-import Queue :: *;
-import Vector :: *;
-import ConfigReg :: *;
-import Util :: *;
-import DReg :: *;
+import Mailbox :: *;
+import Globals :: *;
+import Interface :: *;
+import Queue :: *;
+import Vector :: *;
+import ConfigReg :: *;
+import Util :: *;
+import DReg :: *;
+import ProgRouter :: *;
+import Assert :: *;
// The total number of messages sent by all threads on an FPGA minus
// the total number of messages received by all threads on an FPGA.
@@ -221,6 +223,7 @@ module mkIdleDetector (IdleDetector);
NetAddr {
addr: MailboxNetAddr {
acc: False,
+ isKey: False,
host: option(True, 0),
board: BoardId { y: 0, x: 0 },
mbox: MailboxId { y: 0, x: 0 }
@@ -301,33 +304,6 @@ module mkIdleDetector (IdleDetector);
endmodule
-// Pipelined reduction tree
-module mkPipelinedReductionTree#(
- function a reduce(a x, a y),
- a init,
- List#(a) xs)
- (a) provisos(Bits#(a, _));
- Integer len = List::length(xs);
- if (len == 0)
- return error("mkSumList applied to empty list");
- else if (len == 1)
- return xs[0];
- else begin
- List#(a) ys = xs;
- List#(a) reduced = Nil;
- for (Integer i = 0; i < len; i=i+2) begin
- Reg#(a) r <- mkConfigReg(init);
- rule assignOut;
- r <= reduce(ys[0], ys[1]);
- endrule
- ys = List::drop(2, ys);
- reduced = Cons(readReg(r), reduced);
- end
- a res <- mkPipelinedReductionTree(reduce, init, reduced);
- return res;
- end
-endmodule
-
interface IdleDetectorClient;
method Bit#(1) incSent;
method Bit#(1) incReceived;
@@ -342,22 +318,33 @@ interface IdleDetectorClient;
method Bool idleStage1Ack;
endinterface
-// Connect cores to idle detector
-module connectCoresToIdleDetector#(
- Vector#(n, IdleDetectorClient) core, IdleDetector detector) ()
- provisos (Log#(n, log_n), Add#(log_n, 1, m), Add#(_a, m, 62));
+// Connect cores and fetchers to idle detector
+module connectClientsToIdleDetector#(
+ Vector#(`CoresPerBoard, IdleDetectorClient) core,
+ Vector#(`FetchersPerProgRouter, FetcherActivity) fetcher,
+ IdleDetector detector) ()
+ provisos (Mul#(2, `CoresPerBoard, n));
+
+ staticAssert(2**`LogCoresPerBoard1 > `CoresPerBoard+`FetchersPerProgRouter,
+ "connectCoresToIdleDetector: insufficient width");
// Sum "incSent" wires from each core
- Vector#(n, Bit#(m)) incSents = newVector;
- for (Integer i = 0; i < valueOf(n); i=i+1)
+ Vector#(n, Bit#(`LogCoresPerBoard1)) incSents = replicate(0);
+ for (Integer i = 0; i < `CoresPerBoard; i=i+1)
incSents[i] = zeroExtend(core[i].incSent);
- Bit#(m) incSent <- mkPipelinedReductionTree( \+ , 0, toList(incSents));
+ for (Integer i = 0; i < `FetchersPerProgRouter; i=i+1)
+ incSents[`CoresPerBoard+i] = zeroExtend(fetcher[i].incSent);
+ Bit#(`LogCoresPerBoard1) incSent <-
+ mkPipelinedReductionTree( \+ , 0, toList(incSents));
// Sum "incRecv" wires from each core
- Vector#(n, Bit#(m)) incRecvs = newVector;
- for (Integer i = 0; i < valueOf(n); i=i+1)
+ Vector#(n, Bit#(`LogCoresPerBoard1)) incRecvs = replicate(0);
+ for (Integer i = 0; i < `CoresPerBoard; i=i+1)
incRecvs[i] = zeroExtend(core[i].incReceived);
- Bit#(m) incRecv <- mkPipelinedReductionTree( \+ , 0, toList(incRecvs));
+ for (Integer i = 0; i < `FetchersPerProgRouter; i=i+1)
+ incRecvs[`CoresPerBoard+i] = zeroExtend(fetcher[i].incReceived);
+ Bit#(`LogCoresPerBoard1) incRecv <-
+ mkPipelinedReductionTree( \+ , 0, toList(incRecvs));
// Maintain the total count
Reg#(MsgCount) count <- mkConfigReg(0);
@@ -368,16 +355,18 @@ module connectCoresToIdleDetector#(
endrule
// OR the "active" wires from each core
- Vector#(n, Bool) actives = newVector;
- for (Integer i = 0; i < valueOf(n); i=i+1)
+ Vector#(n, Bool) actives = replicate(False);
+ for (Integer i = 0; i < `CoresPerBoard; i=i+1)
actives[i] = core[i].active;
+ for (Integer i = 0; i < `FetchersPerProgRouter; i=i+1)
+ actives[`CoresPerBoard+i] = fetcher[i].active;
Bool anyActive <- mkPipelinedReductionTree( \|| , True, toList(actives));
- // OR the "vote" wires from each core
- Vector#(n, Bool) votes = newVector;
- for (Integer i = 0; i < valueOf(n); i=i+1)
+ // AND the "vote" wires from each core
+ Vector#(n, Bool) votes = replicate(True);
+ for (Integer i = 0; i < `CoresPerBoard; i=i+1)
votes[i] = core[i].vote;
- Bool unanamous <- mkPipelinedReductionTree( \&& , False, toList(votes));
+ Bool voteDecision <- mkPipelinedReductionTree( \&& , False, toList(votes));
// Register the result
Reg#(Bool) active <- mkConfigReg(True);
@@ -385,24 +374,25 @@ module connectCoresToIdleDetector#(
rule updateActive;
active <= anyActive;
- vote <= unanamous;
+ vote <= voteDecision;
endrule
// Counter number of stage 1 acks
- Reg#(Bit#(m)) numAcks <- mkConfigReg(0);
+ Reg#(Bit#(`LogCoresPerBoard1)) numAcks <- mkConfigReg(0);
// Sum stage 1 ack wires from each core
- Vector#(n, Bit#(m)) incAcks = newVector;
- for (Integer i = 0; i < valueOf(n); i=i+1)
+ Vector#(`CoresPerBoard, Bit#(`LogCoresPerBoard1)) incAcks = newVector;
+ for (Integer i = 0; i < `CoresPerBoard; i=i+1)
incAcks[i] = zeroExtend(pack(core[i].idleStage1Ack));
- Bit#(m) incAck <- mkPipelinedReductionTree( \+ , 0, toList(incAcks));
+ Bit#(`LogCoresPerBoard1) incAck <-
+ mkPipelinedReductionTree( \+ , 0, toList(incAcks));
// Stage 1 output ack
Wire#(Bool) stage1AckWire <- mkDWire(False);
rule updateAcks;
- Bit#(m) total = numAcks + incAck;
- if (total == fromInteger(valueOf(n))) begin
+ Bit#(`LogCoresPerBoard1) total = numAcks + incAck;
+ if (total == `CoresPerBoard) begin
numAcks <= 0;
stage1AckWire <= True;
end else begin
@@ -418,7 +408,7 @@ module connectCoresToIdleDetector#(
detector.idle.voteIn(vote);
detector.idle.ackStage1(stage1AckWire);
- for (Integer i = 0; i < valueOf(n); i=i+1) begin
+ for (Integer i = 0; i < `CoresPerBoard; i=i+1) begin
core[i].idleDetectedStage1(detector.idle.detectedStage1);
core[i].idleVoteStage1(detector.idle.voteStage1);
core[i].idleDetectedStage2(detector.idle.detectedStage2);
@@ -538,6 +528,7 @@ module mkIdleDetectMaster (IdleDetectMaster);
NetAddr {
addr: MailboxNetAddr {
acc: False,
+ isKey: False,
host: option(False, 0),
board: BoardId { y: truncate(boardY), x: truncate(boardX) },
mbox: MailboxId { y: 0, x: 0 }
diff --git a/rtl/Interface.bsv b/rtl/Interface.bsv
index c3d16860..a7cd0e91 100644
--- a/rtl/Interface.bsv
+++ b/rtl/Interface.bsv
@@ -212,6 +212,14 @@ module onBOut#(function u f(t x), BOut#(t) out) (BOut#(u));
method u value = f(out.value);
endmodule
+// Convert BOut to Out
+function Out#(t) fromBOut(BOut#(t) out) =
+ interface Out
+ method Action tryGet = out.get;
+ method Bool valid = out.valid;
+ method t value = out.value;
+ endinterface;
+
// A null In port accepts and discards all inputs
module mkNullIn (In#(t));
method Action tryPut(u val); endmethod
@@ -248,6 +256,14 @@ function BOut#(t) enableBOut(Bool en, BOut#(t) out) =
method t value = out.value;
endinterface;
+// Convert queue to BOut interface
+function BOut#(t) queueToBOut(SizedQueue#(n, t) q) =
+ interface BOut
+ method Action get = q.deq;
+ method Bool valid = q.canDeq && q.canPeek;
+ method t value = q.dataOut;
+ endinterface;
+
// =============================================================================
// Merge unit
// =============================================================================
@@ -396,7 +412,7 @@ module mkMergeTreeB#(MergeMethod m, module#(SizedQueue#(d, t)) mkQ,
xs = List::cons(x, xs);
end
- let out <- mkMergeTreeList(m, mkQ, xs);
+ let out <- mkMergeTreeList(m, mkQ, List::reverse(xs));
return out;
endmodule
@@ -578,7 +594,7 @@ module mkDeserialiser (Deserialiser#(typeIn, typeOut))
endmodule
// =============================================================================
-// Expansion and reduction connectors
+// Reduction connectors
// =============================================================================
// Reduce a list of interfaces down to a given number of interfaces,
@@ -651,31 +667,4 @@ module reduceConnect#(
endmodule
-// Connect 'from' ports to 'to' ports,
-// where 'length(from)' may be less than 'length(to)'.
-// Works by wiring null to any unused 'to' ports.
-module expandConnect#(List#(Out#(t)) from, List#(In#(t)) to) ()
- provisos (Bits#(t, twidth));
-
- // Count inputs and outputs
- Integer numFrom = List::length(from);
- Integer numTo = List::length(to);
- Integer q = numTo/numFrom;
-
- for (Integer i = 0; i < numTo; i=i+1) begin
- if (q == 0) begin
- // Connect input
- connectUsing(mkUGShiftQueue1(QueueOptFmax), from[i], to[i]);
- end else if ((i%q) == 0) begin
- // Connect input
- connectUsing(mkUGShiftQueue1(QueueOptFmax), from[i/q], to[i]);
- end else begin
- // Connect terminator
- BOut#(t) nullOut <- mkNullBOut;
- connectDirect(nullOut, to[i]);
- end
- end
-
-endmodule
-
endpackage
diff --git a/rtl/Mailbox.bsv b/rtl/Mailbox.bsv
index 0398b0e2..e08b1b9a 100644
--- a/rtl/Mailbox.bsv
+++ b/rtl/Mailbox.bsv
@@ -260,6 +260,9 @@ interface Mailbox;
(* always_ready *) method Bit#(1) freeDone;
// Network-side interface
interface MailboxNet net;
+ // Initialise send slots (use extra send slot?)
+ (* always_ready, always_enabled *)
+ method Action initSendSlots(Option#(Bool) useExtraSendSlot);
endinterface
// Combined receive request/response interface
@@ -292,6 +295,45 @@ module mkMailbox (Mailbox);
Vector#(`CoresPerMailbox, InPort#(ReceiveReq)) rxReqPorts <-
replicateM(mkInPort);
+ // Initialise free slots
+ // =====================
+
+ // Set of currently-unused message slots
+ // By default, the first ThreadsPerMailbox slots are reserved for sending
+ // Optionally, the first 2*ThreadsPerMailbox slots are reserved for sending
+ SizedQueue#(`LogMsgsPerMailbox, Bit#(`LogMsgsPerMailbox))
+ freeSlots <- mkUGSizedQueuePrefetch;
+
+ // Reserve extra send slot?
+ Wire#(Option#(Bool)) useExtraSendSlot <- mkBypassWire;
+
+ // State of free slot initialiser
+ Reg#(Bit#(1)) freeSlotsInitState <- mkConfigReg(0);
+
+ // Have the free slots been initialised yet?
+ Reg#(Bool) freeSlotsInitDone <- mkConfigReg(False);
+
+ // Next slot to insert into free slot queue
+ Reg#(Bit#(`LogMsgsPerMailbox)) freeSlotsInitNext <- mkConfigRegU;
+
+ // Wait until config option available, which tells us how
+ // many slots to reserve for sending
+ rule initFreeSlots0 (freeSlotsInitState == 0);
+ if (useExtraSendSlot.valid) begin
+ freeSlotsInitNext <= useExtraSendSlot.value ?
+ fromInteger(2*`ThreadsPerMailbox) : `ThreadsPerMailbox;
+ freeSlotsInitState <= 1;
+ end
+ endrule
+
+ // Initialise free slots
+ rule initFreeSlots1 (!freeSlotsInitDone && freeSlotsInitState == 1);
+ freeSlots.enq(freeSlotsInitNext);
+ freeSlotsInitNext <= freeSlotsInitNext + 1;
+ if (freeSlotsInitNext == fromInteger(2**`LogMsgsPerMailbox - 1))
+ freeSlotsInitDone <= True;
+ endrule
+
// Message access unit
// ===================
@@ -336,15 +378,6 @@ module mkMailbox (Mailbox);
Reg#(RefCount) refCountReg <- mkConfigRegU;
Reg#(Bit#(`LogMsgsPerMailbox)) refCountSlot <- mkConfigRegU;
- // Set of currently-unused message slots
- // (The first ThreadsPerMailbox slots are reserved for sending)
- QueueOpts freeSlotsOpts;
- freeSlotsOpts.style = "AUTO";
- freeSlotsOpts.size = 2**`LogMsgsPerMailbox - `ThreadsPerMailbox;
- freeSlotsOpts.file = Valid("FreeSlots");
- SizedQueue#(`LogMsgsPerMailbox, Bit#(`LogMsgsPerMailbox))
- freeSlots <- mkUGSizedQueuePrefetchOpts(freeSlotsOpts);
-
// Multicast buffer
Vector#(`CoresPerMailbox,
SizedQueue#(`LogMulticastBufferSize, MulticastBufferEntry))
@@ -598,7 +631,7 @@ module mkMailbox (Mailbox);
// to a message slot is freed
Reg#(Bit#(1)) freeDoneReg <- mkDReg(0);
- rule free (freeReqPort.canGet);
+ rule free (freeReqPort.canGet && freeSlotsInitDone);
FreeReq req = freeReqPort.value;
// Process request in two cycles
let count = refCount.dataOutB;
@@ -667,6 +700,10 @@ module mkMailbox (Mailbox);
endinterface
endinterface
+ method Action initSendSlots(Option#(Bool) useExtra);
+ useExtraSendSlot <= useExtra;
+ endmethod
+
endmodule
// =============================================================================
@@ -1138,14 +1175,16 @@ import "BVI" ExternalTinselAccelerator =
`ifndef UseCustomAccelerator
-module mkMailboxAcc#(BoardId boardId, Integer tileX, Integer tileY) (Mailbox);
+module mkMailboxAcc#(BoardId boardId,
+ Integer tileX, Integer tileY) (Mailbox);
Mailbox mbox <- mkMailbox;
return mbox;
endmodule
`else
-module mkMailboxAcc#(BoardId boardId, Integer tileX, Integer tileY) (Mailbox);
+module mkMailboxAcc#(BoardId boardId,
+ Integer tileX, Integer tileY) (Mailbox);
// Instantiate standard mailbox
Mailbox mbox <- mkMailbox;
diff --git a/rtl/Makefile b/rtl/Makefile
index cc521bae..e938b015 100644
--- a/rtl/Makefile
+++ b/rtl/Makefile
@@ -11,7 +11,7 @@ DEFS = $(shell python ../config.py defs)
BSC = bsc
BSCFLAGS = -wait-for-license -suppress-warnings S0015 \
-suppress-warnings G0023 \
- -steps-warn-interval 500000 -check-assert \
+ -steps-warn-interval 750000 -check-assert \
+RTS -K32M -RTS
# Top level module
@@ -28,13 +28,13 @@ sim: $(TOPMOD) $(HOSTTOPMOD)
.PHONY: verilog
verilog: $(TOPMOD).v $(HOSTTOPMOD).v
-$(TOPMOD): *.bsv *.c InstrMem.hex FreeSlots.hex
+$(TOPMOD): *.bsv *.c InstrMem.hex
make -C $(TINSEL_ROOT)/apps/boot
make -C $(TINSEL_ROOT)/hostlink udsock
$(BSC) $(BSCFLAGS) $(DEFS) -D SIMULATE -sim -g $(TOPMOD) -u $(TOPFILE)
$(BSC) $(BSCFLAGS) -sim -o $(TOPMOD) -e $(TOPMOD) *.c
-$(TOPMOD).v: *.bsv $(QP)/InstrMem.mif $(QP)/FreeSlots.mif
+$(TOPMOD).v: *.bsv $(QP)/InstrMem.mif
make -C $(TINSEL_ROOT)/apps/boot
$(BSC) $(BSCFLAGS) -opt-undetermined-vals -unspecified-to X \
$(DEFS) -u -verilog -g $(TOPMOD) $(TOPFILE)
@@ -63,12 +63,6 @@ InstrMem.hex:
$(QP)/InstrMem.mif:
make -C $(TINSEL_ROOT)/apps/boot
-FreeSlots.hex: GenInit.sh
- ./GenInit.sh
-
-$(QP)/FreeSlots.mif: GenInit.sh
- ./GenInit.sh
-
.PHONY: test-mem
test-mem: testMem
@@ -83,7 +77,6 @@ clean:
rm -f de5Top.v mkCore.v mkDCache.v mkMailbox.v mkDebugLinkRouter.v
rm -f mkFPU.v mkMeshRouter.v
rm -f de5BridgeTop.v
- rm -f FreeSlots.hex ../de5/FreeSlots.mif
rm -rf test-mem-log
rm -rf test-mailbox-log
rm -rf test-array-of-queue-log
diff --git a/rtl/NarrowSRAM.bsv b/rtl/NarrowSRAM.bsv
index d0651392..4e51be85 100644
--- a/rtl/NarrowSRAM.bsv
+++ b/rtl/NarrowSRAM.bsv
@@ -1,22 +1,21 @@
// SPDX-License-Identifier: BSD-2-Clause
package NarrowSRAM;
-import DCacheTypes :: *;
-import Util :: *;
+import Util :: *;
// ============================================================================
// Types
// ============================================================================
// SRAM request id
-typedef Bit#(`LogDCachesPerDRAM) SRAMReqId;
+typedef Bit#(TLog#(TAdd#(`DCachesPerDRAM,`FetchersPerProgRouter))) SRAMReqId;
// SRAM load request
typedef struct {
SRAMReqId id;
Bit#(`SRAMAddrWidth) addr;
Bit#(`SRAMBurstWidth) burst;
- InflightDCacheReqInfo info;
+ Bit#(`BeatWidth) info;
} SRAMLoadReq deriving (Bits);
// SRAM store request
@@ -31,7 +30,7 @@ typedef struct {
typedef struct {
SRAMReqId id;
Bit#(`SRAMDataWidth) data;
- InflightDCacheReqInfo info;
+ Bit#(`BeatWidth) info;
} SRAMResp deriving (Bits);
// ============================================================================
@@ -140,7 +139,6 @@ module mkSRAM#(RAMId id) (SRAM);
resp.id = req.id;
resp.data = pack(elems);
resp.info = req.info;
- resp.info.beat = truncate(loadBurstCount);
resps.enq(resp);
inFlightCount.dec;
end
@@ -243,7 +241,7 @@ endinterface
typedef struct {
SRAMReqId id;
Bit#(`SRAMBurstWidth) burst;
- InflightDCacheReqInfo info;
+ Bit#(`BeatWidth) info;
} SRAMInFlightReq deriving (Bits);
// SRAM Implementation
diff --git a/rtl/Network.bsv b/rtl/Network.bsv
index 3efbb480..07d9adfd 100644
--- a/rtl/Network.bsv
+++ b/rtl/Network.bsv
@@ -23,6 +23,9 @@ import Socket :: *;
import Util :: *;
import IdleDetector :: *;
import FlitMerger :: *;
+import OffChipRAM :: *;
+import DRAM :: *;
+import ProgRouter :: *;
// =============================================================================
// Mesh Router
@@ -146,11 +149,9 @@ module mkMeshRouter#(MailboxId m) (MeshRouter);
// Routing function
function Route route(NetAddr a);
- if (a.addr.board.y < b.y) return Down;
- else if (a.addr.board.y > b.y) return Up;
- else if (a.addr.host.valid) return a.addr.host.value == 0 ? Left : Right;
- else if (a.addr.board.x < b.x) return Left;
- else if (a.addr.board.x > b.x) return Right;
+ if (a.addr.board != b) return Down;
+ else if (a.addr.isKey) return Down;
+ else if (a.addr.host.valid) return Down;
else if (a.addr.mbox.y < m.y) return Down;
else if (a.addr.mbox.y > m.y) return Up;
else if (a.addr.mbox.x < m.x) return Left;
@@ -271,27 +272,35 @@ module mkBoardLink#(Bool en, SocketId id) (BoardLink);
endmodule
// =============================================================================
-// Mailbox Mesh
+// Network-on-chip
// =============================================================================
-// Interface to external (off-board) network
-interface ExtNetwork;
-`ifndef SIMULATE
- // Avalon interfaces to 10G MACs
+// NoC interface
+interface NoC;
+ `ifndef SIMULATE
+ // Avalon interfaces to 10G MACs (inter-FPGA links)
interface Vector#(`NumNorthSouthLinks, AvalonMac) north;
interface Vector#(`NumNorthSouthLinks, AvalonMac) south;
interface Vector#(`NumEastWestLinks, AvalonMac) east;
interface Vector#(`NumEastWestLinks, AvalonMac) west;
-`endif
+ `endif
+ // Connections to off-chip memory (for the programmable router)
+ interface Vector#(`DRAMsPerBoard,
+ Vector#(`FetchersPerProgRouter, BOut#(DRAMReq))) dramReqs;
+ interface Vector#(`DRAMsPerBoard,
+ Vector#(`FetchersPerProgRouter, In#(DRAMResp))) dramResps;
+ // ProgRouter fetcher activities & performance counters
+ interface Vector#(`FetchersPerProgRouter, FetcherActivity) activities;
+ interface ProgRouterPerfCounters progRouterPerfCounters;
endinterface
-module mkMailboxMesh#(
+module mkNoC#(
BoardId boardId,
Vector#(4, Bool) linkEnable,
Vector#(`MailboxMeshYLen,
Vector#(`MailboxMeshXLen, MailboxNet)) mailboxes,
IdleDetector idle)
- (ExtNetwork);
+ (NoC);
// Create off-board links
Vector#(`NumNorthSouthLinks, BoardLink) northLink <-
@@ -303,6 +312,9 @@ module mkMailboxMesh#(
Vector#(`NumEastWestLinks, BoardLink) westLink <-
mapM(mkBoardLink(linkEnable[3]), westSocket);
+ // Dimension-ordered routers
+ // -------------------------
+
// Create mailbox routers
Vector#(`MailboxMeshYLen,
Vector#(`MailboxMeshXLen, MeshRouter)) routers =
@@ -362,79 +374,43 @@ module mkMailboxMesh#(
routers[y+1][x].bottomOut, routers[y][x].topIn);
end
- // Connect north links
- // -------------------
+ // Programmable board router
+ // -------------------------
- // Extract mesh top inputs and outputs
- List#(In#(Flit)) topInList = Nil;
- List#(Out#(Flit)) topOutList = Nil;
- for (Integer x = `MailboxMeshXLen-1; x >= 0; x=x-1) begin
- topOutList = Cons(routers[`MailboxMeshYLen-1][x].topOut, topOutList);
- topInList = Cons(routers[`MailboxMeshYLen-1][x].topIn, topInList);
- end
+ // Programmable router
+ ProgRouter boardRouter <- mkProgRouter(boardId);
- // Connect the outgoing links
- function In#(Flit) getFlitIn(BoardLink link) = link.flitIn;
- reduceConnect(mkFlitMerger,
- topOutList, List::map(getFlitIn, toList(northLink)));
-
- // Connect the incoming links
- function Out#(Flit) getFlitOut(BoardLink link) = link.flitOut;
- expandConnect(List::map(getFlitOut, toList(northLink)), topInList);
-
- // Connect south links
- // -------------------
-
- // Extract mesh bottom inputs and outputs
- List#(In#(Flit)) botInList = Nil;
- List#(Out#(Flit)) botOutList = Nil;
- for (Integer x = `MailboxMeshXLen-1; x >= 0; x=x-1) begin
- botOutList = Cons(routers[0][x].bottomOut, botOutList);
- botInList = Cons(routers[0][x].bottomIn, botInList);
- end
+ // Connect board router to north link
+ connectDirect(boardRouter.flitOut[0], northLink[0].flitIn);
+ connectUsing(mkUGShiftQueue1(QueueOptFmax),
+ northLink[0].flitOut, boardRouter.flitIn[0]);
- // Connect the outgoing links
- reduceConnect(mkFlitMerger, botOutList,
- List::map(getFlitIn, toList(southLink)));
-
- // Connect the incoming links
- expandConnect(List::map(getFlitOut, toList(southLink)), botInList);
-
- // Connect east links
- // ------------------
-
- // Extract mesh right inputs and outputs
- List#(In#(Flit)) rightInList = Nil;
- List#(Out#(Flit)) rightOutList = Nil;
- for (Integer y = `MailboxMeshYLen-1; y >= 0; y=y-1) begin
- rightOutList = Cons(routers[y][`MailboxMeshXLen-1].rightOut, rightOutList);
- rightInList = Cons(routers[y][`MailboxMeshXLen-1].rightIn, rightInList);
- end
+ // Connect board router to south link
+ connectDirect(boardRouter.flitOut[1], southLink[0].flitIn);
+ connectUsing(mkUGShiftQueue1(QueueOptFmax),
+ southLink[0].flitOut, boardRouter.flitIn[1]);
- // Connect the outgoing links
- reduceConnect(mkFlitMerger,
- rightOutList, List::map(getFlitIn, toList(eastLink)));
-
- // Connect the incoming links
- expandConnect(List::map(getFlitOut, toList(eastLink)), rightInList);
-
- // Connect west links
- // ------------------
-
- // Extract mesh right inputs and outputs
- List#(In#(Flit)) leftInList = Nil;
- List#(Out#(Flit)) leftOutList = Nil;
- for (Integer y = `MailboxMeshYLen-1; y >= 0; y=y-1) begin
- leftOutList = Cons(routers[y][0].leftOut, leftOutList);
- leftInList = Cons(routers[y][0].leftIn, leftInList);
- end
+ // Connect board router to east link
+ connectDirect(boardRouter.flitOut[2], eastLink[0].flitIn);
+ connectUsing(mkUGShiftQueue1(QueueOptFmax),
+ eastLink[0].flitOut, boardRouter.flitIn[2]);
- // Connect the outgoing links
- reduceConnect(mkFlitMerger,
- leftOutList, List::map(getFlitIn, toList(westLink)));
-
- // Connect the incoming links
- expandConnect(List::map(getFlitOut, toList(westLink)), leftInList);
+ // Connect board router to west link
+ connectDirect(boardRouter.flitOut[3], westLink[0].flitIn);
+ connectUsing(mkUGShiftQueue1(QueueOptFmax),
+ westLink[0].flitOut, boardRouter.flitIn[3]);
+
+ // Connect mailbox mesh south rim to board router
+ for (Integer i = 0; i < `MailboxMeshXLen; i=i+1)
+ connectUsing(mkUGShiftQueue1(QueueOptFmax),
+ routers[0][i].bottomOut, boardRouter.flitIn[4+i]);
+
+ // Connect board router to mailbox mesh south rim
+ function In#(Flit) getBottomIn(MeshRouter r) = r.bottomIn;
+ Vector#(`MailboxMeshXLen, In#(Flit)) southRimInPorts =
+ map(getBottomIn, routers[0]);
+ for (Integer i = 0; i < `MailboxMeshXLen; i=i+1)
+ connectDirect(boardRouter.flitOut[4+i], southRimInPorts[i]);
// Detect inter-board activity
// ---------------------------
@@ -465,13 +441,31 @@ module mkMailboxMesh#(
idle.idle.interBoardActivity(activityReg);
endrule
-`ifndef SIMULATE
+ // Interfaces
+ // ----------
+
+ function In#(t) getIn(InPort#(t) p) = p.in;
+
+ `ifndef SIMULATE
function AvalonMac getMac(BoardLink link) = link.avalonMac;
interface north = Vector::map(getMac, northLink);
interface south = Vector::map(getMac, southLink);
interface east = Vector::map(getMac, eastLink);
interface west = Vector::map(getMac, westLink);
-`endif
+ `endif
+
+ // Requests to off-chip memory
+ interface dramReqs = boardRouter.ramReqs;
+
+ // Responses from off-chip memory
+ interface dramResps = boardRouter.ramResps;
+
+ // Fetcher activities
+ interface activities = boardRouter.activities;
+
+ // Performance counters
+ interface ProgRouterPerfCounters progRouterPerfCounters =
+ boardRouter.perfCounters;
endmodule
diff --git a/rtl/ProgRouter.bsv b/rtl/ProgRouter.bsv
new file mode 100644
index 00000000..6e531261
--- /dev/null
+++ b/rtl/ProgRouter.bsv
@@ -0,0 +1,948 @@
+// SPDX-License-Identifier: BSD-2-Clause
+// Functions, data types, and modules for programmable routers
+package ProgRouter;
+
+import Globals :: *;
+import Util :: *;
+import DRAM :: *;
+import Vector :: *;
+import Queue :: *;
+import Interface :: *;
+import BlockRam :: *;
+import Assert :: *;
+import Util :: *;
+import DReg :: *;
+
+// =============================================================================
+// Routing keys and beats
+// =============================================================================
+
+// A routing record is either 48 bits or 96 bits in size (aligned on a
+// 48-bit or 96-bit boundary respectively). Multiple records are
+// packed into a 256-bit DRAM beat (aligned on a 256-bit boundary).
+// The most significant 16 bits of the beat contain a count of the
+// number of records in the beat (in the range 1 to 5 inclusive). The
+// remaining 240 bits contain records. The first record lies in the
+// least-significant bits of the beat. The size portion of the routing
+// key contains the number of contiguous DRAM beats holding all
+// records for the key.
+
+// 256-bit routing beat
+typedef struct {
+ // Number of records present
+ Bit#(16) size;
+ // The 48-bit record chunks
+ Vector#(5, Bit#(48)) chunks;
+} RoutingBeat deriving (Bits, FShow);
+
+// 32-bit routing key
+typedef struct {
+ // Which off-chip RAM?
+ Bit#(`LogDRAMsPerBoard) ram;
+ // Pointer to array of routing beats containing routing records
+ Bit#(`LogBeatsPerDRAM) ptr;
+ // Number of beats in the array
+ Bit#(`LogRoutingEntryLen) numBeats;
+} RoutingKey deriving (Bits, FShow);
+
+// Extract routing key from an address
+function RoutingKey getRoutingKey(NetAddr addr) =
+ unpack(getRoutingKeyRaw(addr));
+
+// =============================================================================
+// Types of routing record
+// =============================================================================
+
+typedef enum {
+ URM1 = 3'd0, // 48-bit Unicast Router-to-Mailbox
+ URM2 = 3'd1, // 96-bit Unicast Router-to-Mailbox
+ RR = 3'd2, // 48-bit Router-to-Router
+ MRM = 3'd3, // 96-bit Multicast Router-to-Mailbox
+ IND = 3'd4 // 48-bit Indirection
+} RoutingRecordTag deriving (Bits, Eq, FShow);
+
+typedef enum {
+ NORTH = 2'd0,
+ SOUTH = 2'd1,
+ EAST = 2'd2,
+ WEST = 2'd3
+} RoutingDir deriving (Bits, Eq);
+
+// 48-bit Unicast Router-to-Mailbox (URM1) record
+typedef struct {
+ // Record type
+ RoutingRecordTag tag;
+ // Mailbox destination
+ Bit#(4) mbox;
+ // Mailbox-local thread identifier
+ Bit#(6) thread;
+ // Unused
+ Bit#(3) unused;
+ // Local key. The first word of the message
+ // payload is overwritten with this.
+ Bit#(32) localKey;
+} URM1Record deriving (Bits, FShow);
+
+// 96-bit Unicast Router-to-Mailbox (URM2) record
+typedef struct {
+ // Record type
+ RoutingRecordTag tag;
+ // Mailbox destination
+ Bit#(4) mbox;
+ // Mailbox-local thread identifier
+ Bit#(6) thread;
+ // Currently unused
+ Bit#(19) unused;
+ // Local key. The first two words of the message
+ // payload is overwritten with this.
+ Bit#(64) localKey;
+} URM2Record deriving (Bits);
+
+// 48-bit Router-to-Router (RR) record
+typedef struct {
+ // Record type
+ RoutingRecordTag tag;
+ // Direction (N, S, E, or W)
+ RoutingDir dir;
+ // Currently unused
+ Bit#(11) unused;
+ // New 32-bit routing key that will replace the one in the
+ // current message for the next hop of the message's journey
+ Bit#(32) newKey;
+} RRRecord deriving (Bits);
+
+// 96-bit Multicast Router-to-Mailbox (MRM) record
+typedef struct {
+ // Record type
+ RoutingRecordTag tag;
+ // Mailbox destination
+ Bit#(4) mbox;
+ // Currently unused
+ Bit#(9) unused;
+ // Local key. The least-significant half-word
+ // of the message is replaced with this
+ Bit#(16) localKey;
+ // Mailbox-local destination mask
+ Bit#(64) destMask;
+} MRMRecord deriving (Bits);
+
+// 48-bit Indirection (IND) record
+// Note the restrictions on IND records:
+// 1. At most one IND record per key lookup
+// 2. A max-sized key lookup must contain an IND record
+typedef struct {
+ // Record type
+ RoutingRecordTag tag;
+ // Currently unused
+ Bit#(13) unused;
+ // New 32-bit routing key for new set of records on current router
+ Bit#(32) newKey;
+} INDRecord deriving (Bits);
+
+// =============================================================================
+// Internal types
+// =============================================================================
+
+// It is sometimes convenient (though redundant) to record a routing
+// decision for a flit internally within the programmable router
+typedef struct {
+ // Normal flit
+ Flit flit;
+ // Routing decision for flit
+ RoutingDecision decision;
+} RoutedFlit deriving (Bits, FShow);
+
+// Routing decision
+typedef enum {
+ RouteNorth,
+ RouteSouth,
+ RouteEast,
+ RouteWest,
+ RouteNoC
+} RoutingDecision deriving (Bits, Eq, FShow);
+
+// Elements of the indirection queue inside each fetcher
+typedef struct {
+ // The indirection
+ RoutingKey key;
+ // The location of the message in the flit buffer
+ FetcherFlitBufferMsgAddr addr;
+} IndQueueEntry deriving (Bits, FShow);
+
+// =============================================================================
+// Design
+// =============================================================================
+
+// In the following diagram N/S/E/W are the inter-FPGA links and
+// L0..L3 are links at one edge of the NoC. Depending on the NoC
+// dimensions, there may be more or less than four links on a single
+// NoC edge, but the diagram assumes four.
+
+//
+// N S E W L0 L1 L2 L3 Input flits
+// | | | | | | | |
+// +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+
+// | F | | F | | F | | F | | F | | F | | F | | F | Fetchers
+// +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+
+// | | | | | | | |
+// +-------------------------------------------+
+// | Crossbar | Routing
+// +-------------------------------------------+
+// | | | | | | | |
+// N S E W L0 L1 L2 L3 Output queues
+
+// The core functionality is implemented in the fetchers, which:
+// (1) extract routing keys from incoming flits;
+// (2) lookup the keys in RAM;
+// (3) interpret the resulting routing records; and
+// (4) emit the interpreted flits.
+
+// The key property of these fetchers is that they act entirely
+// indepdedently of each other: each one can make progress even if
+// another is blocked. This leads to duplicated logic resources, but
+// is necessary to avoid deadlock.
+
+// As the routers are fully programmable, it is possible for the
+// programmer to introduce deadlock using an ill-defined routing
+// scheme, e.g. where a flit arrives in on (say) link N and requires a
+// flit to be sent back along the same direction N. However, the
+// hardware does guarantee deadlock-freedom if the routing scheme is
+// based on dimension-ordered routing.
+
+// After the fetchers have interpreted the flits, they are fed to a
+// fair crossbar which organises them by destination into output
+// queues.
+
+// =============================================================================
+// Fetcher
+// =============================================================================
+
+// Flit address in a fetcher's flit buffer
+typedef Bit#(`FetcherLogFlitBufferSize) FetcherFlitBufferAddr;
+
+// Message address in a fetcher's flit buffer
+typedef Bit#(`FetcherLogMsgsPerFlitBuffer) FetcherFlitBufferMsgAddr;
+
+// This structure contains information about an in-flight memory
+// request from a fetcher. When a fetcher issues a memory load
+// request, this info is packed into the unused data field of the
+// request. When the memory subsystem responds, it passes back the
+// same info in an extra field inside the memory response structure.
+// Maintaining info about an inflight request inside the request
+// itself provides an easy way to handle out-of-order responses from
+// memory.
+typedef struct {
+ // Message address in the fetcher's flit buffer
+ FetcherFlitBufferMsgAddr msgAddr;
+ // How many beats in the burst?
+ Bit#(`BeatBurstWidth) burst;
+ // Is this the final burst of routing records for the current key?
+ Bool finalBurst;
+ // Are we processing a max-sized key (which must contain an IND record)?
+ Bool isMaxSizedKey;
+} InflightFetcherReqInfo deriving (Bits, FShow);
+
+// Routing beat, tagged with the beat number in the DRAM burst
+typedef struct {
+ // Beat
+ RoutingBeat beat;
+ // Beat number
+ Bit#(`BeatBurstWidth) beatNum;
+ // Inflight request info
+ InflightFetcherReqInfo info;
+} NumberedRoutingBeat deriving (Bits, FShow);
+
+// Fetcher interface
+interface Fetcher;
+ // Incoming and outgoing flits
+ interface In#(Flit) flitIn;
+ interface BOut#(RoutedFlit) flitOut;
+ // Off-chip RAM connections
+ interface Vector#(`DRAMsPerBoard, BOut#(DRAMReq)) ramReqs;
+ interface Vector#(`DRAMsPerBoard, In#(DRAMResp)) ramResps;
+ // Activity
+ interface FetcherActivity activity;
+endinterface
+
+// Fetcher activity for performance counters and termination detection
+(* always_ready *)
+interface FetcherActivity;
+ // Increment number of sent messages
+ method Bit#(1) incSent;
+ // Increment number of messages sent to another board
+ method Bit#(1) incSentInterBoard;
+ // Increment number of received messages
+ method Bit#(1) incReceived;
+ // Active (in the termination-detection sense)?
+ method Bool active;
+endinterface
+
+// Fetcher module
+module mkFetcher#(BoardId boardId, Integer fetcherId) (Fetcher);
+
+ // Flit input port
+ InPort#(Flit) flitInPort <- mkInPort;
+
+ // RAM request queues
+ Vector#(`DRAMsPerBoard, Queue1#(DRAMReq)) ramReqQueue <-
+ replicateM(mkUGShiftQueue(QueueOptFmax));
+
+ // Flit buffer
+ BlockRamOpts flitBufferOpts =
+ BlockRamOpts {
+ readDuringWrite: DontCare,
+ style: "AUTO",
+ registerDataOut: False,
+ initFile: Invalid
+ };
+ BlockRam#(FetcherFlitBufferAddr, Flit) flitBuffer <-
+ mkBlockRamOpts(flitBufferOpts);
+
+ // Beat buffer
+ SizedQueue#(`FetcherLogBeatBufferSize, NumberedRoutingBeat)
+ beatBuffer <- mkUGSizedQueue;
+
+ // Track length of beat buffer, so that we don't overfetch
+ Count#(TAdd#(`FetcherLogBeatBufferSize, 1)) beatBufferLen <-
+ mkCount(2 ** `FetcherLogBeatBufferSize);
+
+ // For flits whose destinations are *not* routing keys
+ Queue1#(RoutedFlit) flitBypassQueue <- mkUGShiftQueue(QueueOptFmax);
+
+ // For flits whose destinations are routing keys
+ Queue1#(RoutedFlit) flitProcessedQueue <- mkUGShiftQueue(QueueOptFmax);
+
+ // Final output queue for flits
+ Queue1#(RoutedFlit) flitOutQueue <- mkUGShiftQueue(QueueOptFmax);
+
+ // Indirection queue and size
+ SizedQueue#(`FetcherLogIndQueueSize, IndQueueEntry) indQueue <-
+ mkUGShiftQueue(QueueOptFmax);
+ Count#(TAdd#(`FetcherLogIndQueueSize, 1)) indQueueLen <-
+ mkCount(2 ** `FetcherLogIndQueueSize);
+
+ // Activity
+ Reg#(Bit#(1)) incSentReg <- mkDReg(0);
+ Reg#(Bit#(1)) incSentInterBoardReg <- mkDReg(0);
+ Reg#(Bit#(1)) incReceivedReg <- mkDReg(0);
+
+ // Stage 1: consume input message
+ // ------------------------------
+
+ // Consumer state
+ // State 0: pass through flits that don't contain routing keys
+ // State 1: buffer flits that do contain routing keys
+ // State 2: fetch routing beats
+ Reg#(Bit#(2)) consumeState <- mkReg(0);
+
+ // Count number of flits of message consumed so far
+ Reg#(Bit#(`LogMaxFlitsPerMsg)) consumeFlitCount <- mkReg(0);
+
+ // Flit slot allocator
+ Vector#(`FetcherMsgsPerFlitBuffer, SetReset) flitBufferUsedSlots <-
+ replicateM(mkSetReset(False));
+
+ // Chosen message slot
+ Reg#(FetcherFlitBufferMsgAddr) chosenReg <- mkRegU;
+
+ // Routing key of message consumed
+ Reg#(RoutingKey) consumeKey <- mkRegU;
+
+ // Maintain count of routing beats fetched so far
+ Reg#(Bit#(`LogRoutingEntryLen)) fetchBeatCount <- mkReg(0);
+
+ // Track when messages are bypassing fetcher, to keep the bypass atomic
+ Reg#(Bool) bypassInProgress <- mkReg(False);
+
+ // State 0: pass through flits that don't contain routing keys
+ rule consumeMessage0 (consumeState == 0);
+ Flit flit = flitInPort.value;
+ // Find unused message slot
+ Bool found = False;
+ FetcherFlitBufferMsgAddr chosen = ?;
+ for (Integer i = 0; i < `FetcherMsgsPerFlitBuffer; i=i+1)
+ if (! flitBufferUsedSlots[i].value) begin
+ found = True;
+ chosen = fromInteger(i);
+ end
+ // Initialise counters for subsequent states
+ consumeFlitCount <= 0;
+ fetchBeatCount <= 0;
+ // First, try to consume indirection
+ if (indQueue.canDeq && indQueue.canPeek && !bypassInProgress) begin
+ IndQueueEntry ind = indQueue.dataOut;
+ // Consume
+ indQueue.deq;
+ // Release space in indQueue, unless we have another max-sized key
+ if (!allHigh(ind.key.numBeats))
+ indQueueLen.dec;
+ // Jump straight to fetch state, as message already in flit buffer
+ chosenReg <= ind.addr;
+ consumeKey <= ind.key;
+ // Proceed only if key size is non-zero
+ if (ind.key.numBeats != 0)
+ consumeState <= 2;
+ end else begin
+ chosenReg <= chosen;
+ // Otherwise, try to consume flit
+ if (flitInPort.canGet) begin
+ if (flit.dest.addr.isKey) begin
+ if (found) begin
+ RoutingKey key = getRoutingKey(flit.dest);
+ // For a full-size key, we must reserve space in the indQueue
+ if (allHigh(key.numBeats)) begin
+ if (indQueueLen.notFull) begin
+ indQueueLen.inc;
+ consumeState <= 1;
+ end
+ end else
+ consumeState <= 1;
+ end
+ end else if (flitBypassQueue.notFull) begin
+ flitInPort.get;
+ bypassInProgress <= flit.notFinalFlit;
+ // Make routing decision
+ RoutingDecision decision = RouteNoC;
+ MailboxNetAddr addr = flit.dest.addr;
+ if (addr.board.y < boardId.y) decision = RouteSouth;
+ else if (addr.board.y > boardId.y) decision = RouteNorth;
+ else if (addr.host.valid)
+ decision = addr.host.value == 0 ? RouteWest : RouteEast;
+ else if (addr.board.x < boardId.x) decision = RouteWest;
+ else if (addr.board.x > boardId.x) decision = RouteEast;
+ // Insert into bypass queue
+ flitBypassQueue.enq(RoutedFlit { decision: decision, flit: flit});
+ end
+ end
+ end
+ endrule
+
+ // State 1: buffer flits that do contain routing keys
+ rule consumeMessage1 (consumeState == 1);
+ Flit flit = flitInPort.value;
+ if (flitInPort.canGet) begin
+ flitInPort.get;
+ RoutingKey key = getRoutingKey(flit.dest);
+ consumeKey <= key;
+ // Write to flit buffer
+ flitBuffer.write({chosenReg, consumeFlitCount}, flit);
+ consumeFlitCount <= consumeFlitCount + 1;
+ // On final flit, move to fetch state
+ if (! flit.notFinalFlit) begin
+ // Ignore keys with zero beats
+ if (key.numBeats == 0) begin
+ consumeState <= 0;
+ incReceivedReg <= 1;
+ end else begin
+ consumeState <= 2;
+ // Claim chosen slot
+ flitBufferUsedSlots[chosenReg].set;
+ end
+ end
+ end
+ endrule
+
+ // State 2: fetch routing beats
+ rule consumeMessage2 (consumeState == 2);
+ // Have we finished fetching beats?
+ Bool finished = (consumeKey.numBeats-fetchBeatCount) <= `ProgRouterMaxBurst;
+ // Prepare inflight RAM request info
+ // (to handle out of order resps from the RAMs)
+ InflightFetcherReqInfo info;
+ info.msgAddr = chosenReg;
+ info.burst = truncate(
+ min(consumeKey.numBeats - fetchBeatCount, `ProgRouterMaxBurst));
+ info.finalBurst = finished;
+ info.isMaxSizedKey = allHigh(consumeKey.numBeats);
+ // Prepare RAM request
+ DRAMReq req;
+ req.isStore = False;
+ req.id = fromInteger(`DCachesPerDRAM + fetcherId);
+ req.addr = {1'b0, consumeKey.ptr + zeroExtend(fetchBeatCount)};
+ req.data = {?, pack(info)};
+ req.burst = info.burst;
+ // Don't overfetch (beat buffer has finite size)
+ if (ramReqQueue[consumeKey.ram].notFull &&
+ beatBufferLen.available >= zeroExtend(req.burst)) begin
+ ramReqQueue[consumeKey.ram].enq(req);
+ fetchBeatCount <= fetchBeatCount + zeroExtend(req.burst);
+ beatBufferLen.incBy(zeroExtend(req.burst));
+ if (finished) begin
+ consumeState <= 0;
+ incReceivedReg <= 1;
+ end
+ end
+ endrule
+
+ // Stage 2: interpret routing beats
+ // --------------------------------
+
+ // Merge responses from each RAM
+ staticAssert(`DRAMsPerBoard == 2,
+ "Fetcher: need to generalise number of RAMs used");
+ MergeUnit#(NumberedRoutingBeat) ramRespMerger <- mkMergeUnitFair;
+
+ // Convert a RAM response to a numbered routing beat
+ function NumberedRoutingBeat fromDRAMResp(DRAMResp resp) =
+ NumberedRoutingBeat {
+ beat: unpack(resp.data)
+ , beatNum: resp.beat
+ , info: unpack(truncate(resp.info))
+ };
+
+ // Create RAM response input interfaces for this module
+ In#(DRAMResp) respA <- onIn(fromDRAMResp, ramRespMerger.inA);
+ In#(DRAMResp) respB <- onIn(fromDRAMResp, ramRespMerger.inB);
+ Vector#(`DRAMsPerBoard, In#(DRAMResp)) ramRespsOut = vector(respA, respB);
+
+ // Connect the merger to the beat buffer
+ connectToQueue(ramRespMerger.out, beatBuffer);
+
+ // Count number of flits of message emitted so far
+ Reg#(Bit#(`LogMaxFlitsPerMsg)) emitFlitCount <- mkReg(0);
+
+ // Count number of records processed so far in current beat
+ Reg#(Bit#(3)) recordCount <- mkReg(0);
+
+ // (Shift) register holding current routing beat
+ Reg#(NumberedRoutingBeat) beatReg <- mkRegU;
+
+ // Interpreter state
+ // 0: register the routing beat and fetch first flit
+ // 1: interpret flits
+ Reg#(Bit#(1)) interpreterState <- mkReg(0);
+
+ // State 0: register the routing beat and fetch first flit
+ rule interpreter0 (interpreterState == 0);
+ let beat = beatBuffer.dataOut;
+ InflightFetcherReqInfo info = beat.info;
+ // Consume beat
+ if (beatBuffer.canDeq && beatBuffer.canPeek) begin
+ beatReg <= beat;
+ beatBuffer.deq;
+ beatBufferLen.dec;
+ interpreterState <= 1;
+ end
+ // Load first flit
+ flitBuffer.read({info.msgAddr, 0});
+ emitFlitCount <= 0;
+ recordCount <= 0;
+ endrule
+
+ // State 1: interpret flits
+ rule interpreter1 (interpreterState == 1);
+ // Extract details of registered routing beat
+ let beat = beatReg.beat;
+ let beatNum = beatReg.beatNum;
+ let info = beatReg.info;
+ // Extract tag from next record
+ RoutingRecordTag tag = unpack(truncateLSB(beat.chunks[4]));
+ // Is this the first flit of a message?
+ Bool firstFlit = emitFlitCount == 0;
+ // Modify flit by interpreting routing key
+ RoutingDecision decision = ?;
+ Flit flit = flitBuffer.dataOut;
+ // Unless otherwise stated (e.g. RR records),
+ // flits emitted will be destined for this board
+ flit.dest.addr.board = boardId;
+ case (tag)
+ // 48-bit Unicast Router-to-Mailbox
+ URM1: begin
+ URM1Record rec = unpack(beat.chunks[4]);
+ flit.dest.addr.isKey = False;
+ flit.dest.addr.mbox.x = unpack(truncate(rec.mbox[1:0]));
+ flit.dest.addr.mbox.y = unpack(truncate(rec.mbox[3:2]));
+ Vector#(`ThreadsPerMailbox, Bool) threadMask = newVector;
+ for (Integer j = 0; j < `ThreadsPerMailbox; j=j+1)
+ threadMask[j] = rec.thread == fromInteger(j);
+ flit.dest.threads = pack(threadMask);
+ // Replace first word of message with local key
+ if (firstFlit)
+ flit.payload = {truncateLSB(flit.payload), rec.localKey};
+ decision = RouteNoC;
+ end
+ // 96-bit Unicast Router-to-Mailbox
+ URM2: begin
+ URM2Record rec = unpack({beat.chunks[4], beat.chunks[3]});
+ flit.dest.addr.isKey = False;
+ flit.dest.addr.mbox.x = unpack(truncate(rec.mbox[1:0]));
+ flit.dest.addr.mbox.y = unpack(truncate(rec.mbox[3:2]));
+ Vector#(`ThreadsPerMailbox, Bool) threadMask = newVector;
+ for (Integer j = 0; j < `ThreadsPerMailbox; j=j+1)
+ threadMask[j] = rec.thread == fromInteger(j);
+ flit.dest.threads = pack(threadMask);
+ // Replace first two words of message with local key
+ if (firstFlit)
+ flit.payload = {truncateLSB(flit.payload), rec.localKey};
+ decision = RouteNoC;
+ end
+ // 48-bit Router-to-Router
+ RR: begin
+ RRRecord rec = unpack(beat.chunks[4]);
+ case (rec.dir)
+ NORTH: begin
+ decision = RouteNorth;
+ flit.dest.addr.board = BoardId {x: boardId.x, y: boardId.y+1};
+ end
+ SOUTH: begin
+ decision = RouteSouth;
+ flit.dest.addr.board = BoardId {x: boardId.x, y: boardId.y-1};
+ end
+ EAST: begin
+ decision = RouteEast;
+ flit.dest.addr.board = BoardId {x: boardId.x+1, y: boardId.y};
+ end
+ WEST: begin
+ decision = RouteWest;
+ flit.dest.addr.board = BoardId {x: boardId.x-1, y: boardId.y};
+ end
+ endcase
+ flit.dest.threads = {?, rec.newKey};
+ end
+ // 96-bit Multicast Router-to-Mailbox
+ MRM: begin
+ MRMRecord rec = unpack({beat.chunks[4], beat.chunks[3]});
+ flit.dest.addr.isKey = False;
+ flit.dest.addr.mbox.x = unpack(truncate(rec.mbox[1:0]));
+ flit.dest.addr.mbox.y = unpack(truncate(rec.mbox[3:2]));
+ flit.dest.threads = rec.destMask;
+ // Replace first half-word of message with local key
+ if (firstFlit)
+ flit.payload = {truncateLSB(flit.payload), rec.localKey};
+ decision = RouteNoC;
+ end
+ // 48-bit Indirection
+ IND: begin end
+ endcase
+ // Is output queue ready for new flit?
+ Bool emit = flitProcessedQueue.notFull;
+ let newFlitCount = emitFlitCount;
+ // Consume routing record
+ if (emit) begin
+ // Only enqueue if not an IND record
+ if (tag != IND)
+ flitProcessedQueue.enq(RoutedFlit { decision: decision, flit: flit });
+ // Shift beat to point to next record
+ RoutingBeat newBeat = beat;
+ Bool doubleChunk = unpack(pack(tag)[0]);
+ if (doubleChunk) begin
+ for (Integer i = 4; i > 1; i=i-1)
+ newBeat.chunks[i] = beat.chunks[i-2];
+ end else begin
+ for (Integer i = 4; i > 0; i=i-1)
+ newBeat.chunks[i] = beat.chunks[i-1];
+ end
+ // Is this the final flit in the message?
+ if (flit.notFinalFlit)
+ newFlitCount = emitFlitCount + 1;
+ else begin
+ // Move to next record
+ recordCount <= recordCount + 1;
+ beatReg <= NumberedRoutingBeat {
+ beat: newBeat, beatNum: beatNum, info: info };
+ // Handle IND record: insert into indirection queue
+ if (tag == IND) begin
+ myAssert(indQueue.notFull, "Restrictions on IND records violated");
+ INDRecord ind = unpack(beat.chunks[4]);
+ indQueue.enq(IndQueueEntry
+ { key: unpack(ind.newKey), addr: info.msgAddr });
+ end
+ // Is this the final record in the beat?
+ if ((recordCount+1) == truncate(beat.size)) begin
+ interpreterState <= 0;
+ // Have we finished with this message yet?
+ if (info.finalBurst && info.burst == (beatNum+1)) begin
+ // Reclaim message slot in flit buffer
+ // (Don't do this when we have an indirection to process)
+ if (! info.isMaxSizedKey)
+ flitBufferUsedSlots[info.msgAddr].clear;
+ end
+ end
+ incSentReg <= 1;
+ if (tag == RR) incSentInterBoardReg <= 1;
+ newFlitCount = 0;
+ end
+ end
+ // Issue flit load request
+ flitBuffer.read({info.msgAddr, newFlitCount});
+ emitFlitCount <= newFlitCount;
+ endrule
+
+ // Stage 3: merge output queues
+ // ----------------------------
+
+ // We want to merge messages, not flits
+ // Are we in the middle of consuming a message?
+ Reg#(Bool) mergeInProgress <- mkReg(False);
+ Reg#(Bool) prevFromBypass <- mkReg(False);
+
+ rule merge (flitOutQueue.notFull);
+ // Favour the bypass queue
+ Bool chooseBypass = mergeInProgress ? prevFromBypass :
+ flitBypassQueue.canDeq;
+ if (chooseBypass) begin
+ if (flitBypassQueue.canDeq) begin
+ flitBypassQueue.deq;
+ flitOutQueue.enq(flitBypassQueue.dataOut);
+ mergeInProgress <= flitBypassQueue.dataOut.flit.notFinalFlit;
+ prevFromBypass <= True;
+ end
+ end else if (flitProcessedQueue.canDeq) begin
+ flitProcessedQueue.deq;
+ flitOutQueue.enq(flitProcessedQueue.dataOut);
+ mergeInProgress <= flitProcessedQueue.dataOut.flit.notFinalFlit;
+ prevFromBypass <= False;
+ end
+ endrule
+
+ // Interfaces
+ // -----------
+
+ interface flitIn = flitInPort.in;
+ interface flitOut = queueToBOut(flitOutQueue);
+ interface ramReqs = map(queueToBOut, ramReqQueue);
+ interface ramResps = ramRespsOut;
+
+ interface FetcherActivity activity;
+ method Bit#(1) incSent = incSentReg;
+ method Bit#(1) incSentInterBoard = incSentInterBoardReg;
+ method Bit#(1) incReceived = incReceivedReg;
+ method Bool active =
+ beatBufferLen.value != 0 || consumeState != 0;
+ endinterface
+
+endmodule
+
+// =============================================================================
+// Crossbar
+// =============================================================================
+
+// Selector function for a mux in the programmable router crossbar
+typedef function Bool selector(RoutedFlit flit) SelectorFunc;
+
+module mkProgRouterCrossbar#(
+ Vector#(numOut, SelectorFunc) f,
+ Vector#(numIn, BOut#(RoutedFlit)) out)
+ (Vector#(numOut, BOut#(RoutedFlit)))
+ provisos(Add#(a__, 1, numIn));
+
+ // Input ports
+ Vector#(numIn, InPort#(RoutedFlit)) inPort <- replicateM(mkInPort);
+
+ // Connect up input ports
+ for (Integer i = 0; i < valueOf(numIn); i=i+1)
+ connectDirect(out[i], inPort[i].in);
+
+ // Cosume wires, for each input port
+ Vector#(numIn, PulseWire) consumeWire <- replicateM(mkPulseWireOR);
+
+ // Keep track of service history for flit sources (for fair selection)
+ Vector#(numOut, Reg#(Bit#(numIn))) hist <- replicateM(mkReg(0));
+
+ // Current choice of flit source
+ Vector#(numOut, Reg#(Bit#(numIn))) choiceReg <- replicateM(mkReg(0));
+
+ // Output queues
+ Vector#(numOut, Queue#(RoutedFlit)) outQueue <-
+ replicateM(mkUGShiftQueue(QueueOptFmax));
+
+ // Selector mux for each out queue
+ for (Integer i = 0; i < valueOf(numOut); i=i+1) begin
+
+ rule select;
+ // Vector of input flits and available flits
+ Vector#(numIn, RoutedFlit) flits = newVector;
+ Vector#(numIn, Bool) nextAvails = newVector;
+ Bool avail = False;
+ for (Integer j = 0; j < valueOf(numIn); j=j+1) begin
+ flits[j] = inPort[j].value;
+ nextAvails[j] = inPort[j].canGet && f[i](inPort[j].value)
+ && choiceReg[i][j] == 0;
+ avail = avail || (choiceReg[i][j] == 1 && inPort[j].canGet);
+ end
+ Bit#(numIn) nextAvail = pack(nextAvails);
+ // Choose a new source using fair scheduler
+ match {.newHist, .nextChoice} = sched(hist[i], nextAvail);
+ // Select a flit
+ RoutedFlit flit = oneHotSelect(unpack(choiceReg[i]), flits);
+ // Consume a flit
+ if (avail) begin
+ if (outQueue[i].notFull) begin
+ // Pass chosen flit to out queue
+ outQueue[i].enq(flit);
+ // On final flit of message
+ if (!flit.flit.notFinalFlit) begin
+ choiceReg[i] <= nextChoice;
+ hist[i] <= newHist;
+ end
+ end
+ end else if (choiceReg[i] == 0) begin
+ choiceReg[i] <= nextChoice;
+ hist[i] <= newHist;
+ end
+ // Consume from chosen source
+ for (Integer j = 0; j < valueOf(numIn); j=j+1)
+ if (inPort[j].canGet && choiceReg[i][j] == 1 && outQueue[i].notFull)
+ consumeWire[j].send;
+ endrule
+
+ end
+
+ // Consume from flit sources
+ rule consumeFlitSources;
+ for (Integer j = 0; j < valueOf(numIn); j=j+1)
+ if (consumeWire[j]) inPort[j].get;
+ endrule
+
+ return map(queueToBOut, outQueue);
+endmodule
+
+
+// =============================================================================
+// Splitter
+// =============================================================================
+
+// Split a single stream in two based on a predicate
+module splitFlits#(SelectorFunc f, BOut#(RoutedFlit) out)
+ (Tuple2#(BOut#(Flit), BOut#(Flit)));
+
+ // Consume wire
+ PulseWire consumeWire <- mkPulseWireOR;
+
+ // Output streams
+ BOut#(Flit) outYes =
+ interface BOut
+ method Action get = consumeWire.send;
+ method Bool valid = out.valid && f(out.value);
+ method Flit value = out.value.flit;
+ endinterface;
+ BOut#(Flit) outNo =
+ interface BOut
+ method Action get = consumeWire.send;
+ method Bool valid = out.valid && !f(out.value);
+ method Flit value = out.value.flit;
+ endinterface;
+
+ // Consume
+ rule consume;
+ if (consumeWire) out.get;
+ endrule
+
+ return tuple2(outYes, outNo);
+endmodule
+
+// =============================================================================
+// Programmable router
+// =============================================================================
+
+// Enough bits to store a count of the number of fetchers
+typedef TLog#(TAdd#(`FetchersPerProgRouter, 1)) LogFetchersPerProgRouter;
+
+// ProgRouter's performance counters
+(* always_ready, always_enabled *)
+interface ProgRouterPerfCounters;
+ method Bit#(LogFetchersPerProgRouter) incSent;
+ method Bit#(LogFetchersPerProgRouter) incSentInterBoard;
+endinterface
+
+interface ProgRouter;
+ // Incoming and outgoing flits
+ interface Vector#(`FetchersPerProgRouter, In#(Flit)) flitIn;
+ interface Vector#(`FetchersPerProgRouter, BOut#(Flit)) flitOut;
+
+ // Interface to off-chip memory
+ interface Vector#(`DRAMsPerBoard,
+ Vector#(`FetchersPerProgRouter, BOut#(DRAMReq))) ramReqs;
+ interface Vector#(`DRAMsPerBoard,
+ Vector#(`FetchersPerProgRouter, In#(DRAMResp))) ramResps;
+
+ // Activities & performance counters
+ interface Vector#(`FetchersPerProgRouter, FetcherActivity) activities;
+ interface ProgRouterPerfCounters perfCounters;
+endinterface
+
+module mkProgRouter#(BoardId boardId) (ProgRouter);
+
+ // Fetchers
+ Vector#(`FetchersPerProgRouter, Fetcher) fetchers = newVector;
+ for (Integer i = 0; i < `FetchersPerProgRouter; i=i+1)
+ fetchers[i] <- mkFetcher(boardId, i);
+
+ // Crossbar routing functions
+ function Bit#(`MailboxMeshXBits) xcoord(RoutedFlit rf) =
+ zeroExtend(rf.flit.dest.addr.mbox.x);
+ function Bool routeN(RoutedFlit rf) = rf.decision == RouteNorth;
+ function Bool routeS(RoutedFlit rf) = rf.decision == RouteSouth;
+ function Bool routeE(RoutedFlit rf) = rf.decision == RouteEast;
+ function Bool routeW(RoutedFlit rf) = rf.decision == RouteWest;
+ function Bool routeL(Bit#(`MailboxMeshXBits) x, RoutedFlit rf) =
+ rf.decision == RouteNoC && xcoord(rf) == x;
+ Vector#(`FetchersPerProgRouter, SelectorFunc) funcs;
+ funcs[0] = routeN; funcs[1] = routeS;
+ funcs[2] = routeE; funcs[3] = routeW;
+ for (Integer i = 0; i < `MailboxMeshXLen; i=i+1)
+ funcs[4+i] = routeL(fromInteger(i));
+
+ // Crossbar
+ function BOut#(RoutedFlit) getFetcherFlitOut(Fetcher f) = f.flitOut;
+ Vector#(`FetchersPerProgRouter, BOut#(RoutedFlit)) fetcherOuts =
+ map(getFetcherFlitOut, fetchers);
+ Vector#(`FetchersPerProgRouter, BOut#(RoutedFlit))
+ crossbarOuts <- mkProgRouterCrossbar(funcs, fetcherOuts);
+ Vector#(`FetchersPerProgRouter, BOut#(Flit)) crossbarOutFlits;
+ function Flit toFlit (RoutedFlit rf) = rf.flit;
+ for (Integer i = 0; i < `FetchersPerProgRouter; i=i+1)
+ crossbarOutFlits[i] <- onBOut(toFlit, crossbarOuts[i]);
+
+ // Flit input interfaces
+ Vector#(`FetchersPerProgRouter, In#(Flit)) flitInIfc = newVector;
+ for (Integer i = 0; i < `FetchersPerProgRouter; i=i+1)
+ flitInIfc[i] = fetchers[i].flitIn;
+
+ // RAM interfaces
+ Vector#(`DRAMsPerBoard, Vector#(`FetchersPerProgRouter, In#(DRAMResp)))
+ ramRespIfc = replicate(newVector);
+ Vector#(`DRAMsPerBoard, Vector#(`FetchersPerProgRouter, BOut#(DRAMReq)))
+ ramReqIfc = replicate(newVector);
+ for (Integer i = 0; i < `DRAMsPerBoard; i=i+1)
+ for (Integer j = 0; j < `FetchersPerProgRouter; j=j+1) begin
+ ramReqIfc[i][j] = fetchers[j].ramReqs[i];
+ ramRespIfc[i][j] = fetchers[j].ramResps[i];
+ end
+
+ // Performance counters
+ Vector#(TExp#(TLog#(`FetchersPerProgRouter)),
+ Bit#(LogFetchersPerProgRouter)) incSents = replicate(0);
+ Vector#(TExp#(TLog#(`FetchersPerProgRouter)),
+ Bit#(LogFetchersPerProgRouter)) incSentsInterBoard = replicate(0);
+ for (Integer i = 0; i < `FetchersPerProgRouter; i=i+1) begin
+ incSents[i] = zeroExtend(fetchers[i].activity.incSent);
+ incSentsInterBoard[i] =
+ zeroExtend(fetchers[i].activity.incSentInterBoard);
+ end
+ Bit#(LogFetchersPerProgRouter) numSent <-
+ mkPipelinedReductionTree( \+ , 0, toList(incSents));
+ Bit#(LogFetchersPerProgRouter) numSentInterBoard <-
+ mkPipelinedReductionTree( \+ , 0, toList(incSentsInterBoard));
+
+ function FetcherActivity getActivity(Fetcher f) = f.activity;
+ interface flitIn = flitInIfc;
+ interface flitOut = crossbarOutFlits;
+ interface ramReqs = ramReqIfc;
+ interface ramResps = ramRespIfc;
+ interface activities = map(getActivity, fetchers);
+ interface ProgRouterPerfCounters perfCounters;
+ method incSent = numSent;
+ method incSentInterBoard = numSentInterBoard;
+ endinterface
+
+endmodule
+
+// For core(s) to access ProgRouter's performance counters
+(* always_ready, always_enabled *)
+interface ProgRouterPerfClient;
+ method Action incSent(Bit#(LogFetchersPerProgRouter) amount);
+ method Action incSentInterBoard(Bit#(LogFetchersPerProgRouter) amount);
+endinterface
+
+endpackage
diff --git a/rtl/Util.bsv b/rtl/Util.bsv
index 7ac885c3..f45ece48 100644
--- a/rtl/Util.bsv
+++ b/rtl/Util.bsv
@@ -254,4 +254,51 @@ module mkBuffer#(Integer n, dataT init, dataT inp) (dataT)
return regs[n-1];
endmodule
+// Isolate first hot bit
+function Bit#(n) firstHot(Bit#(n) x) = x & (~x + 1);
+
+// Function for fair scheduling of n tasks
+function Tuple2#(Bit#(n), Bit#(n)) sched(Bit#(n) hist, Bit#(n) avail);
+ // First choice: an available bit that's not in the history
+ Bit#(n) first = firstHot(avail & ~hist);
+ // Second choice: any available bit
+ Bit#(n) second = firstHot(avail);
+
+ // Return new history, and chosen bit
+ if (first != 0) begin
+ // Return first choice, and update history
+ return tuple2(hist | first, first);
+ end else begin
+ // Return second choice, and reset history
+ return tuple2(second, second);
+ end
+endfunction
+
+// Pipelined reduction tree
+module mkPipelinedReductionTree#(
+ function a reduce(a x, a y),
+ a init,
+ List#(a) xs)
+ (a) provisos(Bits#(a, _));
+ Integer len = List::length(xs);
+ if (len == 0)
+ return error("mkSumList applied to empty list");
+ else if (len == 1)
+ return xs[0];
+ else begin
+ List#(a) ys = xs;
+ List#(a) reduced = Nil;
+ for (Integer i = 0; i < len; i=i+2) begin
+ Reg#(a) r <- mkConfigReg(init);
+ rule assignOut;
+ r <= reduce(ys[0], ys[1]);
+ endrule
+ ys = List::drop(2, ys);
+ reduced = Cons(readReg(r), reduced);
+ end
+ a res <- mkPipelinedReductionTree(reduce, init, reduced);
+ return res;
+ end
+endmodule
+
endpackage
diff --git a/rtl/WideSRAM.bsv b/rtl/WideSRAM.bsv
index a3816a38..04af1dc7 100644
--- a/rtl/WideSRAM.bsv
+++ b/rtl/WideSRAM.bsv
@@ -108,6 +108,7 @@ module mkWideSRAM#(RAMId id) (WideSRAM);
respOut.data = pack(data);
respOut.info = respIn.info;
respOut.finalBeat = True;
+ respOut.beat = 0;
respQueue.enq(respOut);
respCount <= 0;
end