diff --git a/Makefile b/Makefile index 5b2608a3..133b1533 100644 --- a/Makefile +++ b/Makefile @@ -26,13 +26,19 @@ clean: make -C apps/temps clean make -C apps/POLite/heat-gals clean make -C apps/POLite/heat-sync clean + make -C apps/POLite/heat-cube-sync clean + make -C apps/POLite/heat-grid-sync clean make -C apps/POLite/asp-gals clean make -C apps/POLite/asp-sync clean - make -C apps/POLite/asp-pc clean make -C apps/POLite/pagerank-sync clean make -C apps/POLite/pagerank-gals clean + make -C apps/POLite/sssp-sync clean make -C apps/POLite/sssp-async clean - make -C apps/POLite/ping-test clean make -C apps/POLite/clocktree-async clean + make -C apps/POLite/izhikevich-gals clean + make -C apps/POLite/izhikevich-sync clean + make -C apps/POLite/pressure-sync clean + make -C apps/POLite/hashmin-sync clean + make -C apps/POLite/progrouters clean make -C bin clean make -C tests clean diff --git a/README.md b/README.md index 00f6a84b..a66aed56 100644 --- a/README.md +++ b/README.md @@ -1,13 +1,12 @@ -# Tinsel 0.7.1 +# Tinsel 0.8 Tinsel is a [RISC-V](https://riscv.org/)-based manythread message-passing architecture designed for FPGA clusters. It is being developed as part of the [POETS Project](https://poets-project.org/about) (Partial Ordered Event -Triggered Systems). This manual describes the architecture and -associated APIs. Further background can be found in our [FPL 2019 -paper](doc/fpl-2019-paper.pdf), which presents Tinsel 0.6. If you're -a POETS Partner, you can access a machine running Tinsel in the [POETS +Triggered Systems). Further background can be found in our [FPL 2019 +paper](doc/fpl-2019-paper.pdf). If you're a POETS Partner, you can +access a machine running Tinsel in the [POETS Cloud](https://github.com/POETSII/poets-cloud). ## Release Log @@ -27,15 +26,19 @@ Released on 10 Sep 2018 and maintained in the * [v0.5](https://github.com/POETSII/tinsel/releases/tag/v0.5): Released on 8 Jan 2019 and maintained in the [tinsel-0.5.1 branch](https://github.com/POETSII/tinsel/tree/tinsel-0.5.1). -(Hardware idle-detection.) +(Hardware termination-detection.) * [v0.6](https://github.com/POETSII/tinsel/releases/tag/v0.6): Released on 11 Apr 2019 and maintained in the [tinsel-0.6.3 branch](https://github.com/POETSII/tinsel/tree/tinsel-0.6.3). (Multi-box cluster.) * [v0.7](https://github.com/POETSII/tinsel/releases/tag/v0.7): Released on 2 Dec 2019 and maintained in the +[tinsel-0.7.1 branch](https://github.com/POETSII/tinsel/tree/tinsel-0.7.1). +(Local hardware multicast.) +* [v0.8](https://github.com/POETSII/tinsel/releases/tag/v0.8): +Released on 24 Jun 2020 and maintained in the [master branch](https://github.com/POETSII/tinsel/). -(Localised hardware multicast.) +(Global hardware multicast.) ## Contents @@ -45,8 +48,9 @@ Released on 2 Dec 2019 and maintained in the * [4. Tinsel Cache](#4-tinsel-cache) * [5. Tinsel Mailbox](#5-tinsel-mailbox) * [6. Tinsel Network](#6-tinsel-network) -* [7. Tinsel HostLink](#7-tinsel-hostlink) -* [8. POLite API](#8-polite-api) +* [7. Tinsel Router](#7-tinsel-router) +* [8. Tinsel HostLink](#8-tinsel-hostlink) +* [9. POLite API](#9-polite-api) ## Appendices @@ -62,24 +66,19 @@ Released on 2 Dec 2019 and maintained in the ## 1. Overview On the [POETS Project](https://poets-project.org/about), we are -looking at ways to accelerate applications that can be expressed as -large numbers of small processes communicating by message-passing. -Our first attempt is based around a manythread RISC-V architecture -called Tinsel running on an FPGA cluster. Tinsel aims to support -irregular applications that have heavy memory and communication -demands, but fairly modest compute requrements. The main features are: +looking at ways to accelerate applications that are naturally +expressed as a large number of small processes communicating by +message-passing. Our first attempt is based around a manythread +RISC-V architecture called Tinsel, running on an FPGA cluster. The +main features are: * **Multithreading**. A critical aspect of the design is to tolerate latency as cleanly as possible. This includes the - latencies arising from: floating-point on Stratix V FPGAs - (tens of cycles); off-chip memories; deep pipelines - (keeping Fmax high); and sharing of resources between cores + latencies arising from floating-point on Stratix V FPGAs + (tens of cycles), off-chip memories, deep pipelines + (keeping Fmax high), and sharing of resources between cores (such as caches, mailboxes, and FPUs). - * **Caches**. To keep the programming model simple, we have opted - to use thread-partitioned data caches to optimise access to - off-chip memory rather than DMA. - * **Message-passing**. Although there is a requirement to support a large amount of memory, it is not necessary to provide the illusion of a single shared memory space: message-passing is intended @@ -87,17 +86,22 @@ demands, but fairly modest compute requrements. The main features are: instructions for sending and receiving messages between any two threads in the cluster. - * **Hardware termination detection**. A global termination event is + * **Hardware termination-detection**. A global termination event is triggered when every thread indicates termination and no messages are in-flight. Termination can be interpreted as termination of a time step, or termination of the application, supporting both synchronous and asynchronous event-driven systems. - * **Localised hardware multicast**. Threads can send a message to - multiple colocated destination threads simultaneously, greatly reducing + * **Local hardware multicast**. Threads can send a message to + multiple collocated destination threads simultaneously, greatly reducing the number of inter-thread messages in applications exhibiting good locality of communication. + * **Global hardware multicast**. Programmable routers + automatically propagate messages to any number of destination + threads distributed throughout the cluster, minimising inter-FPGA + bandwidth usage for distributed fanouts. + * **Host communication**. Tinsel threads communicate with x86 machines distributed throughout the FPGA cluster, for command and control, via PCI Express and USB. @@ -106,7 +110,7 @@ demands, but fairly modest compute requrements. The main features are: include custom accelerators written in SystemVerilog. This repository also includes a prototype high-level vertex-centric -programming API for Tinsel, called [POLite](#8-polite-api). +programming API for Tinsel, called [POLite](#9-polite-api). ## 2. High-Level Structure @@ -133,11 +137,13 @@ accelerators](doc/custom) in tiles. #### Tinsel FPGA -Each FPGA contains two *Tinsel Slices*, with each slice typically +Each FPGA contains two *Tinsel Slices*, with each slice by default comprising eight tiles connected to one 4GB DDR3 DIMM and two 8MB QDRII+ SRAMs. All tiles are connected together via a routers to form -a 2D NoC. At the edges of the NoC are the inter-FPGA reliable -links. +a 2D NoC. The NoC is connected to the inter-FPGA links using a +*per-board programmable router*. Note that the per-board router also +has connections to off-chip memory: this is where the programmable +routing tables are stored. @@ -418,16 +424,22 @@ has reached the destination or none of it has. As one would expect, shorter messages consume less bandwidth than longer ones. The size of a flit is defined by `LogWordsPerFlit`. -At the heart of a mailbox is a memory-mapped *scratchpad* that -stores both incoming and outgoing messages. The capacity of the -scratchpad is defined by `LogMsgsPerMailbox`. Each thread connected -to the mailbox has one message slot reserved for sending messages. -The address of this slot is obtained using the following Tinsel API -call. +At the heart of a mailbox is a memory-mapped *scratchpad* that stores +both incoming and outgoing messages. The capacity of the scratchpad +is defined by `LogMsgsPerMailbox`. Each thread connected to the +mailbox has one or two message slots reserved for sending messages. +(By default, only a single send slot is reserved; the extra send slot +may be optionally reserved at power-up via a parameter to the +[HostLink](#8-tinsel-hostlink) constructor.) The addresses of these +slots are obtained using the following Tinsel API calls. ```c -// Get pointer to thread's message slot reserved for sending. +// Get pointer to thread's message slot reserved for sending volatile void* tinselSendSlot(); + +// Get pointer to thread's extra message slot reserved for sending +// (Assumes that HostLink has requested the extra slot) +volatile void* tinselSendSlotExtra(); ``` Once a thread has written a message to the scratchpad, it can trigger @@ -544,7 +556,7 @@ Tinsel also provides a function int tinselIdle(bool vote); ``` -which blocks until either +for global termination detection, which blocks until either 1. a message is available to receive, or @@ -639,7 +651,208 @@ communication. And since we are using the links point-to-point, almost all of the ethernet header fields can be used for our own purposes, resulting in very little overhead on the wire. -## 7. Tinsel HostLink +## 7. Tinsel Router + +Tinsel provides a programmable router on each FPGA board to support +*global* multicasting. Programmable routers automatically propagate +messages to any number of destination threads distributed throughout +the cluster, minimising inter-FPGA bandwidth usage for distributed +fanouts, and offloading work from the cores. Further background can +be found in [PIP 24](doc/PIP-0024-global-multicast.md). + +To support programmable routers, the destination component of a +message is generalised so that it can be (1) a thread id; or (2) a +*routing key*. A message, sent by a thread, containing a routing +key as a destination will go to a per-board router on the same +FPGA. The router will use the key as an index into a DRAM-based +routing table and automatically propagate the message towards all the +destinations associated with that key. + +A **routing key** is a 32-bit value consisting of a board-local *ram +id*, a *pointer*, and a *size*: + +```sv +// 32-bit routing key (MSB to LSB) +typedef struct { + // Which off-chip RAM on this board? + Bit#(`LogDRAMsPerBoard) ram; + // Pointer to array of routing beats containing routing records + Bit#(`LogBeatsPerDRAM) ptr; + // Number of beats in the array + Bit#(`LogRoutingEntryLen) numBeats; +} RoutingKey; +``` + +To send a message using a routing key as the destination, a new Tinsel +API call is provided: + +```c +// Send message at addr using given routing key +inline void tinselKeySend(uint32_t key, volatile void* addr); +``` + +When a message reaches the per-board router, the `ptr` field of the +routing key is used as an index into DRAM, where a sequence of 256-bit +**routing beats** are found. The `numBeats` field of the routing key +indicates how many contiguous routing beats there are. The value of +`numBeats` may be zero, in which case there are no destinations +associated with the key. + +A routing beat consists of a *size* and a sequence of five 48-bit +*routing chunks*: + +```sv +// 256-bit routing beat (aligned, MSB to LSB) +typedef struct { + // Number of routing records present in this beat + Bit#(16) size; + // Five 48-bit record chunks + Vector#(5, Bit#(48)) chunks; +} RoutingBeat; +``` + +The *size* must lie in the range 1 to 5 inclusive (0 is disallowed). +A **routing record** consists of one or two routing chunks, depending +on the **record type**. + +All byte orderings are little endian. For example, the order of bytes +in a routing beat is as follows. + +Byte | Contents +---- | -------- +31: | Upper byte of size (i.e. number of records in beat) +30: | Lower byte of size +29: | Upper byte of first chunk +... | ... +24: | Lower byte of first chunk +23: | Upper byte of second chunk +... | ... +18: | Lower byte of second chunk +17: | Upper byte of third chunk +... | ... +12: | Lower byte of third chunk +11: | Upper byte of fourth chunk +... | ... + 6: | Lower byte of fourth chunk + 5: | Upper byte of fifth chunk +... | ... + 0: | Lower byte of fifth chunk + +Clearly, both routing keys and routing beats have a maximum size. +However, in principle there is no limit to the number of records +associated with a key, due to the possibility of *indirection records* +(see below). + +There are five types of routing record, defined below. + +**48-bit Unicast Router-to-Mailbox (URM1):** + +```sv +typedef struct { + // Record type (URM1 == 0) + Bit#(3) tag; + // Mailbox destination + Bit#(4) mbox; + // Mailbox-local thread identifier + Bit#(6) thread; + // Unused + Bit#(3) unused; + // Local key. The first word of the message + // payload is overwritten with this. + Bit#(32) localKey; +} URM1Record; +``` + +The `localKey` can be used for anything, but might encode the +destination thread-local device identifier, or edge identifier, or +both. The `mbox` field is currently 4 bits (two Y bits followed by +two X bits), but there are spare bits available to increase the size +of this field in future if necessary. + +**96-bit Unicast Router-to-Mailbox (URM2):** + +```sv +typedef struct { + // Record type (URM2 == 1) + Bit#(3) tag; + // Mailbox destination + Bit#(4) mbox; + // Mailbox-local thread identifier + Bit#(6) thread; + // Currently unused + Bit#(19) unused; + // Local key. The first two words of the message + // payload is overwritten with this. + Bit#(64) localKey; +} URM2Record; +``` + +This is the same as a URM1 record except the local key is 64-bits in +size. + +**48-bit Router-to-Router (RR):** + +```sv +typedef struct { + // Record type (RR == 2) + Bit#(3) tag; + // Direction (N,S,E,W == 0,1,2,3) + Bit#(2) dir; + // Currently unused + Bit#(11) unused; + // New 32-bit routing key that will replace the one in the + // current message for the next hop of the message's journey + Bit#(32) newKey; +} RRRecord; +``` + +The `newKey` field will replace the key in the current message for the +next hop of the message's journey. Introducing a new key at each hop +simplifies the mapping process (keeping it quick). + +**96-bit Multicast Router-to-Mailbox (MRM):** + +```sv +typedef struct { + // Record type (MRM == 3) + Bit#(3) tag; + // Mailbox destination + Bit#(4) mbox; + // Currently unused + Bit#(9) unused; + // Local key. The least-significant half-word + // of the message is replaced with this + Bit#(16) localKey; + // Mailbox-local destination mask + Bit#(64) destMask; +} MRMRecord; +``` + +**48-bit Indirection (IND):** + +```sv +// 48-bit Indirection (IND) record +// Note the restrictions on IND records: +// 1. At most one IND record per key lookup +// 2. A max-sized key lookup must contain an IND record +typedef struct { + // Record type (IND == 4) + Bit#(3) tag; + // Currently unused + Bit#(13) unused; + // New 32-bit routing key for new set of records on current router + Bit#(32) newKey; +} INDRecord; +``` + +Indirection records can be used to handle large fanouts, which exceed +the number of bits available in the size portion of the routing key. + +Finally, it is worth noting that when using programmable routers, +there is an added responsibility for the programmer to use a +deadlock-free routing scheme, such as dimension-ordered routing. + +## 8. Tinsel HostLink *HostLink* is the means by which Tinsel cores running on a mesh of FPGA boards communicate with a *host PC*. It comprises three main @@ -647,7 +860,7 @@ communication channels: * An FPGA *bridge board* that connects the host PC inside a POETS box (PCI Express) to the FPGA mesh (SFP+). Using this high-bandwidth -channel (10Gbps), the host PC can efficiently send messages to any +channel (2 x 10Gbps), the host PC can efficiently send messages to any Tinsel thread and vice-versa. * A set of *debug links* connecting the host PC inside a POETS box to @@ -662,34 +875,45 @@ each FPGA's *power management module* via separate USB UART cables. These connections can be used to power-on/power-off each FPGA and to monitor power consumption, temperature, and fan tachometer. -HostLink supports multiple POETS boxes, but requires that one of these -boxes is designated as the **master box**. Currently, all messages -are injected/extracted to/from the FPGA network via the master box's -bridge board. - -A Tinsel application typically consists of two programs: one which -runs on the RISC-V cores, linked against the [Tinsel +HostLink allows multiple POETS boxes to be used to run an application, +but requires that one of these boxes is designated as the **master +box**. A Tinsel application typically consists of two programs: one +which runs on the RISC-V cores, linked against the [Tinsel API](#f-tinsel-api), and the other which runs on the host PC of the master box, linked against the [HostLink API](#g-hostlink-api). The HostLink API is implemented as a C++ class called `HostLink`. The constructor for this class first powers up all the worker FPGAs (which -are by default powered down). On power-up the FPGAs are automatically -programmed using the Tinsel bit-file residing in flash memory, and are -ready to be used within a few seconds, as soon as the `HostLink` -constructor returns. +are by default powered down). On power-up, the FPGAs are +automatically programmed using the Tinsel bit-file residing in flash +memory, and are ready to be used within a few seconds, as soon as the +`HostLink` constructor returns. The `HostLink` constructor is overloaded: ```cpp HostLink::HostLink(); HostLink::HostLink(uint32_t numBoxesX, uint32_t numBoxesY); +HostLink::HostLink(HostLinkParams params); ``` If it is called without any arguments, then it assumes that a single -box is to be used. Alternatively, the user may request multiple -boxes by specifying the width and height of the box sub-mesh they -wish to use. (The box from which the application is started is -considered as the origin of this sub-mesh.) +box is to be used. Alternatively, the user may request multiple boxes +by specifying the width and height of the box sub-mesh they wish to +use. (The box from which the application is started, i.e. the master +box, is considered as the the origin of this sub-mesh.) The most +general constructor takes a `HostLinkParams` structure as an argument, +which allows additional options to be specified. + +```cpp +// HostLink parameters +struct HostLinkParams { + // Number of boxes to use (default is 1x1) + uint32_t numBoxesX; + uint32_t numBoxesY; + // Enable use of tinselSendSlotExtra() on threads (default is false) + bool useExtraSendSlot; +}; +``` HostLink methods for sending and receiving messages on the host PC are as follows. @@ -711,6 +935,12 @@ bool HostLink::canRecv(); // Receive a message (blocking), given size of message in bytes // Any bytes beyond numBytes up to the next message boundary will be ignored void HostLink::recvMsg(void* msg, uint32_t numBytes); + +// Send a message using routing key (blocking) +bool HostLink::keySend(uint32_t key, uint32_t numFlits, void* msg); + +// Try to send using routing key (non-blocking, returns true on success) +bool HostLink::keyTrySend(uint32_t key, uint32_t numFlits, void* msg); ``` The `send` method allows a message consisting of multiple flits to be @@ -895,7 +1125,7 @@ not be called. When the application returns from `main()`, all but one thread on each core are killed and the remaining threads reenter the boot loader. -## 8. POLite API +## 9. POLite API POLite is a layer of abstraction that takes care of mapping arbitrary task graphs onto the Tinsel overlay, completely hiding architectural @@ -1069,16 +1299,24 @@ by each thread. After mapping, POLite writes the graph into cluster memory and triggers execution. By default, vertex states are written into the off-chip QDRII+ SRAMs, and edge lists are written in the DDR3 DRAMs. -This default behaviour can be modified by setting the boolean flags -`graph.mapVerticesToDRAM`, `graph.mapInEdgesToDRAM`, -`graph.mapOutEdgesToDRAM` accordingly (true means "map to DRAM" and -false means "map to SRAM"). Once the application is up and running, -the host and the graph vertices can continue to communicate: any -vertex can send messages to the host via the `HostPin` or the `finish` -handler, and the host can send messages to any vertex. +This default behaviour can be modified by adjusting the following +flags of the `PGraph` class. + + Flag | Default + ------------------------ | ------- + `mapVerticesToDRAM` | `false` + `mapInEdgeHeadersToDRAM` | `true` + `mapInEdgeRestToDRAM` | `true` + `mapOutEdgesToDRAM` | `true` + +A value of `true` means "map to DRAM", while `false` means "map to +(off-chip) SRAM". Once the application is up and running, the host +and the graph vertices can continue to communicate: any vertex can +send messages to the host via the `HostPin` or the `finish` handler, +and the host can send messages to any vertex. **Softswitch**. Central to POLite is an event loop running on each -Tinsel thread, which we call **the softswitch** as it effectively +Tinsel thread, which we call the softswitch as it effectively context-switches between vertices mapped to the same thread. The softswitch has four main responsibilities: (1) to maintain a queue of vertices wanting to send; (2) to implement multicast sends over a pin @@ -1087,14 +1325,34 @@ messages efficiently between vertices running on the same thread and on different threads; and (4) to invoke the vertex handlers when required, to meet the semantics of the POLite library. -**Limitations**. POLite provides several important features of the -vertex-centric paradigm, but there are some limitations. One of the -features of the Pregel framework is the ability for vertices to add -and remove vertices and edges at runtime -- but currently, POLite only -supports static graphs. And for large *non-localised* fan-outs, a -hierarchical hardware or software multicast feature may be desirable -(where messages get forked at intermediate stages along the way to the -destinations). +**POLite static parameters**. The following macros can be defined, +before the first instance of `#include `, to control some +aspects of POLite behaviour. + + Macro | Meaning + --------- | ------- + `POLITE_NUM_PINS` | Max number of pins per vertex (default 1) + `POLITE_DUMP_STATS` | Dump stats upon completion + `POLITE_COUNT_MSGS` | Include message counts in stats dump + `POLITE_EDGES_PER_HEADER` | Lower this for large edge states (default 6) + +**POLite dynamic parameters**. The following environment variables can +be set, to control some aspects of POLite behaviour. + + Environment variable | Meaning + -------------------- | ------- + `HOSTLINK_BOXES_X` | Size of box mesh to use in X dimension + `HOSTLINK_BOXES_Y` | Size of box mesh to use in Y dimension + `POLITE_BOARDS_X` | Size of board mesh to use in X dimension + `POLITE_BOARDS_Y` | Size of board mesh to use in Y dimension + `POLITE_CHATTY` | Set to `1` to enable emission of mapper stats + `POLITE_PLACER` | Use `metis`, `random`, `bfs`, or `direct` placement + +**Limitations**. POLite is primarily intended as a prototype library +for hardware evaluation purposes. It occupies a single, simple point +in a wider, richer design space. In particular, it doesn't support +dynamic creation of vertices and edges, and it hasn't been optimised +to deal with highly non-uniform fanouts. ## A. DE5-Net Synthesis Report @@ -1111,9 +1369,10 @@ The default Tinsel configuration on a single DE5-Net board contains: * four QDRII+ SRAM controllers * four 10Gbps reliable links * one termination/idle detector + * one 8x8 programmable router * a JTAG UART -The clock frequency is 225MHz and the resource utilisation is 74% of +The clock frequency is 215MHz and the resource utilisation is 84% of the DE5-Net. ## B. Tinsel Parameters @@ -1143,9 +1402,9 @@ the DE5-Net. `MeshXLenWithinBox` | 3 | Boards in X dimension within box `MeshYLenWithinBox` | 2 | Boards in Y dimension within box `EnablePerfCount` | True | Enable performance counters - `ClockFreq` | 225 | Clock frequency in MHz + `ClockFreq` | 215 | Clock frequency in MHz -Further parameters can be found in [config.py](config.py). +A full list of parameters can be found in [config.py](config.py). ## C. Tinsel Memory Map @@ -1204,15 +1463,20 @@ separate memory regions (which they are not). Optional performance-counter CSRs (when `EnablePerfCount` is `True`): - Name | CSR | R/W | Function - ---------------- | ------ | --- | -------- - `PerfCount` | 0xc07 | W | Reset(0)/Start(1)/Stop(2) all counters - `MissCount` | 0xc08 | R | Cache miss count - `HitCount` | 0xc09 | R | Cache hit count - `WritebackCount` | 0xc0a | R | Cache writeback count - `CPUIdleCount` | 0xc0b | R | CPU idle-cycle count (lower 32 bits) - `CPUIdleCountU` | 0xc0c | R | CPU idle-cycle count (upper 8 bits) - `CycleU` | 0xc0d | R | Cycle counter (upper 8 bits) + Name | CSR | R/W | Function + ---------------- | ------ | --- | -------- + `PerfCount` | 0xc07 | W | Reset(0)/Start(1)/Stop(2) all counters + `MissCount` | 0xc08 | R | Cache miss count + `HitCount` | 0xc09 | R | Cache hit count + `WritebackCount` | 0xc0a | R | Cache writeback count + `CPUIdleCount` | 0xc0b | R | CPU idle-cycle count (lower 32 bits) + `CPUIdleCountU` | 0xc0c | R | CPU idle-cycle count (upper 8 bits) + `CycleU` | 0xc0d | R | Cycle counter (upper 8 bits) + `ProgRouterSent` | 0xc0e | R | Total msgs sent by ProgRouter + `ProgRouterSentInter` | 0xc0f | R | Inter-board msgs sent by ProgRouter + +Note that `ProgRouterSent` and `ProgRouterSentInter` are only valid +from thread zero on each board. Tinsel also supports the following custom instructions. @@ -1258,6 +1522,13 @@ inline void tinselFlushLine(uint32_t lineNum, uint32_t way); // (A message of length n is comprised of n+1 flits) inline void tinselSetLen(uint32_t n); +// Get pointer to thread's message slot reserved for sending +volatile void* tinselSendSlot(); + +// Get pointer to thread's extra message slot reserved for sending +// (Assumes that HostLink has requested the extra slot) +volatile void* tinselSendSlotExtra(); + // Determine if calling thread can send a message inline uint32_t tinselCanSend(); @@ -1273,6 +1544,9 @@ inline void tinselMulticast( // (Address must be aligned on message boundary) inline void tinselSend(uint32_t dest, volatile void* addr); +// Send message at address using given routing key +inline void tinselKeySend(uint32_t key, volatile void* addr); + // Determine if calling thread can receive a message inline uint32_t tinselCanRecv(); @@ -1352,6 +1626,14 @@ inline uint32_t tinselCPUIdleCountU(); // Read cycle counter (upper 8 bits) inline uint32_t tinselCycleCountU(); +// Performance counter: number of messages emitted by ProgRouter +// (Only valid from thread zero on each board) +inline uint32_t tinselProgRouterSent(); + +// Performance counter: number of inter-board messages emitted by ProgRouter +// (Only valid from thread zero on each board) +inline uint32_t tinselProgRouterSentInterBoard(); + // Address construction inline uint32_t tinselToAddr( uint32_t boardX, uint32_t boardY, @@ -1410,6 +1692,12 @@ class HostLink { // Any bytes beyond numBytes up to the next message boundary will be ignored void recvMsg(void* msg, uint32_t numBytes); + // Send a message using routing key (blocking by default) + bool keySend(uint32_t key, uint32_t numFlits, void* msg, bool block = true); + + // Try to send using routing key (non-blocking, returns true on success) + bool keyTrySend(uint32_t key, uint32_t numFlits, void* msg); + // Bulk send and receive // --------------------- @@ -1476,14 +1764,24 @@ class HostLink { // Trigger application execution on all started threads on given core void goOne(uint32_t meshX, uint32_t meshY, uint32_t coreId); }; + +// HostLink parameters (used by the most general HostLink constructor) +struct HostLinkParams { + // Number of boxes to use (default is 1x1) + uint32_t numBoxesX; + uint32_t numBoxesY; + // Enable use of tinselSendSlotExtra() on threads (default is false) + bool useExtraSendSlot; +}; ``` ```cpp class DebugLink { public: - // Constructor + // Constructors DebugLink(uint32_t numBoxesX, uint32_t numBoxesY); + DebugLink(DebugLinkParams params); // On given board, set destination core and thread void setDest(uint32_t boardX, uint32_t boardY, diff --git a/apps/POLite/asp-gals/ASP.h b/apps/POLite/asp-gals/ASP.h index 42462622..f69dfa3d 100644 --- a/apps/POLite/asp-gals/ASP.h +++ b/apps/POLite/asp-gals/ASP.h @@ -9,8 +9,8 @@ #ifndef _ASP_H_ #define _ASP_H_ -//#define POLITE_DUMP_STATS -//#define POLITE_COUNT_MSGS +#define POLITE_DUMP_STATS +#define POLITE_COUNT_MSGS // Lightweight POETS frontend #include diff --git a/apps/POLite/asp-gals/Run.cpp b/apps/POLite/asp-gals/Run.cpp index d50821ce..4c00e1da 100644 --- a/apps/POLite/asp-gals/Run.cpp +++ b/apps/POLite/asp-gals/Run.cpp @@ -51,7 +51,8 @@ int main(int argc, char**argv) // Create random set of source nodes uint32_t numSources = NUM_SOURCES*32; uint32_t sources[numSources]; - randomSet(numSources, sources, graph.numDevices); + //randomSet(numSources, sources, graph.numDevices); + for (int i = 0; i < numSources; i++) sources[i] = i; // Initialise devices for (PDeviceId i = 0; i < graph.numDevices; i++) { @@ -102,7 +103,9 @@ int main(int argc, char**argv) // Display time timersub(&finish, &start, &diff); double duration = (double) diff.tv_sec + (double) diff.tv_usec / 1000000.0; + #ifndef POLITE_DUMP_STATS printf("Time = %lf\n", duration); + #endif return 0; } diff --git a/apps/POLite/asp-pc/Makefile b/apps/POLite/asp-pc/Makefile index 0cf7448f..bf9439f3 100644 --- a/apps/POLite/asp-pc/Makefile +++ b/apps/POLite/asp-pc/Makefile @@ -1,10 +1,10 @@ # SPDX-License-Identifier: BSD-2-Clause -all: asp GenHypercube GenTree GenGeoGraph +all: asp GenHypercube GenTree INC=../../../../include asp: asp.cpp - g++ -fopenmp -D_DEFAULT_SOURCE -I$(INC) -O3 asp.cpp -o asp + g++ -I$(INC) -O3 asp.cpp -o asp GenHypercube: GenHypercube.hs ghc -O2 --make GenHypercube.hs @@ -12,8 +12,5 @@ GenHypercube: GenHypercube.hs GenTree: GenTree.hs ghc -O2 --make GenTree.hs -GenGeoGraph: GenGeoGraph.cpp - g++ -O2 -lstdc++ GenGeoGraph.cpp -o GenGeoGraph - clean: - rm -f asp GenHypercube GenTree GenGeoGraph *.hi *.o + rm -f asp GenHypercube GenTree *.hi *.o diff --git a/apps/POLite/asp-pc/asp-push.cpp b/apps/POLite/asp-pc/asp-push.cpp new file mode 100644 index 00000000..a75f6628 --- /dev/null +++ b/apps/POLite/asp-pc/asp-push.cpp @@ -0,0 +1,180 @@ +// SPDX-License-Identifier: BSD-2-Clause +#include "RandomSet.h" + +#include +#include +#include +#include +#include +#include + +// Number of nodes and edges +uint32_t numNodes; +uint32_t numEdges; + +// Mapping from node id to array of neighbouring node ids +// First element of each array holds the number of neighbours +uint32_t** neighbours; + +// Mapping from node id to bit vector of reaching nodes +uint64_t** reaching; +uint64_t** reachingNext; + +// Number of 64-bit words in reaching vector +const uint64_t vectorSize = 1; + +void readGraph(const char* filename, bool undirected) +{ + // Read edges + FILE* fp = fopen(filename, "rt"); + if (fp == NULL) { + fprintf(stderr, "Can't open '%s'\n", filename); + exit(EXIT_FAILURE); + } + + // Note: we use a "pull" algorithm (rather than "push") to + // avoid parallel writes to the same address, hence we reverse + // the direction of the edges here. + + // Count number of nodes and edges + numEdges = 0; + numNodes = 0; + int ret; + while (1) { + uint32_t src, dst; + ret = fscanf(fp, "%d %d", &dst, &src); + if (ret == EOF) break; + numEdges++; + numNodes = src >= numNodes ? src+1 : numNodes; + numNodes = dst >= numNodes ? dst+1 : numNodes; + } + rewind(fp); + + // Create mapping from node id to number of neighbours + uint32_t* count = (uint32_t*) calloc(numNodes, sizeof(uint32_t)); + for (int i = 0; i < numEdges; i++) { + uint32_t src, dst; + ret = fscanf(fp, "%d %d", &dst, &src); + count[src]++; + if (undirected) count[dst]++; + } + + // Create mapping from node id to neighbours + neighbours = (uint32_t**) calloc(numNodes, sizeof(uint32_t*)); + rewind(fp); + for (int i = 0; i < numNodes; i++) { + neighbours[i] = (uint32_t*) calloc(count[i]+1, sizeof(uint32_t)); + neighbours[i][0] = count[i]; + } + for (int i = 0; i < numEdges; i++) { + uint32_t src, dst; + ret = fscanf(fp, "%d %d", &dst, &src); + neighbours[src][count[src]--] = dst; + if (undirected) neighbours[dst][count[dst]--] = src; + } + + // Create mapping from node id to bit vector of reaching nodes + reaching = (uint64_t**) calloc(numNodes, sizeof(uint64_t*)); + reachingNext = (uint64_t**) calloc(numNodes, sizeof(uint64_t*)); + for (int i = 0; i < numNodes; i++) { + reaching[i] = (uint64_t*) calloc(vectorSize, sizeof(uint64_t)); + reachingNext[i] = (uint64_t*) calloc(vectorSize, sizeof(uint64_t)); + } + + // Release + free(count); + fclose(fp); +} + +// Compute sum of all shortest paths from given sources +uint64_t ssp(uint32_t numSources, uint32_t* sources) +{ + // Sum of distances + uint64_t sum = 0; + + // Initialise reaching vector for each node + for (int i = 0; i < numNodes; i++) { + for (int j = 0; j < vectorSize; j++) { + reaching[i][j] = 0; + reachingNext[i][j] = 0; + } + } + for (int i = 0; i < numSources; i++) { + uint32_t src = sources[i]; + reaching[src][i/64] |= 1ul << (i%64); + } + + int* queue = new int [numNodes]; + int queueSize = 0; + for (int i = 0; i < numNodes; i++) queue[queueSize++] = i; + + // Distance increases on each iteration + uint32_t dist = 1; + + while (queueSize > 0) { + // For each node + for (int i = 0; i < queueSize; i++) { + int me = queue[i]; + // For each neighbour + uint32_t numNeighbours = neighbours[me][0]; + for (int j = 1; j <= numNeighbours; j++) { + uint32_t n = neighbours[me][j]; + // For each chunk + for (int k = 0; k < vectorSize; k++) { + if (reaching[me][k] & ~reachingNext[n][k]) + reachingNext[n][k] |= reaching[me][k]; + } + } + } + + // For each node, update reaching vector + queueSize = 0; + for (int i = 0; i < numNodes; i++) { + for (int k = 0; k < vectorSize; k++) { + uint64_t diff = reachingNext[i][k] & ~reaching[i][k]; + if (diff) { + queue[queueSize++] = i; + uint32_t n = __builtin_popcountll(diff); + sum += n * dist; + reaching[i][k] |= reachingNext[i][k]; + } + } + } + dist++; + } + + return sum; +} + +int main(int argc, char**argv) +{ + if (argc != 2) { + printf("Specify edges file\n"); + exit(EXIT_FAILURE); + } + bool undirected = false; + readGraph(argv[1], undirected); + printf("Nodes: %u. Edges: %u\n", numNodes, numEdges); + + uint32_t numSources = 64*vectorSize; + assert(numSources < numNodes); + uint32_t sources[numSources]; + for (int i = 0; i < numSources; i++) sources[i] = i; + //randomSet(numSources, sources, numNodes); + + struct timeval start, finish, diff; + + uint64_t sum = 0; + const int nodesPerVector = 64 * vectorSize; + gettimeofday(&start, NULL); + sum = ssp(numSources, sources); + gettimeofday(&finish, NULL); + + printf("Sum of subset of shortest paths = %lu\n", sum); + + timersub(&finish, &start, &diff); + double duration = (double) diff.tv_sec + (double) diff.tv_usec / 1000000.0; + printf("Time = %lf\n", duration); + + return 0; +} diff --git a/apps/POLite/asp-sync/Run.cpp b/apps/POLite/asp-sync/Run.cpp index 25082646..518a33b5 100644 --- a/apps/POLite/asp-sync/Run.cpp +++ b/apps/POLite/asp-sync/Run.cpp @@ -19,9 +19,11 @@ int main(int argc, char**argv) // Read network EdgeList net; net.read(argv[1]); - + // Print max fan-out printf("Max fan-out = %d\n", net.maxFanOut()); + printf("Min fan-out = %d\n", net.minFanOut()); + assert(net.minFanOut() > 0); // Check that parameters make sense assert(32*N <= net.numNodes); @@ -97,7 +99,9 @@ int main(int argc, char**argv) // Display time timersub(&finish, &start, &diff); double duration = (double) diff.tv_sec + (double) diff.tv_usec / 1000000.0; + #ifndef POLITE_DUMP_STATS printf("Time = %lf\n", duration); + #endif return 0; } diff --git a/apps/POLite/asp-tiles-sync/Run.cpp b/apps/POLite/asp-tiles-sync/Run.cpp index 049d83a8..cdc2bb14 100644 --- a/apps/POLite/asp-tiles-sync/Run.cpp +++ b/apps/POLite/asp-tiles-sync/Run.cpp @@ -135,11 +135,11 @@ int main(int argc, char**argv) double duration; timersub(&finishCompute, &startCompute, &diff); duration = (double) diff.tv_sec + (double) diff.tv_usec / 1000000.0; - printf("Time (compute) = %lf\n", duration); + printf("Time (compute, including stats transfer over UART) = %lf\n", duration); gettimeofday(&finishAll, NULL); timersub(&finishAll, &startAll, &diff); duration = (double) diff.tv_sec + (double) diff.tv_usec / 1000000.0; - printf("Time (all) = %lf\n", duration); + printf("Time (all, including stats transfer over UART) = %lf\n", duration); return 0; } diff --git a/apps/POLite/clocktree-async/Run.cpp b/apps/POLite/clocktree-async/Run.cpp index 270c9b48..02f76723 100644 --- a/apps/POLite/clocktree-async/Run.cpp +++ b/apps/POLite/clocktree-async/Run.cpp @@ -93,7 +93,9 @@ int main(int argc, char** argv) // Display time timersub(&finish, &start, &diff); double duration = (double) diff.tv_sec + (double) diff.tv_usec / 1000000.0; + #ifndef POLITE_DUMP_STATS printf("Time = %lf\n", duration); + #endif return 0; } diff --git a/apps/POLite/hashmin-sync/Run.cpp b/apps/POLite/hashmin-sync/Run.cpp index cb6a7ced..eab92eff 100644 --- a/apps/POLite/hashmin-sync/Run.cpp +++ b/apps/POLite/hashmin-sync/Run.cpp @@ -82,7 +82,9 @@ int main(int argc, char**argv) // Display time timersub(&finish, &start, &diff); double duration = (double) diff.tv_sec + (double) diff.tv_usec / 1000000.0; + #ifndef POLITE_DUMP_STATS printf("Time = %lf\n", duration); + #endif return 0; } diff --git a/apps/POLite/heat-cube-sync/Run.cpp b/apps/POLite/heat-cube-sync/Run.cpp index aaa42c39..1163f01b 100644 --- a/apps/POLite/heat-cube-sync/Run.cpp +++ b/apps/POLite/heat-cube-sync/Run.cpp @@ -76,7 +76,9 @@ int main() // Display time timersub(&finish, &start, &diff); double duration = (double) diff.tv_sec + (double) diff.tv_usec / 1000000.0; + #ifndef POLITE_DUMP_STATS printf("Time = %lf\n", duration); + #endif return 0; } diff --git a/apps/POLite/heat-gals/Heat.h b/apps/POLite/heat-gals/Heat.h index 12ca9574..600b4d00 100644 --- a/apps/POLite/heat-gals/Heat.h +++ b/apps/POLite/heat-gals/Heat.h @@ -2,6 +2,8 @@ #ifndef _HEAT_H_ #define _HEAT_H_ +#define POLITE_DUMP_STATS +#define POLITE_COUNT_MSGS #include struct HeatMessage { @@ -10,7 +12,7 @@ struct HeatMessage { // Time step uint32_t time; // Temperature at sender - uint32_t val; + float val; }; struct HeatState { @@ -21,9 +23,9 @@ struct HeatState { // Current time step of device uint32_t time; // Current temperature of device - uint32_t val; + float val; // Accumulator for temperatures received at times t and t+1 - uint32_t acc, accNext; + float acc, accNext; // Count messages sent and received uint8_t sent, received, receivedNext; // Is the temperature of this device constant? @@ -45,7 +47,7 @@ struct HeatDevice : PDevice { // Proceed to next time step? if (s->sent && s->received == s->fanIn) { s->time--; - if (!s->isConstant) s->val = s->acc >> 2; + if (!s->isConstant) s->val = s->acc / (float) s->fanIn; s->acc = s->accNext; s->received = s->receivedNext; s->accNext = s->receivedNext = 0; diff --git a/apps/POLite/heat-gals/Makefile b/apps/POLite/heat-gals/Makefile index 0c343edd..86430b66 100644 --- a/apps/POLite/heat-gals/Makefile +++ b/apps/POLite/heat-gals/Makefile @@ -1,7 +1,7 @@ # SPDX-License-Identifier: BSD-2-Clause APP_CPP = Heat.cpp APP_HDR = Heat.h -RUN_CPP = Run.cpp Colours.cpp -RUN_H = Colours.h +RUN_CPP = Run.cpp +RUN_H = include ../util/polite.mk diff --git a/apps/POLite/heat-gals/Run.cpp b/apps/POLite/heat-gals/Run.cpp index 0a08505b..44c2f921 100644 --- a/apps/POLite/heat-gals/Run.cpp +++ b/apps/POLite/heat-gals/Run.cpp @@ -1,17 +1,31 @@ // SPDX-License-Identifier: BSD-2-Clause #include "Heat.h" -#include "Colours.h" #include #include +#include #include -int main() +int main(int argc, char **argv) { // Parameters - const uint32_t width = 256; - const uint32_t height = 256; - const uint32_t time = 1000; + const uint32_t time = 1000; + + // Read in the example edge list and create data structure + if (argc != 2) { + printf("Specify edge file\n"); + exit(EXIT_FAILURE); + } + + // Load in the edge list file + printf("Loading in the graph..."); fflush(stdout); + EdgeList net; + net.read(argv[1]); + printf(" done\n"); + + // Print max fan-out + printf("Min fan-out = %d\n", net.minFanOut()); + printf("Max fan-out = %d\n", net.maxFanOut()); // Connection to tinsel machine HostLink hostLink; @@ -19,58 +33,32 @@ int main() // Create POETS graph PGraph graph; - // Create 2D mesh of devices - PDeviceId **mesh = new PDeviceId* [height]; - for (uint32_t y = 0; y < height; y++) { - mesh[y] = new PDeviceId [width]; - for (uint32_t x = 0; x < width; x++) - mesh[y][x] = graph.newDevice(); + // Create nodes in POETS graph + for (uint32_t i = 0; i < net.numNodes; i++) { + PDeviceId id = graph.newDevice(); + assert(i == id); } - // Add edges - for (uint32_t y = 0; y < height; y++) - for (uint32_t x = 0; x < width; x++) { - if (x < width-1) { - graph.addEdge(mesh[y][x], 0, mesh[y][x+1]); - graph.addEdge(mesh[y][x+1], 0, mesh[y][x]); - } - if (y < height-1) { - graph.addEdge(mesh[y][x], 0, mesh[y+1][x]); - graph.addEdge(mesh[y+1][x], 0, mesh[y][x]); - } - } + // Create connections in POETS graph + for (uint32_t i = 0; i < net.numNodes; i++) { + uint32_t numNeighbours = net.neighbours[i][0]; + for (uint32_t j = 0; j < numNeighbours; j++) + graph.addEdge(i, 0, net.neighbours[i][j+1]); + } // Prepare mapping from graph to hardware graph.map(); - // Set device ids - for (uint32_t y = 0; y < height; y++) - for (uint32_t x = 0; x < width; x++) - graph.devices[mesh[y][x]]->state.id = mesh[y][x]; - - // Initialise time and fanIn fields + // Specify number of time steps to run on each device + srand(1); for (PDeviceId i = 0; i < graph.numDevices; i++) { + int r = rand() % 255; + graph.devices[i]->state.id = i; graph.devices[i]->state.time = time; + graph.devices[i]->state.val = (float) r; + graph.devices[i]->state.isConstant = false; graph.devices[i]->state.fanIn = graph.fanIn(i); } - - // Apply constant heat at north edge - // Apply constant cool at south edge - for (uint32_t x = 0; x < width; x++) { - graph.devices[mesh[0][x]]->state.val = 255 << 16; - graph.devices[mesh[0][x]]->state.isConstant = true; - graph.devices[mesh[height-1][x]]->state.val = 40 << 16; - graph.devices[mesh[height-1][x]]->state.isConstant = true; - } - - // Apply constant heat at west edge - // Apply constant cool at east edge - for (uint32_t y = 0; y < height; y++) { - graph.devices[mesh[y][0]]->state.val = 255 << 16; - graph.devices[mesh[y][0]]->state.isConstant = true; - graph.devices[mesh[y][width-1]]->state.val = 40 << 16; - graph.devices[mesh[y][width-1]]->state.isConstant = true; - } // Write graph down to tinsel machine via HostLink graph.write(&hostLink); @@ -84,8 +72,11 @@ int main() struct timeval start, finish, diff; gettimeofday(&start, NULL); + // Consume performance stats + politeSaveStats(&hostLink, "stats.txt"); + // Allocate array to contain final value of each device - uint32_t* pixels = new uint32_t [graph.numDevices]; + float* pixels = new float [graph.numDevices]; // Receive final value of each device for (uint32_t i = 0; i < graph.numDevices; i++) { @@ -97,25 +88,19 @@ int main() pixels[msg.payload.from] = msg.payload.val; } + // Display final values of first ten devices + for (uint32_t i = 0; i < 10; i++) { + if (i < graph.numDevices) { + printf("%d: %f\n", i, pixels[i]); + } + } + // Display time timersub(&finish, &start, &diff); double duration = (double) diff.tv_sec + (double) diff.tv_usec / 1000000.0; + #ifndef POLITE_DUMP_STATS printf("Time = %lf\n", duration); - - // Emit image - FILE* fp = fopen("out.ppm", "wt"); - if (fp == NULL) { - printf("Can't open output file for writing\n"); - return -1; - } - fprintf(fp, "P3\n%d %d\n255\n", width, height); - for (uint32_t y = 0; y < height; y++) - for (uint32_t x = 0; x < width; x++) { - uint32_t val = (pixels[mesh[y][x]] >> 16) & 0xff; - fprintf(fp, "%d %d %d\n", - colours[val*3], colours[val*3+1], colours[val*3+2]); - } - fclose(fp); + #endif return 0; } diff --git a/apps/POLite/heat-gals/Colours.cpp b/apps/POLite/heat-grid-sync/Colours.cpp similarity index 100% rename from apps/POLite/heat-gals/Colours.cpp rename to apps/POLite/heat-grid-sync/Colours.cpp diff --git a/apps/POLite/heat-gals/Colours.h b/apps/POLite/heat-grid-sync/Colours.h similarity index 100% rename from apps/POLite/heat-gals/Colours.h rename to apps/POLite/heat-grid-sync/Colours.h diff --git a/apps/POLite/ping-test/ping.cpp b/apps/POLite/heat-grid-sync/Heat.cpp similarity index 57% rename from apps/POLite/ping-test/ping.cpp rename to apps/POLite/heat-grid-sync/Heat.cpp index 74960d36..b2b4fc3e 100644 --- a/apps/POLite/ping-test/ping.cpp +++ b/apps/POLite/heat-grid-sync/Heat.cpp @@ -1,21 +1,21 @@ // SPDX-License-Identifier: BSD-2-Clause -#include "ping.h" +#include "Heat.h" #include #include typedef PThread< - PingDevice, - PingState, // State + HeatDevice, + HeatState, // State None, // Edge label - PingMessage // Message - > PingThread; + HeatMessage // Message + > HeatThread; int main() { // Point thread structure at base of thread's heap - PingThread* thread = (PingThread*) tinselHeapBaseSRAM(); - + HeatThread* thread = (HeatThread*) tinselHeapBaseSRAM(); + // Invoke interpreter thread->run(); diff --git a/apps/POLite/heat-grid-sync/Heat.h b/apps/POLite/heat-grid-sync/Heat.h new file mode 100644 index 00000000..b3a63a93 --- /dev/null +++ b/apps/POLite/heat-grid-sync/Heat.h @@ -0,0 +1,71 @@ +// SPDX-License-Identifier: BSD-2-Clause +#ifndef _HEAT_H_ +#define _HEAT_H_ + +#include + +struct HeatMessage { + // Sender id + uint32_t from; + // Time step + uint32_t time; + // Temperature at sender + uint32_t val; +}; + +struct HeatState { + // Device id + uint32_t id; + // Current time step of device + uint32_t time; + // Current temperature of device + uint32_t val, acc; + // Is the temperature of this device constant? + bool isConstant; +}; + +struct HeatDevice : PDevice { + + // Called once by POLite at start of execution + inline void init() { + *readyToSend = Pin(0); + } + + // Send handler + inline void send(volatile HeatMessage* msg) { + msg->from = s->id; + msg->time = s->time; + msg->val = s->val; + *readyToSend = No; + } + + // Receive handler + inline void recv(HeatMessage* msg, None* edge) { + s->acc += msg->val; + } + + // Called by POLite when system becomes idle + inline bool step() { + // Execution complete? + if (s->time == 0) { + *readyToSend = No; + return false; + } + else { + s->time--; + if (!s->isConstant) s->val = s->acc >> 2; + s->acc = 0; + *readyToSend = Pin(0); + return true; + } + } + + // Optionally send message to host on termination + inline bool finish(volatile HeatMessage* msg) { + msg->from = s->id; + msg->val = s->val; + return true; + } +}; + +#endif diff --git a/apps/POLite/heat-grid-sync/Makefile b/apps/POLite/heat-grid-sync/Makefile new file mode 100644 index 00000000..0c343edd --- /dev/null +++ b/apps/POLite/heat-grid-sync/Makefile @@ -0,0 +1,7 @@ +# SPDX-License-Identifier: BSD-2-Clause +APP_CPP = Heat.cpp +APP_HDR = Heat.h +RUN_CPP = Run.cpp Colours.cpp +RUN_H = Colours.h + +include ../util/polite.mk diff --git a/apps/POLite/heat-grid-sync/Run.cpp b/apps/POLite/heat-grid-sync/Run.cpp new file mode 100644 index 00000000..a938a446 --- /dev/null +++ b/apps/POLite/heat-grid-sync/Run.cpp @@ -0,0 +1,119 @@ +// SPDX-License-Identifier: BSD-2-Clause +#include "Heat.h" +#include "Colours.h" + +#include +#include +#include + +int main() +{ + // Parameters + const uint32_t width = 256; + const uint32_t height = 256; + const uint32_t time = 1000; + + // Connection to tinsel machine + HostLink hostLink; + + // Create POETS graph + PGraph graph; + + // Create 2D mesh of devices + PDeviceId **mesh = new PDeviceId* [height]; + for (uint32_t y = 0; y < height; y++) { + mesh[y] = new PDeviceId [width]; + for (uint32_t x = 0; x < width; x++) + mesh[y][x] = graph.newDevice(); + } + + // Add edges + for (uint32_t y = 0; y < height; y++) + for (uint32_t x = 0; x < width; x++) { + if (x < width-1) { + graph.addEdge(mesh[y][x], 0, mesh[y][x+1]); + graph.addEdge(mesh[y][x+1], 0, mesh[y][x]); + } + if (y < height-1) { + graph.addEdge(mesh[y][x], 0, mesh[y+1][x]); + graph.addEdge(mesh[y+1][x], 0, mesh[y][x]); + } + } + + // Prepare mapping from graph to hardware + graph.map(); + + // Set device ids + for (uint32_t y = 0; y < height; y++) + for (uint32_t x = 0; x < width; x++) + graph.devices[mesh[y][x]]->state.id = mesh[y][x]; + + // Specify number of time steps to run on each device + for (PDeviceId i = 0; i < graph.numDevices; i++) + graph.devices[i]->state.time = time; + + // Apply constant heat at north edge + // Apply constant cool at south edge + for (uint32_t x = 0; x < width; x++) { + graph.devices[mesh[0][x]]->state.val = 255 << 16; + graph.devices[mesh[0][x]]->state.isConstant = true; + graph.devices[mesh[height-1][x]]->state.val = 40 << 16; + graph.devices[mesh[height-1][x]]->state.isConstant = true; + } + + // Apply constant heat at west edge + // Apply constant cool at east edge + for (uint32_t y = 0; y < height; y++) { + graph.devices[mesh[y][0]]->state.val = 255 << 16; + graph.devices[mesh[y][0]]->state.isConstant = true; + graph.devices[mesh[y][width-1]]->state.val = 40 << 16; + graph.devices[mesh[y][width-1]]->state.isConstant = true; + } + + // Write graph down to tinsel machine via HostLink + graph.write(&hostLink); + + // Load code and trigger execution + hostLink.boot("code.v", "data.v"); + hostLink.go(); + printf("Starting\n"); + + // Start timer + struct timeval start, finish, diff; + gettimeofday(&start, NULL); + + // Allocate array to contain final value of each device + uint32_t* pixels = new uint32_t [graph.numDevices]; + + // Receive final value of each device + for (uint32_t i = 0; i < graph.numDevices; i++) { + // Receive message + PMessage msg; + hostLink.recvMsg(&msg, sizeof(msg)); + if (i == 0) gettimeofday(&finish, NULL); + // Save final value + pixels[msg.payload.from] = msg.payload.val; + } + + // Display time + timersub(&finish, &start, &diff); + double duration = (double) diff.tv_sec + (double) diff.tv_usec / 1000000.0; + printf("Time = %lf\n", duration); + + // Emit image + FILE* fp = fopen("out.ppm", "wt"); + if (fp == NULL) { + printf("Can't open output file for writing\n"); + return -1; + } + fprintf(fp, "P3\n%d %d\n255\n", width, height); + for (uint32_t y = 0; y < height; y++) + for (uint32_t x = 0; x < width; x++) { + uint32_t val = (pixels[mesh[y][x]] >> 16) & 0xff; + fprintf(fp, "%d %d %d\n", + colours[val*3], colours[val*3+1], colours[val*3+2]); + } + fclose(fp); + + return 0; +} diff --git a/apps/POLite/heat-pc/Makefile b/apps/POLite/heat-pc/Makefile new file mode 100644 index 00000000..235863ef --- /dev/null +++ b/apps/POLite/heat-pc/Makefile @@ -0,0 +1,11 @@ +# SPDX-License-Identifier: BSD-2-Clause +all: heat + +INC=../../../include + +heat: heat.cpp + g++ -I$(INC) -O3 heat.cpp -o heat + +.PHONY: clean +clean: + rm heat diff --git a/apps/POLite/heat-pc/heat.cpp b/apps/POLite/heat-pc/heat.cpp new file mode 100644 index 00000000..194766ac --- /dev/null +++ b/apps/POLite/heat-pc/heat.cpp @@ -0,0 +1,63 @@ +// SPDX-License-Identifier: BSD-2-Clause +#include +#include +#include +#include +#include +#include +#include + +int main(int argc, char**argv) +{ + if (argc != 2) { + printf("Specify edges file\n"); + exit(EXIT_FAILURE); + } + + // Read network + EdgeList net; + net.read(argv[1]); + + // Create states + float* heat = new float [net.numNodes]; + float* heatNext = new float [net.numNodes]; + srand(1); + for (int i = 0; i < net.numNodes; i++) { + int r = rand() % 255; + heat[i] = (float) r; + } + + // Start timer + printf("Started\n"); + struct timeval start, finish, diff; + gettimeofday(&start, NULL); + + for (int t = 0; t < 100; t++) { + for (int i = 0; i < net.numNodes; i++) { + uint32_t numNeighbours = net.neighbours[i][0]; + float acc = 0.0; + for (uint32_t j = 0; j < numNeighbours; j++) { + uint32_t neighbour = net.neighbours[i][j+1]; + acc += heat[neighbour]; + } + heatNext[i] = acc / (float) numNeighbours; + } + float* tmp = heat; heat = heatNext; heatNext = tmp; + } + + // Stop timer + gettimeofday(&finish, NULL); + + // Display final values of first ten devices + for (uint32_t i = 0; i < 10; i++) { + if (i < net.numNodes) + printf("%d: %f\n", i, heat[i]); + } + + // Display time + timersub(&finish, &start, &diff); + double duration = (double) diff.tv_sec + (double) diff.tv_usec / 1000000.0; + printf("Time = %lf\n", duration); + + return 0; +} diff --git a/apps/POLite/heat-sync/Colours.cpp b/apps/POLite/heat-sync/Colours.cpp deleted file mode 100644 index 93b49740..00000000 --- a/apps/POLite/heat-sync/Colours.cpp +++ /dev/null @@ -1,71 +0,0 @@ -// SPDX-License-Identifier: BSD-2-Clause -#include - -// 256 x RGB colours representing heat intensities -uint8_t colours[] = { - 0x00, 0x00, 0x76, 0x00, 0x00, 0x7a, 0x00, 0x00, 0x7f, 0x00, 0x00, 0x83, - 0x00, 0x00, 0x88, 0x00, 0x00, 0x8c, 0x00, 0x00, 0x91, 0x00, 0x00, 0x95, - 0x00, 0x00, 0x9a, 0x00, 0x00, 0x9e, 0x00, 0x00, 0xa3, 0x00, 0x00, 0xa3, - 0x00, 0x00, 0xa7, 0x00, 0x00, 0xac, 0x00, 0x00, 0xb0, 0x00, 0x00, 0xb5, - 0x00, 0x00, 0xb9, 0x00, 0x00, 0xbe, 0x00, 0x00, 0xc2, 0x00, 0x00, 0xc7, - 0x00, 0x00, 0xcb, 0x00, 0x00, 0xd0, 0x00, 0x00, 0xd4, 0x00, 0x00, 0xd9, - 0x00, 0x00, 0xde, 0x00, 0x00, 0xe2, 0x00, 0x00, 0xe7, 0x00, 0x00, 0xeb, - 0x00, 0x00, 0xf0, 0x00, 0x00, 0xf4, 0x00, 0x00, 0xf9, 0x00, 0x00, 0xfd, - 0x00, 0x03, 0xff, 0x00, 0x07, 0xff, 0x00, 0x0c, 0xff, 0x00, 0x10, 0xff, - 0x00, 0x15, 0xff, 0x00, 0x19, 0xff, 0x00, 0x1e, 0xff, 0x00, 0x22, 0xff, - 0x00, 0x27, 0xff, 0x00, 0x2b, 0xff, 0x00, 0x30, 0xff, 0x00, 0x34, 0xff, - 0x00, 0x39, 0xff, 0x00, 0x3d, 0xff, 0x00, 0x42, 0xff, 0x00, 0x47, 0xff, - 0x00, 0x4b, 0xff, 0x00, 0x50, 0xff, 0x00, 0x54, 0xff, 0x00, 0x59, 0xff, - 0x00, 0x5d, 0xff, 0x00, 0x62, 0xff, 0x00, 0x66, 0xff, 0x00, 0x6b, 0xff, - 0x00, 0x6f, 0xff, 0x00, 0x74, 0xff, 0x00, 0x78, 0xff, 0x00, 0x7d, 0xff, - 0x00, 0x81, 0xff, 0x00, 0x86, 0xff, 0x00, 0x8a, 0xff, 0x00, 0x8f, 0xff, - 0x00, 0x93, 0xff, 0x00, 0x98, 0xff, 0x00, 0x9c, 0xff, 0x00, 0xa1, 0xff, - 0x00, 0xa5, 0xff, 0x00, 0xaa, 0xff, 0x00, 0xaf, 0xff, 0x00, 0xb3, 0xff, - 0x00, 0xb8, 0xff, 0x00, 0xbc, 0xff, 0x00, 0xc1, 0xff, 0x00, 0xc5, 0xff, - 0x00, 0xca, 0xff, 0x00, 0xce, 0xff, 0x00, 0xd3, 0xff, 0x00, 0xd7, 0xff, - 0x00, 0xdc, 0xff, 0x00, 0xe0, 0xff, 0x00, 0xe5, 0xff, 0x00, 0xe9, 0xff, - 0x00, 0xee, 0xff, 0x00, 0xf2, 0xff, 0x00, 0xf7, 0xff, 0x00, 0xfb, 0xff, - 0x00, 0xff, 0xff, 0x00, 0xff, 0xfa, 0x00, 0xff, 0xf5, 0x00, 0xff, 0xf1, - 0x00, 0xff, 0xec, 0x00, 0xff, 0xe7, 0x00, 0xff, 0xe3, 0x00, 0xff, 0xde, - 0x00, 0xff, 0xda, 0x00, 0xff, 0xd5, 0x00, 0xff, 0xd1, 0x00, 0xff, 0xcc, - 0x00, 0xff, 0xc8, 0x00, 0xff, 0xc3, 0x00, 0xff, 0xbf, 0x00, 0xff, 0xba, - 0x00, 0xff, 0xb6, 0x00, 0xff, 0xb1, 0x00, 0xff, 0xad, 0x00, 0xff, 0xa8, - 0x00, 0xff, 0xa4, 0x00, 0xff, 0x9f, 0x00, 0xff, 0x9b, 0x00, 0xff, 0x96, - 0x00, 0xff, 0x92, 0x00, 0xff, 0x8d, 0x00, 0xff, 0x89, 0x00, 0xff, 0x84, - 0x00, 0xff, 0x80, 0x00, 0xff, 0x7b, 0x00, 0xff, 0x76, 0x00, 0xff, 0x72, - 0x00, 0xff, 0x6d, 0x00, 0xff, 0x69, 0x00, 0xff, 0x64, 0x00, 0xff, 0x60, - 0x00, 0xff, 0x5b, 0x00, 0xff, 0x57, 0x00, 0xff, 0x52, 0x00, 0xff, 0x4e, - 0x00, 0xff, 0x49, 0x00, 0xff, 0x45, 0x00, 0xff, 0x40, 0x00, 0xff, 0x3c, - 0x00, 0xff, 0x37, 0x00, 0xff, 0x33, 0x00, 0xff, 0x2e, 0x00, 0xff, 0x2a, - 0x00, 0xff, 0x25, 0x00, 0xff, 0x21, 0x00, 0xff, 0x1c, 0x00, 0xff, 0x18, - 0x00, 0xff, 0x13, 0x00, 0xff, 0x0e, 0x00, 0xff, 0x0a, 0x00, 0xff, 0x05, - 0x00, 0xff, 0x01, 0x04, 0xff, 0x00, 0x08, 0xff, 0x00, 0x0d, 0xff, 0x00, - 0x11, 0xff, 0x00, 0x16, 0xff, 0x00, 0x1a, 0xff, 0x00, 0x1f, 0xff, 0x00, - 0x23, 0xff, 0x00, 0x28, 0xff, 0x00, 0x2c, 0xff, 0x00, 0x31, 0xff, 0x00, - 0x35, 0xff, 0x00, 0x3a, 0xff, 0x00, 0x3e, 0xff, 0x00, 0x43, 0xff, 0x00, - 0x47, 0xff, 0x00, 0x4c, 0xff, 0x00, 0x50, 0xff, 0x00, 0x55, 0xff, 0x00, - 0x5a, 0xff, 0x00, 0x5e, 0xff, 0x00, 0x63, 0xff, 0x00, 0x67, 0xff, 0x00, - 0x6c, 0xff, 0x00, 0x70, 0xff, 0x00, 0x75, 0xff, 0x00, 0x79, 0xff, 0x00, - 0x7e, 0xff, 0x00, 0x82, 0xff, 0x00, 0x87, 0xff, 0x00, 0x8b, 0xff, 0x00, - 0x90, 0xff, 0x00, 0x94, 0xff, 0x00, 0x99, 0xff, 0x00, 0x9d, 0xff, 0x00, - 0xa2, 0xff, 0x00, 0xa6, 0xff, 0x00, 0xab, 0xff, 0x00, 0xaf, 0xff, 0x00, - 0xb4, 0xff, 0x00, 0xb8, 0xff, 0x00, 0xbd, 0xff, 0x00, 0xc2, 0xff, 0x00, - 0xc6, 0xff, 0x00, 0xcb, 0xff, 0x00, 0xcf, 0xff, 0x00, 0xd4, 0xff, 0x00, - 0xd8, 0xff, 0x00, 0xdd, 0xff, 0x00, 0xe1, 0xff, 0x00, 0xe6, 0xff, 0x00, - 0xea, 0xff, 0x00, 0xef, 0xff, 0x00, 0xf3, 0xff, 0x00, 0xf8, 0xff, 0x00, - 0xfc, 0xff, 0x00, 0xff, 0xfd, 0x00, 0xff, 0xf9, 0x00, 0xff, 0xf4, 0x00, - 0xff, 0xf0, 0x00, 0xff, 0xeb, 0x00, 0xff, 0xe7, 0x00, 0xff, 0xe2, 0x00, - 0xff, 0xde, 0x00, 0xff, 0xd9, 0x00, 0xff, 0xd5, 0x00, 0xff, 0xd0, 0x00, - 0xff, 0xcb, 0x00, 0xff, 0xc7, 0x00, 0xff, 0xc2, 0x00, 0xff, 0xbe, 0x00, - 0xff, 0xb9, 0x00, 0xff, 0xb5, 0x00, 0xff, 0xb0, 0x00, 0xff, 0xac, 0x00, - 0xff, 0xa7, 0x00, 0xff, 0xa3, 0x00, 0xff, 0x9e, 0x00, 0xff, 0x9a, 0x00, - 0xff, 0x95, 0x00, 0xff, 0x91, 0x00, 0xff, 0x8c, 0x00, 0xff, 0x88, 0x00, - 0xff, 0x83, 0x00, 0xff, 0x7f, 0x00, 0xff, 0x7a, 0x00, 0xff, 0x76, 0x00, - 0xff, 0x71, 0x00, 0xff, 0x6d, 0x00, 0xff, 0x68, 0x00, 0xff, 0x63, 0x00, - 0xff, 0x5f, 0x00, 0xff, 0x5a, 0x00, 0xff, 0x56, 0x00, 0xff, 0x51, 0x00, - 0xff, 0x4d, 0x00, 0xff, 0x48, 0x00, 0xff, 0x44, 0x00, 0xff, 0x3f, 0x00, - 0xff, 0x3b, 0x00, 0xff, 0x36, 0x00, 0xff, 0x32, 0x00, 0xff, 0x2d, 0x00, - 0xff, 0x29, 0x00, 0xff, 0x24, 0x00, 0xff, 0x20, 0x00, 0xff, 0x1b, 0x00, - 0xff, 0x17, 0x00, 0xff, 0x12, 0x00, 0xff, 0x0e, 0x00, 0xff, 0x09, 0x00, - 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 -}; diff --git a/apps/POLite/heat-sync/Colours.h b/apps/POLite/heat-sync/Colours.h deleted file mode 100644 index fc34e04c..00000000 --- a/apps/POLite/heat-sync/Colours.h +++ /dev/null @@ -1,10 +0,0 @@ -// SPDX-License-Identifier: BSD-2-Clause -#ifndef _COLOURS_H_ -#define _COLOURS_H_ - -#include - -// 256 x RGB colours representing heat intensities -extern uint8_t colours[]; - -#endif diff --git a/apps/POLite/heat-sync/Heat.h b/apps/POLite/heat-sync/Heat.h index b3a63a93..8dc926b3 100644 --- a/apps/POLite/heat-sync/Heat.h +++ b/apps/POLite/heat-sync/Heat.h @@ -2,24 +2,26 @@ #ifndef _HEAT_H_ #define _HEAT_H_ +#define POLITE_DUMP_STATS +#define POLITE_COUNT_MSGS #include struct HeatMessage { // Sender id uint32_t from; - // Time step - uint32_t time; // Temperature at sender - uint32_t val; + float val; }; struct HeatState { // Device id uint32_t id; - // Current time step of device - uint32_t time; // Current temperature of device - uint32_t val, acc; + float val, acc; + // Time step + uint16_t time; + // Number of neighbours + uint16_t numNeighbours; // Is the temperature of this device constant? bool isConstant; }; @@ -34,7 +36,6 @@ struct HeatDevice : PDevice { // Send handler inline void send(volatile HeatMessage* msg) { msg->from = s->id; - msg->time = s->time; msg->val = s->val; *readyToSend = No; } @@ -42,6 +43,7 @@ struct HeatDevice : PDevice { // Receive handler inline void recv(HeatMessage* msg, None* edge) { s->acc += msg->val; + s->numNeighbours++; } // Called by POLite when system becomes idle @@ -53,8 +55,9 @@ struct HeatDevice : PDevice { } else { s->time--; - if (!s->isConstant) s->val = s->acc >> 2; - s->acc = 0; + if (!s->isConstant) s->val = s->acc / (float) s->numNeighbours; + s->acc = 0.0; + s->numNeighbours = 0; *readyToSend = Pin(0); return true; } diff --git a/apps/POLite/heat-sync/Makefile b/apps/POLite/heat-sync/Makefile index 0c343edd..f44d5b09 100644 --- a/apps/POLite/heat-sync/Makefile +++ b/apps/POLite/heat-sync/Makefile @@ -1,7 +1,7 @@ # SPDX-License-Identifier: BSD-2-Clause APP_CPP = Heat.cpp APP_HDR = Heat.h -RUN_CPP = Run.cpp Colours.cpp -RUN_H = Colours.h +RUN_CPP = Run.cpp +RUN_H = include ../util/polite.mk diff --git a/apps/POLite/heat-sync/Run.cpp b/apps/POLite/heat-sync/Run.cpp index a938a446..ed978e39 100644 --- a/apps/POLite/heat-sync/Run.cpp +++ b/apps/POLite/heat-sync/Run.cpp @@ -1,17 +1,30 @@ // SPDX-License-Identifier: BSD-2-Clause #include "Heat.h" -#include "Colours.h" #include #include +#include #include -int main() +int main(int argc, char **argv) { - // Parameters - const uint32_t width = 256; - const uint32_t height = 256; - const uint32_t time = 1000; + const uint32_t time = 1000; + + // Read in the example edge list and create data structure + if (argc != 2) { + printf("Specify edge file\n"); + exit(EXIT_FAILURE); + } + + // Load in the edge list file + printf("Loading in the graph..."); fflush(stdout); + EdgeList net; + net.read(argv[1]); + printf(" done\n"); + + // Print max fan-out + printf("Min fan-out = %d\n", net.minFanOut()); + printf("Max fan-out = %d\n", net.maxFanOut()); // Connection to tinsel machine HostLink hostLink; @@ -19,55 +32,31 @@ int main() // Create POETS graph PGraph graph; - // Create 2D mesh of devices - PDeviceId **mesh = new PDeviceId* [height]; - for (uint32_t y = 0; y < height; y++) { - mesh[y] = new PDeviceId [width]; - for (uint32_t x = 0; x < width; x++) - mesh[y][x] = graph.newDevice(); + // Create nodes in POETS graph + for (uint32_t i = 0; i < net.numNodes; i++) { + PDeviceId id = graph.newDevice(); + assert(i == id); } - // Add edges - for (uint32_t y = 0; y < height; y++) - for (uint32_t x = 0; x < width; x++) { - if (x < width-1) { - graph.addEdge(mesh[y][x], 0, mesh[y][x+1]); - graph.addEdge(mesh[y][x+1], 0, mesh[y][x]); - } - if (y < height-1) { - graph.addEdge(mesh[y][x], 0, mesh[y+1][x]); - graph.addEdge(mesh[y+1][x], 0, mesh[y][x]); - } - } + // Create connections in POETS graph + for (uint32_t i = 0; i < net.numNodes; i++) { + uint32_t numNeighbours = net.neighbours[i][0]; + for (uint32_t j = 0; j < numNeighbours; j++) + graph.addEdge(i, 0, net.neighbours[i][j+1]); + } // Prepare mapping from graph to hardware graph.map(); - // Set device ids - for (uint32_t y = 0; y < height; y++) - for (uint32_t x = 0; x < width; x++) - graph.devices[mesh[y][x]]->state.id = mesh[y][x]; - // Specify number of time steps to run on each device - for (PDeviceId i = 0; i < graph.numDevices; i++) + srand(1); + for (PDeviceId i = 0; i < graph.numDevices; i++) { + int r = rand() % 255; + graph.devices[i]->state.id = i; graph.devices[i]->state.time = time; - - // Apply constant heat at north edge - // Apply constant cool at south edge - for (uint32_t x = 0; x < width; x++) { - graph.devices[mesh[0][x]]->state.val = 255 << 16; - graph.devices[mesh[0][x]]->state.isConstant = true; - graph.devices[mesh[height-1][x]]->state.val = 40 << 16; - graph.devices[mesh[height-1][x]]->state.isConstant = true; - } - - // Apply constant heat at west edge - // Apply constant cool at east edge - for (uint32_t y = 0; y < height; y++) { - graph.devices[mesh[y][0]]->state.val = 255 << 16; - graph.devices[mesh[y][0]]->state.isConstant = true; - graph.devices[mesh[y][width-1]]->state.val = 40 << 16; - graph.devices[mesh[y][width-1]]->state.isConstant = true; + graph.devices[i]->state.val = (float) r; + graph.devices[i]->state.isConstant = false; + //graph.devices[i]->state.fanOut = graph.fanOut(i); } // Write graph down to tinsel machine via HostLink @@ -82,8 +71,11 @@ int main() struct timeval start, finish, diff; gettimeofday(&start, NULL); + // Consume performance stats + politeSaveStats(&hostLink, "stats.txt"); + // Allocate array to contain final value of each device - uint32_t* pixels = new uint32_t [graph.numDevices]; + float* pixels = new float [graph.numDevices]; // Receive final value of each device for (uint32_t i = 0; i < graph.numDevices; i++) { @@ -95,25 +87,19 @@ int main() pixels[msg.payload.from] = msg.payload.val; } + // Display final values of first ten devices + for (uint32_t i = 0; i < 10; i++) { + if (i < graph.numDevices) { + printf("%d: %f\n", i, pixels[i]); + } + } + // Display time timersub(&finish, &start, &diff); double duration = (double) diff.tv_sec + (double) diff.tv_usec / 1000000.0; + #ifndef POLITE_DUMP_STATS printf("Time = %lf\n", duration); - - // Emit image - FILE* fp = fopen("out.ppm", "wt"); - if (fp == NULL) { - printf("Can't open output file for writing\n"); - return -1; - } - fprintf(fp, "P3\n%d %d\n255\n", width, height); - for (uint32_t y = 0; y < height; y++) - for (uint32_t x = 0; x < width; x++) { - uint32_t val = (pixels[mesh[y][x]] >> 16) & 0xff; - fprintf(fp, "%d %d %d\n", - colours[val*3], colours[val*3+1], colours[val*3+2]); - } - fclose(fp); + #endif return 0; } diff --git a/apps/POLite/izhikevich-gals/Izhikevich.cpp b/apps/POLite/izhikevich-gals/Izhikevich.cpp new file mode 100644 index 00000000..8533062a --- /dev/null +++ b/apps/POLite/izhikevich-gals/Izhikevich.cpp @@ -0,0 +1,23 @@ +// SPDX-License-Identifier: BSD-2-Clause +#include "Izhikevich.h" + +#include +#include + +typedef PThread< + IzhikevichDevice, + IzhikevichState, // State + Weight, // Edge label + IzhikevichMsg // Message + > IzhikevichThread; + +int main() +{ + // Point thread structure at base of thread's heap + IzhikevichThread* thread = (IzhikevichThread*) tinselHeapBaseSRAM(); + + // Invoke interpreter + thread->run(); + + return 0; +} diff --git a/apps/POLite/izhikevich-gals/Izhikevich.h b/apps/POLite/izhikevich-gals/Izhikevich.h new file mode 100644 index 00000000..701af341 --- /dev/null +++ b/apps/POLite/izhikevich-gals/Izhikevich.h @@ -0,0 +1,115 @@ +// SPDX-License-Identifier: BSD-2-Clause +// (Based on code by David Thomas) +#ifndef _Izhikevich_H_ +#define _Izhikevich_H_ + +#define POLITE_DUMP_STATS +#define POLITE_COUNT_MSGS +#include +#include "RNG.h" + +// Number of time steps to run for +#define NUM_STEPS 100 + +// Vertex state +struct IzhikevichState { + // Random-number-generator state + uint32_t rng; + // Neuron state + float u, v, I, acc, accNext; + uint32_t spikeCount; + // Protocol + bool sent; + uint16_t received, receivedNext, fanIn, time; + // Neuron properties + float a, b, c, d, Ir; +}; + +// Edge weight type +typedef float Weight; + +// Message type +struct IzhikevichMsg { + // Did the sender spike or not? + bool spike; + // Time step of sender + uint16_t time; + // Number of times sender has spiked + uint32_t spikeCount; +}; + +// Vertex behaviour +struct IzhikevichDevice : PDevice { + inline void init() { + s->v = -65.0f; + s->u = s->b * s->v; + s->I = s->Ir * grng(s->rng); + *readyToSend = Pin(0); + } + + // We call this on every state change + inline void change() { + // Execution complete? + if (s->time == NUM_STEPS) return; + + // Proceed to next time step? + if (s->sent && s->received == s->fanIn) { + s->time++; + s->I += s->acc; + s->acc = s->accNext; + s->accNext = 0; + s->received = s->receivedNext; + s->receivedNext = 0; + s->sent = false; + *readyToSend = s->time == (NUM_STEPS+1) ? No : Pin(0); + } + } + + // Send handler + inline void send(volatile IzhikevichMsg* msg) { + bool spike = false; + float &v = s->v; + float &u = s->u; + float &I = s->I; + v = v+0.5*(0.04*v*v+5*v+140-u+I); // Step 0.5 ms + v = v+0.5*(0.04*v*v+5*v+140-u+I); // for numerical + u = u + s->a*(s->b*v-u); // stability + if (v >= 30.0) { + v = s->c; + u += s->d; + s->spikeCount++; + spike = true; + } + s->I = s->Ir * grng(s->rng); + msg->time = s->time; + msg->spike = spike; + msg->spikeCount = s->spikeCount; + s->sent = true; + *readyToSend = No; + change(); + } + + // Receive handler + inline void recv(IzhikevichMsg* msg, Weight* weight) { + if (msg->time == s->time) { + if (msg->spike) s->acc += *weight; + s->received++; + change(); + } + else { + if (msg->spike) s->accNext += *weight; + s->receivedNext++; + } + } + + inline bool step() { + return false; + } + + inline bool finish(IzhikevichMsg* msg) { + msg->spikeCount = s->spikeCount; + return true; + } +}; + +#endif diff --git a/apps/POLite/ping-test/Makefile b/apps/POLite/izhikevich-gals/Makefile similarity index 63% rename from apps/POLite/ping-test/Makefile rename to apps/POLite/izhikevich-gals/Makefile index 7e85d2c6..5ba3d9e3 100644 --- a/apps/POLite/ping-test/Makefile +++ b/apps/POLite/izhikevich-gals/Makefile @@ -1,6 +1,6 @@ # SPDX-License-Identifier: BSD-2-Clause -APP_CPP = ping.cpp -APP_HDR = ping.h +APP_CPP = Izhikevich.cpp +APP_HDR = Izhikevich.h RUN_CPP = Run.cpp include ../util/polite.mk diff --git a/apps/POLite/izhikevich-gals/RNG.h b/apps/POLite/izhikevich-gals/RNG.h new file mode 100644 index 00000000..61b719b3 --- /dev/null +++ b/apps/POLite/izhikevich-gals/RNG.h @@ -0,0 +1,23 @@ +#ifndef _RNG_H_ +#define _RNG_H_ + +inline uint32_t urng(uint32_t &state) { + state = state*1664525+1013904223; + return state; +} + +// World's crappiest gaussian (courtesy of dt10!) +inline float grng(uint32_t &state) { + uint32_t u=urng(state); + int32_t acc=0; + for(unsigned i=0;i<8;i++){ + acc += u&0xf; + u=u>>4; + } + // a four-bit uniform has mean 7.5 and variance ((15-0+1)^2-1)/12 = 85/4 + // sum of four uniforms has mean 8*7.5=60 and variance of 8*85/4=170 + const float scale=0.07669649888473704; // == 1/sqrt(170) + return (acc-60.0f) * scale; +} + +#endif diff --git a/apps/POLite/izhikevich-gals/Run.cpp b/apps/POLite/izhikevich-gals/Run.cpp new file mode 100644 index 00000000..e542881f --- /dev/null +++ b/apps/POLite/izhikevich-gals/Run.cpp @@ -0,0 +1,132 @@ +// SPDX-License-Identifier: BSD-2-Clause +#include "Izhikevich.h" + +#include +#include + +#include +#include +#include +#include + +inline double urand() { return (double) rand() / RAND_MAX; } + +int main(int argc, char**argv) +{ + if (argc != 2) { + printf("Specify edges file\n"); + exit(EXIT_FAILURE); + } + + // Read network + EdgeList net; + net.read(argv[1]); + + // Connection to tinsel machine + HostLink hostLink; + + // Create POETS graph + PGraph graph; + + // Create nodes in POETS graph + for (uint32_t i = 0; i < net.numNodes; i++) { + PDeviceId id = graph.newDevice(); + assert(i == id); + } + + // Ratio of excitatory to inhibitory neurons + double excitatory = 0.8; + + // Mark each neuron as excitatory (or inhibiatory) + srand(1); + bool* excite = new bool [net.numNodes]; + for (int i = 0; i < net.numNodes; i++) + excite[i] = urand() < excitatory; + + // Create connections in POETS graph + for (uint32_t i = 0; i < net.numNodes; i++) { + uint32_t numNeighbours = net.neighbours[i][0]; + for (uint32_t j = 0; j < numNeighbours; j++) { + float weight = excite[i] ? 0.5 * urand() : -urand(); + graph.addLabelledEdge(weight, i, 0, net.neighbours[i][j+1]); + } + } + + // Add zero-weight back-edges for any directed edges + // (For GALS synchronisation) + for (uint32_t i = 0; i < net.numNodes; i++) { + for (uint32_t j = 0; j < net.neighbours[i][0]; j++) { + uint32_t n = net.neighbours[i][j+1]; + // TODO: can be more efficient here + bool needBackEdge = true; + for (uint32_t k = 0; k < net.neighbours[n][0]; k++) + if (net.neighbours[n][k+1] == i) needBackEdge = false; + if (needBackEdge) graph.addLabelledEdge(0.0, n, 0, i); + } + } + + // Prepare mapping from graph to hardware + graph.map(); + + srand(2); + // Initialise devices + for (PDeviceId i = 0; i < graph.numDevices; i++) { + IzhikevichState* n = &graph.devices[i]->state; + n->rng = (int32_t) (urand()*((double) (1<<31))); + n->fanIn = graph.fanIn(i); + if (excite[i]) { + float re = (float) urand(); + n->a = 0.02; + n->b = 0.2; + n->c = -65+15*re*re; + n->d = 8-6*re*re; + n->Ir = 5; + } + else { + float ri = (float) urand(); + n->a = 0.02+0.08*ri; + n->b = 0.25-0.05*ri; + n->c = -65; + n->d = 2; + n->Ir = 2; + } + } + + // Write graph down to tinsel machine via HostLink + graph.write(&hostLink); + + // Load code and trigger execution + hostLink.boot("code.v", "data.v"); + hostLink.go(); + + // Timer + printf("Started\n"); + struct timeval start, finish, diff; + gettimeofday(&start, NULL); + + // Consume performance stats + politeSaveStats(&hostLink, "stats.txt"); + + int64_t sum = 0; + // Receive final distance to each vertex + for (uint32_t i = 0; i < graph.numDevices; i++) { + // Receive message + PMessage msg; + hostLink.recvMsg(&msg, sizeof(msg)); + if (i == 0) gettimeofday(&finish, NULL); + // Accumulate + sum += msg.payload.spikeCount; + } + + // Emit result + printf("Total spikes = %ld\n", sum); + + // Display time + timersub(&finish, &start, &diff); + double duration = (double) diff.tv_sec + (double) diff.tv_usec / 1000000.0; + #ifndef POLITE_DUMP_STATS + printf("Time = %lf\n", duration); + #endif + + return 0; +} diff --git a/apps/POLite/izhikevich-pc/Izhikevich.cpp b/apps/POLite/izhikevich-pc/Izhikevich.cpp new file mode 100644 index 00000000..b4f03ed5 --- /dev/null +++ b/apps/POLite/izhikevich-pc/Izhikevich.cpp @@ -0,0 +1,139 @@ +// SPDX-License-Identifier: BSD-2-Clause +// (Based on code by David Thomas) + +#include +#include +#include +#include "RNG.h" + +#define NUM_STEPS 100 + +// Neuron +struct Neuron { + // Random-number-generator state + uint32_t rng; + // Neuron state + float u, v, I, spikeCount; + // Neuron properties + float a, b, c, d, Ir; +}; + +int main(int argc, char**argv) +{ + if (argc != 2) { + printf("Specify edges file\n"); + exit(EXIT_FAILURE); + } + + // Read network + EdgeList net; + net.read(argv[1]); + + // Ratio of excitatory to inhibitory neurons + double excitatory = 0.8; + + // Mark each neuron as excitatory (or inhibiatory) + srand(1); + bool* excite = new bool [net.numNodes]; + for (int i = 0; i < net.numNodes; i++) { + excite[i] = urand() < excitatory; + } + + // Edge weights + float** weight = new float* [net.numNodes]; + for (int i = 0; i < net.numNodes; i++) { + uint32_t numEdges = net.neighbours[i][0]; + weight[i] = new float [numEdges]; + for (int j = 0; j < numEdges; j++) { + weight[i][j] = excite[i] ? 0.5 * urand() : -urand(); + } + } + + // State for each neuron + srand(2); + Neuron* neuron = new Neuron [net.numNodes]; + for (int i = 0; i < net.numNodes; i++) { + Neuron* n = &neuron[i]; + n->rng = (int32_t) (urand()*((double) (1<<31))); + if (excite[i]) { + float re = (float) urand(); + n->a = 0.02; + n->b = 0.2; + n->c = -65+15*re*re; + n->d = 8-6*re*re; + n->Ir = 5; + } + else { + float ri = (float) urand(); + n->a = 0.02+0.08*ri; + n->b = 0.25-0.05*ri; + n->c = -65; + n->d = 2; + n->Ir = 2; + } + } + + // Spike array + bool* spike = new bool [net.numNodes]; + + // Initialisation + for (int i = 0; i < net.numNodes; i++) { + Neuron* n = &neuron[i]; + n->v = -65.0; + n->u = n->b * n->v; + n->I = n->Ir * grng(n->rng); + } + + // Timer + printf("Started\n"); + struct timeval start, finish, diff; + gettimeofday(&start, NULL); + + // Simulation + int64_t totalSpikes = 0; + for (int t = 0; t <= NUM_STEPS; t++) { + // Update state + for (int i = 0; i < net.numNodes; i++) { + spike[i] = false; + Neuron* n = &neuron[i]; + float &v = n->v; + float &u = n->u; + float &I = n->I; + v = v+0.5*(0.04*v*v+5*v+140-u+I); // Step 0.5 ms + v = v+0.5*(0.04*v*v+5*v+140-u+I); // for numerical + u = u + n->a*(n->b*v-u); // stability + if (v >= 30.0) { + n->v = n->c; + n->u += n->d; + spike[i] = true; + } + n->I = n->Ir * grng(n->rng); + } + // Update I-values + uint32_t spikes = 0; + for (int i = 0; i < net.numNodes; i++) { + Neuron* n = &neuron[i]; + if (spike[i]) { + spikes++; + n->spikeCount++; + uint32_t numEdges = net.neighbours[i][0]; + uint32_t* dst = &net.neighbours[i][1]; + for (int j = 0; j < numEdges; j++) { + neuron[dst[j]].I += weight[i][j]; + } + } + } + //printf("%d: %d\n", t, spikes); + totalSpikes += spikes; + } + gettimeofday(&finish, NULL); + + printf("Total spikes: %ld\n", totalSpikes); + + // Display time + timersub(&finish, &start, &diff); + double duration = (double) diff.tv_sec + (double) diff.tv_usec / 1000000.0; + printf("Time = %lf\n", duration); + + return 0; +} diff --git a/apps/POLite/izhikevich-pc/Makefile b/apps/POLite/izhikevich-pc/Makefile new file mode 100644 index 00000000..52c92c74 --- /dev/null +++ b/apps/POLite/izhikevich-pc/Makefile @@ -0,0 +1,6 @@ +Izhikevich: Izhikevich.cpp RNG.h + g++ -I../../../include -O2 Izhikevich.cpp -o Izhikevich + +.PHONY: clean +clean: + rm Izhikevich diff --git a/apps/POLite/izhikevich-pc/RNG.h b/apps/POLite/izhikevich-pc/RNG.h new file mode 100644 index 00000000..decc32f1 --- /dev/null +++ b/apps/POLite/izhikevich-pc/RNG.h @@ -0,0 +1,27 @@ +#ifndef _RNG_H_ +#define _RNG_H_ + +inline uint32_t urng(uint32_t &state) { + state = state*1664525+1013904223; + return state; +} + +// World's crappiest gaussian (courtesy of dt10!) +inline float grng(uint32_t &state) { + uint32_t u=urng(state); + int32_t acc=0; + for(unsigned i=0;i<8;i++){ + acc += u&0xf; + u=u>>4; + } + // a four-bit uniform has mean 7.5 and variance ((15-0+1)^2-1)/12 = 85/4 + // sum of four uniforms has mean 8*7.5=60 and variance of 8*85/4=170 + const float scale=0.07669649888473704; // == 1/sqrt(170) + return (acc-60.0f) * scale; +} + +inline double urand() { + return (double) rand() / RAND_MAX; +} + +#endif diff --git a/apps/POLite/izhikevich-sync/Izhikevich.cpp b/apps/POLite/izhikevich-sync/Izhikevich.cpp new file mode 100644 index 00000000..8533062a --- /dev/null +++ b/apps/POLite/izhikevich-sync/Izhikevich.cpp @@ -0,0 +1,23 @@ +// SPDX-License-Identifier: BSD-2-Clause +#include "Izhikevich.h" + +#include +#include + +typedef PThread< + IzhikevichDevice, + IzhikevichState, // State + Weight, // Edge label + IzhikevichMsg // Message + > IzhikevichThread; + +int main() +{ + // Point thread structure at base of thread's heap + IzhikevichThread* thread = (IzhikevichThread*) tinselHeapBaseSRAM(); + + // Invoke interpreter + thread->run(); + + return 0; +} diff --git a/apps/POLite/izhikevich-sync/Izhikevich.h b/apps/POLite/izhikevich-sync/Izhikevich.h new file mode 100644 index 00000000..150a4afa --- /dev/null +++ b/apps/POLite/izhikevich-sync/Izhikevich.h @@ -0,0 +1,72 @@ +// SPDX-License-Identifier: BSD-2-Clause +// (Based on code by David Thomas) +#ifndef _Izhikevich_H_ +#define _Izhikevich_H_ + +#define POLITE_DUMP_STATS +#define POLITE_COUNT_MSGS + +#include +#include "RNG.h" + +// Number of time steps to run for +#define NUM_STEPS 100 + +// Vertex state +struct IzhikevichState { + // Random-number-generator state + uint32_t rng; + // Neuron state + float u, v, I; + uint32_t spikeCount; + // Neuron properties + float a, b, c, d, Ir; +}; + +// Edge weight type +typedef float Weight; + +// Message type +struct IzhikevichMsg { + // Number of times sender has spiked + uint32_t spikeCount; +}; + +// Vertex behaviour +struct IzhikevichDevice : PDevice { + inline void init() { + s->v = -65.0f; + s->u = s->b * s->v; + s->I = s->Ir * grng(s->rng); + *readyToSend = No; + } + inline void send(IzhikevichMsg* msg) { + s->spikeCount++; + msg->spikeCount = s->spikeCount; + *readyToSend = No; + } + inline void recv(IzhikevichMsg* msg, Weight* weight) { + s->I += *weight; + } + inline bool step() { + float &v = s->v; + float &u = s->u; + float &I = s->I; + v = v+0.5*(0.04*v*v+5*v+140-u+I); // Step 0.5 ms + v = v+0.5*(0.04*v*v+5*v+140-u+I); // for numerical + u = u + s->a*(s->b*v-u); // stability + if (v >= 30.0) { + v = s->c; + u += s->d; + *readyToSend = Pin(0); + } + s->I = s->Ir * grng(s->rng); + return (time < NUM_STEPS); + } + inline bool finish(IzhikevichMsg* msg) { + msg->spikeCount = s->spikeCount; + return true; + } +}; + +#endif diff --git a/apps/POLite/izhikevich-sync/Makefile b/apps/POLite/izhikevich-sync/Makefile new file mode 100644 index 00000000..5ba3d9e3 --- /dev/null +++ b/apps/POLite/izhikevich-sync/Makefile @@ -0,0 +1,6 @@ +# SPDX-License-Identifier: BSD-2-Clause +APP_CPP = Izhikevich.cpp +APP_HDR = Izhikevich.h +RUN_CPP = Run.cpp + +include ../util/polite.mk diff --git a/apps/POLite/izhikevich-sync/RNG.h b/apps/POLite/izhikevich-sync/RNG.h new file mode 100644 index 00000000..61b719b3 --- /dev/null +++ b/apps/POLite/izhikevich-sync/RNG.h @@ -0,0 +1,23 @@ +#ifndef _RNG_H_ +#define _RNG_H_ + +inline uint32_t urng(uint32_t &state) { + state = state*1664525+1013904223; + return state; +} + +// World's crappiest gaussian (courtesy of dt10!) +inline float grng(uint32_t &state) { + uint32_t u=urng(state); + int32_t acc=0; + for(unsigned i=0;i<8;i++){ + acc += u&0xf; + u=u>>4; + } + // a four-bit uniform has mean 7.5 and variance ((15-0+1)^2-1)/12 = 85/4 + // sum of four uniforms has mean 8*7.5=60 and variance of 8*85/4=170 + const float scale=0.07669649888473704; // == 1/sqrt(170) + return (acc-60.0f) * scale; +} + +#endif diff --git a/apps/POLite/izhikevich-sync/Run.cpp b/apps/POLite/izhikevich-sync/Run.cpp new file mode 100644 index 00000000..0693b8c3 --- /dev/null +++ b/apps/POLite/izhikevich-sync/Run.cpp @@ -0,0 +1,120 @@ +// SPDX-License-Identifier: BSD-2-Clause +#include "Izhikevich.h" + +#include +#include + +#include +#include +#include +#include + +inline double urand() { return (double) rand() / RAND_MAX; } + +int main(int argc, char**argv) +{ + if (argc != 2) { + printf("Specify edges file\n"); + exit(EXIT_FAILURE); + } + + // Read network + EdgeList net; + net.read(argv[1]); + printf("Max fan-out = %d\n", net.maxFanOut()); + printf("Min fan-out = %d\n", net.minFanOut()); + + // Connection to tinsel machine + HostLink hostLink; + + // Create POETS graph + PGraph graph; + + // Create nodes in POETS graph + for (uint32_t i = 0; i < net.numNodes; i++) { + PDeviceId id = graph.newDevice(); + assert(i == id); + } + + // Ratio of excitatory to inhibitory neurons + double excitatory = 0.8; + + // Mark each neuron as excitatory (or inhibiatory) + srand(1); + bool* excite = new bool [net.numNodes]; + for (int i = 0; i < net.numNodes; i++) + excite[i] = urand() < excitatory; + + // Create connections in POETS graph + for (uint32_t i = 0; i < net.numNodes; i++) { + uint32_t numNeighbours = net.neighbours[i][0]; + for (uint32_t j = 0; j < numNeighbours; j++) { + float weight = excite[i] ? 0.5 * urand() : -urand(); + graph.addLabelledEdge(weight, i, 0, net.neighbours[i][j+1]); + } + } + + // Prepare mapping from graph to hardware + graph.map(); + + srand(2); + // Initialise devices + for (PDeviceId i = 0; i < graph.numDevices; i++) { + IzhikevichState* n = &graph.devices[i]->state; + n->rng = (int32_t) (urand()*((double) (1<<31))); + if (excite[i]) { + float re = (float) urand(); + n->a = 0.02; + n->b = 0.2; + n->c = -65+15*re*re; + n->d = 8-6*re*re; + n->Ir = 5; + } + else { + float ri = (float) urand(); + n->a = 0.02+0.08*ri; + n->b = 0.25-0.05*ri; + n->c = -65; + n->d = 2; + n->Ir = 2; + } + } + + // Write graph down to tinsel machine via HostLink + graph.write(&hostLink); + + // Load code and trigger execution + hostLink.boot("code.v", "data.v"); + hostLink.go(); + + // Timer + printf("Started\n"); + struct timeval start, finish, diff; + gettimeofday(&start, NULL); + + // Consume performance stats + politeSaveStats(&hostLink, "stats.txt"); + + int64_t sum = 0; + // Receive final distance to each vertex + for (uint32_t i = 0; i < graph.numDevices; i++) { + // Receive message + PMessage msg; + hostLink.recvMsg(&msg, sizeof(msg)); + if (i == 0) gettimeofday(&finish, NULL); + // Accumulate + sum += msg.payload.spikeCount; + } + + // Emit result + printf("Total spikes = %ld\n", sum); + + // Display time + timersub(&finish, &start, &diff); + double duration = (double) diff.tv_sec + (double) diff.tv_usec / 1000000.0; + #ifndef POLITE_DUMP_STATS + printf("Time = %lf\n", duration); + #endif + + return 0; +} diff --git a/apps/POLite/pagerank-sync/Run.cpp b/apps/POLite/pagerank-sync/Run.cpp index 435a0750..1b0eb356 100644 --- a/apps/POLite/pagerank-sync/Run.cpp +++ b/apps/POLite/pagerank-sync/Run.cpp @@ -28,7 +28,8 @@ int main(int argc, char **argv) net.read(argv[1]); printf(" done\n"); - // Print max fan-out + // Print fan-out + printf("Min fan-out = %d\n", net.minFanOut()); printf("Max fan-out = %d\n", net.maxFanOut()); // Create nodes in POETS graph diff --git a/apps/POLite/ping-test/Run.cpp b/apps/POLite/ping-test/Run.cpp deleted file mode 100644 index 57ac5441..00000000 --- a/apps/POLite/ping-test/Run.cpp +++ /dev/null @@ -1,57 +0,0 @@ -// SPDX-License-Identifier: BSD-2-Clause -#include "ping.h" - -#include -#include -#include -#include -#include -#include - -int main(int argc, char**argv) -{ - // Connection to tinsel machine - HostLink hostLink; - - // Create POETS graph - PGraph graph; - - // Create single ping device - PDeviceId id = graph.newDevice(); - - // Prepare mapping from graph to hardware - graph.map(); - - // Write graph down to tinsel machine via HostLink - graph.write(&hostLink); - - // Load code and trigger execution - hostLink.boot("code.v", "data.v"); - hostLink.go(); - - printf("Ping started\n"); - - // Consume performance stats - //politeSaveStats(&hostLink, "stats.txt"); - - int test = 0; - int deviceAddr = graph.toDeviceAddr[id]; - printf("deviceAddr = %d\n", deviceAddr); - while (test < 100) { - // Send ping - PMessage sendMsg; - sendMsg.devId = getLocalDeviceId(deviceAddr); - sendMsg.payload.test = test; - hostLink.send(getThreadId(deviceAddr), 1, &sendMsg); - printf("Sent %d to device\n", sendMsg.payload.test); - - // Receive pong - PMessage recvMsg; - hostLink.recvMsg(&recvMsg, sizeof(recvMsg)); - printf("Received %d from device\n", recvMsg.payload.test); - - test++; - } - - return 0; -} diff --git a/apps/POLite/ping-test/ping.h b/apps/POLite/ping-test/ping.h deleted file mode 100644 index 3d4c17de..00000000 --- a/apps/POLite/ping-test/ping.h +++ /dev/null @@ -1,54 +0,0 @@ -// SPDX-License-Identifier: BSD-2-Clause -// Test messaging between host and threads. - -#ifndef _ping_H_ -#define _ping_H_ - -//#define POLITE_DUMP_STATS -//#define POLITE_COUNT_MSGS - -// Lightweight POETS frontend -#include - -struct PingMessage { - uint32_t test; -}; - -struct PingState { - // Number received to be sent back to host - uint32_t test; -}; - -struct PingDevice : PDevice { - // Called once by POLite at start of execution - void init() { - // Do nothing until a message is received from the host - *readyToSend = No; - } - - // Receive handler - inline void recv(PingMessage* msg, None* edge) { - // Store number from host to send back to host - s->test = msg->test; - *readyToSend = HostPin; - } - - // Send handler - inline void send(volatile PingMessage* msg) { - // Put received value back in message for host to check - msg->test = s->test; - *readyToSend = No; - } - - // Called by POLite when system becomes idle - inline bool step() { - return true; // Never terminate - } - - // Optionally send message to host on termination - inline bool finish(volatile PingMessage* msg) { - return false; - } -}; - -#endif diff --git a/apps/POLite/progrouters/Makefile b/apps/POLite/progrouters/Makefile new file mode 100644 index 00000000..9c0837be --- /dev/null +++ b/apps/POLite/progrouters/Makefile @@ -0,0 +1,7 @@ +# SPDX-License-Identifier: BSD-2-Clause +APP_CPP = ProgRoutersTest.cpp +APP_HDR = +RUN_CPP = Run.cpp +RUN_H = + +include ../util/polite.mk diff --git a/apps/POLite/progrouters/ProgRoutersTest.cpp b/apps/POLite/progrouters/ProgRoutersTest.cpp new file mode 100644 index 00000000..109565df --- /dev/null +++ b/apps/POLite/progrouters/ProgRoutersTest.cpp @@ -0,0 +1,43 @@ +// SPDX-License-Identifier: BSD-2-Clause +#include + +int main() +{ + // Get thread id + int me = tinselId(); + + // Sample outgoing message + volatile uint32_t* msgOut = (uint32_t*) tinselSendSlot(); + msgOut[0] = 0x10; + msgOut[1] = 0x20; + msgOut[2] = 0x30; + msgOut[3] = 0x40; + msgOut[4] = 0x50; + msgOut[5] = 0x60; + msgOut[6] = 0x70; + msgOut[7] = 0x80; + + // On thread 0, send to key supplied by host + if (me == 0) { + tinselSetLen(1); + tinselWaitUntil(TINSEL_CAN_RECV); + volatile uint32_t* msgIn = (uint32_t*) tinselRecv(); + uint32_t key = msgIn[0]; + tinselFree(msgIn); + + tinselWaitUntil(TINSEL_CAN_SEND); + tinselKeySend(key, msgOut); + } + + // Print anything received + while (1) { + tinselWaitUntil(TINSEL_CAN_RECV); + volatile uint32_t* msgIn = (uint32_t*) tinselRecv(); + printf("%x %x %x %x %x %x %x %x\n", + msgIn[0], msgIn[1], msgIn[2], msgIn[3] + , msgIn[4], msgIn[5], msgIn[6], msgIn[7]); + tinselFree(msgIn); + } + + return 0; +} diff --git a/apps/POLite/progrouters/Run.cpp b/apps/POLite/progrouters/Run.cpp new file mode 100644 index 00000000..c2b27bd2 --- /dev/null +++ b/apps/POLite/progrouters/Run.cpp @@ -0,0 +1,47 @@ +// SPDX-License-Identifier: BSD-2-Clause +#include +#include + +int main(int argc, char **argv) +{ + // Connection to tinsel machine + HostLink hostLink; + + // Create routing tables + ProgRouterMesh mesh(2, 1); + + // Board (1, 0) + for (int i = 0; i < 2; i++) { + uint64_t mask = 1ul << i; + mesh.table[0][1].addMRM(1, 0, mask >> 32, mask, 0xf0f0); + } + uint32_t key01 = mesh.table[0][1].genKey(); + + // Board (0, 0) + for (int i = 0; i < 2; i++) { + uint64_t mask = 1ul << i; + mesh.table[0][0].addMRM(1, 0, mask >> 32, mask, 0xf0f0); + } + for (int i = 0; i < 2; i++) { + uint64_t mask = 1ul << i; + mesh.table[0][0].addMRM(1, 1, mask >> 32, mask, 0xf0f0); + } + mesh.table[0][0].addRR(2, key01); // East + uint32_t key00 = mesh.table[0][0].genKey(); + + // Transfer routing tables to FPGAs + mesh.write(&hostLink); + + // Load code and trigger execution + hostLink.boot("code.v", "data.v"); + hostLink.go(); + + // Send key + printf("Sending key %x\n", key00); + uint32_t msg[1 << TinselLogWordsPerMsg]; + msg[0] = key00; + hostLink.send(0, 1, msg); + + hostLink.dumpStdOut(); + return 0; +} diff --git a/apps/POLite/sssp-async/Run.cpp b/apps/POLite/sssp-async/Run.cpp index c7953795..37ffcb4e 100644 --- a/apps/POLite/sssp-async/Run.cpp +++ b/apps/POLite/sssp-async/Run.cpp @@ -20,8 +20,9 @@ int main(int argc, char**argv) EdgeList net; net.read(argv[1]); - // Print max fan-out + // Print fan-out printf("Max fan-out = %d\n", net.maxFanOut()); + printf("Min fan-out = %d\n", net.minFanOut()); // Connection to tinsel machine HostLink hostLink; @@ -86,7 +87,9 @@ int main(int argc, char**argv) // Display time timersub(&finish, &start, &diff); double duration = (double) diff.tv_sec + (double) diff.tv_usec / 1000000.0; + #ifndef POLITE_DUMP_STATS printf("Time = %lf\n", duration); + #endif return 0; } diff --git a/apps/POLite/sssp-pc/Makefile b/apps/POLite/sssp-pc/Makefile new file mode 100644 index 00000000..2ddbeca3 --- /dev/null +++ b/apps/POLite/sssp-pc/Makefile @@ -0,0 +1,11 @@ +# SPDX-License-Identifier: BSD-2-Clause +all: sssp + +INC=../../../include + +sssp: sssp.cpp + g++ -I$(INC) -O3 sssp.cpp -o sssp + +.PHONY: clean +clean: + rm sssp diff --git a/apps/POLite/sssp-pc/sssp.cpp b/apps/POLite/sssp-pc/sssp.cpp new file mode 100644 index 00000000..9012f49e --- /dev/null +++ b/apps/POLite/sssp-pc/sssp.cpp @@ -0,0 +1,92 @@ +// SPDX-License-Identifier: BSD-2-Clause +#include +#include +#include +#include +#include +#include +#include + +int main(int argc, char**argv) +{ + if (argc != 2) { + printf("Specify edges file\n"); + exit(EXIT_FAILURE); + } + + // Read network + EdgeList net; + net.read(argv[1]); + + // Create weights + srand(1); + uint32_t** weights = new uint32_t* [net.numNodes]; + for (uint32_t i = 0; i < net.numNodes; i++) { + uint32_t numNeighbours = net.neighbours[i][0]; + weights[i] = new uint32_t [numNeighbours]; + for (uint32_t j = 0; j < numNeighbours; j++) { + weights[i][j] = rand() % 100; + } + } + + // Create states + uint32_t* dist = new uint32_t [net.numNodes]; + int* queue = new int [net.numNodes]; + int queueSize = 0; + int* queueNext = new int [net.numNodes]; + int queueSizeNext = 0; + bool* inQueue = new bool [net.numNodes]; + for (int i = 0; i < net.numNodes; i++) { + inQueue[i] = false; + dist[i] = 0x7fffffff; + } + + // Set source vertex + dist[2] = 0; + queue[queueSize++] = 2; + + // Start timer + printf("Started\n"); + struct timeval start, finish, diff; + gettimeofday(&start, NULL); + + int iters = 0; + while (queueSize > 0) { + for (int i = 0; i < queueSize; i++) { + uint32_t me = queue[i]; + uint32_t numNeighbours = net.neighbours[me][0]; + for (uint32_t j = 0; j < numNeighbours; j++) { + uint32_t neighbour = net.neighbours[me][j+1]; + uint32_t newDist = dist[me] + weights[me][j]; + if (newDist < dist[neighbour]) { + dist[neighbour] = newDist; + if (!inQueue[neighbour]) { + queueNext[queueSizeNext++] = neighbour; + inQueue[neighbour] = true; + } + } + } + } + queueSize = queueSizeNext; + queueSizeNext = 0; + int32_t* tmp = queue; queue = queueNext; queueNext = tmp; + for (int i = 0; i < queueSize; i++) inQueue[queue[i]] = false; + iters++; + } + + // Stop timer + gettimeofday(&finish, NULL); + + uint64_t sum = 0; + for (int i = 0; i < net.numNodes; i++) + sum += dist[i]; + printf("Sum of distances = %ld\n", sum); + printf("Iterations = %d\n", iters); + + // Display time + timersub(&finish, &start, &diff); + double duration = (double) diff.tv_sec + (double) diff.tv_usec / 1000000.0; + printf("Time = %lf\n", duration); + + return 0; +} diff --git a/apps/POLite/sssp-sync/Run.cpp b/apps/POLite/sssp-sync/Run.cpp index c7953795..37ffcb4e 100644 --- a/apps/POLite/sssp-sync/Run.cpp +++ b/apps/POLite/sssp-sync/Run.cpp @@ -20,8 +20,9 @@ int main(int argc, char**argv) EdgeList net; net.read(argv[1]); - // Print max fan-out + // Print fan-out printf("Max fan-out = %d\n", net.maxFanOut()); + printf("Min fan-out = %d\n", net.minFanOut()); // Connection to tinsel machine HostLink hostLink; @@ -86,7 +87,9 @@ int main(int argc, char**argv) // Display time timersub(&finish, &start, &diff); double duration = (double) diff.tv_sec + (double) diff.tv_usec / 1000000.0; + #ifndef POLITE_DUMP_STATS printf("Time = %lf\n", duration); + #endif return 0; } diff --git a/apps/POLite/util/genld.sh b/apps/POLite/util/genld.sh index 0350108e..474e5694 100755 --- a/apps/POLite/util/genld.sh +++ b/apps/POLite/util/genld.sh @@ -18,7 +18,7 @@ OUTPUT_ARCH( "riscv" ) MEMORY { instrs : ORIGIN = $MaxBootImageBytes, LENGTH = $MaxInstrBytes - globals : ORIGIN = $DRAMBase, LENGTH = $DRAMGlobalsLength + globals : ORIGIN = $DRAMBase, LENGTH = $POLiteDRAMGlobalsLength } SECTIONS diff --git a/apps/POLite/util/polite.mk b/apps/POLite/util/polite.mk index a1d96f83..4abe32ee 100644 --- a/apps/POLite/util/polite.mk +++ b/apps/POLite/util/polite.mk @@ -51,7 +51,7 @@ $(HL)/%.o: $(BUILD)/run: $(RUN_CPP) $(RUN_H) $(HL)/*.o g++ -std=c++11 -O2 -I $(INC) -I $(HL) -o $(BUILD)/run $(RUN_CPP) $(HL)/*.o \ - -lmetis -fno-exceptions + -lmetis -fno-exceptions -fopenmp $(BUILD)/sim: $(RUN_CPP) $(RUN_H) $(HL)/sim/*.o g++ -O2 -I $(INC) -I $(HL) -o $(BUILD)/sim $(RUN_CPP) $(HL)/sim/*.o \ diff --git a/apps/POLite/util/sumstats.awk b/apps/POLite/util/sumstats.awk index 4d037cca..719699aa 100755 --- a/apps/POLite/util/sumstats.awk +++ b/apps/POLite/util/sumstats.awk @@ -10,10 +10,12 @@ BEGIN { cacheCount = 0; coreCount = 0; cacheLineSize = 32; - intraThreadSendCount = 0; - interThreadSendCount = 0; - interBoardSendCount = 0; - fmax = 225000000; + msgsReceived = 0; + msgsSent = 0; + progRouterSent = 0; + progRouterSentInter = 0; + blockedSends = 0; + fmax = 215000000; if (boardsX == "" || boardsY == "") { boardsX = 3; boardsY = 2; @@ -48,13 +50,18 @@ BEGIN { coreCount = coreCount+1; } # Per-thread message counts - else if (match($0, /(.*) LS:(.*),TS:(.*),BS:(.*)/, fields)) { - ls=strtonum("0x"fields[2]); - ts=strtonum("0x"fields[3]); - bs=strtonum("0x"fields[4]); - intraThreadSendCount = intraThreadSendCount+ls; - interThreadSendCount = interThreadSendCount+ts; - interBoardSendCount = interBoardSendCount+bs; + else if (match($0, /(.*) MS:(.*),MR:(.*),PR:(.*),PRI:(.*),BL:(.*)/, + fields)) { + ms=strtonum("0x"fields[2]); + mr=strtonum("0x"fields[3]); + pr=strtonum("0x"fields[4]); + pri=strtonum("0x"fields[5]); + bl=strtonum("0x"fields[6]); + msgsSent = msgsSent + ms; + msgsReceived = msgsReceived + mr; + progRouterSent = progRouterSent + pr; + progRouterSentInter = progRouterSentInter + pri; + blockedSends = blockedSends + bl; } } } @@ -70,7 +77,14 @@ END { bytes = cacheLineSize * (missCount + writebackCount) print "Off-chip memory (GBytes/s): ", ((1/time) * bytes)/1000000000 print "CPU util (%): ", (1-(cpuIdleCount/cycleCount))*100 - print "Intra-thread messages: ", intraThreadSendCount - print "Inter-thread messages: ", interThreadSendCount - print "Inter-board messages: ", interBoardSendCount + print "Msgs received: ", msgsReceived + print "Msgs sent by threads: ", msgsSent + print "Msgs injected by ProgRouter:", progRouterSent + print "Inter-board msgs:", progRouterSentInter + print "Blocked sends:", blockedSends + print "" + print "Notes:" + print " * ProgRouter injections includes inter-board msgs" + print " * Memory bandwidth does not include lookups by ProgRouter" + print " * If runtime > 40s approx, hit/miss counts may overflow" } diff --git a/config.py b/config.py index 74c7f63e..6500be58 100755 --- a/config.py +++ b/config.py @@ -161,6 +161,16 @@ def quoted(s): return "'\"" + s + "\"'" p["SRAMLogMaxInFlight"] = 5 p["SRAMStoreLatency"] = 2 +# Programmable router parameters: +p["LogRoutingEntryLen"] = 5 # Number of beats in a routing table entry +p["ProgRouterMaxBurst"] = 4 +p["FetcherLogIndQueueSize"] = 1 +p["FetcherLogBeatBufferSize"] = 5 +p["FetcherLogFlitBufferSize"] = 5 +p["FetcherLogMsgsPerFlitBuffer"] = ( + p["FetcherLogFlitBufferSize"] - p["LogMaxFlitsPerMsg"]) +p["FetcherMsgsPerFlitBuffer"] = 2 ** p["FetcherLogMsgsPerFlitBuffer"] + # Enable performance counters p["EnablePerfCount"] = True @@ -178,7 +188,7 @@ def quoted(s): return "'\"" + s + "\"'" p["UseCustomAccelerator"] = False # Clock frequency (in MHz) -p["ClockFreq"] = 225 +p["ClockFreq"] = 215 #============================================================================== # Derived Parameters @@ -300,6 +310,7 @@ def quoted(s): return "'\"" + s + "\"'" # Cores per board p["LogCoresPerBoard"] = p["LogCoresPerMailbox"] + p["LogMailboxesPerBoard"] +p["LogCoresPerBoard1"] = p["LogCoresPerBoard"] + 1 p["CoresPerBoard"] = 2**p["LogCoresPerBoard"] # Threads per core @@ -356,10 +367,21 @@ def quoted(s): return "'\"" + s + "\"'" # DRAM base and length p["DRAMBase"] = 3 * (2 ** p["LogBytesPerSRAM"]) p["DRAMGlobalsLength"] = 2 ** (p["LogBytesPerDRAM"] - 1) - p["DRAMBase"] +p["POLiteDRAMGlobalsLength"] = 2 ** 14 +p["POLiteProgRouterBase"] = p["DRAMBase"] + p["POLiteDRAMGlobalsLength"] +p["POLiteProgRouterLength"] = (p["DRAMGlobalsLength"] - + p["POLiteDRAMGlobalsLength"]) + +# POLite globals # Number of FPGA boards per box (including bridge board) p["BoardsPerBox"] = p["MeshXLenWithinBox"] * p["MeshYLenWithinBox"] + 1 +# Parameters for programmable routers +# (and the routing-record fetchers they contain) +p["FetchersPerProgRouter"] = 4 + p["MailboxMeshXLen"] +p["LogFetcherFlitBufferSize"] = 5 + #============================================================================== # Main #============================================================================== diff --git a/de5/S5_DDR3_QSYS.qsys b/de5/S5_DDR3_QSYS.qsys index 0695a737..4d8e3a49 100644 --- a/de5/S5_DDR3_QSYS.qsys +++ b/de5/S5_DDR3_QSYS.qsys @@ -891,7 +891,7 @@ - + @@ -1214,7 +1214,7 @@ - + diff --git a/doc/PIP-0024-global-multicast.md b/doc/PIP-0024-global-multicast.md new file mode 100644 index 00000000..65105f71 --- /dev/null +++ b/doc/PIP-0024-global-multicast.md @@ -0,0 +1,226 @@ +# PIP-0024: Programmable routers and global multicast + +Author: Matthew Naylor + +This proposal replaces PIP 21. + +## Proposal + +We propose to generalise the destination component of a message so +that it can be (1) a thread id; or (2) a **routing key**. A message, +sent by a thread, containing a routing key as a destination will go to +a **per-board router** on the same FPGA. The router will use they key +as an index into a DRAM-based routing table and automatically +propagate the message towards all the destinations associated with +that key. + +## Motivation/Rationale + +PIP 22 resulted in a *mailbox-level* multicast feature, implemented in +Tinsel 0.7. It enables each thread to send to a message +simultaneously to any subset of the 64 threads on a destination +mailbox. It works well when graphs exhibit good locality, with +destination vertices often collocated on the same mailbox. + +However, it has a few drawbacks: + + 1. Costly graph partitioning algorithms are needed to identify + locality. This is problematic for graphs with billions of edges + and vertices, because mapping time may significantly outweigh + execution time. (Indeed, graph partitioning is itself an + interesting application for the hardware.) + + 2. In some graphs there are limits to how well destination vertices + can be collocated after partitioning. For example, *small-world + graphs* contain some extremely large, highly-distributed fanouts. + +A *global multicast* feature should reduce the need to find optimal +partitions for very large graphs, and support distributed fanouts. It +should also move work away from the cores and into the hardware +routers: the softswitch no longer needs to iterate over the outgoing +edges of a pin. While providing these improvements, it is also +important to maintain the advantages of the existing mailbox-level +multicast, for applications in which the mapping time is not a +concern. + +## Functional overview + +A **routing key** is a 32-bit value consisting of a *ram id*, an +*address*, and a *size*: + +```sv +// 32-bit routing key (MSB to LSB) +typedef struct { + // Which off-chip RAM on this board? + Bit#(`LogDRAMsPerBoard) ram; + // Pointer to array of routing beats containing routing records + Bit#(`LogBeatsPerDRAM) ptr; + // Number of beats in the array + Bit#(`LogRoutingEntryLen) numBeats; +} RoutingKey; +``` + +When a message reaches the per-board router, the `ptr` field of the +routing key is used as an index into DRAM, where a sequence of 256-bit +**routing beats** are found. The `numBeats` field of the routing key +indicates how many contiguous routing beats there are. Knowing the +size before the lookup makes the hardware simpler and more efficient, +e.g. it can avoid blocking on responses and issue a burst of an +appropriate size. The value of `numBeats` may be zero. + +A routing beat consists of a *size* and a sequence of five 48-bit +*routing chunks*: + +```sv +// 256-bit routing beat (aligned, MSB to LSB) +typedef struct { + // Number of routing records present in this beat + Bit#(16) size; + // Five 48-bit record chunks + Vector#(5, Bit#(48)) chunks; +} RoutingBeat; +``` + +The *size* must lie in the range 1 to 5 inclusive (0 is disallowed). +A **routing record** consists of one or two routing chunks, depending +on the **record type**. + +All byte orderings are little endian. For example, the order of bytes +in a routing beat is as follows. + +``` +Byte Contents +---- -------- +31: Upper byte of length (i.e. number of records in beat) +30: Lower byte of length +29: Upper byte of first chunk + ... +24: Lower byte of first chunk +23: Upper byte of second chunk + ... +18: Lower byte of second chunk +17: Upper byte of third chunk + ... +12: Lower byte of third chunk +11: Upper byte of fourth chunk + ... + 6: Lower byte of fourth chunk + 5: Upper byte of fifth chunk + ... + 0: Lower byte of fifth chunk +``` + +Clearly, both routing keys and routing beats have a maximum size. +However, in principle there is no limit to the number of records +associated with a key, due to the possibility of *indirection records* +(see below). + +There are five types of routing record, defined below. + +**48-bit Unicast Router-to-Mailbox (URM1).** + +```sv +typedef struct { + // Record type (URM1 == 0) + Bit#(3) tag; + // Mailbox destination + Bit#(4) mbox; + // Mailbox-local thread identifier + Bit#(6) thread; + // Unused + Bit#(3) unused; + // Local key. The first word of the message + // payload is overwritten with this. + Bit#(32) localKey; +} URM1Record; +``` + +The `localKey` can be used for anything, but might encode the +destination thread-local device identifier, or edge identifier, or +both. The `mbox` field is currently 4 bits (two Y bits followed by +two X bits), but there are spare bits available to increase the size +of this field in future if necessary. + +**96-bit Unicast Router-to-Mailbox (URM2).** + +```sv +typedef struct { + // Record type (URM2 == 1) + Bit#(3) tag; + // Mailbox destination + Bit#(4) mbox; + // Mailbox-local thread identifier + Bit#(6) thread; + // Currently unused + Bit#(19) unused; + // Local key. The first two words of the message + // payload is overwritten with this. + Bit#(64) localKey; +} URM2Record; +``` + +This is the same as a URM1 record except the local key is 64-bits in +size. + +**48-bit Router-to-Router (RR).** + +```sv +typedef struct { + // Record type (RR == 2) + Bit#(3) tag; + // Direction (N,S,E,W == 0,1,2,3) + Bit#(2) dir; + // Currently unused + Bit#(11) unused; + // New 32-bit routing key that will replace the one in the + // current message for the next hop of the message's journey + Bit#(32) newKey; +} RRRecord; +``` + +The `newKey` field will replace the key in the current message for the +next hop of the message's journey. Introducing a new key at each hop +simplifies the mapping process (keeping it quick). + +**96-bit Multicast Router-to-Mailbox (MRM).** + +```sv +typedef struct { + // Record type (MRM == 3) + Bit#(3) tag; + // Mailbox destination + Bit#(4) mbox; + // Currently unused + Bit#(9) unused; + // Local key. The least-significant half-word + // of the message is replaced with this + Bit#(16) localKey; + // Mailbox-local destination mask + Bit#(64) destMask; +} MRMRecord; +``` + +**48-bit Indirection (IND).** + +```sv +// 48-bit Indirection (IND) record +// Note the restrictions on IND records: +// 1. At most one IND record per key lookup +// 2. A max-sized key lookup must contain an IND record +typedef struct { + // Record type (IND == 4) + Bit#(3) tag; + // Currently unused + Bit#(13) unused; + // New 32-bit routing key for new set of records on current router + Bit#(32) newKey; +} INDRecord; +``` + +Indirection records can be used to handle large fanouts, which exceed +the number of bits available in the size portion of the routing key. + +## Impact + +Since use of routing keys is optional, existing applications will +continue to work unmodified. diff --git a/doc/custom/ExampleAccelerator.sv b/doc/custom/ExampleAccelerator.sv index 34a97fc2..acc73455 100644 --- a/doc/custom/ExampleAccelerator.sv +++ b/doc/custom/ExampleAccelerator.sv @@ -5,6 +5,7 @@ typedef struct packed { logic acc; + logic isKey; logic host; logic hostDir; logic [`TinselMeshYBits-1:0] boardY; diff --git a/doc/custom/README.md b/doc/custom/README.md index c380f9c9..fde29010 100644 --- a/doc/custom/README.md +++ b/doc/custom/README.md @@ -74,6 +74,7 @@ custom accelerator or a mailbox. ```sv typedef struct packed { logic acc; + logic isKey; logic host; logic hostDir; logic [`TinselMeshYBits-1:0] boardY; diff --git a/doc/figures/fpga.png b/doc/figures/fpga.png index f4d60fbb..71a4c97f 100644 Binary files a/doc/figures/fpga.png and b/doc/figures/fpga.png differ diff --git a/doc/figures/fpga.tex b/doc/figures/fpga.tex index 02922a0f..9eafda95 100644 --- a/doc/figures/fpga.tex +++ b/doc/figures/fpga.tex @@ -14,15 +14,6 @@ \definecolor{myorange}{RGB}{197,90,17} \definecolor{mygreen}{RGB}{84,130,53} - \node[fill=gray!20,rounded corners, - minimum width=6.3cm,minimum height=4.8cm] (border0) - at (4.5,2.0) {}; - \node[fill=white,rounded corners, - minimum width=5.8cm,minimum height=4.1cm] (border1) - at (4.5,1.8) {}; - \node[fill=none,color=black] at (4.5,6.4) - {\footnotesize{inter-FPGA reliable links}}; - \node[fill=myblue,rounded corners] (tile00) at (0,0) {\footnotesize{tile}}; \node[rectangle,sharp corners,fill=black] (router00) @@ -123,16 +114,16 @@ \draw[arrows=-,color=mygreen] (tile13) to (mem13); \node[rounded corners,fill=mygreen] - (ram0) at (1.7,-1.6) {\footnotesize{off-chip RAM}}; + (ram0) at (1.3,-1.8) {\footnotesize{off-chip RAM}}; - \draw[arrows=-,color=mygreen] (mem00) to ([xshift=-7mm]ram0.north); - \draw[arrows=-,color=mygreen] (mem01) to ([xshift=-5mm]ram0.north); - \draw[arrows=-,color=mygreen] (mem02) to ([xshift=-3mm]ram0.north); - \draw[arrows=-,color=mygreen] (mem03) to ([xshift=-1mm]ram0.north); - \draw[arrows=-,color=mygreen] (mem10) to ([xshift=7mm]ram0.north); - \draw[arrows=-,color=mygreen] (mem11) to ([xshift=5mm]ram0.north); - \draw[arrows=-,color=mygreen] (mem12) to ([xshift=3mm]ram0.north); - \draw[arrows=-,color=mygreen] (mem13) to ([xshift=1mm]ram0.north); + \draw[arrows=-,color=mygreen] (mem00) to ([xshift=-3mm]ram0.north); + \draw[arrows=-,color=mygreen] (mem01) to ([xshift=-1mm]ram0.north); + \draw[arrows=-,color=mygreen] (mem02) to ([xshift=1mm]ram0.north); + \draw[arrows=-,color=mygreen] (mem03) to ([xshift=3mm]ram0.north); + \draw[arrows=-,color=mygreen] (mem10) to ([xshift=11mm]ram0.north); + \draw[arrows=-,color=mygreen] (mem11) to ([xshift=9mm]ram0.north); + \draw[arrows=-,color=mygreen] (mem12) to ([xshift=7mm]ram0.north); + \draw[arrows=-,color=mygreen] (mem13) to ([xshift=5mm]ram0.north); \coordinate[] (south0b) at (4.3, -0.9) {}; \coordinate[] (south0a) at (-0.83, -0.9) {}; @@ -282,16 +273,16 @@ \draw[arrows=-,color=mygreen] (tile33) to (memb13); \node[rounded corners,fill=mygreen] - (ram1) at (7.57,-1.6) {\footnotesize{off-chip RAM}}; + (ram1) at (7.97,-1.8) {\footnotesize{off-chip RAM}}; - \draw[arrows=-,color=mygreen] (memb00) to ([xshift=-7mm]ram1.north); - \draw[arrows=-,color=mygreen] (memb01) to ([xshift=-5mm]ram1.north); - \draw[arrows=-,color=mygreen] (memb02) to ([xshift=-3mm]ram1.north); - \draw[arrows=-,color=mygreen] (memb03) to ([xshift=-1mm]ram1.north); - \draw[arrows=-,color=mygreen] (memb10) to ([xshift=7mm]ram1.north); - \draw[arrows=-,color=mygreen] (memb11) to ([xshift=5mm]ram1.north); - \draw[arrows=-,color=mygreen] (memb12) to ([xshift=3mm]ram1.north); - \draw[arrows=-,color=mygreen] (memb13) to ([xshift=1mm]ram1.north); + \draw[arrows=-,color=mygreen] (memb00) to ([xshift=-11mm]ram1.north); + \draw[arrows=-,color=mygreen] (memb01) to ([xshift=-9mm]ram1.north); + \draw[arrows=-,color=mygreen] (memb02) to ([xshift=-7mm]ram1.north); + \draw[arrows=-,color=mygreen] (memb03) to ([xshift=-5mm]ram1.north); + \draw[arrows=-,color=mygreen] (memb10) to ([xshift=3mm]ram1.north); + \draw[arrows=-,color=mygreen] (memb11) to ([xshift=1mm]ram1.north); + \draw[arrows=-,color=mygreen] (memb12) to ([xshift=-1mm]ram1.north); + \draw[arrows=-,color=mygreen] (memb13) to ([xshift=-3mm]ram1.north); @@ -359,33 +350,20 @@ \coordinate[] (south2c) at (4.7, -2.3) {}; \draw[arrows=-,color=black] (south2b) to (south2c); - \draw[arrows=-,color=black] (router00.west) to - ([xshift=-2.3mm]router00.west); - \draw[arrows=-,color=black] (router01.west) to - ([xshift=-2.3mm]router01.west); - \draw[arrows=-,color=black] (router02.west) to - ([xshift=-2.3mm]router02.west); - \draw[arrows=-,color=black] (router03.west) to - ([xshift=-2.3mm]router03.west); - - \draw[arrows=-,color=black] (router30.east) to - ([xshift=14.4mm]router30.east); - \draw[arrows=-,color=black] (router31.east) to - ([xshift=14.4mm]router31.east); - \draw[arrows=-,color=black] (router32.east) to - ([xshift=14.4mm]router32.east); - \draw[arrows=-,color=black] (router33.east) to - ([xshift=14.4mm]router33.east); - - \draw[arrows=-,color=black] (router03.north) to - ([yshift=2mm]router03.north); - \draw[arrows=-,color=black] (router13.north) to - ([yshift=2mm]router13.north); - \draw[arrows=-,color=black] (router23.north) to - ([yshift=2mm]router23.north); - \draw[arrows=-,color=black] (router33.north) to - ([yshift=2mm]router33.north); + \node[rounded corners,fill=myorange,minimum height=0.5cm] (boardrouter) + at (4.63cm,-1.8cm) {\footnotesize{board}\\[-1mm]\footnotesize{router}}; + + \node[rounded corners,fill=gray!20, text=black,minimum width=5.25cm] (links) + at (4.63cm, -3.2cm) {\footnotesize{inter-FPGA reliable links}}; + + \draw[arrows=-,color=black] (links.north) to (boardrouter.south); + + % Is the board router connected to off-chip RAM? + \draw[arrows=-,color=mygreen] (ram0.east) to (boardrouter.west); + \draw[arrows=-,color=mygreen] (ram1.west) to (boardrouter.east); + \end{tikzpicture} + \end{document} diff --git a/doc/figures/logo.png b/doc/figures/logo.png new file mode 100644 index 00000000..8271002b Binary files /dev/null and b/doc/figures/logo.png differ diff --git a/hostlink/DebugLink.cpp b/hostlink/DebugLink.cpp index f838441d..0031969c 100644 --- a/hostlink/DebugLink.cpp +++ b/hostlink/DebugLink.cpp @@ -60,10 +60,10 @@ void DebugLink::putPacket(int x, int y, BoardCtrlPkt* pkt) } // Constructor -DebugLink::DebugLink(uint32_t numBoxesX, uint32_t numBoxesY) +DebugLink::DebugLink(DebugLinkParams p) { - boxMeshXLen = numBoxesX; - boxMeshYLen = numBoxesY; + boxMeshXLen = p.numBoxesX; + boxMeshYLen = p.numBoxesY; get_tryNextX = 0; get_tryNextY = 0; @@ -105,11 +105,11 @@ DebugLink::DebugLink(uint32_t numBoxesX, uint32_t numBoxesY) "But is has a box X coordinate of %i\n", thisBoxX); exit(EXIT_FAILURE); } - if ((thisBoxX+numBoxesX-1) >= TinselBoxMeshXLen || - (thisBoxY+numBoxesY-1) >= TinselBoxMeshYLen) { + if ((thisBoxX+p.numBoxesX-1) >= TinselBoxMeshXLen || + (thisBoxY+p.numBoxesY-1) >= TinselBoxMeshYLen) { fprintf(stderr, "Requested box sub-mesh of size %ix%i " "is not valid from box %s\n", - numBoxesX, numBoxesY, hostname); + p.numBoxesX, p.numBoxesY, hostname); exit(EXIT_FAILURE); } @@ -187,6 +187,8 @@ DebugLink::DebugLink(uint32_t numBoxesX, uint32_t numBoxesY) if (y == 0) pkt.payload[2] |= 2; if (thisBoxX == 0 && boxMeshXLen == 1) pkt.payload[2] |= 4; if (thisBoxX == 1 && boxMeshXLen == 1) pkt.payload[2] |= 8; + // Reserve extra send slot? + pkt.payload[2] |= p.useExtraSendSlot ? 0x10 : 0; // Send commands to each board for (int b = 0; b < TinselBoardsPerBox; b++) { pkt.linkId = b; diff --git a/hostlink/DebugLink.h b/hostlink/DebugLink.h index fd3c8291..18d352dc 100644 --- a/hostlink/DebugLink.h +++ b/hostlink/DebugLink.h @@ -8,6 +8,13 @@ #include "BoardCtrl.h" #include "DebugLinkFormat.h" +// DebugLinkH parameters +struct DebugLinkParams { + uint32_t numBoxesX; + uint32_t numBoxesY; + bool useExtraSendSlot; +}; + class DebugLink { // Location of this box with full box mesh @@ -46,7 +53,7 @@ class DebugLink { int meshYLen; // Constructor - DebugLink(uint32_t numBoxesX, uint32_t numBoxesY); + DebugLink(DebugLinkParams params); // On given board, set destination core and thread void setDest(uint32_t boardX, uint32_t boardY, diff --git a/hostlink/HostLink.cpp b/hostlink/HostLink.cpp index aa4d3af6..dd896f4d 100644 --- a/hostlink/HostLink.cpp +++ b/hostlink/HostLink.cpp @@ -60,9 +60,11 @@ static int connectToPCIeStream(const char* socketPath) } // Internal constructor -void HostLink::constructor(uint32_t numBoxesX, uint32_t numBoxesY) +void HostLink::constructor(HostLinkParams p) { - if (numBoxesX > TinselBoxMeshXLen || numBoxesY > TinselBoxMeshYLen) { + useExtraSendSlot = p.useExtraSendSlot; + + if (p.numBoxesX > TinselBoxMeshXLen || p.numBoxesY > TinselBoxMeshYLen) { fprintf(stderr, "Number of boxes requested exceeds those available\n"); exit(EXIT_FAILURE); } @@ -92,7 +94,11 @@ void HostLink::constructor(uint32_t numBoxesX, uint32_t numBoxesY) #endif // Create DebugLink - debugLink = new DebugLink(numBoxesX, numBoxesY); + DebugLinkParams debugLinkParams; + debugLinkParams.numBoxesX = p.numBoxesX; + debugLinkParams.numBoxesY = p.numBoxesY; + debugLinkParams.useExtraSendSlot = p.useExtraSendSlot; + debugLink = new DebugLink(debugLinkParams); // Set board mesh dimensions meshXLen = debugLink->meshXLen; @@ -145,12 +151,25 @@ HostLink::HostLink() int x = str ? atoi(str) : 1; str = getenv("HOSTLINK_BOXES_Y"); int y = str ? atoi(str) : 1; - constructor(x, y); + HostLinkParams params; + params.numBoxesX = x; + params.numBoxesY = y; + params.useExtraSendSlot = false; + constructor(params); } HostLink::HostLink(uint32_t numBoxesX, uint32_t numBoxesY) { - constructor(numBoxesX, numBoxesY); + HostLinkParams params; + params.numBoxesX = numBoxesX; + params.numBoxesY = numBoxesY; + params.useExtraSendSlot = false; + constructor(params); +} + +HostLink::HostLink(HostLinkParams params) +{ + constructor(params); } // Destructor @@ -218,8 +237,9 @@ void HostLink::fromAddr(uint32_t addr, uint32_t* meshX, uint32_t* meshY, *meshY = addr; } -// Inject a message via PCIe (blocking by default) -bool HostLink::send(uint32_t dest, uint32_t numFlits, void* payload, bool block) +// Internal helper for sending messages +bool HostLink::sendHelper(uint32_t dest, uint32_t numFlits, void* payload, + bool block, uint32_t key) { assert(useSendBuffer ? block : true); @@ -242,7 +262,7 @@ bool HostLink::send(uint32_t dest, uint32_t numFlits, void* payload, bool block) buffer[0] = dest; buffer[1] = 0; buffer[2] = (numFlits-1) << 24; - buffer[3] = 0; + buffer[3] = key; // Fill in message payload memcpy(&buffer[4], payload, numFlits*16); @@ -285,6 +305,13 @@ bool HostLink::send(uint32_t dest, uint32_t numFlits, void* payload, bool block) } } + +// Inject a message via PCIe (blocking by default) +bool HostLink::send(uint32_t dest, uint32_t numFlits, void* msg, bool block) +{ + return sendHelper(dest, numFlits, msg, block, 0); +} + // Flush the send buffer void HostLink::flush() { @@ -298,7 +325,28 @@ void HostLink::flush() // Try to send a message (non-blocking, returns true on success) bool HostLink::trySend(uint32_t dest, uint32_t numFlits, void* msg) { - return send(dest, numFlits, msg, false); + return sendHelper(dest, numFlits, msg, false, 0); +} + +// Send a message using routing key (blocking by default) +bool HostLink::keySend(uint32_t key, uint32_t numFlits, + void* msg, bool block) +{ + uint32_t useRoutingKey = 1 << ( + TinselLogThreadsPerCore + TinselLogCoresPerMailbox + + TinselMailboxMeshXBits + TinselMailboxMeshYBits + + TinselMeshXBits + TinselMeshYBits + 2); + return sendHelper(useRoutingKey, numFlits, msg, block, key); +} + +// Try to send using routing key (non-blocking, returns true on success) +bool HostLink::keyTrySend(uint32_t key, uint32_t numFlits, void* msg) +{ + uint32_t useRoutingKey = 1 << ( + TinselLogThreadsPerCore + TinselLogCoresPerMailbox + + TinselMailboxMeshXBits + TinselMailboxMeshYBits + + TinselMeshXBits + TinselMeshYBits + 2); + return sendHelper(useRoutingKey, numFlits, msg, false, key); } // Receive a message via PCIe (blocking) diff --git a/hostlink/HostLink.h b/hostlink/HostLink.h index 81c9b32f..41d78303 100644 --- a/hostlink/HostLink.h +++ b/hostlink/HostLink.h @@ -16,6 +16,13 @@ #define PCIESTREAM "pciestream" #define PCIESTREAM_SIM "tinsel.b-1.1" +// HostLink parameters +struct HostLinkParams { + uint32_t numBoxesX; + uint32_t numBoxesY; + bool useExtraSendSlot; +}; + class HostLink { // Lock file for acquring exclusive access to PCIeStream int lockFile; @@ -33,8 +40,15 @@ class HostLink { char* sendBuffer; int sendBufferLen; + // Request an extra send slot when bringing up Tinsel FPGAs + bool useExtraSendSlot; + // Internal constructor - void constructor(uint32_t numBoxesX, uint32_t numBoxesY); + void constructor(HostLinkParams params); + + // Internal helper for sending messages + bool sendHelper(uint32_t dest, uint32_t numFlits, void* payload, + bool block, uint32_t key); public: // Dimensions of board mesh int meshXLen; @@ -43,6 +57,7 @@ class HostLink { // Constructors HostLink(); HostLink(uint32_t numBoxesX, uint32_t numBoxesY); + HostLink(HostLinkParams params); // Destructor ~HostLink(); @@ -65,6 +80,12 @@ class HostLink { // Try to send a message (non-blocking, returns true on success) bool trySend(uint32_t dest, uint32_t numFlits, void* msg); + // Send a message using routing key (blocking by default) + bool keySend(uint32_t key, uint32_t numFlits, void* msg, bool block = true); + + // Try to send using routing key (non-blocking, returns true on success) + bool keyTrySend(uint32_t key, uint32_t numFlits, void* msg); + // Receive a max-sized message (blocking) void recv(void* msg); diff --git a/include/EdgeList.h b/include/EdgeList.h index 7d03bb8f..ebd5d37f 100644 --- a/include/EdgeList.h +++ b/include/EdgeList.h @@ -3,8 +3,11 @@ #define _NETWORK_H_ #include -#include #include +#include +#include +#include +#include struct EdgeList { // Number of nodes and edges @@ -18,50 +21,42 @@ struct EdgeList { // Read network from file void read(const char* filename) { - // Read edges - FILE* fp = fopen(filename, "rt"); - if (fp == NULL) { - fprintf(stderr, "Can't open '%s'\n", filename); - exit(EXIT_FAILURE); - } + std::fstream file(filename, std::ios_base::in); + std::vector vec; // Count number of nodes and edges numEdges = 0; numNodes = 0; - int ret; - while (1) { - uint32_t src, dst; - ret = fscanf(fp, "%d %d", &src, &dst); - if (ret == EOF) break; + uint32_t numInts = 0; + uint32_t val; + while (file >> val) { + vec.push_back(val); + numNodes = val >= numNodes ? val+1 : numNodes; numEdges++; - numNodes = src >= numNodes ? src+1 : numNodes; - numNodes = dst >= numNodes ? dst+1 : numNodes; } - rewind(fp); + assert((numEdges&1) == 0); + numEdges >>= 1; uint32_t* count = (uint32_t*) calloc(numNodes, sizeof(uint32_t)); - for (int i = 0; i < numEdges; i++) { - uint32_t src, dst; - ret = fscanf(fp, "%d %d", &src, &dst); - count[src]++; + for (int i = 0; i < vec.size(); i+=2) { + count[vec[i]]++; } // Create mapping from node id to neighbours neighbours = (uint32_t**) calloc(numNodes, sizeof(uint32_t*)); - rewind(fp); for (int i = 0; i < numNodes; i++) { neighbours[i] = (uint32_t*) calloc(count[i]+1, sizeof(uint32_t)); neighbours[i][0] = count[i]; } - for (int i = 0; i < numEdges; i++) { - uint32_t src, dst; - ret = fscanf(fp, "%d %d", &src, &dst); + for (int i = 0; i < vec.size(); i+=2) { + uint32_t src = vec[i]; + uint32_t dst = vec[i+1]; neighbours[src][count[src]--] = dst; } // Release free(count); - fclose(fp); + file.close(); } // Determine max fan-out @@ -73,6 +68,17 @@ struct EdgeList { } return max; } + + // Determine min fan-out + uint32_t minFanOut() { + uint32_t min = ~0; + for (uint32_t i = 0; i < numNodes; i++) { + uint32_t numNeighbours = neighbours[i][0]; + if (numNeighbours < min) min = numNeighbours; + } + return min; + } + }; #endif diff --git a/include/POLite.h b/include/POLite.h index d12a0e73..f053e440 100644 --- a/include/POLite.h +++ b/include/POLite.h @@ -9,10 +9,10 @@ #include #else #include + #include #include #include #include - #include #endif #endif diff --git a/include/POLite/Bitmap.h b/include/POLite/Bitmap.h new file mode 100644 index 00000000..9271bc07 --- /dev/null +++ b/include/POLite/Bitmap.h @@ -0,0 +1,59 @@ +#ifndef _BITMAP_H_ +#define _BITMAP_H_ + +#include +#include +#include + +struct Bitmap { + // Bitmap contents (sequence of 64-bit words) + Seq* contents; + + // Index of first non-full word in bitmap + uint32_t firstFree; + + // Constructor + Bitmap() { + contents = new Seq (16); + firstFree = 0; + } + + // Destructor + ~Bitmap() { + if (contents) delete contents; + } + + // Get value of word at given index, return 0 if out-of-bounds + inline uint64_t getWord(uint32_t index) { + return index >= contents->numElems ? 0ul : contents->elems[index]; + } + + // Find index of next free word in bitmap starting from given word index + inline uint32_t nextFreeWordFrom(uint32_t start) { + for (uint32_t i = start; i < contents->numElems; i++) + if (~contents->elems[i] != 0ul) return i; + return contents->numElems; + } + + // Set bit at given index and bit offset in bitmap + inline void setBit(uint32_t wordIndex, uint32_t bitIndex) { + for (uint32_t i = contents->numElems; i <= wordIndex; i++) + contents->append(0ul); + contents->elems[wordIndex] |= 1ul << bitIndex; + if (wordIndex == firstFree) { + firstFree = nextFreeWordFrom(firstFree); + } + } + + // Find index of next zero bit, and flip that bit + inline uint32_t grabNextBit() { + uint64_t word = getWord(firstFree); + assert(~word != 0ul); + uint32_t bit = __builtin_ctzll(~word); + uint32_t result = 64*firstFree + bit; + setBit(firstFree, bit); + return result; + } +}; + +#endif diff --git a/include/POLite/PDevice.h b/include/POLite/PDevice.h index 9eefda3a..508207bd 100644 --- a/include/POLite/PDevice.h +++ b/include/POLite/PDevice.h @@ -22,14 +22,22 @@ #define POLITE_NUM_PINS 1 #endif -// Macros for performance stats +// The local-multicast key points to a list of incoming edges. Some +// of those edges are stored in a header, the rest in an array at a +// different location. The number stored in the header is controlled +// by the following parameter. If it's too low, we risk wasting +// memory bandwidth. If it's too high, we risk wasting memory. +// The minimum value is 0. For large edge state sizes, use 0. +#ifndef POLITE_EDGES_PER_HEADER +#define POLITE_EDGES_PER_HEADER 6 +#endif + +// Macros for performance stats: // POLITE_DUMP_STATS - dump performance stats on termination -// POLITE_COUNT_MSGS - include message counts of performance stats +// POLITE_COUNT_MSGS - include message counts in performance stats // Thread-local device id typedef uint16_t PLocalDeviceId; -#define InvalidLocalDevId 0xffff -#define UnusedLocalDevId 0xfffe // Thread id typedef uint32_t PThreadId; @@ -54,7 +62,7 @@ inline PLocalDeviceId getLocalDeviceId(PDeviceAddr addr) { return addr >> 19; } // What's the max allowed local device address? inline uint32_t maxLocalDeviceId() { return 8192; } -// Routing key +// Local multicast key typedef uint16_t Key; #define InvalidKey 0xffff @@ -102,8 +110,8 @@ template struct ALIGNED PState { // Message structure template struct PMessage { - // Source-based routing key - Key key; + // Destination key + uint16_t destKey; // Application message M payload; }; @@ -119,15 +127,15 @@ struct POutEdge { uint32_t threadMaskHigh; }; -// An incoming edge to a device (labelleled) +// An incoming edge to a device template struct PInEdge { // Destination device PLocalDeviceId devId; - // Edge info + // Edge data E edge; }; -// An incoming edge to a device (unlabelleled) +// An incoming edge to a device (unlabelled) template <> struct PInEdge { union { // Destination device @@ -137,15 +145,17 @@ template <> struct PInEdge { }; }; -// Helper function: Count board hops between two threads -inline uint32_t hopsBetween(uint32_t t0, uint32_t t1) { - uint32_t xmask = ((1<> (TinselLogThreadsPerBoard + TinselMeshXBits); - int32_t x0 = (t0 >> TinselLogThreadsPerBoard) & xmask; - int32_t y1 = t1 >> (TinselLogThreadsPerBoard + TinselMeshXBits); - int32_t x1 = (t1 >> TinselLogThreadsPerBoard) & xmask; - return (abs(x0-x1) + abs(y0-y1)); -} +// Header for a list of incoming edges (fixed size structure to +// support fast construction/packing of local-multicast tables) +template struct PInHeader { + // Number of receivers + uint16_t numReceivers; + // Pointer to remaining edges in inTableRest, + // if they don't all fit in the header + uint16_t restIndex; + // Edges stored in the header, to make good use of cached data + PInEdge edges[POLITE_EDGES_PER_HEADER]; +}; // Generic thread structure template ) devices; // Pointer to base of routing tables PTR(POutEdge) outTableBase; - PTR(PInEdge) inTableBase; + PTR(PInHeader) inTableHeaderBase; + PTR(PInEdge) inTableRestBase; // Array of local device ids are ready to send PTR(PLocalDeviceId) senders; // This array is accessed in a LIFO manner @@ -170,11 +181,11 @@ template * m = (PMessage*) tinselSendSlot(); // Send message - m->key = outEdge->key; + m->destKey = outEdge->key; tinselMulticast(outEdge->mbox, outEdge->threadMaskHigh, outEdge->threadMaskLow, m); #ifdef POLITE_COUNT_MSGS - interThreadSendCount++; - interBoardSendCount += - hopsBetween(outEdge->mbox << TinselLogThreadsPerMailbox, - tinselId()); + msgsSent++; #endif // Move to next neighbour outEdge++; } - else + else { + #ifdef POLITE_COUNT_MSGS + blockedSends++; + #endif tinselWaitUntil(TINSEL_CAN_SEND|TINSEL_CAN_RECV); + } } else if (sendersTop != senders) { if (tinselCanSend()) { @@ -292,8 +310,12 @@ template * inMsg = (PMessage*) tinselRecv(); - PInEdge* inEdge = &inTableBase[inMsg->key]; - while (inEdge->devId != InvalidLocalDevId) { + PInHeader* inHeader = &inTableHeaderBase[inMsg->destKey]; + // Determine number and location of edges/receivers + uint32_t numReceivers = inHeader->numReceivers; + PInEdge* inEdge = inHeader->edges; + // For each receiver + for (uint32_t i = 0; i < numReceivers; i++) { + if (i == POLITE_EDGES_PER_HEADER) + inEdge = &inTableRestBase[inHeader->restIndex]; // Lookup destination device PLocalDeviceId id = inEdge->devId; DeviceType dev = getDevice(id); @@ -332,7 +360,7 @@ template #include #include +#include +#include #include -#include "Seq.h" +#include // Nodes of a POETS graph are devices typedef NodeId PDeviceId; @@ -24,9 +26,27 @@ template struct PReceiverGroup { // Thread id where all the receivers reside uint32_t threadId; // A sequence of receiving devices on that thread - Seq>* receivers; + SmallSeq> receivers; }; +// This structure holds info about an edge destination +struct PEdgeDest { + // Index of edge in outgoing edge list + uint32_t index; + // Destination device + PDeviceId dest; + // Address where destination is located + PDeviceAddr addr; +}; + +// Comparison function for PEdgeDest +// (Useful to sort destinations by thread id of destination) +inline int cmpEdgeDest(const void* e0, const void* e1) { + PEdgeDest* d0 = (PEdgeDest*) e0; + PEdgeDest* d1 = (PEdgeDest*) e1; + return getThreadId(d0->addr) < getThreadId(d1->addr); +} + // POETS graph template class PGraph { @@ -59,8 +79,19 @@ template *** outTable; - // Sequence of incoming edges for every thread - Seq>** inTable; + // Sequence of in-edge headers, for each thread + Seq>** inTableHeaders; + // Remaining in-edges that don't fit in the header table, for each thread + Seq>** inTableRest; + // Bitmap denoting used space in header table, for each thread + Bitmap** inTableBitmaps; + + // Programmable routing tables + ProgRouterMesh* progRouterTables; + + // Receiver groups (used internally by some methods, but declared once + // to avoid repeated allocation) + PReceiverGroup groups[TinselThreadsPerMailbox]; // Generic constructor void constructor(uint32_t lenX, uint32_t lenY) { @@ -79,18 +110,29 @@ template ); } - // Add space for incoming edge table - if (inTable[threadId]) { - sizeEIMem = inTable[threadId]->numElems * sizeof(PInEdge); - sizeEIMem = wordAlign(sizeEIMem); + // Add space for incoming edge tables + if (inTableHeaders[threadId]) { + sizeEIHeaderMem = inTableHeaders[threadId]->numElems * + sizeof(PInHeader); + sizeEIHeaderMem = wordAlign(sizeEIHeaderMem); + } + if (inTableRest[threadId]) { + sizeEIRestMem = inTableRest[threadId]->numElems * sizeof(PInEdge); + sizeEIRestMem = wordAlign(sizeEIRestMem); } // Add space for outgoing edge table for (uint32_t devNum = 0; devNum < numDevs; devNum++) { @@ -231,8 +288,10 @@ template maxDRAMSize) { @@ -246,14 +305,17 @@ template devices = vertexMemBase[threadId]; // Set tinsel address of base of edge tables thread->outTableBase = outEdgeMemBase[threadId]; - thread->inTableBase = inEdgeMemBase[threadId]; + thread->inTableHeaderBase = inEdgeHeaderMemBase[threadId]; + thread->inTableRestBase = inEdgeRestMemBase[threadId]; // Add space for each device on thread uint32_t numDevs = numDevicesOnThread[threadId]; for (uint32_t devNum = 0; devNum < numDevs; devNum++) { @@ -337,11 +408,18 @@ template * inEdgeArray = (PInEdge*) inEdgeMem[threadId]; - Seq>* edges = inTable[threadId]; + PInHeader* inEdgeHeaderArray = + (PInHeader*) inEdgeHeaderMem[threadId]; + Seq>* headers = inTableHeaders[threadId]; + if (headers) + for (uint32_t i = 0; i < headers->numElems; i++) { + inEdgeHeaderArray[i] = headers->elems[i]; + } + PInEdge* inEdgeRestArray = (PInEdge*) inEdgeRestMem[threadId]; + Seq>* edges = inTableRest[threadId]; if (edges) for (uint32_t i = 0; i < edges->numElems; i++) { - inEdgeArray[i] = edges->elems[i]; + inEdgeRestArray[i] = edges->elems[i]; } // At this point, check that next pointers line up with heap sizes if (nextVMem != vertexMemSize[threadId]) { @@ -368,12 +446,27 @@ template >**) + // Receiver-side tables (headers) + inTableHeaders = (Seq>**) + calloc(TinselMaxThreads,sizeof(Seq>*)); + for (uint32_t t = 0; t < TinselMaxThreads; t++) { + if (numDevicesOnThread[t] != 0) + inTableHeaders[t] = new SmallSeq>; + } + + // Receiver-side tables (rest) + inTableRest = (Seq>**) calloc(TinselMaxThreads,sizeof(Seq>*)); for (uint32_t t = 0; t < TinselMaxThreads; t++) { if (numDevicesOnThread[t] != 0) - inTable[t] = new SmallSeq>; + inTableRest[t] = new SmallSeq>; + } + + // Receiver-side tables (bitmaps) + inTableBitmaps = (Bitmap**) calloc(TinselMaxThreads,sizeof(Bitmap*)); + for (uint32_t t = 0; t < TinselMaxThreads; t++) { + if (numDevicesOnThread[t] != 0) + inTableBitmaps[t] = new Bitmap; } // Sender-side tables @@ -386,174 +479,232 @@ template >* receivers, - Seq>* groups) { - groups->clear(); - for (uint32_t i = 0; i < 64; i++) { - if (receivers[i].numElems > 0) { - // Add receiver group - PReceiverGroup g; - g.threadId = (mbox << TinselLogThreadsPerMailbox) | i; - g.receivers = &receivers[i]; - groups->append(g); - } + // Determine local-multicast routing key for given set of receivers + // (The key must be the same for all receivers) + uint32_t findKey(uint32_t numGroups) { + // Fast path (single receiver) + if (numGroups == 1) { + Bitmap* bm = inTableBitmaps[groups[0].threadId]; + return bm->grabNextBit(); } - } - // Determine routing key for given set of receivers - // (The key must be the same for all receivers) - uint32_t findKey(Seq>* receivers) { - uint32_t key = 0; - - bool found = false; - while (!found) { - found = true; - for (uint32_t i = 0; i < receivers->numElems; i++) { - PReceiverGroup g = receivers->elems[i]; - uint32_t numReceivers = g.receivers->numElems; - if (numReceivers > 0) { - // Lookup thread id of receiver - uint32_t t = g.threadId; - // Lookup table size for this thread - uint32_t tableSize = inTable[t]->numElems; - // Move to next receiver when we find a space - if (key >= tableSize) continue; - // Is there space at the current key? - // (Need space for numReceivers plus null terminator) - bool space = true; - for (int j = 0; j < numReceivers+1; j++) { - if ((key+j) >= tableSize) break; - if (inTable[t]->elems[key+j].devId != UnusedLocalDevId) { - found = false; - key = key+j+1; - break; - } - } - } + // Determine starting index for key search + uint32_t index = 0; + for (uint32_t i = 0; i < numGroups; i++) { + PReceiverGroup* g = &groups[i]; + Bitmap* bm = inTableBitmaps[g->threadId]; + if (bm->firstFree > index) index = bm->firstFree; + } + + // Find key that is available for all receivers + uint64_t mask; + retry: + mask = 0ul; + for (uint32_t i = 0; i < numGroups; i++) { + PReceiverGroup* g = &groups[i]; + Bitmap* bm = inTableBitmaps[g->threadId]; + mask |= bm->getWord(index); + if (~mask == 0ul) { index++; goto retry; } } + + // Mark key as taken in each bitmap + uint32_t bit = __builtin_ctzll(~mask); + for (uint32_t i = 0; i < numGroups; i++) { + PReceiverGroup* g = &groups[i]; + Bitmap* bm = inTableBitmaps[g->threadId]; + bm->setBit(index, bit); } - return key; + return 64*index + bit; } // Add entries to the input tables for the given receivers // (Only valid after mapper is called) - uint32_t addInTableEntries(Seq>* receivers) { - uint32_t key = findKey(receivers); - if (key >= 0xfffe) { + uint32_t addInTableEntries(uint32_t numGroups) { + uint32_t key = findKey(numGroups); + if (key >= 0xffff) { printf("Routing key exceeds 16 bits\n"); exit(EXIT_FAILURE); } - PInEdge null, unused; - null.devId = InvalidLocalDevId; - unused.devId = UnusedLocalDevId; - // Now that a key with sufficient space has been found, populate the tables - for (uint32_t i = 0; i < receivers->numElems; i++) { - PReceiverGroup g = receivers->elems[i]; - uint32_t numReceivers = g.receivers->numElems; - if (numReceivers > 0) { - // Lookup thread id of receiver - uint32_t t = g.threadId; - // Lookup table size for this thread - uint32_t tableSize = inTable[t]->numElems; - // Make sure inTable is big enough for new entries - for (uint32_t j = tableSize; j < (key+numReceivers+1); j++) - inTable[t]->append(unused); - // Add receivers to thread's inTable - for (uint32_t j = 0; j < numReceivers; j++) { - inTable[t]->elems[key+j] = g.receivers->elems[j]; + // Populate inTableHeaders and inTableRest using the key + for (uint32_t i = 0; i < numGroups; i++) { + PReceiverGroup* g = &groups[i]; + uint32_t numEdges = g->receivers.numElems; + PInEdge* edgePtr = g->receivers.elems; + if (numEdges > 0) { + // Determine thread id of receiver + uint32_t t = g->threadId; + // Extend table + Seq>* headers = inTableHeaders[t]; + if (key >= headers->numElems) + headers->extendBy(key + 1 - headers->numElems); + // Fill in header + PInHeader* header = &inTableHeaders[t]->elems[key]; + header->numReceivers = numEdges; + if (inTableRest[t]->numElems > 0xffff) { + printf("In-table index exceeds 16 bits\n"); + exit(EXIT_FAILURE); + } + header->restIndex = inTableRest[t]->numElems; + uint32_t numHeaderEdges = numEdges < POLITE_EDGES_PER_HEADER ? + numEdges : POLITE_EDGES_PER_HEADER; + for (uint32_t j = 0; j < numHeaderEdges; j++) { + header->edges[j] = *edgePtr; + edgePtr++; + } + numEdges -= numHeaderEdges; + // Overflow into rest memory if header not big enough + for (uint32_t j = 0; j < numEdges; j++) { + inTableRest[t]->append(*edgePtr); + edgePtr++; } - inTable[t]->elems[key+numReceivers] = null; } } return key; } + // Split edge list into board-local and non-board-local destinations + // And sort each list by destination thread id + // (Only valid after mapper is called) + void splitDests(PDeviceId devId, PinId pinId, + Seq* local, Seq* nonLocal) { + local->clear(); + nonLocal->clear(); + PDeviceAddr devAddr = toDeviceAddr[devId]; + uint32_t devBoard = getThreadId(devAddr) >> TinselLogThreadsPerBoard; + // Split destinations into local/non-local + Seq* dests = graph.outgoing->elems[devId]; + Seq* pinIds = graph.pins->elems[devId]; + for (uint32_t d = 0; d < dests->numElems; d++) { + if (pinIds->elems[d] == pinId) { + PEdgeDest e; + e.index = d; + e.dest = dests->elems[d]; + e.addr = toDeviceAddr[e.dest]; + uint32_t destBoard = getThreadId(e.addr) >> TinselLogThreadsPerBoard; + if (devBoard == destBoard) + local->append(e); + else + nonLocal->append(e); + } + } + // Sort local list + qsort(local->elems, local->numElems, sizeof(PEdgeDest), cmpEdgeDest); + // Sort non-local list + qsort(nonLocal->elems, nonLocal->numElems, sizeof(PEdgeDest), cmpEdgeDest); + } + + // Compute table updates for destinations for given device + // (Only valid after mapper is called) + void computeTables(Seq* dests, uint32_t d, + Seq* out) { + out->clear(); + uint32_t index = 0; + while (index < dests->numElems) { + // New set of receiver groups on same mailbox + uint32_t threadMaskLow = 0; + uint32_t threadMaskHigh = 0; + uint32_t nextGroup = 0; + // Current mailbox & thread being considered + PDeviceAddr mbox = getThreadId(dests->elems[index].addr) >> + TinselLogThreadsPerMailbox; + uint32_t thread = getThreadId(dests->elems[index].addr) & + ((1<numElems) { + PEdgeDest* edge = &dests->elems[index]; + // Determine destination mailbox address and mailbox-local thread + uint32_t destMailbox = getThreadId(edge->addr) >> + TinselLogThreadsPerMailbox; + uint32_t destThread = getThreadId(edge->addr) & + ((1< in; + in.devId = getLocalDeviceId(edge->addr); + Seq* edges = edgeLabels.elems[d]; + if (! std::is_same::value) + in.edge = edges->elems[edge->index]; + // Update current receiver group + groups[nextGroup].receivers.append(in); + groups[nextGroup].threadId = getThreadId(edge->addr); + if (thread < 32) threadMaskLow |= 1 << thread; + if (thread >= 32) threadMaskHigh |= 1 << (thread-32); + index++; + } + else { + // Start new receiver group + thread = destThread; + nextGroup++; + assert(nextGroup < TinselThreadsPerMailbox); + } + } + else break; + } + // Add input table entries + uint32_t key = addInTableEntries(nextGroup+1); + // Add output entry + PRoutingDest dest; + dest.kind = PRDestKindMRM; + dest.mbox = mbox; + dest.mrm.key = key; + dest.mrm.threadMaskLow = threadMaskLow; + dest.mrm.threadMaskHigh = threadMaskHigh; + out->append(dest); + // Clear receiver groups, for a new iteration + for (uint32_t i = 0; i <= nextGroup; i++) groups[i].receivers.clear(); + } + } + // Compute routing tables // (Only valid after mapper is called) void computeRoutingTables() { - // Routing table stats - uint64_t totalOutEdges = 0; + // Edge destinations (local to sender board, or not) + Seq local; + Seq nonLocal; - // Sequence of local device ids, for each multicast destiation - SmallSeq> receivers[64]; + // Routing destinations + Seq dests; - // Sequence of receiver groups - // (A more compact representation of the receivers array) - SmallSeq> groups; + // Allocate per-board programmable routing tables + progRouterTables = new ProgRouterMesh(numBoardsX, numBoardsY); // For each device for (uint32_t d = 0; d < numDevices; d++) { // For each pin for (uint32_t p = 0; p < POLITE_NUM_PINS; p++) { - Seq dests = *(graph.outgoing->elems[d]); - Seq edges = *(edgeLabels.elems[d]); - // While destinations are remaining - while (dests.numElems > 0) { - // Clear receivers - for (uint32_t i = 0; i < 64; i++) receivers[i].clear(); - uint32_t threadMaskLow = 0; - uint32_t threadMaskHigh = 0; - // Current mailbox being considered - PDeviceAddr mbox = getThreadId(toDeviceAddr[dests.elems[0]]) >> - TinselLogThreadsPerMailbox; - // For each destination - uint32_t destsRemaining = 0; - for (uint32_t i = 0; i < dests.numElems; i++) { - // Determine destination mailbox address and mailbox-local thread - PDeviceId destId = dests.elems[i]; - PDeviceAddr destAddr = toDeviceAddr[destId]; - uint32_t destMailbox = getThreadId(destAddr) >> - TinselLogThreadsPerMailbox; - uint32_t destThread = getThreadId(destAddr) & - ((1< edge; - edge.devId = getLocalDeviceId(destAddr); - if (! std::is_same::value) edge.edge = edges.elems[i]; - receivers[destThread].append(edge); - if (destThread < 32) threadMaskLow |= 1 << destThread; - if (destThread >= 32) threadMaskHigh |= 1 << (destThread-32); - } - else { - // Add destination back into sequence - dests.elems[destsRemaining] = dests.elems[i]; - edges.elems[destsRemaining] = edges.elems[i]; - destsRemaining++; - } - } - // Create receiver groups - createReceiverGroups(mbox, receivers, &groups); - // Add input table entries - uint32_t key = addInTableEntries(&groups); - // Add output table entry + // Split edge lists into local/non-local and sort by target thread id + splitDests(d, p, &local, &nonLocal); + // Deal with board-local connections + computeTables(&local, d, &dests); + for (uint32_t i = 0; i < dests.numElems; i++) { + PRoutingDest dest = dests.elems[i]; POutEdge edge; - edge.mbox = mbox; - edge.key = key; - edge.threadMaskLow = threadMaskLow; - edge.threadMaskHigh = threadMaskHigh; + edge.mbox = dest.mbox; + edge.key = dest.mrm.key; + edge.threadMaskLow = dest.mrm.threadMaskLow; + edge.threadMaskHigh = dest.mrm.threadMaskHigh; outTable[d][p]->append(edge); - // Prepare for new output table entry - dests.numElems = destsRemaining; - edges.numElems = destsRemaining; - totalOutEdges++; } - // Add output edge terminator + // Deal with non-board-local connections + computeTables(&nonLocal, d, &dests); + uint32_t src = getThreadId(toDeviceAddr[d]) >> + TinselLogThreadsPerMailbox; + uint32_t key = progRouterTables->addDestsFromBoard(src, &dests); + POutEdge edge; + edge.mbox = tinselUseRoutingKey(); + edge.key = 0; + edge.threadMaskLow = key; + edge.threadMaskHigh = 0; + outTable[d][p]->append(edge); + // Add output list terminator POutEdge term; term.key = InvalidKey; outTable[d][p]->append(term); } } - //printf("Average edges per pin: %lu\n", - // totalOutEdges / (numDevices * POLITE_NUM_PINS); - } + } // Release all structures void releaseAll() { @@ -575,21 +726,38 @@ template useSendBuffer = true; writeRAM(hostLink, vertexMem, vertexMemSize, vertexMemBase); writeRAM(hostLink, threadMem, threadMemSize, threadMemBase); - writeRAM(hostLink, inEdgeMem, inEdgeMemSize, inEdgeMemBase); + writeRAM(hostLink, inEdgeHeaderMem, + inEdgeHeaderMemSize, inEdgeHeaderMemBase); + writeRAM(hostLink, inEdgeRestMem, inEdgeRestMemSize, inEdgeRestMemBase); writeRAM(hostLink, outEdgeMem, outEdgeMemSize, outEdgeMemBase); + progRouterTables->write(hostLink); hostLink->flush(); hostLink->useSendBuffer = useSendBufferOld; @@ -835,7 +1008,6 @@ template #include #include +#include +#include typedef uint32_t PartitionId; // Partition and place a graph on a 2D mesh struct Placer { + // Select between different methods + enum Method { + Default, + Metis, + Random, + Direct, + BFS + }; + const Method defaultMethod=Metis; + // The graph being placed Graph* graph; @@ -41,8 +53,40 @@ struct Placer { uint32_t* yCoordSaved; uint64_t savedCost; + // Random numbers + unsigned int seed; + void setRand(unsigned int s) { seed = s; }; + int getRand() { return rand_r(&seed); } + + // Controls which strategy is used + Method method = Default; + + // Select placer method + void chooseMethod() + { + auto e = getenv("POLITE_PLACER"); + if (e) { + if (!strcmp(e, "metis")) + method=Metis; + else if (!strcmp(e, "random")) + method=Random; + else if (!strcmp(e, "direct")) + method=Direct; + else if (!strcmp(e, "bfs")) + method=BFS; + else if (!strcmp(e, "default") || *e == '\0') + method=Default; + else { + fprintf(stderr, "Don't understand placer method : %s\n", e); + exit(EXIT_FAILURE); + } + } + if (method == Default) + method = defaultMethod; + } + // Partition the graph using Metis - void partition() { + void partitionMetis() { // Compute total number of edges uint32_t numEdges = 0; for (uint32_t i = 0; i < graph->incoming->numElems; i++) { @@ -116,6 +160,96 @@ struct Placer { free(parts); } + // Partition the graph randomly + void partitionRandom() { + uint32_t numVertices = graph->incoming->numElems; + uint32_t numParts = width * height; + + // Populate result array + for (uint32_t i = 0; i < numVertices; i++) { + partitions[i] = getRand() % numParts; + } + } + + // Partition the graph using direct mapping + void partitionDirect() { + uint32_t numVertices = graph->incoming->numElems; + uint32_t numParts = width * height; + uint32_t partSize = (numVertices + numParts) / numParts; + + // Populate result array + for (uint32_t i = 0; i < numVertices; i++) { + partitions[i] = i / partSize; + } + } + + // Partition the graph using repeated BFS + void partitionBFS() { + uint32_t numVertices = graph->incoming->numElems; + uint32_t numParts = width * height; + uint32_t partSize = (numVertices + numParts) / numParts; + + // Visited bit for each vertex + bool* seen = new bool [numVertices]; + memset(seen, 0, numVertices); + + // Next vertex to visit + uint32_t nextUnseen = 0; + + // Next partition id + uint32_t nextPart = 0; + + while (nextUnseen < numVertices) { + // Frontier + std::queue frontier; + uint32_t count = 0; + + while (nextUnseen < numVertices && count < partSize) { + // Sized-bounded BFS from nextUnseen + frontier.push(nextUnseen); + while (count < partSize && !frontier.empty()) { + uint32_t v = frontier.front(); + frontier.pop(); + if (!seen[v]) { + seen[v] = true; + partitions[v] = nextPart; + count++; + // Add unvisited neighbours of v to the frontier + Seq* dests = graph->outgoing->elems[v]; + for (uint32_t i = 0; i < dests->numElems; i++) { + uint32_t w = dests->elems[i]; + if (!seen[w]) frontier.push(w); + } + } + } + while (nextUnseen < numVertices && seen[nextUnseen]) nextUnseen++; + } + + nextPart++; + } + + delete [] seen; + } + + void partition() + { + switch(method){ + case Default: + case Metis: + partitionMetis(); + break; + case Random: + partitionRandom(); + break; + case Direct: + partitionDirect(); + break; + case BFS: + partitionBFS(); + break; + } + } + // Create subgraph for each partition void computeSubgraphs() { uint32_t numPartitions = width*height; @@ -179,7 +313,7 @@ struct Placer { // Random mapping for (uint32_t y = 0; y < height; y++) { for (uint32_t x = 0; x < width; x++) { - int index = rand() % numPartitions; + int index = getRand() % numPartitions; PartitionId p = pids[index]; mapping[y][x] = p; xCoord[p] = x; @@ -295,6 +429,8 @@ struct Placer { graph = g; width = w; height = h; + // Random seed + setRand(1 + omp_get_thread_num()); // Allocate the partitions array partitions = new PartitionId [g->incoming->numElems]; // Allocate subgraphs @@ -316,6 +452,8 @@ struct Placer { yCoord = new uint32_t [width*height]; xCoordSaved = new uint32_t [width*height]; yCoordSaved = new uint32_t [width*height]; + // Pick a placement method, or select default + chooseMethod(); // Partition the graph using Metis partition(); // Compute subgraphs, one per partition diff --git a/include/POLite/ProgRouters.h b/include/POLite/ProgRouters.h new file mode 100644 index 00000000..9890c43e --- /dev/null +++ b/include/POLite/ProgRouters.h @@ -0,0 +1,413 @@ +// SPDX-License-Identifier: BSD-2-Clause +#ifndef _PROGROUTERS_H_ +#define _PROGROUTERS_H_ + +#include +#include +#include +#include +#include +#include + +// ============================= +// Per-board programmable router +// ============================= + +class ProgRouter { + + // Number of chunks used so far in current beat + uint32_t numChunks; + + // Number of records used so far in current beat + uint32_t numRecords; + + // Number of beats associated with current key + uint32_t numBeats; + + // Index of RAM currently being used + uint32_t currentRAM; + + // Pointer to previously created indirection + // (We need indirections to handle record sequences of 31 beats or more) + uint8_t* prevInd; + + // Move on to next the beat + void nextBeat() { + // Set number of records in current beat + uint32_t beatBase = table[currentRAM]->numElems - 32; + uint8_t* beat = &table[currentRAM]->elems[beatBase]; + beat[31] = 0; + beat[30] = numRecords; + numChunks = numRecords = 0; + // Allocate new beat, and check for overflow + numBeats++; + table[currentRAM]->extendBy(32); + if (table[currentRAM]->numElems >= (TinselPOLiteProgRouterLength-1024)) { + printf("ProgRouter out of memory\n"); + exit(EXIT_FAILURE); + } + // We need indirections to handle sequences of 31 beats or more + if ((numBeats % 31) == 0) { + // Set previous indirection, if there is one + if (prevInd) { + uint32_t key = TinselPOLiteProgRouterBase + + table[currentRAM]->numElems - 31*32; + if (currentRAM) key |= 0x80000000; + key |= 31; + setIND(prevInd, key); + } + prevInd = addIND(); + } + } + + // Get current record pointer for 48-bit entry + inline uint8_t* currentRecord48() { + uint32_t beatBase = (table[currentRAM]->numElems-32) + 6*(4-numChunks); + return &table[currentRAM]->elems[beatBase]; + } + + // Get current record pointer for 96-bit entry + inline uint8_t* currentRecord96() { + uint32_t beatBase = (table[currentRAM]->numElems-32) + 6*(3-numChunks); + return &table[currentRAM]->elems[beatBase]; + } + + public: + + // A table holding encoded routing beats for each RAM + Seq** table; + + // Constructor + ProgRouter() { + // Currently we assume two RAMs per board + assert(TinselDRAMsPerBoard == 2); + // Initialise member variables + prevInd = NULL; + numBeats = 1; + numChunks = numRecords = currentRAM = 0; + // Allocate one sequence per RAM + table = new Seq* [TinselDRAMsPerBoard]; + // Initially each sequence is 32MB + for (int i = 0; i < TinselDRAMsPerBoard; i++) { + table[i] = new Seq (1 << 15); + // Allocate first beat + table[i]->extendBy(32); + } + } + + // Destructor + ~ProgRouter() { + for (int i = 0; i < TinselDRAMsPerBoard; i++) delete table[i]; + delete [] table; + } + + // Generate a new key for the records added + uint32_t genKey() { + // Determine index of first beat in record sequence + uint32_t index = table[currentRAM]->numElems - numBeats*32; + // Determine final key length + uint32_t finalKeyLen = prevInd ? 31 : numBeats; + // Insert outstanding indirection, if there is one + if (prevInd) { + // Set previous indirection to latest block of beats + uint32_t indKey = TinselPOLiteProgRouterBase + + table[currentRAM]->numElems - (numBeats%31)*32; + if (currentRAM) indKey |= 0x80000000; + indKey |= (numBeats%31); + setIND(prevInd, indKey); + } + // Determine final key + uint32_t key = TinselPOLiteProgRouterBase + index; + if (currentRAM) key |= 0x80000000; + key |= finalKeyLen; + // Move to next beat + nextBeat(); + numBeats = 1; + prevInd = NULL; + // Pick smaller RAM for next key + currentRAM = table[0]->numElems < table[1]->numElems ? 0 : 1; + return key; + } + + // Add an IND record to the table + // Return a pointer to the indirection key, + // so it can be set later by the caller + uint8_t* addIND() { + if (numChunks == 5) nextBeat(); + uint8_t* ptr = currentRecord48(); + ptr[5] = 4 << 5; + numChunks++; + numRecords++; + return ptr; + } + + // Set indirection key + void setIND(uint8_t* ind, uint32_t key) { + ind[0] = key; + ind[1] = key >> 8; + ind[2] = key >> 16; + ind[3] = key >> 24; + } + + // Add an MRM record to the table + void addMRM(uint32_t mboxX, uint32_t mboxY, + uint32_t threadsHigh, uint32_t threadsLow, + uint16_t localKey) { + if (numChunks >= 4) nextBeat(); + uint8_t* ptr = currentRecord96(); + ptr[0] = threadsLow; + ptr[1] = threadsLow >> 8; + ptr[2] = threadsLow >> 16; + ptr[3] = threadsLow >> 24; + ptr[4] = threadsHigh; + ptr[5] = threadsHigh >> 8; + ptr[6] = threadsHigh >> 16; + ptr[7] = threadsHigh >> 24; + ptr[8] = localKey; + ptr[9] = localKey >> 8; + ptr[11] = (3 << 5) | (mboxY << 3) | (mboxX << 1); + numChunks += 2; + numRecords++; + } + + // Add an RR record to the table + void addRR(uint32_t dir, uint32_t key) { + if (numChunks == 5) nextBeat(); + uint8_t* ptr = currentRecord48(); + ptr[0] = key; + ptr[1] = key >> 8; + ptr[2] = key >> 16; + ptr[3] = key >> 24; + ptr[5] = (2 << 5) | (dir << 3); + numChunks++; + numRecords++; + } + + // Add a URM1 record to the table + void addURM1(uint32_t mboxX, uint32_t mboxY, + uint32_t threadId, uint32_t key) { + if (numChunks == 5) nextBeat(); + uint8_t* ptr = currentRecord48(); + ptr[0] = key; + ptr[1] = key >> 8; + ptr[2] = key >> 16; + ptr[3] = key >> 24; + ptr[4] = (threadId << 3); + ptr[5] = (mboxY << 3) | (mboxX << 1) | (threadId >> 5); + numChunks++; + numRecords++; + } +}; + +// ================================== +// Data type for routing destinations +// ================================== + +enum PRoutingDestKind { PRDestKindURM1, PRDestKindMRM }; + +// URM1 routing destination +struct PRoutingDestURM1 { + // Mailbox-local thread + uint16_t threadId; + // Thread-local routing key + uint32_t key; +}; + +// MRM routing destination +struct PRoutingDestMRM { + // Thread-local routing key + uint16_t key; + // Destination threads + uint32_t threadMaskLow; + uint32_t threadMaskHigh; +}; + +// Routing destination +struct PRoutingDest { + PRoutingDestKind kind; + // Destination mailbox + uint32_t mbox; + // URM1 or MRM destination + union { + PRoutingDestURM1 urm1; + PRoutingDestMRM mrm; + }; +}; + +// Extract board X coord from routing dest +inline uint32_t destX(uint32_t mbox) { + uint32_t x = mbox >> (TinselMailboxMeshXBits + TinselMailboxMeshYBits); + return x & ((1<> (TinselMailboxMeshXBits + + TinselMailboxMeshYBits + TinselMeshXBits); + return y & ((1<> TinselMailboxMeshXBits) & + ((1<* dests) { + if (dests->numElems == 0) return 0; + + // Categorise dests into local, N, S, E, and W groups + Seq local(dests->numElems); + Seq north(dests->numElems); + Seq south(dests->numElems); + Seq east(dests->numElems); + Seq west(dests->numElems); + for (int i = 0; i < dests->numElems; i++) { + PRoutingDest dest = dests->elems[i]; + uint32_t receiverX = destX(dest.mbox); + uint32_t receiverY = destY(dest.mbox); + if (receiverX < senderX) west.append(dest); + else if (receiverX > senderX) east.append(dest); + else if (receiverY < senderY) south.append(dest); + else if (receiverY > senderY) north.append(dest); + else local.append(dest); + } + + // Recurse on non-local groups and add RR records on return + if (north.numElems > 0) { + uint32_t key = addDestsFromBoardXY(senderX, senderY+1, &north); + table[senderY][senderX].addRR(0, key); + } + if (south.numElems > 0) { + uint32_t key = addDestsFromBoardXY(senderX, senderY-1, &south); + table[senderY][senderX].addRR(1, key); + } + if (east.numElems > 0) { + uint32_t key = addDestsFromBoardXY(senderX+1, senderY, &east); + table[senderY][senderX].addRR(2, key); + } + if (west.numElems > 0) { + uint32_t key = addDestsFromBoardXY(senderX-1, senderY, &west); + table[senderY][senderX].addRR(3, key); + } + + // Add local records + for (int i = 0; i < local.numElems; i++) { + PRoutingDest dest = local.elems[i]; + if (dest.kind == PRDestKindMRM) { + table[senderY][senderX].addMRM(destMboxX(dest.mbox), + destMboxY(dest.mbox), dest.mrm.threadMaskHigh, + dest.mrm.threadMaskLow, dest.mrm.key); + } + else if (dest.kind == PRDestKindURM1) { + table[senderY][senderX].addURM1(destMboxX(dest.mbox), + destMboxY(dest.mbox), dest.urm1.threadId, dest.urm1.key); + } + else { + fprintf(stderr, "ProgRouters.h: unknown routing record kind\n"); + exit(EXIT_FAILURE); + } + } + + return table[senderY][senderX].genKey(); + } + + // Add routing destinations from given global mailbox id + uint32_t addDestsFromBoard(uint32_t mbox, Seq* dests) { + return addDestsFromBoardXY(destX(mbox), destY(mbox), dests); + } + + // Write routing tables to memory via HostLink + void write(HostLink* hostLink) { + // Request to boot loader + BootReq req; + + // Compute number of cores per DRAM + const uint32_t coresPerDRAM = 1 << + (TinselLogCoresPerDCache + TinselLogDCachesPerDRAM); + + // Initialise write address for each routing table + for (int y = 0; y < boardsY; y++) { + for (int x = 0; x < boardsX; x++) { + for (int i = 0; i < TinselDRAMsPerBoard; i++) { + // Use one core to initialise each DRAM + uint32_t dest = hostLink->toAddr(x, y, coresPerDRAM * i, 0); + req.cmd = SetAddrCmd; + req.numArgs = 1; + req.args[0] = TinselPOLiteProgRouterBase; + hostLink->send(dest, 1, &req); + // Ensure space for an extra 32 bytes in each + // table so we don't have to check for overflow below + // when consuming the tables in chunks of 12 bytes + table[y][x].table[i]->ensureSpaceFor(32); + } + } + } + + // Write each routing table + bool allDone = false; + uint32_t offset = 0; + while (! allDone) { + allDone = true; + for (int y = 0; y < boardsY; y++) { + for (int x = 0; x < boardsX; x++) { + for (int i = 0; i < TinselDRAMsPerBoard; i++) { + Seq* seq = table[y][x].table[i]; + if (offset < seq->numElems) { + uint32_t dest = hostLink->toAddr(x, y, coresPerDRAM * i, 0); + uint8_t* base = &seq->elems[offset]; + allDone = false; + req.cmd = StoreCmd; + req.numArgs = 3; + req.args[0] = ((uint32_t*) base)[0]; + req.args[1] = ((uint32_t*) base)[1]; + req.args[2] = ((uint32_t*) base)[2]; + hostLink->send(dest, 1, &req); + } + } + } + } + offset += 12; + } + } + + // Destructor + ~ProgRouterMesh() { + for (int y = 0; y < boardsY; y++) + delete [] table[y]; + delete [] table; + } +}; + + +#endif diff --git a/include/POLite/Seq.h b/include/POLite/Seq.h index b6cb61f1..23a7616c 100644 --- a/include/POLite/Seq.h +++ b/include/POLite/Seq.h @@ -45,12 +45,26 @@ template class Seq elems = newElems; } + // Extend size of sequence by N + void extendBy(int n) + { + numElems += n; + if (numElems > maxElems) + setCapacity(numElems*2); + } + // Extend size of sequence by one void extend() { - numElems++; - if (numElems > maxElems) - setCapacity(maxElems*2); + extendBy(1); + } + + // Ensure space for a further N elements + void ensureSpaceFor(int n) + { + int newNumElems = numElems + n; + if (newNumElems > maxElems) + setCapacity(newNumElems*2); } // Append diff --git a/include/tinsel-interface.h b/include/tinsel-interface.h index 93b5ec96..21dfdfcb 100644 --- a/include/tinsel-interface.h +++ b/include/tinsel-interface.h @@ -166,7 +166,7 @@ INLINE uint32_t tinselAccId( uint32_t tileX, uint32_t tileY) { uint32_t addr; - addr = 0x4; + addr = 0x8; addr = (addr << TinselMeshYBits) | boardY; addr = (addr << TinselMeshXBits) | boardX; addr = (addr << TinselMailboxMeshYBits) | tileY; @@ -175,4 +175,13 @@ INLINE uint32_t tinselAccId( return addr; } +// Special address to signify use of routing key +INLINE uint32_t tinselUseRoutingKey() +{ + // Special address to signify use of routing key + return 1 << + (TinselMailboxMeshYBits + TinselMailboxMeshXBits + + TinselMeshXBits + TinselMeshYBits + 2); +} + #endif diff --git a/include/tinsel.h b/include/tinsel.h index 9ebd8451..0b88844d 100644 --- a/include/tinsel.h +++ b/include/tinsel.h @@ -28,13 +28,15 @@ #define CSR_FLUSH "0xc01" // Performance counter CSRs -#define CSR_PERFCOUNT "0xc07" -#define CSR_MISSCOUNT "0xc08" -#define CSR_HITCOUNT "0xc09" -#define CSR_WBCOUNT "0xc0a" -#define CSR_CPUIDLECOUNT "0xc0b" -#define CSR_CPUIDLECOUNTU "0xc0c" -#define CSR_CYCLEU "0xc0d" +#define CSR_PERFCOUNT "0xc07" +#define CSR_MISSCOUNT "0xc08" +#define CSR_HITCOUNT "0xc09" +#define CSR_WBCOUNT "0xc0a" +#define CSR_CPUIDLECOUNT "0xc0b" +#define CSR_CPUIDLECOUNTU "0xc0c" +#define CSR_CYCLEU "0xc0d" +#define CSR_PROGROUTERSENT "0xc0e" +#define CSR_PROGROUTERSENTINTER "0xc0f" // Get globally unique thread id of caller INLINE uint32_t tinselId() @@ -127,6 +129,18 @@ INLINE volatile void* tinselSendSlot() return mb_scratchpad_base + (threadId << TinselLogBytesPerMsg); } +// Get pointer to thread's extra message slot reserved for sending +// (Assumes that HostLink has requested the extra slot) +INLINE volatile void* tinselSendSlotExtra() +{ + volatile char* mb_scratchpad_base = + (volatile char*) (1 << TinselLogBytesPerMailbox); + uint32_t threadId = tinselId() & + ((1<> 6, high, low, addr); } +// Send message at addr using given routing key +INLINE void tinselKeySend(int key, volatile void* addr) +{ + tinselMulticast(tinselUseRoutingKey(), 0, key, addr); +} + // Receive message INLINE volatile void* tinselRecv() { @@ -270,7 +290,7 @@ INLINE uint32_t tinselWritebackCount() return n; } -// Performance counter:: get the CPU-idle count +// Performance counter: get the CPU-idle count INLINE uint32_t tinselCPUIdleCount() { uint32_t n; @@ -294,6 +314,22 @@ INLINE uint32_t tinselCycleCountU() return n; } +// Performance counter: number of messages emitted by ProgRouter +INLINE uint32_t tinselProgRouterSent() +{ + uint32_t n; + asm volatile ("csrrw %0, " CSR_PROGROUTERSENT ", zero" : "=r"(n)); + return n; +} + +// Performance counter: number of inter-board messages emitted by ProgRouter +INLINE uint32_t tinselProgRouterSentInterBoard() +{ + uint32_t n; + asm volatile ("csrrw %0, " CSR_PROGROUTERSENTINTER ", zero" : "=r"(n)); + return n; +} + // Get address of any specified host // (This Y coordinate specifies the row of the FPGA mesh that the // host is connected to, and the X coordinate specifies whether it is diff --git a/rtl/Connections.bsv b/rtl/Connections.bsv new file mode 100644 index 00000000..7f542acc --- /dev/null +++ b/rtl/Connections.bsv @@ -0,0 +1,151 @@ +package Connections; + +import Vector :: *; +import OffChipRAM :: *; +import Interface :: *; +import DRAM :: *; +import Queue :: *; +import DCache :: *; +import DCacheTypes :: *; +import Util :: *; +import ProgRouter :: *; +import Core :: *; + +// ============================================================================ +// DCache <-> Core connections +// ============================================================================ + +module connectCoresToDCache#( + Vector#(`CoresPerDCache, DCacheClient) clients, + DCache dcache) (); + + // Connect requests + function getDCacheReqOut(client) = client.dcacheReqOut; + let dcacheReqs <- mkMergeTree(Fair, + mkUGShiftQueue1(QueueOptFmax), + map(getDCacheReqOut, clients)); + connectUsing(mkUGQueue, dcacheReqs, dcache.reqIn); + + // Connect responses + function Bit#(`LogCoresPerDCache) getDCacheRespKey(DCacheResp resp) = + truncateLSB(resp.id); + function getDCacheRespIn(client) = client.dcacheRespIn; + let dcacheResps <- mkResponseDistributor( + getDCacheRespKey, + mkUGShiftQueue1(QueueOptFmax), + map(getDCacheRespIn, clients)); + connectDirect(dcache.respOut, dcacheResps); + + // Connect performance-counter wires + rule connectPerfCountWires; + clients[0].incMissCount(dcache.incMissCount); + clients[0].incHitCount(dcache.incHitCount); + clients[0].incWritebackCount(dcache.incWritebackCount); + for (Integer i = 1; i < `CoresPerDCache; i=i+1) begin + clients[i].incMissCount(False); + clients[i].incHitCount(False); + clients[i].incWritebackCount(False); + end + endrule + +endmodule + +// ============================================================================ +// Off-chip RAM connections +// ============================================================================ + +module connectClientsToOffChipRAM#( + // Data caches + Vector#(`DCachesPerDRAM, DCache) caches, + // Reqs and resps from ProgRouter's fetchers + Vector#(`FetchersPerProgRouter, BOut#(DRAMReq)) routerReqs, + Vector#(`FetchersPerProgRouter, In#(DRAMResp)) routerResps, + // Off-chip memory + OffChipRAM ram) (); + + // Count the number of outstanding fetcher requests + // Used to throttle the fetcher requests to avoid starving/blocking + // the cache requests + Integer throttleCount = 2 ** (`DRAMLogMaxInFlight - 1); + Count#(`DRAMLogMaxInFlight) fetcherCount <- mkCount(throttleCount); + + // Merge cache requests + function getReqOut(cache) = cache.reqOut; + Out#(DRAMReq) cacheReqs <- + mkMergeTreeB(Fair, + mkUGShiftQueue1(QueueOptFmax), + map(getReqOut, caches)); + Queue#(DRAMReq) cacheReqsQueue <- mkUGQueue; + connectToQueue(cacheReqs, cacheReqsQueue); + BOut#(DRAMReq) cacheReqsB = queueToBOut(cacheReqsQueue); + + // Merge router requests + Out#(DRAMReq) fetcherReqs <- + mkMergeTreeB(Fair, + mkUGShiftQueue1(QueueOptFmax), + routerReqs); + Queue#(DRAMReq) fetcherReqsQueue <- mkUGQueue; + connectToQueue(fetcherReqs, fetcherReqsQueue); + BOut#(DRAMReq) fetcherReqsB = queueToBOut(fetcherReqsQueue); + + // Update count on router request + BOut#(DRAMReq) fetcherReqsIncCountB = + interface BOut + method Action get = + action + fetcherReqsB.get; + fetcherCount.incBy(zeroExtend(fetcherReqsB.value.burst)); + endaction; + method Bool valid = fetcherReqsB.valid && + zeroExtend(fetcherReqsB.value.burst) <= fetcherCount.available; + method DRAMReq value = fetcherReqsB.value; + endinterface; + + // Merge cache and router requests, and connect to off-chip RAM + let reqs <- mkMergeTwoB(Fair, cacheReqsB, fetcherReqsIncCountB); + connectUsing(mkUGQueue, reqs, ram.reqIn); + + // Connect load responses + function DRAMClientId getRespKey(DRAMResp resp) = resp.id; + function getRespIn(cache) = cache.respIn; + let ramResps <- mkResponseDistributor( + getRespKey, + mkUGShiftQueue2(QueueOptFmax), + append(map(getRespIn, caches), routerResps)); + + // Update count on respose + BOut#(DRAMResp) ramRespOutDecCount = + interface BOut + method Action get = + action + ram.respOut.get; + if (ram.respOut.value.id >= fromInteger(`DCachesPerDRAM)) + fetcherCount.dec; + endaction; + method Bool valid = ram.respOut.valid; + method DRAMResp value = ram.respOut.value; + endinterface; + + // Connect responses from off-chip RAM + connectDirect(ramRespOutDecCount, ramResps); + +endmodule + +// ============================================================================ +// ProgRouter performance counter connections +// ============================================================================ + +module connectProgRouterPerfCountersToCores#( + ProgRouterPerfCounters counters, Vector#(n, Core) cores) (Empty); + rule connect; + // Only core zero can access the ProgRouter perf counters + cores[0].progRouterPerfClient.incSent(counters.incSent); + cores[0].progRouterPerfClient.incSentInterBoard(counters.incSentInterBoard); + for (Integer i = 1; i < valueOf(n); i=i+1) begin + cores[i].progRouterPerfClient.incSent(?); + cores[i].progRouterPerfClient.incSentInterBoard(?); + end + endrule +endmodule + +endpackage diff --git a/rtl/Core.bsv b/rtl/Core.bsv index 1d35d278..4c454c98 100644 --- a/rtl/Core.bsv +++ b/rtl/Core.bsv @@ -25,6 +25,7 @@ import FPUOps :: *; import InstrMem :: *; import DCacheTypes :: *; import IdleDetector :: *; +import ProgRouter :: *; // ============================================================================ // Control/status registers (CSRs) supported @@ -60,15 +61,17 @@ import IdleDetector :: *; // Performance Counter CSRs (Optional) // ============================================================================ -// Name | CSR | R/W | Function -// --------------- | ------ | --- | -------- -// PerfCount | 0xc07 | W | Reset(0)/Start(1)/Stop(2) all counters -// MissCount | 0xc08 | R | Cache miss count -// HitCount | 0xc09 | R | Cache hit count -// WritebackCount | 0xc0a | R | Cache writeback count -// CPUIdleCount | 0xc0b | R | CPU idle-cycle count (lower 32 bits) -// CPUIdleCountU | 0xc0c | R | CPU idle-cycle count (upper 8 bits) -// CycleU | 0xc0d | R | Cycle counter (upper 8 bits) +// Name | CSR | R/W | Function +// ------------------- | ------ | --- | -------- +// PerfCount | 0xc07 | W | Reset(0)/Start(1)/Stop(2) all counters +// MissCount | 0xc08 | R | Cache miss count +// HitCount | 0xc09 | R | Cache hit count +// WritebackCount | 0xc0a | R | Cache writeback count +// CPUIdleCount | 0xc0b | R | CPU idle-cycle count (lower 32 bits) +// CPUIdleCountU | 0xc0c | R | CPU idle-cycle count (upper 8 bits) +// CycleU | 0xc0d | R | Cycle counter (upper 8 bits) +// ProgRouterSent | 0xc0e | R | Msgs sent by ProgRouter +// ProgRouterSentInter | 0xc0f | R | Inter-board msgs sent by ProgRouter // ============================================================================ // Types @@ -505,12 +508,13 @@ endfunction // ============================================================================ interface Core; - interface DCacheClient dcacheClient; - interface MailboxClient mailboxClient; - interface DebugLinkClient debugLinkClient; - interface FPUClient fpuClient; - interface InstrMemClient instrMemClient; - interface IdleDetectorClient idleClient; + interface DCacheClient dcacheClient; + interface MailboxClient mailboxClient; + interface DebugLinkClient debugLinkClient; + interface FPUClient fpuClient; + interface InstrMemClient instrMemClient; + interface IdleDetectorClient idleClient; + interface ProgRouterPerfClient progRouterPerfClient; // Each core can see its board id (* always_ready, always_enabled *) @@ -676,18 +680,27 @@ module mkCore#(CoreId myId) (Core); Reg#(Bit#(32)) hitCount <- mkConfigReg(0); Reg#(Bit#(32)) writebackCount <- mkConfigReg(0); Reg#(Bit#(40)) cpuIdleCount <- mkConfigReg(0); + // Only core zero maintains the following two counters + Reg#(Bit#(32)) progRouterSent <- mkConfigReg(0); + Reg#(Bit#(32)) progRouterSentInterBoard <- mkConfigReg(0); // Indexable vector of performance counters - Vector#(6, Bit#(32)) perfCounters = + Vector#(8, Bit#(32)) perfCounters = vector(missCount, hitCount, writebackCount, cpuIdleCount[31:0], zeroExtend(cpuIdleCount[39:32]), - zeroExtend(cycleCount[39:32])); + zeroExtend(cycleCount[39:32]), + myId == 0 ? progRouterSent : ?, + myId == 0 ? progRouterSentInterBoard : ?); // Increment wires Wire#(Bool) incMissCountWire <- mkDWire(False); Wire#(Bool) incHitCountWire <- mkDWire(False); Wire#(Bool) incWritebackCountWire <- mkDWire(False); Wire#(Bool) incCPUIdleCountWire <- mkDWire(False); + Wire#(Bit#(LogFetchersPerProgRouter)) + incProgRouterSent <- mkBypassWire; + Wire#(Bit#(LogFetchersPerProgRouter)) + incProgRouterSentInterBoard <- mkBypassWire; // Update performance counters rule updatePerfCounters; @@ -696,11 +709,20 @@ module mkCore#(CoreId myId) (Core); hitCount <= 0; writebackCount <= 0; cpuIdleCount <= 0; + if (myId == 0) begin + progRouterSent <= 0; + progRouterSentInterBoard <= 0; + end end else if (perfCountEnabled) begin if (incMissCountWire) missCount <= missCount+1; if (incHitCountWire) hitCount <= hitCount+1; if (incWritebackCountWire) writebackCount <= writebackCount+1; if (incCPUIdleCountWire) cpuIdleCount <= cpuIdleCount+1; + if (myId == 0) begin + progRouterSent <= progRouterSent + zeroExtend(incProgRouterSent); + progRouterSentInterBoard <= progRouterSentInterBoard + + zeroExtend(incProgRouterSentInterBoard); + end end endrule `endif @@ -1321,6 +1343,19 @@ module mkCore#(CoreId myId) (Core); method Bool idleStage1Ack = mailbox.idleStage1Ack; endinterface + interface ProgRouterPerfClient progRouterPerfClient; + method Action incSent(Bit#(LogFetchersPerProgRouter) amount); + `ifdef EnablePerfCount + incProgRouterSent <= amount; + `endif + endmethod + method Action incSentInterBoard(Bit#(LogFetchersPerProgRouter) amount); + `ifdef EnablePerfCount + incProgRouterSentInterBoard <= amount; + `endif + endmethod + endinterface + endmodule endpackage diff --git a/rtl/DCache.bsv b/rtl/DCache.bsv index 3162aade..e972a858 100644 --- a/rtl/DCache.bsv +++ b/rtl/DCache.bsv @@ -437,9 +437,11 @@ module mkDCache#(DCacheId myId) (DCache); // This rule either consumes a flush request or a memory response let flush = flushQueue.dataOut; let resp = respPort.value; + InflightDCacheReqInfo info = unpack(truncate(resp.info)); + Bit#(`LogBeatsPerLine) beat = truncate(resp.beat); lineWriteDataWire <= resp.data; - lineWriteIndexWire <= beatIndex(resp.info.beat, resp.info.req.id, - resp.info.req.addr, resp.info.way); + lineWriteIndexWire <= beatIndex(beat, info.req.id, + info.req.addr, info.way); // Ready to consume flush queue? if (flushQueue.canDeq && flushQueue.canPeek) begin flush.req.cmd.isFlush = False; @@ -453,14 +455,14 @@ module mkDCache#(DCacheId myId) (DCache); // Remove item from fill queue and feed associated request (which // will definitely hit if it starts again from the beginning of // the pipeline) back to beginning of the pipeline - if (allHigh(resp.info.beat)) + if (allHigh(beat)) feedbackTrigger <= True; // Write new line data to dataMem // (The write parameters are set outside condition for better timing) lineWriteReqWire <= True; respPort.get; // Set feedback request - feedbackReq <= resp.info.req; + feedbackReq <= info.req; end endrule @@ -492,11 +494,10 @@ module mkDCache#(DCacheId myId) (DCache); InflightDCacheReqInfo info; info.req = miss.req; info.way = miss.evictWay; - info.beat = ?; // Create memory request DRAMReq memReq; memReq.isStore = !isLoad; - memReq.id = myId; + memReq.id = zeroExtend(myId); memReq.addr = {isLoad ? readLineAddr : writeLineAddr, reqBeat}; memReq.data = isLoad ? {?, pack(info)} : dataMem.dataOutA; memReq.burst = isLoad ? `BeatsPerLine : 1; @@ -589,66 +590,6 @@ interface DCacheClient; method Action incWritebackCount(Bool inc); endinterface -// ============================================================================ -// Connections -// ============================================================================ - -module connectCoresToDCache#( - Vector#(`CoresPerDCache, DCacheClient) clients, - DCache dcache) (); - - // Connect requests - function getDCacheReqOut(client) = client.dcacheReqOut; - let dcacheReqs <- mkMergeTree(Fair, - mkUGShiftQueue1(QueueOptFmax), - map(getDCacheReqOut, clients)); - connectUsing(mkUGQueue, dcacheReqs, dcache.reqIn); - - // Connect responses - function Bit#(`LogCoresPerDCache) getDCacheRespKey(DCacheResp resp) = - truncateLSB(resp.id); - function getDCacheRespIn(client) = client.dcacheRespIn; - let dcacheResps <- mkResponseDistributor( - getDCacheRespKey, - mkUGShiftQueue1(QueueOptFmax), - map(getDCacheRespIn, clients)); - connectDirect(dcache.respOut, dcacheResps); - - // Connect performance-counter wires - rule connectPerfCountWires; - clients[0].incMissCount(dcache.incMissCount); - clients[0].incHitCount(dcache.incHitCount); - clients[0].incWritebackCount(dcache.incWritebackCount); - for (Integer i = 1; i < `CoresPerDCache; i=i+1) begin - clients[i].incMissCount(False); - clients[i].incHitCount(False); - clients[i].incWritebackCount(False); - end - endrule - -endmodule - -module connectDCachesToOffChipRAM#( - Vector#(`DCachesPerDRAM, DCache) caches, OffChipRAM ram) (); - - // Connect requests - function getReqOut(cache) = cache.reqOut; - let reqs <- mkMergeTreeB(Fair, - mkUGShiftQueue1(QueueOptFmax), - map(getReqOut, caches)); - connectUsing(mkUGQueue, reqs, ram.reqIn); - - // Connect load responses - function DCacheId getRespKey(DRAMResp resp) = resp.id; - function getRespIn(cache) = cache.respIn; - let ramResps <- mkResponseDistributor( - getRespKey, - mkUGShiftQueue2(QueueOptFmax), - map(getRespIn, caches)); - connectDirect(ram.respOut, ramResps); - -endmodule - // ============================================================================ // Dummy cache // ============================================================================ diff --git a/rtl/DCacheTypes.bsv b/rtl/DCacheTypes.bsv index fa6ba407..4ddd809f 100644 --- a/rtl/DCacheTypes.bsv +++ b/rtl/DCacheTypes.bsv @@ -43,7 +43,6 @@ typedef struct { typedef struct { DCacheReq req; Way way; - Bit#(`LogBeatsPerLine) beat; } InflightDCacheReqInfo deriving (Bits); endpackage diff --git a/rtl/DE5BridgeTop.bsv b/rtl/DE5BridgeTop.bsv index 5dce9e25..15e2ba8f 100644 --- a/rtl/DE5BridgeTop.bsv +++ b/rtl/DE5BridgeTop.bsv @@ -12,9 +12,10 @@ // 1. DA: Destination address (4 bytes) // 2. NM: Number of messages that follow minus one (4 bytes) // 3. FM: Number of flit payloads per message minus one (1 byte) -// 4. Padding (7 bytes) -// 5. (NM+1)*(FM+1) flit payloads ((NM+1)*(FM+1)*BytesPerFlit bytes) -// 6. Goto step 1 +// 4. Padding (3 bytes) +// 5. Routing key (optional, 4 bytes) +// 6. (NM+1)*(FM+1) flit payloads ((NM+1)*(FM+1)*BytesPerFlit bytes) +// 7. Goto step 1 // // The format of the data stream in the FPGA->PC direction is simply // raw flit payloads. @@ -161,6 +162,7 @@ module de5BridgeTop (DE5BridgeTop); Reg#(Bit#(32)) fromPCIeDA <- mkConfigRegU; Reg#(Bit#(32)) fromPCIeNM <- mkConfigRegU; Reg#(Bit#(8)) fromPCIeFM <- mkConfigRegU; + Reg#(Bit#(32)) fromPCIeKey <- mkConfigRegU; Reg#(Bit#(1)) toLinkState <- mkConfigReg(0); Reg#(Bit#(32)) messageCount <- mkConfigReg(0); @@ -182,6 +184,7 @@ module de5BridgeTop (DE5BridgeTop); fromPCIeDA <= data[31:0]; fromPCIeNM <= data[63:32]; fromPCIeFM <= data[95:88]; + fromPCIeKey <= data[127:96]; toLinkState <= 1; fromPCIe.get; end @@ -203,6 +206,10 @@ module de5BridgeTop (DE5BridgeTop); Flit flit; flit.dest.addr = unpack(truncate(fromPCIeDA[31:`LogThreadsPerMailbox])); flit.dest.threads = pack(destThreads); + // If address says to use routing key, then use it + if (flit.dest.addr.isKey) begin + flit.dest.threads = zeroExtend(fromPCIeKey); + end flit.payload = fromPCIe.value; flit.notFinalFlit = True; flit.isIdleToken = False; diff --git a/rtl/DE5Top.bsv b/rtl/DE5Top.bsv index 2173526d..bb35bc19 100644 --- a/rtl/DE5Top.bsv +++ b/rtl/DE5Top.bsv @@ -22,6 +22,7 @@ import InstrMem :: *; import NarrowSRAM :: *; import OffChipRAM :: *; import IdleDetector :: *; +import Connections :: *; // ============================================================================ // Interface @@ -114,10 +115,6 @@ module de5Top (DE5Top); for (Integer j = 0; j < `DCachesPerDRAM; j=j+1) connectCoresToDCache(map(dcacheClient, cores[i][j]), dcaches[i][j]); - // Connect data caches to DRAM - for (Integer i = 0; i < `DRAMsPerBoard; i=i+1) - connectDCachesToOffChipRAM(dcaches[i], rams[i]); - // Create FPUs Vector#(`FPUsPerBoard, FPU) fpus; for (Integer i = 0; i < `FPUsPerBoard; i=i+1) @@ -143,10 +140,6 @@ module de5Top (DE5Top); // Create idle-detector IdleDetector idle <- mkIdleDetector; - // Connect cores to idle-detector - function idleClient(core) = core.idleClient; - connectCoresToIdleDetector(map(idleClient, vecOfCores), idle); - // Create mailboxes Vector#(`MailboxMeshYLen, Vector#(`MailboxMeshXLen, Mailbox)) mailboxes = @@ -155,6 +148,13 @@ module de5Top (DE5Top); for (Integer x = 0; x < `MailboxMeshXLen; x=x+1) mailboxes[y][x] <- mkMailboxAcc(debugLink.getBoardId(), x, y); + // Initialise mailbox send slots + rule initSendSlots; + for (Integer y = 0; y < `MailboxMeshYLen; y=y+1) + for (Integer x = 0; x < `MailboxMeshXLen; x=x+1) + mailboxes[y][x].initSendSlots(debugLink.useExtraSendSlot); + endrule + // Connect cores to mailboxes for (Integer y = 0; y < `MailboxMeshYLen; y=y+1) for (Integer x = 0; x < `MailboxMeshXLen; x=x+1) begin @@ -167,13 +167,27 @@ module de5Top (DE5Top); connectCoresToMailbox(map(mailboxClient, cs), mailboxes[y][x]); end - // Create mesh of mailboxes + // Create network-on-chip function MailboxNet mailboxNet(Mailbox mbox) = mbox.net; - ExtNetwork net <- mkMailboxMesh( - debugLink.getBoardId(), - debugLink.linkEnable, - map(map(mailboxNet), mailboxes), - idle); + NoC noc <- mkNoC( + debugLink.getBoardId(), + debugLink.linkEnable, + map(map(mailboxNet), mailboxes), + idle); + + // Connect cores and ProgRouter fetchers to idle-detector + function idleClient(core) = core.idleClient; + connectClientsToIdleDetector( + map(idleClient, vecOfCores), noc.activities, idle); + + // Connections to off-chip RAMs + for (Integer i = 0; i < `DRAMsPerBoard; i=i+1) + connectClientsToOffChipRAM(dcaches[i], + noc.dramReqs[i], noc.dramResps[i], rams[i]); + + // Connects ProgRouter performance counters to cores + connectProgRouterPerfCountersToCores(noc.progRouterPerfCounters, + concat(concat(cores))); // Set board ids rule setBoardIds; @@ -199,10 +213,10 @@ module de5Top (DE5Top); interface dramIfcs = map(getDRAMExtIfc, rams); interface sramIfcs = concat(map(getSRAMExtIfcs, rams)); interface jtagIfc = debugLink.jtagAvalon; - interface northMac = net.north; - interface southMac = net.south; - interface eastMac = net.east; - interface westMac = net.west; + interface northMac = noc.north; + interface southMac = noc.south; + interface eastMac = noc.east; + interface westMac = noc.west; method Action setBoardId(Bit#(4) id); localBoardId <= id; endmethod diff --git a/rtl/DRAM.bsv b/rtl/DRAM.bsv index b9bab54e..406cfe89 100644 --- a/rtl/DRAM.bsv +++ b/rtl/DRAM.bsv @@ -5,8 +5,11 @@ package DRAM; // Types // ============================================================================ +// DRAM client id +typedef Bit#(TLog#(TAdd#(`DCachesPerDRAM,`FetchersPerProgRouter))) DRAMClientId; + // DRAM request id -typedef DCacheId DRAMReqId; +typedef DRAMClientId DRAMReqId; // DRAM request typedef struct { @@ -22,8 +25,13 @@ typedef struct { typedef struct { DRAMReqId id; Bit#(`BeatWidth) data; - InflightDCacheReqInfo info; + // Which beat is it? Bool finalBeat; + Bit#(`BeatBurstWidth) beat; + // Data from original load request + // (Can be largely ignored and optimised away, but + // can also hold useful info about the original request) + Bit#(`BeatWidth) info; } DRAMResp deriving (Bits); // DRAM identifier @@ -80,7 +88,6 @@ import Util :: *; import Interface :: *; import Queue :: *; import Assert :: *; -import DCacheTypes :: *; // Types // ----- @@ -151,8 +158,8 @@ module mkDRAM#(RAMId id) (DRAM); DRAMResp resp; resp.id = req.id; resp.data = pack(elems); - resp.info = unpack(truncate(req.data)); - resp.info.beat = truncate(burstCount); + resp.info = req.data; + resp.beat = burstCount; resp.finalBeat = finalBeat; resps.enq(resp); decOutstanding.send; @@ -219,7 +226,6 @@ import Interface :: *; import Assert :: *; import Util :: *; import Assert :: *; -import DCacheTypes :: *; // Types // ----- @@ -244,7 +250,7 @@ endinterface typedef struct { DRAMReqId id; Bit#(`BeatBurstWidth) burst; - InflightDCacheReqInfo info; + Bit#(`BeatWidth) info; } DRAMInFlightReq deriving (Bits); // Implementation @@ -309,7 +315,7 @@ module mkDRAM#(t id) (DRAM); DRAMInFlightReq inflightReq; inflightReq.id = req.id; inflightReq.burst = req.burst; - inflightReq.info = unpack(truncate(req.data)); + inflightReq.info = req.data; inFlight.enq(inflightReq); inFlightCount.incBy(zeroExtend(req.burst)); end @@ -336,7 +342,7 @@ module mkDRAM#(t id) (DRAM); DRAMResp resp; resp.id = inFlight.dataOut.id; resp.info = inFlight.dataOut.info; - resp.info.beat = truncate(burstCount-1); + resp.beat = truncate(burstCount-1); resp.data = respBuffer.dataOut; resp.finalBeat = burstCount == inFlight.dataOut.burst; return resp; diff --git a/rtl/DebugLink.bsv b/rtl/DebugLink.bsv index 676696e7..a09236b5 100644 --- a/rtl/DebugLink.bsv +++ b/rtl/DebugLink.bsv @@ -13,16 +13,18 @@ package DebugLink; // Commands sent from the host PC to DebugLink typically consist of a // few bytes over the JTAG UART. // -// QueryIn: tag (1 byte), board offset (1 byte), edge disable (1 byte) -// ------------------------------------------------------------------- +// QueryIn: tag (1 byte), board offset (1 byte), config (1 byte) +// ------------------------------------------------------------- // // Sets the X offset (offset[3:0]) and the Y offset (offset[7:4]) // of the board id (to support multiple boxes). // Disable the specified inter-FPGA links: -// * disable[0]: disable links on north side of box -// * disable[1]: disable links on south side of box -// * disable[2]: disable links on east side of box -// * disable[3]: disable links on west side of box +// * config[0]: disable links on north side of box +// * config[1]: disable links on south side of box +// * config[2]: disable links on east side of box +// * config[3]: disable links on west side of box +// Enable extra send slot: +// * config[4]: reserve extra send slot // Responds with a QueryOut (see below). // // SetDest: tag (1 byte), thread id (1 byte), core id (1 byte) @@ -202,9 +204,13 @@ interface DebugLink; // Get board id via DebugLink (* always_ready, always_enabled *) method BoardId getBoardId(); - // Optionally disable each inter-FPGA link via DebugLink + // Config option: disable each inter-FPGA link via DebugLink + // (Allows sanboxing of boxes or groups of boxes) (* always_ready, always_enabled *) method Vector#(4, Bool) linkEnable; + // Config option: reserve extra send slot per thread in mailbox + (* always_ready, always_enabled *) + method Option#(Bool) useExtraSendSlot; endinterface module mkDebugLink#( @@ -224,6 +230,11 @@ module mkDebugLink#( // (Initially, all disabled) Reg#(Vector#(4, Bool)) linkEnableReg <- mkConfigReg(replicate(False)); + // Config option: reserve extra send slot in mailbox? + // Use a chain of registers to aid propagation on chip + Vector#(3, Reg#(Option#(Bool))) useExtraSendSlotReg <- + replicateM(mkConfigReg(Option {valid : False, value: False})); + // Ports InPort#(Bit#(8)) fromJtag <- mkInPort; OutPort#(Bit#(8)) toJtag <- mkOutPort; @@ -331,6 +342,9 @@ module mkDebugLink#( // Disable west link? if (x == 0 && edgeEn[3] == 1) linkEn[3] = False; linkEnableReg <= linkEn; + // Reserve extra send slot? + useExtraSendSlotReg[2] <= + Option {valid: True, value: fromJtag.value[4] == 1}; respondFlag <= True; respondCmd <= cmdQueryIn; recvState <= 0; @@ -404,6 +418,11 @@ module mkDebugLink#( end endrule + // Propagate extra send slot option through chain of registers (for timing) + rule chain; + for (Integer i = 0; i < 2; i=i+1) + useExtraSendSlotReg[i] <= useExtraSendSlotReg[i+1]; + endrule `ifndef SIMULATE interface jtagAvalon = uart.jtagAvalon; @@ -411,7 +430,7 @@ module mkDebugLink#( method BoardId getBoardId() = boardId; method Vector#(4, Bool) linkEnable = linkEnableReg; - + method Option#(Bool) useExtraSendSlot = useExtraSendSlotReg[0]; endmodule endpackage diff --git a/rtl/GenInit.sh b/rtl/GenInit.sh deleted file mode 100755 index ad2a6e0c..00000000 --- a/rtl/GenInit.sh +++ /dev/null @@ -1,19 +0,0 @@ -#!/bin/bash - -# Generate memory initialisation files - -# Load config parameters -while read -r EXPORT; do - eval $EXPORT -done <<< `python ../config.py envs` - -MaxSlot=$(((2**LogMsgsPerMailbox) - 1)) -ThreadsPerMailbox=$((2**$LogThreadsPerMailbox)) - -# Emit hex file -for I in $(seq $ThreadsPerMailbox $MaxSlot); do - printf "%x\n" $I -done >> FreeSlots.hex - -# Emit MIF file -../bin/hex-to-mif.py FreeSlots.hex $LogMsgsPerMailbox > ../de5/FreeSlots.mif diff --git a/rtl/Globals.bsv b/rtl/Globals.bsv index a2648a23..d240aa2c 100644 --- a/rtl/Globals.bsv +++ b/rtl/Globals.bsv @@ -20,10 +20,13 @@ typedef struct { // destination board, it is routed either left or right depending // the contents of the host bit. This is to support bridge boards // connected at the east/west rims of the FPGA mesh. +// The 'isKey' bit means that the destination is a routing key, held +// in the botom 32 bits of the 'NetAddr'. // The 'acc' bit means message is routed to a custom accelerator rather // than a mailbox. typedef struct { Bool acc; + Bool isKey; Option#(Bit#(1)) host; BoardId board; MailboxId mbox; @@ -42,6 +45,9 @@ typedef struct { function MailboxId getMailboxId(NetAddr addr) = addr.addr.mbox; +// Extract routing key from network address +function Bit#(32) getRoutingKeyRaw(NetAddr addr) = truncate(pack(addr)); + // ============================================================================ // Messages // ============================================================================ @@ -63,7 +69,7 @@ typedef struct { Bool notFinalFlit; // Is this a special packet for idle-detection? Bool isIdleToken; -} Flit deriving (Bits); +} Flit deriving (Bits, FShow); // A padded flit is a multiple of 64 bits // (i.e. the data width of the 10G MAC interface) diff --git a/rtl/IdleDetector.bsv b/rtl/IdleDetector.bsv index 0307f198..59e4b530 100644 --- a/rtl/IdleDetector.bsv +++ b/rtl/IdleDetector.bsv @@ -18,14 +18,16 @@ // The implementation below is based on Safra's termination detection // algorithm (EWD998). -import Mailbox :: *; -import Globals :: *; -import Interface :: *; -import Queue :: *; -import Vector :: *; -import ConfigReg :: *; -import Util :: *; -import DReg :: *; +import Mailbox :: *; +import Globals :: *; +import Interface :: *; +import Queue :: *; +import Vector :: *; +import ConfigReg :: *; +import Util :: *; +import DReg :: *; +import ProgRouter :: *; +import Assert :: *; // The total number of messages sent by all threads on an FPGA minus // the total number of messages received by all threads on an FPGA. @@ -221,6 +223,7 @@ module mkIdleDetector (IdleDetector); NetAddr { addr: MailboxNetAddr { acc: False, + isKey: False, host: option(True, 0), board: BoardId { y: 0, x: 0 }, mbox: MailboxId { y: 0, x: 0 } @@ -301,33 +304,6 @@ module mkIdleDetector (IdleDetector); endmodule -// Pipelined reduction tree -module mkPipelinedReductionTree#( - function a reduce(a x, a y), - a init, - List#(a) xs) - (a) provisos(Bits#(a, _)); - Integer len = List::length(xs); - if (len == 0) - return error("mkSumList applied to empty list"); - else if (len == 1) - return xs[0]; - else begin - List#(a) ys = xs; - List#(a) reduced = Nil; - for (Integer i = 0; i < len; i=i+2) begin - Reg#(a) r <- mkConfigReg(init); - rule assignOut; - r <= reduce(ys[0], ys[1]); - endrule - ys = List::drop(2, ys); - reduced = Cons(readReg(r), reduced); - end - a res <- mkPipelinedReductionTree(reduce, init, reduced); - return res; - end -endmodule - interface IdleDetectorClient; method Bit#(1) incSent; method Bit#(1) incReceived; @@ -342,22 +318,33 @@ interface IdleDetectorClient; method Bool idleStage1Ack; endinterface -// Connect cores to idle detector -module connectCoresToIdleDetector#( - Vector#(n, IdleDetectorClient) core, IdleDetector detector) () - provisos (Log#(n, log_n), Add#(log_n, 1, m), Add#(_a, m, 62)); +// Connect cores and fetchers to idle detector +module connectClientsToIdleDetector#( + Vector#(`CoresPerBoard, IdleDetectorClient) core, + Vector#(`FetchersPerProgRouter, FetcherActivity) fetcher, + IdleDetector detector) () + provisos (Mul#(2, `CoresPerBoard, n)); + + staticAssert(2**`LogCoresPerBoard1 > `CoresPerBoard+`FetchersPerProgRouter, + "connectCoresToIdleDetector: insufficient width"); // Sum "incSent" wires from each core - Vector#(n, Bit#(m)) incSents = newVector; - for (Integer i = 0; i < valueOf(n); i=i+1) + Vector#(n, Bit#(`LogCoresPerBoard1)) incSents = replicate(0); + for (Integer i = 0; i < `CoresPerBoard; i=i+1) incSents[i] = zeroExtend(core[i].incSent); - Bit#(m) incSent <- mkPipelinedReductionTree( \+ , 0, toList(incSents)); + for (Integer i = 0; i < `FetchersPerProgRouter; i=i+1) + incSents[`CoresPerBoard+i] = zeroExtend(fetcher[i].incSent); + Bit#(`LogCoresPerBoard1) incSent <- + mkPipelinedReductionTree( \+ , 0, toList(incSents)); // Sum "incRecv" wires from each core - Vector#(n, Bit#(m)) incRecvs = newVector; - for (Integer i = 0; i < valueOf(n); i=i+1) + Vector#(n, Bit#(`LogCoresPerBoard1)) incRecvs = replicate(0); + for (Integer i = 0; i < `CoresPerBoard; i=i+1) incRecvs[i] = zeroExtend(core[i].incReceived); - Bit#(m) incRecv <- mkPipelinedReductionTree( \+ , 0, toList(incRecvs)); + for (Integer i = 0; i < `FetchersPerProgRouter; i=i+1) + incRecvs[`CoresPerBoard+i] = zeroExtend(fetcher[i].incReceived); + Bit#(`LogCoresPerBoard1) incRecv <- + mkPipelinedReductionTree( \+ , 0, toList(incRecvs)); // Maintain the total count Reg#(MsgCount) count <- mkConfigReg(0); @@ -368,16 +355,18 @@ module connectCoresToIdleDetector#( endrule // OR the "active" wires from each core - Vector#(n, Bool) actives = newVector; - for (Integer i = 0; i < valueOf(n); i=i+1) + Vector#(n, Bool) actives = replicate(False); + for (Integer i = 0; i < `CoresPerBoard; i=i+1) actives[i] = core[i].active; + for (Integer i = 0; i < `FetchersPerProgRouter; i=i+1) + actives[`CoresPerBoard+i] = fetcher[i].active; Bool anyActive <- mkPipelinedReductionTree( \|| , True, toList(actives)); - // OR the "vote" wires from each core - Vector#(n, Bool) votes = newVector; - for (Integer i = 0; i < valueOf(n); i=i+1) + // AND the "vote" wires from each core + Vector#(n, Bool) votes = replicate(True); + for (Integer i = 0; i < `CoresPerBoard; i=i+1) votes[i] = core[i].vote; - Bool unanamous <- mkPipelinedReductionTree( \&& , False, toList(votes)); + Bool voteDecision <- mkPipelinedReductionTree( \&& , False, toList(votes)); // Register the result Reg#(Bool) active <- mkConfigReg(True); @@ -385,24 +374,25 @@ module connectCoresToIdleDetector#( rule updateActive; active <= anyActive; - vote <= unanamous; + vote <= voteDecision; endrule // Counter number of stage 1 acks - Reg#(Bit#(m)) numAcks <- mkConfigReg(0); + Reg#(Bit#(`LogCoresPerBoard1)) numAcks <- mkConfigReg(0); // Sum stage 1 ack wires from each core - Vector#(n, Bit#(m)) incAcks = newVector; - for (Integer i = 0; i < valueOf(n); i=i+1) + Vector#(`CoresPerBoard, Bit#(`LogCoresPerBoard1)) incAcks = newVector; + for (Integer i = 0; i < `CoresPerBoard; i=i+1) incAcks[i] = zeroExtend(pack(core[i].idleStage1Ack)); - Bit#(m) incAck <- mkPipelinedReductionTree( \+ , 0, toList(incAcks)); + Bit#(`LogCoresPerBoard1) incAck <- + mkPipelinedReductionTree( \+ , 0, toList(incAcks)); // Stage 1 output ack Wire#(Bool) stage1AckWire <- mkDWire(False); rule updateAcks; - Bit#(m) total = numAcks + incAck; - if (total == fromInteger(valueOf(n))) begin + Bit#(`LogCoresPerBoard1) total = numAcks + incAck; + if (total == `CoresPerBoard) begin numAcks <= 0; stage1AckWire <= True; end else begin @@ -418,7 +408,7 @@ module connectCoresToIdleDetector#( detector.idle.voteIn(vote); detector.idle.ackStage1(stage1AckWire); - for (Integer i = 0; i < valueOf(n); i=i+1) begin + for (Integer i = 0; i < `CoresPerBoard; i=i+1) begin core[i].idleDetectedStage1(detector.idle.detectedStage1); core[i].idleVoteStage1(detector.idle.voteStage1); core[i].idleDetectedStage2(detector.idle.detectedStage2); @@ -538,6 +528,7 @@ module mkIdleDetectMaster (IdleDetectMaster); NetAddr { addr: MailboxNetAddr { acc: False, + isKey: False, host: option(False, 0), board: BoardId { y: truncate(boardY), x: truncate(boardX) }, mbox: MailboxId { y: 0, x: 0 } diff --git a/rtl/Interface.bsv b/rtl/Interface.bsv index c3d16860..a7cd0e91 100644 --- a/rtl/Interface.bsv +++ b/rtl/Interface.bsv @@ -212,6 +212,14 @@ module onBOut#(function u f(t x), BOut#(t) out) (BOut#(u)); method u value = f(out.value); endmodule +// Convert BOut to Out +function Out#(t) fromBOut(BOut#(t) out) = + interface Out + method Action tryGet = out.get; + method Bool valid = out.valid; + method t value = out.value; + endinterface; + // A null In port accepts and discards all inputs module mkNullIn (In#(t)); method Action tryPut(u val); endmethod @@ -248,6 +256,14 @@ function BOut#(t) enableBOut(Bool en, BOut#(t) out) = method t value = out.value; endinterface; +// Convert queue to BOut interface +function BOut#(t) queueToBOut(SizedQueue#(n, t) q) = + interface BOut + method Action get = q.deq; + method Bool valid = q.canDeq && q.canPeek; + method t value = q.dataOut; + endinterface; + // ============================================================================= // Merge unit // ============================================================================= @@ -396,7 +412,7 @@ module mkMergeTreeB#(MergeMethod m, module#(SizedQueue#(d, t)) mkQ, xs = List::cons(x, xs); end - let out <- mkMergeTreeList(m, mkQ, xs); + let out <- mkMergeTreeList(m, mkQ, List::reverse(xs)); return out; endmodule @@ -578,7 +594,7 @@ module mkDeserialiser (Deserialiser#(typeIn, typeOut)) endmodule // ============================================================================= -// Expansion and reduction connectors +// Reduction connectors // ============================================================================= // Reduce a list of interfaces down to a given number of interfaces, @@ -651,31 +667,4 @@ module reduceConnect#( endmodule -// Connect 'from' ports to 'to' ports, -// where 'length(from)' may be less than 'length(to)'. -// Works by wiring null to any unused 'to' ports. -module expandConnect#(List#(Out#(t)) from, List#(In#(t)) to) () - provisos (Bits#(t, twidth)); - - // Count inputs and outputs - Integer numFrom = List::length(from); - Integer numTo = List::length(to); - Integer q = numTo/numFrom; - - for (Integer i = 0; i < numTo; i=i+1) begin - if (q == 0) begin - // Connect input - connectUsing(mkUGShiftQueue1(QueueOptFmax), from[i], to[i]); - end else if ((i%q) == 0) begin - // Connect input - connectUsing(mkUGShiftQueue1(QueueOptFmax), from[i/q], to[i]); - end else begin - // Connect terminator - BOut#(t) nullOut <- mkNullBOut; - connectDirect(nullOut, to[i]); - end - end - -endmodule - endpackage diff --git a/rtl/Mailbox.bsv b/rtl/Mailbox.bsv index 0398b0e2..e08b1b9a 100644 --- a/rtl/Mailbox.bsv +++ b/rtl/Mailbox.bsv @@ -260,6 +260,9 @@ interface Mailbox; (* always_ready *) method Bit#(1) freeDone; // Network-side interface interface MailboxNet net; + // Initialise send slots (use extra send slot?) + (* always_ready, always_enabled *) + method Action initSendSlots(Option#(Bool) useExtraSendSlot); endinterface // Combined receive request/response interface @@ -292,6 +295,45 @@ module mkMailbox (Mailbox); Vector#(`CoresPerMailbox, InPort#(ReceiveReq)) rxReqPorts <- replicateM(mkInPort); + // Initialise free slots + // ===================== + + // Set of currently-unused message slots + // By default, the first ThreadsPerMailbox slots are reserved for sending + // Optionally, the first 2*ThreadsPerMailbox slots are reserved for sending + SizedQueue#(`LogMsgsPerMailbox, Bit#(`LogMsgsPerMailbox)) + freeSlots <- mkUGSizedQueuePrefetch; + + // Reserve extra send slot? + Wire#(Option#(Bool)) useExtraSendSlot <- mkBypassWire; + + // State of free slot initialiser + Reg#(Bit#(1)) freeSlotsInitState <- mkConfigReg(0); + + // Have the free slots been initialised yet? + Reg#(Bool) freeSlotsInitDone <- mkConfigReg(False); + + // Next slot to insert into free slot queue + Reg#(Bit#(`LogMsgsPerMailbox)) freeSlotsInitNext <- mkConfigRegU; + + // Wait until config option available, which tells us how + // many slots to reserve for sending + rule initFreeSlots0 (freeSlotsInitState == 0); + if (useExtraSendSlot.valid) begin + freeSlotsInitNext <= useExtraSendSlot.value ? + fromInteger(2*`ThreadsPerMailbox) : `ThreadsPerMailbox; + freeSlotsInitState <= 1; + end + endrule + + // Initialise free slots + rule initFreeSlots1 (!freeSlotsInitDone && freeSlotsInitState == 1); + freeSlots.enq(freeSlotsInitNext); + freeSlotsInitNext <= freeSlotsInitNext + 1; + if (freeSlotsInitNext == fromInteger(2**`LogMsgsPerMailbox - 1)) + freeSlotsInitDone <= True; + endrule + // Message access unit // =================== @@ -336,15 +378,6 @@ module mkMailbox (Mailbox); Reg#(RefCount) refCountReg <- mkConfigRegU; Reg#(Bit#(`LogMsgsPerMailbox)) refCountSlot <- mkConfigRegU; - // Set of currently-unused message slots - // (The first ThreadsPerMailbox slots are reserved for sending) - QueueOpts freeSlotsOpts; - freeSlotsOpts.style = "AUTO"; - freeSlotsOpts.size = 2**`LogMsgsPerMailbox - `ThreadsPerMailbox; - freeSlotsOpts.file = Valid("FreeSlots"); - SizedQueue#(`LogMsgsPerMailbox, Bit#(`LogMsgsPerMailbox)) - freeSlots <- mkUGSizedQueuePrefetchOpts(freeSlotsOpts); - // Multicast buffer Vector#(`CoresPerMailbox, SizedQueue#(`LogMulticastBufferSize, MulticastBufferEntry)) @@ -598,7 +631,7 @@ module mkMailbox (Mailbox); // to a message slot is freed Reg#(Bit#(1)) freeDoneReg <- mkDReg(0); - rule free (freeReqPort.canGet); + rule free (freeReqPort.canGet && freeSlotsInitDone); FreeReq req = freeReqPort.value; // Process request in two cycles let count = refCount.dataOutB; @@ -667,6 +700,10 @@ module mkMailbox (Mailbox); endinterface endinterface + method Action initSendSlots(Option#(Bool) useExtra); + useExtraSendSlot <= useExtra; + endmethod + endmodule // ============================================================================= @@ -1138,14 +1175,16 @@ import "BVI" ExternalTinselAccelerator = `ifndef UseCustomAccelerator -module mkMailboxAcc#(BoardId boardId, Integer tileX, Integer tileY) (Mailbox); +module mkMailboxAcc#(BoardId boardId, + Integer tileX, Integer tileY) (Mailbox); Mailbox mbox <- mkMailbox; return mbox; endmodule `else -module mkMailboxAcc#(BoardId boardId, Integer tileX, Integer tileY) (Mailbox); +module mkMailboxAcc#(BoardId boardId, + Integer tileX, Integer tileY) (Mailbox); // Instantiate standard mailbox Mailbox mbox <- mkMailbox; diff --git a/rtl/Makefile b/rtl/Makefile index cc521bae..e938b015 100644 --- a/rtl/Makefile +++ b/rtl/Makefile @@ -11,7 +11,7 @@ DEFS = $(shell python ../config.py defs) BSC = bsc BSCFLAGS = -wait-for-license -suppress-warnings S0015 \ -suppress-warnings G0023 \ - -steps-warn-interval 500000 -check-assert \ + -steps-warn-interval 750000 -check-assert \ +RTS -K32M -RTS # Top level module @@ -28,13 +28,13 @@ sim: $(TOPMOD) $(HOSTTOPMOD) .PHONY: verilog verilog: $(TOPMOD).v $(HOSTTOPMOD).v -$(TOPMOD): *.bsv *.c InstrMem.hex FreeSlots.hex +$(TOPMOD): *.bsv *.c InstrMem.hex make -C $(TINSEL_ROOT)/apps/boot make -C $(TINSEL_ROOT)/hostlink udsock $(BSC) $(BSCFLAGS) $(DEFS) -D SIMULATE -sim -g $(TOPMOD) -u $(TOPFILE) $(BSC) $(BSCFLAGS) -sim -o $(TOPMOD) -e $(TOPMOD) *.c -$(TOPMOD).v: *.bsv $(QP)/InstrMem.mif $(QP)/FreeSlots.mif +$(TOPMOD).v: *.bsv $(QP)/InstrMem.mif make -C $(TINSEL_ROOT)/apps/boot $(BSC) $(BSCFLAGS) -opt-undetermined-vals -unspecified-to X \ $(DEFS) -u -verilog -g $(TOPMOD) $(TOPFILE) @@ -63,12 +63,6 @@ InstrMem.hex: $(QP)/InstrMem.mif: make -C $(TINSEL_ROOT)/apps/boot -FreeSlots.hex: GenInit.sh - ./GenInit.sh - -$(QP)/FreeSlots.mif: GenInit.sh - ./GenInit.sh - .PHONY: test-mem test-mem: testMem @@ -83,7 +77,6 @@ clean: rm -f de5Top.v mkCore.v mkDCache.v mkMailbox.v mkDebugLinkRouter.v rm -f mkFPU.v mkMeshRouter.v rm -f de5BridgeTop.v - rm -f FreeSlots.hex ../de5/FreeSlots.mif rm -rf test-mem-log rm -rf test-mailbox-log rm -rf test-array-of-queue-log diff --git a/rtl/NarrowSRAM.bsv b/rtl/NarrowSRAM.bsv index d0651392..4e51be85 100644 --- a/rtl/NarrowSRAM.bsv +++ b/rtl/NarrowSRAM.bsv @@ -1,22 +1,21 @@ // SPDX-License-Identifier: BSD-2-Clause package NarrowSRAM; -import DCacheTypes :: *; -import Util :: *; +import Util :: *; // ============================================================================ // Types // ============================================================================ // SRAM request id -typedef Bit#(`LogDCachesPerDRAM) SRAMReqId; +typedef Bit#(TLog#(TAdd#(`DCachesPerDRAM,`FetchersPerProgRouter))) SRAMReqId; // SRAM load request typedef struct { SRAMReqId id; Bit#(`SRAMAddrWidth) addr; Bit#(`SRAMBurstWidth) burst; - InflightDCacheReqInfo info; + Bit#(`BeatWidth) info; } SRAMLoadReq deriving (Bits); // SRAM store request @@ -31,7 +30,7 @@ typedef struct { typedef struct { SRAMReqId id; Bit#(`SRAMDataWidth) data; - InflightDCacheReqInfo info; + Bit#(`BeatWidth) info; } SRAMResp deriving (Bits); // ============================================================================ @@ -140,7 +139,6 @@ module mkSRAM#(RAMId id) (SRAM); resp.id = req.id; resp.data = pack(elems); resp.info = req.info; - resp.info.beat = truncate(loadBurstCount); resps.enq(resp); inFlightCount.dec; end @@ -243,7 +241,7 @@ endinterface typedef struct { SRAMReqId id; Bit#(`SRAMBurstWidth) burst; - InflightDCacheReqInfo info; + Bit#(`BeatWidth) info; } SRAMInFlightReq deriving (Bits); // SRAM Implementation diff --git a/rtl/Network.bsv b/rtl/Network.bsv index 3efbb480..07d9adfd 100644 --- a/rtl/Network.bsv +++ b/rtl/Network.bsv @@ -23,6 +23,9 @@ import Socket :: *; import Util :: *; import IdleDetector :: *; import FlitMerger :: *; +import OffChipRAM :: *; +import DRAM :: *; +import ProgRouter :: *; // ============================================================================= // Mesh Router @@ -146,11 +149,9 @@ module mkMeshRouter#(MailboxId m) (MeshRouter); // Routing function function Route route(NetAddr a); - if (a.addr.board.y < b.y) return Down; - else if (a.addr.board.y > b.y) return Up; - else if (a.addr.host.valid) return a.addr.host.value == 0 ? Left : Right; - else if (a.addr.board.x < b.x) return Left; - else if (a.addr.board.x > b.x) return Right; + if (a.addr.board != b) return Down; + else if (a.addr.isKey) return Down; + else if (a.addr.host.valid) return Down; else if (a.addr.mbox.y < m.y) return Down; else if (a.addr.mbox.y > m.y) return Up; else if (a.addr.mbox.x < m.x) return Left; @@ -271,27 +272,35 @@ module mkBoardLink#(Bool en, SocketId id) (BoardLink); endmodule // ============================================================================= -// Mailbox Mesh +// Network-on-chip // ============================================================================= -// Interface to external (off-board) network -interface ExtNetwork; -`ifndef SIMULATE - // Avalon interfaces to 10G MACs +// NoC interface +interface NoC; + `ifndef SIMULATE + // Avalon interfaces to 10G MACs (inter-FPGA links) interface Vector#(`NumNorthSouthLinks, AvalonMac) north; interface Vector#(`NumNorthSouthLinks, AvalonMac) south; interface Vector#(`NumEastWestLinks, AvalonMac) east; interface Vector#(`NumEastWestLinks, AvalonMac) west; -`endif + `endif + // Connections to off-chip memory (for the programmable router) + interface Vector#(`DRAMsPerBoard, + Vector#(`FetchersPerProgRouter, BOut#(DRAMReq))) dramReqs; + interface Vector#(`DRAMsPerBoard, + Vector#(`FetchersPerProgRouter, In#(DRAMResp))) dramResps; + // ProgRouter fetcher activities & performance counters + interface Vector#(`FetchersPerProgRouter, FetcherActivity) activities; + interface ProgRouterPerfCounters progRouterPerfCounters; endinterface -module mkMailboxMesh#( +module mkNoC#( BoardId boardId, Vector#(4, Bool) linkEnable, Vector#(`MailboxMeshYLen, Vector#(`MailboxMeshXLen, MailboxNet)) mailboxes, IdleDetector idle) - (ExtNetwork); + (NoC); // Create off-board links Vector#(`NumNorthSouthLinks, BoardLink) northLink <- @@ -303,6 +312,9 @@ module mkMailboxMesh#( Vector#(`NumEastWestLinks, BoardLink) westLink <- mapM(mkBoardLink(linkEnable[3]), westSocket); + // Dimension-ordered routers + // ------------------------- + // Create mailbox routers Vector#(`MailboxMeshYLen, Vector#(`MailboxMeshXLen, MeshRouter)) routers = @@ -362,79 +374,43 @@ module mkMailboxMesh#( routers[y+1][x].bottomOut, routers[y][x].topIn); end - // Connect north links - // ------------------- + // Programmable board router + // ------------------------- - // Extract mesh top inputs and outputs - List#(In#(Flit)) topInList = Nil; - List#(Out#(Flit)) topOutList = Nil; - for (Integer x = `MailboxMeshXLen-1; x >= 0; x=x-1) begin - topOutList = Cons(routers[`MailboxMeshYLen-1][x].topOut, topOutList); - topInList = Cons(routers[`MailboxMeshYLen-1][x].topIn, topInList); - end + // Programmable router + ProgRouter boardRouter <- mkProgRouter(boardId); - // Connect the outgoing links - function In#(Flit) getFlitIn(BoardLink link) = link.flitIn; - reduceConnect(mkFlitMerger, - topOutList, List::map(getFlitIn, toList(northLink))); - - // Connect the incoming links - function Out#(Flit) getFlitOut(BoardLink link) = link.flitOut; - expandConnect(List::map(getFlitOut, toList(northLink)), topInList); - - // Connect south links - // ------------------- - - // Extract mesh bottom inputs and outputs - List#(In#(Flit)) botInList = Nil; - List#(Out#(Flit)) botOutList = Nil; - for (Integer x = `MailboxMeshXLen-1; x >= 0; x=x-1) begin - botOutList = Cons(routers[0][x].bottomOut, botOutList); - botInList = Cons(routers[0][x].bottomIn, botInList); - end + // Connect board router to north link + connectDirect(boardRouter.flitOut[0], northLink[0].flitIn); + connectUsing(mkUGShiftQueue1(QueueOptFmax), + northLink[0].flitOut, boardRouter.flitIn[0]); - // Connect the outgoing links - reduceConnect(mkFlitMerger, botOutList, - List::map(getFlitIn, toList(southLink))); - - // Connect the incoming links - expandConnect(List::map(getFlitOut, toList(southLink)), botInList); - - // Connect east links - // ------------------ - - // Extract mesh right inputs and outputs - List#(In#(Flit)) rightInList = Nil; - List#(Out#(Flit)) rightOutList = Nil; - for (Integer y = `MailboxMeshYLen-1; y >= 0; y=y-1) begin - rightOutList = Cons(routers[y][`MailboxMeshXLen-1].rightOut, rightOutList); - rightInList = Cons(routers[y][`MailboxMeshXLen-1].rightIn, rightInList); - end + // Connect board router to south link + connectDirect(boardRouter.flitOut[1], southLink[0].flitIn); + connectUsing(mkUGShiftQueue1(QueueOptFmax), + southLink[0].flitOut, boardRouter.flitIn[1]); - // Connect the outgoing links - reduceConnect(mkFlitMerger, - rightOutList, List::map(getFlitIn, toList(eastLink))); - - // Connect the incoming links - expandConnect(List::map(getFlitOut, toList(eastLink)), rightInList); - - // Connect west links - // ------------------ - - // Extract mesh right inputs and outputs - List#(In#(Flit)) leftInList = Nil; - List#(Out#(Flit)) leftOutList = Nil; - for (Integer y = `MailboxMeshYLen-1; y >= 0; y=y-1) begin - leftOutList = Cons(routers[y][0].leftOut, leftOutList); - leftInList = Cons(routers[y][0].leftIn, leftInList); - end + // Connect board router to east link + connectDirect(boardRouter.flitOut[2], eastLink[0].flitIn); + connectUsing(mkUGShiftQueue1(QueueOptFmax), + eastLink[0].flitOut, boardRouter.flitIn[2]); - // Connect the outgoing links - reduceConnect(mkFlitMerger, - leftOutList, List::map(getFlitIn, toList(westLink))); - - // Connect the incoming links - expandConnect(List::map(getFlitOut, toList(westLink)), leftInList); + // Connect board router to west link + connectDirect(boardRouter.flitOut[3], westLink[0].flitIn); + connectUsing(mkUGShiftQueue1(QueueOptFmax), + westLink[0].flitOut, boardRouter.flitIn[3]); + + // Connect mailbox mesh south rim to board router + for (Integer i = 0; i < `MailboxMeshXLen; i=i+1) + connectUsing(mkUGShiftQueue1(QueueOptFmax), + routers[0][i].bottomOut, boardRouter.flitIn[4+i]); + + // Connect board router to mailbox mesh south rim + function In#(Flit) getBottomIn(MeshRouter r) = r.bottomIn; + Vector#(`MailboxMeshXLen, In#(Flit)) southRimInPorts = + map(getBottomIn, routers[0]); + for (Integer i = 0; i < `MailboxMeshXLen; i=i+1) + connectDirect(boardRouter.flitOut[4+i], southRimInPorts[i]); // Detect inter-board activity // --------------------------- @@ -465,13 +441,31 @@ module mkMailboxMesh#( idle.idle.interBoardActivity(activityReg); endrule -`ifndef SIMULATE + // Interfaces + // ---------- + + function In#(t) getIn(InPort#(t) p) = p.in; + + `ifndef SIMULATE function AvalonMac getMac(BoardLink link) = link.avalonMac; interface north = Vector::map(getMac, northLink); interface south = Vector::map(getMac, southLink); interface east = Vector::map(getMac, eastLink); interface west = Vector::map(getMac, westLink); -`endif + `endif + + // Requests to off-chip memory + interface dramReqs = boardRouter.ramReqs; + + // Responses from off-chip memory + interface dramResps = boardRouter.ramResps; + + // Fetcher activities + interface activities = boardRouter.activities; + + // Performance counters + interface ProgRouterPerfCounters progRouterPerfCounters = + boardRouter.perfCounters; endmodule diff --git a/rtl/ProgRouter.bsv b/rtl/ProgRouter.bsv new file mode 100644 index 00000000..6e531261 --- /dev/null +++ b/rtl/ProgRouter.bsv @@ -0,0 +1,948 @@ +// SPDX-License-Identifier: BSD-2-Clause +// Functions, data types, and modules for programmable routers +package ProgRouter; + +import Globals :: *; +import Util :: *; +import DRAM :: *; +import Vector :: *; +import Queue :: *; +import Interface :: *; +import BlockRam :: *; +import Assert :: *; +import Util :: *; +import DReg :: *; + +// ============================================================================= +// Routing keys and beats +// ============================================================================= + +// A routing record is either 48 bits or 96 bits in size (aligned on a +// 48-bit or 96-bit boundary respectively). Multiple records are +// packed into a 256-bit DRAM beat (aligned on a 256-bit boundary). +// The most significant 16 bits of the beat contain a count of the +// number of records in the beat (in the range 1 to 5 inclusive). The +// remaining 240 bits contain records. The first record lies in the +// least-significant bits of the beat. The size portion of the routing +// key contains the number of contiguous DRAM beats holding all +// records for the key. + +// 256-bit routing beat +typedef struct { + // Number of records present + Bit#(16) size; + // The 48-bit record chunks + Vector#(5, Bit#(48)) chunks; +} RoutingBeat deriving (Bits, FShow); + +// 32-bit routing key +typedef struct { + // Which off-chip RAM? + Bit#(`LogDRAMsPerBoard) ram; + // Pointer to array of routing beats containing routing records + Bit#(`LogBeatsPerDRAM) ptr; + // Number of beats in the array + Bit#(`LogRoutingEntryLen) numBeats; +} RoutingKey deriving (Bits, FShow); + +// Extract routing key from an address +function RoutingKey getRoutingKey(NetAddr addr) = + unpack(getRoutingKeyRaw(addr)); + +// ============================================================================= +// Types of routing record +// ============================================================================= + +typedef enum { + URM1 = 3'd0, // 48-bit Unicast Router-to-Mailbox + URM2 = 3'd1, // 96-bit Unicast Router-to-Mailbox + RR = 3'd2, // 48-bit Router-to-Router + MRM = 3'd3, // 96-bit Multicast Router-to-Mailbox + IND = 3'd4 // 48-bit Indirection +} RoutingRecordTag deriving (Bits, Eq, FShow); + +typedef enum { + NORTH = 2'd0, + SOUTH = 2'd1, + EAST = 2'd2, + WEST = 2'd3 +} RoutingDir deriving (Bits, Eq); + +// 48-bit Unicast Router-to-Mailbox (URM1) record +typedef struct { + // Record type + RoutingRecordTag tag; + // Mailbox destination + Bit#(4) mbox; + // Mailbox-local thread identifier + Bit#(6) thread; + // Unused + Bit#(3) unused; + // Local key. The first word of the message + // payload is overwritten with this. + Bit#(32) localKey; +} URM1Record deriving (Bits, FShow); + +// 96-bit Unicast Router-to-Mailbox (URM2) record +typedef struct { + // Record type + RoutingRecordTag tag; + // Mailbox destination + Bit#(4) mbox; + // Mailbox-local thread identifier + Bit#(6) thread; + // Currently unused + Bit#(19) unused; + // Local key. The first two words of the message + // payload is overwritten with this. + Bit#(64) localKey; +} URM2Record deriving (Bits); + +// 48-bit Router-to-Router (RR) record +typedef struct { + // Record type + RoutingRecordTag tag; + // Direction (N, S, E, or W) + RoutingDir dir; + // Currently unused + Bit#(11) unused; + // New 32-bit routing key that will replace the one in the + // current message for the next hop of the message's journey + Bit#(32) newKey; +} RRRecord deriving (Bits); + +// 96-bit Multicast Router-to-Mailbox (MRM) record +typedef struct { + // Record type + RoutingRecordTag tag; + // Mailbox destination + Bit#(4) mbox; + // Currently unused + Bit#(9) unused; + // Local key. The least-significant half-word + // of the message is replaced with this + Bit#(16) localKey; + // Mailbox-local destination mask + Bit#(64) destMask; +} MRMRecord deriving (Bits); + +// 48-bit Indirection (IND) record +// Note the restrictions on IND records: +// 1. At most one IND record per key lookup +// 2. A max-sized key lookup must contain an IND record +typedef struct { + // Record type + RoutingRecordTag tag; + // Currently unused + Bit#(13) unused; + // New 32-bit routing key for new set of records on current router + Bit#(32) newKey; +} INDRecord deriving (Bits); + +// ============================================================================= +// Internal types +// ============================================================================= + +// It is sometimes convenient (though redundant) to record a routing +// decision for a flit internally within the programmable router +typedef struct { + // Normal flit + Flit flit; + // Routing decision for flit + RoutingDecision decision; +} RoutedFlit deriving (Bits, FShow); + +// Routing decision +typedef enum { + RouteNorth, + RouteSouth, + RouteEast, + RouteWest, + RouteNoC +} RoutingDecision deriving (Bits, Eq, FShow); + +// Elements of the indirection queue inside each fetcher +typedef struct { + // The indirection + RoutingKey key; + // The location of the message in the flit buffer + FetcherFlitBufferMsgAddr addr; +} IndQueueEntry deriving (Bits, FShow); + +// ============================================================================= +// Design +// ============================================================================= + +// In the following diagram N/S/E/W are the inter-FPGA links and +// L0..L3 are links at one edge of the NoC. Depending on the NoC +// dimensions, there may be more or less than four links on a single +// NoC edge, but the diagram assumes four. + +// +// N S E W L0 L1 L2 L3 Input flits +// | | | | | | | | +// +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +// | F | | F | | F | | F | | F | | F | | F | | F | Fetchers +// +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +// | | | | | | | | +// +-------------------------------------------+ +// | Crossbar | Routing +// +-------------------------------------------+ +// | | | | | | | | +// N S E W L0 L1 L2 L3 Output queues + +// The core functionality is implemented in the fetchers, which: +// (1) extract routing keys from incoming flits; +// (2) lookup the keys in RAM; +// (3) interpret the resulting routing records; and +// (4) emit the interpreted flits. + +// The key property of these fetchers is that they act entirely +// indepdedently of each other: each one can make progress even if +// another is blocked. This leads to duplicated logic resources, but +// is necessary to avoid deadlock. + +// As the routers are fully programmable, it is possible for the +// programmer to introduce deadlock using an ill-defined routing +// scheme, e.g. where a flit arrives in on (say) link N and requires a +// flit to be sent back along the same direction N. However, the +// hardware does guarantee deadlock-freedom if the routing scheme is +// based on dimension-ordered routing. + +// After the fetchers have interpreted the flits, they are fed to a +// fair crossbar which organises them by destination into output +// queues. + +// ============================================================================= +// Fetcher +// ============================================================================= + +// Flit address in a fetcher's flit buffer +typedef Bit#(`FetcherLogFlitBufferSize) FetcherFlitBufferAddr; + +// Message address in a fetcher's flit buffer +typedef Bit#(`FetcherLogMsgsPerFlitBuffer) FetcherFlitBufferMsgAddr; + +// This structure contains information about an in-flight memory +// request from a fetcher. When a fetcher issues a memory load +// request, this info is packed into the unused data field of the +// request. When the memory subsystem responds, it passes back the +// same info in an extra field inside the memory response structure. +// Maintaining info about an inflight request inside the request +// itself provides an easy way to handle out-of-order responses from +// memory. +typedef struct { + // Message address in the fetcher's flit buffer + FetcherFlitBufferMsgAddr msgAddr; + // How many beats in the burst? + Bit#(`BeatBurstWidth) burst; + // Is this the final burst of routing records for the current key? + Bool finalBurst; + // Are we processing a max-sized key (which must contain an IND record)? + Bool isMaxSizedKey; +} InflightFetcherReqInfo deriving (Bits, FShow); + +// Routing beat, tagged with the beat number in the DRAM burst +typedef struct { + // Beat + RoutingBeat beat; + // Beat number + Bit#(`BeatBurstWidth) beatNum; + // Inflight request info + InflightFetcherReqInfo info; +} NumberedRoutingBeat deriving (Bits, FShow); + +// Fetcher interface +interface Fetcher; + // Incoming and outgoing flits + interface In#(Flit) flitIn; + interface BOut#(RoutedFlit) flitOut; + // Off-chip RAM connections + interface Vector#(`DRAMsPerBoard, BOut#(DRAMReq)) ramReqs; + interface Vector#(`DRAMsPerBoard, In#(DRAMResp)) ramResps; + // Activity + interface FetcherActivity activity; +endinterface + +// Fetcher activity for performance counters and termination detection +(* always_ready *) +interface FetcherActivity; + // Increment number of sent messages + method Bit#(1) incSent; + // Increment number of messages sent to another board + method Bit#(1) incSentInterBoard; + // Increment number of received messages + method Bit#(1) incReceived; + // Active (in the termination-detection sense)? + method Bool active; +endinterface + +// Fetcher module +module mkFetcher#(BoardId boardId, Integer fetcherId) (Fetcher); + + // Flit input port + InPort#(Flit) flitInPort <- mkInPort; + + // RAM request queues + Vector#(`DRAMsPerBoard, Queue1#(DRAMReq)) ramReqQueue <- + replicateM(mkUGShiftQueue(QueueOptFmax)); + + // Flit buffer + BlockRamOpts flitBufferOpts = + BlockRamOpts { + readDuringWrite: DontCare, + style: "AUTO", + registerDataOut: False, + initFile: Invalid + }; + BlockRam#(FetcherFlitBufferAddr, Flit) flitBuffer <- + mkBlockRamOpts(flitBufferOpts); + + // Beat buffer + SizedQueue#(`FetcherLogBeatBufferSize, NumberedRoutingBeat) + beatBuffer <- mkUGSizedQueue; + + // Track length of beat buffer, so that we don't overfetch + Count#(TAdd#(`FetcherLogBeatBufferSize, 1)) beatBufferLen <- + mkCount(2 ** `FetcherLogBeatBufferSize); + + // For flits whose destinations are *not* routing keys + Queue1#(RoutedFlit) flitBypassQueue <- mkUGShiftQueue(QueueOptFmax); + + // For flits whose destinations are routing keys + Queue1#(RoutedFlit) flitProcessedQueue <- mkUGShiftQueue(QueueOptFmax); + + // Final output queue for flits + Queue1#(RoutedFlit) flitOutQueue <- mkUGShiftQueue(QueueOptFmax); + + // Indirection queue and size + SizedQueue#(`FetcherLogIndQueueSize, IndQueueEntry) indQueue <- + mkUGShiftQueue(QueueOptFmax); + Count#(TAdd#(`FetcherLogIndQueueSize, 1)) indQueueLen <- + mkCount(2 ** `FetcherLogIndQueueSize); + + // Activity + Reg#(Bit#(1)) incSentReg <- mkDReg(0); + Reg#(Bit#(1)) incSentInterBoardReg <- mkDReg(0); + Reg#(Bit#(1)) incReceivedReg <- mkDReg(0); + + // Stage 1: consume input message + // ------------------------------ + + // Consumer state + // State 0: pass through flits that don't contain routing keys + // State 1: buffer flits that do contain routing keys + // State 2: fetch routing beats + Reg#(Bit#(2)) consumeState <- mkReg(0); + + // Count number of flits of message consumed so far + Reg#(Bit#(`LogMaxFlitsPerMsg)) consumeFlitCount <- mkReg(0); + + // Flit slot allocator + Vector#(`FetcherMsgsPerFlitBuffer, SetReset) flitBufferUsedSlots <- + replicateM(mkSetReset(False)); + + // Chosen message slot + Reg#(FetcherFlitBufferMsgAddr) chosenReg <- mkRegU; + + // Routing key of message consumed + Reg#(RoutingKey) consumeKey <- mkRegU; + + // Maintain count of routing beats fetched so far + Reg#(Bit#(`LogRoutingEntryLen)) fetchBeatCount <- mkReg(0); + + // Track when messages are bypassing fetcher, to keep the bypass atomic + Reg#(Bool) bypassInProgress <- mkReg(False); + + // State 0: pass through flits that don't contain routing keys + rule consumeMessage0 (consumeState == 0); + Flit flit = flitInPort.value; + // Find unused message slot + Bool found = False; + FetcherFlitBufferMsgAddr chosen = ?; + for (Integer i = 0; i < `FetcherMsgsPerFlitBuffer; i=i+1) + if (! flitBufferUsedSlots[i].value) begin + found = True; + chosen = fromInteger(i); + end + // Initialise counters for subsequent states + consumeFlitCount <= 0; + fetchBeatCount <= 0; + // First, try to consume indirection + if (indQueue.canDeq && indQueue.canPeek && !bypassInProgress) begin + IndQueueEntry ind = indQueue.dataOut; + // Consume + indQueue.deq; + // Release space in indQueue, unless we have another max-sized key + if (!allHigh(ind.key.numBeats)) + indQueueLen.dec; + // Jump straight to fetch state, as message already in flit buffer + chosenReg <= ind.addr; + consumeKey <= ind.key; + // Proceed only if key size is non-zero + if (ind.key.numBeats != 0) + consumeState <= 2; + end else begin + chosenReg <= chosen; + // Otherwise, try to consume flit + if (flitInPort.canGet) begin + if (flit.dest.addr.isKey) begin + if (found) begin + RoutingKey key = getRoutingKey(flit.dest); + // For a full-size key, we must reserve space in the indQueue + if (allHigh(key.numBeats)) begin + if (indQueueLen.notFull) begin + indQueueLen.inc; + consumeState <= 1; + end + end else + consumeState <= 1; + end + end else if (flitBypassQueue.notFull) begin + flitInPort.get; + bypassInProgress <= flit.notFinalFlit; + // Make routing decision + RoutingDecision decision = RouteNoC; + MailboxNetAddr addr = flit.dest.addr; + if (addr.board.y < boardId.y) decision = RouteSouth; + else if (addr.board.y > boardId.y) decision = RouteNorth; + else if (addr.host.valid) + decision = addr.host.value == 0 ? RouteWest : RouteEast; + else if (addr.board.x < boardId.x) decision = RouteWest; + else if (addr.board.x > boardId.x) decision = RouteEast; + // Insert into bypass queue + flitBypassQueue.enq(RoutedFlit { decision: decision, flit: flit}); + end + end + end + endrule + + // State 1: buffer flits that do contain routing keys + rule consumeMessage1 (consumeState == 1); + Flit flit = flitInPort.value; + if (flitInPort.canGet) begin + flitInPort.get; + RoutingKey key = getRoutingKey(flit.dest); + consumeKey <= key; + // Write to flit buffer + flitBuffer.write({chosenReg, consumeFlitCount}, flit); + consumeFlitCount <= consumeFlitCount + 1; + // On final flit, move to fetch state + if (! flit.notFinalFlit) begin + // Ignore keys with zero beats + if (key.numBeats == 0) begin + consumeState <= 0; + incReceivedReg <= 1; + end else begin + consumeState <= 2; + // Claim chosen slot + flitBufferUsedSlots[chosenReg].set; + end + end + end + endrule + + // State 2: fetch routing beats + rule consumeMessage2 (consumeState == 2); + // Have we finished fetching beats? + Bool finished = (consumeKey.numBeats-fetchBeatCount) <= `ProgRouterMaxBurst; + // Prepare inflight RAM request info + // (to handle out of order resps from the RAMs) + InflightFetcherReqInfo info; + info.msgAddr = chosenReg; + info.burst = truncate( + min(consumeKey.numBeats - fetchBeatCount, `ProgRouterMaxBurst)); + info.finalBurst = finished; + info.isMaxSizedKey = allHigh(consumeKey.numBeats); + // Prepare RAM request + DRAMReq req; + req.isStore = False; + req.id = fromInteger(`DCachesPerDRAM + fetcherId); + req.addr = {1'b0, consumeKey.ptr + zeroExtend(fetchBeatCount)}; + req.data = {?, pack(info)}; + req.burst = info.burst; + // Don't overfetch (beat buffer has finite size) + if (ramReqQueue[consumeKey.ram].notFull && + beatBufferLen.available >= zeroExtend(req.burst)) begin + ramReqQueue[consumeKey.ram].enq(req); + fetchBeatCount <= fetchBeatCount + zeroExtend(req.burst); + beatBufferLen.incBy(zeroExtend(req.burst)); + if (finished) begin + consumeState <= 0; + incReceivedReg <= 1; + end + end + endrule + + // Stage 2: interpret routing beats + // -------------------------------- + + // Merge responses from each RAM + staticAssert(`DRAMsPerBoard == 2, + "Fetcher: need to generalise number of RAMs used"); + MergeUnit#(NumberedRoutingBeat) ramRespMerger <- mkMergeUnitFair; + + // Convert a RAM response to a numbered routing beat + function NumberedRoutingBeat fromDRAMResp(DRAMResp resp) = + NumberedRoutingBeat { + beat: unpack(resp.data) + , beatNum: resp.beat + , info: unpack(truncate(resp.info)) + }; + + // Create RAM response input interfaces for this module + In#(DRAMResp) respA <- onIn(fromDRAMResp, ramRespMerger.inA); + In#(DRAMResp) respB <- onIn(fromDRAMResp, ramRespMerger.inB); + Vector#(`DRAMsPerBoard, In#(DRAMResp)) ramRespsOut = vector(respA, respB); + + // Connect the merger to the beat buffer + connectToQueue(ramRespMerger.out, beatBuffer); + + // Count number of flits of message emitted so far + Reg#(Bit#(`LogMaxFlitsPerMsg)) emitFlitCount <- mkReg(0); + + // Count number of records processed so far in current beat + Reg#(Bit#(3)) recordCount <- mkReg(0); + + // (Shift) register holding current routing beat + Reg#(NumberedRoutingBeat) beatReg <- mkRegU; + + // Interpreter state + // 0: register the routing beat and fetch first flit + // 1: interpret flits + Reg#(Bit#(1)) interpreterState <- mkReg(0); + + // State 0: register the routing beat and fetch first flit + rule interpreter0 (interpreterState == 0); + let beat = beatBuffer.dataOut; + InflightFetcherReqInfo info = beat.info; + // Consume beat + if (beatBuffer.canDeq && beatBuffer.canPeek) begin + beatReg <= beat; + beatBuffer.deq; + beatBufferLen.dec; + interpreterState <= 1; + end + // Load first flit + flitBuffer.read({info.msgAddr, 0}); + emitFlitCount <= 0; + recordCount <= 0; + endrule + + // State 1: interpret flits + rule interpreter1 (interpreterState == 1); + // Extract details of registered routing beat + let beat = beatReg.beat; + let beatNum = beatReg.beatNum; + let info = beatReg.info; + // Extract tag from next record + RoutingRecordTag tag = unpack(truncateLSB(beat.chunks[4])); + // Is this the first flit of a message? + Bool firstFlit = emitFlitCount == 0; + // Modify flit by interpreting routing key + RoutingDecision decision = ?; + Flit flit = flitBuffer.dataOut; + // Unless otherwise stated (e.g. RR records), + // flits emitted will be destined for this board + flit.dest.addr.board = boardId; + case (tag) + // 48-bit Unicast Router-to-Mailbox + URM1: begin + URM1Record rec = unpack(beat.chunks[4]); + flit.dest.addr.isKey = False; + flit.dest.addr.mbox.x = unpack(truncate(rec.mbox[1:0])); + flit.dest.addr.mbox.y = unpack(truncate(rec.mbox[3:2])); + Vector#(`ThreadsPerMailbox, Bool) threadMask = newVector; + for (Integer j = 0; j < `ThreadsPerMailbox; j=j+1) + threadMask[j] = rec.thread == fromInteger(j); + flit.dest.threads = pack(threadMask); + // Replace first word of message with local key + if (firstFlit) + flit.payload = {truncateLSB(flit.payload), rec.localKey}; + decision = RouteNoC; + end + // 96-bit Unicast Router-to-Mailbox + URM2: begin + URM2Record rec = unpack({beat.chunks[4], beat.chunks[3]}); + flit.dest.addr.isKey = False; + flit.dest.addr.mbox.x = unpack(truncate(rec.mbox[1:0])); + flit.dest.addr.mbox.y = unpack(truncate(rec.mbox[3:2])); + Vector#(`ThreadsPerMailbox, Bool) threadMask = newVector; + for (Integer j = 0; j < `ThreadsPerMailbox; j=j+1) + threadMask[j] = rec.thread == fromInteger(j); + flit.dest.threads = pack(threadMask); + // Replace first two words of message with local key + if (firstFlit) + flit.payload = {truncateLSB(flit.payload), rec.localKey}; + decision = RouteNoC; + end + // 48-bit Router-to-Router + RR: begin + RRRecord rec = unpack(beat.chunks[4]); + case (rec.dir) + NORTH: begin + decision = RouteNorth; + flit.dest.addr.board = BoardId {x: boardId.x, y: boardId.y+1}; + end + SOUTH: begin + decision = RouteSouth; + flit.dest.addr.board = BoardId {x: boardId.x, y: boardId.y-1}; + end + EAST: begin + decision = RouteEast; + flit.dest.addr.board = BoardId {x: boardId.x+1, y: boardId.y}; + end + WEST: begin + decision = RouteWest; + flit.dest.addr.board = BoardId {x: boardId.x-1, y: boardId.y}; + end + endcase + flit.dest.threads = {?, rec.newKey}; + end + // 96-bit Multicast Router-to-Mailbox + MRM: begin + MRMRecord rec = unpack({beat.chunks[4], beat.chunks[3]}); + flit.dest.addr.isKey = False; + flit.dest.addr.mbox.x = unpack(truncate(rec.mbox[1:0])); + flit.dest.addr.mbox.y = unpack(truncate(rec.mbox[3:2])); + flit.dest.threads = rec.destMask; + // Replace first half-word of message with local key + if (firstFlit) + flit.payload = {truncateLSB(flit.payload), rec.localKey}; + decision = RouteNoC; + end + // 48-bit Indirection + IND: begin end + endcase + // Is output queue ready for new flit? + Bool emit = flitProcessedQueue.notFull; + let newFlitCount = emitFlitCount; + // Consume routing record + if (emit) begin + // Only enqueue if not an IND record + if (tag != IND) + flitProcessedQueue.enq(RoutedFlit { decision: decision, flit: flit }); + // Shift beat to point to next record + RoutingBeat newBeat = beat; + Bool doubleChunk = unpack(pack(tag)[0]); + if (doubleChunk) begin + for (Integer i = 4; i > 1; i=i-1) + newBeat.chunks[i] = beat.chunks[i-2]; + end else begin + for (Integer i = 4; i > 0; i=i-1) + newBeat.chunks[i] = beat.chunks[i-1]; + end + // Is this the final flit in the message? + if (flit.notFinalFlit) + newFlitCount = emitFlitCount + 1; + else begin + // Move to next record + recordCount <= recordCount + 1; + beatReg <= NumberedRoutingBeat { + beat: newBeat, beatNum: beatNum, info: info }; + // Handle IND record: insert into indirection queue + if (tag == IND) begin + myAssert(indQueue.notFull, "Restrictions on IND records violated"); + INDRecord ind = unpack(beat.chunks[4]); + indQueue.enq(IndQueueEntry + { key: unpack(ind.newKey), addr: info.msgAddr }); + end + // Is this the final record in the beat? + if ((recordCount+1) == truncate(beat.size)) begin + interpreterState <= 0; + // Have we finished with this message yet? + if (info.finalBurst && info.burst == (beatNum+1)) begin + // Reclaim message slot in flit buffer + // (Don't do this when we have an indirection to process) + if (! info.isMaxSizedKey) + flitBufferUsedSlots[info.msgAddr].clear; + end + end + incSentReg <= 1; + if (tag == RR) incSentInterBoardReg <= 1; + newFlitCount = 0; + end + end + // Issue flit load request + flitBuffer.read({info.msgAddr, newFlitCount}); + emitFlitCount <= newFlitCount; + endrule + + // Stage 3: merge output queues + // ---------------------------- + + // We want to merge messages, not flits + // Are we in the middle of consuming a message? + Reg#(Bool) mergeInProgress <- mkReg(False); + Reg#(Bool) prevFromBypass <- mkReg(False); + + rule merge (flitOutQueue.notFull); + // Favour the bypass queue + Bool chooseBypass = mergeInProgress ? prevFromBypass : + flitBypassQueue.canDeq; + if (chooseBypass) begin + if (flitBypassQueue.canDeq) begin + flitBypassQueue.deq; + flitOutQueue.enq(flitBypassQueue.dataOut); + mergeInProgress <= flitBypassQueue.dataOut.flit.notFinalFlit; + prevFromBypass <= True; + end + end else if (flitProcessedQueue.canDeq) begin + flitProcessedQueue.deq; + flitOutQueue.enq(flitProcessedQueue.dataOut); + mergeInProgress <= flitProcessedQueue.dataOut.flit.notFinalFlit; + prevFromBypass <= False; + end + endrule + + // Interfaces + // ----------- + + interface flitIn = flitInPort.in; + interface flitOut = queueToBOut(flitOutQueue); + interface ramReqs = map(queueToBOut, ramReqQueue); + interface ramResps = ramRespsOut; + + interface FetcherActivity activity; + method Bit#(1) incSent = incSentReg; + method Bit#(1) incSentInterBoard = incSentInterBoardReg; + method Bit#(1) incReceived = incReceivedReg; + method Bool active = + beatBufferLen.value != 0 || consumeState != 0; + endinterface + +endmodule + +// ============================================================================= +// Crossbar +// ============================================================================= + +// Selector function for a mux in the programmable router crossbar +typedef function Bool selector(RoutedFlit flit) SelectorFunc; + +module mkProgRouterCrossbar#( + Vector#(numOut, SelectorFunc) f, + Vector#(numIn, BOut#(RoutedFlit)) out) + (Vector#(numOut, BOut#(RoutedFlit))) + provisos(Add#(a__, 1, numIn)); + + // Input ports + Vector#(numIn, InPort#(RoutedFlit)) inPort <- replicateM(mkInPort); + + // Connect up input ports + for (Integer i = 0; i < valueOf(numIn); i=i+1) + connectDirect(out[i], inPort[i].in); + + // Cosume wires, for each input port + Vector#(numIn, PulseWire) consumeWire <- replicateM(mkPulseWireOR); + + // Keep track of service history for flit sources (for fair selection) + Vector#(numOut, Reg#(Bit#(numIn))) hist <- replicateM(mkReg(0)); + + // Current choice of flit source + Vector#(numOut, Reg#(Bit#(numIn))) choiceReg <- replicateM(mkReg(0)); + + // Output queues + Vector#(numOut, Queue#(RoutedFlit)) outQueue <- + replicateM(mkUGShiftQueue(QueueOptFmax)); + + // Selector mux for each out queue + for (Integer i = 0; i < valueOf(numOut); i=i+1) begin + + rule select; + // Vector of input flits and available flits + Vector#(numIn, RoutedFlit) flits = newVector; + Vector#(numIn, Bool) nextAvails = newVector; + Bool avail = False; + for (Integer j = 0; j < valueOf(numIn); j=j+1) begin + flits[j] = inPort[j].value; + nextAvails[j] = inPort[j].canGet && f[i](inPort[j].value) + && choiceReg[i][j] == 0; + avail = avail || (choiceReg[i][j] == 1 && inPort[j].canGet); + end + Bit#(numIn) nextAvail = pack(nextAvails); + // Choose a new source using fair scheduler + match {.newHist, .nextChoice} = sched(hist[i], nextAvail); + // Select a flit + RoutedFlit flit = oneHotSelect(unpack(choiceReg[i]), flits); + // Consume a flit + if (avail) begin + if (outQueue[i].notFull) begin + // Pass chosen flit to out queue + outQueue[i].enq(flit); + // On final flit of message + if (!flit.flit.notFinalFlit) begin + choiceReg[i] <= nextChoice; + hist[i] <= newHist; + end + end + end else if (choiceReg[i] == 0) begin + choiceReg[i] <= nextChoice; + hist[i] <= newHist; + end + // Consume from chosen source + for (Integer j = 0; j < valueOf(numIn); j=j+1) + if (inPort[j].canGet && choiceReg[i][j] == 1 && outQueue[i].notFull) + consumeWire[j].send; + endrule + + end + + // Consume from flit sources + rule consumeFlitSources; + for (Integer j = 0; j < valueOf(numIn); j=j+1) + if (consumeWire[j]) inPort[j].get; + endrule + + return map(queueToBOut, outQueue); +endmodule + + +// ============================================================================= +// Splitter +// ============================================================================= + +// Split a single stream in two based on a predicate +module splitFlits#(SelectorFunc f, BOut#(RoutedFlit) out) + (Tuple2#(BOut#(Flit), BOut#(Flit))); + + // Consume wire + PulseWire consumeWire <- mkPulseWireOR; + + // Output streams + BOut#(Flit) outYes = + interface BOut + method Action get = consumeWire.send; + method Bool valid = out.valid && f(out.value); + method Flit value = out.value.flit; + endinterface; + BOut#(Flit) outNo = + interface BOut + method Action get = consumeWire.send; + method Bool valid = out.valid && !f(out.value); + method Flit value = out.value.flit; + endinterface; + + // Consume + rule consume; + if (consumeWire) out.get; + endrule + + return tuple2(outYes, outNo); +endmodule + +// ============================================================================= +// Programmable router +// ============================================================================= + +// Enough bits to store a count of the number of fetchers +typedef TLog#(TAdd#(`FetchersPerProgRouter, 1)) LogFetchersPerProgRouter; + +// ProgRouter's performance counters +(* always_ready, always_enabled *) +interface ProgRouterPerfCounters; + method Bit#(LogFetchersPerProgRouter) incSent; + method Bit#(LogFetchersPerProgRouter) incSentInterBoard; +endinterface + +interface ProgRouter; + // Incoming and outgoing flits + interface Vector#(`FetchersPerProgRouter, In#(Flit)) flitIn; + interface Vector#(`FetchersPerProgRouter, BOut#(Flit)) flitOut; + + // Interface to off-chip memory + interface Vector#(`DRAMsPerBoard, + Vector#(`FetchersPerProgRouter, BOut#(DRAMReq))) ramReqs; + interface Vector#(`DRAMsPerBoard, + Vector#(`FetchersPerProgRouter, In#(DRAMResp))) ramResps; + + // Activities & performance counters + interface Vector#(`FetchersPerProgRouter, FetcherActivity) activities; + interface ProgRouterPerfCounters perfCounters; +endinterface + +module mkProgRouter#(BoardId boardId) (ProgRouter); + + // Fetchers + Vector#(`FetchersPerProgRouter, Fetcher) fetchers = newVector; + for (Integer i = 0; i < `FetchersPerProgRouter; i=i+1) + fetchers[i] <- mkFetcher(boardId, i); + + // Crossbar routing functions + function Bit#(`MailboxMeshXBits) xcoord(RoutedFlit rf) = + zeroExtend(rf.flit.dest.addr.mbox.x); + function Bool routeN(RoutedFlit rf) = rf.decision == RouteNorth; + function Bool routeS(RoutedFlit rf) = rf.decision == RouteSouth; + function Bool routeE(RoutedFlit rf) = rf.decision == RouteEast; + function Bool routeW(RoutedFlit rf) = rf.decision == RouteWest; + function Bool routeL(Bit#(`MailboxMeshXBits) x, RoutedFlit rf) = + rf.decision == RouteNoC && xcoord(rf) == x; + Vector#(`FetchersPerProgRouter, SelectorFunc) funcs; + funcs[0] = routeN; funcs[1] = routeS; + funcs[2] = routeE; funcs[3] = routeW; + for (Integer i = 0; i < `MailboxMeshXLen; i=i+1) + funcs[4+i] = routeL(fromInteger(i)); + + // Crossbar + function BOut#(RoutedFlit) getFetcherFlitOut(Fetcher f) = f.flitOut; + Vector#(`FetchersPerProgRouter, BOut#(RoutedFlit)) fetcherOuts = + map(getFetcherFlitOut, fetchers); + Vector#(`FetchersPerProgRouter, BOut#(RoutedFlit)) + crossbarOuts <- mkProgRouterCrossbar(funcs, fetcherOuts); + Vector#(`FetchersPerProgRouter, BOut#(Flit)) crossbarOutFlits; + function Flit toFlit (RoutedFlit rf) = rf.flit; + for (Integer i = 0; i < `FetchersPerProgRouter; i=i+1) + crossbarOutFlits[i] <- onBOut(toFlit, crossbarOuts[i]); + + // Flit input interfaces + Vector#(`FetchersPerProgRouter, In#(Flit)) flitInIfc = newVector; + for (Integer i = 0; i < `FetchersPerProgRouter; i=i+1) + flitInIfc[i] = fetchers[i].flitIn; + + // RAM interfaces + Vector#(`DRAMsPerBoard, Vector#(`FetchersPerProgRouter, In#(DRAMResp))) + ramRespIfc = replicate(newVector); + Vector#(`DRAMsPerBoard, Vector#(`FetchersPerProgRouter, BOut#(DRAMReq))) + ramReqIfc = replicate(newVector); + for (Integer i = 0; i < `DRAMsPerBoard; i=i+1) + for (Integer j = 0; j < `FetchersPerProgRouter; j=j+1) begin + ramReqIfc[i][j] = fetchers[j].ramReqs[i]; + ramRespIfc[i][j] = fetchers[j].ramResps[i]; + end + + // Performance counters + Vector#(TExp#(TLog#(`FetchersPerProgRouter)), + Bit#(LogFetchersPerProgRouter)) incSents = replicate(0); + Vector#(TExp#(TLog#(`FetchersPerProgRouter)), + Bit#(LogFetchersPerProgRouter)) incSentsInterBoard = replicate(0); + for (Integer i = 0; i < `FetchersPerProgRouter; i=i+1) begin + incSents[i] = zeroExtend(fetchers[i].activity.incSent); + incSentsInterBoard[i] = + zeroExtend(fetchers[i].activity.incSentInterBoard); + end + Bit#(LogFetchersPerProgRouter) numSent <- + mkPipelinedReductionTree( \+ , 0, toList(incSents)); + Bit#(LogFetchersPerProgRouter) numSentInterBoard <- + mkPipelinedReductionTree( \+ , 0, toList(incSentsInterBoard)); + + function FetcherActivity getActivity(Fetcher f) = f.activity; + interface flitIn = flitInIfc; + interface flitOut = crossbarOutFlits; + interface ramReqs = ramReqIfc; + interface ramResps = ramRespIfc; + interface activities = map(getActivity, fetchers); + interface ProgRouterPerfCounters perfCounters; + method incSent = numSent; + method incSentInterBoard = numSentInterBoard; + endinterface + +endmodule + +// For core(s) to access ProgRouter's performance counters +(* always_ready, always_enabled *) +interface ProgRouterPerfClient; + method Action incSent(Bit#(LogFetchersPerProgRouter) amount); + method Action incSentInterBoard(Bit#(LogFetchersPerProgRouter) amount); +endinterface + +endpackage diff --git a/rtl/Util.bsv b/rtl/Util.bsv index 7ac885c3..f45ece48 100644 --- a/rtl/Util.bsv +++ b/rtl/Util.bsv @@ -254,4 +254,51 @@ module mkBuffer#(Integer n, dataT init, dataT inp) (dataT) return regs[n-1]; endmodule +// Isolate first hot bit +function Bit#(n) firstHot(Bit#(n) x) = x & (~x + 1); + +// Function for fair scheduling of n tasks +function Tuple2#(Bit#(n), Bit#(n)) sched(Bit#(n) hist, Bit#(n) avail); + // First choice: an available bit that's not in the history + Bit#(n) first = firstHot(avail & ~hist); + // Second choice: any available bit + Bit#(n) second = firstHot(avail); + + // Return new history, and chosen bit + if (first != 0) begin + // Return first choice, and update history + return tuple2(hist | first, first); + end else begin + // Return second choice, and reset history + return tuple2(second, second); + end +endfunction + +// Pipelined reduction tree +module mkPipelinedReductionTree#( + function a reduce(a x, a y), + a init, + List#(a) xs) + (a) provisos(Bits#(a, _)); + Integer len = List::length(xs); + if (len == 0) + return error("mkSumList applied to empty list"); + else if (len == 1) + return xs[0]; + else begin + List#(a) ys = xs; + List#(a) reduced = Nil; + for (Integer i = 0; i < len; i=i+2) begin + Reg#(a) r <- mkConfigReg(init); + rule assignOut; + r <= reduce(ys[0], ys[1]); + endrule + ys = List::drop(2, ys); + reduced = Cons(readReg(r), reduced); + end + a res <- mkPipelinedReductionTree(reduce, init, reduced); + return res; + end +endmodule + endpackage diff --git a/rtl/WideSRAM.bsv b/rtl/WideSRAM.bsv index a3816a38..04af1dc7 100644 --- a/rtl/WideSRAM.bsv +++ b/rtl/WideSRAM.bsv @@ -108,6 +108,7 @@ module mkWideSRAM#(RAMId id) (WideSRAM); respOut.data = pack(data); respOut.info = respIn.info; respOut.finalBeat = True; + respOut.beat = 0; respQueue.enq(respOut); respCount <= 0; end