High Packet Rate Network Capture System

Introduction

In today's digital landscape, the ability to efficiently and accurately capture high volumes of network traffic is critical for a variety of domains, including cybersecurity, network optimization, and high-frequency trading (HFT). High-rate packet capture systems are designed to address the challenges of processing and storing immense quantities of network packet data with minimal latency and error. These systems play a vital role in maintaining the integrity, performance, and security of modern networks.

This project explores the development of a robust high-rate packet capture system, delving into the underlying technologies, methodologies, and challenges associated with high-speed data acquisition. The integration of advanced hardware configurations, optimized software solutions, and precision timing mechanisms aims to meet the stringent requirements of ultra-low latency (ULL) environments, particularly those critical to HFT operations.

Through this report, I examine the foundational principles of packet-switching networks, the architectural considerations for ULL data processing, and the practical applications of packet capture systems in industries where nanoseconds can define success or failure. Additionally, I address the technical and operational hurdles involved in implementing such systems, including efficient data storage, error handling, and scalability.

This report contributes to the advancement of high-rate packet capture technologies by bridging theoretical insights with practical implementations, paving the way for enhanced network performance and strategic decision-making in real-time data-driven environments.

Contributors

Project Lead: Jack Winfield

Emails:
- University: [email protected]
- Personal: [email protected]
Websites:
- LinkedIn
- Personality research
  - Guided by University of Illinois at Urbana-Champaign Professor, Dr. Brent Roberts

1. The OSI Model

What is the OSI Model?

The Open Systems Interconnection (OSI) Model (also known as the OSI Reference Model), is an example of a reference model which conceptualizes standard communication functions of a telecommunications or computing system without focusing on its underlying internal structure and technology. It is a cornerstone in the field of networking as it simplifies the troubleshooting tasks as it helps to break down a problem and narrowing it down to one or more layers of the OSI model, thus avoiding a lot of unnecessary work, especially in identifying the origin of attacks, exploits, bugs, and other network issues.

The OSI Model was developed starting in the late 1970s to support the emergence of the diverse computer networking methods that competed to be in the large national networking efforts in France, the United Kingdom, and the United States. In the 1980s, the OSI Model became a working product of the Open Systems Interconnection group at the International Organization for Standardization (ISO).

Memorizing the OSI Model

The OSI Model is often memorized with the following mnemonic: "Please Do Not Throw Sausage Pizza Away", which stands for:

Please - Physical Layer
Do - Data Link Layer
Not - Network Layer
Throw - Transport Layer
Sausage - Session Layer
Pizza - Presentation Layer
Away - Application Layer

Both the name (e.g. Session Layer) and number (e.g. Layer 5) are used interchangeably.

Definitions of Each Layer

Starting from the Application layer working through each layer, the layers are defined as follows:

The Application layer (Layer 7) is the top-most layer of the OSI model and serves as the interface between the end-user applications and the network. It interacts with the operating system or application whenever the user chooses to transfer files, read messages, or perform other network-related activities (e.g., visit a website). These applications call the lower layers to fetch and deliver their data.

Common protocols that operate at the Application layer include:
- HyperText Transfer Protocol (HTTP)
- File Transfer Protocol (FTP)
- Simple Mail Transfer Protocol (SMTP)
The Presentation layer (Layer 6) is responsible for data representation and encryption and takes data provided by the Application layer and converts it into a standard format that the other layers can understand.

Common tasks at the presentation layer include:
- Data encryption and decryption
- Reformatting
- Compression and decompression
Common protocols and standards that operate at the Presentation layer include:
- Secure Sockets Layer (SSL)
- Transport Layer Security (TLS)
- American Standard Code for Information Interchange (ASCII)
- Joint Photographic Experts Group (JPEG)
The Session layer (Layer 5) is responsible for inter-host communication and establishes, maintains, and terminates user connections.

Common protocols that operate at the Session layer include:
- Network Basic Input/Output System (NetBIOS)
- Server Message Block (SMB)
The Transport layer (Layer 4) is responsible for end-to-end connections and connection reliability.

Overall, the Transport layer is responsible for:
- Detecting and correcting connection-related errors
- Controlling the flow of data
- Sequencing data
- Determining the size of a packet, also known as a datagram.
When sending data, the Transport layer may break the received data into smaller pieces (called segments) for transmission, and uniquely number them. When receiving data, the Transport layer is responsible for making sure the data arrives intact (not damaged) and then putting everything together in its original order before handing the data off to the Session layer.

Common protocols that operate at the Transport layer include:
- Transmission Control Protocol (TCP)
- User Datagram Protocol (UDP)
The Network layer (layer 3) is responsible for path determination and IP and is responsible for tasks such as:
- Delivering packets
- Providing logical addressing (e.g., Internet Protocol (IP) addresses)
- Determining the best path for a packet
A communications session does not necessarily always occur between two systems on the same network. Sometimes, those systems are literally half a world away from each other. In such cases, the Network layer contains the mechanisms that map out the best route (or path) a data packet can travel on a network. The route includes every device that handles the packet between its source to its destination, including routers, switches, and firewalls, for that session.

Common protocols that operate at the Network layer include:
- Internet Protocol version 4 (IPv4) and version 6 (IPv6)
- Internet Control Message Protocol version 4 (ICMPv4) and version 6 (ICMPv6)
Common hardware that operates at the Network layer includes:
- Routers
- Layer 3 switches
The Data Link layer (Layer 6) is responsible for Media Access Control (MAC) and logical link control (LLC) (physical addressing) and is where the rules, processes, and mechanisms for sending and receiving data are over a local area network (LAN) are defined.

Common tasks the Data Link layer is responsible for include:
- Accessing the transmission media
- Hardware addressing
- Detecting Data Link-related errors
- Controlling the flow of frames, which are the basic packaging for LAN traffic as it travels across the medium.
A common protocol that operates on the Data Link layer is the Address Resolution Protocol (ARP).

Common hardware that operates at the Data Link layer includes:
- Bridges, which are devices that connect two network segments by analyzing incoming frames and making decisions about where to direct them based on each frame's address.
- Switches, which are essentially high-speed, multi-port bridges; a port is an opening on computer networking equipment that cables plug into. Their purpose is to connect wired hardware in a LAN to one another to share data.
- Network interface cards (NICs), which are computer hardware that connects to computer to a computer network. It plugs into an expansion slot or is integrated on the motherboard and allows systems to communicate over a network, either by using cables or wirelessly.
The Physical layer (Layer 7) is responsible for media, signal, and binary transmission and it includes all the procedures and mechanisms needed to both place data onto the network's medium for transmission and to receive the data sent to your system on that same medium. For example, bits are converted into electrical or light impulses through a process known as encoding, a process which often occurs at the Physical layer through some transmission medium.

Common hardware and standards that operate at the Physical layer include:
- Cabling
- Repeaters, which regenerate a signal before it becomes unreadable due to transmission power loss and extends a network's reach.
- Modems
- Adapter cards
- Physical standards
  - IEEE 802.3 (Ethernet)
  - IEEE 802.11 (Wireless)

Criticisms of the OSI Model

While the OSI Model is a widely recognized framework for understanding network communications, it has also faced (valid) criticisms over the years. Some of these criticisms include:

Lack of implementation: Although the OSI model was developed as a theoretical framework to standardize network communication, it has not been widely implemented in practice.

Instead, the TCP/IP model is the de facto standard for networking.
Limited practicality: The OSI Model was designed to be a general-purpose communication system, but it did not fully consider the practical needs of real-world networks.

As a result, the OSI Model is often criticized for being overly theoretical and not addressing the practical concerns of network engineers and administrators. For example, there's no Session nor Presentation layers in modern networks. The concepts exist, but not as layers and not with the functionality those layers envisioned.

For example, the Session layer wanted "synchronization points" to synchronize transactions. This model never worked, not to mention how the synchronization works on the Internet is vastly more complex, with most organizations designing their own implementations.
Lack of flexibility: The OSI Model is often criticized for being inflexible and not adaptable to new technologies or emerging trends.

As a result, the OSI Model has not kept pace with the rapid changes in networking technologies and the increasing demand for more flexible and dynamic network designs and architectures.

2. Packets and Networks

What are packets?

A packet, also known as a datagram, acts as an envelope for Transmission Control Protocol (TCP) and User Datagram Protocol (UDP) data. Packets contain information necessary for routers to transfer data between different local area network (LAN) segments (dividing one LAN into parts, with each part called a LAN Segment) over a packet-switched network. Similar to a real-life package, each network packet includes control data and user data being transferred. Control data, also known as the header, includes data for delivering the payload, such as the source and destination network addresses; error detection codes; and sequencing information, and is used by networking hardware to direct the packet to its destination and ensure data integrity. User data, also known as the payload, is data that is carried on behalf of an application. The payload is extracted and used by an operation system, application software, or higher layer protocols.

Network packets are crucial because they enable the efficient and reliable transfer of data over complex networks. Instead of sending large files or streams of data as a single, continuous block—which could monopolize network resources and be more susceptible to errors—data is broken into smaller, manageable pieces. These packets can then be transmitted independently and reassembled at the destination, allowing for better utilization of network bandwidth and resources. Often, when a user sends a file across a network, it gets transferred in smaller data packets, not in one piece. For example, a 5MB file will be divided into some number of packets (e.g., 3), each with a source and destination address (e.g., Source and Destination IP Addresses), the number of packets in the entire data file (3), and the sequence number so the packets can be reassembled at their destination once they have all been received (e.g., "hey receiver, this is packet 1 of 3").

Packet-switched networks

While crucial as a data structure structure, packets need a routing mechanism to ensure proper network communications. Packet-switching, is one such mechanism, where packets have a fundamental role. Packet-switching is known as a method of grouping data into packets that are transmitted over a digital network and refers to the routing and transferring of data by means of addressed packets so that a channel is occupied during the transmission of the packet only, and upon completion of the transmission, the channel is made available for the transfer of other traffic. This means that packets can take different paths to reach the same destination, optimizing efficiency and reliability.

Packet-switched networks are:

Efficient:

Packets can be routed through the least congested paths, making optimal use of bandwidth.
Reliable:

If a particular path fails or becomes congested, packets can be rerouted through alternative paths.
Scalable:

Networks can be easily accommodate growth, as packets find the best available routes without the need for dedicated circuits.

Thus, packet-switched networks contrast with circuit-switched networks, where a dedicated communication path is established between two endpoints for the duration of the session (e.g., traditional telephone networks). Packet-switching is more flexible and efficient for data transmission, making it the primary basis for data communications in computer networks worldwide. In regard to HFT, the ubiquitousness of packet-switched networks give credibility to their flexibility and efficiency to provide HFT firms the ability to route data efficiently and to adapt to network conditions. This efficiency and adaptability of packet-switched networks ensures that orders, trades, and market data updates are transmitted with minimal latency, enabled by ultra-low latency (ULL) switches. As a result, packet-switched networks with ULL switches improve the competitiveness and profitability of trading strategies, as the ultra timely information can lead to better decision-making and faster execution.

Format of a packet

A packet includes 3 components:

A header, with control information, such as the source address of the sending host; the destination address of the receiving host; the length of the packet (in bytes); the sequencing (or proper ordering) of the individual packets; and the type of data being carried by the packet (e.g., Layer 7 application data).
The data/body/payload: includes the data that is carried on behalf of an application.
The trailer/footer: provides mechanisms for detecting errors during transmission and correcting them, along with information to verify the packet’s contents (e.g., Cyclic Redundancy Check (CRC)).

To easily conceptualize a packet, the idea of a postal letter is frequently used:

The header is like the envelope
The payload is the entire content(s) inside the envelope (e.g., a love letter), and
The footer is your signature at the bottom of the love letter.

Packet creation

Packets operate at Layer 3 (the Network Layer) of the OSI Model:

From the sender's perspective, data moves down the layers of OSI model, with each layer adding header or trailer information.
Data travels up layers at the receiving system, with each layer removing the header or trailer information placed by the corresponding sender layer.

A packet can also be created using various software, also known as packet crafting tools (e.g., Scapy, hping, Nmap).

Basic Types of Packets and Their Structure

There are various types of packets but a few of them are most important for HFT firms.

1. Ethernet II Frames (Data Link Layer - Layer 2)

Ethernet Packet and Frame Structure.
Source: https://en.wikipedia.org/wiki/Ethernet_frame

Ethernet II frames are the most popular frame type used today, operate at the Data Link layer (Layer 2) of the OSI model, are used within LANs.

The core structure of an Ethernet frame consists of 7 fields:

Preamble (7 bytes): A sequence of alternating 1s and 0s used to synchronize the receivers on the network before the actual data arrives; it allows devices to prepare for the reception of a frame.
Start Frame Delimiter (SFD) (1 byte): Indicates the start of the frame and helps align the data bits properly.
Destination MAC Address (part of the frame Header) (6 bytes): The physical hardware address of the Network Interface Card (NIC) to which the frame is being sent.
Source MAC Address (part of the frame Header) (6 bytes): The physical hardware address of the NIC from which the message originated.
Frame Type(2 bytes) (part of the frame Header): Specifies the type of higher layer protocol data that follows the header in the data portion of the Ethernet frame (e.g., 0x0806 in this field means ARP data follows next in the Data portion of the frame).
Data/Payload (46 - 1500 bytes): Contains the encapsulated data, such as an IP packet or other higher-layer data. The minimum size ensures proper collision detection.
Frame Check Sequence (FCS) (4 bytes): A Cyclic Redundancy Check (CRC) value used for error detection to ensure data integrity.

To summarize, Ethernet frames encapsulate data for transmission over the physical medium. They handle addressing within the LAN and ensure that data is delivered to the correct device. The FCS helps detect any errors that may have occurred during transmission, prompting retransmission if necessary. In HFT environments, any delay or error in data transmission can lead to missed trading opportunities or financial losses. Network engineers in HFT firms focus on optimizing Ethernet frame handling by minimizing Ethernet collisions, reducing latency through ULL switches, and ensuring that hardware components are capable of handling the required data rates (e.g. 100 Mbps to 100 Gbps, depending on the Ethernet cable standard used).

2. IP Packets (Network Layer - Layer 3)

IP packets operate at Layer 3 (the Network Layer) and are responsible for routing data across interconnected networks, such as the Internet. IP packets, also known as IP datagrams, act as an envelope for TCP, UDP, and higher-layer protocol data/information (e.g., a webpage request). The IP packets contain logical addressing information (e.g., source/destination IP addresses) necessary for hosts to transfer the packets between different network segments.

There are two main Internet Protocols: Internet Protocol version 4 (IPv4) and Internet Protocol version 6 (IPv6).

IPv4 Header Structure

IPv4 Header Structure.
Source: Matt Baxter

Specified in RFC 791, IPv4 is a connectionless protocol where each packet is treated independently than the others. An IPv4 packet has the following structure:
1. Version (4 bits): Identifies the version of IP (e.g. v4 or v6) used to generate the datagram.
  
  The purpose of this field is to ensure compatibility between devices that may be running different versions of the IP (a dual-stack system is one running both versions of IPv4 and IPv6 software).
2. Header Length (4 bits): Specifies the length of the IP header for the packet.
  
  The Header length is important to help the receiving host determine where in the IP datagram the data actually starts (the data portion of a packet starts immediately after the IP header ends). The numerical value found in this field is shown as a multiplier of 4 bytes (e.g., a value of 5 in this field means the IP header length is 20 bytes total ($5\ x\ 4 = 20$)). The maximum value in this field is 15 (or 60 bytes).
3. Type of Service (ToS) (8 bits): Defines the priority and the Quality of Service (QoS) parameters for the packet.
  
  This field is now now referred to as Differentiated Services (DiffServ) and Explicit Congestion Notification (ECN).
  
  The updated DiffServ field allows for up to 64 values and a greater range of packet-forwarding behaviors (i.e. per-hop behaviors (PHBs)). DiffServ is defined a Class of Service (CoS) which engages in traffic classification, a more scalable and flexible approach than QoS since DiffServ allocates network resources on a per-class basis and sophisticated network operations only need to be implemented at network boundaries or hosts. Marking a packet with a high Differentiated Services Code Point (DSCP) value gives the packet an Expedited Forwarding (EF) group treatment, which is suited for traffic with strict QoS requirements for latency, packet loss, and jitter (Sources: https://www.techtarget.com/whatis/definition/Differentiated-Services-DiffServ-or-DS, https://www.cisco.com/c/en/us/products/ios-nx-os-software/differentiated-services/index.html)
  
  ECN enables end-to-end congestion notification between two endpoints on TCP/IP based networks. ECN notifies networks about congestion with the goal of reducing packet loss and delay by making the sending device decrease the transmission rate until the congestion clears, without dropping packets. (Source: https://www.juniper.net/documentation/us/en/software/junos/cos/topics/concept/cos-qfx-series-explicit-congestion-notification-understanding.html)
4. Total Length (16 bits): Identifies he total length of the IP datagram, including the header and data (in bytes).
  
  The value in this field cannot exceed 65,535 bytes (if the field is 16 bits in length, that means $2^{16}$ is the largest possible value, but since the first possible value is 0, the maximum value is $2^{16}$ - 1, or 65,535).
5. Identification (16 bits): Used for uniquely identifying the group of fragments of a single IP datagram.
  
  This field (and the next two: Flags and Fragment Offset) ensure data is rebuilt on the receiving end properly. IP can break a packet it receives from a higher-level protocol into smaller packets (A.K.A. fragments), depending on the maximum size of the packet supported by the underlying network transmission technology. On the receiving end, these packets need to be reassembled.
6. Flags (3 bits): Signifies fragmentation options (e.g., whether or not fragments are allowed).
  
  The sender can also use this field to tell the receiving host that more fragments are on the way (which is done with the More Fragments (MF) flag).
7. Fragmented Offset (13 bits): Identifies where the datagram fragment belongs in the incoming set of fragments by assigning a number to each one (known as the offset).
  
  The receiving host will then use these numbers to reassemble the data correctly. This field is measured in units of 8-byte blocks. For example, a value of 2 in this field means to place the data 16 bytes into the packet when it's reassembled. This allows a maximum offset value of 65,528. The first fragment has an offset value of 0. This field is only applicable if fragments are allowed/set.
8. Time to Live (TTL) (8 bits): Sets an upper limit on the number of routers through which a datagram can pass to prevent it from circulating indefinitely.
  
  The initial TTL value is set as a system default in the tCP/IP stack implementation of the various OS vendors, where each OS uses its own unique TTL value. Each router that handles an IP datagram is required to decrement the TTL value by 1. When the TTL value reaches 0, the datagram is discarded, and the sender is notified with an error message. This error message when TTL reaches 0 prevents packets from getting caught in loops forever (a routing loop happens when a data packet is continually routed through the same routers over and over, never reaching their destination).
9. Protocol (8 bits): Indicates the protocol that follows the IPv4 header (e.g., ICMP, TCP, UDP).
10. Header Checksum (16 bits): Contains a computed value used by the receiving host to ensure the integrity of the header information of the packet.
11. Source IP Address (32 bits): A.K.A. the Network Layer Address, it specifies the IPv4 address of the sending host.
12. Destination IP Address (32 bits): Specifies the recipient of the packet.
13. Options (variable length): Contains additional header options, such as optional routing and timing information.
IPv4 Data

What follows the IPv4 header is the data portion of the packet, which encapsulates the original application data sent by the source host, plus information added by any other layers (e.g. TCP, UDP). The size of the data portion of a packet varies in length.
IPv6 Features and Differences from IPv4

Like IPv4, IPv6 operates at Layer 3 (the Network Layer) of the OSI model; however, IPv6 has many features that improve on and differentiate it from IPv4:
1. Most transport- and application-layer protocols need little or no change to operate over IPv6. Exceptions are application protocols that embed network-layer addresses (e.g., FTP, Network Time Protocol (NTPv3)).
2. IPv6 specifies a new packet format, designed to minimize packet-header processing.
  
  Since headers of the IPv4 packets and IPv6 packets ar significantly different, the 2 protocols are not interoperable.
3. IPv6 has a larger address space
  
  The Size of an IPv6 address is 128 bits ($2^{128}$ or $3.4\ x\ 10^{38}$ possible addresses), compared to 32 bits in IPv4 ($2^{32}$ possible addresses). The longer addresses allow for a systematic and hierarchical allocation of addresses and efficient route aggregation.
4. Auto-configuration
  
  IPv6 hosts can configure themselves automatically with help from a router on the local link, or dynamically via a DHCPv6 server.
5. Multicast
  
  Multicast is part of the base specification of IPv6, unlike in IPv4, where multicast is optional, although usually implemented.
6. Broadcast
  
  IPv6 does not implement broadcast. Instead IPv6 treats broadcasts as a special cast of multicasting.
7. Mandatory Network-Layer Security
  
  Internet Protocol Security (IPSec), the protocol for IP encryption and authentication, forms an integral part of the base protocol suite in IPv6.
8. Simplified Processing by Routers
  
  A number of simplifications have been made to the IPv6 packet header. The process of packet forwarding has also been simplified to make packet processing by routers simpler and more efficient (the IPv4 header was inefficient because routing required the analysis of each IPv4 header field).
  - The IPv6 header is not protected by a checksum. Instead, integrity protection is assumed to be assured by a transport layer checksum.
  - The TTL field of IPv4 has been renamed to Hop Limit, reflecting the fact that routers are no longer expected to compute the time a packet has spent in a queue.
9. The size of the IPv6 header has doubled (from 20 bytes for a minimum-sized IPv4 header) to 40 bytes.
  
  IPv6 Header Structure.
  Source: Matt Baxter
An IPv6 packet is composed of the following parts:
1. Mandatory Base Header (40 bytes).
2. Followed by the payload (up to 65,535 bytes), which includes optional IPv6 Extension Headers and data from upper layer protocols.
IPv6 Base Header Structure

IPv6 Base Structure.

Designed for routing efficiency, the IPv6 base header has the following fields:
1. Version (4 bits): Just like IPv4, identifies the version of IP used to generate the datagram.
  
  This field ensures compatibility between systems that may be running different versions of the Internet Protocol.
2. Traffic Class (8 bits): Defines the priority of the packet with respect to other packets from the same source. For example, if 1 of 2 consecutive datagrams must be discarded due to congestion, the datagram with the lower packet priority will be discarded.
  
  IPv6 divides traffic into 2 broad categories:
  1. Congestion-controlled, and
  2. Non-congestion controlled
  Congestion-controlled traffic has the source adapt itself to the traffic slowdown. In this type of traffic, it is understood that packets may arrive delayed, lost, or received out of order. Congestion-controlled data are assigned priorities from 0-7, with 0 being the lowest (no priority set) and 7 the highest (control traffic).
  
  Non-congestion controlled traffic refers to the type of traffic that expects minimum delay. Discarding the packet(s) is not desireable and retransmission, in most cases, is impossible. Thus, the source does not adapt itself to congestion. Real-time audio and video are examples of this type of traffic. Priority numbers from 8-15 are assigned to this type of traffic. Generally speaking, data containing less redundancy (e.g., low fidelity audio or video) can be given a higher priority (15); data containing more redundancy (e.g., high-fidelity audio and video) are given a lower priority (8).
3. Flow Label (20 bits): Designed to provide special handling for a particular flow of data.
  
  A sequence of packets, sent from a particular source to a particular destination, that needs special handling by routers is called a flow of packets. The combination of the source address and the value of the flow label uniquely defines a flow of packets. To a router, a flow is a sequence of packets that share the same characteristics (e.g., traveling the same path, using the same resources, having the same kind of security requirements, etc.). In its simplest form, a flow label can be used to speed up the processing of a packet by a router (e.g., when a router receives a packet, instead of consulting the routing table and going through a routing algorithm to define the address of the next hop, it can easily look in a flow label table for the next hop).
  
  In its more sophisticated form, a flow label can be used to support the transmission of a real-time audio and video. Real-time audio or video, in digital form, require resources (e.g., high bandwidth, large buffers, long processing time). A process can make a reservation for these resources beforehand to guarantee that real-time data will not be delayed due to a lack of resources. The use of real-time data and the reservation of these resources require other protocols (e.g., Real-Time Protocol (RTP), Resource Reservation Protocol (RSVP)).
4. Payload Length (16 bits): Defines the length of the data portion of the IPv6 datagram.
  
  This field includes the optional extension headers and any upper-layer data. Using the 16 bits, an IPv6 payload of up to 65,535 bytes can be indicated. For payload lengths greater than 65,535 bytes, the Payload Length field is set to 0 and the Jumbo Payload option is used in the Hop-by-Hop Options extension header, which is briefly described in the IPv6 Extension Headers section.
5. Next Header (8 bits): Defines the payload that follows the base header in the datagram (similar to the Protocol field in IPv4).
  
  It is either one of the option extension headers used by IP, or the header of an encapsulated packet (e.g., UDP, TCP, ICMP).
6. Hop Limit (8 bits): Serves the same purpose as the TTL field in IPv4.
7. Source Address (16 bytes): Identifies the original source address of the datagram.
8. Destination Address (16 bytes): Identifies the final destination address of the datagram. However, if source routing is used, this field contains the address of the next router.
9. (Optional) Extension Headers: Used to give more functionality to the IPv4 datagram.
  
  The base header can be followed up by up to 6 extension headers, which are similar to the IPv4 Options field. With IPv6, delivery and forwarding options are moved to the extension headers and each datagram includes extension headers for only those facilities that the datagram uses. The only extension header that must be processed at each intermediate router is the Hop-by-Hop Options extension header. This new design increases IPv6 header processing speed and improves the performance of forwarding IPv6 packets. The 6 extension headers are:
  1. Hop-by-Hop Option
  2. Source Routing
  3. Fragmentation
  4. Authentication
  5. Encapsulation Security Protocol (ESP)
  6. Destination Option

For HFT, the speed of an IP packet delivery is paramount. Network engineers strive to minimize the number of hops and optimize routing paths to reduce latency. Techniques such as IP route optimization, traffic engineering, and the use of dedicated private networks are employed to ensure that data travels the most direct path with the least delay. Additionally, features like DiffServ can be used to prioritize HFT traffic, such as market data updates or trade execution signals, over less critical data.

3. UDP Datagrams (Transport Layer - Layer 4)

Similar to IP, UDP is also connectionless communications protocol, but is designed as a best-effort mode of communications. UDP does not provide any guarantees on upper-layer data delivery, or retransmit lost or corrupted messages. It is primarily used to establish low-latency and loss-tolerating connections between processes on host systems. UDP speeds up transmissions by enabling the transfer of data before an agreement is provided by the receiving host and is ideal for delivering large quantities of data in a short amount of time (e.g., live audio, video transmission over the internet). Thus, UDP is the go-to Layer 4 protocol for time-sensitive communications: DNS lookups, VoIP, and video and audio playback. It includes several attributes that make it beneficial for use with applications that can tolerate data loss:

It allows segments to be dropped and received in a different order than they were transmitted, making it suitable for real-time applications where latency might be a concern.
It can be used in applications where speed (rather than reliability) is important (e.g. with transaction-based protocols (e.g., DNS, Network Time Protocol (NTP))).
It can be used where a large number of clients are connected and where real-time error correction isn't necessary (e.g., gaming, voice or video conferencing, streaming media).

UDP is also used by: DNS, DHCP, Trivial File Transfer Program (TFTP), and Simple Network Management Protocol (SNMP).

UDP Header Structure

UDP Header Structure.

UDP operates at Layer 4 (the Transport Layer) of the OSI model and is composed of 4 fields:

Source Port (16 bits): Identifies the sending port number and should be assumed to be the port number in any reply.
Destination Port (16 bits): Identifies the destination port number.
Length (16 bits): Specifies the length of the UDP header and the UDP data.
(Option) Checksum (16 bits): Used for error-checking of the UDP header and UDP data.

For HFT firms, UDP can be used for transmitting market data feeds because UDP's low latency. While the lack of reliability mechanisms can lead to data loss, HFT systems can implement their own error-handling and data verification methods to mitigate this risk. For example, UDP Multicast is commonly used for collocated exchange customers for market-data distribution, often distributing UDP data in a binary format, "or an easy-to-parse text". Two predominant binary formats are ITCH and OUCH, both sacrificing flexibility (fixed-length offsets) for speed (very simple parsing). (Source: https://dl.acm.org/doi/pdf/10.1145/2523426.2536492)

4. TCP Segments (Transport Layer - Layer 4)

Just like IP and UDP, TCP also operates at Layer 4 (the Transport Layer) of the OSI model. TCP's main function is to establish and maintain host-to-host communication by which applications can reliably exchange data. TCP is the primary internet transport protocol for applications that need guaranteed delivery of data. Thus, TCP is considered connection-oriented, which means that the two applications using TCP (normally a client and a server) must establish and maintain a connection until the applications at each end have finished exchanging messages (via the TCP 3-Way Handshake mechanism). Additional functionality of TCP includes:

Segmenting, or the breaking up of application data for transmission across a network.
Assigning a unique Sequence Number to each segment of application data. This Sequence Number comes in handy when the receiving machine tries to reassemble all the pieces.
Assigning a port number that functions as the address of the application that is sending/receiving the data (much like UDP).
Tracking the sequence of received TCP segments.
Ensuring that data received wasn't damaged in transit (and if so, retransmitting that data as many times as needed).
Acknowledging that a segment or segments was/were received undamaged, and
Regulating the rate at which the source machine sends data (flow control), which helps prevent the network (and communicating hosts) from getting bogged down when congestion begins.

TCP Header Structure

TCP Header Structure.
Source: Matt Baxter

Like its UDP counterpart, TCP segments are encapsulated in the payload portion of an IP datagram. TCP segments are composed of 11 fields:

Source Port (16 bits): Indicates the port number of the source host.
Destination Port (16 bits): Indicates the port number of the destination host.
Sequence Number (32 bits): Keeps track of both transmitted and received segments in a TCP communication.
Acknowledgement Number (32 bits): Used to confirm receipt of packets via a return segment (also known as an ACK segment) to the sender.
Header Length, A.K.A. the Offset or Data Offset (4 bits): Specifies the length of the TCP header.

The value in this field lets the receiving host know where the data portion of the TCP segment begins.
Reserved (3 bits): This field is rarely used and gets set to 0.
Flags (9 bits): A collection of 9 one-bit fields that signal special conditions (e.g., SYN, ACK, FIN, RST, PSH, URG).

Each flag is actually a special segment named for its function.
Window, A.K.A. Sliding Window Size (16 bits): Used to provide flow control by designating the size of the receive window.
Checksum (16 bits): Allows the receiving host to determine whether the TCP segment became corrupted during transmission.
Urgent Pointer (16 bits): Indicates a location on the payload/data where urgent data resides (if the URG flag is set).
Options (0 - 32 bits): Specifies special options (e.g. the Maximum Segment Size (MMS) of a frame/packet a network can handle).

While TCP's reliability can be beneficial, its overhead can introduce unacceptable latency for some HFT operations. As stated previously, UDP Multicast is commonly used for collocated exchange customers to distribute market-data. However, TCP is still often used for non-collocated exchange customers. Even so, some markets, like foreign exchange (FX) markets, have all market data distributed over TCP in Financial Information Exchange (FIX). (Source: https://dl.acm.org/doi/pdf/10.1145/2523426.2536492)

Nagle's Algorithm

Nagle's Algorithm is a mechanism in the TCP/IP protocol stack designed to improve the efficiency of network communication, especially when sending small packets of data. It aims to reduce the overhead associated with transmitting a large number of small packets by batching them together whenever possible.

Problem context:

When an application sends small chunks of data (e.g., one character at a time), each data chunk can result in a separate TCP segment. The separate TCP segments lead to inefficient network usage because of the overhead associated with headers in each TCP segment.
- Minimum Ethernet frame:
  
  Each TCP segment transmitted must include:
  - Ethernet Header: 14 bytes
  - IP Header: 20 bytes
  - TCP Header: 20 bytes
  - Payload: Typically small in scenarios like telnet/SSH, e.g., 1 byte.
  The result of these multiple separate TCP segments is a minimum frame size of 64 bytes after padding, even if the payload is only 1 byte. The efficiency in such a case is only 1/64 = ~1.5%.
Nagle's solution:

The algorithm states that a sender should only have one outstanding small packet that has not been acknowledged at a time. If new data needs to be sent, it is buffered until:
1. An acknowledgment (ACK) for the previous packet is received, or
2. The buffer is full, i.e. "Nagle's threshold" is reached, and the packets can be sent as a larger segment.

Thus, through Nagle's algorithm, the buffering of small TCP segments reduces the number of packets sent and increases the overall efficiency of network bandwidth usage.

Example of avoiding Nagle's Algorithm for distributed ULL systems:
1. Imagine you are writing an app like telnet or ssh…
  - These applications often involve transmitting individual keystrokes as the user types. Sending a separate packet for each keystroke would result in significant inefficiency due to the high header-to-payload ratio.
2. Do you transmit an entire packet for every single keystroked character?
  - Without Nagle's Algorithm, each character would result in a separate TCP segment. With the algorithm, the stack buffers the characters and transmits them together, waiting until one of the conditions for sending the data is met.
3. Minimum sized Ethernet frame for TCP/IP:
  - The headers (Ethernet, IP, TCP) add up to 54 bytes before payload and padding. Adding a 1-byte payload and adding padding to the minimum Ethernet frame size results in 64 bytes.
  - Efficiency for 1-byte payload: 1 byte / 64 bytes = ~1.5%.
4. The TCP stack waits for more data…
  - Nagle's Algorithm forces the TCP stack to delay sending small data chunks. The TCP stack waits to batch additional data into the packet, potentially increasing bandwidth efficiency but adding latency.
5. Useful for increasing bandwidth…
  - For applications sending small amounts of data, the algorithm increases efficiency by reducing the number of packets. However, in latency-sensitive applications, like HFT, the added time delay can be detrimental.
6. Your code calls send(orderPayload)…
  - When calling send, the payload is queued by the TCP stack but may not immediately be sent on the wire. It depends on:
    - Whether there are outstanding unacknowledged packets.
    - The availability of enough data to form a larger packet.
7. When did the data actually get sent out on the wire…?
  - The exact timing of the data being sent depends on the network stack's state (acknowledgments, buffer availability, etc.). This timing uncertainty can impact applications that rely on precise control over data transmission.

Therefore, to reduce latency of packet transmission, especially in distributed ULL environments, it is advised to disable Nagle's Algorithm, which can be done in Linux by enabling the TCP_NODELAY option on the socket.

In addition to increases in latency of packet transmission, Nagle's algorithm is not effective in handling bursts of network traffic, since the delays in packet transmission may lead to increased network congestion.

References:

1 Awati, Rahul. (2023, May). Differentiated Services (DiffServ or DS). Retrieved from https://www.techtarget.com/whatis/definition/Differentiated-Services-DiffServ-or-DS 2. Cisco. (n.d.). Differentiated Services. Retrieved from https://www.cisco.com/c/en/us/products/ios-nx-os-software/differentiated-services/index.html 3. Juniper Networks. (2024, September 9). Understanding CoS Explicit Congestion Notification. Retrieved from https://www.juniper.net/documentation/us/en/software/junos/cos/topics/concept/cos-qfx-series-explicit-congestion-notification-understanding.html 3. Loveless, Jacob. (2013). Barbarians at the Gateways. Retrieved from https://dl.acm.org/doi/pdf/10.1145/2523426.2536492 4. Brooker, Marc. (2024, May 9). Marc's Blog. It’s always TCP_NODELAY Every damn time. Retrieved from https://brooker.co.za/blog/2024/05/09/nagle.html 5. Arvey, Stanley. (2023, April 3). Orhan Ergun. Uncovering Nagle's TCP Algorithm: Technical Overview. Retrieved from https://orhanergun.net/nagles-tcp-algorithm

Types of Network Communications

1. Unicast Communication

Unicast communication.
Source: https://en.wikipedia.org/wiki/Unicast

Unicast communication involves a one-to-one connection between a single sender and single destination. Each destination address uniquely identifies a single receiver endpoint.

Characteristics:

Direct communication between two network nodes.
Uses unique IP and MAC addresses for sender and receiver.
The most common form of communication on the Internet.

Advantages:

Ensures privacy and security, as data is not broadcasted to other devices.
Simplifies error handling and acknowledgement processes.
Provides a dedicated communication channel, reducing the risk of interference.

Unicast can be used for direct orders and trade confirmations between trading systems and exchanges.

2. Broadcast Communication

Broadcast communication.
Source: https://www.keycdn.com/support/anycast

Broadcast communication involves sending data from one sender to all devices on the LAN. It uses a special broadcast address that all nodes listen to.

The Ethernet broadcast address is distinguished by having all of its bits set to 1. (Source: https://www.sciencedirect.com/topics/computer-science/broadcast-address#:~:text=The%20Ethernet%20broadcast%20address%20is,hosts%20on%20the%20local%20subnet)

Characteristics:

Effective within local network segments (broadcast domains).
Commonly used by protocols like Address Resolution Protocol (ARP) and Dynamic Host Configuration Protocol (DHCP) for network discovery.

Advantages:

Simplifies processes where information needs to reach all devices, such as ARP requests and DHCP offers.
Reduces the complexity of network management tasks by enabling devices to announce their presence or request information from multiple devices simultaneously (ARP).

Disadvantages:

Can lead to network congestion if overused, as all devices must process broadcast traffic.
Not suitable for large-scale networks due to scalability issues and potential security concerns.

Broadcasted packets are generally avoided in HFT networks due to their potential for increased latency and unnecessary network load. Network engineers minimize broadcast domains through VLAN segmentation and limit broadcast traffic using network policies and configurations.

3. Multicast Communication

Multicast communication.
Source: https://en.wikipedia.org/wiki/Multicast

Multicast communication allows one sender to simultaneously transmit data to multiple specific receivers who are part of a multicast group. It is more efficient than broadcasting when data needs to be sent to multiple, but not all, recipients.

Characteristics:

Receivers join multicast groups to receive data
Data is sent once by the sender and distributed to multiple recipients by network devices that support multicast routing.

Advantages:

Efficient bandwidth usage, reducing the network load compared to unicast transmissions to multiple recipients.
Ideal for applications where the same data needs to be delivered to multiple systems simultaneously, like market data feeds.

Multicast is extensively used in HFT to disseminate market data feeds from exchanges to trading systems. Exchanges broadcast price updates, trade information, and order book changes to all subscribers using multicast, which ensures that all participants receive the data simultaneously, allowing trading systems to react to market changes as quickly as possible.

History and Commonality of Ethernet

Ethernet History

The Evolution of Ethernet
Source: https://www.techtarget.com/rms/onlineImages/evolution_of_ethernet_mobile.jpg

Ethernet has become the backbone of LANs due to its evolution, standardization, and adaptability to performance demands.

In the 1970s:

Developed at Xerox PARC by Robert Metcalfe and David Boggs.
Initially operated at 2.94 Mbps using thick coaxial cable (10BASE5).
Used Carrier Sense Multiple Access with Collision Detection (CSMA/CD) to connect multiple computers over a shared medium.

In the 1980s:

DIX (Digital Equipment Corporation, Intel, Xerox) consortium standardized Ethernet.
IEEE 802.3 formalized Ethernet standards.
Introduction of 10BASE2 and 10BASE-T cabling reduced installation complexity and costs.
Ethernet surpassed Token Ring and ARCNET in cost-effectiveness and ease of use.

In the 1990s:

Fast Ethernet (100BASE-TX) increased speeds to 100 Mbps.
Adoption of Cat5 cables enhanced performance.
Ethernet switches replaced hubs, reducing collisions and improving performance.

In the 2000s:

Gigabit Ethernet (1000BASE-T) delivered 1 Gbps speeds over copper and fiber optics.
Deployed widely in enterprises and data centers.
IEEE 802.3ae standardized 10 Gigabit Ethernet (10GbE) for high-speed needs.

From the 2010s-Present:

Development of 40GbE, 100GbE, 25GbE, 50GbE, 200GbE, and 400GbE using PAM4 modulation.
Fiber optics and Cat8 copper cabling supported higher speeds.
Introduction of Power over Ethernet (PoE) and Ethernet Virtual Private Networks (EVPN).

Five Advantages of Ethernet

Standardization and interoperability: Ethernet standards, maintained by IEEE 802.3, ensure device compatibility across manufacturers. This standardization fosters competition, drives innovation, and reduces costs, making Ethernet the default choice for LANs.
Scalability and flexibility: Ethernet supports speeds from 10 Mbps to 400 Gbps and various media types, making it adaptable to different network sizes. It can be used in small offices or scaled to large data centers and metropolitan networks.
Cost-effectiveness: The widespread adoption of Ethernet has led to reduced equipment costs, making it affordable for both small and large organizations.
Ease of deployment and management: Ethernet is easy to install and manage with structured cabling systems. IT professionals are well-versed in Ethernet, ensuring reliable support, and mature tools are available for management.
Performance and reliability: Advances in Ethernet increase speed and reliability, with features like full-duplex, flow control, and link aggregation improving performance and fault tolerance.

In HFT, Ethernet is the primary technology used for network connectivity due to its high speeds and low latency capabilities. High-performance Ethernet switches and NICs are critical components in HFT infrastructure. Vendors specializing in low-latency Ethernet equipment, such as Arista Networks, Cisco Systems, and Juniper Networks, provide solutions tailored to the demands of HFT firms. These devices often support features like cut-through switching, where the switch starts forwarding a frame before it is fully received, reducing latency.

Use case specific alternatives to Ethernet

While Ethernet is versatile and widely used, certain applications in HFT and data centers require specialized networking technologies to meet ULL and high-throughput requirements. Two of these alternatives include fiber channel which is often used for storage area networks (SANs), and InfiniBand, which is often used for high-performance computing (HPC) environments, e.g., scientific computing, AI, cloud data centers, and of course, HFT.

1. Fiber Channel

Fiber Channel (FC) is a high-speed networking technology primarily used for storage area networks (SANs). It facilitates the transfer of data between computer systems and storage devices, offering high throughput and low latency for such systems.

Characteristics:

Supports speeds up to 128 Gbps (Source: https://fibrechannel.org/wp-content/uploads/2023/06/FCIA-128GFC-Webcast-Final-v1.pdf).
Provides in-order, lossless delivery of block data (Source: https://www.snia.org/education/what-is-fibre-channel)
Deployed for low latency applications best suited to block-based storage (Source: https://www.snia.org/education/what-is-fibre-channel)

Popular Vendors:

Broadcom Inc. (formerly Emulex and Brocade): Provides Fiber Channel Host Bus Adapters (HBAs) and switches. (Source: https://www.broadcom.com/products/storage/fibre-channel-host-bus-adapters)
Cisco Systems: Offers Fiber Channel switches. (Source: https://www.cisco.com/c/en/us/products/interfaces-modules/mds-9000-48-port-8-gbps-advanced-fibre-channel-switching-module/index.html)
Dell EMC: Offers host bus adapters (HBAs) which incorporate Fiber Channel via PCIe (Source: https://www.dell.com/en-us/shop/fibre-channel-hbas/ar/7761).
Hewlett Packard Enterprise (HPE): Offers storage switches, SANs, and other networking equipment with Fiber Channel support. (Source: https://buy.hpe.com/us/en/storage/storage-networking/c/304608)
IBM: Offers enterprise SANs with Fiber Channel connectivity. (Source: https://www.ibm.com/storage-area-network?_ga=2.243073927.400418427.1684156226-82144775.1666370910&_gl=1*frug94*_ga*ODIxNDQ3NzUuMTY2NjM3MDkxMA..*_ga_FYECCCS21D*MTY4NDE1NjIyNS44LjEuMTY4NDE2MDYyMS4wLjAuMA..)

For applications in HFT, when used with SANs, Fiber Channel provides rapid access to large volumes of historical data via low-latency storage which is essential for backtesting trading algorithms, risk management, and recording transactions. For enterprise class quality, Fiber Channel SANs offer the necessary performance and reliability for these tasks (Source: https://fibrechannel.org/overview/). However, as much as Fiber Channel is reliable, they have been getting gradually phased out in favor of InfiniBand.

2. InfiniBand

InfiniBand is known as a high-performance network architecture commonly used in High-Performance Computing (HPC) environments, supercomputers, and data centers requiring ULL and high bandwidth. It is now a networking industry-standard specification, defining an I/O architecture used to interconnect servers, communcations infrastructure equipment, storage (thus, can be used to replace Fiber Channel systems), and embedded systems (Source: https://www.infinibandta.org/about-infiniband/).

Characteristics:

Dominated the global 2023 Top 100 supercomputer rankings (Source: https://community.fs.com/article/exploring-the-significance-of-infiniband-networking-and-hdr-in-supercomputing.html)
Provides throughput up to 2.4 Tbps (via 12X Link eXtended Data Rate (XDR) InfiniBand (Source: https://community.fs.com/article/need-for-speed-–-infiniband-network-bandwidth-evolution.html)) with extremely low latency (that can go below 100 nanoseconds (Source: https://community.fs.com/article/exploring-the-significance-of-infiniband-networking-and-hdr-in-supercomputing.html)).
Supports Remote Direct Memory Access (RDMA), allowing direct memory access from the memory of one host to another, thus removing CPU overhead which offers ULL.
Highly scalable, supporting thousands of nodes in a fabric.
Offers features like Quality of Service (QoS), partitioning, Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) support (to offload collective operations from CPUs and GPUs to the network switch), and error detection and correction mechanisms via Self-Healing Networking (Source: https://community.fs.com/article/key-advantages-of-infiniband-technology.html).

Popular Vendors:

NVIDIA Corporation: Acquired Mellanox Technologies in 2020, NVIDIA offers a comprehensive range of InfiniBand products, including network adapters, switches, and cables, providing end-to-end solutions. (Source: https://www.nvidia.com/en-us/networking/products/infiniband/)
Hewlett Packard Enterprise (HPE): HPE provides servers and systems with InfiniBand connectivity, commonly utilized in high-performance computing (HPC) clusters. They offer various InfiniBand options, including adapters and switches. (Source: https://buy.hpe.com/us/en/options/networking-options/infiniband-options/infiniband-options/hpe-infiniband-options/p/1014827455)
Oracle Corporation: Oracle employs InfiniBand in its engineered systems, such as Exadata and Oracle Cloud Infrastructure, to ensure high-throughput and low-latency connectivity (Source: https://www.oracle.com/database/technologies/exadata/hardware/rdmanetwork/)
Lenovo: Lenovo offers switches and adapters with InfiniBand options for HPC applications. (Source: https://www.lenovo.com/us/outletus/en/c/data-center/networking/infiniband)

Applying InfiniBand to HFT, InfiniBand is used in environments where the lowest possible latency is required, such as connecting servers within a trading firm's data center or co-location facility. The RDMA capability of InfiniBand reduces CPU overhead, allowing trading applications to process data more quickly and efficiently. This reduction in latency can provide a competitive edge over Ethernet in executing trades. However, the complexity and cost of InfiniBand infrastructure, which requires dedicated InfiniBand NICs and InfiniBand switches (Source: https://www.naddod.com/blog/what-is-rdma-roce-vs-infiniband-vs-iwar-difference?srsltid=AfmBOoobgNz01ip5e6WUWoODWYfeIOGdOgDN1vAV-OyJtuTbnCPA5KC4), mean that it is typically reserved for critical path components where performance gains justify the investment. Thus, Fiber Channel can still hold a presence in certain data centers with storage-centric applications where its established infrastructure are valued. Nonetheless, for HPC environments and networking systems at a data-center-scale, InfiniBand is the preferred choice over Ethernet and Fiber Channel, offering significantly higher bandwidth and lower latency. Even OpenAI used an InfiniBand network which was built within Microsoft Azure to train ChatGPT (Source: https://www.naddod.com/blog/differences-between-infiniband-and-ethernet-networks?srsltid=AfmBOoqj6itv2HQMFm2SiEsstkv8wxhFaJJCeqJdkihimhtHqozlMBYW).

Other Alternatives:

Technologies like RDMA over Converged Ethernet (RoCE) and iWARP aim to bring the benefits of RDMA to Ethernet networks. These protocols enable low-latency, high-throughput communication over Ethernet infrastructure, providing a middle ground between the cost-effectiveness of Ethernet and the performance of InfiniBand. HFT firms may consider these technologies to enhance performance while leveraging existing Ethernet infrastructure (Source: https://lwn.net/Articles/914992/).

3. Time Synchronization

In HFT, algorithms execute trades based on the analysis of market data, often capitalizing on price discrepancies that exist across different markets and instruments. These trade opportunities usually exist for mere nanoseconds before being corrected by the market. Thus, the precise timing of data acquisition, data processing, and trade execution is crucial.

Highly accurate and precise time synchronization enables:

Regulatory compliance:

Exchanges and HFT firms can meet stringent regulations that require precise and accurate timestamping of trades and market data. For example, 2020's FINRA CAT, which requires firms' clocks to be maintained within 100 µs minimum of NIST's atomic clock, and 2018's MiFID II, which allows for a maximum divergence from UTC of 100 µs for algorithmic HFT techniques (Source: The Significance of Accurate Timekeeping and Synchronization in Trading Systems — September 2024 - Francisco Girela-López from Safarn Electronics and Defense Spain , https://www.esma.europa.eu/sites/default/files/library/2016-1452_guidelines_mifid_ii_transaction_reporting.pdf).
Network monitoring and traceability:

Improves monitoring and traceability of a HFT firm's network systems by ensuring that the timestamping of network packets are highly precise and accurate (Source: The Significance of Accurate Timekeeping and Synchronization in Trading Systems — September 2024 - Francisco Girela-López from Safarn Electronics and Defense Spain).
Improved pricing:

Enhanced reaction times to data feeds from multiple exchanges can help HFT traders and firms secure better pricing than their competition by taking advantage of latency arbitrage (Source: The Significance of Accurate Timekeeping and Synchronization in Trading Systems — September 2024 - Francisco Girela-López from Safarn Electronics and Defense Spain).
Enhanced trade execution:

Capturing higher quality data means better ML and AI predictions, resulting in better trading algorithms (Source: The Significance of Accurate Timekeeping and Synchronization in Trading Systems — September 2024 - Francisco Girela-López from Safarn Electronics and Defense Spain).
Risk management:

Backtesting (described in further detail later in the report) using higher-quality data enables a more accurate analysis of trading strategies and performance.

The corollary to this is imprecise timing, which can result in significant consequences in HFT, including:

Regulatory penalities
Market manipulation risks
Errors in trade execution, and
Data inconsistencies

Time Sync Methods

Highly accurate time synchronization across distributed systems is complicated primarily because nanosecond discrepancies can have significant impacts on HFT operations. The primary methods for network time synchronization include the Network Time Protocol (NTP), the Precision Time Protocol (PTP), Intel's Precision Time Measurement (PTM), and emerging technologies like photonic time synchronization.

1 Network Time Protocol (NTP)

Developed in the 1980s by Dr. David L. Mills at the University of Delaware, NTP is one of the oldest and most widely used parts of the TCP/IP suite and is used for synchronizing clocks over packet-switched, variable-latency networks. Operating over User Data Protocol (UDP) port 123, NTP can synchronize clocks within milliseconds of Coordinated Universal Time (UTC) over the Internet. It is currently on version 4 (NTPv4) (Sources: <https://www.techtarget.com/searchnetworking/definition/Network-Time-Protocol#:~:text=Network%20Time%20Protocol%20(NTP)%20is,programs%20that%20run%20on%20computers, https://en.wikipedia.org/wiki/Network_Time_Protocol>).

Operation:

NTP has a hierarchical system of layers, i.e. clock sources, called as strata, which defines how many hops away a device is from an authoritative time source (Source: https://networklessons.com/cisco/ccnp-encor-350-401/cisco-network-time-protocol-ntp):

Stratum 0

High precision timekeeping reference clocks receive true time from a dedicated transmitter (i.e. atomic clocks) or satellite navigation system (i.e. GNSS) (Sources: https://www.techtarget.com/searchnetworking/definition/Network-Time-Protocol#:~:text=Network%20Time%20Protocol%20(NTP)%20is,programs%20that%20run%20on%20computers.).
Stratum 1

Known as primary time servers, these are servers have a one-on-one direct connection with a Stratum 0 device, "achieve microsecond-level synchronization with Stratum 0 clocks, and connect to other Stratum 1 servers for quick sanity tests and backup" (Sources: https://www.masterclock.com/network-timing-technology-ntp-vs-ptp.html).
Stratum 2 and Below

These servers are synchronized over the network to higher-stratum servers. They can connect to multiple primary time servers (e.g. stratum 3 device <-- time <-- stratum 2 device <-- time <-- stratum 1 device, and so on) for tighter synchronization and improved accuracy.

NTP supports a maximum of up to 15 strata, but the accuracy of each additional stratum from 0 is reduced (Source: https://www.masterclock.com/network-timing-technology-ntp-vs-ptp.html).

Relating back to HFT, NTP is known to introduce jitter and delays from its software-based timestamping, which cannot meet the extremely precise (nanosecond and below) timing requirements of HFT systems.

2. Precision Time Protocol (PTP)

NTP was never meant to be accurate to the nanosecond. Thus, PTP was invented to solve the issue of ULL time synchronization; however, it assumes latencies and time lengths are symmetric.

PTP is an IEEE/IEC standardized protocol defined in IEEE 1588 (Source: https://networklessons.com/cisco/ccnp-encor-350-401/introduction-to-precision-time-protocol-ptp) and offers significantly higher precision than NTP by utilizing hardware timestamping and specialized network equipment. The synchronization process involves "ToD (Time of Day) offset correction and frequency correction" between timeTransmitter and timeReceiver device/clock (Source: https://www.intel.com/content/www/us/en/docs/programmable/683410/current/precision-time-protocol-ptp-synchronization.html). PTP devices/clocks timestamp the length of time that synchronization messages spend in each device, which accounts for device/clock latency (Source: https://www.masterclock.com/network-timing-technology-ntp-vs-ptp.html).

1. Operation:

PTP operates by exchanging messages between a timeTransmitter (formerly known as "master") and timeReceiver (formerly known as "slave") clock in a hierarchical structure called the "Best Master Clock Algorithm" (BMCA) (or Best TimeTransmitter Clock Algorithm (BTCA)) (Source: https://networklessons.com/cisco/ccnp-encor-350-401/introduction-to-precision-time-protocol-ptp). The protocol accounts for network delays by measuring the time it takes for messages to travel between devices:

PTP master slave clock synchronization messages
Source: https://networklessons.com/cisco/ccnp-encor-350-401/introduction-to-precision-time-protocol-ptp

PTP synchronization diagram showing offset and delay calculation.
Source: https://www.mobatime.com/article/ptp-precision-time-protocol

Sync Message: The timeTransmitter clock sends a Sync message with a timestamp of when it was sent.
Follow_Up Message: For two-step clocks, the timeTransmitter sends a Follow_Up message containing precise transmission timestamps.
Delay_Request: The timeReceiver clock sends a Delay_Request message to the timeTransmitter.
Delay_Response: The timeTransmitter replies with a Delay_Response message containing the reception timestamp of the Delay_Request.
Delay Calculation: The timeReceiver calculates the path delay and adjusts its clock accordingly.

Clock comparison procedure for the BMCA.
Source: https://blog.meinbergglobal.com/2022/02/01/bmca-deep-dive-part-1/

PTP has four different clock types, with each able to have a timeTransmitter or timeReceiver:

Grandmaster clock (GMC)

Generalized PTP over Layer 3 unicast.
Source: https://www.cisco.com/c/en/us/td/docs/switches/lan/catalyst9400/software/release/17-13/configuration_guide/lyr2/b_1713_lyr2_9400_cg/configuring_generalized_precision_time_protocol.html

The GMC is the primary source of time in PTP, functioning as a timing reference, and is connected to a reliable time source, such as GNSS or an atomic clock. The GMC always has the timeTransmitter role on its interface(s); therefore, all other clocks synchronize directly or indirectly with the GMC.
Ordinary clock (OC)

The OC runs PTP on only one of its interfaces. This interface can have the [timeReceiver] or [timeTransmitter] role. The OC is usually an end device that needs its time synchronized. (Source: https://networklessons.com/cisco/ccnp-encor-350-401/introduction-to-precision-time-protocol-ptp)
Boundary clock (BC)

Conceptual model of a Boundary Clock (BC) in the Grandmaster state, as an Ordinary Clock in the Grandmaster state and a BC not acting as the Grandmaster.
Source: https://blog.meinbergglobal.com/2022/02/01/bmca-deep-dive-part-1/

A BC runs PTP on two or more interfaces. It can synchronize one network segment with another. The upstream interface that connects to the GMC has the timeReceiver role. The downstream interface that connects to other clocks has the timeTransmitter role. A BC also sits between the GMC and other BCs or OCs. Each interface can connect to a different VLAN to synchronize time in different VLANs Adding BCs to the network also has a scalability advantage because it prevents all OCs from having to talk with the GMC directly (Source: https://networklessons.com/cisco/ccnp-encor-350-401/introduction-to-precision-time-protocol-ptp).

As an analogy, BCs are like a fridge holding onto ice cubes (i.e. PTP message packets) to prevent the ice cubes (PTP message packets) from melting (i.e. from suffering too much latency) (Source: https://engineering.fb.com/2022/11/21/production-engineering/future-computing-ptp/). Adding BCs to the network also has a scalability advantage because it prevents all OCs from having to talk with the GMC directly (Source: https://networklessons.com/cisco/ccnp-encor-350-401/introduction-to-precision-time-protocol-ptp).

An PTP GMC connected to a single BC which is connected to two OCs using two VLANs.
Source: https://networklessons.com/cisco/ccnp-encor-350-401/introduction-to-precision-time-protocol-ptp

An PTP BC hierarchy cascade, illustrating the scalability of BCs.
Source: https://networklessons.com/cisco/ccnp-encor-350-401/introduction-to-precision-time-protocol-ptp

However, the more BCs clocks you add, the higher the chance your clocks are not as accurate anymore; therefore, using boundary clocks is only suitable for networks with a small number of switches (Source: https://networklessons.com/cisco/ccnp-encor-350-401/introduction-to-precision-time-protocol-ptp)
Transparent clock (TC)

TCs were introduced in PTPv2 with the goal of forwarding PTP messages. A TC cannot be a source clock like a GMC or BC. Instead, TCs forward PTP messages within a VLAN but not between VLANs (Source: https://networklessons.com/cisco/ccnp-encor-350-401/introduction-to-precision-time-protocol-ptp).

As an analogy, if ice (i.e. PTP message packets) in a fridge (i.e. an BC) is already a bit melted, the fridge (BC) only keeps it from melting (i.e. from suffering too much network latency) a bit further; thus, TCs try to mitigate this network latency by measuring and adjusting for time delays to improve synchronization, sort of "like insulation on pipes" (Source: https://engineering.fb.com/2022/11/21/production-engineering/future-computing-ptp/). Thus, by adjusting the correction field, which are used to compensate for time delays, of a PTP message, TCs can an precisely account for any time discrepancies that may affect accurate time synchronization.

An PTP TC between a GMC and two OCs using two VLANs.
Source: https://networklessons.com/cisco/ccnp-encor-350-401/introduction-to-precision-time-protocol-ptp

2. Hardware Timestamping:

For most companies, NTP's time resolution is sufficient; however, since NTP networks are software-based, all timestamp requests have to wait for the local operation system (OS), introducing latency which impacts accuracy. Thus, PTP provides a far more precise level of time synchronization since it achieves hardware timestamping at the network interface level (Source: https://www.masterclock.com/network-timing-technology-ntp-vs-ptp.html).

Eliminating BCs from a system can be done to ensure that each device in the network communicates directly with the GMC (Source: https://engineering.fb.com/2022/11/21/production-engineering/future-computing-ptp/). However, to precisely sync every device in a network to the GMC, the network would need to rely on GNSS receivers, such as the u-blox RCB-F9T GNSS time module, which integrates with Meta's custom Time Card which provides an open-source solution, via PCIe, for PTP network timestamping via the Time Card's hardware/software bridge between its GNSS receiver and its atomic clock (Source: https://opencomputeproject.github.io/Time-Appliance-Project/docs/time-card/introduction#bridge).

Hence, NICs and switches equipped with PTP support can timestamp with high precision, minimizing software-induced latency and jitter.

3. Profiles and Extensions:

PTP includes various profiles which are tailored for specific industry applications:

Default PTP Profile: The default option, which is for general-purpose synchronization (Source: https://networklessons.com/cisco/ccnp-encor-350-401/introduction-to-precision-time-protocol-ptp).
Telecom Profile: Defined by the ITU-T under the G.8265.1, G.8275.1, and G.8275.2 recommendations, it's used in telecommunication networks (Source: https://networklessons.com/cisco/ccnp-encor-350-401/introduction-to-precision-time-protocol-ptp).
Power Profile: Defined under the IEEE C37.238 standard, power profiles are intended For power utility networks and their system applications, especially electric grid measurements and control systems (Source: https://networklessons.com/cisco/ccnp-encor-350-401/introduction-to-precision-time-protocol-ptp).
802.1AS: Audio Video Bridging over Ethernet (AVB) is a set of standards that describe how to run real-time content such as audio and video over Ethernet networks. 802.1AS explains how to use PTP for AVB (Source: https://networklessons.com/cisco/ccnp-encor-350-401/introduction-to-precision-time-protocol-ptp).

Extensions like the White Rabbit timing system (used in CERN) further enhance PTP to achieve sub-nanosecond accuracy and picoseconds precision of synchronization by going over a fiber connection to achieve better accuracy than regular PTP. With its optimizations, White Rabbit also ensures network resiliency, by providing auto failover between GPS at different trading sites, and precise monitoring by incorporating time references and GNSS time backups over fiber (Source: The Significance of Accurate Timekeeping and Synchronization in Trading Systems — September 2024 - Francisco Girela-López from Safarn Electronics and Defense Spain).

4. Notable Features:

PTP offers several notable features:

Supports multiple outputs including PTP, NTP, PPS, PPO, 10MHz, SMPTE, IRIG-B, IRIG-A, IRIG-E, NMEA 0183, NENA (Source: https://www.masterclock.com/network-timing-technology-ntp-vs-ptp.html)
Can provide accuracy to within 15 nanoseconds of UTC (Source: https://www.masterclock.com/network-timing-technology-ntp-vs-ptp.html)
Offers SSH configuration with AES 256 encryption (Source: https://www.masterclock.com/network-timing-technology-ntp-vs-ptp.html)
Includes IPv4/IPv6 network compatibility (Source: https://www.masterclock.com/network-timing-technology-ntp-vs-ptp.html)
Can function as an NTP client and/or server (Source: https://www.masterclock.com/network-timing-technology-ntp-vs-ptp.html)

Another advantage of PTP is that if your network runs on Ethernet or IP, you can use your existing network for time synchronization, since "PTP can run directly over Ethernet or on top of IP with UDP for transport" (Source: https://networklessons.com/cisco/ccnp-encor-350-401/introduction-to-precision-time-protocol-ptp).

5. PTP Challenges:

Implementing PTP requires compatible hardware throughout the network, which can be costly. PTP-compatible hardware includes not just switches and routers but also end devices, like servers, NICs, and GNSS receivers, that are capable of supporting PTP timestamps. The required specialized hardware can be significantly more expensive than non-PTP-enabled devices, making the initial setup cost quite high. Also, network asymmetry and variable time delays still pose challenges, but they are significantly mitigated compared to NTP. In asymmetrical network paths, PTP's accuracy is reduced since packets take different routes to/from their destinations, which leads to variable time delays. Hence, variations poorly designed networks can introduce small inaccuracies, which is a risk in the complex, high-traffic networks of HFT. Furthermore, variable time delays could also be introduced by network congestion, buffering, or processing delays in intermediate devices (such as having too many BCs). Additionally, accuracy may be lost when the clock is accessed in application software because "synchronization protocols using hardware timestamping synchronize the network device clock" (Source: https://www.electronicdesign.com/technologies/embedded/article/21276422/pci-sig-boost-time-synchronization-accuracy-with-the-pcie-ptm-protocol). Electronic Design further explains,

Accessing that clock in application software usually requires a relatively slow memory-mapped I/O (MMIO) read. Because it’s unknown when the device clock is actually sampled during the read operation, the time value received by application software is inaccurate by up to one half the MMIO read latency. An MMIO read of the network device clock may take several microseconds to complete. This means that the 1-µs accurate PTP clock is practically unusable by application software.

- Source: https://www.electronicdesign.com/technologies/embedded/article/21276422/pci-sig-boost-time-synchronization-accuracy-with-the-pcie-ptm-protocol

To be clear, PTP includes mechanisms like hardware-based timestamping to reduce the impact of these delays, achieving nanosecond and sub-nanosecond accuracy requires careful management and design of network architecture. Accurate calibration of devices, which can be enhanced through PTP extensions like White Rabbit, careful network design, and continuous monitoring and optimization are essential to achieve desired levels precision in synchronization.

3. Precision Time Measurement (PTM)

PCI Express (PCIe) PTM is a supported feature in the PCI-SIG PCI Express 3.0 specification and is defined as a new protocol of timing measurement and synchronization of messages for time-sensitive media and server applications within a distributed system (Source: https://docs.nvidia.com/networking/display/nvidia5ttechnologyusermanualv10/precision+time+measurement). In other words, PCIe PTM is an optional feature in PCIe specifications for time synchronization over standard PCIe connections (Source: https://www.youtube.com/watch?v=lcbs9PRMjs0), facilitating time synchronization at the nanosecond level.

Operation:

As described by Electronic Design,

The PTM protocol distributes PTM master time from the PTM root [port] to each PTM-capable endpoint device, supplying a common time reference used to correlate endpoint device clocks. Using the common time reference, any endpoint device clock can be correlated with any other endpoint device clock. PTM master time is propagated from the upstream device to the downstream device for each PCIe link in the path to the endpoint device. PTM propagates time using protocol-specific Transaction Layer Packets (TLPs) that must be timestamped on transmission and reception.. This requires hardware timestamping in every port in the path between the PTM root port and the endpoint device, including switch ports.

- Source: https://www.electronicdesign.com/technologies/embedded/article/21276422/pci-sig-boost-time-synchronization-accuracy-with-the-pcie-ptm-protocol

Protocol:

PTM operates over the PCIe bus by transmitting master time from downstream to upstream through a series of PTM dialogs. Each dialog involves a PTM Request Transaction Layer Packet (TLP) initiated by the downstream device. This request is timestamped upon transmission (T1) by the upstream port and upon reception (T2) by the downstream port. The downstream port responds with a PTM Response or PTM ResponseD TLP, which is timestamped on transmit (T3) and on reception (T4) (Source: https://www.electronicdesign.com/technologies/embedded/article/21276422/pci-sig-boost-time-synchronization-accuracy-with-the-pcie-ptm-protocol).

The PTM protocol assumes symmetrical upstream and downstream PCIe link delays when computing link delay and master time. The link delay is calculated using these timestamps, specifically with the following equation:

$$ Link\ Delay = \frac{(T4 - T1) - (T3 - T2)}{2} $$

However, devices like PCIe retimers can introduce asymmetry, which would require adjustments by offsetting the link delay by half the asymmetry to maintain accuracy (Source: https://www.electronicdesign.com/technologies/embedded/article/21276422/pci-sig-boost-time-synchronization-accuracy-with-the-pcie-ptm-protocol).

Enhancements for Accuracy:

Duplicate PTM messages can lead to mismatched timestamps and inaccuracies. The PTM protocol addresses duplicate PTM messages by using the most recent transmit or receive timestamps and invalidating potentially mismatched ones. The Enhanced PTM (ePTM) capability further improves accuracy by adding requirements to invalidate timestamps when duplicates are detected or messages are replayed. Thus, ePTM is recommended for all PTM-capable devices and required for devices supporting Flit mode (Source: https://www.electronicdesign.com/technologies/embedded/article/21276422/pci-sig-boost-time-synchronization-accuracy-with-the-pcie-ptm-protocol).

Device Roles and Capabilities:

Devices indicate their PTM capabilities through a PTM capability register. The key roles are:

Requester: Operates on upstream ports, obtaining time from the downstream port. Endpoint devices are typically requesters.
Responder: Operates on downstream ports, providing PTM master time to upstream ports. Switches and root complexes act as responders on all downstream ports and as requesters on upstream ports.

Devices that support ePTM indicate this capability in their registers. Responders capable of being a PTM root (source of master time) would also indicate that in their capability flags (Source: https://www.electronicdesign.com/technologies/embedded/article/21276422/pci-sig-boost-time-synchronization-accuracy-with-the-pcie-ptm-protocol).

Configuration and Synchronization:

PTM-capable devices are configured using the PTM control register, which enables PTM and designates a device as a PTM root if applicable (Source: https://www.electronicdesign.com/technologies/embedded/article/21276422/pci-sig-boost-time-synchronization-accuracy-with-the-pcie-ptm-protocol). Effective synchronization requires:

The farthest upstream PTM-capable device should be configured as the PTM root.
PTM must be enabled on the endpoint and all devices upstream to the PTM root.
Multiple PTM roots can exist if the root complex doesn't support PTM, but this may lead to unsynchronized master times, which would complicate clock correlation and downstream devices.

The effective granularity field in the control register reports the accuracy of the propagated PTM master, helping application software determine synchronization precision at the endpoint (Source: https://www.electronicdesign.com/technologies/embedded/article/21276422/pci-sig-boost-time-synchronization-accuracy-with-the-pcie-ptm-protocol).

PTM-capable switches propagate their master time from upstream to downstream ports by adjusting their internal clocks to match the PTM master, accounting for link delays and processing times. If a switch isn't configured as a PTM root, it must invalidate its clock after 10 ms to prevent drift unless it is phase-locked with the PTM root device. (Source: https://www.electronicdesign.com/technologies/embedded/article/21276422/pci-sig-boost-time-synchronization-accuracy-with-the-pcie-ptm-protocol).

Relevance to HFT:

Regulatory bodies require strict adherence to time synchronization standards:

United States

U.S. stock markets mandate the use of the National Institute of of Standards and Technology (NIST) Disciplined Clock (or NISTDC) for timestamping. Automatic of high-frequency trades require timestamp accuracy within 50 ms.
European Union

Trades must reference atomic clocks contributing to UTC, with automatic trades requiring accuracy within 100 µs.
Multiple jurisdictions

Trades crossing multiple jurisdictions must comply with the most stringent requirement, which necessitates timestamp accuracy within 100 µs relative to an international clock.

The bottom-line is that PTM offers significant advantages over PTP, offering higher precision, lower latency, improved reliability, and regulatory compliance. PTM provides hardware-level time synchronization over PCIe, unlike PTP, which operates over Ethernet and is more susceptible to network induced latency and jitter.

PCIe PTM support in embedded processors and MCUs

Measured nanosecond accuracy of PCIe PTM support in an embedded processor. Notice that the jitter is always kept within nanoseconds once the PCIe reference clock is connected.
Source: https://www.youtube.com/watch?v=lcbs9PRMjs0

PTM allows for precise and direct communication between hardware components such as CPUs, NICs, and storage devices, ensuring all parts of an HFT system operate in unison and further reducing latency. This precise and direct communication to devices minimizes complexities associated with network traffic and congestion inherent in PTP, PTM offers a more stable and reliable synchronization method.

Potential challenges of PTM include the integration costs since existing systems might need substantial modifications or replacements, such as with NICs with PTM-support like NVIDIA's ConnectX-7 to accommodate PTM capabilities. There is also limited vendor support of PTM, which limits the availability of compatible devices and increases procurement costs. A list of various products and manufacturers can be found at OpenCompute's PTM Readiness page. Nonetheless, investing in PTM could offer long-term competitive advantages in trading speed and accuracy, potentially offsetting the initial higher costs through increased profitability.

4. Photonic Time Sync

Photonic time synchronization represents the cutting edge of time distribution, utilizing optical (i.e. light wavelength) signals to achieve time synchronization with femtosecond ($10^{-15}$ seconds) precision. This method employs fiber-optic networks to distribute timing signals with minimal latency and jitter.

Operation:

Optical Two-Way Time Transfer: Timing signals are sent in both directions over the same fiber, allowing for the measurement and compensation of delays (Source: https://www.nist.gov/programs-projects/optical-two-way-time-frequency-transfer).
Frequency Combs: Employing laser-based frequency combs generates a spectrum of optical frequencies, providing an ultra-stable timing reference (Source: https://www.nist.gov/topics/physics/optical-frequency-combs#Timekeeping).

Source: https://www.youtube.com/watch?v=lcbs9PRMjs0

Applications of photonic time sync:

Though still largely in the research and development phase, photonic synchronization have applications advanced communication systems, such as the recent October 18, 2024 paper which describes its use for space-based networks and satellites (Source: https://pubs.aip.org/aip/app/article/9/10/100903/3317543/Classical-and-quantum-frequency-combs-for). For their ultra-precise measurements of wavelength "ticks", frequency combs have also recently been getting used in more advanced atomic clocks and in the even more advanced nuclear clocks (Source: https://www.nist.gov/news-events/news/2024/09/major-leap-nuclear-clock-paves-way-ultraprecise-timekeeping).

In regard to HFT, photonic time sync is not yet practical for widespread use due to cost and complexity. However, photonic synchronization could revolutionize timekeeping by providing unprecedented precision, such as in its nascent use in space-based networks (Source: https://pubs.aip.org/aip/app/article/9/10/100903/3317543/Classical-and-quantum-frequency-combs-for). As technology advances, it may become a viable option for ULL trading networks, especially for firms seeking a competitive edge through technological innovation.

Challenges in Distributing Time Over Networks

From the discussions above on time synchronization methods, it becomes obvious that distributing time over networks has several common challenges. Those challenges include:

1. Variable Latency:

Networks inherently suffer from variable delays due to factors like routing paths, congestion, and physical distance, making consistent timing accuracy difficult to achieve. Even in high-speed networks, microbursts of traffic can introduce latency spikes. Additionally, there exists different kinds of time delays, such as propagation, serialization, and queueing delays:

Processing delay: The length of time it takes a router to process the packet header (Source: https://en.wikipedia.org/wiki/Network_delay).
Propagation Delay: The length of time it takes for the first bit to travel over a data link between the sender and receiver (Source: https://apposite-tech.com/latency/).
Serialization Delay: Refers to the time difference between the transmission of the first and last byte in a packet (Source: https://apposite-tech.com/latency/). It can also be conceptualized as the time it takes to push the packet's bits onto the data link (Source: https://en.wikipedia.org/wiki/Network_delay).
Queueing Delay: The length of time a packet sits in a routing queue due to network congestion (Source: https://apposite-tech.com/latency/).

2. Jitter:

Latency vs. Jitter
Source: https://obkio.com/blog/latency-vs-jitter

NIST defines jitter as "non-uniform delays that can cause packets to arrive and be processed out of sequence" (Source: https://csrc.nist.gov/glossary/term/jitter). In time protocols that rely on packet exchange (e.g., NTP and PTP), jitter can introduce errors in calculated offsets, leading to inaccurate clock adjustments. Sources of jitter include network congestion, variable processing times, poor hardware performance, and not implementing packet prioritization (Source: https://www.ir.com/guides/what-is-network-jitter). Overall, jitter increases uncertainty in timing measurements.

Shown below is an equation that illustrates how to calculate jitter from a collection of packets, usually extracted from a single .pcap or .pcapng file or several of them. The equation is calculated by measuring the:

$N$: Total number of collected packets
$D_{i}$: Delay of the $i$-th packet
$\overline{D}$: Average delay of all measured packets

Equation for measuring jitter
Source: https://obkio.com/blog/latency-vs-jitter

Measuring Jitter with AMD's sysjitter tool:
1. Definition:
  
  AMD defines Solarflare sysjitter utility tool as:
  
  The Solarflare sysjitter utility measures the extent to which a system introduces packet jitter and how that jitter impacts user-level processes.
  
  sysjitter runs a separate thread on each CPU core, measures elapsed time when the thread is de-scheduled from the CPU core and produces summary statistics for each CPU core.
  
  You can download sysjitter from https://github.com/Xilinx-CNS/cns-sysjitter. After downloading, review the sysjitter README file for instructions on building and running sysjitter.
  
  Note: Be sure to run sysjitter when the system is idle.
  
  - AMD's Tuning Guide: Low Latency Tuning for AMD EPYC CPU-Powered Servers
2. Using Solarflare's sysjitter utility tool:
  
  Note:
  
  For advanced tuning or troubleshooting, consult AMD's documentation or contact their support team for assistance.
  1. Installation and Preparation:
    1. Download and Review the Tool:
      - Obtain sysjitter from its GitHub repository.
      - Refer to the README file included in the repository for detailed installation instructions.
    2. Install the Utility:
      - Follow the installation steps outlined in the README file.
    3. Save the Script:
      - Save the run_sysjit.txt script (available in the tuning documentation) as /opt/run_sysjit.sh.
      - Adjust the script to match your system's sysjitter binary installation location. For example:
        
        /opt/LowLatency_Jitter/AMD/sysjitter-1.4/sysjitter
    4. Make the Script Executable:
      - Run the following command to ensure the script is executable:
        
        chmod +x /opt/run_sysjit.sh
  2. System Check and Preparation:
    1. Verify System Readiness:
      - Ensure the system is idle with a load average close to 0.00. Run:
        
        uptime
      - Examples:
        
        Ready system: load average: 0.00, 0.01, 0.02
        
        Not ready system: load average: 5.03, 4.07, 0.88
        
        Wait for load averages to stabilize before proceeding.
    2. Check TuneD Profile:
      - Confirm the system is configured with the AMDLowLatency TuneD profile:
        
        tuned-adm active
      - Example output:
        
        Preset profile: AMDLowLatency
    3. Run Housekeeping Script:
      - AMD recommends running the housekeeping script:
        
        /usr/local/bin/oneshot_script.sh
  3. Running sysjitter:
    1. Navigate to the Directory:
      - Go to the /opt directory where the run_sysjit.sh script is saved:
        
        cd /opt
    2. Execute the Script:
      - Run the script with two arguments:
        
        ./run_sysjit.sh 100 605
      - Arguments:
        
        100: Ignores interrupts shorter than 100 ns.
        
        605: Specifies the duration in seconds (10 minutes and 5 seconds in this example).
      - Start with shorter runs (e.g., 65 seconds or 305 seconds) to quickly analyze jitter-related issues.
  4. Output and Analysis:
    1. Locate the Output:
    - sysjitter creates a directory with the current date and time, such as:
      /lowlatency/20240626134405PDT/
    - Example files:
      - Raw data: sysjitter.amd-lowlat.20240626132446PDT.txt
      - Formatted output: sysjitter.amd-lowlat.20240626132446PDT.tab
    1. Sample Output:
      - Output files include statistics for individual cores, such as:
        
        sysjitter.amd-lowlat.20240626132446PDT.01 sysjitter.amd-lowlat.20240626132446PDT.02
      - Review these files to identify jitter events and analyze core-specific behavior.
    2. Interpretation:
      - Examine the tab file for a formatted summary of jitter events.
      - Use the per-core statistics to assess time-history of jitter events across isolated cores.
  5. Tips for Effective Use:
    - System Isolation: Run sysjitter when the system is not actively handling other workloads.
    - Incremental Runs: Start with shorter durations to identify key patterns before conducting longer investigations.
    - Core Monitoring: Focus on isolated cores (e.g., cores 1-7 and 9-15) for precise jitter analysis.
    - Detailed Analysis: Use output files to identify interrupt trends and isolate potential bottlenecks.

3. Asymmetry:

Example diagram of asymmetric versus symmetric IP routing
Source: https://www.thenetworkdna.com/2023/12/asymmetric-vs-symmetric-ip-routing.html

Network paths often have different delays in the upstream and downstream directions due to routing asymmetries or different transmission speeds. There are two main sources of asymmetry, asymmetrical routing and media conversion. Asymmetrical routing refers to when packets take different paths in each direction in a full duplex data transmission (Source: https://www.auvik.com/franklyit/blog/asymmetric-routing-issue/). Media conversion asymmetry refers to the different delays in transmission media; for example, fiber optics cables offers higher speed and lower latency relative to twisted-copper pair cables.

4. Other Challenges:

Other notable challenges include network load, hardware limitations, and security concerns.

High network load, i.e. network congestion from insufficient network bandwidth or packet loss, can exacerbate network latency and jitter.
Accurate timestamping of packets depends on the ability of the hardware.
Time synchronization are vulnerable to network attacks such as time spoofing attacks, Denial-of-Service (DoS) and Distributed DoS attacks, and Man-in-the-Middle (MITM) attacks.
- Time spoofing attacks involves supplying a network with an incorrect time (Source: https://www.bodet-time.com/resources/blog/1755-maintaining-network-security-when-using-ntp-time-synchronization.html).
- DoS/DDoS attacks involve a malicious attempt to disrupt the normal traffic of a targeted server, service or network by overwhelming the target or its surrounding infrastructure with a flood of Internet traffic (Source: https://www.cloudflare.com/learning/ddos/what-is-a-ddos-attack/#:~:text=A%20distributed%20denial%2Dof%2Dservice%20(DDoS)%20attack%20is,a%20flood%20of%20Internet%20traffic.).
- MITM attack intercepts communication between two systems or devices (Source: https://owasp.org/www-community/attacks/Manipulator-in-the-middle_attack).

Case Study: 2013 Nasdaq Flash Freeze from a Timing Error

On August 22, 2013, the Nasdaq experienced a 3-hour trading halt due to a software glitch affecting time synchronization. The cause was attributed to a software bug in the Securities Information Processor (SIP), which was overwhelmed by an unexpected surge in traffic from the NYSE's Arca system, a fully automated exchange market that uses a central limit order book (CLOB) to automatically match buy and sell orders in the market. The incident highlighted the critical importance of accurate timing and led to increased scrutiny of time synchronization practices in financial markets (Source: https://en.wikipedia.org/wiki/August_2013_NASDAQ_flash_freeze).

Types of Oscillators

At the heart of any time synchronization is the oscillator, a device that generates a repetitive signal used to maintain time. The stability and accuracy of an oscillator directly impacts the precision of the clock it powers.

1. Crystal Oscillator (XO)

The most basic type of oscillator, an XO uses the mechanical resonance of a vibrating quartz crystal to create an electrical signal with a precise frequency. XOs operate through the piezoelectric effect, an electromechanical effect by which an electrical charge produces a mechanical force by changing the shape of the crystal and vice versa, a mechanical force applied to the crystal produces an electrical charge (Source: https://www.electronics-tutorials.ws/oscillator/crystal.html).

The accuracy of an XO can be influenced by temperature changes, which can cause frequency to slightly drift. Thus, XOs are commonly used in applications where moderate accuracy is sufficient, like embedded controller clocks (Sources: https://www.dynamicengineers.com/content/what-is-the-difference-between-xo-and-tcxo-and-ocxo, https://www.circuitcrush.com/crystal-oscillator-tutorial/).

Operation:

Oscillation: When an alternating electric field is applied to a quartz crystal wafer, it vibrates at a specific resonant frequency, determined by the physical dimensions of the crystal (Source: https://www.electricity-magnetism.org/crystal-oscillators/).
Stability: The resonant frequency of a quartz crystal is extremely stable and reliable, making it ideal for use in a XO, providing a constant clock signal for electronic devices (Source: https://www.electricity-magnetism.org/crystal-oscillators/).

Since XOs are the most basic type, XOs are rather cost-effective. However, XOs are limited in that they are not suitable for applications which require high stability over varying temperatures.

2. Temperature-Compensated Crystal Oscillator (TCXO)

A TCXO offers improved stability over a standard XO because of a TCXO's improved performance in environments with fluctuating temperatures.

Operation:

Oscillation: A TCXO uses a mechanism to adjust its oscillator's frequency for variation in temperature (Source: https://www.xtaltq.com/news/the-basic-characteristics-of-tcxo-ocxo-vcxo.html).
Temperature-compensated stability: When temperature changes, the TCXO compensates by adjusting its frequency to maintain stability, making TCXO more accurate than XOs in environments with varying temperatures. TCXOs have a typical temperature stability of $±0.20$ ppm to $±2.0$ ppm, with an aging rate between $±0.50$ ppm/year to $±2.0$ ppm/year, meaning between 0.2 µs and 2 µs of frequency drift in the face of varying temperatures, with it increasing between 0.5 µs and 2 µs year-over-year due to wear (Source: https://blog.bliley.com/quartz-crystal-oscillators-guide-ocxo-tcxo-vcxo-clocks).

This stability in the crystal's resonant frequency under varying temperatures makes it ideal for use in communication devices, GNSS, and other time-critical applications (Source: https://www.dynamicengineers.com/content/what-is-the-difference-between-xo-and-tcxo-and-ocxo).

3. Voltage-Controlled Crystal Oscillator (VCXO)

VCXOs do not concern stability over temperature and are positioned in a circuit board that intends to receive a frequency from another device or application. This is different than TCXOs or OCXOs which are integrated into a circuit board to provide a stable frequency signal (Source: https://ecsxtal.com/news-resources/vcxos-vs-tcxos-vs-ocxos/).

Operation:

A VCXO applies an external voltage to shift the oscillator's frequency up or down to match the shifted frequency from the side of data transmission (Source: https://ecsxtal.com/news-resources/vcxos-vs-tcxos-vs-ocxos/).
VCXOs offer frequency deviation ranges from $±10$ ppm to as much as $±2000$ ppm, meaning between 10 µs and 2000 µs of frequency drift, with an aging rate between $±1$ ppm/year and $±5$ ppm/year (Source: https://blog.bliley.com/quartz-crystal-oscillators-guide-ocxo-tcxo-vcxo-clocks).
When working with a VCXO, pullability must be considered, which is the extent to which the oscillator's frequency can be altered through external voltage control (Source: https://ecsxtal.com/news-resources/vcxos-vs-tcxos-vs-ocxos/).

The output frequency of a VCXO can shift with a change in voltage control, although it is highly dependent on the oscillator circuit, making VCXOs widely used in electronics where a stable but electrically tunable oscillator is required (Source: https://www.xtaltq.com/news/the-basic-characteristics-of-tcxo-ocxo-vcxo.html).

4. Oven-Controlled Crystal Oscillator (OCXO)

OCXOs are the most stable type of oscillator since it operates proactively to maintain the crystal at a very stable temperature by enclosing the XO in a built-in temperature-controlled chamber (oven) (Source: https://www.dynamicengineers.com/content/what-is-the-difference-between-xo-and-tcxo-and-ocxo).

Operation:

An OCXO contains two key parts to control the temperature rage: a thermistor and a comparator circuit. The thermistor records the temperature of the circuitry, and the comparator circuit adjusts the oscillator's voltage to bring the temperature back to its predetermined point (Source: https://ecsxtal.com/news-resources/vcxos-vs-tcxos-vs-ocxos/).

An OCXO has a typical temperature stability of $±1\ x\ 10^{-7}$ to $\ ±1\ x\ 10^{-9}$ with an aging rate between $±2\ x\ 10^{-7}$/year and $±2\ x\ 10^{-8}$/year, giving it a tenfold improvement over TCXO for temperature versus frequency stability. While OCXO has a the most exceptional accuracy and stable frequency, it is more expensive and consumes greater electricity (Source: https://blog.bliley.com/quartz-crystal-oscillators-guide-ocxo-tcxo-vcxo-clocks).

As a result, OCXOs are used in highly precise applications when temperature stabilities of $±1\ x\ 10^{-8}$ or better are required (Source: https://blog.bliley.com/quartz-crystal-oscillators-guide-ocxo-tcxo-vcxo-clocks), such as in infrastructure of telecommunications networks, making them ideal for use in HFT systems (Source: https://www.dynamicengineers.com/content/what-is-the-difference-between-xo-and-tcxo-and-ocxo).

Atomic Clocks

For the highest levels of precision, atomic clocks are employed. Atomic clocks use the consistent resonance frequency of atomic energy transitions to measure time. The resonance frequency in electronics is expressed when a circuit exhibits a maximum oscillatory response at a specific frequency (Source: https://resources.pcb.cadence.com/blog/2021-what-is-resonant-frequency). In terms of energy transitions, the resonance frequency can be understood as the frequency of electromagnetic radiation that matches the energy difference of the hyperfine transition of the atom. Hyperfine transitions, commonly termed as "hyperfine structure" in atomic physics, refer to the small energy level shifts caused by the interaction between the magentic field from atom's nucleus (nuclear spin) and its surrounding electrons, where each transition is specific to each type of atom (Sources: https://chem.libretexts.org/Bookshelves/Physical_and_Theoretical_Chemistry_Textbook_Maps/Supplemental_Modules_(Physical_and_Theoretical_Chemistry)/Quantum_Mechanics/13%3A_Fine_and_Hyperfine_Structure/Hyperfine_Structure, https://en.wikipedia.org/wiki/Hyperfine_structure#:~:text=The%20term%20transition%20frequency%20denotes,h%20is%20the%20Planck%20constant.).

Most atomic clocks combine a quartz XO with a set of atoms to achieve greater accuracy than traditional clocks. At its core, an atomic clock works by tuning light or microwave waves to the specific resonant frequency of a specific set of atoms, causing hyperfine transitions, i.e. the electrons in those atoms to jump between energy states. The steady XOs of this light at the resonant frequency are counted to create a "tick" of time. Unlike mechanical or quartz clocks, atomic clocks achieve exceptional accuracy, down to under a nanosecond-level of precision, because atoms of a specific element always have the same natural frequency, providing a stable and universal measurement of time (Sources: https://www.nasa.gov/missions/tech-demonstration/deep-space-atomic-clock/what-is-an-atomic-clock/, https://www.nist.gov/atomic-clocks/how-do-atomic-clocks-work).

Since being the most accurate timekeeping devices ever created, the invention of atomic clocks have led to significant advances in science and technology, even playing a critical role in the development of the Global Positioning System (GPS), which requires extremely precise time-keeping to synchronize time-keeping systems of GPS satellites and GPS receivers on Earth. Atomic clocks have many applications, such as synchronizing the timing of TV video signals and monitoring the control and frequency of power grids. In every application, atomic clocks ensure that data is transmitted accurately and efficiently, even over long distances (Source: https://syncworks.com/background-history-of-atomic-clocks/).

1. Cesium standard

The U.S. Navy (USN) defines a cesium atomic clock as follows:

A cesium atomic clock is a device that uses as a reference the exact frequency of the microwave spectral line emitted by atoms of the metallic element cesium.

- Source: http://tycho.usno.navy.mil/cesium.html

The cesium-133 standard is the primary standard for timekeeping, and in 1967, was by the 13th General Conference on Weights and measures to define the International System (SI) unit of the second.

Operation:

Using atoms of cesium-133, a soft silvery-gold alkali metal, the cesium standard is used to define a second by having a XO monitor cesium's natural resonance frequency of $9191.63177$ MHz, i.e. the number of cycles of microwave radiation needed to cause a hyperfine transition. The cesium clock was revolutionary because, according to the USN, it became "the most accurate realization of a unit that mankind" had ever achieved, showcasing its unparalleled accuracy (Source: http://tycho.usno.navy.mil/cesium.html).

Newer atomic clocks, such as rubidium and hydrogen maser, have been even more accurate.

2. Rubidium standard

The rubidium (Rb) standard uses rubidium-87, whitish-grey alkali metal, atoms in atomic clocks, offering precise time and frequency references based on the resonance frequency of Rb atoms. Rb atomic clocks are the most inexpensive, compact, and widely produced, and are commonly used in GNSS systems like GPS (Sources: https://www.worldscientific.com/worldscibooks/10.1142/11249?srsltid=AfmBOop4V0DXpBy8YuKyngwx9lTbbdbc3YTkOQsF7K4izmEfM2BugFW9#t=aboutBook, https://en.wikipedia.org/wiki/Rubidium_standard).

Operation:

Rb frequency standards operate by having a XO monitor the Rb-87's hyperfine transition, which is $6,834.682610904$ MHz or $6.35$ GHz (Sources: https://en.wikipedia.org/wiki/Rubidium_standard#:~:text=A%20rubidium%20standard%20or%20rubidium,to%20control%20the%20output%20frequency., https://www.wriley.com/A%20History%20of%20the%20Rubidium%20Frequency%20Standard.pdf).

Rb clocks have had the advantage of portability, achieving an accuracy of about $1$ in $10^{12}$ in a transportable instrument, making Rb atomic clocks useful for carrying one cesium clock to another to synchronize clocks (Source: http://hyperphysics.phy-astr.gsu.edu/hbase/acloc.html).

Attributed to the combination of the use of an OCXO and Rb's natural physical properties, the Rb standard offers high short-term and long-term stability in its resonance frequency, reducing frequency drift and ensuring consistent performance. As a result, Rb atomic clocks offer long operational life, reducing the need for frequent maintenance and replacement (Source: https://freqelec.com/rubidiumatomicfrequencystandards/).
Applications:

The inexpensiveness and consistent performance of Rb atomic clocks make them suitable for use even in the military. For example, the U.S. Naval Observatory (USNO) uses a Rb Fountain Clock, a continuous running fountain clock, an advanced timekeeping technique which uses lasers to cool and trap atoms in a high vacuum, to provide day-to-day precision measured at the femtosceond ($10^{-15}$) level, offering the most precise operational clocks in the world, as of December, 2020 (Source: https://www.cnmoc.usff.navy.mil/Our-Commands/United-States-Naval-Observatory/Precise-Time-Department/The-USNO-Master-Clock/The-USNO-Master-Clock/Rubidium-Fountain-Clocks/).
Cost:

Miniature Rb atomic clocks, from SparkFun, can go for just under $2,000 (Source: https://www.sparkfun.com/products/14830), while full server-sized Rb atomic clocks can go between $4,000 (from Pro Studio Connection) and just under $7,000 (from B&H) (Sources: https://prostudioconnection.com/products/endrun-meridian-ii-timebase-rubidium-gps-ptp-ntp-network-time-server-clock-1pps?currency=USD&utm_source=google&utm_medium=cpc&utm_campaign=google%2Bshopping&gad_source=1&gbraid=0AAAAAC6RA5uAEDOpVDwdHQNKKEOivrLbU&gclid=Cj0KCQjwpvK4BhDUARIsADHt9sTYUVS46Cvhi0qshI_Pi0v3IRVbdKdf-0HiCzSTDwQhocasKkVD2KIaAnd9EALw_wcB, https://www.bhphotovideo.com/c/product/1223815-REG/antelope_10mx_isochrone_rubidium_atomic_clock.html/?ap=y&ap=y&smpadsrd=&smpm=ba_f2_lar&smp=y&lsft=BI%3A6879&gad_source=1&gbraid=0AAAAAD7yMh1TRJmRGjwhEdZIVYZGbhE0R&gclid=Cj0KCQjwpvK4BhDUARIsADHt9sSQNVJTwGYUDhYKq9mGOEiv6KH4YsvgTmwoobYBi8XirkNci2II_oMaAmhuEALw_wcB).

3. Hydrogen maser standard

The hydrogen maser standard uses a specific type of maser, a device for microwave amplification by stimulated emission of radiation (Source: https://en.wikipedia.org/wiki/Maser), that uses the properties of a hydrogen atom to serve as a precision frequency reference (Source: https://en.wikipedia.org/wiki/Hydrogen_maser).

Operation:

Hydrogen masers operate the the resonance frequency of the hydrogen atom, which is $1,420.405752$ MHz (Source: https://www.sciencedirect.com/topics/earth-and-planetary-sciences/hydrogen-maser). The 2003 Encyclopedia of Physical Science and Technology descibes the operation of hydrogen maser's as the following:

A hydrogen maser works by sending hydrogen has through a magnetic gate that allows certain energy states to pass through. The atoms that make it through the gate enter a storage bulb surrounded by a tuned, resonant cavity. Once inside the bulb, some atoms drop to a lower energy level, releasing photons of microwave frequency. These photons stimulate other atoms to drop their energy level, and they in turn release additional photons. In this manner, a self-sustaining microwave field builds up in the bulb. The tuned cavity around the bulb helps to redirect photons back into the system to keep the oscillation going. The result is a microwave signal that is locked to the resonance frequency of the hydrogen atom and that is continually emitted as long as new atoms are fed into the system.

- Source: https://www.sciencedirect.com/topics/earth-and-planetary-sciences/hydrogen-maser

As usual, a XO monitors the hydrogen atom's resonance frequency.
Cost:

Active hyrdrogen masers can quite expensive, selling for $145,000 on BMI Surplus (Source: https://bmisurplus.com/product/symmetricom-mhm-2010-meter-frequency/?srsltid=AfmBOop1VI67aVsDkiEVEXg1ttSHmqJ9gY13OTyqbPNQ_4N6mPV_SkY4), which can make them unsuitable for less-capitalized HFT firms. However, they are commonly used in space satellites, such as the passive hydrogen maser (Source: https://space.leonardo.com/en/), which generally have more stringent requirements of time synchronization.
Applications:

Hyrodgen masers have become mainstays in navigation satellites such as GPS, Galilee, and Glonass (described more later). Hydrogen masers suffer from frequency-pulling effects, which are externally sourced frequency disturbances that interfere with the oscillator and cause the resonance frequency to shift towards the interfering source (Source: https://en.wikipedia.org/wiki/Injection_locking#:~:text=Injection%20(aka%20frequency)%20pulling%20occurs,inherent%20periodicity%20of%20an%20oscillator.). Consequently, due to the common frequency-pulling effect produced by hydrogen atom collisions with the maser's container wall, hydrogen masers are not viewed as a primary frequency standard, like cesium and Rb (Source: https://www.physics.harvard.edu/sites/projects.iq.harvard.edu/files/physics/files/2022-maser.pdf). Instead, hyrdrogen masers are used as a "flywheel" oscillator for demanding applications, steered by primary clocks, such as cesium fountain clocks (NIST-F1 and NIST-F2), where a maser's signal is averaged through a cluster of multiple hydrogen masers to average the time and frequency standards. The long term stability of hydrogen masers are then guided by comparing the primary fountain clocks on a monthly time scale (Source: https://www.physics.harvard.edu/sites/projects.iq.harvard.edu/files/physics/files/2022-maser.pdf).

For their oustanding short-term stability, the USNO also incorporates the use of hydrogen masers in the design of its Master Clock (Source: https://www.cnmoc.usff.navy.mil/Our-Commands/United-States-Naval-Observatory/Precise-Time-Department/The-USNO-Master-Clock/The-USNO-Master-Clock/Hydrogen-Masers-at-the-USNO/).

4. Chip-scale atomic clocks (CSACs)

CSACs are a revolutionary technology because they are compact and low-power enough atomic clocks that are small enough to fit on a PBC board. The Department of Defense's research and development agency, Defense Advanced Research Projects Agency (DARPA), which funded the development of CSACs, reported that commercially available CSACs "achieved a hundredfold size reduction [by being smaller than a single coffee bean] while consuming 50 times less power than traditional atomic clocks" (Source: https://www.nist.gov/noac/success-stories/success-story-chip-scale-atomic-clock).

Operation:

The primary innovation of a CSAC is its microelectromechanical system to keep time:
- It uses a low-power semiconductor laser to shine a beam of infared light.
- The infared light is modulated by the in-built microwave oscillator.
- The oscillating infared light is sent through a capsule onto a photodetector.
- When the oscillator is at the precise frequency of the hyperfine transition, the optical absorption of the cesium atoms is reduced, increasing the output of the photodetector.
- The output of the photodetector is used as feedback in a frequency-locked loop-circuit to keep the oscillator at the correct frequency (Source: https://en.wikipedia.org/wiki/Chip-scale_atomic_clock).
By placing the microchip-sized CSAC directly on a circuit board, CSACs give a GNSS receiver its own timing signal, significantly reducing the effects of jamming by avoiding the synchronization process of transferring timing signals across devices (Source: https://www.nist.gov/noac/success-stories/success-story-chip-scale-atomic-clock). Unlike traditional oscillators, CSACs maintain accurate timekeeping even in the absence of a GPS signal, making them incredibly reliable in environments where nanosecond-level of precision is crucial (Source: https://novotech.com/pages/chip-scale-atomic-clock-csac).
Cost:

CSACs go from $1,072 (Microchip's Developer Kit CSAC) to over $8,000 for highly advanced CSAC models (such as Microchip's CSAC-SA45S) (Source: https://www.microchipdirect.com/product/search/all/CSAC-SA65/product/search/all/).
Applications:

The unmatched pairing of precision and portability of CSACs has them found in satellites, and military systems such as improved explosive device (IED) jammers, GNSS receivers, and unmanned aerial vehicles (UAVs) (Source: https://www.microsemi.com/product-directory/5207).

The most recent application of CSACs has been its use in the Open Time Server Project's Time Cards, which can incorporate a CSAC on top of a PCB board with a GNSS receiver to provide accurate GNSS-enabled time for a NTP- or PTP-enabled network. The Time Card is an open source solution via PCIe (Source: https://github.com/opencomputeproject/Time-Appliance-Project/blob/master/Time-Card/README.md).
Innovations

Newer innovations of CSACs achieve 100x the accuracy of OCXOs and and up to 10,000x the accuracy of TCXOs (Source: https://www.microsemi.com/product-directory/5207).

5. Innovations on atomic clocks

Optical trapping techniques have led to higher performant atomic standards, chiefly, the "atomic fountain", and has produced cesium and rubidium standards of unprecedented stability, such as in the USNO's Rubidium Fountain Clock. Even more recently is the technique of laser-trapping single ions and the formation of stable "optical lattices", which involves atoms trapped in laser-generated standing waves. Over a decade ago, these two techniques already resulted in lab-scale standards with stabilities of $10^{-16}$ for atomic fountains and $10^{-17}$ for optical lattices (Source: https://www.physics.harvard.edu/sites/projects.iq.harvard.edu/files/physics/files/2022-maser.pdf).

Optical lattice clocks:

The laser maze of a caesium fountain, the tuning fork for America’s official atomic clocks
Source: https://www.ft.com/content/625d2043-a5a4-4d6d-bbe9-42e524a211dd

The conventional atomic clock uses the frequency of a microwave oscillator to a specific transition of cesium atoms by firing microwaves at a group of atoms, measuring the hyperfine transitions. The precision of these measurements are improved by repeating the process many times and averaging away instability, which is the atoms' internal variation in ticking rate. The higher the hyperfine transition frequency of an atom, the quicker the averaging can be done. Thus, optical lattice clocks improve over standard cesium standard clocks by operating at much higher frequencies, since they operate at optical frequencies from intense laser light, rather than microwave frequencies, signficantly improving the averaging away of resonance frequency instabilities (Source: https://www.optica-opn.org/home/newsroom/2024/july/strontium_lattice_is_now_the_world_s_most_accurate_clock/). In short, optical frequencies divide time into smaller units and thus can offer greater accuracy (Source: https://www.nist.gov/news-events/news/2019/10/jila-team-demonstrates-model-system-distribution-more-accurate-time-signals).
- Operation:
  
  The most common optical lattice clock uses strontium-87 atoms in a vertical lattice of thousands of lasers, which can achieve a stabilities of $10^{-19}$ (Source: https://www.optica-opn.org/home/newsroom/2024/july/strontium_lattice_is_now_the_world_s_most_accurate_clock/). A frequency comb is used to transfer the resonance frequency stability from one silicon cavity (which captures the laser beams) to a prestabilitzed laser that probes the strontium lattice clock and synchronizes the light with an atoms' ticking (Source: https://www.nist.gov/news-events/news/2019/10/jila-team-demonstrates-model-system-distribution-more-accurate-time-signals).
These strontium lattice clocks provide clock stability so precise that "it would neither gain nor lose one second in some 15 billion years - roughly the age of the universe". The level of precision lattice clocks are so precise that they allow the measurement of gravitational shift when a clock is raised just 2 centimenters on the Earth's surface (Source: https://www.nist.gov/news-events/news/2015/04/getting-better-all-time-jila-strontium-atomic-clock-sets-new-records).
Nuclear clocks:

The latest innovation on precise timekeeping was reported by the University of Colorado Boulder's Joint Institute for Laboratory Astrophysics (JILA) and NIST on September 4, 2024: a nuclear clock, which uses high-frequency light from ultraviolet lasers to excite the nucleus of a thorium-229 atom to cause a hyperfine transition. To more precisely measure frequency cycles, nuclear clocks also employ optical frequency combs. Compared with electrons in atomic clocks, the nucleus is much less affected by outside disturbances such as stray electromagnetic fields, and possibly easier to make portable. Additionally, the higher frequency of light required to cause hyperfine transitions in thorium-229 means more wave cycles per second than traditional forms of light which yields a greater number of "ticks" per second, therefore leading to more precise timekeeping (Source: https://www.nist.gov/news-events/news/2024/09/major-leap-nuclear-clock-paves-way-ultraprecise-timekeeping, https://www.ft.com/content/625d2043-a5a4-4d6d-bbe9-42e524a211dd).
- Operation:
  
  Briefly, a nuclear clock has the following parts:
  - A thorium-229 nuclear transition to provide the clock's “ticks”
  - A laser to create precise energy jumps between the individual quantum states of the nucleus, and
  - A frequency comb for direct measurements of these “ticks” (Source: https://www.nist.gov/news-events/news/2024/09/major-leap-nuclear-clock-paves-way-ultraprecise-timekeeping)
The thorium nuclear clock achieves a level of precision "that is one million times higher than the previous wavelength-based measurement". Moreover, the researchers established the first direct frequency link between a nuclear transition of the nuclear clock and an atomic (strontium) clock, which is a crucial step towards integrating a nuclear clock with existing timekeeping systems (Source: https://www.nist.gov/news-events/news/2024/09/major-leap-nuclear-clock-paves-way-ultraprecise-timekeeping).
- Applications
  
  The enhanced accuracy of nuclear clocks could lead to:
  - More precise navigation systems (with or without GPS (i.e. in CSACs))
  - Faster internet speeds
  - More reliable network connections
  - More secure digital communications, and
  - Even verify constants in physical nature for whether they are truly constant, thus enhancing particle physics without the need for large-scale particle accelerator facilities (Source: https://www.nist.gov/news-events/news/2024/09/major-leap-nuclear-clock-paves-way-ultraprecise-timekeeping).

Official Time Standards

1. Coordinated Universal Time (UTC)

Coordinated Universal Time, abbreviated as UTC, is the worldwide reference time scale computed by France's Bureau International des Poids et Mesures (BIPM) - the international organization dealing with matters related to measurement science and measurement standards. UTC is based on about 450 atomic clocks, which are maintained in 85 national time labs around the world. These 450 atomic clocks provide regular measurement data to BIPM, as well as the local real-time approximations of UTC, known as UTC(k), for national use (Source: https://www.itu.int/hub/wp-content/uploads/sites/4/2023/05/PatriciaTavella_Fig1.jpg.optimal.jpg).

Calculating UTC:

First, a weighted average of all the designated atomic clocks is computed to achieve International Atomic Time (TAI). The algorithm involves estimation, prediciton, and validation for each type of clock. Similarly, measurements to compare clocks at a distance are based either on GNSS or other techniques, such as two-way satellie time and frequency transfer, or via optical fibers. These measurements all need to be processed to compensate for the time delay, due, for example, to ionospheric distortions (discussed later), the gravitational field, or the movement of satellites (Source: https://www.itu.int/hub/wp-content/uploads/sites/4/2023/05/PatriciaTavella_Fig1.jpg.optimal.jpg).

Obtaining UTC from the offset, i.e. UTC-UTC(k).
Source: https://www.itu.int/hub/2023/07/coordinated-universal-time-an-overview/

Ultimately, UTC is obtained from TAI by adding or removing a leap second as necessary and maintaining the same ticking of the atomic second (Source: https://www.itu.int/hub/wp-content/uploads/sites/4/2023/05/PatriciaTavella_Fig1.jpg.optimal.jpg).

The International Earth Rotation and Reference Systems Service (IERS) determines and publishes the difference between UTC and the Earth's rotation angle indicated by UT1 (Univeral Time 1, defined as the mean solar time at 0° longitude (Source: https://crf.usno.navy.mil/ut1-utc)). Whenever this difference approaches 0.9 seconds, a new leap second is announced and applied in all time labs (Source: https://www.itu.int/hub/wp-content/uploads/sites/4/2023/05/PatriciaTavella_Fig1.jpg.optimal.jpg).

UTC in the United States:

In the United States, two primary institutions maintain official time scales.

UTC (USNO):

USNO maintains its own time scale that is primarily used for the U.S. Department of Defense. UTC (USNO) operates as an ensemble of atomic clocks , which consists of hydrogen masers, cesium clocks, and rubidium fountain clocks, where a mean timescale is computed from the ensemble of clocks to compensate for frequency drift. Specifically, USNO's calculation of the mean timescale incorporates each clock's weight (relative to the clock's stability), frequency rate, and the clock's frequency drift (relative to the mean of the original clock ensemble) (Source: https://www.cnmoc.usff.navy.mil/Our-Commands/United-States-Naval-Observatory/Precise-Time-Department/The-USNO-Master-Clock/USNO-Time-Scales/).

The USNO's Master Clock is the source of the UTC (USNO), serving as the lead reference to which all time measurements can be corrected, if necessary. However, most of the time, the differences between these timing systems are about 1 nanosecond or less (Source: https://www.cnmoc.usff.navy.mil/Our-Commands/United-States-Naval-Observatory/Precise-Time-Department/The-USNO-Master-Clock/USNO-Time-Scales/). In relation to international time scales, i.e. the BIPM's computed international UTC, UTC (USNO) has been kept within 26 nanseconds of it by frequency steering the Master Clocks to the USNO's extrapolation of UTC (Source: https://www.cnmoc.usff.navy.mil/Our-Commands/United-States-Naval-Observatory/Precise-Time-Department/The-USNO-Master-Clock/International-Time-Scales-and-the-BIPM/).

In other words, the USNO's reference clocks are real-time apporximations of UTC, with the USNO's official real-time reference clock steered on the short-term to the mean time scale of their atomic clock ensemble, which is itself steered to an extrapolation of UTC (Source: https://www.cnmoc.usff.navy.mil/Our-Commands/United-States-Naval-Observatory/Precise-Time-Department/The-USNO-Master-Clock/International-Time-Scales-and-the-BIPM/).
UTC (NIST):

UTC (NIST) also maintains its own time scale, which comprises of an ensemble of cesium beam and hydrogen maser atomic clocks. Both types of are regularly calibrated by NIST's primary frequency standard. The number in the time scale is typically around 10, but it does vary. Similar to USNO's time scale, the outputs of NIST's enemble of atomic clocks are taken as a weighted average to arrive at a single output. The most stable clocks are assigned the most weight in the calculation of the average (Source: https://www.nist.gov/pml/time-and-frequency-division/how-utcnist-related-coordinated-universal-time-utc-international).

UTC (NIST) serves as a national standard for resonance frequency, time interval, and time-of-day, and is continuously compared to the time and frequency standards located around the world (Source: https://www.nist.gov/pml/time-and-frequency-division/how-utcnist-related-coordinated-universal-time-utc-international).

"Coordinated" time:

As described by the USNO, the world's timing centers, including USNO, submit their clock measurements to BIPM, which then uses them to compute a free-running (unsteered) mean time scale. The BIPM then applies frequency corrections to the mean time scale, i.e. "steers" the mean time scale, based on measurements of two kinds: measurements intended to keep the International System's (SIs) basic unit of time, the second, constant; and measurements from primary frquency standards. The result of these frequency corrections is another time scale, TAI. The addition leap seconds to TAI produces UTC: the world's timing centers agree to keep their real-time time scales closely synchronized, i.e. "coordinated", with UTC. Hence, all these atomic time scales are called Coordinated Universal Time (UTC), of which NIST's version is UTC (NIST) and USNO's version is UTC (USNO) (Source: https://www.cnmoc.usff.navy.mil/Our-Commands/United-States-Naval-Observatory/Precise-Time-Department/The-USNO-Master-Clock/International-Time-Scales-and-the-BIPM/). Since NIST operates as a timing center, just like the USNO, clocks in the UTC (NIST) time scale also contribute to TAI and the official Coordinated Universal Time (UTC) (Source: https://www.nist.gov/pml/time-and-frequency-division/how-utcnist-related-coordinated-universal-time-utc-international).

UTC (NIST) and UTC (USNO) are kept in very clockse agreement, typically within 20 nanoseconds, and both can be considered official sources of time in the United States (Source: https://www.nist.gov/pml/time-and-frequency-division/how-utcnist-related-coordinated-universal-time-utc-international). A list of recent differences between UTC (USNO, Master Clock) and UTC (NIST), i.e. UTC(USNO) - UTC (NIST) ($±\ 5$ ns), can be found under NIST's Physical Measurement Laboratory's Time and Frequency Division which tracks the differences on a weekly basis in a data table, with data archives through 2024 and since 2003 (Source: https://www.nist.gov/pml/time-and-frequency-division/time-services/nist-usno).

Everyone HATES leap seconds:

It is important to note that leap seconds are a result from irregularities arising from Earth's rotation, which is strongly attributed to climate change and ice caps melting. However, instead of addressing Earth's environmental ecosystems, technologists have adapted computer systems (Source: https://engineering.fb.com/2022/07/25/production-engineering/its-time-to-leave-the-leap-second-in-the-past/).

Recall that leap seconds are scheduled to be inserted into or deleted from the UTC time scale in irregular intervals to keep the UTC time scale synchronized with Earth's rotation. If a leap second is to be inserted, then in most Unix-like systems, the OS kernel just steps the time back by 1 second at the beginning of the leap second, so the last second of the UTC day is repeated and thus duplicate timestamps can occur. However, there are lots of distributed applications which get confused if the system time is stepped back due to leap second insertions or deletions (Source: https://docs.ntpsec.org/latest/leapsmear.html). The most common example is the leap second insertion resulting in unusual timestamps that result in time clocks looking like,

23:59:59 → 23:50:60 → 00:00:00

which can crash programs and even corrupt data due to weird timestamps in data storage. For example, in 2012, Reddit experienced a massive outage becaause of a leap second, causing the site to be inaccessible for 30 to 40 minutes. The leap second confused the high-resolution timer (hrtimer), sparking hyperactivity on the servers, which locked up the machines' CPUs. More recently, in 2017, Cloudlare's public DNS was affected by a leap second at midnight UTC on New Year's Day. The root cause of the bug was the belief that time could not go backward, so their Go code took the upstream values and fed them to Go's rand.Int64n() function. The rand.Int63n() function promptly panicked because the argument was negative, which caused the DNS server to fail (Source: https://engineering.fb.com/2022/07/25/production-engineering/its-time-to-leave-the-leap-second-in-the-past/).

Thus, many tech organizations smear the leap second over hours to avoid this confusion, which is frequently imposed by leap seconds multiple times each decade; however, not all organizations have the same smearing technique:

Google:

Since 2008, Google has been smearing leap seconds. They perform a "24-hour linear smear from noon to noon UTC" (Source: https://developers.google.com/time/smear). The smear is centered on the leap second at midnight, so from noon the day before to noon the day after, each second is $11.6$ ppm longer (or $\frac{1s}{24(60)(60)} = 11.564$ µs).

The difference is too small for most of Google's services to be bothered with, and by centering at midnight, the difference in time will never be more than half a second at midnight; just before midnight it will be half a second behind, and after midnight it will be half a second ahead (Source: https://www.explainxkcd.com/wiki/index.php/2266:_Leap_Smearing).
Amazon:

For AWS, Amazon uses the same leap smear as Google, smearing the leap second linearly over a 24-hour period, from noon to noon (Source: https://aws.amazon.com/blogs/aws/look-before-you-leap-the-coming-leap-second-and-aws/).

AWS offers the Amazon Time Sync Service, which is accessible from all EC2 instances and used by various AWS services. There are two versions of the Amazon Time Sync Service: local and public. Both the local and public Amazon Time Sync Service automatically smear any leap seconds that are added to UTC and both use AWS' fleet of satellite-connected andnd atomic reference clocks in each AWS region to delivery accurate readings of UTC. AWS recommends using the local Amazon Time Sync Service for EC2 instances to achieve the best performance (Source: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/set-time.html).

AWS uses the NTP IPv4 endpoint by default for Amazon Linux AMIs, but EC2 instances can be reconfigured to use the PTP hardware clock provided by the local Amazon Time Sync Service. Reconfiguring an EC2 instance to use PTP or NTP connections do not require any VPC configuration changes, and an EC2 instance does not require Internet access. (Source: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/configure-ec2-ntp.html).
Meta:

As of July 25, 2022, Meta has stopped future introductions of leap seconds into their systems altogether, smearing the leap second "throughout 17 hours, starting at 00:00:00 UTC based on the time zone data (tzdata) package content". As for the algorithm they use, they use quadratic smearing, as opposed to Google's and AWS' linear smearing (Source: https://engineering.fb.com/2022/07/25/production-engineering/its-time-to-leave-the-leap-second-in-the-past/).
Windows:

Microsoft also implements their own leap seconds, smearing over the last two seconds (Source: https://www.itu.int/hub/2023/07/coordinated-universal-time-an-overview/, https://learn.microsoft.com/en-us/troubleshoot/windows-server/active-directory/time-service-treats-leap-second).

The Windows Time service does not indicate the value of the Leap Indicator when the Windows Time service receives a packet that includes a leap second. (The Leap Indicator indicates whether an impending leap second is to be inserted or deleted in the last minute of the current day.) Therefore, after the leap second occurs, the NTP client that is running Windows Time service is one second faster than the actual time. This time difference is resolved at the next time synchronization.

- Source: https://learn.microsoft.com/en-us/troubleshoot/windows-server/active-directory/time-service-treats-leap-second

Leap seconds eliminated by 2035:

It is worth mentioning that as of 2022, the General Conference of Weights and Measures (CGPM), the primary international authority responsible for maintaining and developing the SI (International System of Units), the international standard for measurement units, has agreed to elimiate leap seconds by or before 2035, due to much of the same issues mentioned earlier (Source: https://www.bipm.org/en/cgpm-2022/resolution-4). The BIPM has outlined the following notes that have warranted the elimination of leap seconds:

the accepted maximum value of the difference (UT1-UTC) has been under discussion for many years because the consequent introduction of leap seconds creates discontinuities that risk causing serious malfunctions in critical digital infrastructure including the Global Navigation Satellite Systems (GNSSs), telecommunications, and energy transmission systems,

operators of digital networks and GNSSs have developed and applied different methods to introduce the leap second, which do not follow any agreed standards,

the implementation of these different uncoordinated methods threatens the resilience of the synchronization capabilities that underpin critical national infrastructures,

the use of these different methods leads to confusion that puts at risk the recognition of UTC as the unique reference time scale and also the role of National Metrology Institutes (and Designated Institutes) as sources of traceability to national and international metrological standards,

recent observations on the rotation rate of the Earth indicate the possible need for the first negative leap second whose insertion has never been foreseen or tested,

the Consultative Committee for Time and Frequency (CCTF) has conducted an extensive survey amongst metrological, scientific and technology institutions, and other stakeholders, and the feedback has confirmed the understanding that actions should be taken to address the discontinuities in UTC,

- Source: https://www.bipm.org/en/cgpm-2022/resolution-4

The intent of these changes are to keep UTC in alignment with Earth's rotation and to guarantee UT's usefulness for at least another 100 years. By using a new tolerance value for the UT1-UTC offset, UTC remains efficient and effective in serving current and future timing applications (Source: https://www.itu.int/hub/2023/07/coordinated-universal-time-an-overview/).

2. Global Navigation Satellite Systems (GNSSs)

gps.gov defines GNSS as "a general term describing any satellite constellation that provides positioning, navigation, and timing (PNT) services on a global or regional basis" (Source: https://www.gps.gov/systems/gnss/). GNSS broadcasts location signals of space and time, of networks of ground control stations, and of receivers that calculate ground positions by trilateration (Source: https://www.unoosa.org/oosa/de/ourwork/psa/gnss/gnss.html). Ground or space-based GNSS receivers detect, decode, and process ranging codes and phase transmitted from orbiting GNSS satellites to determine the 3-dimensional location of the GNSS receivers and to calculate precise time. The accuracy of a GNSS receiver's location is depends on the receiver itself and the post-processing of the satellite data (Source: https://cddis.nasa.gov/Data_and_Derived_Products/GNSS/GNSS_data_and_product_archive.html).

GNSS can also refer to satellite-based augmentation systems, which are systems that aid GPS by adding improvements to PNT that are not inherently a part of GPS itself (Source: https://www.gps.gov/systems/augmentations/), but there are too many international augmentation systems to list and describe (some of the U.S. augmentation systems used for GPS are described in this section) (Source: https://www.gps.gov/systems/gnss/).

The profound usefulness of GNSS has them used in all forms of transportation: space stations, aviation, maritime, rail, road, and mass transit. The usefulness of PNT data provided by GNSSs can be better understood from how they have become crucial in operations of telecommunications, land surveying, law enforcement, emergency response, precision agriculture, mining, finance, scientific research, and other fields (Source: https://www.unoosa.org/oosa/de/ourwork/psa/gnss/gnss.html).

Key GNSS Satellite Constellations:

Four key global satellite constellations of GNSS include:
- Global Positioning System (GPS):
  - Operated/Managed by: United States (DoD)
  - Constellation: 31 satellites
    
    31 GPS satellites travel Earth in a 12-hour circular orbit at an altitude of ~11,000 miles, providing users with accurate PNT anywhere and in all weather conditions (Source: https://www.faa.gov/about/office_org/headquarters_offices/ato/service_units/techops/navservices/gnss/gps, https://cddis.nasa.gov/Techniques/GNSS/GNSS_Overview.html).
  - Coverage: Global
    
    GPS has 6 satellites that are observable nearly 100% of the time from any point on Earth (Source: https://cddis.nasa.gov/Techniques/GNSS/GNSS_Overview.html).
  - Characteristics: Uses Code Division Muliple Access (CDMA)
    
    CDMA leverages spread-spectrum technology to allow mulitple users to occupy the same time and frequency allocations in a given frequency band, where each user's data is spread across the bandwidth and tagged with a unique code to differentiate it from other data within the same band (Source: https://novotech.com/pages/code-division-multiple-access-cdma#:~:text=Support-,What%20is%20Code%20Division%20Multiple%20Access%20(CDMA)%20in%20the%20World,allocations%20in%20a%20given%20band.).
  - Time Standard: GPS Time (GPST)
    
    GPST is offset from UTC by a fixed number of seconds and operates on a continuous time scale, i.e. with no leap seconds, and is defined by the GPS Control segment on the basis of a set of atomic clocks at the Monitor Stations and onboard satellites (Source: https://gssc.esa.int/navipedia/index.php/Time_References_in_GNSS).
    
    More formally, GPST is the exact number of seconds since January 6th, 1980 at 00:00:00 UTC (midnight), and since it has not been unsettled by leap seconds: GPS is now ahead of UTS by 18 seconds (Source: http://leapsecond.com/java/gpsclock.htm). GPST is synchronised with UTC (USNO) at 1 µs level (modulo one second), but actually is kept within 25 ns (Source: https://gssc.esa.int/navipedia/index.php/Time_References_in_GNSS).
- GLObal NAvigation Satellite System (GLONASS):
  - Operated/Managed by: Russia
  - Constellation: 26 satellites
    
    24 satellites are in operation and 2 are in flight tests phase. They operate similar to U.S. GPS in terms of satellite constellation, orbits, and signal structure (Source: https://cddis.nasa.gov/Techniques/GNSS/GNSS_Overview.html).
  - Coverage: Global
  - Characteristics: Uses Frequency Division Multiple Access (FDMA)
    
    FDMA consists of assigning each satellite with a specific carrier frequency, guaranteeing signal separation since each signal is transmitted in a dedicated frequency slot. However, FDMA requires higher complexity and cost regarding antenna and receiver design, related to the implementation of the different band-pass filters and calibration. Over the years, GLONASS has been progressively including more CDMA signals in its signal plan (Source: https://gssc.esa.int/navipedia/index.php/CDMA_FDMA_Techniques).
  - Time Standard: GLONASS Time (GLONASST)
    
    GLONASST is closely aligned with UTC (SU) and implements leap seconds (Source: https://gssc.esa.int/navipedia/index.php/Time_References_in_GNSS).
- Galileo:
  - Operated/Managed by: European Union
  - Constellation: 30 satellites
    
    27 MEO satellites are in operation. 3 satellites are spares.
  - Coverage: Global
  - Time Standard: Galileo Time System (GST)
    
    GST is a continuous time scale maintained by the Galileo Central Segment and is synchronized with TAI with a nominal offset below 50 ns (Source: https://gssc.esa.int/navipedia/index.php/Time_References_in_GNSS).
- BeiDuo:
  - Operated/Managed by: China
  - Constellation: 35 satellites
    
    Includes 5 Geostationary Earth Orbit (GEO), 3 Inclined Geo-Synchronous Orbit (IGSO), and 27 MEO satellites (Source: https://cddis.nasa.gov/Techniques/GNSS/GNSS_Overview.html).
  - Coverage: Global
  - Time Standard: BeiDou Time (BDT)
    
    BDT is a continuous time scale starting at 00:00:00 UTC on January 1st, 2006. To be as consistent as possible with UTC, BDT may steer to an interposed frequency adjustment after a period of time (more than 30 days) according to the situation, but the quantity adjustment is not allowed to be more than $5\ x\ 10^{-15}$ (Source: https://gssc.esa.int/navipedia/index.php/Time_References_in_GNSS).
In addition to these global satellite constellations, Regional Navigation Satellite Systems (RNSSs) offer service only to specific regions. These RNSSs include:
- Japan's Quasi-Zenith Satellite System (QZSS):
  
  A four-satellite regional satellite navigation system and a satellite-based augmentation system developed by the Japanese government to enhance the US-operated GPS in the Asia-Oceania regions, with a focus on Japan (Source: https://cddis.nasa.gov/Techniques/GNSS/GNSS_Overview.html). GZSS plans for 7-satellite constellation 2024 to 2025 (Source: https://qzss.go.jp/en/overview/services/seven-satellite.html).
- The Indian Regional Navigation Satellite System (IRNSS):
  
  An autonomous regional satellite navigation system, comprising of 5 satellites, that provides accurate real-time positioning and timing services for the Indian subcontintent (Source: https://cddis.nasa.gov/Techniques/GNSS/GNSS_Overview.html).

3. How GNSS works

Using signals from space, each of GNSSs transmits ranging and timing data to GNSS-enabled receivers, which then use this data to determine location. (Source: https://www.euspa.europa.eu/eu-space-programme/galileo/what-gnss).

GNSS atomic clocks:

In addition to determining and providing longitude, latitude, and altitude data, GPS provides timing data (Source: https://www.gps.gov/applications/timing/). GPS/GNSS satellites include 3 to 4 atomic clocks that are monitored and controlled to ensure that they are highly synchronized and traceable to UTC (Sources: https://safran-navigation-timing.com/guide-to-gps-gnss-clock-synchronization/).

From the section on Types of Oscillators and Atomic Clocks, it is clearly understood that XOs enable accurate time synchronization by providing a stable frequency reference through the mechanical resonance of vibrating quartz crystals. However, due to their susceptibility to temperature-induced frequency drift, XOs alone are insufficient for the extreme precision required in GNSS systems. Therefore, GNSS satellites are "disciplined" to, or combined with, more advanced oscillators like TCXOs and OCXOs — usually incorporating the more short-term accurate rubidium or hydrogen maser clocks, like Galileo's satellite atomic clocks which uses a passive hyrodgen maser as its master clock and a rubidium clock as a second independent clock (Source: https://www.esa.int/Applications/Satellite_navigation/Galileo/Galileo_s_clocks).
- Vendors of GNSS TCXOs and OCXOs:
  
  There are various manufacturers which sell GNSS disciplined TCXOs and OCXOs. Some GNSS TCXOs include Jauch's GPS TCXO and SiTime's SiT5155 Super-TCXO. Some GNSS OCXOs include Safran's GXClok-500, Abracon's ABCM-60 GNSS OCXO, and various Microship GNSS Disciplined Oscillator (GNSSDO) Modules with OCXOs and Atomic Clocks.
For synchronization, the GNSS signal is received, processed by a local master clock, time server, or primary reference, and passed on to the downstream devices, systems or networks so that their local clocks are also synchronized to UTC (Source: https://safran-navigation-timing.com/guide-to-gps-gnss-clock-synchronization/). Recall that UTC is synchronized by highly accurate cesium fountain clocks, hydrogen masers, or rubidium fountain clocks; for example, NIST's cesium fountain clocks, F-3 and F-4, and hydrogen masers, or USNO's ensemble of rubidium fountain clocks, cesium-beams, and hydrogen masers are all used to establish a master clock that are close in time to the predicted UTC (Sources: https://www.nist.gov/pml/time-and-frequency-division/time-realization/cesium-fountain-atomic-clocks, https://www.cnmoc.usff.navy.mil/Our-Commands/United-States-Naval-Observatory/Precise-Time-Department/The-USNO-Master-Clock/The-USNO-Master-Clock/).

Resultingly, one of the key benefits of using GNSS satellites for time synchronization is that the widespread availability of their embedded atomic clocks removes the need to own or operate a local atomic clock (Sources: https://www.gps.gov/applications/timing/, https://safran-navigation-timing.com/guide-to-gps-gnss-clock-synchronization/).
GNSS almanac and ephemeris data:

A GNSS almanac is a regularly updated schedule of satellite orbital parameters for use by GNSS receivers, consisting of coarse orbit and status information covering every satellite in the constellation, the relevant ionospheric model and time-related information. For example, the GPS almanac provides the necessary corrections to relate GPS time to UTC. The major role of the GNSS almanac is to help a GNSS receiver to acquire satellite signals from a cold or warm start by providing data on which satellites will be visible at any given time, together with their appropriate positions. The ionospheric model contained within the almanac is essential for single-frequency receivers to correct for ionospheric distortions, the largest error source for GPS receivers (Source: https://www.spirent.com/blogs/2011-05-12_gps_almanac).

GNSS satellite ephemeris data allows the receiver to compute the position of the satellite to pinpoint the exact location of the satellite at the time that the satellite transmitted its time (Source: https://gisresources.com/everything-you-need-to-know-about-gps-l1-l2-and-l5-frequencies/).
GNSS trilateration:

GNSS trilateration is the process of using multiple satellites to determine an object's precise 3-dimensional location. It works by measuring the time it takes for a GNSS signal to travel from an object's location to several nearby satellites. Just one satellite is sufficient as each satellite is only capable of providing a circular range of location. To precisely track an object's longitude, latitude, and altitude, the spherical coordinates, captured by a combination of satellites and ground-based antennas, are used (Source: https://www.jpl.nasa.gov/edu/teach/activity/tracking-spacecraft-with-trilateration/#:~:text=GPS%20uses%20what%20is%20called,location%20to%20several%20nearby%20satellites.).

To solidify the concept of GNSS trilateration, the images below are used as an illustration:

One satellite can only identify the distance between you and the satellite. Image credit: NASA/JPL-Caltech
Source: https://www.jpl.nasa.gov/edu/teach/activity/tracking-spacecraft-with-trilateration/#:~:text=GPS%20uses%20what%20is%20called,location%20to%20several%20nearby%20satellites

Adding in a second satllite helps us narrow down the location to one of two intersecting points. Image credit: NASA/JPL-Caltech
Source: https://www.jpl.nasa.gov/edu/teach/activity/tracking-spacecraft-with-trilateration/#:~:text=GPS%20uses%20what%20is%20called,location%20to%20several%20nearby%20satellites

A third satellite allows us to pinpoint a single location at the the spot where all three circles intersect. Image credit: NASA/JPL-Caltech
Source: https://www.jpl.nasa.gov/edu/teach/activity/tracking-spacecraft-with-trilateration/#:~:text=GPS%20uses%20what%20is%20called,location%20to%20several%20nearby%20satellites

To find the location of an object in space, we use spherical ranges provided by a combination of ground-based antennas and satellites. Image credit: NASA/JPL-Caltech
Source: https://www.jpl.nasa.gov/edu/teach/activity/tracking-spacecraft-with-trilateration/#:~:text=GPS%20uses%20what%20is%20called,location%20to%20several%20nearby%20satellites

Determining the location of a person using GNSS trilateration. Image credit: u-blox
Source: https://www.u-blox.com/en/blogs/insights/gnss-time-synchronization-development
GNSS signal:

GNSS satellites continuously transmit navigation signals in 2 or more frequencies in L band, where L-band refers to a segment of the electromagnetic spectrum with frequencies ranging between 1 - 2 GHz (Source: https://www.sparkfun.com/news/8954).

The main GNSS signals components are (Source: https://gssc.esa.int/navipedia/index.php?title=GNSS_signal):
- Carrier: Radio frequency sinusoidal signal at a given frequency.
- Ranging code: Sequences of 0s and 1s (zeroes and ones), which allow the receiver to determine the travel time of radio signal from satellite to receiver. They are called Pseudo-Random Noise (PRN) sequences or PRN codes.
- Navigation data: A binary-coded message providing information on the satellite ephemeris (Keplerian elements or satellite position and velocity), clock bias parameters, almanac (with a reduced accuracy ephemeris data set), satellite health status, and other complementary information.
Measuring time:

In regard to time, GNSSs strongly rely on measuring the time of arrival of radio signals propagation. Thus, each GNSS has its own time refference from which all elements of Space, Control, and User segments are time synchronized, as well as most of GNSS-based applications. GNSS times include GPS Time (GPST), GLONASS Time (GLONASST), Galileo Time (GST), and BeiDou Time (BDT) (Source: https://gssc.esa.int/navipedia/index.php?title=Atomic_Time).

Most GNSSs, excluding GLONASS (Russia's GNSS), have opted to synchronize their clocks and time scale with UTC at the outset, without adding any leap seconds to avoid risks discontinuity risks from insertions or deletions of leap seconds (Source: https://www.itu.int/hub/2023/07/coordinated-universal-time-an-overview/).
Assessing GNSS performance:

GNSS performance can be assessed using 4 criteria (Source: https://www.euspa.europa.eu/eu-space-programme/galileo/what-gnss):
- Accuracy: the difference between a receiver’s measured and real position, speed or time.
- Integrity: a system’s capacity to provide a threshold of confidence and, in the event of an anomaly in the positioning data, an alarm.
- Continuity: a system’s ability to function without interruption.
- Availability: the percentage of time a signal fulfils the above accuracy, integrity and continuity criteria.
GNSS performance can be improved using satellite-based augmentation systems (Source: https://www.euspa.europa.eu/eu-space-programme/galileo/what-gnss).
GPSTest:

Applications like GPSTest visualize satellite positions, signal strengths, and timing data, aiding in monitoring, diagnostics, and optimization of GNSS signals.

Here's a YouTube video below that gives a brief overview to the GPSTest app:

4. GPS

Evolution of GPS:

The GPS system has been developed in stages, known as "Blocks," where each generation of satellites introduced significant advancements in features, durability, and performance. The table below outlines the key characteristics and innovations associated with each Block of GPS satellites:

Block	Launch Period	Features	Notes	Sources
Block I	1978-1985	- Prototype satellites for testing and validation - Limited lifespan - Basic functionality	Selective Availability (S/A) was not implemented in Block I satellites	Navipedia
Block II/IIA	1989-1997	- C/A code on L1 frequency for civil users - Improved reliability and lifespan (7.5 years) - Precise P(Y) code on L1 & L2 frequencies for military users	Block II/IIA have no satellites in operation	GPS.gov
Block IIR/IIR-M	IIR: 1997-2004 IIR-M: 2005-2009	- IIR-M offers 2nd civil signal on L2 (L2C) - Flexible power levels for military signals - New military signals (M-code) for enhanced jamming resistance	- Block IIR satellites included Rubidium clocks, planned lifespan of 10 years but reaching an average lifespan of 18 years - Block IIR-M modernized with M-code and L2C	GPS.gov, Navipedia
Block IIF	2010-2016	- Longer lifespan (12 years) - New civilian signal on L5 frequency - Enhanced atomic clock performance - Improved accuracy, signal strength, and quality	- Enhanced atomic clocks: 2 radiation-hardened rubidium and 1 cesium clock - Block IIF has 12 operational satellites	GPS.gov, Navipedia
GPS III	2018 onwards	- Enhanced accuracy, security, and reliability - Improved anti-jamming capabilities through increased M-code coverage - New civilian signals (L1C), compatible with other GNSS - Increased lifespan (15 years)	- Designed and built by Lockheed Martin - Provides 3× the accuracy and 8× the anti-jamming capabilities of existing satellites - Modular design allows easy addition of new technology and capabilities	Stanford PNT Presentation, Lockheed Martin News

GPS Frequencies: L1, L2, and L5:

GPS operates across several frequencies, each serving distinct purposes and user bases, including both civilian and military applications. The following table summarizes the primary GPS frequencies, their uses, and the key features that enhance positioning accuracy and signal reliability:

Frequency	Purpose	Signals	Features	Sources
L1 (1575.42 MHz)	Original civilian GPS frequency	Coarse/Acquisition (C/A) code for all users, P(Y) code (restricted)	Basic positioning and timing services; Affected by ionospheric delays and multipath errors. The P(Y) code is used only in military applications, offering better interference rejection compared to the C/A code, making military GPS more robust than civilian GPS.	GIS Resources
L2 (1227.60 MHz)	Initially reserved for military use	P(Y) code and M-code for military applications	Higher precision due to better ionospheric correction capabilities. Civilian signal (L2C) is available on modern receivers, introduced in Block IIR-M satellites. By comparing signals at L1 and L2, receivers can correct for ionospheric delay, enhancing precision.	SparkFun
L5 (1176.45 MHz)	New civilian frequency for safety-of-life applications	Stronger signal transmission, wider bandwidth (24 MHz), advanced signal design with better error correction	Strong resistance to jamming and spoofing; improved ionospheric correction when used with L1; designed for aviation, maritime, and surveying industries. Addresses high interference and multipath effects common in dense urban environments, enhancing reliability.	GIS Resources

Augmentation Systems:

There is a wide range of different augmentation systems available worldwide that are provided by both government and commercial entities (Source: https://www.gps.gov/systems/augmentations/). To meet specific requirements, the U.S. government has fielded a number of publicly available GPS augmentation systems, including, but not limited to:

Wide Area Augmentation System (WAAS):

A regional space-based augmentation system (SBAS) operated by the Federal Aviation Administration (FAA), supporting aircraft navigation across North America (Source: https://www.gps.gov/systems/augmentations/).
Continuously Operating Reference Stations (CORS):

Archives and distributes GPS data for precise positioning tied to the National Spatial Reference System. Over 200 private, public, and acadmeic organizations contribute data from almost 2,000 GPS tracking station to CORS. The U.S. CORS network is managed by the National Oceanic and Atmospheric Administration (Source: https://www.gps.gov/systems/augmentations/).
Global Differential GPS (GDGPS):

A high accuracy GPS augmentation system developed by NASA Jet Propulsion Laboratory (JPL) to support the real-time positioning, timing, and determination requirements of NASA science missions (Source: https://www.gps.gov/systems/augmentations/).
International GNSS Service (IGS):

A network of +350 GPS monitoring stations from 200 contributing organizations in 80 countries, with the mission to provide the highest quality data and products as the standard fgor GNSS in support of Earth science research, multidisciplinary applications, education, and to faciilitate other applications benefitting society. Approximately 100 IGS stations transmit their tracking data within 1 hour of collection (Source: https://www.gps.gov/systems/augmentations/).
Nationwide Differential GPS System (NDGPS):

Was a ground-based augmentation system that provided increased accuracy and integrity of GPS information to users on U.S. waterways. As of June 30, 2020, NDGPS service has been discontinued due to the termination of Selective Availability and the rollout of the new GPS III satellites, both of which reduced the necessity of the NDGPS as an augmentation approach for close harbors (Source: https://www.gps.gov/systems/augmentations/).

Ionospheric distortion/delay:

Ionospheric delay refers to the slowing and bending of Global Navigation Satellite System (GNSS) signals as they traverse the Earth's ionosphere—a layer filled with charged particles. This phenomenon can introduce significant errors in GNSS positioning, typically around ±5 meters, but potentially more during periods of high ionospheric activity (Source: https://novatel.com/an-introduction-to-gnss/gnss-error-sources).

The extent of ionospheric delay varies based on several factors:

Solar Activity: Increased solar radiation enhances ionization, leading to greater signal delays (Source: https://www.e-education.psu.edu/geog862/node/1715).
Time of Day: Delays are generally more pronounced during the day due to higher ionization levels (Source: https://www.e-education.psu.edu/geog862/node/1715).
Geographical Location: Regions near the magnetic equator and high latitudes experience more significant ionospheric effects (Source: https://galileognss.eu/the-ionosphere-effect-to-gnss-signals/).
Correcting for ionospheric distortions:

To mitigate these errors, dual-frequency GNSS receivers compare signals at different frequencies to estimate and correct for ionospheric delays. Single-frequency receivers often rely on ionospheric models to approximate and reduce these errors. Lower-frequency signals, like GPS's L1, experience more significant delays compared to higher-frequency signals, such as L5. This relationship is inversely proportional to the square of the signal's frequency, as given by the following equation (Source: https://www.e-education.psu.edu/geog862/node/1715):

$$I_f = \frac{40.3(TEC)}{f^{2}}$$

Consequently, the ionospheric delay at L5 is approximately 80% larger than at L1 (Source: https://www.e-education.psu.edu/geog862/node/1715).

5. Pulse Per Second (PPS) synchronization

How PPS signals operate:

Pulse Per Second (PPS) signals are precise electrical pulses occurring at the start of each second, commonly used to synchronize clocks in electronic devices. These signals are typically generated by GNSS receivers synchronized to atomic clocks in satellites, producing TTL-level pulses with sharp rising edges. The PPS signals are transmitted via coaxial cables or other mediums to connected devices, which use the rising edge of the PPS signal to align their internal clocks (Source: https://en.wikipedia.org/wiki/Pulse-per-second_signal).
Benefits and considerations PPS synchronization:

The advantages of PPS signals include simplicity in implementation and integration, sub-microsecond synchronization accuracy, reliability due to reduced susceptibility to network-induced delays compared to packet-based synchronization, and low latency achieved through direct electrical connections (Source: https://www.ntp.org/documentation/4.2.8-series/pps/).

When implementing PPS signals, considerations include ensuring signal integrity by maintaining clean signal edges and minimal noise, accounting for propagation delays in cables (approximately 5 ns per meter), and preventing ground loops and electrical interference to maintain isolation (Source: https://tf.nist.gov/general/pdf/1498.pdf).
Role of PPS in HFT:

In high-frequency trading (HFT) systems, PPS signals play a crucial role in synchronizing servers, ensuring all servers in a data center share the same time reference. They provide accurate timing for logging and transaction records, synchronize network devices like switches and routers to minimize timing discrepancies, and are utilized in dedicated hardware such as packet capture cards and time-sensitive applications (Source: https://www.fmad.io/blog/pps-time-synchronization).

4. Specific use cases for packet capture

1. Surveillance

Surveillance in high rate packet capture emphasizes the need for network monitoring to maintain security in network systems, with activities revolving around capturing and analyzing network traffic to (Source: https://media.defense.gov/2022/Jun/15/2003018261/-1/-1/0/CTR_NSA_NETWORK_INFRASTRUCTURE_SECURITY_GUIDE_20220615.PDF).

Some of the most common use cases for network surveillance are listed and described below:

Prevent and mitigate cyber risks:
1. Network monitoring systems like Snort or Zeek can capture and analyze traffic at high rates to detect and prevent Distributed Denial of Service (DDoS) attacks.
  - By capturing packets in real-time, network monitoring systems can identify unusually high volumes of inbound traffic targeting specific resources and trigger alerts or automated mitigation responses, such as rate-limiting or IP blocking, to reduce the impact on the network.
2. Intrusion Prevention Systems (IPS) equipped with high-packet capture rates can identify and prevent malicious activities such as SQL injection or cross-site scripting attempts by detecting specific patterns within network packets that match known attack signatures, halting potential threats before they reach vulnerable systems.
Gather intelligence:
1. Packet capture tools help organizations gather intelligence on potential attackers by capturing metadata such as IP addresses, timestamps, and session details.
  - For instance, if repeated unauthorized login attempts are detected, these tools can log the origin IP addresses, which analysts can further investigate to determine if they belong to known threat actors or botnets.
2. In cybersecurity operations centers (CSOCs), high-packet rate capture tools are used to gather intelligence on evolving threats by continuously collecting data.
  - Analysts can then study packet data over time to identify patterns and build threat models, enabling them to anticipate future attacks and strengthen defenses.
Monitor communications:
1. Organizations can use high-rate packet capture to monitor data exfiltration attempts by inspecting outgoing traffic for large, unusual data flows.
  - For instance, if an internal device suddenly begins sending substantial amounts of data to an external IP, the monitoring system can alert security teams to investigate, as this behavior might indicate data theft.
2. In compliance monitoring, such as ensuring adherence to HIPAA or GDPR, packet capture enables organizations to track communications for unauthorized data sharing.
  - Network administrators can set up filters to capture and monitor packets containing sensitive information, ensuring that it doesn’t leave the network improperly or without encryption.
National surveillance practices:

High-rate packet capture is also a critical component of national security and intelligence operations conducted by agencies like the United States' National Security Agency (NSA) and governmental bodies in China. These entities employ packet capture technologies to analyze network traffic for intelligence gathering, identifying suspicious activities, tracking individuals, and monitoring potential threats, all while adhering to their respective legal guidelines and surveillance procedures.
1. NSA Surveillance Programs:
  
  The NSA utilizes high-rate packet capture to monitor global communications and internet traffic for signals intelligence (SIGINT). Programs such as PRISM and Upstream Collection involve capturing raw data packets traversing global networks.
  1. PRISM Program:
    
    Initiated under the authority of the Foreign Intelligence Surveillance Act (FISA), PRISM collects internet communications from U.S. internet companies. The NSA targets foreign nationals outside the United States by obtaining court orders to collect emails, chats, videos, and file transfers for intelligence analysis (Source: The Washington Post).
  2. Upstream Collection:
    
    This program taps directly into the internet's backbone infrastructure to capture raw data packets as they transit global fiber-optic cables. The NSA collects both metadata and content, which are then filtered using specific selectors like email addresses or phone numbers associated with foreign targets (Source: Privacy and Civil Liberties Oversight Board).
  3. XKeyscore:
    
    A comprehensive system used by the NSA to search and analyze global internet data collected from various sources. XKeyscore enables analysts to query vast databases containing emails, online chats, browsing histories, and other internet activities without prior authorization.
    1. Functionality:
      
      XKeyscore captures and indexes raw data packets, allowing analysts to perform real-time and retrospective searches based on metadata and content. It can retrieve nearly all internet activities of a user, including emails, social media interactions, and browsing history (Source: The Guardian).
    2. Usage:
      
      The system is used to track individuals, identify new targets, and monitor potential threats by analyzing large volumes of data. It has been instrumental in counter-terrorism operations but has raised significant privacy and civil liberties concerns due to the breadth of data accessible to analysts.
    3. Legal Framework:
      
      Operations involving XKeyscore are conducted under legal authorities such as Executive Order 12333 and FISA. However, disclosures have raised questions about oversight, adherence to legal standards, and the protection of privacy rights (Source: ProPublica).
  These activities are governed by legal frameworks such as Section 702 of the FISA Amendments Act and Executive Order 12333. Oversight is provided by the Foreign Intelligence Surveillance Court (FISC), congressional intelligence committees, and internal compliance mechanisms to ensure adherence to legal standards and protection of citizens' privacy rights.
2. China's Surveillance Mechanisms:
  
  China employs extensive network monitoring and packet capture techniques to control information flow, maintain internal security, and enforce censorship.
  1. Great Firewall of China:
    
    This is a combination of legislative actions and technologies used to regulate the internet domestically. Deep Packet Inspection (DPI) is employed to filter and block access to certain websites and services, monitor internet traffic, and enforce content censorship based on government policies (Source: Council on Foreign Relations).
  2. Golden Shield Project:
    
    Also known as the National Public Security Work Informational Project, it integrates surveillance technologies to monitor communications, track individuals, and collect data on potential threats. Packet capture and analysis are key components used to scrutinize internet activities and enforce laws (Source: Amnesty International).
  Chinese surveillance activities are conducted under laws like the Cybersecurity Law and the National Intelligence Law, which require network operators to store data locally and assist government agencies when requested. These laws grant authorities broad powers to conduct surveillance with limited transparency and oversight (Sources: Stanford's DigiChina on China's Cybersecurity Law of the PRC, U.S. Department of Homeland Security's Data Security Business Advisory on the PRC, U.S. National Counterintellgence and Security Center (NCSC)'s Business Risk 2023 Report on the PRC)
Both the NSA and Chinese surveillance agencies capture raw data packets to:
1. Identify suspicious activity:
  
  By analyzing network traffic and packet contents, they detect anomalies indicating cyber threats, unauthorized communications, or activities deemed harmful to national security.
2. Track individuals:
  
  Packet capture allows collection of metadata and content used to trace activities of specific individuals, such as suspects in criminal investigations or foreign intelligence targets.
3. Monitor potential threats:
  
  Continuous monitoring enables agencies to stay alert to emerging threats like cyber-attacks, espionage, or terrorism, facilitating proactive measures.
Adherence to legal guidelines and surveillance procedures:
1. NSA:
  
  Surveillance activities must comply with U.S. laws like FISA, which requires court authorization for targeting and mandates minimization procedures to protect U.S. persons' privacy. Oversight is conducted by the FISC, the Privacy and Civil Liberties Oversight Board, and congressional committees (Source: Office of the Director of National Intelligence).
2. China:
  
  The government's surveillance operations are backed by laws that compel cooperation from citizens and companies. While aimed at national security, these laws often lack transparency and have been criticized for infringing on privacy and freedom of expression (Source: Human Rights Watch).

There are a myriad of surveillance strategies, some for small and medium-sized businesses (SMBs) and others for enterprise-level corporations and government institutions, each requiring different kinds of implementation and maintenance strategies. A detailed survey of these strategies is left to the reader to uncover. Nonetheless, a couple of strategies and surveillance tools are described below:

Surveillance strategies:
1. Common surveillance strategies include:
  1. Network Intrusion Detection Systems (NIDS):
    
    NIDS are a listen-only security tool designed to monitor network traffic for suspicious activity, anomalies, or known malicious patterns that could indicate potential security threats, such as unauthorized access, malware, or attacks. It operates by analyzing the data packets transmitted over the network and comparing them against a database of known attack signatures or identifying anomalous behavior. As such, NIDS alone are not sufficient for intrusion detection or prevention since NIDS are listen-only and don't take any preventative actions (Source: https://www.paloaltonetworks.com/cyberpedia/what-is-an-intrusion-detection-system-ids).
    
    Examples:
    1. Snort
    2. Suricata
  2. Traffic inspection tools/systems:
    
    Such systems typically employ high packet rate capture to allow organizations to monitor network packets in real-time, providing insights into any anomalies or unusual traffic patterns that may indicate a security breach, enabling granular traffic analysis for detailed threat identification and response (Source: https://media.defense.gov/2022/Jun/15/2003018261/-1/-1/0/CTR_NSA_NETWORK_INFRASTRUCTURE_SECURITY_GUIDE_20220615.PDF).
    
    Examples:
    1. Datadog's Network Performance Monitoring
    2. SolarWinds' Network Performance Monitor (NPM)
Surveillance tools:

Some popular packet capture tools that aid surveillance include:
- nmap ("Network Mapper") — a free and open source utility for network discovery and security auditing;
- tcpdump — a powerful command-line packet analyzer; and
- Wireshark — known as the world's most powerful network and packet analyzer.
More information on these surveillance tools can be found on their respective documentation pages, which are linked to per each tool's name.

2. Corporate cybersecurity: recording all inbound and outbound traffic

In corporate cybersecurity, recording all inbound and outbound traffic through packet capture offers several key benefits to network security:

Network forensics:

Corporations could use captured network data packets to discovery malicious activities and communications in relation to certain events, such as cyber-attacks, data breaches, network intrusions, and malicious eavesdropping (Source: https://fidelissecurity.com/threatgeek/network-security/network-forensics/).
- Key benefits:
  - Analyze captured packets to investigate malicious activities like cyberattacks, data breaches, network intrusions, and eavesdropping.
  - Provides a historical record to discover and address vulnerabilities (Source: https://fidelissecurity.com/threatgeek/network-security/network-forensics/)
Intrusion detection:

Through analyzing packet data, organizations can detect anomalies indicative of potential threats, such as unusual data transfers or unauthorized access attempts. Proactive approaches are essential for mitigating risks before they escalate into significant security incidents. With careful planning, proactive approaches of intrusion detection can improve network engineers abilities to identify attack vectors to prevent future cyberattacks or intrusions from occuring in the future (Source: https://fidelissecurity.com/threatgeek/network-security/network-forensics/).
- Key benefits:
  - Detect anomalies indicative of potential threats, such as unusual data transfers or unauthorized access attempts.
  - Enables organizations to proactively mitigate risks and identify attack vectors (Source: https://fidelissecurity.com/threatgeek/network-security/network-forensics/).
Incident response and post-incident analysis:

In the aftermath of a security breach, having a detailed record of network traffic is crucial for understanding the scope and impact of the incident. Packet capture provides a comprehensive log of data exchanges, enabling forensic analysis to determine what information was compromised and how the breach occurred. Additionally, detailed records of network packets aids teams to understand extents of attacks so that containment and recovery plans can be organized and carried out (Source: https://fidelissecurity.com/threatgeek/network-security/network-forensics/).
- Key benefits:
  - Capture traffic logs to understand the scope and impact of breaches.
  - Supports forensic analysis to determine compromised information and improve containment and recovery efforts (Source: https://fidelissecurity.com/threatgeek/network-security/network-forensics/).
Compliance & auditing through evidence gathering:

Packet capture assists organizations in meeting their compliance and regulatory standards by providing verifiable records of data flows and access attempts. By reviewing incident response plans, policies, and procedures, organizations can ensure they are complying with federal laws, regulations, and guidance. Through compliance and audits, damage to organizational reputation can be minimized by avoiding legal consequences and public scrutiny. Additionally, reporting systems must be transparent enough to ensure that employees feel safe to report unethical and abusive behavior without fear of retaliation. Routine risk assessments must also be performed to improve on establishing a culture of compliance (Source: https://nvlpubs.nist.gov/nistpubs/specialpublications/nist.sp.800-61r2.pdf, https://www.faa.gov/regulationspolicies/rulemaking/committees/documents/section-103-organization-designation, https://www.justice.gov/criminal/criminal-fraud/page/file/937501/dl?inline).
- Key benefits:
  - Provide verifiable records of data flows and access attempts for regulatory requirements.
  - Strengthen policies to minimize reputational damage and avoid legal consequences (Source: https://nvlpubs.nist.gov/nistpubs/specialpublications/nist.sp.800-61r2.pdf, https://www.faa.gov/regulationspolicies/rulemaking/committees/documents/section-103-organization-designation, https://www.justice.gov/criminal/criminal-fraud/page/file/937501/dl?inline).

Some recommendations for improving intrusion detection, post-incident analysis, and compliance & auditing include:

Physical access control:

Control physical access to your computers and create user accounts for each employee. Since laptops can be a particularly easy target for theft or can be lost, locking them when unattended is key. Another recommendation is to make sure that a separate user account is created for each employee and to ensure that each account requires strong passwords, with administrative privileges only given to trusted IT staff and key personnel (Source: https://www.fcc.gov/communications-business-opportunities/cybersecurity-small-businesses).
Limited digital access control:

Do not provide any one employee with access to all data systems. Employees should only be given access to the specific data systems that they need for their jobs, and should not be able to install any software without permission (Source: https://www.fcc.gov/communications-business-opportunities/cybersecurity-small-businesses).
Regular backups:

Improve record-keeping of network traffic is to regularly backup data on all computers. Critical data includes word processing documents, electronic spreadsheets, databases, financial files, human resources files, and accounts receivable/payable files. Backup data automatically if possible, or at least weekly and store the copies either offsite or in the cloud (Source: https://www.fcc.gov/communications-business-opportunities/cybersecurity-small-businesses).
Mobile device action plan:

Since mobile devices can create significant security and management challenges, especially if they hold confidential information or can access the corporate network, it is crucial to require users to password-protect their devices, encrypt their data, and install security apps to prevent criminals from stealing information while the phone is on public networks. Additionally, proper reporting procedures are necessary to ensure tracking of lost or stolen equipment (Source: https://www.fcc.gov/communications-business-opportunities/cybersecurity-small-businesses).

3. Network operations: diagnosing and optimizing network performance

Network Operations Centers (NOCs) rely on high-rate packet capture to maintain optimal network performance. By analyzing captured packets, engineers can:

Diagnose Issues:

Network diagnostics can inspect captured packets to identify any that were dropped and reordered that may indicate hardware faults or congestion. Packet analyzers can be used to capture and examine network packets to identify where latency and other issues are occurring (Source: https://www.liveaction.com/resources/blog-post/how-packet-analyzers-help-identify-application-performance-issues/).
Network Monitor Performance:

Network performance monitoring helps to understand how a network is perfoming, providing info such as when, how, and who is sending what data to whom. Capturing packets aids forensic analysts by providing real-time and historical visibility into network traffic behavior. Network performance monitoring is frequently used with network diagnostics since both use packet capture to understand root causes of network issues. (Sources: https://www.riverbed.com/faq/network-performance-monitoring-and-diagnostics/, https://www.endace.com/learn/what-is-network-packet-capture).
Optimize Performance:

Using packet capture files to identify time periods of network latency or abnormal spikes of traffic helps NOC teams to make decisions about network configurations and upgrades to reduce latency and to fix network security issues. (Sources: https://www.endace.com/learn/what-is-network-packet-capture, https://www.solarwinds.com/resources/it-glossary/pcap#Why-should-IT-teams-use-network-packet-capture-tools?).

Alternative methods: packet counters

Beyond packet capture, network engineers can use packet counters offered by compatible NICs to monitor network activity.

What is a packet counter?

A packet counter:
1. Counts packets:
  
  A packet counter tallies the number of packets sent and received (Source: https://www.kernel.org/doc/html/latest/networking/statistics.html#rx_dropped).
2. Monitors traffic volume
  
  With tools like Microsoft's pktmon counters, NOC teams can "confirm the presence of expected traffic and get a high-level view" of traffic activity (Source: https://learn.microsoft.com/en-us/windows-server/administration/windows-commands/pktmon-counters).
3. Detects errors
  
  With a packet counter, NOC teams can identify the number of bad, corrupted, or dropped packets during transmission relative to the number of packets received (Source: https://www.kernel.org/doc/html/latest/networking/statistics.html#rx_dropped).

Limitations of packet counters

Lack of granularity:

They do not provide detailed information about individual packets, since they provide a high-level view of traffic activity (Source: https://www.usenix.org/system/files/conference/hotcloud18/hotcloud18-paper-michel.pdf, https://www.kernel.org/doc/html/latest/networking/statistics.html#rx_dropped).

A few examples are provided below of using Microsoft's pktmon counters command, examples that were generated by ChatGPT with the following prompt:

ChatGPT prompt to generate example code for pktmon counters

Note that the code examples below provide high-level information that is quite different than what you would find from packet capture files like .pcap and .pcapng. Nonethless, they have their own use for NOC teams.

1. Display All Counters

pktmon counters

Example Output:

Type    ID  Name                    Packets   Bytes     Drops
Flow    1   Ethernet Adapter        120,000   15 MB     500
Flow    2   Wi-Fi                   85,000    10 MB     250
Drop    1   Filtered Out            0         0         750
Drop    2   Blocked by ACL          0         0         50

2. Display Flow Counters Only

pktmon counters --type flow

Example Output:

ID  Name                    Packets   Bytes
1   Ethernet Adapter        120,000   15 MB
2   Wi-Fi                   85,000    10 MB

3. Display Drop Counters Only

pktmon counters --type drop

Example Output:

ID  Drop Reason             Drops
1   Filtered Out            750
2   Blocked by ACL          50

4. Include Hidden Counters

pktmon counters --include-hidden

Example Output:

Type    ID  Name                    Packets   Bytes     Drops
Flow    1   Ethernet Adapter        120,000   15 MB     500
Flow    2   Wi-Fi                   85,000    10 MB     250
Hidden  3   Internal Bridge         15,000    1 MB      10
Drop    1   Filtered Out            0         0         750
Drop    2   Blocked by ACL          0         0         50

5. Zero Out Counters After Displaying

pktmon counters --zero

Example Output (First Run):

Type    ID  Name                    Packets   Bytes     Drops
Flow    1   Ethernet Adapter        120,000   15 MB     500
Flow    2   Wi-Fi                   85,000    10 MB     250
Drop    1   Filtered Out            0         0         750
Drop    2   Blocked by ACL          0         0         50

Example Output (Second Run):

Type    ID  Name                    Packets   Bytes     Drops
Flow    1   Ethernet Adapter        0         0         0
Flow    2   Wi-Fi                   0         0         0
Drop    1   Filtered Out            0         0         0
Drop    2   Blocked by ACL          0         0         0

6. Show Detailed Drop Reasons

pktmon counters --drop-reason

Example Output:

Drop Reason             Drops
Filtered Out            750
Blocked by ACL          50
Checksum Error          20
Invalid Protocol        5

7. Live Monitoring with Refresh Rate

pktmon counters --live --refresh-rate 5

Example Output (updates every 5 seconds):

[Update 1]
Type    ID  Name                    Packets   Bytes     Drops
Flow    1   Ethernet Adapter        120,000   15 MB     500
Flow    2   Wi-Fi                   85,000    10 MB     250

[Update 2]
Type    ID  Name                    Packets   Bytes     Drops
Flow    1   Ethernet Adapter        121,000   15.5 MB   505
Flow    2   Wi-Fi                   86,000    10.2 MB   255

[Update 3]
Type    ID  Name                    Packets   Bytes     Drops
Flow    1   Ethernet Adapter        122,000   16 MB     510
Flow    2   Wi-Fi                   87,000    10.4 MB   260

8. Output Counters in JSON Format

pktmon counters --json

Example Output:

{
  "counters": [
    {
      "type": "Flow",
      "id": 1,
      "name": "Ethernet Adapter",
      "packets": 120000,
      "bytes": 15728640,
      "drops": 500
    },
    {
      "type": "Flow",
      "id": 2,
      "name": "Wi-Fi",
      "packets": 85000,
      "bytes": 10485760,
      "drops": 250
    },
    {
      "type": "Drop",
      "id": 1,
      "drop_reason": "Filtered Out",
      "drops": 750
    },
    {
      "type": "Drop",
      "id": 2,
      "drop_reason": "Blocked by ACL",
      "drops": 50
    }
  ]
}

9. Combined Use Case

pktmon counters --type drop --drop-reason --live --zero --refresh-rate 2

Example Output (refreshes every 2 seconds):

[Update 1]
Drop Reason             Drops
Filtered Out            750
Blocked by ACL          50
Checksum Error          20

[Update 2]
Drop Reason             Drops
Filtered Out            5
Blocked by ACL          3
Checksum Error          1

[Update 3]
Drop Reason             Drops
Filtered Out            2
Blocked by ACL          0
Checksum Error          0

Cannot detect missing UDP packets:

Since UDP packets do not resend lost data packets, even if there are errors, packet counters are unable to detect lost UDP packets (Source: https://www.twilio.com/en-us/blog/understanding-packet-loss-and-how-fix-it#Packet-loss-in-TCP-vs-UDP).
No Payload Visibility:

Packet counters do not provide access to the packet contents or headers. For examples on information that packet counters provide, see the code examples above for how to use Microsoft's pktmon counters command.

To supplement packet counters, additional tools for packet analysis ought to be used to ensure maximum visibilty into every packet, even at large scales. One such tool are stream processors for packet analytics which provides "a flexible and scalable way to process unbounded streams of packets in real-time" (Source: https://www.usenix.org/system/files/conference/hotcloud18/hotcloud18-paper-michel.pdf).

Stream processor system architecture for packet analysis.
Source: https://www.usenix.org/system/files/conference/hotcloud18/hotcloud18-paper-michel.pdf

Detailed stream processor architecture for packet analysis.
Source: https://www.usenix.org/system/files/conference/hotcloud18/hotcloud18-paper-michel.pdf

Implications for HFT

In high-frequency trading, even minor network issues can have amplified effects due to the speed and volume of transactions. Packet capture allows for:

Nanosecond-level analysis: Detecting and correcting issues that occur at incredibly short timescales.
Protocol optimization: Fine-tuning protocols to reduce latency.
Real-time monitoring: Immediate detection of anomalies that could impact trading algorithms.

By detecting microbursts of incoming or outgoing network packets, NOC teams can identify possible network intrusions, security issues, and performance issues. With greater accuracy by the nanosecond, network packet analysis can improve network reliability by enhancing detection mechanisms. Using stream processors for packet analysis, NOC teams can monitor network activity in real time, detecting both latency and response time issues while identifying peak bandwidth usage. Latency and response time issues and peak bandwidth usage are crucial to optimize network performance and to reallocate resources from low-activity periods to high-activity periods. With real-time packet analysis at the nanosecond-level, jitter and packet loss are easier to detect and mitigate (Sources: https://wwwx.cisco.com/c/en/us/products/collateral/cloud-systems-management/provider-connectivity-assurance/provider-connect-test-monitor-so.html, https://www.usenix.org/system/files/conference/hotcloud18/hotcloud18-paper-michel.pdf, https://www.liveaction.com/resources/blog-post/how-packet-analyzers-help-identify-application-performance-issues/).

4. Applications to HFT

HFT requires extremely high speed and precision of network operations, currently at nansecond-levels where nanosecond-level packet capture plays a critical role in optimizing network resources and ensuring high reliability. For nanosecond-level packet capture, several elements are necessary:

Extremely accurate timestamping at the nanosecond-level allows for
- Latency measurement:
  
  By ordering packets by extremely precise timestamps, latency in data transmission can be minimized to maintain a competitive advantage (Source: https://www.timebeat.app/post/ultra-low-latency-trading-capturing-timestamps-in-nanoseconds, The Significance of Accurate Timekeeping and Synchronization in Trading Systems — September 2024 - Francisco Girela-López from Safarn Electronics and Defense Spain).
- Order sequencing:
  
  Similar to latency measurement, with extremely precise timestamps, trade-order-packets are ordered sequentially throughout their life cycle with the utmost confidence that they are ordered in the order that they were received (Source: https://www.catnmsplan.com), which is vital for compliance with regulations like MiFID II and FINRA CAT, where MiFID II requires up to microsecond granularity (Source: https://www.esma.europa.eu/sites/default/files/library/2016-1452_guidelines_mifid_ii_transaction_reporting.pdf) and FINRA CAT requires timestamp granularity up to nanoseconds (Source: https://www.finra.org/rules-guidance/notices/20-41), and for accurate trade execution (Sources: https://www.timebeat.app/post/ultra-low-latency-trading-capturing-timestamps-in-nanoseconds, The Significance of Accurate Timekeeping and Synchronization in Trading Systems — September 2024 - Francisco Girela-López from Safarn Electronics and Defense Spain).
Extremely high reliability:
- Real-time synchronization and monitoring:
  
  With atomic clocks, trade synchronization can be achieved with sub-microsecond accuracy. Together with real-time network monitoring, HFT firms can immediately detect anomalies or failures to improve the customer experience and Mean Time To Repair (MTTR) (Source: https://www.timebeat.app/post/ultra-low-latency-trading-capturing-timestamps-in-nanoseconds, https://blog.niagaranetworks.com/blog/packet-timestamping).
Extremely high-quality data for backtesting

Extremely precise packet capture with comprehensive logs of past trades aids trade analysis and strategy development, such as LSEG's Tick History - PCAP solution which provides "raw network packets sent and received by trading systems, providing greater granularity and detailed insight" of market activity, offering all possible levels of data feeds: level 1 Top of Book, 2 Depth of Book, and 3 Market by Order, each with nanosecond timestamping (Source: https://www.lseg.com/en/insights/fx/revolutionising-fx-price-transparency-with-tick-history-pcap, IE 421 Lecture Notes).
Optimization of network resources

Through identifying low-latency periods, network resources can be optimized to allocate bandwidth efficiently.

5. HFT setup

Exchange and trading firm architecture

Note:

This section provides a brief and general overview of an electronic exchange.

For a more detailed examination of exchange/trading firm architecture, please read the Electronic exchange architecture section which goes over the architecture of CME's Globex, Eurex's optimizations in network latency, and passive Traffic Analysis Points/Test Access Points (TAPs).

The architecture of exchanges typically consists of:

Gateways (GWs)
- Gateways perform basic network throttling to ensure that incoming requests are controlled, preventing system overloads and ensuring fair access to resources.
Ticker Plants (TPs) / Market Data Generators (MDGs)
- TPs take the data from GWs and send it out to OMEs. TPs play a crucial role in processing and disseminating market data efficiently, providing real-time updates to OMEs and traders.
Order Matching Engines (OMEs)
- OMEs are generally responsible for handling a specific group of assets. They match buy and sell orders, ensuring trades are executed according to market rules and participants' priorities.
Drop-Copy (DC)
- DC systems are used to manage firm-wide risk by providing a consolidated view of activities across the firm. For instance, if you want an additional machine to monitor all trading activities across the organization without relying on individual applications, a DC system is best suited to provide this functionality.
- DC systems are widely used in finance to support reconciliation processes. Firms use these to assess their current positions across trading desks, buildings, or the entire organization.
- Many firms have internal systems to monitor customer activities, and DCs enhance this by identifying risky trades and providing immediate insights into potential exposures.

Simple exchange diagram:

Shown below is a simple diagram of an exchange that was presented in IE 421 High-Frequency Trading Tech, taught by Professor David Lariviere. It depicts:

GWs connected to an ESB to bi-directionally communicate with OMEs
GWs connected to OC to bi-directionally communicate with exchange clients (Cs)
ESB connected to TPs, which are then connected to the exchange OC to provide data feeds

Simple exchange architecture shown in IE 421 taught by Professor David Lariviere

Simple exchange architecture shown in IE 421: High-Frequency Trading Tech (University of Illinois at Urbana-Champaign).
Source: IE 421: High-Frequency Trading Tech, University of Illinois at Urbana-Champaign

Sources:

IE 421: High-Frequency Trading Tech, University of Illinois at Urbana-Champaign

Co-location (traders and exchanges in the same exchanges)

Visual Map of Exchanges:

Understanding the geographical distribution of exchanges is essential for network engineers in HFT. The physical distance between data centers directly impacts latency, influencing trading strategies and infrastructure investments. Visual maps, such as those provided by HFT Tracker and Quincy Data, illustrate the locations of major exchanges and the network paths connecting them.

HFT Tracker:

Map of exchange data centers in the U.S. from HFT Tracker
Source: https://www.hfttracker.com

HFT Tracker offers an interactive map displaying the microwave routes between key U.S. financial data centers. The map helps to visualize the wireless connections between the largest exchanges and how the largest data centers are typically co-located to ensure the fastest data transfer rates possible.
Quincy Data:

Global data coverage map from Quincy Data
Source: https://www.quincy-data.com/product-page/#map

Quincy Data provides low-latency market data services and offer a global data coverage map that showcases their network coverage. Quincy Data's map highlights their low-latency-wireless-microwave-enabled market data services that are available to major exchanges. Similar to the HFT Tracker map, the Quincy Data map emphasizes the importance of having the most connected data services to ensure the highest data quality and the broadest market data feeds that can be made available to market participants/clients.
The New Jersey Triangle:

U.S. data center map of the New Jersey Triangle
Source: https://earth.google.com/web/

References:

HFT Tracker. (n.d.). Interactive Maps. Retrieved from https://www.hfttracker.com/
Quincy Data. (n.d.). Product Page. Retrieved from https://www.quincy-data.com/product-page/#map

Synchronization of clocks across data centers

Synchronizing clocks across different data centers is achieved using remote GNSS-enabled time servers that are used to distribute the time data across a data centers machines or across a firm's machines that are distributed across multiple data centers.

There are several methods to achieve time synchronization across data centers. One of method is to use a service like White Rabbit's Time-as-a-Service (TaaS) to synchronize time across a data center or multiple data centers. To recall from the Time Synchronization section, extremely precise time synchronization is achieved with GNSS receivers and PTP or PTM time-sync protocols. Since GNSS receivers have their own atomic clocks, they serve as precision timing systems. With White Rabbit's IEEE 1588 PTP-enabled TaaS time servers, a HFT firm's time synchronization system can achieve sub-nanosecond accuracy and reduce the number of grandmaster clocks (GMs) by using a shared White Rabbit time server which provides the highly accurate time data. By linking time clocks, either Boundary Clocks (BCs) or GMs, together to the remote White Rabbit time server, the resiliency and accuracy of the time synchronization system can be drastically improved.

White Rabbit time server diagram
Source: https://www.youtube.com/watch?v=V7mdB3ildPQ

Multiple GNSS compared through White Rabbit links
Source: https://www.youtube.com/watch?v=V7mdB3ildPQ

The diagrams above are borrowed from a lecture titled, "Distributing Time Synchronization in the Datacenter", by IEEE Xplore Author, Francisco Girela-López from Safarn Electronics and Defense Spain.

To view more of Francisco Girela-López's research, please visit his IEEE Xplore Author Profile.

A constant challenge for time synchronization of clocks across data centers is "clock drift", due to temperature changes and other factors which gradually reduces the accuracy of an atomic clock. Additionally, timestamp granularity is an issue because it can vary across different hardware systems and components; thus, inconsistent timestamp granularity — e.g. some hardware devices using 1 nanosecond versus 0.1 nanosecond — introduces time discrepancies which undermines the accuracy of time synchronization. Moreover, the complexity of distributed systems requires careful design and implementation of time synchronization protocols to ensure that each component and clock is unified and synchronized across the entire time synchronization system.

References:

Safran Electronics and Defense Spain, Francisco Girela-López. (2024, September). The Significance of Accurate Timekeeping and Synchronization in Trading Systems [Live Lecture].
YouTube. (2024, February 21). Distributing Time Synchronization in the Datacenter [Video]. Retrieved from https://www.youtube.com/watch?v=V7mdB3ildPQ
YouTube. (2024, November 2). Synchronization in the Datacenter [Video]. Retrieved from https://www.youtube.com/watch?v=3gUvZikFePA
Safran Navigation & Timing. (n.d.). Timekeeping and Synchronization in Trading Systems. Retrieved from https://safran-navigation-timing.com/timekeeping-and-synchronization-in-trading-systems/#:~:text=A%20delay%20of%20a%20few,efficient%20trading%20for%20all%20participants

Backtesting with historical data

Backtesting involves simulating a trading strategy using historical data to evaluate its effectiveness. In HFT, backtesting requires highly reliable historical data. Extremely precise timestamping is required when capturing packets for both public and private market data to ensure high quality historical data. After performing latency analysis from captured packets, latency adjustments further improve the quality of data to accurately reflect past market conditions.

Challenges in Backtesting:
- Data synchronization:
  
  Historical data from different exchanges must be synchronized to the exact nanosecond to ensure accuracy. As stated previously, synchronization can be improved with timing systems that leverage GNSS receivers or more advanced timing systems that use White Rabbit's TaaS.
- Latency adjustments:
  
  Traders must account for the transmission time between data centers, adjusting timestamps to reflect the delay that would have occurred in real trading scenarios. To ensure latency adjustments are accurate, backtesting simulations must represent the HFT firm's specific data centers and use them in the backtesting simulation.
- Data volume:
  
  The sheer volume of data generated at high frequencies requires robust storage, usually with RAID storage, and high-rate packet capture processing capabilities, which can be achieved with packet stream processing.
Adjusting backtesting for transmission latency:

When backtesting from a specific location, traders need to adjust the timestamps of all data captured in other data centers to account for the latency between those centers and the simulation location. Location-dependent backtesting simulations ensure that backtests accurately reflect the sequence and timing of market events as they would have been observed in real-time trading, as if trading against a real co-located data center.
Case study: latency calculations from NASDAQ (Carteret)

This case study examines the latency from NASDAQ's data center in Carteret, New Jersey, to other major exchanges.
- Latency to NY4 (Secaucus):
  - Distance: Approximately 16.15 miles
  - Latency: ~90 µs
- Latency to Mahwah (NYSE):
  - Distance: Approximately 35.60 miles
  - Latency: ~180 µs
- Latency to 350 E Cermak (Chicago, ICE):
  - Distance: Approximately 798 miles
  - Latency: ~7.25 milliseconds
- Latency to Aurora (CME):
  - Distance: Approximately 740 miles
  - Latency: ~3.982 milliseconds
Latency can be further reduced using microwave transmission, which is faster than fiber due to the straighter path and higher speed of signal propagation through air.

References:

Baxtel. (n.d.). Equinix Secaucus: NY2, NY4, NY5, NY6. Retrieved from https://baxtel.com/data-center/equinix-secaucus-ny2-ny4-ny5-ny6
Nasdaq. (2020, April 9). Time is Relativity: What Physics Has to Say About Market Infrastructure. Retrieved from https://www.nasdaq.com/articles/time-is-relativity%3A-what-physics-has-to-say-about-market-infrastructure-2020-04-09
McKay Brothers. (2016, May 13). Quincy: Latency Reductions ‘Nearing Perfection’ on Aurora-NJ Data Network. Retrieved from https://www.mckay-brothers.com/quincy-latency-reductions-nearing-perfection-on-aurora-nj-data-network/

Electronic exchange architecture

CME Globex "GLinks"

The architecture of electronic trading systems, like CME Globex, are crucial to understand. Described below is a brief overview of the "GLinks" infrastructure in Aurora, Illinois that is designed to facilitate HFT.
- Overview of GLink Architecture:
  
  GLink architecture network topology
  Source: https://cmegroupclientsite.atlassian.net/wiki/spaces/EPICSANDBOX/pages/46115155/GLink+Architecture+-+Aurora
  
  As a spine-and-leaf topology, the GLink architecture is designed to be deterministic and protected against Denial of Service (DoS) attacks. The "GLink"s are pairs of fiber-optic links that connect traders to the CME Globex matching engine. There are 24 pairs of customer-facing GLink switches that connect to spine switches which then feed out that data at typically 100 Gbps.
  - Physical Layer (L1) Overview:
    - Customer Access (GLink switches):
      - Arista 7060s, 10 Gbps Ethernet.
      - Dual switch connection per customer (A/B pairs).
      - Connected to three spines at 100 Gbps (no cross-feeding between network A/B spines).
      - Deployment: 24 switch pairs.
    - Spine Switches:
      - Arista 7060s (multicast) and 7260s (non-multicast), 100 Gbps.
      - Handle market data (multicast) and order entry (unicast) traffic.
      - Deployment: 4 switches.
    - Gateway Access (MSGW/CGW):
      - Arista 7060s, 10 Gbps Ethernet.
      - Connected to non-multicast spines at 100 Gbps.
      - Deployment: MSGW - 4 switches; CGW - 2 switches.
    - WAN Distribution:
      - 100 Gbps connectivity to all spines, routing market data to network A/B spines.
      - Deployment: 2 switches.
  - Data Link Layer (L2) Highlights
    - Supports 10 GbE interfaces.
    - Customer connections are bandwidth-limited to 1 Gbps via policing.
    - No VLANs; all nodes use routable Layer 3 addresses.
    - No use of Spanning Tree Protocol (STP).
    - Operates in "store-and-forward" mode to manage packet queuing and forwarding.
  - Network Layer (L3) Key Points:
    - Routing and Path Behavior:
      - Active-standby routing for MSGW servers via non-multicast spines.
      - Symmetric return path traffic to maintain session consistency.
      - Traffic routing can leverage BGP for specific and summary routes.
    - Packet Handling:
      - Reordering allowed within spine layer across sessions.
      - Final packet ordering ensured at MSGW layer.
      - Latency variance between spines is minimal (hundreds of nanoseconds).
    - Performance:
      - Nominal latency: ~3 microseconds for spine-and-leaf switches.
      - Oversubscription rates: 0.48:1 (worst case) to 0.24:1 (best case).
      - Use of Arista 7060 with Broadcom Tomahawk (SOC) for shared memory queue monitoring.
  - Policing Overview:
    
    GLink architecture policing overview
    Source: https://cmegroupclientsite.atlassian.net/wiki/spaces/EPICSANDBOX/pages/46115155/GLink+Architecture+-+Aurora
    - Traffic Ingress Policing:
      - Coloring policy of incoming packet rates:
        
        Green:
        
        < 750 Mbps: Packets marked "normal" (AF11).
        
        None of these packets are dropped.
        
        Yellow:
        
        Packets marked "discard eligible" (AF12) at 750 Mbps–1 Gbps.
        
        These packets are possibly dropped between the client's/customer's server and the GW Access switch.
        
        Red:
        
        Packets are dropped at > 1 Gbps by the Ingress Policer.
        
        These packets are dropped between the client's/customer's server and the GLink 10G server.
      - Mechanics:
        
        Two-rate, three-color marker (RFC 2698).
        
        Uses token-bucket style crediting for metering and policing.
      - Thresholds:
        
        Committed Information Rate (CIR): 750 Mbps (mark AF12).
        
        Peak Information Rate (PIR): 1 Gbps (drop).
        
        Burst Sizes: CBS = 500 KB, PBS = 625 KB.
- Network redundancy with A and B networks:
  
  Exchanges often implement redundant network architectures to ensure reliability and uptime. The GLink architecture is no different. The GLink architecture provides:
  - Simultanous routing:
    
    At the Network Layer (L3), traffic can be routed over the network A and B simultaneously.
  - Redundancy:
    
    Networks A and B provide alternative paths for data, reducing the risk of a single point of failure. With a network A and B, arbitration can be performed between the networks. Thus, HFT firms ought to record data feeds from both networks A and B to ensure a reaction can be made to market movements detected from both network A and B, rather than reacting to any individual network's data feed. The downside of this network redundnancy is it immediately doubles your cost.
  Redundancy is crucial for maintaining reliable and efficient network operations. It ensures fault tolerance, allowing one network to maintain connectivity if the other experiences issues. Additionally, redundancy enables load balancing by distributing traffic across both networks to optimize performance.
- Network saturation issues:
  
  With 24 switches each operating at 10 Gbps, the aggregated potential throughput is 240 Gbps. However, if the spine only outputs at 100 Gbps, there is a risk of network saturation, leading to packet loss and increased latency. This network saturation occurs because if the switches collectively try to send more than 100 Gbps of data to the spine, the spine cannot handle the excess traffic. Since the switch acts as a bottleneck, it cannot process or forward traffic at a rate greater than 100 Gbps. As traffic increases, packets from the switches queue up at the spine, waiting to be forwarded. If the rate of incoming traffic to the spine consistently exceeds 100 Gbps, the buffers in the spine switch eventually overflow, leading to packet drops. Dropped packets require retransmissions in protocols like TCP, increasing latency and reducing network efficiency.
WSJ articles on HFT data sources:

The Wall Street Journal has reported on HFT firms exploiting tiny differences in data transmission times. In the past, a lawsuit was made against CME Group to make it seem that CME had a 'secret, private, and special data feed' that brought unfair profits to those with access. However, these private data feeds are simply data feeds offered only to customers actually trading and contain proprietary details about the customer's own trading activity, providing data such as list of prior trades, fill prices, execution times, list of outstanding orders in the market, and details on the order types. By analyzing the private data feeds first, traders can gain a microsecond-level or less of a competitive advantage. However, most HFT traders rely on a combination of both public and private data feeds.

Passive network Traffic Analysis Points/Test Access Points (TAPs):

One of the most popular ways to monitor network traffic is to use a Network Traffic Analysis Point/Test Access Point (TAP). Network TAPs are external hardware devices that are placed between two network devices, usually a router and a switch, and copies all network traffic data passing through the data link (e.g. Ethernet) in real-time and sends the copy of data to a monitoring and/or analysis tool, such as an Intrusion Detection System (IDS), network analyzer (e.g. Wireshark), or packet sniffer (e.g. tcpdump, Wireshark).

There are two types of network TAPs, but the focus here is on passive network TAPs because of their benefits relative to their maintanence costs. Some key benefits of passive TAPs include:
- Powerlesss relay:
  
  A passive network TAP does not require power to operate, so if a device in the network loses power, the network traffic can still flow between the network ports and reach the passive TAP. Fiber optic passive TAPs require absolutely no power to start or operate, and although copper passive TAPs () require power when used, there is still no physical separation between network ports, which ensures that even copper TAPs remain operational in the event of a power outage. Thus, passive TAPs ensure uninterrupted traffic flow because they rely on purely optical or hardware mechanisms.
- Comprehensive network visibility:
  
  Since network TAPs capture all network packets, including errors and malformed packets, passive TAPs provide visibility of network activity to enhance a packet capturing system on a network. When employing multiple passive TAPs across a network, passive TAPs can also simultaenously capture both incoming and outgoing traffic separately, enhancing the comprehensiveness of captured packets.
- Non-intrusive monitoring:
  
  As the name implies, passive TAPs passively duplicate network traffic data for monitoring without interfering with network communication or altering any network data; passive TAPs collect the raw network traffic data.
- Hardware-based:
  
  A passive TAP has no IP address and isn't addressable on a network, making it more secure against remote attacks.
- Unidirectional data flow:
  
  TAPs are often designed for one-way packet capture of traffic flows, from the network TAP to the monitoring device, to prevent accidental injection of traffic back into the network.
In HFT exchange architectures, passive TAPs are a common way to ensure the timestamps of captured packets are extremely accurate. When implementing passive TAPs into a HFT exchange architecture, they are strategically deployed at locations where highly accurate monitoring and timestamping is needed. Therefore, passive TAPs are typically placed between GNSS receivers and the network timing systems to ensure precise time synchronization across all components. Passive TAPs can be installed between GWs, TPs, and OMEs to capture and analyze market data feeds and order execution processes in real-time. Additionally, passive TAPs are used at client interface points connecting the exchange to clients'/customers' servers to monitor trades and orders without impacting performance.

Cost of Passive TAPs:
- Fiber optic passive TAP hardware can range from:
  - From LightOptics:
    - $59 → Modular Passive Network Optical HD Fiber TAP LC/UPC Singlemode 9/12
    - $227 → 4-Link LC Fiber TAP OM3 Multimode 10G/25G/50G passive network TAP FHD5 Module
  - From L.com:
    - $197 → Passive Optical Tap, MMF OM4 LC Connectors, 70/30 Split, 1/2LGX Module
    - $123 → Passive Optical Tap, SMF LC Connectors, 98/2 Split, 1/2LGX Module
  - From FS.com:
    - $149 → FHD® Fiber TAP Cassette, OS2 Single Mode, 8 x LC Duplex Live Ports, 4 x LC Duplex TAP Ports, 50/50 Split Ratio (Live/TAP), 1/10/40/100G
    - $499 → FHD® Fiber TAP Cassette, OS2 Single Mode, 12 x LC Duplex Live Ports, 2 x MTP®-12 Male Live Ports, 2 x MTP®-12 Male TAP Ports, 70/30 Split Ratio (Live/TAP), 10/40/100G
- Ethernet, or copper, passive TAP hardware can range from:
  - From Dualcomm:
    - $210 → 10/100/1000Base-T Network TAP
    - $899 → Zero-Delay 100M/1G SFP Network TAP
    - $1,099 → Dual-Link/Dual-Mode Failsafe 10/100/1000Base-T Network TAP
- DIY:
  - There's an article from Instructables that provides a 7-step guide to making your own passive network TAP. This article should only be used as a high-level hobbyist's demonstration of how passive network TAPs are constructed.

Insights into trading system dynamics: Eurex:

Eurex, one of the world's leading derivatives exchanges, published a September 2024 report of its trading system called Deutsche Börse's T7®, which employs several methods to optimize network latency in its exchange architecture.

T7® Latency Composition diagram
Source: https://www.eurex.com/resource/blob/48918/9d4d29a403418f093c584b48c43990a7/data/presentation_insights-into-trading-system-dynamics_en.pdf
1. Infrastructure design:
  - Co-location and network design:
    
    The T7® platform offers 10 Gbps co-location connections with normalized cable lengths to minimize latency deviations. Similar to CME's GLink architecture, T7® has two redundant and independent order entry network halves, A and B, to ensure deterministic paths for data transmission.
  - Switching layers:
    
    Introduction of a mid-layer switch, the Cisco 3550T, enhances distribution of market data to the Access Layer switches, reducing internal latency variances across switch ports.
2. Precision time synchronization:
  
  The T7® system has built-in White Rabbit support, enabling timestamp accuracy less than 1 nanosecond. The timestamps are provided by the High Precision Timestamp file service to allow market participants measure their latencies precisely, specifically at data path-points t_3a, t_3d, and t_9d for each request leading to an EOBI market data update.
3. Data dissemination:
  - Enhanced Order Book Interface (EOBI):
    
    The T7® ssytem provides real-time, granular order book updates with the lowest latency. EOBI data is disseminated directly from the Matching Engine, ensuring fast availability.
  - Speculative triggering mitigation:
    
    Recalling Differentiated Services Code Point (DSCP) flags from section 2. IP Packets (Network Layer - Layer 3), techniques like DSCP flags in market data packets and Discard IP ranges help prevent unnecessary packet processing by marketing potential speculative triggers early in the IP header of a market data packet. DSCP flags indicate execution summaries and/or widening or narrowing of the bid/ask spread from market orders (not quotes). Note that the response of a packet may be modified in-flight, after it is read.
4. Latency monitoring and transparency:
  
  With each response, the T7® system provides participants up to six timestamps in real-time and key timestamps with every market data update. These real-time timestmaps provide performance insights.
5. Optimization of data processing:
  
  Consolidation processes such as the Enhanced Market Data interface (EMDI) into the Matching Engine reduces complexity by reducing the number of failover scenarios and enabling a faster and more deterministic distribution of data. Use of layer switches' 'cut-through' mode and FPGA-based solutions minimizes packet processing delays.
6. Hardware upgrades:
  
  Regular refreshes of infrastructure, such as replacing switches and packet capture devices with lower-latency alternatives, ensures the system remains state-of-the-art.

References:

CME Group. (n.d.). GLink Architecture – Aurora. Retrieved from https://cmegroupclientsite.atlassian.net/wiki/spaces/EPICSANDBOX/pages/46115155/GLink+Architecture+-+Aurora.
Mackenzie, M. (2013). High-Speed Traders Exploit Loophole. The Wall Street Journal.
Osipovich, Alexander. (2018, February 18). High-Speed Traders Profit From Return of Loophole at CME.
Profitap. (2018, October 10). The Difference Between Passive and Active Network TAPs. Retrieved from https://insights.profitap.com/passive-vs-active-network-taps
LightOptics. (n.d.). Active Tap vs Passive Tap. Retrieved from https://www.lightoptics.co.uk/blogs/news/active-tap-vs-passive-tap
Eurex. (2016). Insights into Trading System Dynamics. Retrieved from https://www.eurex.com/resource/blob/48918/4f5fd0386f272f89219bd66c0a546d09/data/presentation_insights-into-trading-system-dynamics_en.pdf

Regulatory Requirements: MiFID II

The Markets in Financial Instruments Directive II (MiFID II) is a European regulation that has had and still has significant implications for HFT firms. Its enforcement began on January 3rd, 2018 $^1$. Below is a summary of its most important regulatory requirements.

Overview of MiFID II Regulations:
- Scope:
  
  MiFID II aims to increase market transparency, better protect investors, reinforce confidence in markets, address unregulated areas, and reduce systemic market risk, especially in financial and commodity derivatives markets and in over-the-counter (OTC) markets, through monitoring of orders and detecting instances of market abuse and manipulation by enforcing strict time latency requirements on clock synchronization systems $^1$.
  
  On the level of regulatory comprehensiveness that is deemed necessary, MiFID II explicitly states,
  
  It is necessary to establish a comprehensive regulatory regime governing the execution of transactions in financial instruments irrespective of the trading methods used to conclude those transactions so as to ensure a high quality of execution of investor transactions and to uphold the integrity and overall efficiency of the financial system. A coherent and risk-sensitive framework for regulating the main types of order-execution arrangement currently active in the European financial marketplace should be provided for. It is necessary to recognise the emergence of a new generation of organised trading systems alongside regulated markets which should be subjected to obligations designed to preserve the efficient and orderly functioning of financial markets and to ensure that such organised trading systems do not benefit from regulatory loopholes $^2$.
  
  Thus, MiFID II emphasizes the regulation of HFT systems to ensure that they "do not benefit from regulatory loopholes".
- Timestamping requirements:
  - Operators of trading venues are required to synchronize their clocks to Coordinated Universal Time (UTC) with "GW-to-GW latency time of the trading system" at <= 1 millisecond, with a maximum divergence from UTC of 100 µs, and a timestamp granularity of 1 µs or better $^1$.
  - Members or participants of a trading venue that engage in HFT algorithmic trading technqiues must abide by a maximum divergence from UTC of 100 µs and a timestamp granularity of 1 µs or better $^1$.
- Impact on HFT firms:
  - Hardware and software upgrades:
    
    HFT firms had to invest in advanced hardware and software capable of meeting the high precision and synchronization standards, including deploying atomic clocks, GNSS receivers, and advanced time-sync protocols like PTP and PTM.
  - Infrastructure overhaul:
    
    Network infrastructure needed to be overhauled to handle the low-latency and high-precision demands, including optimizing network configurations and employing advanced data centers.
  - Compliance and monitoring:
    - Continuous monitoring and compliance reporting were added to HFT systems to establish robust adit trails to adhere to the regulatory standards. Technologies like passive TAPs, RAID storage, and high-performance NICs are used to ensure such compliance and monitoring tools are efficient.
    - For clock synchronization, if you use GNSS, the European Security and Markets Authority (ESMA) states that risks of atmospheric interference, intentional jamming, and spoofing must be mitigated, even accounting for lengths of time that the HFT system can remain in compliance during such attacks. Thus, continous monitoring, alerting, and reporting of clock health, status, and performance metrics is important to ensure compliance.

References:

Official Journal of the European Union. (2016, June 6). Commission Delegated Regulation (EU) 2017/574. Retrieved from https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX:32017R0574
European Securities and Markets Authority. (2024, November 8). MiFID II - Recital. Retrieved from https://www.esma.europa.eu/publications-and-data/interactive-single-rulebook/mifid-ii/recital
Safran. (n.d.). MIFID II Clock Sync Requirements . Retrieved from https://safran-navigation-timing.com/mifid-ii-clock-sync-requirements/

Why do financial trading firms capture network data?

1. Cybersecurity

Financial trading firms are prime targets for cyberattacks due to the vast amounts of capital they manage and the critical infrastructures they rely on. Consequently, network and infrastructure security is paramount to ensure the capital HFT firms manage are at no risk of loss or suffer from market manipulation. Described below are two incidents that should be used to guide a HFT firm's cybersecurity planning and operations.

Russia's "Digital Bomb" on the NASDAQ:

One of the most notable financial cybersecurity incidents was the 2010 breach of NASDAQ's computer systems by Russian hackers. Although it never detonated, the Russians installed what investigators described as a "cybergrenade" capable of causing significant disruptions to the U.S. economy. The FBI's network traffic monitoring system was alerted to the custom-made malware in October 2010, malware that had the potential to spy and steal data and also cause digital destruction. Even four years after the initial investigation, the case was still ongoing in 2014, eventually identifying the Russian government's direct involvement. The Russian attacker was from St. Petersburg Russia named Aleksandr Kalinin, and the U.S. Secret Service and FBI caught him relentlessly attacking NASDAQ computers years 3 years prior to the 2010 breach, attacking NASDAQ between 2007 and 2010.
- Lessons learned:
  
  Even though the "digital bomb" was never activated, this cyber-incident made clear that hackers can terrorize financial markets by potentially halting trades for a day or tanking financial markets.
  - Continuous network monitoring:
    
    As a result, this cyber-incident underscores the importance of continuous network monitoring and extremely accurate and synchronized packet capture systems which aids digital forensics personnel and network engineers to detect unusual network traffic patterns indicative of cyber intrusions, allowing for countermeasures to be taken as fast as possible.
U.S. Air Force drone breached through home router vulnerability:

Another serious cyber-incident case was first identified on June 1, 2018. The cyber-incident involved criminal actor activities on the deep and dark web, activities which were identitied by Recorded Future's Insikt Group. Sensitive data from a U.S. Air Force (USAF) MQ-9 Reaper drone, a kind of unmanned aerial vehicle (UAV), was compromised due to a vulnerability in Netgear routers and attempts were made to sell sensitive USAF documents on a dark web hacking forum. With the vulnerability of Netgear routers first identified in early 2016, a hacker exploited the known vulnerability with improperly configured File Transfer Protocol (FTP) login credentials, gaining access to a USAF captain's computer. The stolen documents included maintanence materials and a list of USAF personnel with privleged access to the UAV's system.

Even though the documents were not classified information, their exposure provided adversaries with entry points for malicious activity, opening up attack vectors on the drone's technical capabilities. Consequently, there are various lessons that must be learned from this national cybersecurity incident that put our country's national defense at a great risk:
- Lessons learned:
  - Regularly update and patch systems:
    
    Ensure all system software and hardware components are up-to-date with the latest patches and updates to mitigate known vulnerabilities. Also, monitor announcements of new vulnerabilities by government organizations, such as the Cybersecurity & Infrastructure Security Agency's (CISA's) list of Known Exploted Vulnerabilities Catalog and NIST's National Vulnerability Database (NVD), and third-party cybersecurity organizations, such as MITRE's ATT&CK knowledge base and OWASP Foundation's Top 10 and other OWASP Foundation security projects, that involve software and hardware components that are currently in use.
  - Secure configuration of network devices:
    
    Ideally through properly designed security protocols, securely and properly configure devices, especially those with remote access or Internet capabilities, to prevent unauthorized entry points.
  - Implement strong authentication and access control mechanisms:
    
    Through robust authentication and access control mechanisms, such as multi-factor authentication, Access Control Lists (ACLs) like Mandatory Access Control (MAC) and Role-Based Access Control (RBAC), single sign-on (SSO) and other logical and physical controls such as mantraps.
  - Conduct continuous network monitoring:
    
    Utilize high-rate packet capturing and network monitoring tools to detect and respond to suspicious activities promptly.
  - Educate personnel on cybersecurity best practices:
    
    Provide regular training of relevant personnel to identify and mitigate potential security threats. Cybersecurity training ought to include educating on the importance of secure and cautious network configurations and how to recognize phishing attempts.
- Signifcance of network security and high-rate packet capturing:
  - Network security:
    
    The exploitation of a known network vulnerability underscores the necessity for organizations to regularly update and patch systems, configure devices securely with controlled and properly designed security protocols, and monitor network and hardware systems and their firmware and software for vulnerabilities to prevent unauthorized access.
  - High-rate and real-time packet capturing:
    
    Implementing a high-rate packet capture system allows organizations to monitor and analyze network traffic with the greatest speeds for the detection of unusual network activities, such as unauthorized data exfiltration. A real-time packet capture system is the ideal rate of packet capture, enabling near real-time responses to potential security breaches or network intrusions.

2. Backtesting

Backtesting is aids the development and refinement of trading algorithms, involving simulations of running trading strategies on historical data to evaluate effectiveness of the strategies before deploying them to live market conditions.

Capturing network packets with extremely accurate timestamps allows HFT firms to:

Reconstruct historical market conditions:

By recording market data feeds with extremely accurate timestamps, firms can replay past events with the utmost precision and test how their algorithms would have performed with certainty that the market conditions are mirrored, with respect to time, as closely as possible.
Optimize algorithms:

With a high-visibility packet capturing system that captures packets with nanosecond-level timestamps, HFT firms can analyze the response times and decision-making processes of trading systems to optimize for bottlenecks in network latency and network resources.
Ensure compliance:

Captured packets with nanosecond-level timestamps aids compliance by ensuring records of orders, trades, and data feeds can be reported with great confidence in their reliability, whereby reporting with nanosecond-level data helps demonstrate adherence to regulatory requirements regarding trading practices.
Co-location:

From the previous section on backtesting, Backtesting with historical data, backtesting is best performed when simulating against the firm's co-located data centers. Location-dependent backtesting simulations ensure that backtests accurately reflect the sequence and timing of market events as they would have been observed in real-time trading, as if trading against a real co-located data center, or, ideally, against multiple co-located data centers.

The granularity and accuracy of the captured packets directly impact the reliability of backtesting results. Therefore, network packet capture systems must be capable of handling high data volumes without loss, ideally at a nanosecond-level accuracy and below. Femtosecond-level accuracy in a packet capture system may be achieved using photonic time-sync methods, which can be referenced in the appropriate section in this report, titled 4. Photonic Time Sync.

3. Real-time monitoring

Real-time monitoring of network performance is crucial since any delay, packet loss, or network anomaly can result in missed trading opportunities or worse, financial losses.

Key aspects of real-time monitoring include:

Dropped packets:

Monitoring for packet loss to ensure data integrity. Packet loss can lead to incomplete market data. The risk of packet loss is directly proportional to the rate of packet capture, i.e. as the rate of packet capture increases, the risk of packet loss also increases.
Out-of-order packets:

Detecting packets arriving out of sequence, which can disrupt the processing of market data feeds. If a market data feed were to use out-of-order packets, price information would be inaccurate when observed by market particpants, leading to malformed orders and trades, which would inevitably lead to financial losses.
Latency spikes:

Identifying sudden increases in network latency that can delay trade execution. Latency spikes are usually referred to as bursts, or micro-bursts, where the rate of incoming packets may be greater than the available throughput. Micro-bursts cause packets to form a queue which results in higher latencies as an incoming micro-burst-packet must wait for the next packet in the micro-burst to be processed from the queue.

Capturing and analyzing network packets in real-time lets firms quickly identify and address issues, maintaining optimal newtork performance.

4. Latency and performance benchmarking

Latency can be defined as the delay between a market event and the corresponding action by a trading system. It is a critical factor in HFT.

Capturing network packets enables with ULL enables HFT firms to:

Measure "Tick-to-Trade" times: Calculate the time from receiving a market data "tick" to executing a trade.
Benchmark performance: Compare system performance against industry standards or competitors.
Identify bottlenecks: Pinpoint areas in the network or trading system that cause delays.

Continuously benchmarking performance improves the ability for HFT firms to optimize their infrastructure to achieve the lowest possible latency.

5. Companies that offer network packet capture solutions

Several companies specialize in providing network packet capture solutions tailored to the needs of ULL and HFT firms.

Solarflare:

Solarflare, acquired by Xilinx in April 2019 (Xilinix was then acquired by AMD on February 14, 2022), developed high-performance network interface cards (NICs) and software for low-latency networking. Their solutions focus on reducing latency and jitter, making them suitable for HFT applications.
AMD:

AMD offers various packet capture solutions for ULL and HFT uses, such as FPGAs, Ethernet adapters, ULL accelerator cards, and more. Additionally, since acquiring Xilinix, AMD sells Ethernet adapters from the Xilinix brand.
Corvil (acquired by Pico):

Corvil, acquired by Pico in July 2019, offered analytics solutions that provided real-time visibility into network performance and trading activities. Now Pico, the Pico platform captures and analyzes network data for to firms optimize financial markets infrastructure and ensure compliance in operations.
Napatech:

Napatech specializes in FPGA-based SmartNICs designed for ULL packet capture and packet processing. Additionally, Napatech offers software, like Link-Capture, to enhance the performance of their SmartNICs, further improving the throughput and accuracy of packet capture, enabling real-time analysis of network traffic with nanosecond-level latency, even supporting the DPDK packet processing software library.
Mellanox:

Mellanox Technologies, acquired by NVIDIA in 2020, provided high-speed networking solutions, with the most notable being the high-performance and proprietary InfiniBand data link technology. In addition to InfiniBand, Mellanox offered NICs, switches, and other technologies that supported ULL networking.
NVIDIA:

Since acquiring Mellanox, NVIDIA offers InfiniBand networking solutions, in addition to many various networking solutions that can improve the performance of packet capture systems. Such technologies include Accelerated Ethernet Switches; InfiniBand Switches, Adapters, Data Processing Units (DPUs), Routers and Gateways, Cables and Receviers, network accelerators, NICs, Ethernet adapters, GPU-accelerated compute, and more.

These companies contribute significantly to the HFT industry by providing the hardware and software necessary for highly efficient network packet capture and analysis.

References:

Pagliery, J. (2014, July 17). Russian hackers placed 'digital bomb' in Nasdaq - report. CNN Money. Retrieved from https://money.cnn.com/2014/07/17/technology/security/nasdaq-hack/index.html
Barysevich, Andrei. (2018, July 10). Military Reaper Drone Documents Leaked on the Dark Web. Recorded Future. Retrieved from https://www.recordedfuture.com/blog/reaper-drone-documents-leaked
AMD. (2022, February 14). AMD Completes Acquisition of Xilinx. Retrieved from https://www.amd.com/en/newsroom/press-releases/2022-2-14-amd-completes-acquisition-of-xilinx.html
Pico. (2019, July 9). Pico to Acquire Corvil, Creating the New Benchmark for Technology Services in Financial Markets. Retrieved from https://www.pico.net/press-release/pico-to-acquire-corvil-creating-the-new-benchmark-for-technology-services-in-financial-markets/
Napatech. (n.d.). About Us. Retrieved from https://www.napatech.com/about/
NVIDIA. (n.d.). End-to-End Networking Solutions. Retrieved from https://www.nvidia.com/en-us/networking/
Eurex. (2016). Insights into Trading System Dynamics. Retrieved from https://www.eurex.com/resource/blob/48918/4f5fd0386f272f89219bd66c0a546d09/data/presentation_insights-into-trading-system-dynamics_en.pdf
Sims, Tara. Xilinix. (2019, April 24). Xilinx to Acquire Solarflare . Retrieved from https://www.prnewswire.com/news-releases/xilinx-to-acquire-solarflare-300837025.html
AMD. Ethernet Adapters. Retrieved from https://www.xilinx.com/products/boards-and-kits/ethernet-adapters.html
NVIDIA. (n.d.). Ethernet Network Adapters - ConnectX NICs. Retrieved from https://www.nvidia.com/en-us/networking/ethernet-adapters/
NVIDIA. (n.d.). Intelligent Trading with GPU-Accelerated Computing. Retrieved from https://www.nvidia.com/en-us/industries/finance/ai-trading-brief/
Napatech. (n.d.) . Link Capture Software. Retrieved from https://www.napatech.com/products/link-capture-software/

Primary types of data used by HFT firms

Market data

Market data includes real-time information on prices, volumes, and orders from exchanges.
- UDP Multicast:
  
  Market data is commonly disseminated using UDP multicast. UDP (User Datagram Protocol) allows for ULL transmission, and multicast enables efficient distribution to multiple recipients simultaneously. A multicast setup is ideal for broadcasting market data to numerous subscribers on the network, such as when multiple market participants are consuming the same data feed.
- Examples from exchanges:
  - NASDAQ ITCH:
    
    The NASDAQ TotalView ITCH product uses a binary data format, the ITCH format, designed to optimize speed at the cost of flexibility. The ITCH format's efficiency is a direct results from the choice of a fixed-length offset structure. Described below is an overview of the ITCH format:
    - Key features of TotalView ITCH's binary data format:
      1. Fixed-length offsets in message formats:
        
        Each message in the ITCH feed adheres to a predefined format with fixed offsets for each field. For example, the Message Type field is always located at offset 0.
        
        This fixed-length structure enables extremely fast parsing, as a system can directly access any field within a message using its known offset without additional calculations or lookups.
        
        Subsequent fields follow specific positions in a strict sequence:
        
        NASDAQ ITCH - System Event Message
        Source: https://www.nasdaqtrader.com/content/technicalsupport/specifications/dataproducts/NQTVITCHSpecification.pdf
        
        NASDAQ ITCH - System Event Codes - Daily
        Source: https://www.nasdaqtrader.com/content/technicalsupport/specifications/dataproducts/NQTVITCHSpecification.pdf
        
        In addition to System Event Messages and daily System Event Codes, the TotalView ITCH feed provides many other ITCH message formats to describe orders added to, removed from, and executed on NASDAQ as well as message formats to disseminate Cross and Stock Directory information. Those message formats include:
        
        System Event Messages
        
        Stock Related Messages:
        
        Stock Directory
        
        NASDAQ ITCH - Stock Directory
        Source: https://www.nasdaqtrader.com/content/technicalsupport/specifications/dataproducts/NQTVITCHSpecification.pdf
        
        Stock Trading Action
        
        NASDAQ ITCH - Stock Trading Action
        Source: https://www.nasdaqtrader.com/content/technicalsupport/specifications/dataproducts/NQTVITCHSpecification.pdf
        
        Reg SHO Short Sale Price Test Restricted Indicator
        
        Market Participant Position
        
        NASDAQ ITCH - Market Participant Position
        Source: https://www.nasdaqtrader.com/content/technicalsupport/specifications/dataproducts/NQTVITCHSpecification.pdf
        
        Market-Wide Circuit Breaker (MWCB) Messaging
        
        Quoting Period Update
        
        Limit Up – Limit Down (LULD) Auction Collar
        
        Operational Halt
        
        Add Order Message (where MPID = Market Participant ID)
        
        Add Order – No MPID Attribution
        
        Add Order with MPID Attribution
        
        NASDAQ ITCH - Add Order - MPID Attribution Message
        Source: https://www.nasdaqtrader.com/content/technicalsupport/specifications/dataproducts/NQTVITCHSpecification.pdf
        
        Modify Order Messages
        
        Order Executed Message
        
        NASDAQ ITCH - Order Executed Message
        Source: https://www.nasdaqtrader.com/content/technicalsupport/specifications/dataproducts/NQTVITCHSpecification.pdf
        
        Order Executed With Price Message
        
        Order Cancel Message
        
        Order Delete Message
        
        Order Replace Message
        
        Trade Messages
        
        Trade Message (Non-Cross)
        
        NASDAQ ITCH - Trade Message (Non-Cross)
        Source: https://www.nasdaqtrader.com/content/technicalsupport/specifications/dataproducts/NQTVITCHSpecification.pdf
        
        Cross Trade Message
        
        Broken Trade / Order Execution Message
        
        Net Order Imbalance Indicator (NOII) Message
        
        Direct Listing with Capital Raise Price Discovery Message
      2. Big-endian binary encoding:
        
        All numeric fields are encoded in big-endian (network byte order) format, which ensures compatibility across systems and allows for rapid processing.
      3. Field types and sizes:
        
        Integer fields, such as timestamps and stock locate codes, have fixed sizes (e.g., 2 bytes, 4 bytes, 8 bytes), while alphanumeric fields like stock symbols are padded to their maximum lengths.
        
        The absence of variable-length fields simplifies memory allocation and reduces parsing overhead.
      4. Dynamic yet daily-static stock locate codes:
        
        Instruments are identified by dynamically assigned stock locate codes, which act as low-integer indices. These codes are recalibrated daily but remain static during the trading session, allowing for efficient mapping of securities without ambiguity.
      5. Message granularity and atomicity:
        
        ITCH provides granular messages for all market events, such as order additions, modifications, cancellations, and executions (see the list of message formats shown above under the "Fixed-length offsets in message formats" section). Each message encapsulates a specific update, ensuring that processing systems can handle events incrementally and efficiently.
    - Sacrifices in flexibility:
      
      The simplicity of ITCH's binary format, while advantageous for speed, comes at the cost of flexibility:
      1. Rigid message structure:
        
        The fixed-length and predefined offsets mean that any changes or additions to message formats require a new version of the specification and potentially significant updates to consumer systems.
      2. Limited extensibility:
        
        Adding new fields or modifying existing ones is non-trivial. For example, introducing a new data element might necessitate redefining offsets and recalibrating parsers across all client systems.
      3. Complex version management:
        
        Backward compatibility is limited. New message types or field definitions can only be introduced with care to avoid disrupting existing subscribers who might not yet support updated formats.
    - Speed optimizations through fixed-length parsing:
      1. Reduced overhead:
        
        The absence of delimiters, variable-length encodings, or textual representations minimizes computational overhead.
      2. Predictable processing:
        
        The fixed-length format ensures consistent processing times for each message type, aiding in the predictability and scalability of systems consuming the data feed.
    - Examples of fixed-length message parsing:
      
      From the specification:
      - System event message (Length: 12 bytes)
        
        Fields:
        
        Message Type (1 byte, offset 0)
        
        Stock Locate (2 bytes, offset 1)
        
        Tracking Number (2 bytes, offset 3)
        
        Timestamp (6 bytes, offset 5)
        
        Event Code (1 byte, offset 11)
        
        Parsing this message involves directly reading bytes from known offsets, e.g., the Event Code can be accessed with message[11].
      - Add order message (No MPID Attribution, Length: 36 bytes)
        
        Fields:
        
        Message Type (1 byte, offset 0)
        
        Stock Locate (2 bytes, offset 1)
        
        Tracking Number (2 bytes, offset 3)
        
        Timestamp (6 bytes, offset 5)
        
        Order Reference Number (8 bytes, offset 11)
        
        Buy/Sell Indicator (1 byte, offset 19)
        
        Shares (4 bytes, offset 20)
        
        Stock Symbol (8 bytes, offset 24)
        
        Price (4 bytes, offset 32)
        
        This structure allows parsers to decode orders with minimal computational effort.
    - Use in high-performance scenarios:
      1. Real-time processing:
        
        ITCH's format is optimized for low-latency applications. Since timestamps are represented in nanoseconds, the ITCH format enables trading systems to react to market changes in nanoseconds.
      2. Scalability:
        
        The efficiency of parsing ITCH messages makes it feasible to process millions of messages per second, a requirement for modern electronic markets.
      3. Compatibility with hardware acceleration:
        
        The format's simplicity makes it suitable for hardware-based processing, such as FPGA implementations, where fixed-length parsing translates directly into hardware logic. The NASDAQ offers the TotalView ITCH FPGA feed, but it is only available through the MoldUDP64 protocol, which is 1 of 3 higher-level data feed protocol options provided by the TotalView ITCH data feed.
    NASDAQ's TotalView ITCH product demostrates a deliberate trade-off: sacrificing flexibility in its data format to achieve unparalleled speed and simplicity in parsing. Adhering to a fixed-length, offset-driven structure, enables ITCH to have real-time market data processing, essential for high-frequency trading and other latency-sensitive applications. While this approach imposes constraints on extensibility and adaptability, its benefits in performance and predictability make it a widely recognizable protocol of modern financial markets.
  - IEX DEEP:
    
    Launched on May 15, 2017, the Investors Exchange (IEX) offers the Depth of Order Book Last Sale (DEEP) feed, which provides detailed insights into aggregated order book data and trade executions. Similar to ITCH, it uses UDP multicast for efficient dissemination Desribed below is an overview of the DEEP feed:
    - Understanding IEX's DEEP feed
      
      DEEP delivers real-time aggregated size information for all displayed orders resting on the IEX order book at each price level. DEEP also provides last sale information for executions on IEX. Notably, DEEP does not disclose the number or size of individual orders at any price level and excludes non-displayed orders or non-displayed portions of reserve orders.
      
      The DEEP feed includes several key components:
      - Aggregated order book data:
        
        Offers the total size of resting displayed orders at each price point, segmented by buy and sell sides.
      - Last sale information:
        
        Details the price and size of the most recent trades executed on IEX.
      - Administrative messages:
        
        Provides updates on trading status, short sale restrictions, operational halts, and security events.
      - Auction information:
        
        For IEX-listed securities, DEEP supplies data on current price, size, imbalance information, auction collars, and other relevant details about upcoming auctions.
    - DEEP's relevance to HFT
      
      HFT strategies depend on the swift processing of vast amounts of market data to identify and capitalize on fleeting trading opportunities. DEEP's comprehensive and timely data enables HFT firms to:
      1. Monitor market depth:
        
        Accessing aggregated order sizes at various price levels provides traders assess fine-grained visibility to market liquidity and potential price movements.
      2. Track trade executions:
        
        Real-time last sale information allows firms to observe recent trading activity, aiding price discovery and strategy adjustments.
      3. Respond to market events:
        
        Administrative messages inform traders of changes in trading status, halts, or other events, enabling prompt strategy modifications.
    - Network packet structure and delivery
      
      Similar to TotalView ITCH, DEEP is a multicast feed supporting data transmission to recover dropped packets/messages. TCP or UDP unicast options are available for data retransmission via the IEX Transport Protocol (IEX-TP). This design of network connectivity ensures efficient and reliable data transmission.
      
      Each DEEP message is variable in length and includes a Message Length field for framing. The messages are encapsulated within IEX-TP, which handles sequencing and delivery guarantees. This structure allows HFT systems to process and interpret the data efficiently.
    - Implementation considerations for HFT firms
      
      To effectively utilize DEEP, HFT firms should:
      - Establish direct connectivity:
        
        Connecting directly to IEX's data centers can minimize latency, providing a competitive edge.
      - Develop efficient parsers:
        
        Implementing parsers that can quickly decode DEEP's variable-length messages is essential for timely data processing. A few open-source parsers can be found online; however, developing one from scratch is ideal to ensure market data from IEX are parsed according to a system's expected file format that is used for backtesting.
      - Handle data retransmissions:
        
        Incorporating mechanisms to request and process retransmissions, e.g. to handle dropped packets, ensures data completeness and accuracy.
      - Monitor administrative messages:
        
        Staying informed about trading status changes and other events allows for rapid and well-informed strategy adjustments.

References:

NASDAQ. (2013, Aug 2). Nasdaq TotalView-ITCH 5.0. Retrieved from http://nasdaqtrader.com/content/technicalsupport/specifications/dataproducts/ITCHspecification.pdf
IEX Exchange. (2017, April 19). Introducing DEEP, the IEX Depth of Book and Last Sale Feed Retrieved from https://iextrading.com/trading/alerts/2017/011/
IEX Exchange. (n.d.). Depth of Book and Last Sale (DEEP) Feed. Retrieved from https://www.iexexchange.io/products/market-data-connectivity

Electrical characteristics of network technologies

Ethernet Alliance - Ethernet Applications
Source: https://ethernetalliance.org/wp-content/uploads/2024/03/2024-Ethernet-Roadmap-Digital-Version-March-2024.pdf

Understanding the electrical workings of different Ethernet standards is essential for network engineers in ULL systems.

1 Gbps Ethernet (1000BASE-T)

Established by the IEEE 802.3ab standard, 1000BASE-T, or 1 Gigabit Ethernet (GbE), is the most common networking standard as it is supported by almost all modern equipment with good enough performance for most common applications. 1000BASE-T also quickly replaced other Ethernet standards like 10BASE-T (10 Mbps) and 100BASE-T (100 Mbps). Below are a couple of important characteristics of 1000BASE-T Ethernet:

PAM-5 Encoding:

1 GbE uses pulse-amplitude modulation with five levels (PAM-5): -2, -1, -0, +1, +2. Four levels represent two bits; the fifth level supports forward error correction (FEC).
Analog Signaling:

The use of multiple voltage levels introduces analog characteristics, making it more susceptible to noise. With hybrids or echo cancelers, 1000BASE-T achieves full-duplex transmission which allows for simltaneous symbol transmission and reception on one wire pair.

10 Gbps Ethernet (10GbE)

10 Gigabit Ethernet versions
Source: https://www.techtarget.com/searchnetworking/definition/10-Gigabit-Ethernet

Established by the IEEE 802.3ae standard (for fiber optic cables; however, specification 802.3ak outlines the standard for 10GbE twisted copper pair cables) to supplement the base Ethernet standard, IEEE 802.3, 10GbE operates in full-duplex mode — half-duplex operations do not exist in 10GbE since the IEEE specification only defines full-duplex, point-to-point links —, enabling bi-directional data trasnmision on the same signal carrier. Additionally, 10GbE removes the need for Carrier Sense Multiple Access/Collision Detection (CSMA/CD) protocols, protocols used in earlier Ethernet versions. 10GbE has been widely adopted in data centers and trading environments, with gradual adoption in small-business LANs, for its advantage of single-mode fiber to work over long distances, e.g. 40 kilometers or 24 miles. Similar to 1GbE, a couple of key benefits of 10GbE are listed below:

Serial Transmission:

10GbE typically uses serial transmission methods like 10GBASE-R, where data is transmitted over a single serial data stream. AMD offers its own 10GBASE-R data links that integrate with serial interfaces.
Reduced Latency:

Serial transmission and simplified encoding schemes reduce latency. 10GBASE-R devices offered by AMD provide 1588 hardware timestamping support that is necessary for PTP time-sync systems; however, these devices only support LAN mode.

40 Gbps Ethernet (40GbE)

40GbE was established by the IEEE 802.3ba-2010 standard. For firms requiring higher bandwidth, 40GbE offers increased capacity. It is suitable for handling multiple data feeds or high-volume trading strategies.

4x10 Gbps:

Electrically, 40GbE often consists of four parallel 10 Gbps lanes. This method allows for easier scaling from existing 10GbE technologies. Various vendors offer 4x10 GbE, such as Cisco which offers 40GBASE Quad Small Form-Factor Pluggable (QSFP) products with a 4x10GBASE design, and Approved Networks 40GBASE-PLR4 QSFP+ product.

25 Gbps Ethernet and beyond

25 Gbps Ethernet (25GbE):

Providing 2.5 times the performance and 2.5 times the bandwidth over 10GbE, 25 GbE devices are for high-performance networking applications that require higher bandwidth needs. For example, the Mellanox ConnectX-5 Ethernet adapter supports 10/25GbE data rates via an SFP28 transceiver.
- Backwards compatibility:
  
  Additionally, 25GbE is backwards compatible with the 10GbE standard, since SFP28 is compatible with SFP+ (enhanced small form-factor pluggable), the SFP transceiver version that 10GbE uses.
- Cost efficiency:
  
  25GbE has the same power consumption as 10GbE while offering the same bandwidth, which translates to energy cost savings when comparing 25GbE to 10GbE devices. 25GbE also offers more port density over 40GbE while, again, providing lower costs and power requirements than 40GbE.
Since 2021, data centers have been upgrading to 25GbE devices, making them a more common network technology for high-performance networking applications.
100 Gbps Ethernet (100GbE):

Established by the IEEE 802.3ba-2010 standard, 100GbE devices are on the advanced end of Ethernet devices, usually reserved for AI, machine learning, big data, and cloud networking applications, such as the Mellanox ConnectX-6 Ethernet adapter which supports 100GbE via two QSFP28 transceiver ports. Modulation schemes can break up the single 100GbE data lane into four data lanes of 25 GbE each. Some key benefits of 100 GbE include:
- Handling high network traffic loads:
  
  One of the main benefits of 100GbE is its ability to handle high levels of traffic by multiple devices connected on the same network, where traffic in such large-scale networks would involve complex network requests and network updates in real-time.
- Backwards compatibility:
  
  100GbE is backwards compatibile with devices such as switches, NICs, ASICs, processors, and other networking equipment, which saves costs through the re-use of transceivers and modules that are previously owned.
- Reduces network complexity:
  
  With the ability to handle greater demands of network traffic, 100GbE can reduce the number of network nodes in an enterprise network by consolidating cabling, servers, and other networking equipment. One example are spine networks, such as CME Globex's GLink architecture, which reduce network overloads of core networks of the firm's overall network system.
- Long distance connections:
  
  With single-mode fiber optic 100GbE cables, connections can reach up to 60 miles.
Before upgrading to 100GbE, network engineers must ensure cables support 100GbE speeds and that devices in the network are compatible with it. Otherwise, time synchronization of atomic clocks can be impaired, leading to increases in latency and packet loss.

References:

EDN. (2003, April 1). PAM5 Encoding. Retrieved from https://www.edn.com/what-pam5-means-to-you/
Awati, Rahul; Kirvan, Paul. TechTarget. (2021, June). 10 Gigabit Ethernet (10 GbE). Retrieved from https://www.techtarget.com/searchnetworking/definition/10-Gigabit-Ethernet
RF Wireless World. (n.d.). Difference Between 10GBASE-T,10GBASE-R,10GBASE-X And 10GBASE-W. Retrieved from https://www.rfwireless-world.com/Terminology/10GBASE-T-vs-10GBASE-R-vs-10GBASE-X-vs-10GBASE-W.html
Wright, Gavin. (2021, August). 1000BASE-T (Gigabit Ethernet). Retrieved from https://www.techtarget.com/searchnetworking/definition/1000BASE-T
Ethernet Alliance. (2024). 2024 Ethernet Roadmap. Retrieved from https://ethernetalliance.org/technology/ethernet-roadmap/
AMD. (n.d.). 10 Gigabit Ethernet PCS/PMA (10GBASE-R). Retrieved from https://www.xilinx.com/products/intellectual-property/10gbase-r.html
Cisco. (n.d.) Cisco 40GBASE QSFP Modules Data Sheet. https://www.cisco.com/c/en/us/products/collateral/interfaces-modules/transceiver-modules/data_sheet_c78-660083.html
Approved Newtorks. (n.d.). 40GBASE-PLR4 QSFP+ (4X10) SMF 1310nm 10km DDM Transceiver. https://approvednetworks.com/products/40gbase-qsfp-plr4-4x10-10km-ddm-transceiver.html?srsltid=AfmBOopGXnHfl6zjtFUQ6kXflmwhCuMDWzanKJuLNtBnExzoL8frioYT
Watts, David. Lenovo. (2024, August). ThinkSystem Mellanox ConnectX-5 EN 10/25GbE SFP28 Ethernet Adapter. Retrieved from https://lenovopress.lenovo.com/lp1351-thinksystem-mellanox-connectx5-en-25gbe-sfp28-ethernet-adapter
Migelle. FS. (2021, March 1). Is 25GbE the New 10GbE?. Retrieved from https://community.fs.com/article/is-25gbe-the-new-10gbe.html
Watts, David. Lenovo. (2024, August). ThinkSystem Mellanox ConnectX-6 Dx 100GbE QSFP56 Ethernet Adapter. Retrieved from https://lenovopress.lenovo.com/lp1352-thinksystem-mellanox-connectx-6-dx-100gbe-qsfp56-ethernet-adapter
Awati, Rahul. TechTarget. (2021, September). 100 Gigabit Ethernet (100 GbE). Retrieved from https://www.techtarget.com/searchnetworking/definition/100-Gigabit-Ethernet-100GbE

Full-duplex vs. half-duplex communication

Data links communicate either one-way or bi-directionally. Half-duplex and full-duplex describe this kind of networking communication.

Half-duplex vs full-duplex
Source: https://www.techtarget.com/searchnetworking/answer/The-difference-between-half-duplex-and-full-duplex

Half-duplex communication:

Half-duplex only allows one-way data transmission at a time. Due to its one-way nature, half-duplex networks require mechnisms to avoid data collisions such as CSMA/CD, which checks if data is a process of data transmission is in progress before trying to send data down the wire.
Full duplex communication:

Full duplex allows simultaneous transmission and reception of data on the same data link. Since there is no risk of data collision, data transfers are completed quickly.
- Prevalence:
  
  Most modern network links, operate in full duplex mode to maximize data flow. Recent Ethernet standards, i.e. 10GbE and up, are relaxing required support for half-duplex modes.
Implications for data capture:
- Double the data rate:
  
  Capturing both transmit (TX) and receive (RX) data effectively doubles the bandwidth requirements. Therefore, it is advised to use switches, NICs, and other networking equipment that can handle the higher bandwidth, such as 10GbE, 25GbE, 40GbE, and 100GbE.
- Example:
  
  For a single 10 Gbps link, capturing both directions requires handling up to 20 Gbps of data. Thus, a packet capture system needs to attend to the resulting data rate from the aggregation of all network device activity, not the specified throughput of network devices. Consequently, the monitoring and analysis of peak network traffic rates is crucial in a high-rate packet capture system, where analyses and monitoring can be drastically improved with ULL timestamping that can be provided by PTP, PTM, and photonic time synchronization methods.

As network speeds increase to 100 Gbps and beyond, high-speed packet capture becomes more complex as the risk of dropped packets increases significantly leading to degraded network visibility and reliability.

References:

Burke, John; Partsenidis, Chris. TechTarget. (2019, November 13). What's the difference between half-duplex and full-duplex?. Retrieved from https://www.techtarget.com/searchnetworking/answer/The-difference-between-half-duplex-and-full-duplex

Anecdote: challenges in capturing high-speed data

A real-world example highlights the complexities involved in network packet capture for HFT.

Assuming a Napatech cards had 4x10 Gbps ports to capture:

PCIe Bandwidth Limitations:
- PCIe v2 x8:
  
  The Napatech cards utilized PCI Express (PCIe) version 2 with eight lanes (x8).
- Bandwidth Constraints:
  
  PCIe v2 x8 has a maximum theoretical bandwidth of approximately 4 GB/s (500 MB/s per lane as unidirectional bandwidth), which is less than a required total of 40 GB/s network bandwidth (4 ports x 10 GB/s).
The bottleneck:

This discrepancy meant that the packet capture cards could not handle the full 40 Gbps of network traffic without potential data loss. For ULL applications and HFT firms, any packet loss is unacceptable due to the critical nature of the data.

Solutions and considerations:

To overcome such challenges, network engineers might:

Upgrade to PCIe v3, v4, or v5:

Newer versions of PCIe offer higher bandwidth per lane, alleviating the bottleneck. Shown below is a table listing the x32 bandwidths of different PCIe generations:

PCIe Generations	Bandwidth	Gigatransfer	Frequency
PCIe 1.0 x32	8GB/s	2.5GT/s	2.5GHz
PCIe 2.0 x32	16GB/s	5GT/s	5GHz
PCIe 3.0 x32	32GB/s	8GT/s	8GHz
PCIe 4.0 x32	64GB/s	16GT/s	16GHz
PCIe 5.0 x32	128GB/s	32GT/s	32GHz

The expected bandwidth for x8 is shown below:

Note: just like the first diagram shown above, the table shown below is for the aggregate bi-drectional bandwidth.

PCIe Generations Bandwidth

PCIe 1.0 x8 2GB/s

PCIe 2.0 x8 4GB/s

PCIe 3.0 x8 8GB/s

PCIe 4.0 x8 16GB/s

PCIe 5.0 x8 32GB/s

Distribute Traffic:

Use multiple packet capture cards or systems to distribute the load.
Optimize Data Capture:

Implement efficient data handling techniques to reduce overhead.

This anecdote illustrates the need for careful planning and state-of-the-art hardware in network packet capture for HFT.

References:

PCI-SIG. (n.d.). PCI Express® Technology. Retrieved from https://pcisig.com/specifications/pciexpress/technology
Wikipedia. (n.d.). PCI Express 2.0. Retreived from https://en.wikipedia.org/wiki/PCI_Express#cite_note-PCIExpressPressRelease-65
George. FS. (2024, September 3). PCIe 5.0 vs. PCIe 4.0: Which One to Choose?. Retrieved from https://community.fs.com/article/pcie-50-vs-pcie-40-which-one-to-choose.html

6. Challenges and methods of packet capture

Capturing network packets with ULL is the norm in industries of network security, real-time analytics, and high-frequency trading. Packet capture involves intercepting and logging network traffic in the form of capturing network packets being transmitted over the network medium or data link. In high-rate environments, the task of packet capture becomes increasingly challenging due to the volume of data and the need for minimal impact on network performance, such as reducing packet loss, jitter, and latency. Described below are several challenges and methods of packet capture.

Challenges in packet capture for ULL environments

Capturing packets in ULL settings involves several hurdles:

High data rates: The volume of data can overwhelm capture mechanisms, leading to dropped packets.
Timing precision: Accurate timestamping is crucial for latency measurements and network packet analysis.
Minimal impact: Packet capture should not interfere with the normal operation of the network or the devices involved.

Several methods exist to mitigate each of these challenges to ensure that high data rates are manageable. Some of these methods are described below:

Split packet data streams:
- Splitting the packet stream into $N$ parts can reduce the load on the packet capture system. Additionally, cloing the packet capture system $N$ times, one for each part where each clone gets $\frac{1}{Nth}$ of the traffic, can improve the ability to handle higher packet data rates.
- Splitting the packet data stream can be done with a hardware or software load balancer which ensures that all the packets that are part of the same data transmission stay together. Some vendors of load balancers include F5, HAProxy, and AWS Elasic Load Balancing (ELB).
Turn off DNS resolution:
- DNS resolution turns the IP address into a displayable hostname. Although helpful to produce a human-readable source and destination address in the packet header, an individual DNS lookup can take a significant number of milliseconds to seconds. Thus, turning off DNS resolution reduces latency of a packet capture system by shaving off time from DNS lookups.
- DNS resolution can be turned off in the following tools:
  - tcpdump:
    
    Add the -n flag to disable DNS lookups. For example:
```
tcpdump -n -i eth0
```
  - Wireshark:
    
    In the top-panel menu, go to View → Name Resolution and uncheck the Resolve Network Addresses option. Disabling DNS lookups can also be done at the time of packet capture by going to Capture Options, clicking the Options tab, and unchecking the Resolve network names option:
    
    Wireshark Capture Options
    Source: Wireshark
Turn off port resolution:
- Although not as much of a reduction in latency, disabling port number lookups can shave off additional time to reduce latency in packet capture systems.
- Port resolution can be turned off in the following tools:
  - DNS resolution can be turned off in the following tools:
    - tcpdump: Add a second -n flag to disable port lookups. For example:
      tcpdump -nn -i eth0
    - Wireshark:
      
      In the top-panel menu, go to View → Name Resolution and uncheck the Resolve Transport Addresses option. Similar to DNS lookups, port lookups can also be disabled at the time of packet capture by going to the top-panel menu and selecting Capture → Options, clicking the Options tab, and unchecking the Resolve transport names option.
Passive TAPs:
- As the name implies, passive Traffic Analysis Points/Test Access Points (TAPs) are non-intrusive to data transmission, making them perfect for packet capture systems to ensure normal operations of a network.
Reduce unnecessary packet processing:
- To further reduce packet capture latency, packet capture systems ought to focus on minimizing the average processing time per packet. Listed below are several processes that may or may not be necessary as the amount of processing and memory usage goes up:
  1. Breaking up the packet into fields of interest (not optional; may be significant processing time)
  2. IP address
  3. Protocol and port (including flags)
  4. Keeping track of the TCP connection and UDP conversation state (including ICMP errors)
  5. DNS lookups
  6. Reassembling the TCP connections
  7. Inspecting the packet payload (actual content being transmitted)
  8. Extracting user content out of the payload (such as downloaded files)
  9. Decrypting the payload content
Limit storage writes:
- Writing to storage, particularly to storage disks, is one of the slowest things a computer can do; so limiting storage writes is crucial. For example, packet loss can spike when disk writes are heavy because the heavy disk writes will block other tasks until the heavy disk write is complete.
- In addition to task-blocking from heavy disk writes, a large volume of storage writes can queue up, filling up the packet systems buffer, which will lead to latency spikes.
- Limiting storage writes in packet capture tools:
  - tcpdump: The -w flag tells tcpdump to write to disk:
```
tcpdump -i eth0 -w
```
    Intentional inclusions of the -w flag ensures that packet capture is focused only on what is absolutely necessary.
  - Wireshark:
    
    Wireshark stores captured packets to disk by default, so the best approach to reduce storage writes is to use Wireshark's in-built packet filters to limit the number of captured packets that are written to storage.
Limit screen writes:
- Displaying summaries for each packet can demand effort from software libaries and GUIs which can signficantly slow down packet processing.
- Instead of using GUIs to display packet data, use text-mode equivalents.
  - Wireshark:
    
    Wireshark has tshark, the command-line network protocol analyzer that functions much like tcpdump.
  - Linux:
    
    Further reductions in processing can be done in Linux by running the text-mode packet sniffer right on the Linux console with no GUI, sending each line of text right to the screen with far fewer libraries needed for displaying the packet data.
- Use the screen utility to hold program output:
  - With the screen utility, packet sniffers can capture packets while hiding the output, allowing for the screen utility to capture the program's output at extremely high speed, running the packet capture program faster than running it on the regular console.
  - For example, running:
```
screen -S capture -R
```
    creates a new screen named capture, which can be reconnected to later.
    
    In the command prompt, tcpdump or tshark can be run to start the packet capture process. Then, once the packet capture process begins, the screen session can be disconnected from by entering CTL + A then D. The packet capture process will continue running in the background, sending its output to the screen utility while never making to any graphical interface.
  - tcpdump:
    
    The tcpdump command has its own flag, -q, to reduce the amount of processing on each packet and the amount written to the screen.
  - Wireshark:
    
    Wireshark can reduce outputs to its display by unchecking the:
    1. Update list of packets in real-time,
    2. Automatically scroll during live capture, and
    3. Show capture information during live capture
    options under the top-panel menu Capture → Options. Disabling these capture options ensures that with heavy network traffic, Wireshark is focused on packet capture and not on updating the screen.
Turn off unnecessary packet capture processes:
- Packet capture is a time-sensitive task, so reducing the number of concurrent programs being run on the same machine will drastically reduce the performance of the packet capture system, especially if those concurrent programs put a lot of strain on the CPU, memory, network, and storage disk.
- If heavy concurrent processing tasks cannot be reduced, move them to another machine to free up resources on the machine(s) doing the packet capture.
Raise or lower CPU priority:
- On Linux, Unix, and Mac systems, the nice command is a crucial tool for managing process priorities, which is especially important for improving the speed of packet capture. By adjusting a program's priority, you can ensure that packet capture tools receive processor time before other processes, allowing them to respond swiftly to incoming packets.
- Typically, running a program with the nice command and a positive value lowers its priority, telling the kernel to allocate CPU time to other processes first. However, you can do the opposite by using a negative value with nice, effectively increasing a program's priority. For example:
```
sudo nice -n -10 top
```
  This command elevates the top program's priority, allowing it to run before other tasks. Because you're asking the operating system to prioritize this program over others, you need root privileges, hence the use of sudo.
- For packet capture systems, this CPU prioritization is vital. Packet sniffers need to handle network traffic in real-time, and any delay can result in missed packets, affecting the accuracy of data capture. By assigning them a higher priority, you ensure they get immediate access to the CPU when needed.
- The nice levels range from +19 (lowest priority) to -20 (highest priority), with 0 being the default when nice isn't used. Adjusting these levels can significantly impact how the system allocates CPU time among processes. While increasing the priority of packet capture tools may slow down other applications, this trade-off is acceptable when accurate and timely packet capture is the primary goal.
- You can verify the priority adjustments by running top, which displays running processes along with their NI (nice) values.
  - Using nice with tcpdump:
```
sudo nice -n -20 tcpdump ...
```
    gives it the highest priority level, enabling it to process packets promptly.
  - Using nice with Wireshark:
    
    If you're using Wireshark and start it from the command line, you can elevate its priority with:
```
sudo nice -n -10 wireshark
```
    If you start Wireshark from a menu and can't set the priority upfront, you can adjust it after launching using the renice command:
```
sudo renice -n -20 -p $(pidof Wireshark) $(pidof wireshark)
```
Ensuring tools like tcpdump and Wireshark have immediate access to CPU resources reduces latency of a packet capture system.
Raise or lower disk priority:
- The nice and renice commands effectively adjust CPU priority for processes, ensuring that critical tasks like packet sniffers respond quickly while less urgent programs wait. Similar control over disk access is available with the ionice utility to adjust disk I/O priority.
- Adjusting disk priority is crucial when a system is simultaneously sniffing packets and performing disk-intensive tasks like backups. A backup program, which heavily reads from disk, can cause the packet sniffer to pause during its attempts to write data, potentially leading to lost packets. By using ionice, you can prevent such conflicts and improve the speed of packet capture.
- Using ionice:
  
  To prioritize the packet sniffer, you can raise its disk priority using ionice -c Realtime:
```
sudo ionice -c Realtime tcpdump -w /packets.pcap
```
  Or lower the disk priority of the backup program using ionice -c Idle:
```
ionice -c Idle -p `pidof backup_program_name`
```
  Applying both adjustments maximizes effectiveness:
```
sudo ionice -c Realtime tcpdump -w /packets.pcap
ionice -c Idle -p `pidof backup_program_name`
```
  This approach ensures the packet capture tool maintains priority access to the disk, reducing the chance of lost packets, while the backup process may take longer due to lower priority.
- Similar to CPU priority adjustments, you can start a new program with modified disk priority or change the priority of a running process using the -p flag. Note that raising disk priority requires sudo privileges.
- Using ionice with Wireshark:
  
  When starting Wireshark from the command line, you can run it under ionice:
```
sudo ionice -c Realtime wireshark
```
  If Wireshark is started from a menu, adjust its disk priority after launch:
```
sudo ionice -c Realtime -p `pidof Wireshark` `pidof wireshark`
```
Use dedicated storage:
- Dedicated disks for packet capture is strongly recommended to avoid conflicts over disk access. A dedicated disk ensures that the Linux kernel will have simultaneous disk access to multiple storage drives without conflict.
- Using dedicated storage with ionice can signficantly improve the speed of packet capture by drastically reducing conflicts in storage disk writes.
Split packet capture and processing over multiple machines:
- To improve the speed of packet capture, efficiency can be improved by separating the capturing and analysis processes across multiple machines:
  1. Instead of using a single program on one machine to capture and analyze packets simultaneously, use a lightweight packet capture tool — like tcpdump — that does minimal processing; more advanced packet analysis/capture programs, like Wireshark, can display and process them later. Run this tool at high CPU and disk priority on a dedicated system that is performing few other tasks. Employ a Berkeley Packet Filter (BPF) to limit the capture to only the necessary packets, reducing data volume and speeding up the process.
  2. Next, set up one or more separate computers for analysis. These can be physical machines, virtual machines, or cloud servers. Organize the packet capture workflow so that each analysis machine processes different .pcap or .pcapng files, which can be as simple as distributing files via a shared network drive.
  3. When transferring captured packets to the analysis systems, run the transfer at low CPU and disk priority to avoid impacting the capture process. Ensure this transfer occurs over a different network segment than the one being monitored to prevent any interference or packet loss.
  Distributing the workload and optimizing at each step can significantly improve the speed and efficiency of a packet capture system. As network traffic increases, scaling up can be achieved by adding more analysis machines or utilizing cloud resources. Running analysis tasks in virtual machines allows for easy cloning and efficient resource sharing, further improving performance.
Choice of OS:
- A packet capture system will have the greatest efficiency and speed on Linux because of the Linux kernel's ability to process packets using multiple processor cores, spreading out the network load and letting it handle more packets per second.
  - An advanced analysis of the packets-per-second (pps) performance of Linux's networking stack can be found on Cloudflare's Blog from 2015, which outlines how to achieve 1 million UDP pps:
    1. Baseline performance:
      - A naive approach using a single receive queue and IP address achieved approximately 370kpps. By pinning processes to specific CPUs, performance consistency improved, peaking at around 370kpps.
    2. Multi-queue NIC utilization:
      - A multi-queue NIC, 10G NIC by Solarflare, distributes incoming packets across several receive queues, each pinned to a specific CPU. This packlet multi-queue improved performance to around 650kpps when packets were distributed across multiple IP addresses.
    3. NUMA node considerations:
      - NUMA can signficantly reduce latency and increase throughput by decreasing proximity of memory and processors, reducing cross-NUMA traffic, improving cache locality, CPU pinning, and multiple RX queue support through the use of multiple CPU-assigned NUMA nodes.
      - Performance varied significantly depending on the NUMA (Non-Uniform Memory Access) node configuration. Best performance (430kpps per core) was achieved when processes and RX queues are aligned within the same NUMA node.
    4. SO_REUSEPORT for scaling:
      - The SO_REUSEPORT socket option allows multiple processes to bind to the same port, reducing contention and allowing each to handle its own receive buffer. This strategy achieves over 1Mpps and even up to 1.4Mpps in optimal conditions with well-aligned RX queues and NUMA configurations.
    5. Challenges and limits:
      - Even with optimized settings, the kernel and application may drop packets if CPU resources are insufficient or if hashing collisions occur at the NIC or SO_REUSEPORT layers.
        
        To achieve 1 million UDP packets per second on Linux, multiple receive IP addresses and multi-queue network cards are required. The naive approach using a single receive queue and IP address can only reach 370kpps, but by distributing packets across multiple queues, up to 650kpps can be achieved.

Lastly, packet capture systems ought to group processes based on whether they occur during or after a capture is complete. Wireshark has most of its heavy processing features that are done after packet capture. Therefore, it is best to first identify where the bottleneck in the packet capture system is, whether during or after packet capture, and to start optimizing packet processing from there.

References:

Tcpdump & Libpcap. (n.d.). Tcpdump and Libpcap. Retrieved from https://www.tcpdump.org/.
Stearns, Bill. Active Countermeasures. (2020, July 14). Improving Packet Capture Performance – 1 of 3. Retrieved from https://www.activecountermeasures.com/improving-packet-capture-performance-1-of-3/
Stearns, Bill. Active Countermeasures. (2020, August 18). Improving Packet Capture Performance – 2 of 3. Retrieved from https://www.activecountermeasures.com/improving-packet-capture-performance-2-of-3/
Stearns, Bill. Active Countermeasures. (2020, September 16). Improving Packet Capture Performance – 3 of 3. Retrieved from https://www.activecountermeasures.com/improving-packet-capture-performance-2-of-3/
Wireshark. (n.d.). tshark(1). Retrieved from https://www.wireshark.org/docs/man-pages/tshark.html
Majkowski, Marek. Cloudflare Blog. (2015, June, 16). How to receive a million packets per second. Retrieved from https://blog.cloudflare.com/how-to-receive-a-million-packets/
Awati, Rahul. (2022, September). TechTarget. non-uniform memory access (NUMA). Retrieved from https://www.techtarget.com/whatis/definition/NUMA-non-uniform-memory-access

Methods of packet capture

Several methods exist for capturing packets, each with its advantages and drawbacks in ULL environments.

Capturing on the host itself:
- Benefits:
  - Capturing packets directly on the host machine is the most straightforward method. Tools like tcpdump and Wireshark's tshark and dumpcap can processs can intercept packets as they pass through the network interface.
- Drawbacks of capturing on the host:
  - Performance Impact:
    
    Packet capture consumes CPU and memory resources, potentially degrading the host's performance.
  - Scalability Issues:
    
    Equipping all hosts with packet capture capabilities can be cost-prohibitive and inefficient.
Switch port mirroring:

A mirrored port, or the Cisco-specific Switched Port ANalyzer (SPAN), involves duplicating network traffic from a dedicated "mirror" port (or an entire VLAN) on a switch to another "monitor" port where the packets can be analyzed. The port mirroring feature requires a switch to be configurable or "managed" by a web or CLI management tool.
- Benefits:
  - Cheapest and easiest method of capturing packets since port mirroring is available on most networking devices.
  - Can be configured and turned on/off with a few simple CLI commands or through a web management interface.
  - Can provide aggregated access to multiple source/"mirror" ports or even entire VLANs (mirorring an entire VLAN is not advised because of bandwidth limitations).
  - A dedicated switch for SPAN/port mirroring sessions can isolate mirrored traffic from production traffic.
  - Comes with no data link interruption
    - Production data links do not need to be disconnected to implement a SPAN/mirrored port. Switch configuration is the only necessary step.
- Issues with traditional port mirroring:
  - Limited SPAN bandwidth:
    - The "monitor" may become a bottleneck if it cannot handle the aggregated traffic.
      - A "monitor" port can only transmit data towards the capture device, i.e. it can only receive packets.
      - SPAN/"mirror" ports may send or receive packets.
    - High traffic rates will result in packet loss when the mirrored port reaches its limit.
      - Packets will be dropped if the "monitor" port receives packets at a rate higher than its bandwidth.
    - Traffic from SPAN/port mirroring is considered low-priority by a switch because it is not part of normal traffic.
  - Switch CPU limitations
    - Adding more "mirror" ports to a SPAN session increases the CPU load which can result in packet loss or latency spikes.
  - Insufficient precision:
    - Packet loss from high bandwidth:
      - If a SPAN/mirrored port operates at a lower speed than the original ports being mirrored, the SPAN/mirrored port cannot keep up with the incoming traffic, leading to queued packets.
    - Packet queuing worsens timestamp accuracy.
    - Degradations in timestamp accuracy worsen latency measurements.
    - High switch CPU load also increases latency and can result in packet loss.
  - Difficult to troubleshoot network issues:
    - When traffic rates are high, packet loss makes it difficult to troubleshoot network issues or outages since the source of packet loss is opaque, i.e. whether packets were dropped by the SPAN/mirrored port or by the network will be unclear.
    - With the many possible cases of dropped packets, it is difficult to prove a packet's existence and delivery.
    - SPAN/port mirror timings are distorted by the mirror process, so trusting mirrored packet timestamps on a micro- to nanosecond-level range is impossible for latency-sensitve environments.
    - Susceptible to packet manipulation.
      - SPAN/mirrored ports can get compromised which can result in hidden malicious packets or dropped packets.
Traffic Analysis Points/Test Access Points (TAPs):

A simple analogy to understand how TAPs operate is the Man-in-the-Middle (MITM) attack; TAPs can be thought of eavesdropping on the packets transmitted through the data link, copying all the packets sent (Tx) and received (Rx).

One critical thing to note about hardware TAPs is that they need to physically inserted onto/into the data link, which requires planning of maintenance windows for the network to go offline during the insertion process. Therefore, it is advised to integrate TAPs at the start of building the packet capture system, starting with thorough testing of each TAP device before deploying any. Ideally, all TAPs should be full-duplex so that packets traveling down both directions of the data link communication are captured.

Some key benefits of TAPs are described below:
- Benefits:
  - Most reliable and accurate way to capture network packets.
  - Offers lossless packet capture, making them ideal for network forensics and security compliance operations.
  - Avoid introducing time delays in packet data transmission, i.e. TAPs avoid packet queuing.
  - Signficantly greater precision than SPAN/mirrored ports.
  - Can operate without any power.
    - Fiber optic passive TAPs provide this benefit (copper passive TAPs still require power).
  - Full-duplex TAPs handles send (Tx) and receive (Rx) data on separate channels, reducing latency.
  - Avoids additional load on the network switch.
  - Resistant to packet manipulation or hidden malicious packets.
- Considerations:
  - Fiber or copper:
    - Fiber optic TAPs operate fully without power, decreasing the time to restore the network and ensuring that any packets still in transit will be captured.
      - They replicate the actual photons in fiber optics, enabling ULL measurements without electronic interference.
    - During a power outage, copper TAPs will require a re-synchronization of the data link communication.
  - Optical splitting ratios:
    - Determines the amount of light deflected by the mirrors for capture: X% for production, Y% for capture/monitoring
    - Common splitting ratios include 50/50, 70/30, and 80/20.
There are two kinds of TAPs: passive and aggregation:
1. Passive TAPs:
  
  Passive taps are hardware devices inserted between network segments to monitor traffic without altering it, simply collecting the raw packet data.
  
  The key benefits are listed below, with the full description of each benefit fully described in the Electronic exchange architecture section, under the Passive network Traffic Analysis Points/Test Access Points (TAPs) sub-section:
  - Benefits:
    - Powerless operation
    - Comprehensive visibility
    - Non-intrusive
    - Hardware-based security
    - Unidirectional flow
    - Cheap, especially when using basic fiber TAP/splitters
2. Aggregation TAPs:
  
  As the name implies, aggregation TAPs merge sent and received packets into a single aggregated output. Thus, an aggregation TAP can save on NIC costs since you only need a single NIC to capture packets emitted from the aggregated output.
  - Benefits:
    - Passive TAPs often feed into aggregation TAPs, like the Arista 7150, combining the benefits of passive monitoring with advanced processing capabilities.
    - Saves on cost for NICs, requiring a single NIC per aggregation TAP.
    - With a single NIC, packets are never out-of-order, an issue which is introduced when using +1 NICs in a packet capture system.
  - Caveats:
    - Packet loss if the send (Tx)/receive (Rx) bandwidth exceeds the aggregated output bandwidth.
      - Example: if the aggregated output bandwidth is 1Gbps and the aggregation TAP's bandwidth is 1Gbps, then connection will incur packet loss since the total bi-directional (send & receive) bandwidth is 2Gbps.
    - More complicated, and thus, more expensive than passive TAPs.
Layer 2 Switches:

Layer 2 switches operate at the physical layer, providing ULL connectivity options.

Modern switches, like the Arista 7150 series, can function as an aggregation switch, consolidating traffic from multiple sources with precise timing information. The Arista 7150 Series switch is some of the best of what money can buy, with zero network impact and real <= 1 nanosecond accurate timestamps.

Diagram of an aggregated switch
Source: https://www.fmad.io/blog/10g-tap-span-mirror

One important benefit from an aggregation switch is timestamp injection:
- Hardware timestamp injection:
  - Aggregation TAPs address accuracy issues in time synchronization by injecting hardware-based timestamps into packets.
  - Timestamp injections ensure accurate time ordering despite potential time delays from packet queuing.
Two popular aggregation switches are:
1. Arista 7150 Series:
  
  Optimized for ULL systems, the Arista 7150 series offers the same ULL characteristics at all packet sizes, even when features such as L3, ACL, QoS, Multicast, Port Mirroring, LANZ+ and Time-Stamping are enabled. The 7150S also supports cut-through mode at 100Mb and 1GbE speeds at low latency for legacy connections.
  
  Benefits listed from the 7150 Series' product brief include:
  - Wire-speed low-latency NAT: Reduces NAT latency by tens of microseconds compared to traditional high-latency solutions.
  - IEEE 1588 precision time protocol: Provides hardware-based timing for accurate in-band time distribution with nanosecond accuracy.
  - Integrated high precision oscillator: Ensures highly accurate timing with extended holdover.
  - Latency and application analysis (LANZ): Detects, captures, and streams microbursts and transient congestion at microsecond rates.
  - Advanced multi-port mirroring suite: Avoids costly SPAN/TAP aggregators with in-switch capturing, filtering, and time-stamping.
  - Wire-speed VXLAN Gateway: Enables next-generation Data Center virtualization.
  - AgilePorts: Adapts from 10G to 40G without costly upgrades.
  Another important feature of the Arista 7150 Series is how applies highly accurate nanosecond-level timestamps to packets with by utilizing the Frame Check Sequence (FCS) field in of a packet's Ethernet frame:
  - Benefits:
    - Enables precise packet timing within nanoseconds.
    - Maintains high-speed processing with minimal latency.
    - Avoids congestion effects by applying timestamps early.
  - How timestamping works:
    - Location:
      
      Timestamps are applied in the MAC hardware of the switch, which processes the earliest stages of packet handling.
    - Mechanism:
      
      The FCS field (a 32-bit value) is repurposed to store the timestamp when a packet arrives at the MAC layer. This FCS replacement of the existing FCS occurs before aggregating traffic, ensuring accurate capture of the arrival timestamp.
      - Operations on timestamped frames:
        
        Removal or replacement:
        
        Timestamped frames without a valid FCS can be dropped.
        
        Faulty frames are handled to ensure they do not impact performance.
        
        There are two types of timestamping modes:
        
        Replace mode:
        
        Arista 7150 Series - FCS Replace Mode. Notice how the timestamp completely replaces the 32-bit FCS.
        Source: https://arista.my.site.com/AristaCommunity/s/article/timestamping-deep-dive-frequent-questions-and-tips-on-integration
        
        The existing 32-bit FCS is completely replaced by the timestamp.
        
        Since the original frame size is preserved, there is no latency impact downstream.
        
        Downstream devices must recognize that the FCS field is now invalid, since it is now a timestamp.
        
        Cut-through switches will forward the Ethernet frame but may increment the checksum error counters on the transit interfaces.
        
        Append mode:
        
        Arista 7150 Series - FCS Append Mode. Notice how in this mode, the timestamp is appended after the Ethernet frame data but before the FCS.
        Source: https://arista.my.site.com/AristaCommunity/s/article/timestamping-deep-dive-frequent-questions-and-tips-on-integration
        
        Insertion of an 4-byte timestamp is made between the Ethernet frame's data payload and the frame's 32-bit FCS.
        
        The old FCS is discarded, then the switch recalculates a new FCS, appending it to the end of the Ethernet frame.
        
        Headers of any nested protocols (e.g. TCP, UDP) are not updated.
        
        Downstream applications can access the inserted timestamp by reading the last 32-bits of the Ethernet frame payload.
        
        Egress handling:
        
        Timestamps are written ingress by the MAC in place of the FCS or between the end of the L2 payload and the FCS, but software configration to enable timestamping is applied on egress to ensure the timestamp field can be adjusted for compatibility with tools not designed to interpret timestamp-modified FCS fields.
        
        Extra FCS data may be appended to the frame payload for downstream applications requiring fault tolerance.
      - Application timing:
        
        Timestamps are applied as the MAC processes the first byte of the frame, marking the packet's exact arrival.
        
        Time synchronization to UTC is achievable using keyframe timestamp mechanisms, precise counters, and PTP clock synchronization for maximum accuracy.
        
        It is recommended to use consistent cable/fiber lengths between each traffic source and the aggregator to ensure accurate comparisons between timestamped packets arriving on multiple devices.
  - Why it’s effective
    - Ingress timestamping ensures deterministic, fully parallel performance in accurately marking (within < 10 nanoseconds) every packet as the Ethernet frame enters the switch before packets are aggregated.
    - Avoids congestion issues such as packet queuing cdelays as packets are aggregated.
    - Zero changes in latency or jitter by adding timestamps in parallel.
    - The implementation supports both ingress and egress timestamping configurations, allowing flexibility for applications.
2. Arista 7130 MetaMux:
  
  Another aggregation switch for ULL environments is the Metamako Mux. The company, Metamako, was acquired by Arista Networks in 2018, integrating their ULL products into Arista's portfolio. Below is a brief description of the current MetaMux model.
  
  Diagram of a Arista 7130 MetaMux integrated into a financial exchange for trading
  Source: https://www.arista.com/en/products/7130-meta-mux
  
  Benefits listed from MetaMux's official product brief include:
  - Ultra-fast multiplexing
    - Multiplexing/packet-aggregation in 39 nanoseconds.
    - Aggregates streams into a single stream for exchanges, with configurable N:1 multiplexers.
  - Deterministic
    - Provides consistent latency of ±7 ns for optimal execution environments.
  - Complete packet statistics
    - Offers port-level counters for accounting, diagnostics, and troubleshooting.
  - Support for BGP and PIM
    - Enables compatibility with Layer 3 network devices.
  - Easy to monitor and manage
    - Includes tools such as:
      - Ethernet counters on each port
      - Integrated Linux management processor
      - Streaming telemetry via InfluxDB
      - Web-based GUI
      - Command-line interface (CLI) via SSH, Telnet, or serial connection
      - Local and remote logging via Syslog
      - JSON-RPC API
      - SNMP (v1, v2, v3) support
      - NETCONF support

References:

Endace. (n.d.). What is Network Packet Capture?. Retrieved from https://www.endace.com/learn/what-is-network-packet-capture
Jasper. (2016, November 11). Packet Foo. The Network Capture Playbook Part 4 – SPAN Port In-Depth. Retrieved from https://blog.packet-foo.com/2016/11/the-network-capture-playbook-part-4-span-port-in-depth/
Jasper. (2016, December 12). Packet Foo. The Network Capture Playbook Part 5 – Network TAP Basics. Retrieved from https://blog.packet-foo.com/2016/12/the-network-capture-playbook-part-5-network-tap-basics/
Arista Networks. (n.d.). Arista 7150 Series Network Switch - Quick Look. Retrieved from https://www.arista.com/en/products/7150-series-network-switch-datasheet.
Arista Networks. (2018, September 12). Arista Networks Acquires Metamako. Retrieved from https://www.arista.com/en/company/news/press-release/6070-pr-20180912.
Woods, Kevin. (2021, January 20). Kentik. What Is Port Mirroring? SPAN Ports Explained. Retrieved from https://www.kentik.com/blog/what-is-port-mirroring-span-explained/
Arista Networks. (n.d.). Arista 7130 MetaMux. Retrieved from https://www.arista.com/en/products/7130-meta-mux
Arista Community Central. (2014, January 20). Timestamping on the 7150 Series. Retrieved from https://arista.my.site.com/AristaCommunity/s/article/timestamping-on-the-7150-series
FMADIO. (2024, September 14) NETWORK ARCHITECTURE TO CAPTURE PACKETS. https://www.fmad.io/blog/10g-tap-span-mirror

Recording packets on a computer

Once packets are captured via one of the above methods, they need to be recorded and analyzed on a computing device.

Packet capture tools

Two of the most popular packet capture command-line tools vailable are libpcap's tcpdump and Wirehark's dumpcap:

Using libpcap's tcpdump:

libpcap is a system-independent interface for user-level packet capture. tcpdump is a command-line utility that utilizes libpcap to capture and display packet headers.
1. Relevant tcpdump options:
  - -s [snaplen]:
    
    Specifies the snapshot length, or the number of bytes of each packet to capture. Capturing only headers (layers 1-4 or 5) reduces overhead.
  - -j [tstamp_type]:
    
    Chooses the timestamping method, which can be crucial for latency analysis.
2. Example:
```
tcpdump -i eth0 -s 128 -j adapter_unsynced -w capture.pcap
```
  - -i eth0:
    
    Specifies the interface.
  - -s 128:
    
    Captures the first 128 bytes of each packet.
  - -j adapter_unsynced:
    
    Uses hardware timestamping from the network adapter.
Using Wireshark's dumpcap:

Wireshark's dumpcap is designed for high-speed and efficient .pcap or .pcapng file capturing, especially when paired together with the Wireshark GUI.

dumpcap does not analyze or decode packets directly, since it delegates those tasks to Wireshark or tshark.

Unique to dumpcap, it leverages memory-mapped I/O and ring buffer support to avoid disk space issues where raw packet data needs to be saved at very high-rates over long time periods AND with minimal packet loss for later analysis.
1. Comparing dumpcap to tcpdump:
  1. libpcap's tcpdump:
    - Supports split captures with the -C flag for file size and the -W flag for number of files.
    - It only provides captures packet data in .pcap files.
    - Fully standalone but can provide .pcap files to Wireshark for analysis.
    - Provides real-time network debugging and analysis decoding directly from the command line.
    - Useful for immediate troubleshooting and debugging without requiring a GUI, like Wireshark.
    - Suitable for environments where installing Wireshark is not feasible, like IoT devices or remote servers (e.g. AWS EC2 instances or Google Cloud servers).
    - Highly customizable output with options for verbosity (-v), timestamp formats, and protocol decoding.
  2. Wireshark's dumpcap:
    - Supports split captures using its ring buffer which is designed for long-term captures without running out of disk space.
    - Split captures done with ring buffers are more efficient than tcpdump's -C and -W flags.
    - Optimized for high-performance packet capture with minimal packet loss.
    - Often used in conjunction with Wireshark and tshark for in-depth analysis; it is not typically used standalone.
    - Ideal where raw packet data needs to be saved for later analysis.
    - Focuses on high-performance packet capture without adding real-time decoding overhead.
    - Minimal configuration options due to an emphasis on high-performance packet capture.
2. Using dumpcap with nice and ionice:
```
sudo ionice -c2 -n0 nice -n-10 dumpcap [options]
```
  1. Explanation:
    1. sudo:
      
      Required for dumpcap to access network interfaces and for ionice to change I/O priorities.
    2. ionice options:
      - -c2: Sets the scheduling class to "Best-Effort".
      - -n0: Sets the highest priority within the Best-Effort class.
    3. nice options:
      - -n-10: Sets the CPU priority to a higher level (negative values increase priority).
    4. dumpcap [options]:
      - Replace [options] with your specific dumpcap command-line arguments (e.g., interface, file output, buffer size).
  2. Example Use Case:
    
    Capture packets on interface eth0 and write to a file with high CPU and I/O priority:
```
sudo ionice -c2 -n0 nice -n-10 dumpcap -i eth0 -w /path/to/output.pcap
```
  3. Ring buffer example:
    
    For a long-term capture with a ring buffer to manage disk space:
```
sudo ionice -c2 -n0 nice -n-10 dumpcap -i eth0 -b filesize:100000 -b files:10 -w /path/to/output.pcap
```
    - -b filesize:100000: Splits files into 100 MB chunks.
    - -b files:10: Limits the number of files to 10, overwriting the oldest files when the limit is reached.
  4. Adjusting priority dynamically:
    
    You can adjust the priority of a running dumpcap process:
    1. Find the process ID (PID):
      ps aux | grep dumpcap
    2. Adjust CPU priority:
      sudo renice -10 <PID>
    3. Adjust I/O priority:
      sudo ionice -c2 -n0 -p <PID>
3. Best practices for using dumpcap with nice and ionice:
  - Always run dumpcap with appropriate permissions (usually sudo).
  - Use ring buffers (-b) for long captures to avoid disk space issues.
  - Use nice and ionice together when capturing to balance performance and system resource usage

An additional helpful tool for packet capture is tcpreplay.

tcpreplay:

The tcpreplay suite is a collection of open-source utilities primarily designed for replaying, modifying, and analyzing previously captured network traffic at variable speeds.

While not a packet capture tool per se (you use other tools like tcpdump or Wireshark to actually capture traffic), the tcpreplay utilities enable you to take existing network traces (in .pcap format) and send them back onto the network or through software systems at adjustable rates. This makes them invaluable for testing intrusion detection systems (IDS/IPS), firewalls, load balancers, switches, routers, and other network devices or software stacks to ensure they can handle specific traffic scenarios, including very high packet rates.
1. Key Tools in the tcpreplay Suite:
  1. tcpreplay:
    
    The primary utility used to replay captured packets from a pcap file onto a live network interface.
    
    It allows control over the transmission speed, including the ability to send packets at original capture speed, a user-specified packets-per-second rate, or at maximum line rate for stress testing.
  2. tcpprep:
    
    A pre-processing tool that classifies traffic and creates cache files used by tcpreplay. tcpprep can analyze a pcap and divide traffic into two sides—client and server.
    
    This classification enables more intelligent replay scenarios (e.g., simulating a client-server conversation accurately).
  3. tcprewrite:
    
    A utility for rewriting various fields in the packets prior to replay. Common rewrites include:
    - Changing MAC or IP addresses
    - Modifying VLAN tags
    - Adjusting TCP/UDP ports
    - Recalculating checksums as needed
    This is critical if you want to replay traffic into a different network environment than where it was originally captured.
  4. tcpbridge:
    
    Bridges network traffic from one interface to another, optionally applying transformations much like tcprewrite.
    
    This can be useful for testing inline devices when you don’t have a captured file but want to filter or modify traffic on-the-fly.
  5. capinfos (not part of tcpreplay suite but often used):
    
    Although not included directly in the tcpreplay suite (it comes with Wireshark), capinfos is helpful to understand details about a capture file before using it with tcpreplay.
    
    For tcpreplay specifically, there’s tcpcapinfo which may provide statistics and details about the pcap.
2. Common Use Cases:
  1. Performance & Load Testing:
    
    By taking a representative pcap file of your production network traffic, you can replay it at higher and higher speeds to determine the maximum throughput your device or application can handle before performance degrades.
  2. Security Device Testing:
    
    Test IDS/IPS systems, firewalls, or other security appliances by feeding them previously captured attack traffic at different rates.
    
    This simulates realistic load conditions and validates detection capabilities.
  3. Network Device Regression Testing:
    
    When you deploy a new firmware version on a router or switch, you might want to ensure it still handles the same workload.
    
    By replaying known packet captures, you can confirm the device’s performance has not regressed.
  4. Protocol Testing and Validation:
    
    Developers can use tcpreplay to re-inject complex traffic patterns to debug protocol stacks or to confirm that their software reacts correctly to certain kinds of malformed or edge-case packets.
3. High-Rate Packet Replay: Key Considerations:
  
  When using tcpreplay at very high rates, several factors come into play:
  1. Hardware Limitations:
    
    Ensure the system used for replaying packets has a sufficient CPU, memory bandwidth, and network interface card (NIC) capable of high packet-per-second (pps) output. High-end NICs or specialized capture/playback hardware may be required to reach line-rate replay at 10GbE or beyond.
  2. NIC Configuration:
    
    Tuning your NIC, enabling features like RSS (Receive Side Scaling), disabling interrupt moderation (if it conflicts with timing accuracy), or using kernel bypass frameworks (like DPDK or PF_RING ZC for advanced scenarios) can improve performance.
  3. Proper Use of Timing Options:
    - --pps: Set packets-per-second to a specific value.
    - --topspeed: Attempt to replay as fast as the system can send packets.
    - --mbps: Set the send rate in megabits per second.
    Adjusting these parameters ensures you hit the desired test load. For stress tests, --topspeed can push your system’s limits.
  4. Timestamp Accuracy:
    
    If your goal is to faithfully reproduce original capture timing, you must ensure tcpreplay can accurately respect packet timestamps. On busy systems or with very high replay rates, achieving true fidelity can be challenging.
4. Step-by-Step Usage Guide:
  1. Obtain a Capture File:
    
    Use a tool like tcpdump to capture traffic:
```
tcpdump -i eth0 -w traffic.pcap
```
    You now have traffic.pcap as your baseline file.
  2. Pre-Process the Capture (Optional):
    
    If you need to separate client and server traffic for more realistic bi-directional replay, use tcpprep:
```
tcpprep --auto=bridge --pcap=traffic.pcap --cachefile=traffic.cache
```
    This command analyzes traffic.pcap and produces traffic.cache that classifies each packet. The --auto=bridge mode is a straightforward classification, but other modes exist depending on your network topology.
  3. Rewrite Traffic as Needed:
    
    If the traffic was captured in an environment with different IP or MAC addresses than your test lab, use tcprewrite:
```
tcprewrite --infile=traffic.pcap --outfile=traffic_modified.pcap \
  --dstipmap=192.168.1.0/24:10.0.0.0/24 \
  --enet-dmac=aa:bb:cc:dd:ee:ff
```
    This will change the destination IP subnet from 192.168.1.x to 10.0.0.x and set the destination MAC address. The output is a new pcap you’ll replay.
  4. Replay the Traffic:
    - Use tcpreplay to send the packets onto a given interface.
      
      For example:
      tcpreplay --intf1=eth1 --cachefile=traffic.cache traffic_modified.pcap
      If no cache file is needed, simply omit --cachefile. For controlling replay rate, consider:
      tcpreplay --intf1=eth1 --mbps=100 traffic_modified.pcap
      This attempts to send at 100 Mbps. Or use:
      tcpreplay --intf1=eth1 --pps=1000000 traffic_modified.pcap
      to send at 1,000,000 packets per second.
      
      For maximum speed testing:
      tcpreplay --intf1=eth1 --topspeed traffic_modified.pcap
      This will replay as fast as possible.
    - Increasing Flows Per Second with AF_XDP:
      
      As of version 4.5.1, you can achieve line speed transmission on newer Linux kernels by using the --xdp options.
      
      No kernel modifications are required.
  5. Editing Traffic with tcprewrite:
    
    Modify pcap files to customize traffic for replay scenarios:
    - Change IP addresses:
      tcprewrite --infile=traffic.pcap --outfile=rewritten.pcap --srcipmap=10.0.0.0/8:192.168.0.0/8
    - Change MAC addresses:
      tcprewrite --infile=traffic.pcap --outfile=rewritten.pcap --enet-smac=00:11:22:33:44:55 --enet-dmac=66:77:88:99:AA:BB
    - Add VLAN tags:
      tcprewrite --infile=traffic.pcap --outfile=rewritten.pcap --vlan-add=100
  6. Optimizing Replay with tcpprep:
    
    Generate cache files to determine client/server traffic splitting:
```
tcpprep --auto=bridge --pcap=traffic.pcap --cachefile=cachefile.cache
```
    Replay using the cache:
```
tcpreplay --intf1=eth0 --cachefile=cachefile.cache traffic.pcap
```
  7. Bridge Traffic with tcpbridge:
    
    Replay traffic through live network interfaces:
```
tcpbridge --intf1=eth0 --intf2=eth1 traffic.pcap
```
  8. Analyzing pcap Files:
    
    Use capinfo for a quick summary:
```
capinfo traffic.pcap
```
    Other analysis methods:
    - Monitor the receiving system (e.g., IDS or firewall) logs and CPU usage.
    - Use tcpdump or ifstat on the receiver side to confirm the load.
    - Check packet counters, dropped packets, and device performance metrics.
  9. Debugging and Testing:
    - Use --stats=5 to print replay statistics every 5 seconds:
      tcpreplay --intf1=eth0 --stats=5 traffic.pcap
    - Dry-run mode to verify configuration without sending packets:
      tcpreplay --intf1=eth0 --dry-run traffic.pcap
5. Best Practices:
  - Start with a Lower Rate: Begin by replaying at a modest rate and increase incrementally. This helps identify the point at which device performance starts to degrade.
  - Use Realistic Traffic Mixes: A single pcap from a simple environment might not stress your device in the same way as a real-world blend of protocols and packet sizes.
  - Document Your Test Setup: Note the versions of tcpreplay and NIC drivers, OS kernel parameters, and hardware specifications to ensure repeatability.
  - Leverage Multiple CPU Cores: Consider running multiple instances of tcpreplay in parallel on different CPU cores to achieve even higher aggregate replay rates, provided you have multiple NICs or a NIC with multi-queue support.

Types of packet capture

On-demand packet sniffing vs. continuous packet capture
- On-demand packet sniffing:
  - Key considerations:
    - Useful for immediate troubleshooting of outages or performance issues.
    - Involves portable devices or software with limited storage attached when problems occur.
    - Requires manual setup, potentially causing delays.
  - Accuracy:
    
    May miss intermittent issues; critical attack stages can occur before capture starts.
  - Efficiency:
    
    Less efficient due to manual intervention and limited storage capacity.
- Continuous packet capture:
  - Key considerations:
    - Uses a rotating buffer on large RAID arrays to record packets continuously.
    - Enables investigation of past events (hours, days, or weeks ago).
    - Ideal for cybersecurity threat analysis and root cause discovery.
  - Accuracy:
    
    High, as it captures all network activity continuously.
  - Efficiency:
    
    Efficient for comprehensive monitoring despite higher storage requirements.
Triggered packet capture
- Key considerations:
  - Captures packets only when specific conditions or alerts occur, such as when a cyber-threat is detected.
  - Limited storage (often just a few GB of RAM) hampers thorough investigations.
  - Ineffective against new, undefined threats (zero-day attacks).
- Accuracy:
  
  Compromised if triggers aren't predefined; critical data may be missed.
- Efficiency:
  
  Conserves storage but risks missing important events due to reliance on triggers.
Truncated packet capture
- Key considerations:
  - Stores only packet headers, discarding payloads to save space (also known as snapping or slicing).
  - Simple truncation can omit vital information (e.g., exfiltrated files, TLS handshakes).
  - SmartTruncation™:
    - Selectively truncates encrypted payloads while retaining important data like TLS handshakes.
    - Balances storage savings with minimal information loss.
- Accuracy:
  
  Reduced with basic truncation; improved with advanced methods, like SmartTruncation™ or with a custom-designed packet-truncation system.
Filtered packet capture
- Key considerations:
  - Employs filters to capture only relevant packets, maximizing storage usage.
  - Uses complex L2/3/4 filters and application layer filters to reduce the total number of packets sent to the packet capture system.
  - Focuses on critical services (e.g. finance networks) or excludes non-essential traffic (e.g., YouTube).
- Accuracy:
  
  High for selected data; may miss unfiltered or unexpected traffic.
- Efficiency:
  
  Enhances by reducing unnecessary data capture, extending recording periods.
Enterprise-class packet capture
- Key considerations:
  - Handles sensitive data securely with high reliability and uptime.
  - Essential features:
    - Continuous operation: Designed for 24/7/365 performance under real-world stress.
    - Scalability: Operates across large, distributed networks.
    - Reliability: High redundancy to prevent outages during critical times.
    - Integration: Works with enterprise systems for authentication, authorization, and logging.
    - Security: Restricts access to sensitive data through robust controls.
    - Central Management: Offers centralized operation, management, and search capabilities.
    - Up-to-Date: Maintains latest patches and security updates.
- Accuracy:
  
  Ensures comprehensive data capture with minimal loss.
- Efficiency:
  
  Maximizes through enterprise-grade infrastructure and centralized management.

References:

Wireshark. (n.d.). dumpcap(1). Retrieved from https://www.wireshark.org/docs/man-pages/dumpcap.html
Tcpdump & Libpcap. (n.d.). Tcpdump and Libpcap. Retrieved from https://www.tcpdump.org/.
Stearns, Bill. Active Countermeasures. (2020, September 16). Improving Packet Capture Performance – 3 of 3. Retrieved from https://www.activecountermeasures.com/improving-packet-capture-performance-2-of-3/
Endace. (n.d.). What is Network Packet Capture?. Retrieved from https://www.endace.com/learn/what-is-network-packet-capture
Palo Alto Networks. (n.d.). Take a Threat Packet Capture. Retrieved from https://docs.paloaltonetworks.com/pan-os/11-1/pan-os-admin/monitoring/take-packet-captures/take-a-threat-packet-capture#id7e4dc92e-d3ce-4e2b-b180-8bf1566fb221
Tcpreplay. (n.d.). Tcpreplay Overview. Retrieved from https://tcpreplay.appneta.com/wiki/overview.html

Clocks and timestamping

Accurate timestamping is critical for ULL packet capture, both when measuring latency and capturing packets with extremely high accuracy and precision.

System clock vs. NIC hardware clock:
1. System clock:
  
  A computer's "system clock handles all synchronization within a computer system" $^{10}$. Listed below are key considerations of a computer's system clock:
  - May not offer the precision required for ULL applications.
  - Is subject to scheduling delays and context switches.
  - Is an oscillating signal that alternates between zero and one.
  - Regulates and synchronizes all CPU operations to ensure consistency.
  A more general summary of a computer's system clock is described below:
  - Clock characteristics:
    
    One Clock Period of a System Clock
    Source: The Art of Assembly Language by Randall Hyde, 1st Edition
    - Clock Period: The duration of one complete cycle of the clock (from 0 to 1 and back).
    - Clock Frequency: The number of cycles completed in one second, measured in Hertz (Hz).
  - CPU clock frequency examples:
    - A 1 MHz clock has a period of 1 µs.
    - A 50 MHz clock has a period of 20 nanoseconds.
    - Modern processors achieve very high clock frequencies, significantly improving performance.
  - Rising and falling edges:
    - CPU tasks start on either the rising edge (0 to 1) or the falling edge (1 to 0) of the clock.
  - Memory access ynchronization:
    - Reading or writing to memory aligns with the system clock.
    - Access speed depends on the memory device and its compatibility with the CPU clock speed.
  - Memory access time:
    - Older CPUs (e.g., 8088/8086) require multiple clock cycles for memory access.
    - Modern CPUs achieve faster access due to improved design and reduced cycle requirements.
  - Memory read/write process:
    - Read: The CPU places the address on the address bus; data is fetched and made available on the data bus.
    - Write: The CPU sends the address and data, and the memory subsystem stores the data before the clock period ends.
  - Performance considerations:
    - Faster memory subsystems align better with high-frequency CPUs.
    - Slow memory can bottleneck CPU operations, causing system inefficiencies.
2. NIC hardware clock (PHC):
  - PTP Hardware Clock (PHC):
    
    The Linux has a standardized method for developing PTP user-space programs, synchronizing Linux with external clocks, and providing both kernel and user-space interfaces, enabling complete PTP functionality. Some key features include:
    - Basic Clock Operations:
      - Set and get time.
      - Atomic clock offset adjustment.
      - Clock frequency adjustment.
    - Ancillary Features:
      - Timestamp external events.
      - Configure periodic output signals from user space.
      - Access Low Pass Filter (LPF) functionality from user space.
      - Synchronize Linux system time via the PPS subsystem.
    On a lower level, PTP Hardware Clock (PHC) has been thoroughly described by NVIDIA, with descriptions of PHC taken from NVIDIA's documentation and shared below:
    1. Overview of PHC:
      
      The PHC is implemented in most modern network adapters as a free-running clock starting at 0 at boot time. It reduces the software overhead and errors associated with timestamp conversion by the Linux kernel.
    2. Key features and benefits:
      
      Timestamp conversion in Software
      Source: NVIDIA - https://docs.nvidia.com/networking/display/nvidia5ttechnologyusermanualv10/real+time+clock
      - Direct timestamps:
        
        Eliminates the need for software-based timestamp translation.
        
        Applications can access accurate timestamps directly.
      - Increased accuracy:
        
        Hardware control loops are tighter, leading to faster stabilization and more accurate timestamps.
      - NIC awareness:
        
        Hardware is aware of the network time and capable of performing "time-aware" operations such as:
        
        Accurate scheduling
        
        Packet pacing
        
        Time-based steering
    3. Real-time clock implementation:
      - NVIDIA ConnectX-6 Dx and above:
        
        Timestamp conversion in Hardware (ConnectX-6 Dx)
        Source: NVIDIA - https://docs.nvidia.com/networking/display/nvidia5ttechnologyusermanualv10/real+time+clock
        
        Hardware includes a true real-time clock in PTP format (UTC/TAI).
      - Clock Discipline:
        
        PTP Hardware Clock Discipline
        Source: NVIDIA - https://docs.nvidia.com/networking/display/nvidia5ttechnologyusermanualv10/real+time+clock
        
        Disciplined Clock Behavior
        Source: NVIDIA - https://docs.nvidia.com/networking/display/nvidia5ttechnologyusermanualv10/real+time+clock
        
        Performed by the PTP servo, which can be ptp4l or another commercial PTP stack.
        
        Uses standard Linux APIs or POSIX Clock API for discipline.
    4. Implications:
      - Packet timestamping:
        
        Packets are timestamped in the UTC/TAI timescale, eliminating software-based timestamp errors.
      - Clock drift correction:
        
        A digital phase-locked loop (PLL) in hardware feeds the local oscillator and adjusts for clock drift.
      - Error correction:
        
        PTP daemon calculates clock drift and error, adjusting the clock's frequency to maintain synchronization.
    Additionally, PHC can be set up on KVMs, like AWS EC2 instances, where the PHC device improves time synchronization by closely tracking the hypervisor's host clock, reducing overhead compared to network-based NTP synchronization. The PHC device will provide efficient time readings, with minimal overhead, and enhance timekeeping accuracy in KVMs. Steps are provided below to enable PHC on AWS KVMs.
    1. Pre-requisites:
      - Load the ptp_kvm driver using the command:
        
        modprobe ptp_kvm
      - Verify the presence of the /dev/ptp0 device file, which links to the PHC in KVM.
      - Ensure the driver loads at boot by adding it to /etc/modules or the appropriate system configuration file.
    2. Setting up NTP to use the PHC device:
      1. For chrony:
        
        chrony is an implemenetation of NTP and an alternative to ntpd. chrony has built-in support for a PHC device as a reference clock.
        
        Configuration:
        
        Add the PHC device as a reference clock in /etc/chrony/chrony.conf:
        
        refclock PHC /dev/ptp0 poll 0 delay 0.0004 stratum 1
        
        poll 0: Polls every second.
        
        delay: Adjusts root delay to align with the host clock.
        
        stratum: Ensures the device is treated appropriately in synchronization hierarchy.
        
        Post-setup:
        
        Restart the chronyd service and verify the PHC source using:
        
        chronyc -n sources
      2. For ntpd:
        
        ntpd is an OS daemon and is an implementation of NTP v4. ntpd is "capable of synchronizing time to a theoretical precision of about 232 picoseconds. In practice, this limit is unattainable due to quantum limits on the clock speed of ballistic-electron logic" $^{6}$.
        
        Setup requirements:
        
        Install the linuxptp package to bridge PTP and NTP using the phc2sys utility.
        
        Configure phc2sys to synchronize PHC to NTP's shared memory (SHM) driver.
        
        Service configuration:
        
        Create a systemd service for phc2sys (e.g., /etc/systemd/system/phc2sys.service):
        
        [Unit] Description=Synchronize PTP hardware clock (PHC) to NTP SHM driver [Service] ExecStart=/usr/sbin/phc2sys -E ntpshm -s /dev/ptp0 -O 0 -l 5 Restart=always [Install] WantedBy=ntp.service
        
        Start and enable the service.
        
        NTP configuration:
        
        Add the SHM driver to /etc/ntp.conf:
        
        server 127.127.28.0 fudge 127.127.28.0 stratum 1
        
        Restart ntpd and verify the SHM source using:
        
        ntpq -np
    3. Summary of system timestamp offsets of AWS Nitro System's ENA PHC and KVM PHC:
      
      AWS provides a feature to enhance networking of EC2 instances called Elastic Network Adapter (ENA). Enabling ENA for an EC2 instance is only available on AWS Nitro-based instances. Described below is a comparison between the system timestamp offset (of the client with respect to the server) ENA PHC devices and regular KVM PHC devices:
      - Key highlights:
        
        KVM PHC on AWS:
        
        Using the KVM PHC, system timestamp offset typically ranged between ± 25 µs with occasional spikes reaching +60 or -140 µs.
        
        Root dispersion, a field of an NTP packet which tells you how much error is added due to other factors, for the KVM PHC peaked at approximately 95 µs, with 95th percentiles (on 15-minute averages) not exceeding 20 µs.
        
        Frequency error ranged between -1 and -6.5 parts per million (ppm), meaning between 1 µs and 6.5 µs of frequency drift.
        
        Nitro ENA PHC:
        
        System timestamp offsets achieved higher precision, consistently staying within single-digit microseconds, rarely exceeding ± 2 microseconds.
        
        Frequency error for the ENA PHC was more stable, ranging between -3.9 and -5.2 ppm.
        
        Root dispersion was much lower, varying between approximately 0.6 and 1.8 microseconds, with only one instance exceeding 2 microseconds.
        
        Comparison:
        
        The ENA PHC consistently outperformed the KVM PHC in terms of both precision and stability.
        
        The ENA PHC exhibited a smaller frequency range and lower offsets, making it the preferred option for highly accurate time synchronization.
        
        Both ENA PHC and regular KVM PHC devices provided similarly low root dispersion due to local low-latency characteristics.
      AWS' Nitro system's ENA PHC is the most consistent and precise for time synchronization in AWS. The AWS KVM PHC, while less precise, still delivers excellent results and is a viable option where ENA PHC is unavailable, especially given its compatibility with a wide range of AWS instance types.
Synchronization via PTP/PPS:
- Precision Time Protocol (PTP):
  
  More detailed information can be found in an earlier section of this paper, under the section on PTP titled, 2. Precision Time Protocol (PTP).
  - Defined in IEEE 1588 standard.
  - Achieves highly accurate clock synchronization throughout a network.
  - Meausres and corrects timing offsets to ensure accurate time synchronization.
  - Exchanges messages between a GMC (i.e. the reference clock) and receiver clocks (Ordinary Clocks (OCs), Boundary Clocks (BCs), or Transparent Clocks (TCs)).
  - Time synchronization to µs- or nanosecond-level precision.
  - Applications in:
    - Telecommunications
    - Financial trading
    - Industrial automation
    - Electric/Power grid monitoring
- Pulse Per Second (PPS):
  
  More detailed information can be found in an earlier section of this paper, under the section titled, Pulse Per Second (PPS) synchronization.
  - How PPS signals operate:
    - Precise timing pulses: PPS signals are electrical pulses occurring precisely at the start of each second.
    - Clock synchronization: Used to synchronize clocks in electronic devices.
    - Generated by GNSS receivers: Typically produced by GNSS receivers synchronized with atomic clocks in satellites.
    - TTL-level pulses: Emit Transistor-Transistor Logic (TTL) level pulses with sharp rising edges.
    - Transmission medium: Sent via coaxial cables or other mediums to connected devices.
    - Internal clock alignment: Devices use the rising edge of the PPS signal to align their internal clocks.
  - Benefits and considerations of PPS synchronization:
    - Benefits:
      - Simplicity: Easy to implement and integrate into systems.
      - High accuracy: Offers sub-microsecond synchronization accuracy.
      - Reliability: Less susceptible to network-induced delays compared to packet-based methods.
      - Low-latency: Achieved through direct electrical connections.
    - Considerations:
      - Signal integrity: Maintain clean signal edges and minimize noise.
      - Propagation delays: Account for delays in cables (approximately 5 nanoseconds per meter).
      - Electrical isolation: Prevent ground loops and electrical interference.
  - Role of PPS in HFT:
    - Server synchronization: Ensures all servers in a data center share the same time reference.
    - Accurate logging: Provides precise timing for transaction records and event logging.
    - Network device synchronization: Aligns switches and routers to minimize timing discrepancies.
    - Dedicated hardware use: Employed in packet capture cards and time-sensitive applications.
    - Critical for performance: Essential for the timing precision required in HFT operations.
Meta Time Card project:

Time Card device with a single GNSS receiver and a MAC
Source: https://github.com/opencomputeproject/Time-Appliance-Project/tree/master/Time-Card

Meta's Time Card project incorporates a CSAC on top of a PCB board with a GNSS receiver to provide accurate GNSS-enabled time for a NTP- or PTP-enabled network. The Time Card is an open source solution, via PCIe, to build an Open Time Server. The Time Card provides PTP network timestamping via the Time Card's hardware/software bridge between its GNSS receiver and its atomic clock.
- Time Card overview:
  
  Open Time Server System Diagram integrating Meta's Time Card
  Source: https://ieeexplore.ieee.org/document/9918379
  - An open-source project by Meta provides a PCIe card for precise time synchronization.
  - Implements PTP and offers nanosecond-level accuracy.
  The general idea is this that the Time Card will be connected via PCIe to the Open Time Server and provide Time Of Day (TOD) via an /dev/ptpX (e.g. /dev/ptp0) interface. Using an /dev/ptpX interface, phc2sys will continuously synchronize the PTP hardware clock (PHC) on the network card from the atomic clock on the Time Card. This provides a precision of < 1 µs.
  
  For the extremely high precision of 1 pps output, the Time Card should be connected to the 1 pps input of the NIC. In this setup, ts2phc can provide < 100 nanoseconds of precision.
- Features:
  - Hardware timestamping with nanosecond-level accuracy.
  - Integration with the open-source community for widespread adoption.
  - Leap second awareness.
  - GNSS in.
  - Holdover.
  - Time of Day (ToD).
  - Optional Precision Time Measurement (PTM) protocol support.
  - Integrates easy with ULL NICs such as the NVIDIA ConnectX-6 Dx.
- Hardware:
  - The GNSS receiver can be a product from ublock or any other vendor as long as it provides PPS output and the TOD using any suitable format.
    - The recommended module is the u-blox RCB-F9T GNSS time module
    - Security precautions should be taken for GNSS receivers to protect against jamming and spoofing attacks.
  - For an atomic clock, a high quality atomic clock XO should be used, such as an OCXO or TCXO.
    - Atomic clock examples:
      - SA5X
      - mRO-50
      - SA.45s
    - OCXO examples:
      - SiT5711
    - TCXO examples:
      - SiT5356
  - The bridge between the GNSS receiver and the Atomic clock can be implemented using software or hardware solutions, with the hardware implementation being the main goal with the Time Card.
    
    Time Card's implementation of its bridge in hardware
    Source: https://opencomputeproject.github.io/Time-Appliance-Project/docs/time-card/introduction
- Software:
  - Linux operating system with the ocp_ptp driver (included in Linux kernel 5.12 and newer). Driver may require vt-d CPU flag enabled in BIOS.
  - NTP server: Chrony/NTPd reading /dev/ptpX of the Time Card
  - PTP server: ptp4u or ptp4l reading /dev/ptpX of the NIC
    - phc2sys/ts2phc to copy clock values from the Time Card to the NIC.
    - phc2sys can copy data between clocks, including between GPSd and Atomic and then Atomic to PHC on the NIC
  - Time Card Linux driver $^{14}$:
    
    The PCIe cards can be assembled even on a home PC, as long as it has enough PCIe slots available.
    
    The Time Card driver is included in Linux kernel 5.15 or newer. Or, it can be built from the OCP GitHub repository on kernel 5.12 or newer. The driver will expose several devices, including the PHC clock, GNSS, PPS, and atomic clock serial:
```
$ ls -l /sys/class/timecard/ocp0/
lrwxrwxrwx. 1 root    0 Aug  3 19:49 device -> ../../../0000:04:00.0/
-r--r--r--. 1 root 4096 Aug  3 19:49 gnss_sync
lrwxrwxrwx. 1 root    0 Aug  3 19:49 i2c -> ../../xiic-i2c.1024/i2c-2/
lrwxrwxrwx. 1 root    0 Aug  3 19:49 pps -> ../../../../../virtual/pps/pps1/
lrwxrwxrwx. 1 root    0 Aug  3 19:49 ptp -> ../../ptp/ptp2/
lrwxrwxrwx. 1 root    0 Aug  3 19:49 ttyGNSS -> ../../tty/ttyS7/
lrwxrwxrwx. 1 root    0 Aug  3 19:49 ttyMAC -> ../../tty/ttyS8/
```
    The driver also allows the monitoring of the Time Card, the GNSS receiver, and the atomic clock status and flash a new FPGA bitstream using the devlink cli.
    
    The last thing to do is to configure the NTP and/or PTP server to use the Time Card as a reference clock. To configure chrony, specify the refclock attribute:
```
$ grep refclock /etc/chrony.conf
refclock PHC /dev/ptp2 tai poll 0 trust
```
    And enjoy a very precise and stable NTP Stratum 1 server:
```
$ chronyc sources
210 Number of sources = 1
MS Name/IP address         Stratum Poll Reach LastRx Last sample
===============================================================================

# * PHC0                          0   0   377     1     +4ns[   +4ns] +/-  36ns
```
    For the PTP server (for example, ptp4u) one will first need to synchronize Time Card PHC with the NIC PHC. This can be easily done by using the phc2sys tool which will sync the clock values with the high precision usually staying within single digits of nanoseconds:
```
$ phc2sys -s /dev/ptp2 -c eth0 -O 0 -m
```
    For greater precision, it’s recommended to connect the Time Card and the NIC to the same CPU PCIe lane. For greater precision, one can connect the PPS output of the Time Card to the PPS input of the NIC.
More detailed hardware and NIC information about the Time Card and Open Time Server can be found at OpenCompute's Time Appliance Project repository.

References:

Winfield, Jack. "High Speed Packet Capture". 2. Precision Time Protocol (PTP). (2024, November 27). IE 421: High-Frequency Trading Tech, University of Illinois at Urbana-Champaign.
Winfield, Jack. "High Speed Packet Capture". Pulse Per Second (PPS) synchronization. (2024, November 27). IE 421: High-Frequency Trading Tech, University of Illinois at Urbana-Champaign.
The Linux Kernel GitHub repository. (2023, June 20). Documentation: driver-api: ptp PTP hardware clock infrastructure for Linux. Retrieved from https://github.com/torvalds/linux/blob/master/Documentation/driver-api/ptp.rst
Gear, Paul. (2024, April 25). Paul's blog. VM timekeeping: Using the PTP Hardware Clock on KVM. Retrieved from https://www.libertysys.com.au/2024/04/vm-timekeeping-using-the-ptp-hardware-clock-on-kvm/
NVIDIA. (2023, May 23). Real Time Clock. Retrieved from https://docs.nvidia.com/networking/display/nvidia5ttechnologyusermanualv10/real+time+clock
NTPSec Documentation. (2024, November 25). ntpd - Network Time Protocol (NTP) Daemon. Retrieved from https://docs.ntpsec.org/latest/ntpd.html
Gear, Paul. (2024, May 04). Paul's blog. AWS microsecond-accurate time: a second look. Retrieved from https://www.libertysys.com.au/2024/05/aws-microsecond-accurate-time-second-look/#test-process
AWS. (2024, November). Enable enhanced networking with ENA on your EC2 instances. Retrieved from https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/enhanced-networking-ena.html
Arnold, Douglas. (2021, February 25). Meinberg Global blog The Root of All Timing: Understanding root delay and root dispersion in NTP. Retrieved from https://blog.meinbergglobal.com/2021/02/25/the-root-of-all-timing-understanding-root-delay-and-root-dispersion-in-ntp/
Hyde, Randall. (2003, September 1). The Art of Assembly Language. San Francisco: No Starch Press.
Byagowi, A., Meier, S., Schaub, T., & Sotiropoulos, I. (2022). Time Card and Open Time Server. 2022 IEEE International Symposium on Precision Clock Synchronization for Measurement, Control, and Communication (ISPCS), Vienna, Austria, pp. 1-6. doi: 10.1109/ISPCS55791.2022.9918379
OpenCompute Project's Time Appliance Project. (2024, September). Time Card. Retrieved from https://github.com/opencomputeproject/Time-Appliance-Project/blob/master/Time-Card/README.md
OpenCompute Project's Time Appliance Project. (2023) Open Time Server. Retrieved from https://github.com/opencomputeproject/Time-Appliance-Project/tree/master/Open-Time-Server/
Ahmad Byagowi, Oleg Obleukhov. (2021, August 11). Engineering at Meta. Open-sourcing a more precise time appliance. Retrieved from https://engineering.fb.com/2021/08/11/open-source/time-appliance/

Issues and optimization in packet capture

Dropped packets are common in high-rate packet capture, often caused by system limitations or inefficiencies in handling network traffic. These issues can compromise the integrity and accuracy of the captured data.

To understand how some issues in packet capture, and the optimization of packet capture, manifest, it is important to understand packet flow. Below is an overview of packet flow from the perspective of the Linux networking stack.

Overview of packet flow:

Source: The Path of a Packet Through the Linux Kernel, Technical University of Munich, Germany (Seminar IITM WS 23)

When a NIC receives packets, it utilizes ring buffers — circular buffers shared between the device driver and the NIC — to store these packets temporarily.
- The circular ring buffer can be seen located within the Linux driver in both the ingress and egress diagrams of a packet.
The NIC writes incoming packets into the receive (RX) ring buffer, from which the device driver processes and transfers them into the kernel's networking stack. Conversely, outgoing packets are placed into the transmit (TX) ring buffer before being sent over the network.

As a higher-level summary from a 2023 research paper from the Technical University of Munich, Germany, titled The Path of a Packet Through the Linux Kernel, described below is a packet's ingress and egress path through the Linux networking stack:
1. Ingress (incoming or RX) path:
  
  Source: The Path of a Packet Through the Linux Kernel, Technical University of Munich, Germany (Seminar IITM WS 23)
  
  The ingress path involves receiving packets from the NIC and delivering them to user-space applications.
  1. Ethernet Layer:
    - The NIC copies the packet to system memory via DMA and notifies the kernel through an interrupt.
    - The kernel creates an sk_buff to hold the packet and processes Ethernet headers, removing them before passing the packet up.
      - Socket buffers are an encapsulated structure, containing metadata and pointers to the actual packet data, enabling the kernel to manage and process network packets efficiently.
  2. IP Layer:
    - The packet enters ip_rcv(), where basic header validations occur (e.g., length, checksum).
    - The routing process determines the next steps:
      - Local Delivery: If the packet is addressed to the local machine, it proceeds to ip_local_deliver() for further processing.
      - Forwarding: If not destined for the local machine, it is forwarded using ip_forward().
      - Multicast: Special handling for multicast addresses.
      - Fragmented packets are reassembled.
  3. Transport Layer:
    - TCP:
      - Packets are validated, sequence numbers checked, and associated with the correct socket using __inet_lookup_skb().
      - The TCP state machine manages connection states and queues the packet in the socket receive queue for user-space consumption.
    - UDP:
      - Packets are routed to the appropriate socket and queued after checksum validation.
  4. Socket Layer:
    - User-space applications read packets using system calls like read() or recvfrom().
    - The kernel dequeues packets from the socket receive queue, applies security policies, and copies data to user-space buffers.
2. Egress (outgoing or TX) path:
  
  Source: The Path of a Packet Through the Linux Kernel, Technical University of Munich, Germany (Seminar IITM WS 23)
  
  The egress path describes the process of sending packets from an application to the NIC.
  1. Socket Layer:
    - Packets originate in user space and are passed to the kernel through system calls like write() or sendto().
    - The sock_sendmsg() function processes the packet metadata, applies security filters (e.g., SELinux), and forwards the packet to the transport layer.
  2. Transport Layer:
    - TCP:
      - Handles connection setup, segmentation of data into packets based on Maximum Segment Size (MSS), and retransmission logic.
      - Builds TCP headers, updates metadata, and appends the packet to the socket write queue.
    - UDP:
      - Constructs simpler headers without connection state management, optionally batching datagrams for performance.
      - Routes packets using ip_route_output_flow() and appends data to the sk_buff.
  3. IP Layer:
    - The kernel determines routing (e.g., via the Forwarding Information Base or cached routes).
    - IP headers are constructed, and hooks like NF_INET_POST_ROUTING for Netfilter may be applied.
    - Packets are fragmented if they exceed the Maximum Transmission Unit (MTU).
  4. Ethernet Layer:
    - The sk_buff metadata is updated with the MAC header and passed through the queuing discipline (qdisc).
    - After validation (e.g., checksums, VLAN tagging), packets are added to the NIC's transmit (TX) buffer.
    - The NIC's DMA engine transfers the packet from memory to the network.
Dropped packets and inefficiencies:
1. Causes:
  - Kernel buffer overflows:
    - Issues:
      
      There are some issues with excessively large kernel buffers. These issues are described below:
      1. Excessively large kernel buffers can increase latency, cause packet jitter, and reduce network throughput.
        
        A.K.A. "bufferbloat", this issue occurs when an overallocated buffer space fills up faster than packets can be processed, causing delays in packet transmission.
        
        A an overflow of Linux's circular ring buffer simply overwrites existing data.
      2. Insufficient kernel buffer sizes fail to accommodate bursts of high traffic, resulting in packet loss.
      3. Larger or dynamically adjusted kernel buffer sizes can mitigate this issue but require careful tuning to balance memory usage and performance.
      4. TCP congestion control algorithms can fail and take time to reset before the TCP connection ramps back up to speed to re-fill the packet buffers.
        
        TCP congestion control is for each (IP) source, sending an ACK per packet to signal its safe to transmit, to determine how much capacity is available in the network, so that TCP knows how many packets can safely be in transit.
      5. "ping of death" attack, a type of DoS attack, occurs by sending any IP datagram (TCP, UDP, or IPv4/IPv6) larger than the maximum allowable size, that, upon reassembly, cause buffer overflows, which potentially crash freeze or systems, or allow code injection.
  - CPU bottlenecks:
    - High CPU usage occurs when the system struggles to process packets in real time.
    - Packet capture involves copying data, filtering, and possibly forwarding, all of which consume CPU resources.
    - Multi-core CPUs and parallel processing techniques can alleviate bottlenecks by distributing the workload.
      - However, just because you have 50 CPU cores, does not mean you should be using all 50 CPU cores.
  - Interrupt overhead:
    
    Frequent CPU interrupts for each incoming packet can overwhelm the computer system.
2. Context switching with the OS:
  
  Efficient packet capture is hindered by frequent transitions between kernel space (where the OS operates) and user space (where applications operate). Each transition involves context switching, which adds latency and processing overhead.
  - Kernel/user space transitions
    
    Captured packets typically reside in kernel memory and must be transferred to user-space applications for processing.
    1. System calls (syscalls):
      
      Each transfer involves system calls, data copying, and synchronization, all of which add overhead.
    2. Monitoring syscalls with strace:
      
      strace is a command-line utility that tracks system calls made by a program and the signals it receives. It intercepts and logs syscalls, showing the exact sequence, arguments passed, and results returned for each syscall during the execution of a program.
      1. How strace helps understand "when and where" syscalls happen:
        
        When:
        
        By displaying timestamps or execution order, strace shows the timing and sequence of syscalls, helping to identify their frequency and duration.
        
        Where:
        
        By tracing the program’s execution and associating syscalls with specific parts of the code, it highlights the locations in the code triggering syscalls.
      2. Using strace to keep syscalls to a minimum:
        
        Identifying unnecessary syscalls:
        
        strace reveals redundant or repetitive syscalls (e.g., multiple calls to open the same file or excessive I/O operations) that can be optimized or eliminated.
        
        Improving I/O Operations:
        
        By analyzing syscalls related to I/O (like read, write, poll), developers can identify inefficiencies, such as small and frequent reads or writes, and batch them to reduce syscall overhead.
        
        Optimizing Context Switching:
        
        Since syscalls involve context switching, strace can pinpoint hotspots where context switching occurs frequently. This insight can guide efforts to minimize such operations by caching data, reusing resources, or consolidating syscalls.
        
        Fine-Tuning System Interaction:
        
        strace can help identify if a program unnecessarily interacts with the kernel (e.g., excessive file checks with stat) and suggest ways to avoid kernel calls unless absolutely necessary.
        
        Example:
        
        If a program makes multiple open and close calls for the same file, strace will log these calls. You can then modify the program to open the file once, reuse the file descriptor, and close it only when done, reducing the syscall count.
        
        By systematically using strace to profile and optimize your application, you can minimize syscalls, enhance performance, and reduce CPU overhead.
3. Best practices for optimizing tcpdump:
  1. Kernel parameter optimization:
    1. Increase buffer sizes:
      
      Adjust socket buffer sizes to accommodate bursts of traffic.
      sysctl -w net.core.rmem_max=33554432 sysctl -w net.core.wmem_max=33554432
    2. Adjust memory limits:
      
      Ensure the system allows sufficient memory for network buffers.
      - Debugging oversized buffers:
        
        Using the ping utility.
        
        From Wikipedia:
        
        The size of a buffer serving a bottleneck, i.e. an oversized buffer, can be measured using the ping utility.
        
        First, the other host should be pinged continuously; then, a several-seconds-long download from it should be started and stopped a few times.
        
        By design, the TCP congestion avoidance algorithm will rapidly fill up the bottleneck on the route. If downloading (and uploading, respectively) correlates with a direct and important increase of the round trip time reported by ping, then it demonstrates that the buffer of the current bottleneck in the download (and upload, respectively) direction is bloated.
        
        Since the increase of the round trip time is caused by the buffer on the bottleneck, the maximum increase gives a rough estimation of its size in milliseconds $^{5}$.
        
        Using traceroute.
        
        From Wikipedia:
        
        In the previous example, using an advanced traceroute tool instead of the simple pinging (for example, MTR) will not only demonstrate the existence of a bloated buffer on the bottleneck, but will also pinpoint its location in the network.
        
        traceroute can show the existence and pinpoint the location of a bloated buffer by displaying the route (path) and measuring transit delays of packets across the network.
        
        The history of the route is recorded as round-trip times of the packets received from each successive host (remote node) in the route (path) $^{5}$.
        
        An online version of the traceroute tool can be used at https://traceroute-online.com, which provides an advanced visual that maps and enriches the traceroute output. It also provides Autonomous System Number (ASN), which identifies a network block on the internet, and geolocation data, which provides approximate geographic coordinates of an IP address (like country, city, and sometimes even longitude and latitude). Thus, https://traceroute-online.com can be an additional helpful tool for network security and traffic analysis for maintaining a packet capure system.
    3. Other solutions:
      1. Zero-copy techniques:
        
        Bypass traditional data copying between kernel and user space.
        
        Use technologies like Linux’s mmap, which maps kernel memory directly to user-space memory, enabling applications to access packet data without additional copies.
      2. Packet capture libraries:
        
        These are packet capture libraries described further detail in the following section, Specialized packet capture techniques.
        
        Tools like DPDK (Data Plane Development Kit) and PF_RING optimize packet capture by minimizing context switches and maximizing throughput.
        
        These libraries often leverage polling-based approaches to avoid interrupt overhead altogether.
      3. Dedicated hardware:
        
        Specialized network interface cards (NICs) with on-board processing capabilities can handle packet filtering and queuing, reducing the load on the CPU.
  2. NIC (or kernel and CPU) parameter optimization:
    1. (NIC) Checksum offloading:
      1. Check offload features
        
        Can use ethtool to inspect checksumming to see if it needs to be diabled.
        
        ethtool -K <interface>
        
        where <interface> is eth0, wlan0, etc. This will show you a list of offload features, including:
        
        rx-checksumming: Receive checksum offloading
        
        tx-checksumming: Transmit checksum offloading
        
        Example output:
        
        rx-checksumming: on tx-checksumming: on
      2. Test network behavior
        
        If you suspect issues with kernel checksum offloading (e.g., packet corruption or checksum errors in tools like Wireshark or tcpdump), note the current state of checksum offloading.
      3. Disable kernel checksumming (if needed) To disable unnecessary offloading features that may interfere with packet capture, run:
        
        ethtool -K <interface> rx off tx off
        
        This command disables both receive and transmit checksum offloading for the specified interface.
      4. Verify Changes
        
        Run the ethtool -k <interface> command again to confirm that rx-checksumming and tx-checksumming are now set to off.
      5. Testing without offloading
        
        After disabling checksumming, test network traffic again to determine if the issue is resolved. If disabling checksum offloading resolves your issue, it might indicate a problem with your network driver or hardware.
      6. Permanent changes
        
        To make this change persistent across reboots, you will need to include the ethtool command in a startup script or network configuration file, as ethtool settings are typically not retained after a reboot.
        
        For example, you can add the command to a systemd service or /etc/network/interfaces (on Debian-based systems).
    2. (NIC/CPU/kernel) Interrupt Request (IRQ) coalescing:
      
      Interrupts are signals sent by hardware devices (NICs, storage drives, etc.) to the CPU to request attention or processing. By default, interrupts might be handled by a single CPU core, signifcantly leading increasing latency, so avoid them as much as possible.
      - This technique batches multiple packets before notifying the kernel via an interrupt, reducing interrupt overhead.
      - However, disabling IRQ coalescing may be necessary in low-latency applications where delay is critical.
    3. (CPU) IRQ steering:
      
      Distributing interrupts across multiple CPU cores prevents a single core from becoming a bottleneck.
      
      IRQ steering carefully controls which cores will process which interrupts by distributing hardware interrupts across multiple CPUs or processor cores in a computer system. Some key benefits of IRQ steering are:
      - Prevents a single CPU core from bottlenecking with high interrupt loads, such as servers handling many network connections.
      - Enables efficient use of multi-core CPUs.
      Key considerations:
      - For interrupts that you don’t care for latency optimization, steer all non-critical CPU threads to use the "dump core".
        
        Assign non-time-sensitive interrupts to the "dump core"
        
        Free up other cores for more critical, latency-critical tasks.
      - CPU core affinity/isol_cpus:
        
        isol_cpus is a Linux kernel parameter that can define a set of CPUs where the kernel processes scheduler will not interact with those CPUs, isolating them from the Linux kernel's scheduler and preventing the kernel scheduler from automatically scheduling processes on those CPUs.
        
        Especially helpful for ULL environments to designate CPUs to run specific tasks without interference from other processes.
        
        Attempt to ensure that code for a particular thread can fit in the L1 cache to avoid recompilation of code.
        
        Usage:
        
        To isolate CPUs, add the isolcpus parameter to your kernel boot configuration with a list of CPUs you wish to isolate. The list can include individual CPU numbers or ranges. For example:
        
        isolcpus=1,2,4-6
        
        This command isolates CPUs 1, 2, 4, 5, and 6. After setting this parameter and rebooting, the kernel's scheduler will not assign processes to these CPUs automatically. Recall that isolating CPUs means the scheduler won't manage them, so you need to manually assign processes to these CPUs. This can be done using commands like taskset (a command used to retrieve or set a process's CPU affinity, given its PID) or by configuring applications to set CPU affinity.
        
        More information can be found on the Linux kernel's official documentation page:
        
        Linux kernel documentation on the `isol_cpus` command
        Source: https://www.kernel.org/doc/html/latest/admin-guide/kernel-parameters.html
      Using irqbalance for IRQ steering:
      - Tools like irqbalance on Linux can automatically optimize IRQ distribution
      - irqbalance is a CLI tool that distribute hardware interrupts across processors on a multiprocessor system to improve performance, running as a daemon by default.
  3. (NIC) Receive-Side Scaling (RSS):
    
    RSS is a networking technology designed to distribute incoming network traffic across multiple CPU cores in a system. Without RSS, all network packets might be processed by a single CPU core, potentially creating a performance bottleneck on systems with high network traffic.
    
    RSS uses hashing algorithms (typically based on packet headers such as source/destination IP addresses and ports) to ensure that packets belonging to the same flow are directed to the same CPU core. This consistency prevents issues like out-of-order packet processing while improving overall throughput by leveraging multiple CPU cores.
    1. How does ethtool help with RSS?
      
      ethtool is a powerful tool for managing and inspecting the network interface card (NIC) settings, including RSS. Here’s how it helps with RSS:
      - Check if the NIC supports RSS and the number of available hardware queues:
        
        ethtool is a powerful tool for managing and inspecting the network interface card (NIC) settings, including RSS. Here’s how it helps with RSS.
        
        ethtool -l <interface>
        
        If RSS is supported, the NIC can use multiple queues for packet processing.
        
        Example output:
        
        Channel parameters for eth0: Pre-set maximums: RX: 8 TX: 8 Other: 0 Combined: 8 Current hardware settings: RX: 4 TX: 4 Other: 0 Combined: 4
        
        Pre-set maximums: Maximum supported queues.
        
        Current hardware settings: Queues currently in use.
    2. Enable/Modify RSS queues:
      
      ethtool allows the modification of the number of receive (rx) and transmit (tx) queues based on the number of CPU cores.
      
      Adjust the number of active RSS queues to match the number of CPU cores by running:
      ethtool -L <interface> combined <num-queues>
      Example: Set 8 queues for eth0:
      ethtool -L eth0 combined 8
    3. View RSS hash key and indirection table:
      
      The RSS hash key and indirection table determine how packets are distributed to queues:
      ethtool -x <interface>
      Example output:
      RX flow hash indirection table for eth0 with 4 RX queues: 0: 0 1: 1 2: 2 3: 3 RSS hash key: a1 b2 c3 d4 e5 f6 ...
    4. Set RSS hash parameters:
      
      Configure which parts of a packet are hashed for RSS:
      ethtool -n <interface> rx-flow-hash udp4 sdfn
      This ethtool example modifies the RSS hash settings for UDP IPv4 packets.
    5. How RSS works with queues:
      - Each RSS queue corresponds to a specific CPU core (or set of cores) to handle incoming packets.
      - By inspecting or modifying RSS queue settings with ethtool, network performance can be optimized through:
        
        An increase in the number of queues to utilize more CPU cores.
        
        Balancing of queues across cores using IRQ steering (/proc/irq).
    6. Example workflow:
      1. Check NIC RSS support:
        
        ethtool -l eth0
      2. View and tune active queues:
        
        ethtool -L eth0 combined 8
      3. Inspect indirection table:
        
        ethtool -x eth0
      4. Balance queues across cores (using IRQ steering):
        
        echo 1 > /proc/irq/45/smp_affinity
        
        The Linux Kernel documentation describes /proc/irq/45/smp_affinity as:
        
        /proc/irq/IRQ#/smp_affinity and /proc/irq/IRQ#/smp_affinity_list specify which target CPUs are permitted for a given IRQ source. It’s a bitmask (smp_affinity) or cpu list (smp_affinity_list) of allowed CPUs. It’s not allowed to turn off all CPUs, and if an IRQ controller does not support IRQ affinity then the value will not change from the default of all cpus $^{13}$.
    The combination of RSS and ethtool ensures that network traffic is efficiently distributed and processed, optimizing the system’s performance.
  4. Optimizing kernel scheduling interrupts:
    1. NO_HZ:
      
      NO_HZ is a Linux kernel configuration option and boot parameter that reduces the number of scheduling-clock interrupts, also known as "scheduling-clock ticks" or simply "ticks". The reduction is achieved by allowing the kernel to avoid periodic timer ticks on idle cores or even on active cores in some configurations (like CONFIG_NO_HZ_FULL=y).
      
      Reducing ticks helps minimize OS jitter and improve energy efficiency.
      
      From Linux's official documentation on timers $^{19}$:
      There are three main ways of managing scheduling-clock interrupts (also known as "scheduling-clock ticks" or simply "ticks"):
      
      Never omit scheduling-clock ticks (CONFIG_HZ_PERIODIC=y or CONFIG_NO_HZ=n for older kernels). You normally will -not- want to choose this option.
      
      Omit scheduling-clock ticks on idle CPUs (CONFIG_NO_HZ_IDLE=y or CONFIG_NO_HZ=y for older kernels). This is the most common approach, and should be the default.
      
      Omit scheduling-clock ticks on CPUs that are either idle or that have only one runnable task (CONFIG_NO_HZ_FULL=y). Unless you are running realtime applications or certain types of HPC workloads, you will normally -not- want this option.
      
      - Linux documentation: NO_HZ: Reducing Scheduling-Clock Ticks
      More information can be found on Linux's official GitHub repository, under the timers' documentation.
    2. chrt:
      
      chrt is a command-line tool, working at the process level, used to manage process scheduling attributes and prioritize tasks, indirectly reducing the effect of scheduling interrupts on critical processes.
      
      By setting real-time scheduling policies with chrt (e.g., SCHED_FIFO or SCHED_RR), you can give processes predictable and high-priority access to the CPU.
      
      While setting real-time scheduling policies doesn't directly disable scheduling interrupts, it can minimize their impact on critical processes by ensuring those processes get priority over others. Therefore, chrt influences how interrupts and other processes affect critical tasks rather than outright disabling scheduling interrupts.
      1. Example: Setting real-time scheduling for a program
        
        Assume you have a program called critical_program that needs to run with the FIFO real-time scheduling policy and a priority of 50.
        
        Step-by-step:
        
        Run the program with chrt:
        
        sudo chrt -f 50 ./critical_program
        
        -f specifies the SCHED_FIFO scheduling policy.
        
        50 is the real-time priority (ranges from 1 [lowest] to 99 [highest]).
        
        ./critical_program is the program to execute.
        
        Check the scheduling policy and priority of a running process:
        
        If critical_program is already running and its PID is 1234, you can inspect its scheduling policy and priority:
        
        chrt -p 1234
        
        Output:
        
        pid 1234's current scheduling policy: SCHED_FIFO pid 1234's current scheduling priority: 50
        
        Change the scheduling policy and priority of an existing process:
        
        If critical_program is running and you want to change its scheduling policy and priority to SCHED_RR with priority 60:
        
        sudo chrt -r -p 60 1234
        
        -r specifies the SCHED_RR (round-robin) scheduling policy.
        
        -p modifies the attributes of an existing process by its PID (1234 in this case).
        
        60 is the new priority.
        
        Key considerations:
        
        Root privileges: Modifying real-time scheduling policies requires root permissions, hence the use of sudo.
        
        Real-time scheduling warning: Misusing real-time scheduling (e.g., assigning too many processes high priority) can starve non-critical processes, potentially making the system unresponsive.
  5. NUMA optimizations:
    
    Typical NUMA architecture
    Source: "The Effect of NUMA Tunings on CPU Performance" - Christopher Hollowell et al 2015 J. Phys.: Conf. Ser. 664 092010
    
    For NUMA aware OSs, the benefit of NUMA is that each CPU has its own local RAM that it can effectively access independently of other CPUs in the system.
    1. Potential issues:
      1. Memory and PCIe latencies across CPU cores:
      - In NUMA architectures, different CPU cores experience varying latencies when accessing memory and PCIe devices.
      - Variance in latencies arises because each processor has its own local memory, leading to faster access times compared to non-local memory.
    2. Optimizations when using NUMA:
      1. Rely on Single-Producer Single-Consumer (SPSC) data structures and lockless queues:
        
        Utilizing SPSC data structures and lock-free queues can minimize latency.
        
        Lock-free algorithms, such as unbounded SPSC queues, reduce synchronization delays, thereby enhancing producer-consumer coordination.
        
        Locks introduce latency due to the overhead of managing access control.
        
        In contrast, lock-free structures allow for more efficient data exchange between threads.
        
        Multiple-Producer Single-Consumer (MPSC) or Multiple-Producer Multiple-Consumer (MPMC) configurations often require locks, which can add latency.
        
        Therefore, SPSC setups are preferable when aiming to reduce latency.
      2. Single vs. Multi-Socket configurations:
        
        In dual-socket systems, it's beneficial to assign one socket to handle operating system tasks, keeping the second socket's cache "cleaner" for latency-critical processes.
        
        Dual-socket approach helps maintains cache efficiency and reduces memory access latencies.
        
        Having a large number of cores doesn't necessarily mean all should be utilized simultaneously.
        
        Effective NUMA optimization involves strategic core usage to prevent resource contention and maintain optimal performance.
      3. Memory-/CPU-cell pinning numactl:
        
        numactl handles memory and CPU cell pinning by providing control over the placement of processes and memory allocations on specific NUMA nodes. These NUMA scheduling or memory placement policies avoid the latency associated with accessing memory on remote nodes after a process has been moved across nodes. Here’s how it works:
        
        CPU pinning: numactl can pin a process or thread to specific CPUs that belong to a particular NUMA node. By restricting execution to CPUs on the same node, it prevents the kernel from moving the process to another node.
        
        The example below binds the process to CPUs 0 through 3:
        
        numactl --physcpubind=0-3 ./my_program
        
        Memory pinning:
        
        numactl ensures that memory allocations for a process are made from a specific NUMA node. This avoids situations where memory is allocated on one node but accessed by a CPU on another, which would result in cross-node communication and increased latency.
        
        The example below ensures that all memory allocations for the process are made from NUMA node 0.:
        
        numactl --membind=0 ./my_program
        
        Combined (memory and CPU) pinning:
        
        numactl can bind both CPUs and memory together, ensuring that a process’s threads run on CPUs of a specific NUMA node and that memory is allocated on the same node.
        
        The example below binds the process to NUMA node 0 for both computation and memory:
        
        numactl --cpunodebind=0 --membind=0 ./my_program
        
        Avoiding cross-node access:
        
        By enforcing such bindings:
        
        The kernel cannot move the process to a CPU on another NUMA node.
        
        Memory accesses remain local to the node's controller, avoiding the delay of fetching memory across nodes.
  6. HugePages:
    
    HugePages is a feature of modern operating systems that allocates large memory pages, as opposed to the default small pages. This memory management strategy is often used in high-performance computing environments where the demands on memory and processing throughput are significant.
    1. How HugePages work:
      - Default Page Size: Most systems use a default memory page size of 4 KB.
      - Huge Page Size: HugePages significantly increase this size, typically to 2 MB or 1 GB, depending on the hardware and configuration.
      - Memory Allocation: HugePages pre-allocate memory at boot or runtime and reserve it for specific applications. Once allocated, this memory cannot be used by other processes.
      - Translation Lookaside Buffer (TLB): The TLB caches mappings between virtual memory and physical memory. With HugePages, fewer mappings are needed, reducing TLB misses and improving efficiency.
    2. How HugePages helps maximum throughput in packet capture systems:
      
      In packet capture systems, throughput is often limited by memory and CPU performance due to the high rate of packet processing. HugePages help by addressing the following key areas:
      1. Reduced Transaction Lookaside Buffer (TLB) misses:
        
        TLB is a specialized memory cache built within the CPU that stores the recent virtual-to-physical address translations of virtual memory to physical memory. HugePages increases the size of memory blocks, ensuring that memory can be managed with fewer TLB cache entries. Thus, HugePages lowers the overhead of managing memory in systems with large amounts of RAM.
        
        Packet capture systems frequently access memory to process data.
        
        Using HugePages reduces the number of memory pages required for the same amount of data, resulting in fewer TLB lookups and misses.
        
        This improves memory access speed, a critical factor in high-throughput environments.
      2. Improved memory bandwidth:
        
        HugePages minimize the overhead of managing a large number of small pages, freeing up CPU resources.
        
        This allows more CPU cycles to be dedicated to processing packets rather than handling memory management.
      3. Decreased CPU overhead:
        
        With fewer pages, the kernel spends less time managing page tables and handling page faults.
        
        This reduces the CPU load, enabling the system to handle higher packet rates.
      4. Reduced fragmentation:
        
        Allocating large, contiguous memory regions reduces fragmentation.
        
        This ensures that applications like packet capture systems have consistent and predictable memory performance.
      5. Enhanced DMA (Direct Memory Access):
        
        Packet capture systems often rely on DMA to move data directly from network cards to memory.
        
        HugePages ensure that these memory areas are contiguous, simplifying DMA operations and avoiding unnecessary overhead.
      6. Better NUMA performance:
        
        In systems with NUMA architecture, HugePages can reduce cross-node memory accesses, optimizing performance for packet capture workloads.
    3. Typical use cases for HugePages in packet capture:
      - DPDK (Data Plane Development Kit):
        
        Many packet processing frameworks, such as DPDK, explicitly require HugePages to achieve optimal performance.
      - PF_RING and Netmap:
        
        These frameworks also benefit from HugePages, as they focus on high-speed packet processing and forwarding.
    4. Set up and use HugePages on Linux:
      
      To set up and use HugePages on a Linux system, you'll need to perform system-level configuration and modify your application code to allocate memory using HugePages. Below are step-by-step instructions and code examples to help you get started.
      1. Setting up HugePages:
        
        Check current HugePages configuration:
        
        Use the following command to view the current HugePages settings:
        
        grep Huge /proc/meminfo
        
        This displays information such as HugePages_Total, HugePages_Free, and Hugepagesize.
        
        Allocate HugePages:
        
        Decide how many HugePages you need and allocate them. For example, to allocate 128 HugePages:
        
        sudo sysctl -w vm.nr_hugepages=128
        
        Alternatively, you can write directly to the proc filesystem:
        
        echo 128 | sudo tee /proc/sys/vm/nr_hugepages
        
        Make the allocation persistent:
        
        To ensure the HugePages allocation persists after a reboot, add the following line to /etc/sysctl.conf:
        
        vm.nr_hugepages=128
        
        Then, reload the sysctl settings:
        
        sudo sysctl -p
        
        Mount the hugetlbfs filesystem:
        
        Create a mount point and mount the hugetlbfs filesystem:
        
        sudo mkdir /mnt/hugepages sudo mount -t hugetlbfs none /mnt/hugepages
        
        To make this mount persistent across reboots, add the following line to /etc/fstab:
        
        none /mnt/hugepages hugetlbfs defaults 0 0
        
        Adjust permissions (If Necessary):
        
        If non-root users need to access HugePages, adjust the permissions of the mount point:
        
        sudo chmod 777 /mnt/hugepages
      2. Using HugePages in applications:
        
        In C programs:
        
        Example 1: Allocating HugePages with a file descriptor:
        
        #include <stdio.h> #include <stdlib.h> #include <fcntl.h> #include <sys/mman.h> #include <unistd.h> #define LENGTH (2 * 1024 * 1024) // 2 MB int main() { int fd = open("/mnt/hugepages/hugepagefile", O_CREAT | O_RDWR, 0755); if (fd < 0) { perror("open"); exit(EXIT_FAILURE); } if (ftruncate(fd, LENGTH) == -1) { perror("ftruncate"); exit(EXIT_FAILURE); } void *addr = mmap(NULL, LENGTH, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0); if (addr == MAP_FAILED) { perror("mmap"); exit(EXIT_FAILURE); } sprintf(addr, "Hello, HugePages!"); printf("%s\n", (char *)addr); munmap(addr, LENGTH); close(fd); unlink("/mnt/hugepages/hugepagefile"); return 0; }
        
        Steps:
        
        Open a file in the HugePages mount point to get a file descriptor.
        
        Set the file size to match the HugePage size using ftruncate.
        
        Map the file into memory with mmap, using HugePages.
        
        Use the memory as needed.
        
        Clean up by unmapping memory, closing the file, and deleting it.
        
        Example 2: Using anonymous HugePages with MAP_HUGETLB:
        
        #include <stdio.h> #include <stdlib.h> #include <sys/mman.h> #include <string.h> #define LENGTH (2 * 1024 * 1024) // 2 MB int main() { void *addr = mmap(NULL, LENGTH, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB, -1, 0); if (addr == MAP_FAILED) { perror("mmap"); exit(EXIT_FAILURE); } strcpy(addr, "Hello, Anonymous HugePages!"); printf("%s\n", (char *)addr); munmap(addr, LENGTH); return 0; }
        
        Steps:
        
        Use mmap with MAP_ANONYMOUS and MAP_HUGETLB to allocate HugePages without a backing file.
        
        Perform memory operations as needed.
        
        Unmap the memory when done.
      3. Verifying HugePages usage:
        
        After running your application, check if HugePages are being utilized:
        
        grep Huge /proc/meminfo
        
        Look for changes in the values of HugePages_Total, HugePages_Free, and HugePages_Rsvd to confirm usage.
    Summary:
    - Allocate HugePages by setting vm.nr_hugepages to the desired number.
    - Mount hugetlbfs to a directory (e.g., /mnt/hugepages) for applications to use.
    - Adjust Permissions if necessary for user access.
    - Modify Application Code to allocate memory using HugePages, either through file-backed mappings or anonymous mappings with MAP_HUGETLB.
    - For Java Applications, enable HugePages with the -XX:+UseLargePages JVM option.
    - Verify Usage by checking HugePages statistics in /proc/meminfo.

References:

ntop PF_RING documentation. (2024). PF_RING ZC (Zero Copy). Retrieved from https://www.ntop.org/guides/pf_ring/zc.html
A. B. Narappa, F. Parola, S. Qi and K. K. Ramakrishnan. (2024). Z-Stack: A High-Performance DPDK-Based Zero-Copy TCP/IP Protocol Stack. 2024 IEEE 30th International Symposium on Local and Metropolitan Area Networks (LANMAN), Boston, MA, USA, 2024, pp. 100-105, doi: 10.1109/LANMAN61958.2024.10621881
DPDK Project. (n.d.). ABOUT DPDK. Retrieved from https://www.dpdk.org/about/
Wikipedia. (n.d.). Bufferbloat Retrieved from https://en.wikipedia.org/wiki/Bufferbloat
Lariviere, David, Clinical Professor of Financial Engineering; IE 421: High-Frequency Trading Tech; University of Illinois at Urbana-Champaign. (Spring 2024).
Computer Networks: A Systems Approach. (2024). 6.3 TCP Congestion Control. Retrieved from https://book.systemsapproach.org/congestion/tcpcc.html
traceroute-online.com. (n.d.). Traceroute Online - Trace and Map the Packets Path. Retrieved from https://traceroute-online.com
(n.d.). 3.4 irqbalance. Retrieved from https://docs.redhat.com/en/documentation/red_hat_enterprise_linux/6/html/performance_tuning_guide/sect-red_hat_enterprise_linux-performance_tuning_guide-tool_reference-irqbalance
The Linux Kernel GitHub repository. (2024, November 27). Documentation: admin-guide: kernel-parameters The kernel’s command-line parameters. Retrieved from https://github.com/torvalds/linux/blob/master/Documentation/admin-guide/kernel-parameters.rst
The Linux Kernel GitHub repository. (2020, May 15). Documentation: core-api: irq: irq-affinity. SMP IRQ affinity. Retrieved from https://github.com/torvalds/linux/blob/master/Documentation/core-api/irq/irq-affinity.rst
Aldinucci, M., Danelutto, M., Kilpatrick, P., Meneghin, M., Torquati, M. (2012). An Efficient Unbounded Lock-Free Queue for Multi-core Systems. Euro-Par 2012 Parallel Processing. Euro-Par 2012. Lecture Notes in Computer Science, vol 7484. Springer, Berlin, Heidelberg. doi: 10.1007/978-3-642-32820-6_65
Christopher Hollowell et al 2015 J. Phys.: Conf. Ser. 664 092010. doi: 10.1088/1742-6596/664/9/092010. Retrieved from https://iopscience.iop.org/article/10.1088/1742-6596/664/9/092010
Cloudflare. (n.d.). Ping of death DDoS attack. Retrieved from https://www.cloudflare.com/learning/ddos/ping-of-death-ddos-attack/
Jamie Bainbridge and Jon Maxwell. (2015, March 25). Red Hat. Red Hat Enterprise Linux Network Performance Tuning Guide. Retrieved from https://access.redhat.com/sites/default/files/attachments/20150325_network_performance_tuning.pdf
Alexander Stephan and Lars Wüstrich. (2023). The Path of a Packet Through the Linux Kernel. Technical University of Munich, Germany. Seminar IITM WS 23. doi: 10.2313/NET-2024-04-1_16
Linux man page. (n.d.). numactl(8). Retrieved from https://linux.die.net/man/8/numactl
Speice, Bradley. (2019, July 1). On building high performance systems. Retrieved from https://speice.io/2019/06/high-performance-systems/
Linux man page. (n.d.). strace(1). Retrieved from https://linux.die.net/man/1/strace
Linux GitHub repository. (2024, April 28). Linux timers documentation. NO_HZ: Reducing Scheduling-Clock Ticks. https://github.com/torvalds/linux/blob/master/Documentation/timers/no_hz.rst
Linux man page. (n.d.). chrt(1). https://linux.die.net/man/1/chrt
Ashwathnarayana, Satyadeep. (2023, May 4). Netdata. Understanding Huge Pages: Optimizing Memory Usage Retrieved from https://www.netdata.cloud/blog/understanding-huge-pages/
Redhat Documentation. (n.d.). Chapter 9. What huge pages do and how they are consumed by applications https://docs.redhat.com/en/documentation/openshift_container_platform/4.1/html/scalability_and_performance/what-huge-pages-do-and-how-they-are-consumed
Redhat Documentation. (n.d.). Chapter 36. Configuring huge pages. Retrieved from https://docs.redhat.com/en/documentation/red_hat_enterprise_linux/8/html/monitoring_and_managing_system_status_and_performance/configuring-huge-pages_monitoring-and-managing-system-status-and-performance#parameters-for-reserving-hugetlb-pages-at-boot-time_configuring-huge-pages
The Linux Kernel GitHub repository. (2024, April 25). Documentation: admin-guide: mm: hugetlbpage HugeTLB Pages. https://github.com/torvalds/linux/blob/master/Documentation/admin-guide/mm/hugetlbpage.rst

Specialized packet capture techniques

To overcome the limitations of traditional packet capture methods, specialized techniques and libraries have been developed.

1. Specialized "kernel" or kernel bypass techniques

There exist several inherent limitations of the traditional kernel-network stack in handling high speed network traffic.

Overhead in kernel networking stack:
- Traditional packet capture tools (e.g., tcpdump, Wireshark's dumpcap) rely on the kernel networking stack to process packets. This involves multiple stages:
  - Interrupts generated by the NIC.
  - Packet copying between buffers (NIC to kernel, kernel to user space).
  - Protocol stack processing (e.g., Ethernet, IP, TCP/UDP parsing).
  - System calls for transferring packets from kernel space to user space.
- These operations introduce latency and consume CPU cycles, which are bottlenecks at high packet rates.
CPU interrupt overload:
- At high packet rates (e.g., 10 Gbps or 100 Gbps), the sheer volume of interrupts generated by the NIC can overwhelm the CPU.
- Processing each interrupt separately is inefficient and can lead to packet drops, especially when interrupt coalescing is insufficient.
Copying between kernel and user space:
- Packets captured by the NIC are first stored in kernel buffers and then copied to user space for analysis.
- This data copying introduces additional latency and reduces throughput.
Limited buffering:
- Kernel buffer sizes are often limited, leading to buffer overruns when the packet rate exceeds the kernel's processing capability.
Context switching overhead:
- Packet capture typically involves frequent context switches between user space (application) and kernel space (network stack).
- These context switches further degrade performance at high traffic rates.

Kernel bypass techniques, one of the easiest and biggest ways to get a performance speed-up, address these issues by allowing applications to directly access packets from the NIC, bypassing the traditional kernel networking stack.

Key benefits include:

Reduced Latency and CPU Overhead
- Bypassing the kernel avoids unnecessary processing steps (e.g., protocol parsing, system calls).
- Direct memory access (DMA) allows the NIC to write packets directly into user-space memory.
Higher Throughput
- By eliminating kernel overhead, applications can process packets at rates closer to the physical bandwidth of the NIC.
- Efficient polling mechanisms reduce interrupt overhead.
Customizable Packet Processing
- Applications can implement lightweight, application-specific processing pipelines without the general-purpose constraints of the kernel stack.

Examples of kernel bypass solutions:

DPDK (Data Plane Development Kit):
1. Overview:
  
  Source: https://dpdk.org
  
  Standard dataframe path vs its path using DPDK
  Source: https://www.packetcoders.io/what-is-dpdk/
  
  DPDK is a set of libraries and drivers for fast packet processing, bypassing the kernel network stack by creating a fast-path from the NIC to the application within user-space. This kernel bypassing through a NIC-to-user-space fast-path eliminates context-switching when moving the dataframe between user-space or kernel space. Additional gains in processing speed can come from negating the kernel/network driver and the penalities they introduce. Moreover, DPDK leverages Poll Mode Driver (PMD) at the data-link layer (Layer 2), run by a dedicated CPU core assigned to run PMD, to constantly poll the NIC for new network packets, rather than the NIC raising an interrupt to the CPU.
  
  DPDK supports many NICs and processor architectures and both FreeBSD and Linux.
  A higher-level overview of DPDK — from high-performance computing company, Trenton Systems — is described below:
  1. Initialization:
    
    The DPDK application initializes by configuring the environment and initializing the necessary DPDK libraries. This involves setting up memory management, creating memory pools, and configuring the desired PMDs (Poll Mode Drivers) for the NICs.
  2. (Data-Link Layer/Layer 2) Poll Mode Drivers (PMDs):
    
    PMDs are a key component of DPDK. They provide optimized drivers for various NICs, allowing direct access to network devices from user space. PMDs are responsible for controlling the NICs, receiving and transmitting packets, and managing the underlying hardware resources efficiently.
  3. Memory Management:
    
    DPDK offers a memory management framework that allows applications to efficiently allocate and manage memory for packet buffers. It includes features like huge pages and memory pools. Huge pages provide large memory pages, reducing the overhead of memory management and improving performance. Memory pools are pre-allocated memory regions that can be used to efficiently allocate packet buffers.
  4. Packet Processing:
    
    Once the initialization is complete, the DPDK application can start processing packets. It typically involves the following steps
    
    Receiving Packets:
    
    DPDK applications use the PMDs to receive packets from the NICs. The PMDs fetch packets directly from the NICs receive queues into memory buffers.
    
    Packet Processing:
    
    DPDK provides libraries and APIs for packet manipulation, classification, and I/O operations. Applications can perform tasks such as packet parsing, modification, filtering, and forwarding. DPDK offers optimized functions for these operations to achieve high performance.
    
    Transmitting Packets:
    
    After processing the packets, the application can use the PMDs to transmit packets back to the NIC for onward transmission. The PMDs take packets from memory buffers and place them into the NIC's transmit queues.
  5. Multi-Core Support:
    
    DPDK is designed to fully utilize the processing power of multi-core processors. Applications can leverage DPDK's multi-threading capabilities to distribute packet processing across multiple cores. This involves creating multiple execution threads and assigning specific tasks to each thread. DPDK provides synchronization mechanisms, such as locks and queues, to coordinate the work of different threads.
  6. Integration and Networking Applications:
    
    DPDK can be integrated with other networking components and frameworks. It is often used in conjunction with software-defined networking (SDN) controllers, virtual switches, and network functions virtualization (NFV) infrastructure to build high-performance networking applications. DPDK provides APIs and libraries that enable integration with these components.
  - Trenton Systems, What is DPDK (Data Plane Development Kit)? $^{9}$
  Next, the following sections describe key components of DPDK, some of its key features, its common use cases, and an example workflow.
  1. Key components of DPDK:
    1. EAL (Environment Abstraction Layer): Provides a basic interface between DPDK applications and the underlying hardware, abstracting away specifics of the operating system and hardware differences.
      - Manages hugepage memory for efficient packet buffers.
      - Initializes and configures multiple CPU cores for packet processing.
    2. Memory Management: Includes hugepage support, memory pools, and buffer management (packet buffer manager, librte_mbuf), essential for efficient packet processing.
      - librte_mbuf (packet buffer manager):
        
        Handles message/memory buffers with mbuf, the primary data structure for storing packets.
        
        Allocated from hugepage memory for efficient memory access.
    3. (Data-Link Layer/Layer 2) Poll Mode Drivers (PMDs): These are optimized Data-Link Layer/Layer 2 drivers for various network interfaces, bypassing the kernel’s network stack to reduce latency and increase throughput.
      - Enables direct interaction with network interfaces (NICs).
    4. Ring Buffers: Utilized for efficient queueing mechanisms, allowing high-speed inter-process communication.
      - Used for inter-core communication.
      - Implements lockless queues for high-speed packet transfers between cores.
    5. APIs for Packet Processing: Offers a set of functions and libraries for packet manipulation, including header parsing, packet classification, and packet forwarding.
    6. Crypto and Security: Provides libraries and drivers to support cryptographic operations and secure communication.
    7. Eventdev and Timers: For event-driven programming and time management functionalities, aiding in scheduling and execution of tasks in a timely manner.
  2. Features:
    - Processes packets directly in user-space drivers, avoiding kernel overhead.
    - Data-Link Layer/Layer 2 Poll Mode Drivers (PMDs) reduce interrupt overhead.
    - High throughput and extremely low latency, processing packets with speeds of millions of packets per second (pps) per core.
    - Supports all major NICs and AWS Nitro cards.
      - Even works with SmartNICs, like those from NVIDIA or Napatech.
      - Supported by Arista's virtual router, vEOS Router, through its DPDK Mode
    - Leverages NIC features like Receive Side Scaling (RSS) and hardware queues.
    - Flexible enough to implement custom packet processing logic.
    - Other notable features:
      - Hugepage memory:
        
        Reduces Translation Lookaside Buffer (TLB) misses.
        
        Optimizes memory access speeds for packet buffers.
      - Zero-Copy (ZC) mechanism:
        
        Avoids copying packets between buffers, reducing overhead.
      - Cache optimization:
        
        Minimizes CPU cache misses by aligning memory and using prefetching techniques.
      - NUMA awareness:
        
        Ensures packets are processed by CPU cores and memory within the same NUMA node for efficiency.
      - Batch rocessing:
        
        Handles packets in batches to minimize function call overhead.
  3. Use cases:
    
    Although DPDK is an obvious choice for applications requiring high-speed packet processing, it has other use cases for networking applications:
    - High-speed packet forwarding (e.g., software-based routers, switches).
    - Network Function Virtualization (NFV) applications.
    - Traffic generators and monitoring tools (e.g. Cisco's TRex Realistic Traffic Generator).
      - TRex supports about 10-30 million packets per second (Mpps) per core, scalable with the number of cores.
    - Load balancers and firewalls.
  4. Example DPDK workflow:
    1. Initialization:
      - Reserve hugepage memory.
      - Bind NICs to DPDK-compatible drivers (e.g., vfio-pci).
      - Initialize EAL and assign CPU cores for packet processing.
    2. NIC configuration:
      - Configure NIC ports for packet reception (RX) and transmission (TX).
      - Set up queues on each NIC port for RX and TX.
        
        DPDK can configure NICs with multiple RX queues during initialization.
        
        Each RX queue is associated with a specific CPU core or a group of cores for packet processing.
        
        DPDK can assign individual CPU cores to poll specific RX queues.
        
        This eliminates contention for packet processing and ensures that each core operates independently, improving scalability.
    3. Packet processing:
      1. Packet reception (RX):
        
        NIC receives packets and places them in hardware queues.
        
        PMD polls the queues, fetches packets, and places them in Mbuf structures.
      2. Packet processing:
        
        User-defined logic processes packets (e.g., forwarding, filtering).
        
        Packets can be modified, dropped, or routed based on the application.
      3. Packet transmission (TX):
        
        Processed packets are placed in TX queues.
        
        PMD transmits packets from the TX queues to the NIC.
    4. Scaling:
      - Multicore processing: Assign multiple CPU cores to process different queues or stages.
      - Load balancing: Distribute packet processing workloads across cores.
  Next, the following sections describe key components of DPDK, some of its key features, its common use cases, and an example workflow.
  1. DPDK and Vector Packet Processing (VPP) $^{13}$:
    
    Diagram of VPP with DPDK
    Source: https://cloudswit.ch/blogs/what-are-dpdk-vpp-and-their-advantages/
    
    Vector Packet Processing (VPP) was originally contributed by Cisco as an open-source project. Most implementations of VPP today leverage DPDK as a plug-in, to accelerate getting packets into user-space via DPDK PMDs. Sourced from Asterfusion, brief overview of VPP and its integration with DPDK is shared below.
    1. Vector Packet Processing (VPP)
      
      VPP, part of the Fast Data Input/Output (FD.io) project, is a user-space network stack designed for high-speed processing. It operates efficiently on architectures like x86, ARM, and Power by leveraging vector processing techniques:
      1. Batch Processing: VPP processes a batch or "vector" of packets simultaneously at each node, significantly reducing resource preparation and context-switching overhead.
      2. SIMD Parallelism: Modern CPUs' Single Instruction Multiple Data (SIMD) capabilities are utilized to perform operations on multiple data points simultaneously, enhancing efficiency.
      3. Optimized Cache Usage: By loading multiple packets into the CPU's cache simultaneously, VPP minimizes memory access delays, further boosting performance.
    2. Integration with DPDK:
      
      VPP with DPDK and Single Root I/O Virtualization (SR-IOV). SR-IOV is an extension to the PCI standard that involves creating virtual functions (VF) that can be treated as separate virtual PCI devices. Each of these VFs can be assigned to a VM or a container, and each VF has dedicated queues for incoming packets.
      Source: https://cloudswit.ch/blogs/what-are-dpdk-vpp-and-their-advantages/
      
      DPDK provides the foundational framework for packet processing by bypassing the Linux kernel and working directly in user space. It complements VPP by:
      1. Direct Hardware Access: DPDK enables VPP to directly interface with network hardware, eliminating kernel overhead and improving throughput.
      2. Efficient Memory Management: With DPDK's advanced memory mapping, VPP can directly access network buffers, reducing memory copy operations and context switches.
      3. Layered Functionality: While DPDK handles Layer 2 tasks efficiently, VPP extends this capability to Layers 3-7, providing a comprehensive user-space networking solution.
    3. Synergistic benefits:
      
      The integration of DPDK and VPP results in a seamless, high-performance networking stack that offers:
      - Reduced Latency: By bypassing the kernel and avoiding context switching.
      - Enhanced Throughput: Leveraging DPDK’s Poll Mode Driver (PMD) and VPP’s vector processing.
      - Scalability: Optimized for multi-core processors to handle increased network traffic.
  2. Setting up DPDK on cloud VMs:
    1. AWS:
      1. DPDK driver for Elastic Network Adapter (ENA) $^{15}$:
        
        Amazon provides a comprehensive guide to the DPDK (Data Plane Development Kit) driver for Amazon's Elastic Network Adapter (ENA). It outlines the following key topics:
        
        To set up Data Plane Development Kit (DPDK) with Amazon's Elastic Network Adapter (ENA) using Poll Mode Drivers (PMDs), the process involves the following steps:
        
        Prerequisites:
        
        Update Kernel and Install Dependencies:
        
        Update the kernel to ensure compatibility with the latest DPDK versions.
        
        Install essential tools like kernel-devel, kernel-headers, git, and Python modules (meson, ninja, pyelftools).
        
        Configure Modules:
        
        Use igb_uio or vfio-pci kernel modules for ENA, ensuring Write Combining (WC) is enabled for optimal performance with ENAv2 hardware.
        
        Key setup steps:
        
        Clone and build DPDK:
        
        Clone the DPDK repository and check out the desired version.
        
        Use the meson build system followed by ninja to build DPDK.
        
        Configure environment:
        
        Allocate hugepages for DPDK to optimize memory usage.
        
        Bind ENA devices to the appropriate kernel module (e.g., igb_uio or vfio-pci) using dpdk-devbind.py.
        
        Verify configuration:
        
        Use tools like lspci to check ENA devices and verify WC memory mappings.
        
        Test and optimize:
        
        Execute the testpmd application to validate functionality.
        
        Use runtime options (devargs) to tweak PMD behavior for specific use cases, such as enabling large low-latency queue (LLQ) headers or adjusting transmission timeout settings.
        
        Advanced configurations:
        
        Modify RSS (Receive Side Scaling) settings for efficient packet distribution across multiple Rx queues.
        
        Enable enhanced logging for debugging by configuring build arguments.
        
        Tips for performance:
        
        More performance tips can be found under section 12. Performance Tuning. Some key tips from that section shared below:
        
        Utilize jumbo frames and optimize Tx/Rx paths for high throughput.
        
        Spread traffic across multiple queues for better resource utilization.
        
        Adjust ring sizes and enable RSS redirection to handle traffic spikes efficiently.
        
        Overall, the setup of DPDK through AWS' ENA ensures high performance and compatibility when deploying DPDK applications on AWS instances using ENA PMDs. Let me know if you need more details on any of the steps!
      2. DPDK on AWS and DPDK optimization $^{14}$:
        
        In an article written by AWS Certified Solutions Architect and DevOps Engineer, Marc Richards, packet processing performance of the Linux Kernel versus DPDK is compared using a simple HTTP benchmark. In Marc's article, he gives a brief overview on getting DPDK working on AWS to work with his Seastar database. Some key highlights from Marc on getting DPDK up and running on AWS are shared below:
        
        DPDK needs to be able to take over an entire network interface, so in addition to the primary interface for connecting to the instance via SSH (eth0/ens5), you will also need to attach a secondary interface dedicated to DPDK (eth1/ens6).
        
        DPDK relies on one of two available kernel frameworks for exposing direct device access to user-land, VFIO or UIO. VFIO is the recommended choice, and it is available by default on recent kernels. By default, VFIO depends on hardware IOMMU support to ensure that direct memory access happens in a secure way, however IOMMU support is only available for *.metal EC2 instances. For non-metal instances, VFIO supports running without IOMMU by setting enable_unsafe_noiommu_mode=1 when loading the kernel module.
        
        Seastar uses DPDK 19.05, which is a little outdated at this point. The AWS ENA driver has a set of patches for DPDK 19.05 which must be applied to get Seastar running on AWS. I backported the patches to my DPDK fork for convenience.
        
        Last but not least, I encountered a bug in the DPDK/ENA driver that resulted in the following error message: runtime error: ena_queue_start(): Failed to populate rx ring. This issue was fixed in the DPDK codebase last year so I backported the change to my DPDK fork. On 5th+ generation instances the ENA hardware/driver supports a LLQ (Low Latency Queue) mode for improved performance. When using these instances, it is strongly recommended that you enable the write combining feature of the respective kernel module (VFIO or UIO), otherwise, performance will suffer due to slow PCI transactions.
        
        The VFIO module doesn't support write combining by default, but the ENA team provides a patch and a script to automate the process of adding WC support to the kernel module. I originally had a couple issues getting it working with kernel 5.15 but the ENA team was pretty responsive about getting them fixed. The team also recently indicated they intend to upstream the VFIO patch which will hopefully make things even more painless in the future.
        
        Most importantly, enabling write combining brings performance from 1.19M req/s to 1.51M req/s, a 27% performance increase.
        
        - Marc Richards' blog
    2. Microsoft Azure $^{10}$:
      
      Microsoft Azure has its own documentation page on setting up the DPDK library in a Linux VM. The documentation walks through:
      1. A manual installation process of DPDK.
      2. Configuring the environment:
        
        Setting up hugepages for each NUMA node.
        
        Viewing the available MAC and IP addresses, with ifconfig, to find the VF network interface.
        
        Using the ethtool -i <vf interface name> command to find which PCIe interface to use for VF.
        
        Loading ibuverbs on each reboot with modprobe -a ib_uverbs.
      3. Setting up the PMD, either NetVSC PMD or Failsafe PMD
      4. Testing the PMD:
        
        Testing a single sender/single receiver by printing the packets per second statistics:
        
        On the TX side, run the following command:
        
        testpmd \ -l <core-list> \ -n <num of mem channels> \ -w <pci address of the device you plan to use> \ -- --port-topology=chained \ --nb-cores <number of cores to use for test pmd> \ --forward-mode=txonly \ --eth-peer=<port id>,<receiver peer MAC add

Name		Name	Last commit message	Last commit date
Latest commit History 117 Commits
diagram		diagram
images/README		images/README
videos/README		videos/README
.gitignore		.gitignore
README.md		README.md

PCIe Generations	Bandwidth
PCIe 1.0 x8	2GB/s
PCIe 2.0 x8	4GB/s
PCIe 3.0 x8	8GB/s
PCIe 4.0 x8	16GB/s
PCIe 5.0 x8	32GB/s

platocrat/group_14_project

Folders and files

Latest commit

History

Repository files navigation

High Packet Rate Network Capture System

Introduction

Contributors

Project Lead: Jack Winfield

Table of Contents

1. The OSI Model

What is the OSI Model?

Memorizing the OSI Model

Definitions of Each Layer

Criticisms of the OSI Model

2. Packets and Networks

What are packets?

Packet-switched networks

Format of a packet

Packet creation

Basic Types of Packets and Their Structure

1. Ethernet II Frames (Data Link Layer - Layer 2)

2. IP Packets (Network Layer - Layer 3)

3. UDP Datagrams (Transport Layer - Layer 4)

4. TCP Segments (Transport Layer - Layer 4)

Types of Network Communications

1. Unicast Communication

2. Broadcast Communication

3. Multicast Communication

History and Commonality of Ethernet

Ethernet History

Five Advantages of Ethernet

Use case specific alternatives to Ethernet

1. Fiber Channel

2. InfiniBand

3. Time Synchronization

Time Sync Methods

1 Network Time Protocol (NTP)

2. Precision Time Protocol (PTP)

3. Precision Time Measurement (PTM)

4. Photonic Time Sync

Challenges in Distributing Time Over Networks

Case Study: 2013 Nasdaq Flash Freeze from a Timing Error

Types of Oscillators

1. Crystal Oscillator (XO)

2. Temperature-Compensated Crystal Oscillator (TCXO)

3. Voltage-Controlled Crystal Oscillator (VCXO)

4. Oven-Controlled Crystal Oscillator (OCXO)

Atomic Clocks

1. Cesium standard

2. Rubidium standard

3. Hydrogen maser standard

4. Chip-scale atomic clocks (CSACs)

5. Innovations on atomic clocks

Official Time Standards

1. Coordinated Universal Time (UTC)

2. Global Navigation Satellite Systems (GNSSs)

3. How GNSS works

4. GPS

5. Pulse Per Second (PPS) synchronization

4. Specific use cases for packet capture

1. Surveillance

2. Corporate cybersecurity: recording all inbound and outbound traffic

3. Network operations: diagnosing and optimizing network performance

Alternative methods: packet counters

Limitations of packet counters

Implications for HFT

4. Applications to HFT

5. HFT setup

Exchange and trading firm architecture

Co-location (traders and exchanges in the same exchanges)

Synchronization of clocks across data centers

Backtesting with historical data

Electronic exchange architecture

Regulatory Requirements: MiFID II

Why do financial trading firms capture network data?

1. Cybersecurity

2. Backtesting

3. Real-time monitoring

4. Latency and performance benchmarking

Packages