SMaRTT-REPS: Sender-based Marked Rapidly-adapting Trimmed & Timed Transport with Recycled Entropies


With the rapid growth of machine learning (ML) workloads in datacenters, existing congestion control (CC) algorithms fail to deliver the required performance at scale. ML traffic is bursty and bulk-synchronous and thus requires quick reaction and strong fairness. We show that existing CC algorithms that use delay as a main signal react too slowly and are not always fair. We design SMaRTT, a simple sender-based CC algorithm that combines delay, ECN, and optional packet trimming for fast and precise window adjustments. At the core of SMaRTT lies the novel QuickAdapt algorithm that accurately estimates the bandwidth at the receiver. We show how to combine SMaRTT with a new per-packet traffic load-balancing algorithm called REPS to effectively reroute packets around congested hotspots as well as flaky or failing links. Our evaluation shows that SMaRTT alone outperforms EQDS, Swift, BBR, and MPRDMA by up to 50% on modern datacenter networks.

1 Introduction↩︎

In the quickly-evolving landscape of AI-centric datacenters, the demand for large-scale and high-performance computing (HPC) capabilities has risen to unprecedented heights. This paradigm shift is demonstrated by the exponential growth of large-scale AI training [1], [2] and the proliferation of HPC offerings in the cloud [3]. The demand for high throughput at an affordable cost and low latency has become extremely important, and a crucial component to achieving those targets lies in network infrastructure and protocols.

A notable manifestation of this demand is found in the statistics - a staggering 70% of Azure’s traffic is carried by Remote Direct Memory Access (RDMA) technology [4], and all major cloud providers are relying on similar technologies [3], [5], [6]. Yet, as we witness this transformation, it is worth acknowledging that existing protocols may not suit the demands of large-scale high-bandwidth networking. For example, RDMA over Converged Ethernet (RoCE) is affected by issues like excessive switch buffer requirements for Priority Flow Control (PFC), PFC storms, and in-order delivery requirements [7], [8].

1.1 Motivation↩︎

Similar limitations can be found in CC algorithms and to drive our design choices, we identified the following requirements for a modern high-performance CC algorithm.


2MiB Flows


32 MiB Flows

Figure 1: Comparison of SMaRTT with recent CC algorithms on an 8:1 oversubscribed fat tree running a permutation. The first number under each algorithm indicates the overall completion time, while the second the time difference between the fastest and slowest flow. The vertical gray line represents the ideal case where all flows terminate as fast as possible..

1.1.1 Fairness↩︎

In this paper, we refer with the term fairness to the ability of a CC algorithm to allocate the available bandwidth evenly between competing flows. Minimizing jitter and keeping a low and consistent flow completion time (FCT) is particularly important in the context of AI and HPC workloads that exhibit tightly coordinated and synchronized communication patterns. In such systems, performance is often dictated by the slowest node in the system [9][11], as "straggler" flows and growing variability in FCT can have a disproportionately high impact on overall job progress.

This is exacerbated by the presence of bursty traffic, which is often common in datacenters. For example, 80% of RPCs in Google datacenters fit in 1 bandwidth-delay product (BDP), and 89% fit in 4 BDP [12]. This behavior is also common in some AI workloads where nodes need to frequently exchange few KiB of data [13]. Algorithms relying on delay to detect congestion often react too late to such transient bursts [12], [14], leading to unfair treatment of concurrently running flows and increasing the completion time.

We show this effect in Fig. 1 (a), where we report the Cumulative Distribution Function (CDF) of the Flow Completion Time (FCT) of 2MiB flows in a permutation scenario with 1,024 nodes, for a 8:1 oversubscribed network (more details on the setup in Sec. 4). We can notice how, with Swift [15], a delay-based algorithm, the FCT of the slowest flow is 18% longer than the fastest flow. An even more pronounced effect can be observed with BBR [16] (which we observed needs up to seven round trip times (RTTs) to adjust its sending rate). MPRDMA [17], a per-packet ECN based mechanism suffers from its inherit unfairness which is more visible for small messages. As we discuss further in Sec. 1.1.2, vanilla EQDS [18], although more fair, does not manage fabric congestion as effectively as sender-based CC algorithms. On the other hand, the algorithm we propose (SMaRTT) quickly and precisely reacts to congestion by relying on different congestion signals (delay, ECN markings, and packet trimming), improving fairness and overall performance. Finally, we also show how using SMaRTT together with EQDS can improve the performance of the latter (more details in Sec. 5.1).

1.1.2 Fabric Congestion Management↩︎

Many CC algorithms have been optimized for managing congestion at the final hop caused by incast [18], [19] due to its relevance for traditional TCP-based workloads [20]. However, with the emergence of new workloads, those algorithms provide sub-optimal performance, since they might have limited visibility of congestion happening in the network fabric. This type of congestion is often the result of oversubscription (i.e., traffic exceeding the capacity of the network), common at upper tiers of datacenter networks [6], [12], but can also occur in a non-oversubscribed setting as a result of ECMP collisions [21][23]. This is shown in Fig. 1 (b), where we report the CDF of the FCT of 32MiB flows in a permutation scenario with 1,024 nodes, for an 8:1 oversubscribed network. In this scenario, flows are big enough to allow Swift and MPRDMA to react effectively to congestion. However, FCT with EQDS is higher due to its inability to adapt to fabric congestion, hence underutilizing network bandwidth due to trimmed packets and re-transmissions. As we show in Sec. 5.1, this can be improved by complementing EQDS with SMaRTT.

1.1.3 Ease of Deployment↩︎

Last, the constant increase in the number of nodes used by workloads and the number of network flows per node [24] limits the amount of memory that can be used to track the flow state. This is even more critical when we consider that network bandwidth is rapidly reaching the Tbit/s range and that, for performance reasons, (part of) the CC algorithm would be implemented in Network Interface Cards (NICs). This limits the amount of memory the algorithm can use and the complexity of the algorithms that can be executed (at 800Gbit/s, with 4KiB packets, the NIC needs to process one packet every 40 nanoseconds).

The escalating bandwidth requirements for network switches, currently at 51.2 Tbit/s and beyond, impose constraints on the complexity of features these switches can support. Additionally, gradually deploying new switch generations or incorporating switches from diverse vendors with varying features is common in datacenters [25]. Consequently, CC algorithms relying on specific new features demanding switch hardware support [26][28] might have difficulties getting adopted by datacenter providers.

1.2 Contributions↩︎

To fulfill these requirements, we introduce a new CC algorithm which we call SMaRTT (Sender-based Marked Rapidly-adapting Trimmed & Timed Transport), together with a novel load balancing scheme called REPS (Recycled Entropy Packet Spraying). SMaRTT is a sender-based CC algorithm running over lossy Ethernet. To be easily deployable, SMaRTT does not rely on any advanced in-band network telemetry (INT). It instead leverages a combination of ECN-marking and round-trip-time (RTT) measurements, two signals that have been commonly used in a mutually exclusive manner despite each of them possessing distinctive valuable benefits that complement one another (as we show in Sec. 3.1). Last, SMaRTT efficiently uses memory and compute resources, since it only needs to store a few bytes per flow and does not require any pull queue or specific control packets, as common in many receiver-driven algorithms. Specifically, SMaRTT-REPS makes the following key contributions:

A novel mechanism that can quickly converge to the achievable bandwidth capacity at any bottleneck, typically within one RTT of receiving a severe congestion signal (Sec. 3.3). QuickAdapt relies on packet trimming or, in the case where it is not supported, on timeouts (with minimal performance impact, Sec. 4.2). Using QuickAdapt, SMaRTT can react to severe congestion as quickly as a receiver-based protocol but without requiring any special support on the receiver side (e.g., pull queues). We also argue that this feature can be retrofitted to any transport mechanism operating within tight RTT bounds, thus significantly reducing tail latencies. In particular, QuickAdapt can reduce queue occupancy much earlier than other sender-based mechanisms.

A new congestion window management technique to improve bandwidth fairness (Sec. 3.2.1 and Sec. 3.2.3). In a nutshell, the congestion window is increased or decreased more conservatively (and proportionally to the currently acknowledged packet size) when the ECN and delay congestion signal disagree on the presence of congestion. To reduce any small degree of unfairness that might be present, SMaRTT rapidly increases the congestion window of flows (Sec. 3.4) if no congestion is detected for a period of time.

A powerful and lightweight adaptive load balancing mechanism called Recycled Entropy Packet Spraying (REPS, Sec. 3.6). REPS can sustain near-optimal performance at high bandwidth utilization levels and effectively route around path blackholes. Unlike complex in-network load balancing implementations [29], [30], REPS does not require any switch support besides ECMP. We demonstrate how SMaRTT-REPS effectively orchestrates congestion control and adaptive load balancing. On one side, the CC algorithm does not inhibit early yet mild congestion signals, which are crucial for load balancing. On the other side, load balancing does not aggressively shift substantial portions of traffic away to different paths, which could make the received congestion signals irrelevant. We demonstrate REPS in a real deployment on 128 nodes connected through a two-tier fat-tree network. ’

2 Background↩︎

The behavior of CC algorithms is mostly determined by how they detect (Sec. 2.1) and react (Sec. 2.2) to congestion.

2.1 Congestion Signals↩︎

End-to-end delay accurately approximates congestion. Delay can be computed either by the sender, measuring the RTT (e.g., in Swift [15] and TCP [31]), or by the receiver, annotating acknowledgment packets (ACK) packets with one-way delay information (e.g., in TIMELY [32] or DX [33]). We assume in the following that accurate delay measurements are computed through NIC timestamping [4], [15].

With explicit congestion notification (ECN), switches set a bit in the traffic class field of the IP header when a packet experiences congestion. The receiver sends the marked packet back to the sender, which adjusts its rate accordingly [34][36]. Because ECN notifies congestion using a single bit, it provides less information than time-based signals, which instead account for the exact queuing delay along the path. Switches can use different policies to decide if a packet must be marked. For example, in random early detection (RED) [37], switches randomly mark packets with a probability linear in the switch queue size, if that size is within two thresholds (\(K_{min}\) and \(K_{max}\)). Although ECN was designed to mark packets when they are enqueued [38], it has been shown that doing that when dequeued allows CC algorithms to react faster. Dequeue marking can be easily implemented on most existing switches [39], and in the rest of the paper, we assume that switches use RED and mark ECN packets when dequeued. Although ECN-based CC algorithms can react quicker than delay-based ones, they perform poorly when dealing with incast and can be challenging to tune [15].

Packet losses have extensively been used to detect severe congestion [37], [40]. However, by relying on losses only, the algorithm would react too late to congestion. Also, losses are often detected through timeouts, which are hard to tune and can cause unnecessary retransmissions.

With packet trimming, a switch can remove parts of the packet (e.g., payload) rather than dropping it. Packet trimming retains essential information (e.g., headers), allowing the host to quickly detect and react to congestion [18], [19], [41]. Researchers demonstrated that packet trimming can be implemented on switches such as Intel Tofino, the Broadcom Trident 4 and NVidia Spectrum 2 by leveraging their ability to reroute packets to an alternative port when the initial egress queue becomes saturated [42].

Whereas ECN and packet trimming can be seen as a simple form of in-network telemetry, other protocols report more detailed information, like maximum queue occupancy [26], [43] or links load [44] along the path.

SMaRTT uses both ECN and delay as congestion signals to react quickly and accurately to congestion. To efficiently manage severe congestion scenarios, SMaRTT relies on trimming or, if it is not supported, on timeouts. SMaRTT does not rely on complex in-network telemetry.

2.2 Rate Control↩︎

The decision on how to react to congestion can be made either by the receiver [18], [19], [45][48], the sender [15], [34], [35], [49] or the switch [27], [28]. Whereas receiver-based algorithms can precisely regulate the transmission of the different senders (e.g., in incast scenarios), they require extra control packets and data structures (e.g., pull queues) on the receiver side. Also, some of these algorithms can fall short when dealing with fabric congestion [15]. On the other hand, sender-based schemes do not require such extra complexity but cannot precisely regulate the transmission rate during incast traffic due to lack of interaction with the other senders.

SMaRTT is a sender-based mechanism that, thanks to our novel QuickAdapt algorithm (Sec. 3.3), can precisely regulate the transmission rate in incast scenarios. At the same time, SMaRTT is resource-efficient since it does not require pull queues. As already discussed, to improve deployability, SMaRTT does not rely on any complex or new feature to be implemented in switches.

3 SMaRTT Design↩︎

SMaRTT is a high-performance, easy-to-deploy, and lightweight CC algorithm that can quickly react to congestion happening both at the last hop (incast) and in any other part of the network. SMaRTT keeps a number of in-flight bytes equal to the size of a congestion window (cwnd). SMaRTT updates the window depending on the estimated network congestion, striving to keep packets RTT lower than a pre-determined and constant target RTT. This can be expressed relatively to the base RTT (i.e., the RTT on an empty network). The base RTT be computed when the datacenter network is set up (once for each possible hop count) or recomputed dynamically like what is done by Swift [15]. In the rest of the paper, we assume a target RTT equal to \(1.5\)x the base RTT. Consequently, we can also easily compute a specific target RTT for each hop count by setting it to 50% more than the base RTT (details in Sec. 3.5).

SMaRTT relies on widely available features like ECMP, ECN marking, and packet trimming. ECMP and ECN marking are available on most datacenter switches [7], [50], [51], whereas packet trimming can be implemented on existing switches [42]. If ECMP is not supported, we can still run the CC algorithm without the REPS load balancing. If packet trimming is not supported, SMaRTT falls back to timeouts, with limited performance impact (Sec. 4.2). Similarly to EQDS [18], we assume that each switch port has a high-priority queue to forward control packets such as trims and ACKs, and a lower-priority queue to forward data packets. By doing so, packets carrying potential congestion information are prioritized so that SMaRTT can react quickly.

This work mainly focuses on widely used fat tree networks [3], [52]. However, SMaRTT could be easily adapted to work with other topologies such as torus, Dragonfly [6], [53], [54], BCube [55], SlimFly [56], HammingMesh [13] and others. Moreover, we intentionally limit buffering to a worst-case scenario, assuming that each per-port queue is one bandwidth-delay product (BDP) in size.

In the following, we motivate the choice of using both delay and ECN as congestion signal (Sec. 3.1), and then describe the main control loop (Sec. 3.2). We then present QuickAdapt (Sec. 3.3) and FastIncrease (Sec. 3.4), two techniques we introduce to further improve SMaRTT reactiveness. Eventually, we discuss how to select SMaRTT parameters (Sec. 3.5), and how we couple it with the novel REPS load balancing algorithm (Sec. 3.6).

3.1 SMaRTT Congestion Signals↩︎

Choosing the congestion signal is critical for the whole design of the protocol, and SMaRTT uses both ECN and delay (RTT). SMaRTT can also leverage packet trimming to react to severe congestion, which we discuss in Sec. 3.3. In Fig. 2, we compare the full version of SMaRTT with a simple CC using ECN and delay, by simulating 8:1 incast traffic on 1,024 nodes fat tree with 800Gbps links. In all the simulation results, we use the htsim packet-level network simulator [19]. We use the same congestion control logic for both ECN only and RTT only (we decrease the congestion window by half an MTU per packet at most in response to high delay or ECN marking). We observe that the congestion windows decrease faster when using ECN since, when using delay, the receiver gets the congestion signal only after the packet crosses the entire queue. On the other hand, RTT only is more fair. Finally we show SMaRTT as reference. With it, we take advantage of both signals and also use QuickAdapt (Sec /3.3) to immediately adjust its windows to the right value.

Figure 2: Comparing SMaRTT, ECN and delay-based congestion signals during an incast. The windows start at 1.42MiB (not shown in the figure).

On the other hand, delay is a more precise and fine-grained congestion signal due to its multi-bit nature and can thus estimate congestion more accurately. Moreover, it improves the algorithm’s fairness since it is not affected by the randomness inherently present with ECN. We show this in Fig. 3 (a), where we report the distribution of FCTs of the previous experiment. We can see how delay is more fair but ECN is slightly faster to react. Combining the two allows us to get some benefits from both, reducing the overall completion time. This is important as the last finishing flow would delay tightly coupled applications such as ML training workloads.



Figure 3: Impact of congestion signals and reaction granularity on SMaRTT performance. Running a 8:1 Incast of 8MiB on 1024 nodes at 800Gbps with 4KiB MTU..

Last, it is important to determine how often to react to congestion signals. Many algorithms like MPRDMA react once per ACK to improve responsiveness. However, it might be hard to keep up when increasing network bandwidth. For this reason, other algorithms like Swift or DCQCN react once per period of time (e.g., once per RTT). As shown in Fig. 3 (b), reacting on a per-packet basis can result in slightly better performance. Indeed, the congestion window size is adapted more smoothly, improving reaction times and reducing unfairness. We also observe how, even reacting only once every 50 packets does not significantly impact SMaRTT performance, which is within the 5% of the per-packet reaction scenario.

3.2 Main Control Loop↩︎

In SMaRTT, the sender recomputes the congestion window size every time it receives an ACK. When an ACK is received, four different scenarios can occur depending on the severity of congestion (or lack thereof), as reported by the two congestion signals: Fair Decrease (Sec. 3.2.1), Multiplicative Decrease (Sec. 3.2.2), Fair Increase (Sec. 3.2.3), Multiplicative Increase (Sec. 3.2.4).

3.2.1 Fair Decrease (FD)↩︎

The ACK is ECN-marked, but the reported RTT is smaller than the target one. In this case, queues are building up, but congestion has not yet affected the acknowledged packet. SMaRTT proactively reacts by gently decreasing the window regardless of the packet’s RTT. To improve fairness (see Sec. 4.3), the congestion window is decreased more aggressively for flows with larger windows (and thus, likely, more in-flight bytes) as follows: \[\begin{align} \mathit{cwnd} \mathrel{-}= \frac{\mathit{cwnd}}{\mathit{bdp}} \cdot \mathit{fd} \cdot \mathit{p.size} \end{align}\] where \(\mathit{bdp}\) is the bandwidth-delay product (computed considering the base RTT), \(\mathit{fd}\) is fair decrease constant, and \(\mathit{p.size}\) is the size of the acknowledged packet. Intuitively, the higher \(\mathit{fd}\), the more aggressively the congestion window is reduced. We discuss how to choose \(\mathit{fd}\) in Sec. 3.5.

3.2.2 Multiplicative Decrease (MD)↩︎

The ACK is ECN-marked and reports an RTT higher than the target. This indicates a worst-case scenario and, as a consequence, SMaRTT rapidly reduces the congestion window proportionally to the difference between the target and measured RTT, as follows: \[\label{eq:md} \begin{align} \mathit{cwnd} \mathrel{-}= \min \Big(\mathit{size}, \frac{\mathit{p.rtt} - \mathit{trtt}}{\mathit{p.rtt}} \cdot \mathit{md} \cdot \mathit{p.size}\Big) \end{align}\tag{1}\] where \(\mathit{trtt}\) and \(\mathit{p.rtt}\) are the target and measured RTT, respectively. SMaRTT additionally applies Fair Decrease to improve fairness (Sec. 3.2.1, not shown explicitly in Eq. 1 ). In any case, the multiplicative decrease reduces \(\mathit{cwnd}\) at most by the acknowledged packet’s size. \(\mathit{md}\) is a multiplicative decrease constant selected so that when the RTT is twice the target RTT, the congestion window is decreased aggressively by a packet for each ACK. Thus, we want to have \(\frac{\mathit{p.rtt} - \mathit{trtt}}{\mathit{p.rtt}} \cdot \mathit{md} \cdot \mathit{p.size} = \mathit{p.size}\), which implies \(\mathit{md}=2\).

3.2.3 Fair Increase (FI)↩︎

The ACK is not ECN-marked but reports an RTT higher than the target one. This might indicate that the delay signal is outdated and the queues are not growing significantly. Consequently, SMaRTT gently increases the congestion window, ignoring the delay signal. To improve fairness (see Sec. 4.3), the window is increased more aggressively for flows with smaller windows as follows:

\[\begin{align} \mathit{cwnd} \mathrel{+}= \frac{\mathit{p.size}}{\mathit{cwnd}} \cdot \mathit{mtu} \cdot \mathit{fi} \end{align}\] where \(\mathit{fi}\) is a fair increase constant and \(\mathit{mtu}\) is the Maximum Transmission Unit. The higher \(\mathit{fi}\), the more aggressively the congestion window is increased. \(\mathit{fi}\) impacts small flows more than large ones, and we discuss how to choose \(\mathit{fi}\) in Sec. 3.5.

3.2.4 Multiplicative Increase (MI)↩︎

The received ACK is not ECN-marked and reports an RTT smaller than the target one. In this case, SMaRTT increases the congestion window proportionally to the difference between the target and measured RTTs, as follows:

\[\label{eq:mi} \begin{align} \mathit{cwnd} \mathrel{+}= \min\Big(\mathit{size}, \frac{\mathit{trtt} - \mathit{p.rtt}}{\mathit{p.rtt}} \cdot \frac{\mathit{p.size}}{\mathit{cwnd}} \cdot \mathit{mtu} \cdot \mathit{mi}\Big) \end{align}\tag{2}\]

\(\mathit{mi}\) is a multiplicative increase constant selected to increase the congestion window by at most one MTU per RTT. If we assume \(\mathit{cwnd}/\mathit{p.size}\) ACKs are received in an RTT, SMaRTT would increase the window size by \(\frac{\mathit{trtt} - \mathit{p.rtt}}{\mathit{p.rtt}} \cdot \mathit{mtu} \cdot \mathit{mi}\) per RTT. Because we want this quantity to be smaller than an MTU, \(\mathit{mi} \leq \frac{\mathit{rtt}}{\mathit{trtt} - \mathit{p.rtt}}\). Since \(\mathit{p.rtt} \geq \mathit{brtt}\), where \(\mathit{brtt}\) is the base RTT, we set \(\mathit{mi} = \frac{\mathit{brtt}}{\mathit{trtt}-\mathit{brtt}}\). This multiplicative increase is followed by a fair increase to improve fairness (Sec. 3.2.3, not shown explicitly in Eq. 2 ).

3.2.5 Overall Picture↩︎

Algorithm 4 summarizes the main SMaRTT logic. So far, we discussed lines [lst:mainstart]-[lst:mainstop], while we discuss the remaining parts in the next sections. It is worth remarking that SMaRTT has a low memory footprint (19 bytes per flow, and 28 bytes globally) to ensure scalability and ease of hardware implementation on NICs.

Figure 4: SMaRTT Pseudocode

3.3 QuickAdapt↩︎

The four cases presented in Sec. 3.2 describe how SMaRTT reacts when receiving an ACK. However, if packets are dropped, SMaRTT reacts to such heavy congestion by quickly adjusting the congestion window using a novel technique, which we call QuickAdapt. QuickAdapt behaves similarly to receiver-based CC algorithms without needing special support on the receiver side (e.g., pull queues) or per-packet states like in BBR [16].

By default, SMaRTT relies on packet trimming to detect packet drops. Traditionally, when using packet trimming, the switch removes the payload and forwards the header if the queue size exceeds a threshold [19]. This allows congestion information to be communicated to the receiver even on congested networks. With SMaRTT, we trim packets only if they would otherwise be dropped because of a full buffer.

QuickAdapt counts how many bytes have been received over the last \(\mathit{trtt}\) and, if a loss is detected, it sets the congestion window to that value. It is possible to scale that value by a constant (\(\mathit{qa\_scaling}\)) depending on the desired queue occupancy (see Sec. 3.5). Because the sudden change caused by QuickAdapt already factors in the subsequent feedback, the receiver ignores the congestion information carried by all the unacked packets received from when QuickAdapt is triggered. By doing so, QuickAdapt does not overreact to congestion and, for the same reason, it is applied at most once per target RTT. The QuickAdapt pseudocode is presented in Algorithm 5, while the interaction with the rest of the SMaRTT logic is visible in the previously presented Algorithm 4. Figure 6 shows a simplified example of QuickAdapt in action with a 2:1 incast.

Figure 5: QuickAdapt Pseudocode QuickAdapt without trimming

Although packet trimming can be implemented on many existing switches [42], we also want SMaRTT to be compatible with switches lacking such support. If trimming is not supported, SMaRTT falls back to a simple timeout-based approach, requiring minimal changes to the overall logic (it reacts to timeouts like it does on the reception of a trimmed packet). We compare SMaRTT performance with and without trimming in Sec. 4.2. Because SMaRTT strives to keep the RTT below the target one, it uses aggressive timeouts to detect packet losses. When using timeouts instead of trimming, we also double the value of the \(\mathit{md}\) constant. By doing so, SMaRTT reacts more aggressively to congestion and reduces the probability of packet drops, which can decrease performance and reduce link efficiency.

Figure 6: QuickAdapt behavior for a 2:1 incast. Senders have a starting window of 4 packets. When a sender gets a trimmed packet it sets trigger_qa to true. After trtt, it changes its window size to the number of Bytes acknowledged during this time window (2 packets in this case).

3.4 FastIncrease↩︎

FastIncrease quickly reclaims available bandwidth after the termination of some flows (e.g., in case of incast with uneven sizes). FastIncrease is triggered when at least \(\mathit{cwnd}\) subsequent bytes did not experience any congestion (i.e., the RTT is close to the base RTT, and the ACKs are not ECN marked). In that case, as long as FastIncrease stays active, the congestion window is increased by \(k\) MTUs for each ACK. \(k\) is a constant that we set equal to \(2\) (after proper tuning) (see Sec. 3.5). We show the FastIncrease pseudocode in Algorithm 7 and how it interacts with the rest of the SMaRTT logic in Algorithm 4.

Figure 7: FastIncrease Pseudocode

3.5 Parameter Selection↩︎

SMaRTT enforces the congestion window size within a predefined range. We assume a maximum window size of 1.25 BDP (higher than 1 to manage transient bursts) and a minimum window size of 1 MTU. Modern link speeds and delays (with a BDP around 900KiB [7] and 4KiB MTUs) allow incast with hundreds (200 in this example) of senders targeting a single receiver to be easily supported. Higher incast degrees can be gracefully handled by SMaRTT at reduced link efficiency due to the potentially higher number of trimmed or dropped packets. Alternatively, SMaRTT can integrate a pacer-based solution similar to the one introduced by Swift for cases when the window would drop below one MTU [15]. However, it is worth remarking that such high-degree incast are unlikely to appear in modern AI workloads where traffic is dominated by highly optimized collective operations that limit or avoid incast [13].

We test SMaRTT using RED queueing discipline, with \(K_{min}\) and \(K_{max}\) set to \(20\%\) and \(80\%\) of the switch queue size respectively [37].

\(\mathit{trtt}\) is selected to be equal to \(1.5x\) the base RTT. Because we assume a queue size equal to 1 BDP, this corresponds to having half-full switch queues, thus halfway between \(K_{min}\) and \(K_{max}\). By doing so, SMaRTT allows the REPS load balancing to react by forwarding packets on different paths, before taking more drastic actions involving a reduction of the congestion window.

Selecting \(\mathit{qa\_scaling} = 1\) would let the RTT stabilize around the target RTT, corresponding to a switch queue size to roughly in-between \(K_{min}\) and \(K_{max}\). We set it instead to \(0.8\), so that the RTT drops to \(0.8\) the target RTT (i.e., \(0.8 \cdot 1.5 \mathit{brtt} = 1.2 \mathit{brtt}\)), corresponding to a switch queue size around \(K_{min}\), thus reducing the probability of triggering ECN marking.

We set the \(\mathit{fi}\) and \(\mathit{fd}\) parameters (see Sec. 3.2) to \(0.25\) and \(0.8\) respectively for 100Gbps links 4096B MTU. This selection comes from tuning over hundreds of combinations in all test scenarios of Sec. 4.

Last, some values need to be scaled depending on the network bandwidth and latency since changing these would affect the BDP and, thus, the maximum for congestion windows. Consequently, the congestion window can be increased more aggressively on networks with higher \(\mathit{bdp}\). Namely, the \(\mathit{fi}\) and \(\mathit{mi}\) parameters need to be multiplied by \(\gamma = \frac{\mathit{bdp}}{reference\_bdp}\), where \(\mathit{reference\_bdp}\) is the network \(\mathit{bdp}\) for a 100Gbps network with 12us RTT. On the other hand, to be more conservative, both \(\mathit{fd}\) and \(\mathit{md}\) are fixed regardless of link speed, as we always want to react strongly in case of congestion.

3.6 Recycled Entropy Packet Spray (REPS)↩︎

We couple SMaRTT with a novel per-packet load balancing mechanism, which we call REPS, which does not need any switch support besides ECMP [57]. With ECMP, each switch computes a hash function over some packet header fields to determine which path to forward the packet on. One of these fields is the entropy (e.g., the source port or the IPv6 Flow Label), which REPS changes per-packet.

For the first \(\mathit{bdp}\) of packets, REPS explores all possible entropies. The receiver copies the entropies from the received packet to the ACK, thus forwarding them back to the sender. Whenever an ACK arrives, if the ACK is not ECN-marked, the entropy carried by the ACK is cached. Intuitively, because the corresponding data packet did not experience congestion, we can send other packets on the same path, and REPS will thus reuse it for subsequent packets. Otherwise, if the ACK was ECN-marked, REPS selects the next entropy. This can be further extended by defining additional policies (e.g., excluding bad entropies after some time or storing entropies in a priority queue based on their RTTs and/or ECN marking), but we do not explore this further in this work. We evaluate REPS’ advantages versus traditional packet spraying and ECMP in Sec. 4.1. Algorithm 8 describes the REPS pseudocode, and Fig. 9 (a) shows an example of REPS’ behavior.

Figure 8: REPS Pseudocode



Figure 9: (a) Example of REPS behavior. We assume the sender already cycled over all the possible entropies. (b) Impact of WTD on SMaRTT..

3.6.1 Interaction between CC and load balancing↩︎

REPS can reduce the impact of fabric congestion (e.g., caused by ECMP hashing collisions) without reducing the congestion window size. This is especially relevant in permutation workloads running on non-oversubscribed networks where, in principle, the congestion can be avoided by properly balancing the traffic. To allow REPS to react to such congestion, SMaRTT does not react to ECN-marked packets unless several marked packets are received. In a nutshell, SMaRTT keeps an exponential moving average tracking ECN-marked packets by updating, for each packet, a variable \(\mathit{avg\_wtd} = \alpha \cdot \mathit{p.ecn} \mathrel{+} (1 - \alpha) \cdot \mathit{avg\_wtd}\). SMaRTT does not decrease the congestion window as long as \(\mathit{avg\_wtd} < 0.25\) (i.e. 25 percent or more of the acks indicate congestion, which is a sufficiently large threshold to rule out congestion caused by transient load imbalance). We call this feature Wait to Decrease (WTD), and we show in Fig. 9 (b) how it significantly reduces the FCT of a permutation workload running on a 1,024 nodes non-oversubscribed fat tree.

4 Evaluation↩︎ Simulation Setup

We ran simulations using and extending the htsim packet-level network simulator [19]. We also performed a set of experiments (Sec. 4.1.1) on a real deployment, which we discuss later. Our simulations consider four different fat-tree topologies, two non-blocking (one with 1,024 and the other with 128 nodes), and three with 1,024 nodes and 2:1, 4:1, and 8:1 oversubscription ratios, due to the relevance of such topologies in production datacenters [58]. We set a 4KiB MTU size, a 800 Gbps network bandwidth, and a 400ns switch traversal latency, consistent with the latest generation of switches [54], [59]. For the sake of simplicity, we assume that all links have the same length and, hence, the same latency, equal to 600ns. State-of-the-art Comparison

We evaluate SMaRTT-REPS against EQDS, Swift, MPRDMA, and BBR. To have a fair comparison and focus on congestion control, we enable trimming and multi-pathing for all the algorithms. Indeed, running Swift or BBR in their vanilla version using a single path would result in bad performance, mostly due to routing rather than congestion control. We assessed this experimentally by comparing the single-path vanilla version of Swift (using PLB [60]), a version using oblivious packet spraying, and one using REPS. We found out that the version using REPS consistently outperforms the others since PLB reacts to congestion only after several RTTs, and is thus not suitable for the scenarios we target. We thus use REPS for all the algorithms and all the experiments, except when stated otherwise. We do not compare SMaRTT with DCTCP and DCQCN since it has already been shown that EQDS consistently outperforms them [18]. Analyzed Workloads

We consider the following three traffic patterns: (i) incast: relevant for traditional datacenter workloads, that often send requests to a large number of workers, and must then handle their near-simultaneous responses; (ii) permutation: simulates point to point connections between pairs of nodes, selected so that each packet crosses the core switches, to emulate a worst case scenario and stress the load balancer; (iii) alltoall: relevant for many AI workloads such as Mixture of Experts models.

In the following, we first analyze the performance gain obtained through the use of REPS (Sec. 4.1), as well as the advantages of using packet trimming (Sec. 4.2). Then, we compare SMaRTT’s performance with state-of-the-art algorithms on incast (Sec. 4.3), permutation (Sec. 4.4), and alltoall (Sec. 4.5) patterns.

4.1 REPS↩︎

In Fig. 10 (a), we compare SMaRTT’s performance using different load balancing algorithms (REPS, ECMP [57], PLB [60], [61], and oblivious packet spraying [62]). We simulate a 32MiB permutation pattern on the 4:1 oversubscribed fat tree. We observe that, due to better fairness, REPS improves the completion time by roughly 10% compared to oblivious spraying. We also observe that with per-flow ECMP, the job completion time is 1.5x times higher than that of packet spraying and REPS due to more flow collisions (not shown in the plot). PLB also struggle but it is able to achieve a much better fairness compared to ECMP. Moreover, by analyzing the packet RTT distribution in Fig. 10 (b). We observe that REPS is characterized by RTTs closer to the target one and with a lower variability, indicating better network utilization.



Figure 10: 32MiB permutation running on a 8:1 oversubscribed fat tree. Flow completion times and RTT distribution of different load balancing algorithms..




Figure 11: REPS impact on goodput (a) and FCT (b) in an asymmetric network, and on a link failure scenario (c)..

4.1.1 Handling Faults and Network Asymmetries↩︎

REPS is also effective in dealing with network faults and asymmetries. To assess such capabilities, we implemented REPS in a production-grade FPGA-based RDMA-capable NIC. We then compared REPS against a baseline oblivious packet spray approach. The experimental setup consists of a two-tier fat-tree network with 100G FPGA-based NICs and Tomahawk3-based switches. Tier 0 (T0) downlinks are 100G and baseline Tier0-Tier1 uplinks are 400G. Endpoints run 4MB ring-based AllReduce operations on RDMA, with the logical ring laid out such that all connections traverse the Tier 1 (T1) spine and maximize pressure on the network.

In the first experiment, we connect 16 endpoints through two T0 switches (8 endpoints each) with a total of 4 links to a pair of T1 switches. To demonstrate the adaptiveness of REPS, we changed the link speed of one T0-T1 link from 400G to 200G, creating asymmetry in the network. Fig. 11 (a) shows the per-flow goodput as observed by the application. Oblivious packet spray sends packets across all paths (including those crossing the 200G link) with equal probability and is ultimately capped by the slower 200G path. Indeed, the congestion signal produced from the slower path traffic limits the total flow goodput to the bottlenecked path (thus severely underutilizing the remaining capacity of the 400G links).

In contrast, REPS can gracefully adapt in such a scenario as the cached entropies will reflect the network asymmetry and result in a path distribution that is skewed to tailor to the relative capacity of the available paths. In this example, REPS can reach high utilization with average per-flow goodput within 5% of the ideal fair-share target. We also report in Fig. 11 (b) hardware-collected FCT distributions, showing that REPS can maintain a lower and tighter FCT distribution, significantly reducing average and tail FCTs.

Figure 12: Flow completion time in a 16:1 512KiB incast with and without trimming.

Figure 13: Increase in runtime due to the lack of packet trimming support for different workloads on a 1,024-node fat tree. Perm. stands for permutation with an oversubscribed network. Incasts are run without over-subscription.




Figure 14: Relative incast performance for different message sizes and incast degrees..

To demonstrate the robustness and resilience of REPS in the context of link failures, Fig. 11 (c) shows total packet drops (average across four runs each) observed in a large-scale 128-endpoint run (endpoints split across 2 T0s connected through 8 T1s) where we abruptly bring down a T0-T1 link during the experiment. While the network is trying to recover from the impact of this event (which in our environment can take in the order of 100s of milliseconds), the oblivious spraying continues sending packets across all paths (including those affected by the link that went down). On the other hand, REPS quickly adapts (within the order of an RTT) and avoids sending packets down the affected paths as the entropy cache is replenished from packets traversing healthy paths. In these sample runs, REPS exhibited a much more predictable behavior and reduced the impact of link failure, providing an order of magnitude reduction in the total number of packet drops.

4.2 Trimming vs. Timeouts↩︎

As discussed in Sec. 3.3, SMaRTT relies on timeouts to detect packet drops when the switches do not support trimming. We now evaluate the impact of the lack of trimming. In Fig. 12, we report an incast scenario on small (512 KiB) messages, showing that the lack of trimming increases the FCT by slightly more than one base RTT (equal to 8.6 \(\mu\)s in our setup).

We investigate this further in Fig. 13, reporting the increase in FCT due to the lack of trimming support, showing that the additional delay is often smaller than one base RTT, and always within two base RTTs on all the scenarios we evaluated. We also observed that less than 0.2% packets were unnecessarily re-transmitted (not shown in the plot).

4.3 Incast↩︎

Fig. 14 shows the relative performance of SMaRTT in incast scenarios ranging from an 8:1 incast to a 100:1 incast for varying message sizes. We note that EQDS, being a receiver-based mechanism, can easily achieve good performance. Indeed, it can precisely communicate the window size to each sender and schedule them to be perfectly fair.

On the other hand, sender-based mechanisms like MPRDMA, SMaRTT, and Swift perform slightly worse, particularly true for medium sized messages. Small messages do not overrun switch buffers, whereas for large enough messages all the algorithms have enough time to converge to the correct rate. The zone in between, usually for messages slightly smaller than BDP, is subject to random drop events that affect sender-based mechanisms more. Moreover, MPRDMA seems to have slightly worse performance due to its lack of fairness features. Finally, BBR struggles for medium-sized messages since it needs more time to converge to the correct rate.

4.4 Permutation↩︎

In Fig. 1, we show the results for a 2 MiB and 32 MiB permutation traffic pattern on an 8:1 oversubscribed fat tree with 1,024 nodes. It is worth noting that, when using a relatively small message of 2 MiB, just slightly bigger than the BDP, SMaRTT outperforms all other algorithms while being fair (it can quickly react thanks to ECN). In Fig. 15 (a) and Fig. 15 (b), we show what happens on 2:1 and 4:1 oversubscribed fat trees respectively. In these cases, SMaRTT still outperforms the other algorithms, but not as much as in other cases, a smaller oversubscription ratio is easier to manage even for simpler CC algorithm.

In Fig. 15 (c) we report instead the results for a scenario where multiple permutations run in parallel (which can be the case with a Butterfly AllReduce or a windowed AllToAll). In this case, the performance gap between SMaRTT and EQDS increases up to 50%, while all the other sender based-algorithms perform better but are still about 10% slower than SMaRTT. Finally, we show in Fig. 15 (d) a scenario with 2MiB flows, except for one that is bigger and sends 4MiB. This scenario shows how SMaRTT outperforms other sender-based mechanisms by almost 30% thanks to its capacity to reclaim bandwidth quickly (see Sec. 3.4). This is also the case for EQDS, which outperforms MPRDMA, BBR, and Swift since it can also reclaim unused bandwidth quickly.





Figure 15: Flow completion time with for different permutations on fat trees with different oversubscription ratios..

Generally, we can notice that for high oversubscription ratios, the performance of EQDS degrades even more due to the high number of trimmed packets. To further emphasize this point, we measure the number of trimmed packets for EQDS and report that, in the worst case scenario just mentioned, it can generate up to 155x more trimmed packets than SMaRTT, resulting in wasted bandwidth and resources. The EQDS authors state that in such cases a CC algorithm on the sender side would benefit its performance, and we explore that in Sec. 5.1.

4.5 Alltoall↩︎

We also evaluate performance for alltoall communication, a collective operation commonly used in AI workloads [13]. Due to the large number of messages generated, we use the smaller topology to reduce simulation times. We use a windowed algorithm for the alltoall to have at least \(k\) active flows per node at any time. We show the results in an oversubscribed fat tree, as a non blocking tree topology can theoretically handle alltoall traffic without problems for congestion control. We report the results of our evaluation in Fig. 16. We observe that EQDS performance drops as the number of parallel connection increases, consistent with the multiple permutation results in Fig. 15 (c). On the other hand, sender-based mechanisms handle that case better, with SMaRTT taking the lead being only 6% behind the ideal time.

Figure 16: The number on top of each bar indicates the time distance to the ideal completion time (grey dotted line). k indicates the number of active flows at any given time for a sender.

5 Discussion↩︎

5.1 Augmenting EQDS with SMaRTT↩︎

One of the main drawbacks of using a receiver-based mechanism like EQDS, as shown in the previous results, is its bad management of fabric congestion. As suggested in the EQDS paper, it should be possible to improve EQDS’ performance by complementing it with a sender-based CC algorithm. To do so, we explore using EQDS with SMaRTT, as well as with a DCTCP-style CC algorithm (MPRDMA). While running in combination with EQDS, the role of the sender-based congestion control is to limit its sending rate by capping the congestion window and adapting as needed.

In theory, this combination should give us most of the benefits of both world: perfect incast management thanks to the receiver-based CC and the ability to deal with the other forms of congestion thanks to a solid sender-based CC. In this work, we only show a simple scenario where such combinations seem beneficial for EQDS. A more in-depth analysis of such a combination is necessary but outside the scope of this paper.


All flows 2MiB.


One longer flow at 64 MiB.

Figure 17: Augmenting EQDS with sender-based mechanisms..

In Fig. 17 we report a permutation scenario on an 8:1 oversubscribed fat tree, with 2 MiB and 32 MiB flows. We can see that SMaRTT improves EQDS performance. For 2MiB flows, the improvement introduced by SMaRTT is higher than that of MPRDMA, thanks to its better fairness. When running with 32 MiB flows except for one being 64 MiB, we can also see how we do well on our own but also help EQDS.

6 Related Work↩︎

Researchers proposed several datacenter CC algorithms, and we discuss in the following how SMaRTT differs from some of the most recently proposed ones. Receiver-based algorithms such as NDP [19], EQDS [18], pHost [45], ExpressPass [46], SMSRP [47], and Homa [48] maintain end-to-end credits that are used to adjust sender flow rates. They can effectively manage incast scenarios by precisely regulating the transmission rates of the different senders. We have shown, however, that they can struggle when dealing with fabric congestion unless they are complemented with a sender-based mechanism. SMaRTT can instead deal with fabric congestion and incast but without requiring pull queues.

On the other hand, sender-based CC algorithms can deal better with fabric congestion, but might not be as responsive when confronted with sudden workload shift (e.g. incast or dynamic job arrivals or departures). Some of those algorithms like DCTCP [63], Hull [64], and D\(^2\)TCP [65] rely on ECN marking. For example, DCQCN [34] combines ECN and Priority-based Flow Control (PFC) to avoid packet losses and react to congestion quickly. However, PFC is hard to tune and can cause PFC storms [7], [8]. Differently from DCQCN, SMaRTT is designed for lossy networks and relies on packet trimming to avoid switches dropping packets, thus avoiding all the issues related to the use of PFC. Other sender-based algorithms Swift [15] and TIMELY [32] detect congestion using delay but, as we show in this paper, they cannot react as fast as ECN-based algorithms.

Last, some works recently demonstrated that having an unfair congestion control can improve performance of jobs running ML training workloads [66]. These works are orthogonal to SMaRTT, since they target fairness between different jobs, whereas SMaRTT improves fairness between flows belonging to the same job.

7 Conclusion↩︎

This paper proposes SMaRTT, a sender- and window-based CC algorithm using ECN and delay as congestion signals and packet trimming to detect packet losses. When trimming is not supported SMaRTT falls back to a timeout-based mechanism, which we show adds at most two base RTT of delay. We couple SMaRTT with REPS, a simple and resource-efficient load balancer that significantly improves the FCT, as also shown in our evaluation on a real testbed. REPS can drastically improve the load balancing performance compared to traditional per-flow ECMP or oblivious packet spraying at minimal complexity cost. We also show that REPS can effectively handle faults and network asymmetries using a real network. We evaluate SMaRTT on several workloads, showing that it constantly either outperforms or matches the performance of EQDS, Swift, BBR, and MPRDMA by up to 50%.


Neil C. Thompson, Kristjan Greenewald, Keeheon Lee, and Gabriel F. Manso.2022. The Computational Limits of Deep Learning. (2022).
D. Hernandez D. Amodei.2018. The Computational Limits of Deep Learning. (2018).
Daniele De Sensi, Tiziano De Matteis, Konstantin Taranov, Salvatore Di Girolamo, Tobias Rahn, and Torsten Hoefler.2022. . Proc. ACM Meas. Anal. Comput. Syst.6, 3, Article 49(Dec.2022), 27 pages.
Wei Bai, Shanim Sainul Abdeen, Ankit Agrawal, Krishan Kumar Attre, Paramvir Bahl, Ameya Bhagat, Gowri Bhaskara, Tanya Brokhman, Lei Cao, Ahmad Cheema, Rebecca Chow, Jeff Cohen, Mahmoud Elhaddad, Vivek Ette, Igal Figlin, Daniel Firestone, Mathew George, Ilya German, Lakhmeet Ghai, Eric Green, Albert Greenberg, Manish Gupta, Randy Haagens, Matthew Hendel, Ridwan Howlader, Neetha John, Julia Johnstone, Tom Jolly, Greg Kramer, David Kruse, Ankit Kumar, Erica Lan, Ivan Lee, Avi Levy, Marina Lipshteyn, Xin Liu, Chen Liu, Guohan Lu, Yuemin Lu, Xiakun Lu, Vadim Makhervaks, Ulad Malashanka, David A. Maltz, Ilias Marinos, Rohan Mehta, Sharda Murthi, Anup Namdhari, Aaron Ogus, Jitendra Padhye, Madhav Pandya, Douglas Phillips, Adrian Power, Suraj Puri, Shachar Raindel, Jordan Rhee, Anthony Russo, Maneesh Sah, Ali Sheriff, Chris Sparacino, Ashutosh Srivastava, Weixiang Sun, Nick Swanson, Fuhou Tian, Lukasz Tomczyk, Vamsi Vadlamuri, Alec Wolman, Ying Xie, Joyce Yom, Lihua Yuan, Yanzhao Zhang, and Brian Zill.2023. . In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23). USENIX Association, Boston, MA, 49–67.
Leah Shalev, Hani Ayoub, Nafea Bshara, and Erez Sabbag.2020. . IEEE Micro40, 6(2020), 67–73.
Dan Gibson, Hema Hariharan, Eric Lance, Moray McLaren, Behnam Montazeri, Arjun Singh, Stephen Wang, Hassan M. G. Wassel, Zhehua Wu, Sunghwan Yoo, Raghuraman Balasubramanian, Prashant Chandra, Michael Cutforth, Peter Cuy, David Decotigny, Rakesh Gautam, Alex Iriza, Milo M. K. Martin, Rick Roy, Zuowei Shen, Ming Tan, Ye Tang, Monica Wong-Chan, Joe Zbiciak, and Amin Vahdat.2022. . In 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22). USENIX Association, Renton, WA, 1249–1266.
Torsten Hoefler, Duncan Roweth, Keith Underwood, Robert Alverson, Mark Griswold, Vahid Tabatabaee, Mohan Kalkunte, Surendra Anubolu, Siyuan Shen, Moray McLaren, Abdul Kabbani, and Steve Scott.2023. . Computer56, 7(2023), 67–77.
Radhika Mittal, Alexander Shpiner, Aurojit Panda, Eitan Zahavi, Arvind Krishnamurthy, Sylvia Ratnasamy, and Scott Shenker.2018. . In Proceedings of the 2018 Conference of the ACM Special Interest Group on Data Communication(SIGCOMM ’18). Association for Computing Machinery, New York, NY, USA, 313–326.
Torsten Hoefler, Timo Schneider, and Andrew Lumsdaine.2009. . Parallel Processing Letters (PPL)19, 4(Aug.2009), 573–593.
Daniele De Sensi, Salvatore Di Girolamo, and Torsten Hoefler.2019. . In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC19).
Jeffrey Dean Luiz André Barroso.2013. . Commun. ACM56, 2(feb2013), 74–80.
Serhat Arslan, Yuliang Li, Gautam Kumar, and Nandita Dukkipati.2023. . In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23). USENIX Association, Boston, MA, 219–236.
Torsten Hoefler, Tommaso Bonato, Daniele De Sensi, Salvatore Di Girolamo, Shigang Li, Marco Heddes, Jon Belk, Deepak Goel, Miguel Castro, and Steve Scott.2022. . In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis(SC ’22). IEEE Press, Article 11, 18 pages.
Qiao Zhang, Vincent Liu, Hongyi Zeng, and Arvind Krishnamurthy.2017. . In Proceedings of the 2017 Internet Measurement Conference(IMC ’17). Association for Computing Machinery, New York, NY, USA, 78–85.
Gautam Kumar, Nandita Dukkipati, Keon Jang, Hassan Wassel, Xian Wu, Behnam Montazeri, Yaogong Wang, Kevin Springborn, Christopher Alfeld, Mike Ryan, David J. Wetherall, and Amin Vahdat.2020. .
Neal Cardwell, Yuchung Cheng, C. Stephen Gunn, Soheil Hassas Yeganeh, and Van Jacobson.2017. . Commun. ACM60(2017), 58–66.
Yuanwei Lu, Guo Chen, Bojie Li, Kun Tan, Yongqiang Xiong, Peng Cheng, Jiansong Zhang, Enhong Chen, and Thomas Moscibroda.2018. . In Proceedings of the 15th USENIX Conference on Networked Systems Design and Implementation(NSDI’18). USENIX Association, USA, 357–371.
Vladimir Olteanu, Haggai Eran, Dragos Dumitrescu, Adrian Popa, Cristi Baciu, Mark Silberstein, Georgios Nikolaidis, Mark Handley, and Costin Raiciu.2022. . In 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22). USENIX Association, Renton, WA, 761–777.
Mark Handley, Costin Raiciu, Alexandru Agache, Andrei Voinescu, Andrew W. Moore, Gianni Antichi, and Marcin Wójcik.2017. . In Proceedings of the Conference of the ACM Special Interest Group on Data Communication(SIGCOMM ’17). Association for Computing Machinery, New York, NY, USA, 29–42.
Yanpei Chen, Rean Griffith, Junda Liu, Randy H. Katz, and Anthony D. Joseph.2009. . In Proceedings of the 1st ACM Workshop on Research on Enterprise Networking(WREN ’09). Association for Computing Machinery, New York, NY, USA, 73–82.
Mohammad Alizadeh, Tom Edsall, Sarang Dharmapurikar, Ramanan Vaidyanathan, Kevin Chu, Andy Fingerhut, Vinh The Lam, Francis Matus, Rong Pan, Navindra Yadav, and George Varghese.2014. . In Proceedings of the 2014 ACM Conference on SIGCOMM(SIGCOMM ’14). Association for Computing Machinery, New York, NY, USA, 503–514.
Soudeh Ghorbani, Zibin Yang, P. Brighten Godfrey, Yashar Ganjali, and Amin Firoozshahian.2017. . In Proceedings of the Conference of the ACM Special Interest Group on Data Communication(SIGCOMM ’17). Association for Computing Machinery, New York, NY, USA, 225–238.
Jiaxin Cao, Rui Xia, Pengkun Yang, Chuanxiong Guo, Guohan Lu, Lihua Yuan, Yixin Zheng, Haitao Wu, Yongqiang Xiong, and Dave Maltz.2013. . In Proceedings of the Ninth ACM Conference on Emerging Networking Experiments and Technologies(CoNEXT ’13). Association for Computing Machinery, New York, NY, USA, 49–60.
Arjun Roy, Hongyi Zeng, Jasmeet Bagga, George Porter, and Alex C. Snoeren.2015. . SIGCOMM Comput. Commun. Rev.45, 4(aug2015), 123–137.
Rachee Singh, Muqeet Mukhtar, Ashay Krishna, Aniruddha Parkhi, Jitendra Padhye, and David Maltz.2021. . SIGCOMM Comput. Commun. Rev.51, 2(may2021), 2–9.
Yuliang Li, Rui Miao, Hongqiang Harry Liu, Yan Zhuang, Fei Feng, Lingbo Tang, Zheng Cao, Ming Zhang, Frank Kelly, Mohammad Alizadeh, and Minlan Yu.2019. . In Proceedings of the ACM Special Interest Group on Data Communication(SIGCOMM ’19). Association for Computing Machinery, New York, NY, USA, 44–58.
Parvin Taheri, Danushka Menikkumbura, Erico Vanini, Sonia Fahmy, Patrick Eugster, and Tom Edsall.2020. . In Proceedings of the 16th International Conference on Emerging Networking EXperiments and Technologies(CoNEXT ’20). Association for Computing Machinery, New York, NY, USA, 17–30.
Prateesh Goyal, Preey Shah, Kevin Zhao, Georgios Nikolaidis, Mohammad Alizadeh, and Thomas E. Anderson.2022. . In 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22). USENIX Association, Renton, WA, 779–805.
Daniele De Sensi, Salvatore Di Girolamo, and Torsten Hoefler.2019. . In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis(SC ’19). Association for Computing Machinery, New York, NY, USA, Article 16, 32 pages.
José Rocher-González, Ernst Gunnar Gran, Sven-Arne Reinemo, Tor Skeie, Jesús Escudero-Sahuquillo, Pedro Javier García, and Francisco J. Quiles Flor.2022. . In 2022 22nd IEEE International Symposium on Cluster, Cloud and Internet Computing (CCGrid). 463–472.
V. Cerf R. Kahn.1974. . IEEE Transactions on Communications22, 5(1974), 637–648.
Radhika Mittal, Terry Lam, Nandita Dukkipati, Emily Blem, Hassan Wassel, Monia Ghobadi, Amin Vahdat, Yaogong Wang, David Wetherall, and David Zats.2015. . In Sigcomm ’15.
Changhyun Lee, Chunjong Park, Keon Jang, Sue Moon, and Dongsu Han.2017. . IEEE/ACM Transactions on Networking25, 1(2017), 335–348.
Yibo Zhu, Yibo Zhu, Haggai Eran, Daniel Firestone, Daniel Firestone, Chuanxiong Guo, Marina Lipshteyn, Yehonatan Liron, Jitendra Padhye, Shachar Raindel, Mohamad Haj Yahia, Ming Zhang, and Jitu Padhye.2015. . In SIGCOMM (sigcomm ed.). ACM - Association for Computing Machinery.
Mohammad Alizadeh, Albert Greenberg, David A. Maltz, Jitendra Padhye, Parveen Patel, Balaji Prabhakar, Sudipta Sengupta, and Murari Sridharan.2010. . SIGCOMM Comput. Commun. Rev.40, 4(aug2010), 63–74.
Jin Ye, Renzhang Liu, Ziqi Xie, Luting Feng, and Sen Liu.2019. . In 2019 28th International Conference on Computer Communication and Networks (ICCCN). 1–10.
S. Floyd V. Jacobson.1993. . IEEE/ACM Transactions on Networking1, 4(1993), 397–413.
Sally Floyd, Dr. K. K. Ramakrishnan, and David L. Black.2001. The Addition of Explicit Congestion Notification (ECN) to IP. RFC 3168. (Sept.2001).
Haitao Wu, Chuanxiong Guo, Yongqiang Xiong, and Yongguang Zhang.2012. . In ACM CoNEXT’12. ACM.
Kathleen Nichols Van Jacobson.2012. Queue10, 5(may2012), 20–34.
Peng Cheng, Fengyuan Ren, Ran Shu, and Chuang Lin.2014. . In 11th USENIX Symposium on Networked Systems Design and Implementation (NSDI 14). USENIX Association, Seattle, WA, 17–28.
Popa Adrian, Dumitrescu Dragos, Handley Mark, Nikolaidis Georgios, Lee Jeongkeun, and Raiciu Costin.2022. Implementing packet trimming support in hardware. (2022).
Alessio Sacco, Antonino Angi, Flavio Esposito, and Guido Marchetto.2023. . In 2023 IEEE 24th International Conference on High Performance Switching and Routing (HPSR). 83–88.
Jin Wang, Dongzhi Yuan, Wangqing Luo, Shuying Rao, R. Simon Sherratt, and Jinbin Hu.2023. . Computers, Materials & Continua75, 1(2023), 1195–1212.
Peter X. Gao, Akshay Narayan, Gautam Kumar, Rachit Agarwal, Sylvia Ratnasamy, and Scott Shenker.2015. . In Proceedings of the 11th ACM Conference on Emerging Networking Experiments and Technologies(CoNEXT ’15). Association for Computing Machinery, New York, NY, USA, Article 1, 12 pages.
Zejia Zhou, Dezun Dong, Shan Huang, and Zihao Wei.2019. . In 2019 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom). 46–52.
Nan Jiang, Larry Dennison, and William J. Dally.2015. . In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis(SC ’15). Association for Computing Machinery, New York, NY, USA, Article 35, 12 pages.
Behnam Montazeri, Yilong Li, Mohammad Alizadeh, and John Ousterhout.2018. . In Proceedings of the 2018 Conference of the ACM Special Interest Group on Data Communication(SIGCOMM ’18). Association for Computing Machinery, New York, NY, USA, 221–235.
Xiaolong Zhong, Jiao Zhang, Yali Zhang, Zixuan Guan, and Zirui Wan.2022. . In IEEE INFOCOM 2022 - IEEE Conference on Computer Communications. 2228–2237.
Broadcom.2024. Deploying AI/ML training clusters with IP/Ethernet. (2024).
Nvidia.2024. Networking for the Era of AI: The Network Defines the Data Center. (2024).
Leon Poutievski, Omid Mashayekhi, Joon Ong, Arjun Singh, Mukarram Tariq, Rui Wang, Jianan Zhang, Virginia Beauregard, Patrick Conner, Steve Gribble, Rishi Kapoor, Stephen Kratzer, Nanfang Li, Hong Liu, Karthik Nagaraj, Jason Ornstein, Samir Sawhney, Ryohei Urata, Lorenzo Vicisano, Kevin Yasumura, Shidong Zhang, Junlan Zhou, and Amin Vahdat.2022. . In Proceedings of ACM SIGCOMM 2022.
John Kim, Wiliam J. Dally, Steve Scott, and Dennis Abts.2008. . In 2008 International Symposium on Computer Architecture. 77–88.
Daniele De Sensi, Salvatore Di Girolamo, Kim H. McMahon, Duncan Roweth, and Torsten Hoefler.2020. . In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. 1–14.
Chuanxiong Guo, Guohan Lu, Dan Li, Haitao Wu, Xuan Zhang, Yunfeng Shi, Chen Tian, Yongguang Zhang, and Songwu Lu.2009. . In Proceedings of the ACM SIGCOMM 2009 Conference on Data Communication(SIGCOMM ’09). Association for Computing Machinery, New York, NY, USA, 63–74.
Maciej Besta Torsten Hoefler.2014. . In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis(SC ’14). IEEE Press, 348–359.
C. Hopps.2009. Analysis of an Equal-Cost Multi-Path Algorithm. RFC 2992. (Nov.2009).
Arjun Singh, Joon Ong, Amit Agarwal, Glen Anderson, Ashby Armistead, Roy Bannon, Seb Boving, Gaurav Desai, Bob Felderman, Paulie Germano, Anand Kanagala, Jeff Provost, Jason Simmons, Eiichi Tanda, Jim Wanderer, Urs Hölzle, Stephen Stuart, and Amin Vahdat.2015. . In Sigcomm ’15.
Broadcom.2024. Tomahawk 5 Switch. (2024).
Mubashir Adnan Qureshi, Yuchung Cheng, Qianwen Yin, Qiaobin Fu, Gautam Kumar, Masoud Moshref, Junhua Yan, Van Jacobson, David Wetherall, and Abdul Kabbani.2022. . In Proceedings of the ACM SIGCOMM 2022 Conference(SIGCOMM ’22). Association for Computing Machinery, New York, NY, USA, 207–218.
Abdul Kabbani, Balajee Vamanan, Jahangir Hasan, and Fabien Duchene.2014. . In Proceedings of the 10th ACM International on Conference on Emerging Networking Experiments and Technologies(CoNEXT ’14). Association for Computing Machinery, New York, NY, USA, 149–160.
Advait Dixit, Pawan Prakash, Y. Charlie Hu, and Ramana Rao Kompella.2013. . In 2013 Proceedings IEEE INFOCOM. 2130–2138.
Mohammad Alizadeh, Albert Greenberg, David A. Maltz, Jitendra Padhye, Parveen Patel, Balaji Prabhakar, Sudipta Sengupta, and Murari Sridharan.2010. . In Proceedings of the ACM SIGCOMM 2010 Conference(SIGCOMM ’10). Association for Computing Machinery, New York, NY, USA, 63–74.
Mohammad Alizadeh, Abdul Kabbani, Tom Edsall, Balaji Prabhakar, Amin Vahdat, and Masato Yasuda.2012. . In 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI 12). USENIX Association, San Jose, CA, 253–266.
Balajee Vamanan, Jahangir Hasan, and T.N. Vijaykumar.2012. . In Proceedings of the ACM SIGCOMM 2012 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communication(SIGCOMM ’12). Association for Computing Machinery, New York, NY, USA, 115–126.
Sudarsanan Rajasekaran, Manya Ghobadi, Gautam Kumar, and Aditya Akella.2022. . In Proceedings of the 21st ACM Workshop on Hot Topics in Networks(HotNets ’22). Association for Computing Machinery, New York, NY, USA, 235–242.