计算机 · 2021年12月17日 0

BBR RFC

文档地址

https://tools.ietf.org/id/draft-cardwell-iccrg-bbr-congestion-control-00.html

https://datatracker.ietf.org/doc/html/draft-cardwell-iccrg-bbr-congestion-control-00

虽说这只是一个draft,但是总体没差(主要是文档美观,看着心情好)。

Introduction

Terminology

BBR.send_quantum: The maximum size of a data aggregate scheduled and transmitted together.

关于round trip

BBR里说到持续一个rtt时间时,需要注意区分究竟是持续一个packet-timed round trip还是持续一个wall clock time round trip。前者是持续一个自然的rtt,也就是收到发出去的包的ack,后者则是计时经过rtt这么久。

在PROBE_BW阶段,只有6个pacing_gain为1的phase是需要保持1个wall clock time的round trip。

对于pacing_gain为1.25的带宽上探阶段,退出条件为:出现丢包或者inflight达到1.25BDP。

对于pacing_gain为0.75的排空阶段,退出条件为:持续时间超过一个wall clock time的rtt或者inflight小于等于BDP。

Desing Overview

Network Path Model

Target Operating Point

Control Parameters

State Machine Design Overview

State Transition Diagram

             |
             V
    +---> Startup  ----+
    |        |         |
    |        V         |
    |      Drain   ----+
    |        |         |
    |        V         |
    +---> ProbeBW -----+
    |      ^    |      |
    |      |    |      |
    |      +----+      |
    |                  |
    +---- ProbeRTT <---+

Algorithm Organization

Initialization

 BBROnConnectionInit():
    BBRInit()

Per-ACK Steps

 BBRUpdateOnACK():
    BBRUpdateModelAndState()
    BBRUpdateControlParameters()

  BBRUpdateModelAndState():
    BBRUpdateBtlBw()
    BBRCheckCyclePhase()
    BBRCheckFullPipe()
    BBRCheckDrain()
    BBRUpdateRTprop()
    BBRCheckProbeRTT()

  BBRUpdateControlParameters()
    BBRSetPacingRate()
    BBRSetSendQuantum()
    BBRSetCwnd()

Per-Transmit Steps

 BBROnTransmit():
    BBRHandleRestartFromIdle()

Environment and Usage

Detailed Algorithm

Maintaining the Network Path Model

BBR.BtlBw

Delivery Rate Samples for Estimating BBR.BtlBw

有一篇专门讲计算Delivery Rate的文档

BBR.BtlBw Max Filter

BBR的瓶颈带宽过期时间是用的packet timed round trip,瓶颈带宽超过10个round才会过期;

Tracking Time for the BBR.BtlBw Max Filter

连接初始化:

 BBRInitRoundCounting():
    BBR.next_round_delivered = 0
    BBR.round_start = false
    BBR.round_count = 0

每发送一个包:

 packet.delivered = BBR.delivered

每接收一个ACK:

 BBRUpdateRound():
    BBR.delivered += packet.size
    if (packet.delivered >= BBR.next_round_delivered)
      BBR.next_round_delivered = BBR.delivered
      BBR.round_count++
      BBR.round_start = true
    else
      BBR.round_start = false

BBR.BtlBw and Application-limited Delivery Rate Samples

Updating the BBR.BtlBw Max Filter

 BBRUpdateBtlBw()
    BBRUpdateRound()
    if (rs.delivery_rate >= BBR.BtlBw || ! rs.is_app_limited) 
        BBR.BtlBw = update_windowed_max_filter(
                      filter=BBR.BtlBwFilter,
                      value=rs.delivery_rate,
                      time=BBR.round_count,
                      window_length=BtlBwFilterLen)

BBR.RTprop

Round-Trip Time Samples for Estimating BBR.RTprop

对于一个ack确认多个已发送数据包的情形,BBR取最后发送的那一个包的发送时间来计算RTT,以此来逼近minRtt。

The only divergence from RTT estimation for retransmission timeouts is in the case where a given acknowledgment ACKs more than one data packet. In order to be conservative and schedule long timeouts to avoid spurious retransmissions, the maximum among such potential RTT samples is typically used for computing retransmission timeouts; i.e., SRTT is typically calculated using the data packet with the earliest transmission time. By contrast, in order for BBR to try to reach the minimum amount of data in flight to fill the pipe, BBR uses the minimum among such potential RTT samples; i.e., BBR calculates the RTT using the data packet with the latest transmission time.

BBR.RTprop Min Filter

使用10s作为minRtt的失效时间。

Updating the BBR.RTprop Min Filter

记录每个已发送包的发送时间:

packet.send_time = Now()

每次收到ACK时记录接收时间:

packet.rtt = Now() - packet.send_time

更新minRtt:

 BBRUpdateRTprop()
    BBR.rtprop_expired =
      Now() > BBR.rtprop_stamp + RTpropFilterLen
    if (packet.rtt >= 0 and
       (packet.rtt <= BBR.RTprop or BBR.rtprop_expired))
      BBR.RTprop = packet.rtt
      BBR.rtprop_stamp = Now()

BBR Control Parameters

Pacing Rate

计算每个包的发送时间:

next_send_time = Now() + packet.size / pacing_rate

设置初始发送速率:

 BBRInitPacingRate():
    nominal_bandwidth = InitialCwnd / (SRTT ? SRTT : 1ms)
    BBR.pacing_rate =  BBR.pacing_gain * nominal_bandwidth

每次收到ACK时更新pacing_rate:

 BBRSetPacingRateWithGain(pacing_gain):
    rate = pacing_gain * BBR.BtlBw
    if (BBR.filled_pipe || rate > BBR.pacing_rate)
      BBR.pacing_rate = rate

  BBRSetPacingRate():
    BBRSetPacingRateWithGain(BBR.pacing_gain)

Send Quantum

高带宽情况下实现精准的pacing可能会增加CPU的消耗,因此可以根据带宽调整每次发包的数据量:

 BBRSetSendQuantum():
    if (BBR.pacing_rate < 1.2 Mbps)
      BBR.send_quantum = 1 * MSS
    else if (BBR.pacing_rate < 24 Mbps)
      BBR.send_quantum  = 2 * MSS
    else
      BBR.send_quantum  = min(BBR.pacing_rate * 1ms, 64KBytes)

Congestion Window

Initial Congestion Window

Target cwnd

inflight数总是不能超过target_cwnd。每次收到ACK后都需要更新target_cwnd:

 BBRInflight(gain):
    if (BBR.RTprop == Inf)
      return InitialCwnd /* no valid RTT samples yet */
    quanta = 3*BBR.send_quantum
    estimated_bdp = BBR.BtlBw * BBR.RTprop
    return gain * estimated_bdp + quanta

  BBRUpdateTargetCwnd():
    BBR.target_cwnd = BBRInflight(BBR.cwnd_gain)

Minimum cwnd for pipelining

为了使得管道被充分利用,BBR提出设置4MSS的cwnd下限。

Modulating cwnd in Loss Recovery

出现重传定时器超时情况时,BBR需要进入loss recovery状态,先将cwnd调整为1,然后再按照下面的Core cwnd Adjustment Mechanism进行调整:

 BBR.prior_cwnd = BBRSaveCwnd()
  cwnd = 1

进入Fast Recovery时,在第一个round内,保证cwnd与不丢包的发送速率相匹配,从第二个round开始则保证cwnd与2倍不丢包的发送速率相匹配:

 BBR.prior_cwnd = BBRSaveCwnd()
  cwnd = packets_in_flight + max(packets_delivered, 1)
  BBR.packet_conservation = true

在Fast Recovery阶段,每次收到ACK时,

 BBRModulateCwndForRecovery():
    if (packets_lost > 0)
      cwnd = max(cwnd - packets_lost, 1)
    if (BBR.packet_conservation)
      cwnd = max(cwnd, packets_in_flight + packets_delivered)

上面的packets_delivered表示每次收到ACK时新确认收到的包数,packets_lost为每次收到ACKS新确认的丢包数。

Fast Recovery持续一个round trip后:

 BBR.packet_conservation = false

退出Loss Recovery或者Fast Recovery时,需要恢复cwnd:

 BBR.packet_conservation = false
  BBRRestoreCwnd()
 BBRSaveCwnd():
    if (not InLossRecovery() and BBR.state != ProbeRTT)
      return cwnd
    else
      return max(BBR.prior_cwnd, cwnd)

  BBRRestoreCwnd():
    cwnd = max(cwnd, BBR.prior_cwnd)

Modulating cwnd in ProbeRTT

进入ProbeRTT状态时调整cwnd:

 BBRModulateCwndForProbeRTT():
    if (BBR.state == ProbeRTT)
      cwnd = min(cwnd, BBRMinPipeCwnd)

Core cwnd Adjustment Mechanism

 BBRSetCwnd():
    BBRUpdateTargetCwnd()
    BBRModulateCwndForRecovery()
    if (not BBR.packet_conservation) {
      if (BBR.filled_pipe)
        cwnd = min(cwnd + packets_delivered, BBR.target_cwnd)
      else if (cwnd < BBR.target_cwnd || BBR.delivered < InitialCwnd)
        cwnd = cwnd + packets_delivered
      cwnd = max(cwnd, BBRMinPipeCwnd)
    }
    BBRModulateCwndForProbeRTT()

State Machine

Initialization Steps

连接建立时:

 BBRInit():
    init_windowed_max_filter(filter=BBR.BtlBwFilter, value=0, time=0)
    BBR.rtprop = SRTT ? SRTT : Inf
    BBR.rtprop_stamp = Now()
    BBR.probe_rtt_done_stamp = 0
    BBR.probe_rtt_round_done = false
    BBR.packet_conservation = false
    BBR.prior_cwnd = 0
    BBR.idle_restart = false
    BBRInitRoundCounting()
    BBRInitFullPipe()
    BBRInitPacingRate()
    BBREnterStartup()

STARTUP

Startup Dynamics

 BBREnterStartup():
    BBR.state = Startup
    BBR.pacing_gain = BBRHighGain
    BBR.cwnd_gain = BBRHighGain

Estimating When Startup has Filled the Pipe

初始化:

 BBRInitFullPipe():
    BBR.filled_pipe = false
    BBR.full_bw = 0
    BBR.full_bw_count = 0

在每个round开始时检查是否已经有3个round带宽没有上升25%:

 BBRCheckFullPipe():
    if BBR.filled_pipe or
       not BBR.round_start or rs.is_app_limited
      return  // no need to check for a full pipe now
    if (BBR.BtlBw >= BBR.full_bw * 1.25)  // BBR.BtlBw still growing?
      BBR.full_bw = BBR.BtlBw    // record new baseline level
      BBR.full_bw_count = 0
      return
    BBR.full_bw_count++ // another round w/o much growth
    if (BBR.full_bw_count >= 3)
      BBR.filled_pipe = true

为什么是3个round?这个和tcp的receive window autotuning机制有关:

BBR waits three rounds in order to have solid evidence that the sender is not detecting a delivery-rate plateau that was temporarily imposed by the receive window. Allowing three rounds provides time for the receiver’s receive-window autotuning to open up the receive window and for the BBR sender to realize that BBR.BtlBw should be higher: in the first round the receive-window autotuning algorithm grows the receive window; in the second round the sender fills the higher receive window; in the third round the sender gets higher delivery-rate samples. This three-round threshold was validated by YouTube experimental data.

Drain

为了减少STARTUP阶段引起的链路堆积,需要进入Drain阶段,在持续一个RTT或者inflight小于等于BDP时退出DRAIN。

 BBREnterDrain():
    BBR.state = Drain
    BBR.pacing_gain = 1/BBRHighGain  // pace slowly
    BBR.cwnd_gain = bbr_high_gain    // maintain cwnd
 BBRCheckDrain():
    if (BBR.state == Startup and BBR.filled_pipe)
      BBREnterDrain()
    if (BBR.state == Drain and packets_in_flight <= BBRInflight(1.0))
      BBREnterProbeBW()  // we estimate queue is drained

问题,为什么DRAIN阶段的pacing_gain是1/BBRHighGain?

ProbeBW

Gain Cycling Dynamics

pacing_gain按照[1.25,0.75,1,1,1,1,1,1]循环。cwnd_gain始终为2。

Gain Cycling Randomization

为了公平性,进入ProbeBW时,从0,2,3,4,5,6,7里随便选一个phase开始。不选pacing_gain为0.75这个phase是因为这个phase是为了减少链路堆积而设计的,如果进入ProbeBW时以0.75的pacing_gain这个phase开始,那么只会降低链路的利用率。

Gain Cycling Algorithm

进入ProbeBW时:

 BBREnterProbeBW():
    BBR.state = ProbeBW
    BBR.pacing_gain = 1
    BBR.cwnd_gain = 2
    BBR.cycle_index = BBRGainCycleLen - 1 - random_int_in_range(0..6)
    BBRAdvanceCyclePhase()

每次收到ACK时,检查是否需要advance cycle phase:

 BBRCheckCyclePhase():
    if (BBR.sate == ProbeBW and BBRIsNextCyclePhase()
      BBRAdvanceCyclePhase() 

  BBRAdvanceCyclePhase():
    BBR.cycle_stamp = Now()
    BBR.cycle_index = (BBR.cycle_index + 1) % BBRGainCycleLen
    pacing_gain_cycle = [5/4, 3/4, 1, 1, 1, 1, 1, 1]
    BBR.pacing_gain = pacing_gain_cycle[BBR.cycle_index]

  BBRIsNextCyclePhase():
    is_full_length = (Now() - BBR.cycle_stamp) > BBR.RTprop
    if (BBR.pacing_gain == 1)
      return is_full_length
    if (BBR.pacing_gain > 1)
      return is_full_length and
                (packets_lost > 0 or
                 prior_inflight >= BBRInflight(BBR.pacing_gain))
    else  //  (BBR.pacing_gain < 1)
      return is_full_length or
                 prior_inflight <= BBRInflight(1)

prior_in_flight和in_flight的区别:prior_in_flight是处理ack之前的in_flight,即prior_in_flight比in_flight要大一点。

Restarting From Idle

从Idle状态恢复时,BBR直接以估计带宽为pacing rate发送数据:

 BBRHandleRestartFromIdle():
    if (packets_in_flight == 0 and C.app_limited)
      BBR.idle_start = true
      if (BBR.state == ProbeBW)
        BBRSetPacingRateWithGain(1)

ProbeRTT

每次收到ACK,BBR都需要检查是否进入ProbeRTT状态,

 BBRCheckProbeRTT():
    if (BBR.state != ProbeRTT and
        BBR.rtprop_expired and
        not BBR.idle_restart)
      BBREnterProbeRTT()
      BBRSaveCwnd()
      BBR.probe_rtt_done_stamp = 0
    if (BBR.state == ProbeRTT)
      BBRHandleProbeRTT()
    BBR.idle_restart = false

  BBREnterProbeRTT():
    BBR.state = ProbeRTT
    BBR.pacing_gain = 1
    BBR.cwnd_gain = 1

  BBRHandleProbeRTT():
    /* Ignore low rate samples during ProbeRTT: */
    C.app_limited =
      (BW.delivered + packets_in_flight) ? : 1
    if (BBR.probe_rtt_done_stamp == 0 and
        packets_in_flight <= BBRMinPipeCwnd)
      BBR.probe_rtt_done_stamp =
        Now() + ProbeRTTDuration
      BBR.probe_rtt_round_done = false
      BBR.next_round_delivered = BBR.delivered
    else if (BBR.probe_rtt_done_stamp != 0)
      if (BBR.round_start)
        BBR.probe_rtt_round_done = true
      if (BBR.probe_rtt_round_done and
          Now() > BBR.probe_rtt_done_stamp)
        BBR.rtprop_stamp = Now()
        BBRRestoreCwnd()
        BBRExitProbeRTT()

  BBRExitProbeRTT():
    if (BBR.filled_pipe)
      BBREnterProbeBW()
    else
      BBREnterStartup()

如果连接是从idle状态恢复的(BBR.idle_restart is true),那么即使minRtt已经expire也不会进入ProbeRTT,因为连接处于idle状态时已经把链路排空了。