While I was out chasing computer history last week, the Linux 3.3 kernel was released. And a very interesting release it is, though not for its vaunted re-inclusion of certain Android kernel hacks. I think that modest move is being overblown in the press.  No, Linux 3.3 appears to be the first OS to really take a shot at reducing the problem of bufferbloat. It’s not the answer to this scourge, but it will help some, especially since Linux is so popular for high volume servers.

Bufferbloat, as you’ll recall from my 2011 predictions column, is the result of our misguided attempt to protect streaming applications (now 80 percent of Internet packets) by putting large memory buffers in modems, routers, network cards, and applications. These cascading buffers interfere with each other and with the flow control built into TCP from the very beginning, ultimately breaking that flow control, making things far worse than they’d be if all those buffers simply didn’t exist.

Bufferbloat was named by Jim Gettys of Bell Labs, who has become our chief defender against the scourge, attempting to coordinate what’s become a global response to the problem.

Linux 3.3 isn’t the total solution to bufferbloat but it’s a big step, particularly for servers.

Prepare for technospeak.

One issue is the very large ring buffers described above.  A typical device driver has these buffers set at 200-300 packets, a figure derived a decade ago as a worst case to allow devices to drive Gig-Ethernet flat-out using small packets. But not all packets are small, and there’s the rub.

Because these rings are necessarily expressed in packets, rather than in bytes, the length of time to transmit the packet can be radically different and this meant the arbitrary buffers can be up to 20 times larger than they need to be when sending big packets.  These rings are often constrained to be powers of two in size, and the size can’t easily be changed at runtime without dropping packets.

So the Linux 3.3 kernel now implements Byte Queue Limits (BQL) which controls how many bytes, rather than how many packets, go to the ring buffers at once.  The buffer size can now depend on the size of the packets — many for small packets, or only a few for large packets. Buffers get smaller and life gets better as a result.

BQL is currently implemented for Ethernet where the buffers can be sized so we can take advantage of smart hardware at high bandwidth. But BQL is not available for wireless networks, nor is it likely to be, simply because wireless bandwidth varies too much to express limits in terms of bytes.

According to Jim Gettys, the ultimate answer to bufferbloat is Active Queue Management (AQM), which isn’t yet ready for prime time but may be soon. Here’s Jim’s explanation from a message he sent me last week:

The purpose of buffers is to absorb bursts of traffic, which often occur in a network. You’d prefer to keep buffers almost empty even when running the link at its full speed to minimize latency; in fact, TCP running ideally can deliver packets close to “just in time” for the next transmit opportunity, so if TCP is running at the link speed, even though there is a buffer, the buffer could conceivably be kept (nearly) empty, and impose little delay.

But any size of unmanaged buffer can and will fill, and stay full. After all, TCP is designed to run “as fast as possible”.  So no matter what size “dumb” buffer you have, it can/will add latency; how much depends on its size.

In the face of variable bandwidth, and today’s internet with CDN’s, the traditional 100ms rule of thumb sizing of buffers (already excessive for good telephony) is nonsense. You don’t know the delay, you don’t know the bandwidth, so you really, really don’t know the bandwidth-delay product to size the buffers with…..

What AQM does is monitor the buffer, and signal the end points to slow down any time the buffer starts to fill, either due to that one transfer or competing transfers, by dropping or marking packets.  So the buffer is kept (almost) empty, except when it is handling a burst of traffic. So the steady state latency of the buffer, rather than being the size of the buffer, is set by the size of the bursts in traffic.  The size of the buffer becomes almost irrelevant.

Any link without AQM at a bottleneck is really defective; we must use an AQM algorithm everywhere….

The classic AQM algorithm is known as RED. It, however, is defective and requires manual tuning and can hurt you if it is mis-tuned.  As a result, it’s not present in most edge devices, and not even turned on in many ISP networks where it should be.  

What we need is an AQM algorithm that does the right thing without manual tuning, capable of dealing with varying bandwidth.  I’ve seen simulations of an algorithm that apparently works, but it’s not yet available and has not been tested or fully simulated.

This is Bob again. So Linux 3.3 with BQL is good but not good enough. AQM is required but that’s still two years away from shipping in devices while the mystery algorithm is tested and the network diplomacy begins.