Avoiding Internet Timer Synchronization
The Internet is a complex distributed process in which computers interact with one another over pathways that introduce delays and errors. There are many feedback loops. Some of those loops are simple to identify, such as the handshaking between the end points of a TCP connection. Other loops are harder to identify, such as the interaction of timers in ARP (address resolution), DNS (domain name system), and routing protocols (such as OSPF and BGP).
Positive feedback loops are often merely annoying. But they can also cause more severe problems, such as connectivity failures.
In addition, ameliorative measures, for example the manual or automatic damping of BGP route flapping, can make it difficult to re-establish network service once the root cause of a problem has been corrected. Removal of the amelioration measure may require human administrative intervention.
This note does not address the general problem of internet resonance and positive feedback loops. Rather this note deals with the issue of event synchronization, i.e. the tendency of the internet to lock itself into various forms of resonance.
We have all experienced a form of synchronization when driving on Interstate highways in rural areas: "banana drivers". Bananas come in bunches - hence the name for when a number of independent drivers come together to form long-lived clumps of automobiles and trucks on what is, on average, an underloaded highway.
Internet packets can form banana bunches as well. Packet switches, routers, and wi-fi access points, especially those that manage traffic through various Quality of Service rules, can act like traffic signals on highways by changing an input flow of nicely spaced packets into output bursts of closely spaced, bunched-up, packets.
Internet protocols often contain timeout settings - for example, periodic refreshing of ARP caches, DHCP addresses, or routing announcements. These timeouts may be as short as a few seconds to as long as many days.
No matter the length of the period, over time events triggered by these timeouts tend to merge into a lock step pulsation of internet traffic.
For example, it was observed back in the 1990s that the internet had a 30 second heartbeat caused by the the 30 second timeout in the RIP protocol that was used for local routing in those days. Even though each RIP speaker used its own clock, over time those clocks became synchronized so that every 30 seconds networks were pulsed with a burst of RIP traffic.
Internet audio and video streams often use very strictly timed packets. Voice Over IP streams often have a 200+ byte packet every 20 milliseconds (50 packets per second.) Video streams often have a train of large packets 30 (actually 29.97) times per second.
Most of us have heard the story of how army troops are told to break out of lock step marching when they cross suspension bridges because otherwise the bridge might begin to resonate and collapse. And we have all seen the resonance induced failure of the Tacoma Narrows Bridge.
These things happen on the Internet. But the impact is often masked because it is manifested by bursts of lost packets, moments of congestion, and transient losses of connectivity.
Those of us who write Internet code can do something about this. We can try to avoid contributing to these resonances.
There is a rule of Internet code writing that is almost never expressed. It is even less often acknowledged. And it is almost never implemented.
This is the rule that every timeout period should be randomized +/- 50%.
In other words, if there is code that cycles every 30 seconds, then rather than lock-stepping the at precise 30 second intervals, the code should arrange that the actual cycling should randomly vary between a low bound of 15 seconds and an upper bound of 45 seconds.
Sometimes this kind of randomization would be inappropriate. For example, randomizing the 20 millisecond interval used for VoIP would be harmful because it would be perceived as jitter and thus add to the conversational delays experienced by users.
On the other hand, randomization should be applied as a matter course in routing or discovery protocols.
One who is implementing a randomizer need not go to great lengths to create true random values. The goal here is to discourage synchronization, not to hide data.
IWL's KMAX and Maxwell products give code writers tools that they can use to create traffic bursts so that they can develop, test, and deploy remedial approaches.