Causes and Correlation of Network Impairments

We have passed through the halcyon days of the Internet's childhood. During that period we have seen the growth of applications that fit well into the Internet's basic communications parameters -- applications such as e-mail, the World-Wide-Web, and even highly buffered non-conversational (one-way) streaming audio and video. These applications fit well within the kinds of packet transit delays and packet loss rates that are found on the Internet.

What Does Internet Impaired Mean?

Internet impaired refers to a situation where the normal functioning and performance of the internet connection are hindered or disrupted due to various factors. It means that your internet connection is experiencing issues that negatively impact its speed, reliability, or overall performance. These impairments can include network congestion, packet loss, high latency, or other technical problems that affect the smooth and efficient transmission of data over the internet. When your internet is impaired, you may experience slow browsing, interrupted video streaming, dropped connections, or delays in accessing online services. It is important to identify and address these impairments to ensure a stable and optimal internet experience.

Time-sensitive network applications

However we are now beginning to deploy network based applications that place greater demands on the underlying network. Conversational applications, such as voice-over-IP (VOIP) and storage area networks (SANS, iSCSI) are sensitive to the time it takes for packets to transit the network; how much that time varies (jitter); and how often packets are lost, corrupted, reordered, or replicated.

It will be increasingly necessary to engineer these new applications with full recognition of the characteristics of the underlying network. These new applications may, and frequently will, require that the underlying network itself be engineered, tuned, and operated to meet defined service level agreements (SLAs).

Testing the limits of app performance

Few tools exist for application designers and their customers to discover and explore the boundaries of network behavior in which their applications work or do not work. In 1492 Columbus looked west across the Atlantic ocean and said "there is land out there" but he was unable to say where or how far. In order to find the answers Columbus had to take the risk and expense of actually deploying his ships and sailors. Fortunately we are in a somewhat better position than Columbus was - there are new tools that let us explore the boundaries of our networks and the limitations of our new applications without undertaking the risk and expense of an exploratory deployment.

Addressing the limits of network performance

The purpose of this paper is to discuss the ways in which networks may be imperfect and how we evaluate and deal with those limitations.

In addition, it has become common practice for network equipment vendors to make sweeping claims that new classes of applications require the deployment of that vendor's latest equipment. We find that such claims are either unwarranted or made without knowledge of the actual requirements of the applications in questions.

This paper advocates that customers create testbeds in which they may ascertain the actual service requirements of their present and planned applications. The resulting data will give the customer the necessary information to understand whether the existing network infrastructure is adequate, whether it needs to be upgraded, and what service level agreements should be established with network providers. This approach could result in enormous cost savings and reduced deployment times.

What do we mean by the "imperfect network?”

There is no such thing as a perfect network. The laws of physics and mathematics impose limitations that are random (such as noise on copper or fiber optic cables, or hardware or software failures in routers and switches) or predictable (such as the speed of electrical pulses on wires or light on fiber optics).

There are other sources of imperfection: Packets may be lost or delayed by transient congestion in switching elements of the net. Packets may be lost, replicated, or reordered by changes in routing or transitions between slow-path and fast-path routing mechanisms. Packet reordering may also be caused by "load balancing" of traffic between a pair of routers using a set of parallel links.

These conditions tend to occur in bursts that span periods of time ranging from a few milliseconds to a few minutes. However longer periods are not atypical - congestive losses can last as long as the competing packet flows fight over some scarce resource, typically buffers, in a switching device. Instability in internet routing can cause bursts of lost or reordered packets as route tables are adjusted. There are times when there is no usable route for packets to flow from some point A to some other point B on the net. And reordering caused by parallel telecommunications links can last for as long as those links are in place.

Many people tend to think of these as unusual or rare conditions. In the core of the Internet, a place of large data pipes, high powered switches and routers, and (usually) good traffic engineering, these conditions are infrequent, but they do occur. However, if one considers packet paths that pass through the periphery of the net - as most packet paths do - then one encounters overloaded exchanges and links, older and under provisioned equipment, and lack of 24x7 monitoring and operational coverage.

Beginning in the year 2000, network traffic was low, capacity was relatively higher, and packets flowed across the internet with few delays and relatively few points of trouble-causing congestion.

Today's networks carry considerably more high bandwidth traffic (e.g. video), flaws are more obvious with more demand for limited network resources. New applications will make increasing demands on the net for reliable and timely packet transport.

The Many Faces of Imperfect Network Errors

There are many sources of the imperfect network. The following table defines various ways in which networks err.

Packet Loss	This is simply the disappearance of a packet that was transmitted or ought to have been transmitted. Some media have mechanisms to recover lost packets (generally with some additional delay.) For purposes of this note, packets recovered by such media are not considered lost.
Packet Delay	Delay is the amount of time that elapses between the time a packet is transmitted and the time it is received. There is always delay - the speed of light in a vacuum places the lower limit on how small the packet delay can be. The actual propagation time of a packet across a telecommunication link varies considerably depending on the media. Some media have complex access mechanisms. For example, CSMA controlled media, such as Ethernets, have fairly intricate access procedures that cause delays often larger than the raw propagation time. In addition to propagation time, packets are delayed as the bits exit and enter computer or switch/router interfaces, as packets spend time in various queues in various switching and routing devices, and as software (or hardware) in those devices examines the packets, including any option fields, and deals with them. Packets are sometimes delayed simply because a computer, router, or switch needs to do something else first, such as doing an ARP handshake to obtain the next hop's MAC address or looking up forwarding information.
Packet Delay Variation (Jitter)	Jitter is a measure of the variation in the packet delay experienced by a number of packets. Some packets may experience an unimpeded ride through the net while others may encounter various delays. Jitter is often, but not always, accompanied by some degree of packet loss. Jitter is important because it has a significant impact on how long software can expect to wait for data to arrive. That, in turn, has an impact on software buffering requirements. And for media streams, such as voice or video, in which late data is unusable, jitter affects algorithms that are used to create elastic shock-absorbing buffers to provide a smooth play-out of the media stream despite the packet jitter. There are various formulas used to compute jitter. These formulas vary depending on the emphasis that one wants to put on more recent versus older transit variations. The idea of jitter may be easier to comprehend if one thinks of it in terms of something familiar - such as the time to drive to work in the morning. Sometimes the trip is fast. Sometimes it is slow. If you are trying to estimate how much the travel time varies you may want to focus mainly on your experience over the last few days, what happened last year is probably irrelevant. A person doing highway planning may be more concerned in the variation as measured over a period of months or even longer.
Packet Duplication	Duplication occurs when one packet becomes two (or more) identical packets.
Packet Corruption	Corruption occurs when the contents of a packet are damaged but the packet continues to flow towards the destination rather than being discarded (and thus lost.)
Packet Loss	• Line noise • Bad cables/connectors • Collisions on shared media • Full/half duplex mismatch on media • Telecommunications link outages, clock instability, ATM cell loss • Inadequate input or output buffering • Blocking in switching fabrics • Hardware failures (perhaps accompanied by insufficient fast fail-over to working hardware) • ARP cache timeouts • Routing reconvergence, instability, or failures • Route changes on MPLS and ATM when there is a time gap between the old path and the new • Misconfiguration (perhaps transient as a device is being reconfigured) • Software failures or errors (including obsolete software) • Poor IP reassembly software • Firewalls • Tunnels with inadequate MTU's thus causing packets to be discarded
Packet Delay	• Distance (signal propagation time) • Low bandwidth links • Long queues in switching devices • Slow-path packet forwarding • Switches with serialization delays (i.e. store and forward rather than cut-through) • QoS traffic shaping • Firewall processing
Packet Delay Variation (Jitter)	• Media access delays (e.g. CSMA) • Full/half duplex mismatch on media. • Layer 2 topology changes • Routing changes • Transient routing loops • Fast-path/slow-path transitions in routers • Packet option processing • ICMP messages generated as a result of packet reception. • ARP cache timeouts • Transient congestion in switching devices • Transient higher priority packets occupying queues and telecommunications links. • Transient congestion from same-priority traffic • Parallel multi-links (load balanced-links), particularly when used in conjunction with QoS or other priority algorithms. Jitter also occurs when the individual links forming the parallel/load-balanced multi-link are not equally provisioned or when noise and errors hit one link but not others. • IP fragmentation (receiver can not deliver reassembled IP packet until all the fragments arrive.) • QoS processing and traffic shaping. • Firewall inspection • Tunnels with inadequate MTU's thus causing fragmentation. • Packet interception (typically for purposes of web caching or law enforcement.)

RESOLVE YOUR NETWORK IMPAIRMENTS

Causes of network errors

How do impairments manifest themselves?

There is no single way in which network impairments make themselves visible. For example, in applications that tend to move a lot of data over a long distance TCP connection, packet loss, jitter, and reordering tend to trigger TCP's congestion avoidance algorithms and thus cause considerable diminishment in throughput. In voice-over-IP (VOIP) applications, jitter and delay combine with the result that the people trying to speak end up speaking over one another. VOIP voice quality can degrade in the face of any impairment. Even the perceived responsiveness of web browsing can significantly degrade if DNS query packets are lost or delayed.

What about your own network?

Even small networks can be impaired. Any net with more than a few switches and routers, and particularly any net with out-of-campus connections, is likely to experienced impaired services.

In many cases, smaller networks may be more subject to impairments than larger networks that are monitored by 24x7 Network Operations Centers (NOCs). It isn't that the larger networks are more immune; it's just that on the larger networks somebody (other than the users) are watching and might be able to notice problems, isolate the causes, and initiate a repair

How can I tell if my network has a problem with these things?

As we mentioned earlier, the sensitivity of applications to network impairments varies widely with the nature of the application and to a lesser degree with the quality of the implementation of the protocol stacks used by the application.

So, there are really two questions:

Is my network impaired?

Are my apps and devices affected by these network impairments?

Since nearly every network has some level of impaired service, perhaps the pragmatic approach is to inventory the applications running on your network in order to create some kind of service level definition. Impairments that don't rise to the level where they erode that level of service are impairments that may be safe to ignore. It would, however, be necessary to review that service level as new applications are added, old ones removed, and as the overall traffic demands and patterns change. Even the introduction of a new router or switch may change the behavior of the net.

Service level definition for VoIP

Let's assume for the moment that you do come up with a service level definition. For example, for voice-over-IP applications you might come up with the following:

The combined one-way transit time (delay) and transit variation (jitter) between any two phones must be less than 150 milliseconds (there is an ITU recommendation to this effect). (Given the hysteresis that is built into the jitter-compensation algorithms of many VOIP phones, one may be tempted to try to define a limit on the rate of change of the jitter. However, doing so may be an exercise in futility because there are so few means to control that parameter.)
Packet loss must be less than 1 packet lost out of very 1,000. (Since packet loss tends to occur in bursts, it may be necessary to define a sliding time window in which the loss is to be evaluated.)
No more than 1 packet out of every 10,000 may be out of order by more than one packet (i.e. no more than one later packet may arrive at the destination beforehand.)
No more than 1 packet out of every 100,000 may be duplicated.

How would you measure whether your network meets these service levels? But, more importantly, how would you even know in the first place whether these numbers are actually useful and whether they represent the kinds of services your applications actually need? It can cost a great deal of time and money to over-engineer a network to provide service levels that your applications do not need. And it can be more than simply embarrassing to discover after the fact that your shiny new (and expensive) network, even though it meets the service definitions, doesn't do the trick?

The key is to be able to build a testbed so that you can evaluate how applications behave in the presence of controllable and known degrees of impaired traffic.

With such a testbed you could have more confidence that your service level definitions are, in fact, representative of what you need from your production networks.

Implementing a test bed

There are a number of ways one could go about building a testbed. One can build a miniature version of a proposed network and hope that it adequately reflects the behavior of the full scale network. This approach is expensive and inflexible.

Another approach is to use mathematical simulations and models. These methods take a great deal of expertise to design, implement, and evaluate. And in many instances the results may be rather detached from reality. The software found inside network devices frequently, indeed, almost always, does not act with mathematical precision, or indeed with anything that even approximates that kind of precision.

The approach that we advocate is to use tools that actually produce, under controlled and repeatable conditions, a variety of of network impairments so that the proposed applications can be tested and evaluated under near real-life conditions.

There are several tools available for inducing impairments into networks. Most share an ability to create some or all of the types of impairments described above. Most of these tools manipulate either all packets or classify packets into flows most frequently defined by a simple 5-tuple scheme (source-IP, destination-IP, IP protocol type (UDP, TCP), source-port, destination port).

Those kinds of tools are often adequate when one is testing general classes of equipment. However, experience with networks has taught us that many network applications have sensitivity to certain patterns of impairments. In fact it is this kind of sensitivity that frequently allows crackers to break into equipment or to create denial of service attacks.

However, there is only one tool that is able to dig more deeply into flows and allow one to create impairments that might expose these pattern-based flaws. That tool is InterWorking Labs' Maxwell, The Network Impairment System.

What can be done about network impairments?

We can deal with network impairments in two ways - we can make the network better or we can make the applications better.

David Isen's now famous paper, "Rise of the Stupid Network" (the actual paper is at rageboy.com/stupidnet and hyperorg.com/misc/stupidnet) could be construed as an argument for pushing the burden onto the applications while leaving the underlying network as simple as possible. There is much merit in this approach - in fact it is only in the applications where most of us, including application vendors, have any ability to control the quality of our internet experience.

There are times when one simply cannot build an application without getting better quality network services. There are many time and distance sensitive applications that will not work effectively, regardless of the applications engineering, without guaranteed service level agreements.

At the same time, it is possible to consider the design of devices and applications at the edge of the network and optimize the ability of these applications to properly compensate and handle impaired network situations.

The best approach with either the stupid network or a network engineered for high quality packet delivery is to create pre-deployment testbeds. A pre-deployment testbed allows us to evaluate the range of impairments that our existing and future applications can tolerate. With that knowledge we can better understand the engineering and investment tradeoffs between building more sophisticated applications versus demanding (and obtaining) improved service levels from our network infrastructures.

RESOLVE YOUR NETWORK IMPAIRMENTS