Network Considerations for Remote Vehicle Operation

Components of end-to-end remote control delay

The Society of Automotive Engineers (SAE) has defined several autonomous driving levels. These span the range from fully driver operated to fully automatic.

Auto manufacturers aspire to full automatic control. But being pragmatic, they are first aiming to achieve SAE Level 3.

Level 3 means that the vehicle is not completely autonomous; a human must take over under some conditions and situations. The vehicle is intended to operate under its own control until it finds itself in a situation beyond its capabilities, then it hands control over to a human driver. This article does not deal with the significant and difficult issues of how those transitions of control are made.

When the vehicle is under human control the human operator may be in either the vehicle itself or may be at a Remote Control Driving Operations Center (RCDOC).

Given the cost and inconvenience of having an on board driver, the focus of attention is remotely operated driving.

Remotely operated driving puts a lot of stress on the communications path between the human driver in the RCDOC and the vehicle. The vehicle is moving in real time through the real world. Time is of the essence. The responsiveness and accuracy of the command and control loop between the vehicle and the RCDOC is critical.

5G does not solve the latency problem with remote vehicle operation. No network is instantaneous. The speed of light is limited. 5G will not address operator reaction time, processing time, and the network itself.

5G is to the network like widening your driveway is to transportation; you may be able to get in and out of your driveway faster, but that will not improve your commute to work.

Why? Because there's a lot more involved, and your driveway is just a small piece of the overall commute.

Let's look at this control loop in more detail

There are several main components of this control loop, each with sub-components.

The main components are these:

Sensing of vehicle events, status, and environment.
Transmission of the collected data to the operator in the RCDOC.
Operator evaluation and response.
Transmission of the control instructions back from the RCDOC to the vehicle.
Execution of those instructions.

Sensing vehicle status and environment

When in remote operation mode, the vehicle will take note of its own status and of the environment around it. The vehicle will act as the eyes and ears of the driver in the RCDOC.

Something happens in the real world near the vehicle

Let's posit that something happens that requires handling by the driver in the RCDOC.

This begins with the event itself — perhaps an object of an unknown kind has just stepped into the roadway in front of the vehicle.

It takes time for the vehicle to even capture data about the event.

Audio, video, and LIDAR sensors themselves operate on microsecond scales. However, in a digital world data tends to be collected in chunks. For audio, these chunks are usually about 50 milliseconds long. Video chunks are often about 33 milliseconds long.

The average capture latency is about half of the collection interval, so about 25 milliseconds for audio and 16.5ms for video.

Translating sensor data into digital form

A sound, video, or LIDAR chunk of data is just a bag of bits; it must be encoded and formatted for transmission. Generally this is done by a software or hardware component called a "codec" (for "coder-encoder").

Codecs are not instantaneous. The time to encode data can range from microseconds to milliseconds, and as one might expect, encoding of audio is usually rather faster than encoding of video or LIDAR images. This does not mean that the audio can be transmitted sooner - it is often important that audio and video/LIDAR information be in synchronization when presented to the remote operator. Absence of good synchronization can confuse the remote driver or impede that driver's ability to recognize developing situations. Consider for example how confusing it could be to remotely drive a vehicle through a crowded area if the front, side, and rear cameras were not in sync with one another and also not in sync with the audio.

Transmission of the collected data to the operator in the RCDOC

The sensor data has to be transmitted over the network. There are a lot of steps here, each of which adds time.

Encoded sensor data needs to be wrapped into packets; generally this is so fast, a few microseconds, that we don't need to consider it further.

Getting the data onto the network medium

Some amount of time may be required to get those packets of encoded data out of the vehicle and onto the network medium.

The total time for a packet to flow from its source (in this case the vehicle) to its destination (here, the remote operator) is its "latency". That latency is usually not a constant, it may vary from packet to packet. That variation is called "jitter".

(Often the word "delay" is used to reflect a base, fixed component of latency and the word "jitter" is used to reflect a statistical measure of the variations from that base component as experienced by a large number of packets.)

This variable amount of time creates a condition called "jitter", a variation in the end-to-end delay (sometimes called "latency") experienced by network data.

Jitter itself may, and often does, change. For example the onset of external network traffic may cause a vehicle-to-operator path to suddenly shift from a low jitter regime to one in which jitter becomes large. Similarly, the relaxation of contenting third party network traffic might cause a high-jitter connection to become a nice low-jitter one. Jitter itself is hard for the computer at the receiving end of the connection to handle, but large up and down shifts in jitter can be wrenching.

Delay is bad; jitter is worse.

A device (such as our remote operator's computer) usually has relatively little, and relatively simple software to deal with delay. However, that same computer usually has much more software, that is often quite complex (and bug prone), to deal with jitter and changes in jitter.

So what are some of the things that create delay and jitter of outbound packets?

Outbound device interface queue blocking
Protocol (such as TCP) operations that attempt to avoid congestion.

Additional latency and jitter will be created as the data crosses the network itself. This is usually caused by the onset or relaxation of third party network traffic competing for network bandwidth resources, by naturally occurring interference (such as rain obscuring radio links), by device failures, or by exercise of control by a network operator.

Often as cross-network delay increases, so does the rate of lost packets as intermediate devices become congested and are forced to discard packets. Lost packets induce a kind of "Waiting for Godot" delay effect in that a receiver may refrain from processing subsequent packets while it waits for the lost packet.

Outbound/transmit interface blocking

Outbound device interface blocking occurs when more data is being generated than can be sent on the outbound network interface.

On modern, wired networks this is often not a problem because network interfaces are quite fast compared to the amount of data being sent.

However, wireless interfaces tend to be slower (and of variable speed). There can be competition with other devices for a shared radio channel. Or there may be radio-frequency conditions or noise that prevent immediate access to the channel. The time scale of this blocking can range from almost zero up to hundreds of milliseconds. (This is one of the numbers that 5G wireless and Wi-Fi 6 are trying to reduce.)

In addition, many modern devices contain an excessive amount of outbound queue buffering, which can result in outbound data being held in long queues — which can be many seconds in length — before they are actually transmitted. This is known as "bufferbloat". (See our video on testing for bufferbloat here.)

Protocol induced slowdown

Protocols such as TCP try to be nice citizens of the network, so it is conservative about putting packets onto the network medium.

TCP restricts the amount of unacknowledged data in accord with a "window" provided by the receiving side of the connection.

When a TCP session begins, it tries to get a sense of the available bandwidth. To do this, it slowly ramps-up its transmission speed — this is called "slow start". And if TCP notices an increase in response time from the other end or a loss of data acknowledgments the sending TCP will back-off in order to avoid adding more traffic to what is possibly a congestion situation out in the network.

As a result, outgoing TCP data may be held pending transmission for a variable amount of time. This adds both packet delay and jitter.

Noise (or packet loss) on the network may induce TCP to retransmit data that seems to have gone awry. This also adds to both delay and jitter.

Data transmission time

Data transmission time is the time for a data packet to traverse the network. In a well-tuned system this would, in conjunction with sensor capture and rendering delays, be the main component of overall application latency.

This is the time component that most people believe they are measuring with tools such as "ping". Networks are complicated, dynamic things; simple measurements of network performance can be misleading. (See Limitations of ICMP Echo for Network Measurement)

Data transmission time is highly dependent of intermediate devices (such as routers and switches) and distance.

Intermediate devices are subject to competing traffic loads and consequently can produce highly variable delay (as well as data loss when they become congested). Changes of network packet routing can produce sudden and significant changes of latency (as well as packet duplication and mis-ordering).

The speed of light is simply something we all have to live with. Even at full speed-of-light, there is about one millisecond of latency for every 300 kilometers (186 miles). However, many network media carry data at less than speed-of-light rates. Fiber optic links tend to run at about 60% to 70% of the speed of light. Signals travel closer to the speed of light over copper or radio than they do over fiber. (This is the prime reason why flash-traders of financial instruments tend to prefer long-haul links that use speed-of-light microwave transmission over slower fiber optics.)

In addition, other effects from IP packet fragmentation and re-assembly are generally very troublesome in terms of time as well as susceptibility to programming errors — a well designed system will use things like Path MTU discovery to avoid this realm of trouble. Suffice it to say that the MTU of your outbound interface may well overstate the actual path MTU; so it is wise to be careful.

Reception time

Once a data packet enters the recipient's computer (in our case the computer of the controller) that data undergoes several delay-causing steps.

First the network interface card (NIC) itself may hold the data packet for a short while. Modern NICs often eschew generating interrupts in favor of more efficient operating system polling. So a received packet may linger a while in the NIC before the operating system even notices that it has arrived. The time scale of this is generally under a millisecond, but as you may have begun to notice, all of these delays are beginning to add up.

Second, the received packet must be checked for integrity and subject to protocol processing.

Integrity checking is often done with specialized hardware or in the receiving NIC. So the time is often negligible, although on low powered IoT devices in which the CPU does most of the work, this could take a few tens of microseconds.

Protocol processing is highly variable. TCP may (and usually does) refrain from delivering received data to the user application if some prior data has not yet arrived. However, if data packets are arriving smoothly and without error this element of delay can be quite low.

Rendering time

In order to make sense of the sensor data, that data must be put back into a condition that is usable by a human (or automated) operator.

For audio this means re-assembling the received audio packets into a time-ordered and metronome-clocked stream that can be fed into a sound rendering device. Time ordering is easy (unless data has been lost in transit) but clocking can be very hard. (More on this below.)

For video something similar happens: The received data must be re-assembling the received video packets into a time-ordered and metronome-clocked stream that can be fed into a video rendering device.

Human eyes and ears do not always react well to imperfect data. We are better attuned to audio or visual gaps and noise then we are to varying rates. Consider how hard it would be to play a video game when the tempo of the sound and video are randomly changing.

We've all seen bad lip sync in movies. It can be quite annoying. But it can be more than merely annoying. Because light moves faster than sound the human perception system is adapted to seeing an action slightly before hearing the sound of that action. In a remote control driving situation we do not want to reduce the effectiveness of the operators by burdening their eyes and ears with distracting synchronization problems.

Missing audio data has to be patched with null data. But unless care is taken fill-in audio can create a buzz sound and make it hard or impossible for human operators to comprehend.

Video is often transmitted in small square portions of the screen. Lost video data often results in visible, distracting blotches.

Metronome time-clocking audio or video is hard. The difficulty is the variable delay (jitter) that has accumulated as the data came from the vehicle sensors and wound its way through the network.

In general a receiving device accommodates jitter in two ways:

First, it creates a late-arrival window that discards arriving data that is simply too late to be usefully rendered to the user.

And second, it creates a queue of buffers that holds packets long enough so that they even the largest expected jitter can be masked and the received data clocked out to the user at a smooth rate. This means that if the sensor-to-receiver maximum jitter is X milliseconds than the receiver must have a jitter compensation queue in which received data is aged for at least X milliseconds before presenting it to the rendering device.

(As was noted previously, there can, and often are, sudden major shifts in jitter. Thus, for instance, the value for X in the prior paragraph, could suddenly change. That is a difficult thing for software to handle at all, much less handle gracefully without generating artifacts in the media presented to the the human operator of our vehicle. For example, one of the weak points in many TCP stacks is in the code that regulates the reaction to changes in end-to-end delays and packet loss.)

This jitter compensation thus becomes an additional component of end-to-end system delay that must be added to the actual end-to-end delay. This is often overlooked.

And when audio and video are being synchronized, the more prompt data, whether audio or video, must be delayed until the more tardy data can be made ready.

This effort to create presentable data can consume a fair amount of time. And that time means that when an event is sensed on the vehicle there will be a potentially significant delay before the human operator even sees or hears of that event.

Operator response time

Even at our best, humans take hundreds of milliseconds to notice and respond to audio or visual events.

Operators may not have a lot of time to respond. Even for a vehicle moving at walking speed, the time window may be insufficient to avoid a collision with an oncoming vehicle or pedestrian.

(It will be interesting to see how RDOCs keep the human operators alert and how they will accommodate multiple near simultaneous, and perhaps unrelated, events on different vehicles.)

Transmission of operator commands

Sending control commands back to the remote vehicle is simpler in some ways than the remote-to-operator direction and harder in others.

Control actions usually must be sent by a reliable means. That means TCP (or equivalent) or via multiple UDP actions that are idempotent.

Commands can be issued immediately, without any delay to wait for a capture device or to do media encoding.

Operator-to-vehicle control does not usually involve the need to do jitter compensation or metronome clocking of data. Rather, commands can be performed immediately upon receipt. (Unless those commands involve some sort of audio or video that the vehicle is to enunciate to third parties.)

All of this means that the operator-to-vehicle path is usually rather quicker than the vehicle-to-operator path.

Execution of operator commands

For our purposes we need only recognize that many operator commands will affect mechanical devices and that mechanical time scales can be orders of magnitude slower than electronic time scales.

Conclusion

When we put all of this together we can see that network latency is but one factor in the overall end-to-end responsibility of a remote-control situation, such as remote control driving of a vehicle.

Network jitter can easily double the effective end-to-end time to detect and respond to events.

Given typical network delays and jitter, typical media metronome timings, and human sense and reaction times we can see that a remote control driving system will generally have to accommodate several hundred milliseconds between event detection and the eventual response.

This response lag can be significant even for slow-moving vehicles: A scooter rolling down a sidewalk at 5kph — walking speed — is moving at roughly 1.4 meters/second. If we assume that people walking towards the scooter are moving at roughly the same rate the closure speed is roughly 3 meters/second. For a delivery cart moving down a sidewalk, this suggests that a remote control loop delay that exceeds half a second could be too slow to avoid collisions.

For automobiles moving at highway speeds, the allowable control loop delay may have to be much shorter. But to do that will require careful attention to every element that could add even small amounts of latency or jitter. Special audio/video hardware with low capture time may have to be coupled with potentially imperfect network protocols (e.g., use UDP rather than TCP). And well-tuned, even dedicated, network media, with low jitter as well as low delay may have to be used rather than the shared public internet.

This may prove difficult to engineer or expensive to deploy. That could (and probably will) increase the pressure to make vehicles more autonomous and self-reliant. SAE Level 3 may prove to be unworkable in practice.

In the meantime anyone deploying a remote-control approach should take care to fully test their system to know its limits.

Tools such as IWL's KMAX can be a valuable way to create ranges of potential network conditions in the development lab so that the remote control system can be deeply tested before undertaking the trouble and expense of field testing.

GET STARTED