Wednesday, August 22, 2007

The Truth about Cellular IP

I have been very busy analyzing real-world telemetry traffic over cellular IP, and unfortunately I am now 100% convinced that you cannot effectively use most (any?) off-the-shelf "Ethernet" software tool to talk to remote Ethernet devices over cellular IP. Bottom-line is that - unless your host app is custom written to be data cost and time delay sensitive - your data costs will be bloated due to the nature of the tool. Even something as "obvious" as adding compression doesn't solve the problem because telemetry packets tend to be too small for effective compression. For example: an 8-byte Modbus/RTU request becomes 12-bytes after ZIP-style compression. Plus this doesn't reduce the 104-bytes of TCP overhead nor 28-bytes of UDP overhead. None of the cellular providers allow use of RFC-class TCP header compression, since it requires all of the infra-structure to maintain
copies of headers etc.

So I have been working on "reduction" solutions - how to obtain the effect of moving "X" IP packets but only moving "X-minus-a-bunch" of actual IP packets.

Tunneling TCP thru UDP
The most promising and generic form of reduction is to tunnel TCP/IP via UDP/IP over cellular. So the host application talks TCP/IP to a local proxy, which acts as the TCP end-point. All of the TCP SYN, ACK and Keepalive traffic is limited to the local Ethernet. The local proxy then initiates a UDP "session" with a remote proxy over cellular & we instantly see a 60-90% reduction in data costs. The remote proxy initiates a TCP/IP connection to the remote Ethernet device, which again isolated the extra TCP overhead to the remote Ethernet.

The reaction of non-IA network engineers to this idea is predictable and a bit humorous after a while. They immediately say "You cannot do that!!! UDP/IP is unreliable!!! You'll break something!!! You are committing a mortal Sin!!!" But in reality none of the IA protocols leverage the reliability of TCP anyway. For example, Rockwell RSLogix doesn't send a program block to a ControlLogix and blindly assume it was successful after the TCP Acknowledge from the peer is processed. Instead, RSLogix sits (blocks literally) and waits for a successful CIP response on a single CIP Connection. So if the local proxy returns a TCP-ACK to the RSLogix host and the CIP request is lost within the UDP/IP tunnel ... eventually RSLogix times out the CIP connection and the application (and/or user) will restart.

Fortunately, cellular is very reliable - all of my tests sending 10,000 UDP packets rarely even lost 1 packet and I'm not sure if such a rare loss is due to cellular or just my test script hiccupping & dropping a packet. Plus cellular tends to have only very bursty error problems. In other words, you won't lose 1 packet per 10,000; instead you'll lose all packets for 5 minutes or just 5 random packets out of a group of 10 sent. This shotgun-damage tends to confuse TCP/IP state machines to the point that they abort the connection anyway. In truth, in all of my Wireshark/Ethereal trace reviewing I have never seen a single situation where a TCP retry did anything but add data cost; every TCP retransmission just results in a "Duplicate ACK" showing up a few packets below in the trace & a doubling of the cost of that block of data.

So overall, anyone planning to use cellular should first investigate if they can use UDP/IP instead of TCP/IP.

TCP Problem #1 - added cost for pointless ACK
As mentioned above, real-world analysis of telemetry use of TCP shows the TCP ACK isn't useful; but worse, Embedded TCP devices tend to sub-optimize the ACK timing to "speed up" data transmission and recovery. Almost universally moving an IA protocol via TCP/IP results in 4 TCP packets instead of the idealized 3.

  • Your app sends a TCP request (request data size + 40-52 bytes of overhead)
  • 800-1100 msec later your app receives a TCP ACK without data (another 40-52 bytes of overhead)
  • 10-40 msec later your app receives the protocol response (response data size + 40-52 bytes of overhead)
  • Within a few msec, your app sends the TCP ACK without data.

So what would have been a 2-packet transaction with only 56 bytes of overhead under UDP/IP, or what should have been a 3-packet transaction with 120-156 bytes of overhead under ideal TCP/IP usually becomes a 4-packet transaction with 160-208 bytes of overhead. Yes, there exists a TCP socket option and the concept of "Delayed ACK Timer" to prevent the first empty TCP ack from being returned over cellular, but few embedded products use this since it adds code complexity, and it slows down overall data communications. At least in the IA world it seems everyone wants their Ethernet Product costing 2-4 times more than their serial product to appear lightning fast. So they ignore the TCP community's decades of hard-earned experience and "hack" their TCP stack to sub-optimize fast local Ethernet performance.

So this is where the instant 60-90% data cost savings of using UDP over TCP comes from. UDP has smaller headers and results in fewer packets being sent. Since the cellular IP system is "encapsulating" your TCP/IP packets in a manner similar to PPP, the entire IP header, TCP or UDP header, and your data is all considered billable payload.

There is also a myth propagated to this day that the TCP ack causes retry to occur more rapidly out in the wide-area-network infrastructure. The rhetoric goes, "If the 3rd and 4th router link is congested and the TCP data packet is lost, then the 3rd router will retransmit ... which is faster ..." Perhaps this was true back in the 1980's, but today the 3rd and 4th router (and all of the other 20 to 30 routers in a cellular end-to-end path) are just tossing IP packets upstream with no awareness of the packet functions. In reality, it is only the TCP state machines within your host machine and within your remote device that have any ability to retransmit anything.

TCP Problem #2 - added cost for premature retry
The TCP RFC includes many dynamic timers that automatically adjust themselves based on real-world performance. This is actually pretty neat. It means if the TCP ACK and response times tend to be longer than normal, then the TCP state-machine slowly increases the delay before retransmission. But I've seen 3 problems with this.

  1. The most effective way to leverage auto-adjust is to include the 12-byte TCP header options that time-stamps all packets. Linux system add this by default and installing one of many PLC engineering tools on your Windows computers causes Windows to also start always using this. The setting generally is global - you either have 40-byte TCP headers or 52-byte TCP headers forever. So for small telemetry packets, this adds a disproportionately large increase in data costs.
  2. Many embedded devices (PLC, RTU and I/O devices) have "hacked" the TCP ACK sub-system to force connection failure to be faster than the standard 3-4 minutes. For example, I worked with one large PLC company which expected TCP sockets failure in less than 1 second, so they forced TCP retransmission in hundreds of msec and without any normal exponential backoff between retries. This is totally unusable over cellular; you will end up with 30% to 90% of your data traffic being premature retries and responses to premature retries. I have literally seen Wireshark/Ethereal traces which are mainly black lines with red text - which is the default color used to show TCP "problems" such as lost-frag, retrans, dup-ack, etc.
  3. The latency in cellular is abnormal by an order of magnitude. Even browsing the internet or doing a telemetry polling test over DSL/cable broadband averages latencies in the 100-150 msec range. This is what a Windows or Linux defines as "slow/bad" - not the 800 to 3500msec of cellular. So even watching a Windows or Linux TCP state machine auto-adjust the retransmission delay over time, you will not see it achieve a 100% effective setting which eliminates wasted TCP retransmissions. The delay seems to top out
    at about 1.5 to 1.8 seconds, which is just too close to the actual "normal" latency range. So again, use of UDP/IP frees the use user from data costs associated with TCP legacy assumptions - both the main-stream MIS/IT market variety of assumptions and the misapplied IA vendors "speed-ups".

TCP Problem #3 - uncontrollable SYN/Socket Opens

Given the way all cellular systems "park" inactive cellular data devices, it is exceedingly rare to ever see a host app open a new TCP socket without prematurely retrying/retransmitting the SYN packet. This is because one is virtually guaranteed that it will take about 2.5 seconds for the data device to be given active airwave resources and return the SYN+ACK response. This has NOTHING to do with the "always connected" feature Digi and others claim. The data device (even when parked) is fully connected by IP and fully authenticated by the system - it is "always connected". However, the local cell tower only has finite airwave resources, so any device (cell phone or data device) which is idle from 3 to 45 seconds is "parked" without having any preallocated airwave resources. Literally when the TCP "SYN" shows up, the cell tower has to use the control channel to inform the data device to request airwave resources, and after these are requested and allocated the data device can receive and response to the TCP socket open request.

But that's not the real problem related to TCP Socket Opens ... the real problem is yet another case of IA vendors sub-optimizing TCP behavior for fast local Ethernet performance. For example, I once had a customer who normally paid about $40 per month receive a $2000 bill one month. It turns out they had powered down the remote site for 3+ days and the off-the-shelf 3rd-party host application they used would try to reopen the TCP socket every 5 seconds!!! So Windows would send the initial TCP SYN to start the open, since the remote was off-line Windows would retransmit this TCP SYN a few seconds later. After a total of 5 seconds, the application would ABORT this TCP socket attempt and start a new one. So this host app was pushing 24 billable TCP packets per minute out to a remote site that was powered down. This was nothing the host app vendor documented, nor was it anything a user could configure or over-ride. The user could configure the host app to ONLY poll once per 5 minutes; but the user had no control over this run-away TCP SYN/Open behavior.

Tunneling TCP through UDP effectively decouples the TCP SYN/Open from cellular data charges. The first TCP Syn/Open request to the local proxy would succeed even if the remote IP site is offline. No retries would be required. Even if the host app attempts to retry the data poll every 5 seconds, this is something the UDP proxy can be configured to "resist". If the user truly wants data packets to only move every few minutes, that is something the UDP proxy can easily enforce.

TCP Problem #4 - sub-optimized TCP keepalive

The final problem I'll discuss (but not by any means the "last" problem with TCP) is that many embedded IA devices have relatively fast TCP Keepalives hard-coded to speed up lost-socket detection. While this is an admirable goal, a Rockwell PLC sending out a TCP Keepalive at a fixed 45 second interval can create up to 6MB of monthly traffic by doing this. Siemens S7 PLC seem to issue TCP keepalive every 60 seconds - a bit better, but not by much. Maybe such a heart-beat is useful to know the remote is accessible, but given the reliability of cell phones (when the last time you had a dropped call or no signal ...) you'll obtain a lot of false-alarms if you treat every missed packets as something requiring maintenance's attention.

Again, tunneling TCP through UDP effectively eliminates the automatic, possibly uncontrollable use of TCP Keepalive. If your process can handle you talking to it once an hour, then the cost of TCP socket open and close, as well as any TCP Keepalive is all wasted investment.

Not only this, but the cellular providers do NOT want users who send a simple, rather empty packet every 30 to 60 seconds - this is literally the worst kind of customer, as this forces the cell tower to "waste" one of its very limited airwave resources with almost no income returned to the carrier. From what I hear, carriers either want customers who talk constantly and pay huge monthly fees (say $90 to $350/month); or they want customers who rare talk and pay a small fee (even just $5/mo) but cost the carrier virtually no direct expenses.

Putting this is "restaurant terms":

  • A cellular data device that talks constantly but pays for a large plan is like a restaurant patron who sits at a table, constantly ordering more food and paying a larger bill.
  • A cellular data device that rarely talks is like the restaurant patron who comes in once a month, sits at a table, orders a meal, pays and then vacates the table.
  • A cellular data device that keeps an idle channel open full time but rarely talks is like the restuarant patron who sits at a table in the resturant, reading the paper but rarely ordering food or paying a bill.

In fact, in private chats with carrier account people, I have heard several times that they have been directly to prefer either customers who talk constantly on large plans or those who talk at most once an hour (better once a day) on small plans. Customers planning to talk every few minutes have been defined as bad investments. It may be fair to say that after years of building up the data-plan customer base, the cellular carriers have come to understand that the REAL cost of data plans is not the bulk data bytes moved; it is instead the percentage of time the device consumes (or squats on) 1-of-N scare airwave resources in proportion to the monthly fee they pay.