Friday, November 17, 2006

TCP/IP Encapsulation Limited by Distance?

Summary: serial tunneling or TCP encapsulation is NOT directly affected by distance. However, it is affected by "hops" or how many routers and segments it goes through. To reduce the effect "hops" have on your TCP/IP Encapsulation, you need to set the correct options in your Digi Device Server. These are NOT standard defaults since what works best for Wide-Area-Network is not best for local direct Ethernet links.

What is serial TCP/IP encapsulation? It is also called serial bridging or serial tunneling. Think of it as the modern IP equivalent to old short-haul modems or leased line modems. At each end you have a "modem", you connect a serial cable to each, and you create a virtual serial link running from end to end.
Diagram showing serial tunnel

During a webinar I gave, a traffic-industry user asked if serial TCP encapsulation is limited by distance. If he serial bridges between 2 intersections of a road, things work fine. However, if he tries the same serial bridge between an intersection and the home office, then serial bridging does NOT work. So he was wondering "what is the distance limit for serial bridging or encapsulation?"

The simple answer is "There is no distance limitation". However, the longer the distance you move your serial encapsulation, the more routers and network segments (hops) your traffic moves through. The more hops your traffic moves through, the more variable latency (or delay or jitter) is introduced between consecutive network packets. This affects the timing and different protocols and software implementations react differently to it.

Let me just throw some hypothetical numbers together. Let's say the device sends data as a block of 100 bytes; the receiving device will loop and collect this data. Of course the receiving device cannot just wait for 100 bytes - what if the line breaks? It could sit there forever waiting for the end of the message. So various timers are coded to enable the receiving device to understand failure and abort receiving. Let's say the receiving device waits at most 20 milliseconds for the next byte. On a direct serial link this is very common - once the sending device starts sending data it is very unlikely that even a 2 or 3 millisecond gap will appear between bytes.

Enter serial encapsulation - either by radio or Ethernet or any IP-based media. All of these technologies are packet-based and most include some form of error-retry. So now the serial bytes collect in a buffer up to some point, then a chunk of them move together as one packet. Ideally, the full message moves as a single packet. However, if the message becomes split between 2 or more packets it is possible a gap will be detected by the receiving device. So for illustration we'll say the 100 bytes is split into 4 x 25-byte packets. On a single-hop network, there is much less opportunity for timing delays to open gaps in the final serial data. This diagram shows a small gap between the 25th and 26th bytes:
Diagram shows less variability with single IP hop

But running the serial encapsulation through many network hops greatly increases the accumulated delays added to each packet. So each hop has the opportunity to increase the latency and lag. This diagram shows a much larger gaps that may occur when the packets create the serial traffic at the remote end:
Diagram shows less variability with single IP hop

How to solve this problem? On the Digi Connect products, you'll be using one of these serial port profiles: TCP Sockets or Serial Bridge. Under the Advanced Serial Settings you need to enable the check box labeled: [ ] Send data only under any of the following conditions. If you do not check this option, the Digi Device Server purposely fragments the serial data into many TCP segments to provide more realistic end-to-end performance on direct Ethernet links. However, since you want to move data through a wide-area network, you are less concerned with raw throughput than in preventing message fragmentation. You may also be interested in creating fewer TCP/IP packets to send more serial data. Changing this setting accomplishes both of these things.

You now have 2 options to define when TCP packets are moved:
  • "Send when data is present on the serial line" allows you to define an end-of-message pattern such as a carriage-return (\r or \r\n, etc).
  • "Send after the following number of idle milliseconds" allows you to define an idle or quiet time to wait before sending data. This second option is generally safest and I find a value of 10 or 25 milliseconds to be ideal with most automated devices.

Note: do NOT change the setting in the "Send after the following number of bytes" field! This is rarely useful and it does NOT mean (when unchecked) that the Digi Device Server must see 1024 bytes before it sends anything. I have had too many users change this to 1 and then wonder why they have a huge amount of network traffic!

Friday, November 10, 2006

Application Pitfalls: Serial DF1 over WAN

Digi's wireless group was asking me why AB/DF1 didn't always work over radio when per the specification, DF1 has a nice end-of-message pattern. One would think moving serial DF1 through radio or cellular-IP would be natural and painless.

However, the problem I see watching Windows applications use the serial API (via the PortMon utility) is that they ASSUME a small delay or gap between the (DLE)(ACK) bytes and the response from the slave.

So the application uses the incorrect algorithm:
  1. Read 2 Bytes
  2. Ask Windows to notify application when more data comes
  3. Loop, reading buffered data and waiting until full response seen
The problem with this algorithm is it assumes the response will NOT have been received by the time the application sees the 2 byte (DLE)(ACK). Because we're dealing with radio or IP system that make effort to packetize data there is a high probability that the (DLE)(ACK) and the slave response arrived at the same time within the same packet without any noticable gap. Therefore, Windows NEVER notifies the application of that more data has arrived ... because no more ever comes! The full response has already been received and buffered. The application makes the false assumption that measurable time will occur between step #1 and the start of loop #3.

The correct algorithm would be:
  1. Read 2 Bytes
  2. Loop, reading buffered data and waiting until full response seen
This works because it handles both the situation of no response, a response already received and fully buffered, and a response trickling in over time.

See Also:

Thursday, November 09, 2006

City-wide WiFi - it's not Ethernet

One of my customers is struggling to IP-enable a few dozen Ethernet PLC via one of these new fangled city-wide WiFi systems that are all the rage now. Looked good on paper, but they can only keep the PLCs online for about 20 minutes at a time.

Why? Is the WiFi system defective? Of course not ... it is just behaving more like Wide-Area-Network than Local-Area-Network. I am not directly involved in this, but I'd wager the problem is neither the WiFi nor the PLC. The problem is the host software making Ethenet LAN assumptions about the system. I should offer to go out and do some latency tests; I'd wager the system has high and variable latency more like satellite or cellular.

So industrial control application developers beware, migrating your Ethernet-enabled, LAN-friendly applications to be true IP-enabled, WAN-friendly applications will become more important every time another city annouces the installation of a city-wide wireless infrastructure.

Monday, November 06, 2006

Cellular-IP Friendly Apps - Retrying Socket Opens

Most industrial applications allow the user to set a slow poll rate – such as one poll per 5 minutes. This allows a user to budget a cell plan at 5MB per month and be quite assured of not going over. Unfortunately, this steady-state poll rate is unrelated to initial TCP/IP socket connection opens!

If the remote device is powered down or the TCP socket open fails for any reason, most applications will attempt to reopen the TCP socket continuously. On Ethernet this may make sense; the more frequently the open is retried, the sooner the failed connection will recover. Most Ethernet-based applications will retry opening a TCP socket every 5 to 30 seconds forever. However, for cellular you are paying for all traffic entering the cellular system. It is not Cingular or Sprint or Verizon's fault your remote device is off-line. You will be billed for each and every TCP retry. I have literally seen applications create up to 1000 MB of traffic each day attempting to reopen a TCP socket to an unreachable remote IP. On a 5 MB per month plan, this 1GB of overage could easily cost you $1000 or more for the month!

Recommendation: all applications must include a user-settable option to delay attempts to reopen TCP sockets. This value can default to no-delay, but users must be able to set a delay of at least 1 hour between retries. This enables the user to define and stay within their data usage budget regardless of success or failure of the TCP connection.

Impact: On Ethernet this should have no direct consequences since the recommended default is no delay. Cellular users must adjust this retry delay to match their data traffic expectations and their cell plan budget.

For example, an application polling 10 Modbus registers per 5 minutes via TCP/IP creates about 198 bytes per poll. This works out to 2376 bytes per hour or a little under 2 MB per month. This is a very safe poll rate when paying for a 5 MB per month plan.

Therefore the desired TCP reconnect scenario should also create no more than 2400 bytes per hour. Consider that a 20-second timeout under Windows creates at least 120 bytes of traffic to an off-line remote. Windows sends a 40-byte [SYN] packet and retries the same 40-byte [SYN] in roughly 3 and then 8 seconds from the previous [SYN] packet. Increasing the timeout to 30 or more seconds creates a fourth 40-byte [SYN] packet sent about 18 seconds after the third. So forcing an application to only attempt one connection per 5 minutes will create from 1440 to 1920 bytes of traffic per hour. This will not break our budgeted cell plan.