Summary: Network infrastructure changes can create unexpected havoc (surprised?)
We had a customer come in with a complaint recently about some old products - a 4-port terminal server (a TS4) running
Modbus bridging firmware. Things had gone on smoothly for years, but suddenly they started having problems which power-cycling the units solved ... obviously it's a
Digi problem, right?
Well, I'm not going to run through all of the issues, but I'll highlight one which will affect more and more people over time. Remember the old days - first you had dumb
dialup modems, then you had to pay a lot for various error-detection standards, and now all analog modems have a big chip which does a hundred different standard things and you
CANNOT buy any low-cost analog dial-up modem which is any dumber than state-of-the-art. Such creep occurs in Ethernet switches also - once hubs were cheap and switches cost a king's ransom. Now even the "cheap" $19 home switches purchased in big-box stores are auto-sense, auto-cross, auto-
everything. The same things happens - once the features are rolled into software or firmware, future revs of the product inherit such features for free ... said another way, it is often cheaper to sell lots of one powerful model than handle the marketing and support costs of a dozen models of various power and feature-set.
Well,
this customer had been upgrading their wide-area-network infrastructure - a tangle of routed
IP-based leased lines, old analog lines, radio/microwave links and so one, and as part of the improvement the various router/PPP end-points had added
stateful awareness of packets ... part of an increase in US government security demands for utilities. After all, when you have hundreds of raw Ethernet ports scattered across rural American protected with little more than a padlock and chain-link fence, one has to consider the possibility of someone trying to 'jack' back into the 100% private network from a remote location.
The issue with such an addition is that routers with firewalls commonly default to a 5-minute
TCP life rule. Every
stateful firewall creates a table entry for every
TCP or
UDP activity going "out" from a safer network and into a wilder, less safe network - your home
DSL/Cable router does the same thing. NAT (etc) is optional, but bottom line is the firewall needs to understand and agree to the context of every packet trying to pass through.
For a
TCP socket this usually means at least 1 packet (data or
ACK or
Keepalive) must move every 5 minutes to refresh the context and inform the firewall that the
TCP socket is still authorized. If no packets are seen, the fire wall discards the context and the socket is now ignored and 100% blocked without comment. Of course both
TCP peers might think the socket is alive and well, but they will never be able to use it again.
So back to our TS4 off in some remote field - a host computer has 4
TCP sockets open to it through various routed
subnets and for whatever reason the host doesn't send any new packets for 6 minutes. The TS4 has no reason to talk since it is connected to
Modbus slaves. So 6 minutes later the host issues 4
Modbus requests on 4 sockets, which hit the
stateful firewall. The firewall looks up the context (called forwarding or trigger rules in some systems) and discovers there is no existing context for these
TCP packets. Well, once upon a time the firewall would just assume since these were safe since they came from the safe side, and thus auto-create a new context, yet modern
stateful firewalls might NOT do this since it was a common hacker exploit to fool firmwalls to move malformed packets. Thus modern security demands that the first packet of any new
TCP socket be a [SYN] packet. All four packets are silently discarded as unknown (
note: this is a configurable behavior in most firewalls - to auto-recreate a context for an 'old' TCP socket continuing or to only allow this for new TCP sockets.)
The TS4 never sees the four new request; the host sees neither an [
ACK] nor a [
RST] of the
TCP socket, so it retries after 3 seconds, then 6 and so on. Eventually it comprehends the
TCP socket has died and opens four new
TCP sockets. These are allowed through the firewall since they contain the correct TCP state as being new sockets. The TS4 sees the four new socket requests and now has eight (8) sockets connected.
Now you might say "What a minute - the four old sockets were closed!". Well, yes - the HOST knows they are dead and closed them, but the TS4 does not know this. Six minutes later, the same is repeated and the TS4 now has 12, then 16, then 20 sockets open. Eventually the TS4 runs out of resources, and the customer's open recovery option is a hard reboot of the TS4. This is of course worse if the host only
occasionally takes longer than 5 minutes to poll, since the increase in TS4 sockets could take an unpredictable amount of time to build up.
How to solve this problem?
- The first line of defense in the modern IP age is to always, ALWAYS enable TCP keepalives with an aggressive 4 minute 30 second idle time in any product you install. Even if you don't have stateful packet inspection within your IP network today, that doesn't mean the next time someone replaces a router or serial PPP end-point you won't gain one. Unfortunately the Digi products usually default to a 2 hour TCP Keepalive - which is also disabled (for historical reasons). Yet having a 4.5 minute keepalive would have saved the TS4 because the first time the host delayed longer than 4.5 minutes to poll, the TS4 would have sent a TCP keepalive packet back through the stateful firewall, which the host returns and the firewall's context for this socket is refreshed. Thus when the host does finally send the next Modbus poll after 6 minutes (1.5 minutes after the TCP keepalive exchange) the firewall is satisfied with the packets and forwards them out to the TS4. Everyone is happy. Plus even if the TCP socket has been aborted by the firewall, the TS4's TCP keepalive will be silently discarded and the TS4 will retry and eventually comprehend that the TCP socket is no longer valid. This is the purpose of TCP Keepalive - to allow a TCP peer with no data to move to test (and refresh) the health of an idle connection.
- The second possibility is unique to the Digi IA/Modbus application. You can enable a setting referred to as "idletimeout" on incoming client or outgoing server connections. Unlike the TCP Keepalive which create traffic and keeps an idle socket open, the idletimeout literally just aborts an idle socket without giving it any chance to prove it is healthy. So setting a 5 minute idle timeout in the TS4 would cause it to just assume any incoming Modbus client (master) connection which has NOT sent a new request is bad. This setting would also have saved the TS4, forcing the four old sockets to be aborted before the TS4 had a chance to build up a herd of dead sockets.
So understanding TCP Keepalive is a critical skill for all modern industrial Ethernet users, after all more and more facilities are breaking their floors and functional area into distinct subnets with stateful firewall protection to keep the Accountants from changing the color of the widgets being produced - and to keep the production engineers from printing on the accounting departments laser printer which uses red ink :-).
More complete discussion of TCP Keepalive is here (one of dozens of web site you can find in a web search).