Alan Stewart-Brown, VP Sales EMEA at Opengear discusses how out-of-band management reduces mean time to repair (MTTR) in distributed IT environments.
Where are my devices located? The wave of transformation brought about by digitisation, the Internet of Things and increasingly decentralised hardware landscapes and systems brings new challenges for IT departments.
They have to manage distributed network environments for branches, production sites and private clouds in order to maintain permanent connectivity and avoid the consequential costs of network downtime.
Today this is almost always handled centrally from the data centre at company headquarters. But the problem is – if the primary network (‘in band’) fails or Internet connectivity is not ensured by the service provider, the limits of standard remote maintenance are quickly reached.
In the legacy world, this would mean a technician has to go on site. And the technician on site may find that resolving the problem, for example, could range from a simple reboot to having to order a replacement device.
The modern alternative is ‘out-of-band’ management. Independent of the function and connectivity of the local network, it ensures that companies can connect to their IT hardware at all times, test to see if the issue can be resolved remotely and so build resiliency into IT and ultimately reduce downtime.
But let’s take this from the bottom up: branches extend the reach of a company and new sites are normally a sign that business is going well. It is obvious that remote IT infrastructures place a significantly higher burden and bring added responsibility for network administrators.
The IT department has to ensure that infrastructures, including network devices such as routers, switches, WAN optimisation solutions, firewalls and all distributed applications and servers, are installed and maintained correctly.
It has to integrate remote devices in the central network management, authentication, authorisation and accounting systems to ensure quality management and security.
It is in the nature of things that a distributed network has more weak points and so is more susceptible to interruptions to services. And only very large branches will have an IT expert on site.
But in times where distributed corporate networks are connected to mission-critical data and services via the cloud, the costs of a network failure can also be huge for a shop or a small sales office.
Failure of the primary network: When in-band tools fail
With centralised monitoring and remote access to distributed resources, today administrators in many organisations are capable of managing dozens or even hundreds of sites efficiently. In-band tools such as Telnet (teletype network) are common methods used for remote access and maintenance. They are dependent on the availability of the network. Access is not possible when major technical issues such as network or connectivity failures occur.
A single hardware malfunction can be enough to trigger several days’ downtime. Normally a company will send a technician on site. It often happens that the problem is resolved in a few minutes, for example by rebooting the device.
But in other cases, the troubleshooting process has only just begun. In any case valuable time is lost in travelling from the location of the IT department, time when staff cannot work productively or critical systems are not accessible.
That is why out-of-band management to manage remote network hardware is worth its weight in gold to network administrators of distributed infrastructures. It means they retain control over all components, even when in-band management strategies are unsuccessful. It also means resiliency across the network with the knowledge that even with a critical failure, administrators can quickly gain access to remote networks.
Out-of-band solutions with cellular failover
The principle is simple. Network devices are normally fitted with serial interfaces which can be interrogated independently of the network and give the administrator a complete picture of the status of the device. Many of the important management functions such as firmware updates are only available via serial console interfaces anyway. If a device no longer responds to a command, administrators can carry out a hard reboot via the control system for the power supply.
Out-of-band management allows admins to maintain and manage components such as servers, WAN devices, network devices and power supply units and resolve malfunctions via remote access. If there is an issue with connectivity, out-of-band solutions offer a failover solution. Today this is normally done via cellular (4G LTE or 3G), creating a fast connection. If cellular is not guaranteed, companies can use a slower connection via modem.
So out-of-band management can ensure continuous remote access of administrators to critical components such as network switches, routers, electricity distribution units and a growing number of security applications such as firewalls and encryption tools.
Solutions often mean there is no need for an on-site visit and if it does prove necessary, the technician often has the right spare part or new device in hand and can resolve the issue speedily. Mean Time to Repair (MTTR) is reduced considerably.
Business continuity at the hardware level: automatic monitoring and troubleshooting
But out-of-band management has advantages beyond urgent crises. It also makes it possible to identify and resolve issues automatically even before they affect local data traffic. Modern solutions integrate an auto-responder system to rectify network failures by using diagnostic and repair aids for problems that occur frequently.
This means that simple tasks can be carried out automatically, for example, identifying a router that cannot be contacted, including messaging the administrator via SMS or e-mail, and rebooting the router. Out-of-band console servers can be configured so they power down critical devices properly if the rack temperature is too high or the UPS (uninterrupted power supply) identifies a power outage and battery performance drops below a defined threshold.
The virtual hardware administrator
These self-healing measures use recovery scripts that run without human intervention. This means monitoring practically the whole infrastructure, including the physical environment such as temperature, moisture, smoke or vibrations, as well as automated power management is possible by connecting sensors.
This automation installs a so-called virtual hardware administrator at every distributed site and, in addition to ensuring stability in everyday operation, it also minimises the scope for human error and the attack surface for cybersabotage.
Because of the increasing complexity of IT infrastructures due to M2M, cloud computing and the Internet of Things and in order to avoid high downtime costs, the rapid identification and resolution of issues of connectivity to distributed infrastructures has become a major task for companies.
Secure remote access via out-of-band management to servers, WAN devices, network devices and power devices makes it possible to identify and resolve many problems before they impact users or systems. Administrators need a framework so they can manage remote components from a central location and monitor their status. It normally pays for itself through a few hours of downtime avoided and avoidance of the first service call.