Software errors, errors in configuration files, including the famous "big toe" - everyone is responsible for service interruptions in services based on the cloud, data centers, enterprise networks, and any other IT system, large or small.
Too many things can go wrong. Even if an organization is responsible for everything that is humanly possible, there are still monkeys. This is a lesson that people KenGen, Kenya electricity company, learned the hard way.
After an investigation of a blackout in the country on June 7 that left power for several hours, the company announced that "a monkey climbed on the roof of the central Gitaru and fell into a trigger transformer she. Therefore other machines on the central trip on overload resulting in a loss of more than 180 MW of the plant that caused a nationwide blackout. "
monkey based damage is rare, of course, but other forms of damage are not. According to the firm backup and disaster recovery quorum, for example, three quarters of downtime is caused by hardware or software failures and human error. You may not be able to easily prevent monkey errors, but surely there is something that companies can do to remedy, recover, and perhaps even to prevent human error cuts based primarily, and disturbance for errors, software defects, and even causes "unknown".
Natural disasters such as hurricanes and earthquakes can get all the PR in case of major service disruptions, when networks, banking sites, services and cloud-based longer available. However, according to many studies, the real culprits behind failures are much more down to earth, and controllable. Quorum, for example, it states that three quarters of downtime is caused by hardware failures or software - and by human error. It is this last category that many in the computer world are most concerned with.
Are there that many IT workers "incompetent" current networks? Errors are typing in the configuration files, missing patches and even "big toe" - few of the characteristics of what is called joyfully "human error" - which prevails? And if so, what can companies do to fix it, recover, and perhaps even prevent human error on the basis of first interruptions? Before the barrage of emails and anger legal threats start flowing, IT managers must develop a coping strategy to manage the inevitable. Here are some ideas how.
coordinate
Once an interruption has been detected, the response team must be ready to act, which means it needs a leader who can assign team members remediation tasks. Along with this, there must be an awareness of the interconnection of systems. the team members should be aware that the actions are responsible, and is likely to affect many users in a chain reaction that has the potential to further change work modes.
communicate
In the midst of an emergency, there is a tendency for workers to bend down and focus on the horror unfolding before them, often ignoring the emails, anger service calls, and Twitter messages directed to them. Obviously, all resources must be put in to fix things, but that does not mean that the company must "turn off" after a crisis.
Completely the opposite; recognizing there is a problem and tell customers / clients on how to be treated could ease the bad feelings that generates an interrupt. Although customers / clients may still be upset by the interruption of the service, which may be more likely to accept excuses inevitable result of the crisis if they see the department / company has not abandoned them.
educate
When the crisis is finally under control, singled out begins, and much time and energy is devoted to research what happened, write reports about what happened, reporting on what happened to executives justify / explain what happened to the board, etc.
While much of this is unavoidable, unfortunately, preparations managers will turn the incident into a learning opportunity and implement systems and processes in place for such events in a matter of time.
Prevent
While it is important to be prepared to deal with a power outage, avoiding all together, it's what IT managers should aim first. Doing the same thing and expecting different results will not succeed; in order to be fully prepared, a new approach and outside help may be needed. Systems that can analyze an IT ecosystem and determine the interaction between the various components of hardware, software - configuration files, updates, and anything other than network includes - saves a lot time, effort and pain head next.
Operations of automated computer systems analysis can provide information on potential problems in advance, as if a specific update software can rewrite configuration files that could disrupt the operation of the service, what are the implications of adding a replication server are new, and more. The system sends an alert when a risk is detected, allowing a restoration or other mitigation measures if necessary.
By alerting staff to potential problems before they occur, automated analysis of IT operations can avoid interruptions altogether, unless a monkey gets into the works. At a minimum, these systems will remove the "unknown" category as a failure.
No comments:
Post a Comment