Lead Analyst: Cal Braunstein
The Royal Bank of Scotland (RBS) group, which includes NatWest and Ulster Bank, recently experienced a massive week-long outage caused by an IT failure. Retail customers were unable to receive or make payments, thereby greatly impacting people's ability to process wages, mortgages, and other transactions; thereby damaging the bank's and people's reputations. The bank's retail customer account system utilizes CA Inc.'s CA-7 batch scheduling software. What should have been a routine procedure and straightforward upgrade fix by operations staff was unintentionally converted into a major catastrophe.
The story is that an operator running the end-of-day overnight batch cycle accidentally erased the entire scheduling queue. This error required the re-entry of the entire queue – a complex process requiring an in-depth understanding of the core system's processes and detailed knowledge of legacy software. All this had to be completed within the overnight batch processing window, which for most firms is tight and leaves little room for error correction and reruns. This proved to be impossible, especially as pent-up demand and payment instructions built up over time in the queue, causing other RBS systems, such as access to its online banking, to be out of service. Eventually RBS had to rerun the previous day's transactions before new ones could be inputted into the system. The delays and backlog of up to 100 million transactions fed upon themselves extending the outage over multiple days.
RFG notes that many observers pointed the finger at the bank's legacy mainframe systems – both the hardware and software. However, RFG believes this is not the real story. The vast majority of banks run their retail customer account systems using mainframes and legacy software every day and this is a rare event. RBS runs on System z servers, so one cannot claim it is using ancient iron that is outdated.
The real culprits are the bank's processes and personnel management. The multi-year banking crisis that RBS (and others) went through caused the firm to undertake cost cutting measures over the past few years. IT organizations were not exempt from the staffing actions and many of the IT jobs were outsourced to a team in India. Reports state that the person responsible for the error was part of this team but an RBS executive claims otherwise. Outsourced or not, two things are evident: the staff was inexperienced and not adequately trained for the task, and processes and procedures did not exist to quickly identify the problems and correct them rapidly. The issues here are not technology but people and process.
RFG POV: The RBS business environment is not unique. Because of the financial meltdown that began in 2008, banks, other financial institutions, and enterprises of all types have been forced to slice budgets across multiple years and IT budgets are no exception. For many companies this cost cutting continues. However, it does not mean that IT is no longer accountable and responsible for its actions – it has a fiduciary responsibility to keep the business running regardless of the disaster. RBS did not properly staff and/or train its operations crews and did not have appropriate procedures in place to prevent such a failure. In many organizations the procedures are not well documented and smooth operations are dependent upon the institutional knowledge and skills of senior staff and frequently when there are cuts, these high priced administrators/operators are the first to go. IT executives should proceed cautiously when "rightsizing" staff and ensure that key skills and/or institutional knowledge are not being lost in the process. Documentation tends to be an IT Achilles heel. IT executives need to ensure all procedures are well documented, tested, and staff is fully trained on them. As the proverb goes, an ounce of prevention is worth a pound of cure.