How To Estimate Your Identity and Access Management Recovery Time Objective
Based on an interview with Kayla Williams, CISO @Devo.
Navigating the intricacies of identity and access management systems, like those offered by Okta, requires an understanding of the roles both you and your vendor play. One critical element is the recovery time in case of a disaster, known as the Recovery Time Objective (RTO). It’s essential to recognize that the actual recovery process – restoring systems after a disruption – largely depends on the customer’s ability to respond and manage their data and configurations effectively, making RTO predominantly a customer responsibility.
This doesn’t mean IAM vendors lack responsibility. Vendors play a crucial role in ensuring the uptime of their services, maintaining robust and secure infrastructure, and meeting their own RTO to bring the core services back online in case of a system-wide issue. For instance, Okta commits to high service availability, providing reliable and secure access to all applications, without any planned downtime.
Consider this sobering statistic.
25% of businesses experiencing a disaster cease operations entirely.
The complexity of your IT infrastructure significantly impacts your RTO. As we map out a comprehensive recovery plan and assign critical roles within your team, understanding RTO’s impact on your specific business line, the potential pitfalls of stabilization, and the consequences for industries with stringent regulations become paramount.
Estimating RTO involves various factors, from industry norms to the specific reasons for an outage. Despite these complexities, with the right guidance and proactive planning, you can navigate any challenges
Take a deep breath.
With guidance from Kayla Williams, CISO at Devo, we’ll equip you with tools to understand your RTO and navigate recovery. Let’s dive into the key elements influencing recovery time across four interconnected topics
1. IAM Disruptions: Industry-Specific Impact
The impact of IAM (Identity Access Management) downtime on organizations is absolutely dynamic and diverse depending on your organization’s industry, niche, and vertical.
When we spoke to Kayla Williams, CISO of Devo, about the specifics of RTO and organizational impact of IAM downtime, she brought us back to the notorious Okta breach, and tied it back in with the taking responsibility for estimating RTO:
“I don’t know the exact time frames in which they were back up and operating, but just because Okta was up and running, doesn’t mean that the organizations that were relying on Okta were able to apply any patches or fixes the stabilization issues or repair operational impact.”
If your operational agility and business continuity are directly connected to frictionless IAM of third-parties, downtime could not only be costly, it could bring your operations to a hard stop.
Did you know that just under 50% of organizations that experience downtime are deterred from delivering services to their customers? Even more staggering, in an industry like automotive manufacturing, any downtime on production lines can cost $22,000 per minute, while overall industrial manufacturing downtime can equate to a loss of $50 billion annually.
Those are some pretty serious numbers to swallow. The weight of these numbers anchors us to the reality that the more your IAM’s downtime impacts external stakeholders and customers, the higher the cost, and the more vulnerable your business becomes.
2. Stabilization Issues
Stabilization issues refer to disruptions in the regular workflow of an organization.
This could be anything from a minor delay in processing due to a failed login attempt, to significant issues such as a breach or compromise of the IAM system. These issues destabilize the operational consistency and reliability of the system, hence the term “stabilization issues”. In the context of an IAM system, these could be incidents where unauthorized individuals gain access to resources, users unable to access resources they are supposed to, or inconsistencies in the management of digital identities.
“Stabilization issues can have a broad impact on the operational aspect of the organization. So that has implications as well in terms of how you're going to respond, how your SOC (Security Operations Center) is going to respond, if they can respond. It’s important to consider the specific scenario and its implications on the organization.” said Kayla
If a stabilization issue is severe, the SOC might become overwhelmed.
This can lead to longer recovery times and more complications.
Considering the specific scenario and its implications on the organization is important when planning the RTO. For example, if a company finds out that an unauthorized user has gained access to the system, the company must estimate how long it would take to fix this problem. If it’s a critical system where every minute of downtime means huge losses, the RTO should be very short, and the SOC team needs to have mechanisms in place to respond quickly.
3. Regulated Industries
For regulated industries, the stakes are high. Compliance risks are significant, and the consequences of prolonged downtime can be severe.
CISO Kayla Williams makes sure to offer some guidance on timelines:
“Ideally, if you’re in a regulated industry, you’re going to want your RTO to be max 8 hours - one full business day, plus consider if you’re global, that there are different chunks of 8 hours.
There is absolutely a lot of risk involved in these IAM downtime situations. If you can’t get into your [IT] environment, how can you respond if there’s an actual incident. That has implications as to how your SOC is going to respond and whether they even can respond to the disaster and downtime and its business impact.”
Medical or pharmaceutical companies with time-sensitive ingredients are an example of a regulated industry that could really feel the pain and sting of any extended downtime. The SLAs (Service Level Agreements) in place with any customers can be your lifeboat, or not, as most will determine the feasible amount of IAM downtime you can allow your business.
4. Post Recovery Procedure
Break Down Your Recovery Into Components
Create a full breakdown of your organization’s recovery process and take account of the various components involved. Different scenarios may have different repercussions and recovery timeframes and challenges. A cyberattack, for example, and human error can have different impacts on recovery timeframes, and likewise with natural disasters or system failures.
Here are some steps that can help optimize recovery procedures and processes:
- Prepare a practical disaster recovery strategy in advance. This helps streamline the processes needed during any form of downtime. Tie in all of your lifeboat and rescue channels and resources: communication processes, templates, collaborating departments and teamwork, and even how to inform internal personnel of the scenario without spinning the corporate wheels of operations into panic.
- Reinstate application access and management based on priorities. That means carefully and fully consider which mission-critical assets are urgent for “business as usual” to be your actual state of operation. If that means restoring or creating new credentials for all users, so be it. Sometimes it’s a “whatever it takes” attitude that’s critical to come on top of the corporate survival game when it comes to downtime.
- IT security should always be at the forefront of the recovery process. Just remember, that doesn’t mean compromising seamless user experience once mission-critical applications are accessible. This is particularly relevant based on risk tolerance, like MFA (Multi-Factor Authentication) and SSO (Single Sign-On), along with PAM (Privileged Access Management).
Once you’ve meticulously dissected your recovery into its various components, it’s crucial to understand the larger business implications, which leads us to the importance of conducting a thorough Business Impact Assessment.
Conduct A Business Impact Assessment
In our interview with CISO Kayla Williams, she earnestly extends her insight on estimating or determining a set goal for organizational RTO, with some of the key points in time and process of business impact assessments:
“The best approach to determine the average recovery time would be to conduct a business impact assessment to understand the upstream and downstream impacts, vendor involvement – stack ranking them based on your priorities, and SLA requirements.
Looking across the upstream and downstream of your operations, and which departments need your attention most critically for operations to recoup.
Having a clear understanding of where everyone sits and the upstream and downstream impact is important. It supports the process of gaining consensus amongst the leaders in your organization. This will be beneficial to work towards a common denominator and identify which processes will break during and after a disaster.
A lot of organizations only think downstream. But the flow of operations go both ways.”
Conclusion:
Understanding RTO is crucial for any organization. It not only affects operational continuity but also impacts stakeholder trust and organizational reputation.