Lessons Learned From Crowdstrike Outage


July’s chaos demonstrated clearly why Observability and Site Reliability Engineering (SRE) are critical for your business


On the morning of Friday 19 July, screens at some of the world’s biggest businesses, broadcasters and public service organisations flickered with error messages and blue screens at boot up.

Supermarkets and shops were unable to take payments, flights were grounded and news bulletins were suspended for the morning as the outage hit Windows machines, starting in Australia, through Asia and Europe until reaching the East Coast of the USA.

What caused the outage?

The issue was quickly traced to just one software company: CrowdStrike.

CrowdStrike helps companies find and prevent security breaches. Launched in 2011, the Texas-based company has around 29,000 customers, including more than half of the companies listed on the Fortune 1,000.

At 10.45am on X, the social media platform formerly known as Twitter, George Kurtz, CEO at CrowdStrike sought to voice reassurance in the face of panic:

“CrowdStrike is actively working with customers impacted by a defect found in a single content update for Windows hosts. Mac and Linux hosts are not impacted. This is not a security incident or cyberattack. The issue has been identified, isolate and a fix has been deployed. We refer customers to the support portal for the latest updates and will continue to provide complete and continuous updates on our website. We further recommend organisations ensure they’re communicating with CrowdStrike representatives through official channels. Our team is fully mobilised to ensure the security and stability of CrowdStrike customers.”

While the world breathed a sigh of relief that this was not the work of hostile agents, questions immediately began to be raised: how did it happen? Could it have been avoided? What was the extent of the damage? And when would normal service be resumed?

The cause of what has been reported as “the single largest outage in history” was traced to a defect, described by the company as “a bad sensor configuration update” within CrowdStrike’s “Falcon" cybersecurity defence software for Windows hosts.


How could this have been avoided?

As the world picks up the pieces and reflects on the sobering realisation of how the dominos of interconnected IT systems can fall so easily, questions are being asked about whether this could have been avoided in the first place and whether another such incident is waiting to happen.

The issue is that updates within security software require constant updates: the threats are constantly evolving and with Generative AI, this evolution is happening more quickly than ever. As a result, updates and patches need to be made more often, with shorter lead times. As the CrowdStrike incident has shown, the risk is that an incorrectly-configured update can prove as disruptive as a cyberattack.

With updates being produced at pace, there is also less lead time for them to be properly tested and in the case of CrowdStrike, it is evident that updates did not receive full scrutiny within test environments before being deployed by customers in the real world.

One further aspect that caused such wide scale disruption was that the update was deployed globally rather than in a small test population first. CrowdStrike has been criticised for their failure to roll out the update to a small number of customers in the first instance which would have highlighted the issue early. Other commentators within the industry have said that such a trial deployment is not standard operating procedure for what was considered a “minor” update.

Clearly, this approach may come under review in the days and weeks ahead.

The damage is not limited to business disruption: CrowdStrike’s share price has dropped by nearly a quarter and its reputation has been seriously harmed. Reputational damage has also struck several other companies, particularly those considered large enough to have had contingency and continuity plans in place for precisely this kind of event. Many customers of these multinationals will be asking how they could be so exposed to such a seemingly simple error.

How can organisations avoid falling victim to similar issues?

Given that Friday’s out-of-the-blue outage was caused by a company that had a global reputation for providing stability and security, businesses should remember that risk is a dial, not a switch: there is no such thing as “risk free”, which is where observability and Site Reliability Engineering (SRE) come into play: both are essential for businesses which seek to respond quickly and effectively.

Here's why Observability and SRE are must-haves:

 Faster Outage Detection and Diagnosis:

  • Observability: Provides a comprehensive view of your IT infrastructure. Metrics, logs and traces collected across systems offer real-time insights. This allows you to identify anomalies and potential issues before they snowball into full-blown outages.
  • SRE: Proactively monitors system health through automated tools and dashboards. When an outage occurs, SRE teams can pinpoint the root cause quickly, minimising downtime.

 Improved Communication and Decision-Making:

  •  Observability: Provides a single source of truth for the state of your systems. This shared view enables clear communication between IT and business stakeholders. Everyone understands the scope of the outage and can make informed decisions about recovery priorities.
  • SRE: Focuses on establishing clear communication protocols during outages. With readily available data from observability tools, SRE teams can effectively communicate the situation, estimated recovery time and potential workarounds.

Enhanced Recovery Efficiency:

  • Observability: Helps pinpoint the affected systems and isolate the problem. This allows for targeted recovery efforts, minimising the time it takes to bring critical systems back online.
  • SRE: Emphasises automation and infrastructure as code. This means recovery procedures can be automated,streamlining the process and reducing human error during a stressful outage situation.

Continuous Improvement and Learning:

  • Observability: Provides data for post-incident analysis. By analysing logs and metrics, teams can understand the root cause of the outage and identify preventative measures to avoid similar issues in the future.
  • SRE: Promotes a culture of continuous learning and improvement. By analysing outage data, SRE teams can refine their practices, strengthen infrastructure resilience and improve future responses.

In the context of the CrowdStrike incident, Observability and SRE are demonstrably essential practices that empower businesses to weather IT shutdowns with greater efficiency, minimise downtime and emerge stronger from disruptions. By providing real-time insights, clear communication and automated recovery procedures, these disciplines ensure business continuity and a smoother ride through the inevitable bumps on the IT road.

Known unknowns

While it is still early days, there are outstanding questions that CrowdStrike needs to answer:

We still do not know how many customers were affected and what the long term impact of the incident may be on individuals.

As discussed above, will CrowdStrike (and other security companies) be more strict about how software updates are rolled out in future? Will a staged rollout be mandatory - and what will this mean for the security of those last to receive updates?

CrowdStrike still needs to answer questions about why problems with the software update were not properly checked, identified and corrected before release. Is this a human error or does it point to a more serious organisational failure within CrowdStrike?

Once the dust has settled, there will be a long and complex reckoning. We may learn some uncomfortable truths about our vulnerability to mistakes, let alone hostile attacks. One thing is for certain: a spotlight has been shone on Observability and how organisations should be looking after themselves when those they rely on to keep them safe, fail.






Return to Home