On the morning of Friday 19 July, screens at some of the world’s biggest businesses, broadcasters and public service organisations flickered with error messages and blue screens at boot up.
Supermarkets and shops were unable to take payments, flights were grounded and news bulletins were suspended for the morning as the outage hit Windows machines, starting in Australia, through Asia and Europe until reaching the East Coast of the USA.
The issue was quickly traced to just one software company: CrowdStrike.
CrowdStrike helps companies find and prevent security breaches. Launched in 2011, the Texas-based company has around 29,000 customers, including more than half of the companies listed on the Fortune 1,000.
At 10.45am on X, the social media platform formerly known as Twitter, George Kurtz, CEO at CrowdStrike sought to voice reassurance in the face of panic:
“CrowdStrike is actively working with customers impacted by a defect found in a single content update for Windows hosts. Mac and Linux hosts are not impacted. This is not a security incident or cyberattack. The issue has been identified, isolate and a fix has been deployed. We refer customers to the support portal for the latest updates and will continue to provide complete and continuous updates on our website. We further recommend organisations ensure they’re communicating with CrowdStrike representatives through official channels. Our team is fully mobilised to ensure the security and stability of CrowdStrike customers.”
While the world breathed a sigh of relief that this was not the work of hostile agents, questions immediately began to be raised: how did it happen? Could it have been avoided? What was the extent of the damage? And when would normal service be resumed?
The cause of what has been reported as “the single largest outage in history” was traced to a defect, described by the company as “a bad sensor configuration update” within CrowdStrike’s “Falcon" cybersecurity defence software for Windows hosts.
As the world picks up the pieces and reflects on the sobering realisation of how the dominos of interconnected IT systems can fall so easily, questions are being asked about whether this could have been avoided in the first place and whether another such incident is waiting to happen.
The issue is that updates within security software require constant updates: the threats are constantly evolving and with Generative AI, this evolution is happening more quickly than ever. As a result, updates and patches need to be made more often, with shorter lead times. As the CrowdStrike incident has shown, the risk is that an incorrectly-configured update can prove as disruptive as a cyberattack.
With updates being produced at pace, there is also less lead time for them to be properly tested and in the case of CrowdStrike, it is evident that updates did not receive full scrutiny within test environments before being deployed by customers in the real world.
One further aspect that caused such wide scale disruption was that the update was deployed globally rather than in a small test population first. CrowdStrike has been criticised for their failure to roll out the update to a small number of customers in the first instance which would have highlighted the issue early. Other commentators within the industry have said that such a trial deployment is not standard operating procedure for what was considered a “minor” update.
Clearly, this approach may come under review in the days and weeks ahead.
The damage is not limited to business disruption: CrowdStrike’s share price has dropped by nearly a quarter and its reputation has been seriously harmed. Reputational damage has also struck several other companies, particularly those considered large enough to have had contingency and continuity plans in place for precisely this kind of event. Many customers of these multinationals will be asking how they could be so exposed to such a seemingly simple error.
Given that Friday’s out-of-the-blue outage was caused by a company that had a global reputation for providing stability and security, businesses should remember that risk is a dial, not a switch: there is no such thing as “risk free”, which is where observability and Site Reliability Engineering (SRE) come into play: both are essential for businesses which seek to respond quickly and effectively.
Here's why Observability and SRE are must-haves:
Faster Outage Detection and Diagnosis:
Improved Communication and Decision-Making:
Enhanced Recovery Efficiency:
Continuous Improvement and Learning:
In the context of the CrowdStrike incident, Observability and SRE are demonstrably essential practices that empower businesses to weather IT shutdowns with greater efficiency, minimise downtime and emerge stronger from disruptions. By providing real-time insights, clear communication and automated recovery procedures, these disciplines ensure business continuity and a smoother ride through the inevitable bumps on the IT road.
While it is still early days, there are outstanding questions that CrowdStrike needs to answer:
We still do not know how many customers were affected and what the long term impact of the incident may be on individuals.
As discussed above, will CrowdStrike (and other security companies) be more strict about how software updates are rolled out in future? Will a staged rollout be mandatory - and what will this mean for the security of those last to receive updates?
CrowdStrike still needs to answer questions about why problems with the software update were not properly checked, identified and corrected before release. Is this a human error or does it point to a more serious organisational failure within CrowdStrike?
Once the dust has settled, there will be a long and complex reckoning. We may learn some uncomfortable truths about our vulnerability to mistakes, let alone hostile attacks. One thing is for certain: a spotlight has been shone on Observability and how organisations should be looking after themselves when those they rely on to keep them safe, fail.