Reflecting on the CrowdStrike Incident: It’s Not Them, It’s Us
August 13th, 2024 | By Pedro Fortuna | 10 min read
This article was originally published on LinkedIn, on July 20th 2024. It’s been updated and reposted here, having incorporated information from the RCA report that was released by CrowdStrike.
The now infamous CrowdStrike incident was accidental rather than an intentional attack. A misconstructed "content" update was distributed, automatically updating thousands of Windows servers and computers with CrowdStrike’s Falcon sensor installed. The situation has since been resolved, as 99% of all servers are back online, but it’s important to reflect on the incident and understand what it means. A lot of the discussion is focusing on the vendor angle, and what they could have done to prevent this. I believe it’s equally important to reflect on what companies could’ve done better.
What caused the CrowdStrike incident?
Falcon hooks into the Microsoft Windows OS as a Windows kernel process, providing it with high privileges to monitor system operations in real-time. However, a logic flaw in Falcon sensor version 7.11 later caused it to crash. This deep integration with the Windows kernel led to a system crash and a Blue Screen of Death (BSOD), affecting the overall stability of the Windows system, and the consequences we witnessed this past July.
What can be done to minimize the risks of these outages?
Anyone in the software development business knows that, regardless of how good your testing and QA processes are, sooner or later you’ll ship a bug to production. It’s a risk that everyone - developers and users alike - accepts as intrinsic to a world where software development is prone to human error. That being said, there’s a lot that the software manufacturer could have done to minimize the risk of this happening.
Some examples include:
Employ file integrity validations
Create file checksums to detect any corruption potentially introduced between the moment they were created and the moment they are about to be used. Formalize a structure for updating files or ‘content’ that can be validated before usage. It is now known that “the sensor expected 20 input fields, while the update provided 21 input fields”. Detecting this mismatch before applying the patch would’ve saved the day. Another approach could focus on code signing of updates during testing. Anything after that tampers or corrupts the update, would not have the key to generate a valid signature. The sensor must validate this signature, otherwise, the update should be rejected;
Proper testing before shipping
Installing the update on test machines before shipping. CrowdStrike might have done this, but the update might have been corrupted after testing and before being installed. Their RCA report doesn’t mention the corrupted update file, so we have to assume that it was a misconstructed content update, and a failure to validate the update before applying it. CrowdStrike’s official RCA does not clarify exactly what their testing procedures are.
Staged rollout updates
Before issuing an update to all the customers, roll them a tiny fraction to confirm that they landed well. This can limit the impact of any problem;
Reboot loop protection
A simple technique implemented by kernel-level drivers that when they boot, it first detects if they crashed and, if so, avoids running fully again to prevent reboot loops;
Move the agent into user space
The BSOD problem is caused by a crash in kernel-space code. Moving this to user space would solve the BSOD problem, as the program would just crash without taking the whole system with it;
Move to a memory-safe language
Rust would be most people's top preference here.
Is it fair to say that companies that use CrowdStrike could’ve done better?
It’s difficult to say what the right answer is. Let’s think about it carefully. When a company decides to install software like CrowdStrike on their servers, it means that they trust this company not to do anything malicious. After all, it’s a $86B valued company. Why would it jeopardize their reputation?
I bet you are already spotting the fallacy here. The biggest problem isn’t the vendor’s desires or motivations, but rather unintentional failures, such as failure to properly manage the integrity of updates or to validate them - or being compromised by a threat actor. These issues, combined with the trust that companies are putting on vendors, can lead to significant damages.
Should we trust the vendors that we use blindly? Auto-updates, especially those related to security, are definitely a trend. Is this the right approach? I believe it is. Auto-updates offer more benefits than risks, and the problem is not that they can fail us. The problem is that we don’t always consider updates to the software we depend on as part of our productive system. If we do, then we must ensure updates don’t break our services. That is also valid for security testing. We should run static analysis, dynamic analysis, and other types of security checks even if the only change is a patch to a third-party dependency.
We can never be sure that we’ll be able to catch every issue before shipping. Therefore, we need to assume components that we use might cause dependent services to become unavailable. IT managers know how to plan for business continuity, often by having secondary systems in place for situations like this. But they know better not to have exact copies of the system. At the end of the day, this is also a business decision, because the cost versus the risk of a catastrophic failure could result in the company just accepting the risk and the impact if it happens.
The scale of this incident made me think about what this means for cyber resilience.
In light of the world’s current economic struggles, is there a trend towards “monoculturization” in cyber defense? If there is, that could mean that companies are increasingly relying on a single vendor (or fewer vendors) to defend their assets. It doesn’t help that we have been seeing a lot of consolidation in the cybersecurity sector, with increased M&A activity in response to the economic downturn.
Lack of diversity doesn’t cause the problems, but it makes the consequences harsher when they happen. It arguably attracts more threat actors, as they prefer focusing on more ubiquitous systems that offer them more bang for the buck.
So here we are blaming the vendor when we have so many things in our control that would have helped:
Introduce redundancy (with diversity) to be more resilient to failures and impose a much higher cost to attackers, who then have to defeat multiple systems to succeed.
Treat the supply chain as part of our application - if we are applying a patch on a component our production system depends on, then we should test it as thoroughly as we test any other app we develop and deploy - we should patch a staging environment first. If the service passes the tests, then we apply the patch to the production environment
A "content update" is still a production update - some people pointed out that it wasn’t a driver update but just a ‘content update.’ However, there’s a possibility that a content update, or any type of update, could trigger the existing driver in an unexpected way, potentially causing a catastrophic failure. Therefore, any update should be treated as a production change that requires thorough testing (including security testing).
Isolate components that we use so that in the event of failure, they don’t compromise the whole system. In the case of CrowdStrike, this is easier said than done, as it is deployed to monitor every system. However, the principle still stands for other components that might fail.
Avoid "monoculturization" inside our organization. Ensure that in our threat model, we consider every dependency our production systems have, which can be compromised or become unavailable. How should we address it?
As lawsuits start to unfold, and while we should demand accountability from CrowdStrike, this incident should also serve as a warning that we shouldn’t see our supply chain third-party components as another company’s problem. We should plan for failure in these components the same way we plan for failure in the components we develop. It’s not about dismissing CrowdStrike’s responsibility. It’s about admitting that there are things that we can control and focusing on that.
So, it’s not them, it’s us.
Jscrambler
The leader in client-side Web security. With Jscrambler, JavaScript applications become self-defensive and capable of detecting and blocking client-side attacks like Magecart.
View All ArticlesMust read next
Data Leakage Prevention Policies: Seal your Security Perimeter
We're exploring the benefits of using a comprehensive prevention tool to proactively manage the deleterious impact of data leakages.
February 13, 2024 | By Tom Vicary | 10 min read
Preventing Digital Skimming Attacks and Enabling PCI DSS Compliance
E-commerce skimming = the majority of attacks against payment card data. The newest version of PCI DSS contains requirements aimed at preventing digital skimming attacks.
June 21, 2022 | By John Elliott | 5 min read