CrowdStrike Outage – Full Story

It seemed like the worlds largest cyber attack, yet the culprit was the very software designed to protect us.

Even iconic landmarks such as the Times Square billboards were offline. Almost no one was immune.

July 19, 2024
I had just laid down at 1:30 AM, and was just about ready to close my eyes.
That is when at 1:36 AM my phone lit up.
It was a message from Tom, the NOC tech on-call.

“Steve, you up? Something is really ,wrong, I just got 1,000+ Nagios alerts in my inbox.”

I ran down to my laptop and tried to login, but instead of my desktop I was greeted with a blue screen that read…

“csagent.sys failed”.

I immediately recognized it as the CrowdStrike driver and realized this was going to be an unfortunately exciting night. Using my cell phone, I started a call with the NOC tech and told them what had just happened to my computer. I called Djamel, the new SysAdmin, who had just started 2 weeks earlier and as he tried to login the Blue Screen of Death gave him the same “csagent.sys failed” message.

At that point I realized we had to raise the alarm. As we began our research we immediately started trying to contact some of the other administrators, NOC techs, and senior leadership to bring them up to speed and get a larger team together. But since it was 2 am we had some difficulty contacting people.
So rather than waste time waiting for a supervisor I took the lead and began pulling more people into my response team. My small but diligent team consisted of Mike, the Security Lead, Tom, the NOC tech who was on call, Djamel, the new SysAdmin, and myself

We started to research and discovered this wasn’t just our company – this was global. Reports started flooding in from social media with the first reports coming in from New Zealand and Australia, where banks were reporting outages.

Our Reponse

For the next 12 hours, I coordinated the multi-team, multi-department, recovery effort, teaching techniques on the fly as we encountered new scenarios. After getting the repair process documented, we all started working on servers together, dividing up the workload.

By 2:11am we knew what the problem was an how to resolve it.

By 2:30am  I had assembled a team and we had our own machines patched and ready to attack the growing list of servers.

We continued trying to contact more people every 30 to 60 minutes, slowly adding one or two members every hour.

Our NOC tech had recently audited all the critical servers and understood the careful order in which some of them needed to be brought back online to prevent data loss, so he began outlining the dependencies for the servers that we currently had down. While he was mapping out that sequence, I was teaching our Security Lead how to boot a server into Safe Mode and how to use the command prompt to navigate to the CrowdStrike directory and purge that corrupt channel file.

For servers where we’d lost the local administrator passwords over the years, I walked the team through breaking into those accounts using tools and techniques I’d acquired over my 15+ years in the IT field. The NOC tech’s recent audit work proved critical as we brought servers back online in the specific sequence he had mapped out. Some of it was obvious like database servers before app servers, but then we had some microservices that added to the complexity. So in our case, a specific database replication server had to be brought online before the the core database servers, but core db-server-2 had ot be brought online before core-db-server1, other wise the replication could be come too out of sync.

While there was a painful list of which servers to bring up when, going into further detail would breach any NDAs and could pose a security risk.

But in short we had several groups of servers that had between 2 to 8 dependent servers and microservices.

By 7am, the Help Desk started to receive a flood of calls. At this point we had 12 people on our rapid response team.

By 8:30 AM we had restored all critical business systems in the proper order, with no data loss or corruption. Even as we continued to add managers and supervisors of different departments, I was still Incident Commander, leading the way on assigning SysAdmins, Engineers, and Technicians to various servers or to different teams to train others on how to remedy the problem.

About 12 hours later at about 2:11 PM, I was finally reluctantly tagged out by numerous team members and the Director of Security (Lollino).

Lollino: “Steve, why don’t you get some rest?”
I responded: “I was fine and that I Just needed some food and caffeine.”

Lollino:
          “Steve, I may not be your direct supervisor, but am not asking you, I’m tell you you need to sleep. You need to get rest. We’ve got this. You did a great job handling this o far, now go to sleep.”

Several other team members chimed in agreeing with Lollino.

So I conceded and said my goodbyes and finally went to bed.

Global Impact

As the day went on it seemed like 25% of the western world was under a digital attack, yet the culprit was the very software designed to protect us. There were outages from Delta Airlines globally to local small school systems, banks, and hospitals.

Even iconic landmarks such as the Times Square billboards were offline. Almost no one was immune.

Others affected included retail giants, small hometown MSPs, and numerous government institutions were all struggling. If you used CrowdStrike, now the whole world knew. Nearly everyone running CrowdStrike was dead in the water.

While organizations worldwide faced  at least 3 to 5 days of downtime, my crisis leadership and technical expertise limited our total disruption to from our core business services to JUST 1.5 hours.

Aftermath

Later multiple coworkers informed me that multiple managers and several other employees said: “If Steve wasn’t there we would have been screwed.”

Then someone else added, “I don’t think anyone else would have known how to break into the admin account on those servers. Like I didn’t even know that was possible. That was just… wow bravo”

My immediate manager later told me that the Senior Director of IT said:
“Steve really saved the day back there, that was truly impressive.”

Cause

The cause of all of this was poor architecture and change control on the part of CrowdStrike.

Basically CrowdStrike only performed integrity and functionality tests on its core Windows Agent, but did NOT perform integrity and functionality tests on Channel files which are the CrowdStrike modern day equivalent of a virus database that the agents use.

Details

So in short the CrowdStrike agent is a part of the core Windows Kernel (core computer code) after it is installed so if the CrowdStrike agent fails to function properly Windows will crash or fail to start.

So when CrowdStrike released a bad file called a Channel File 291 all CrowdStrike agents online immediately loaded the bad Channel file and the system halted. The same thing happened to all machines that were powered off, so as soon as the computer was powered on Windows asked CrowdStrike to load before anything else and then the computer crashed.

Channel Files

The Channel file is basic the equivalent of a virus database, although it is much more advanced then a simple database of virus signatures.

A Channel File in CrowdStrike is a collection of threat specific metrics combined with behavior analysis patterns and a set of OS parameters it was looking for. By default the Channel files is outside of the kernel space and therefore does not require any signed drivers despite the channel files being handled by a part of CrowdStrike that is in the kernel space of the Windows OS.

The official code integrity checks that CrowdStrike performed prior to releasing and update NEVER included checking the structural integrity of the Channel files. So a Channel file with excessive or missing parameters could be released and there was nothing preventing that bad Channel file from being sent out globally.

THIS WAS EXACTLY WHAT HAPPENED ON JULY 19, 2024.

This also explains why this change bypassed all change control mechanisms that CrowdStrike allows companies to implement. 

In CrowdStrike all change control mechanisms available to companies were specifically related to the CrowdStrike AGENT and NOT the Channel file.

Channel files were assumed to be harmless despite being a critical part of the CrowdStrike agents ability to function.

So if a company said all servers get the latest update from CrowdStrike agent 2-3 weeks after the test group. It would not matter because CrowdStrike forced all CrowdStrike agents globally to immediately receive and use any updated Channel Files as soon as they were released.