THE global Microsoft outage sent shockwaves across the world–and in the Philippines, left a trail of disrupted flights, stalled transactions, and frustrated customers in its wake.
The July 19 outage, which can be considered the largest in IT history was triggered by a botched software update from security vendor CrowdStrike, affecting millions of Windows systems around the world. System, unbootable displayed the infamous blue screen of death (BSOD).
At Ninoy Aquino International Airport (NAIA), long queues snaked through the terminals as Cebu Pacific and AirAsia Philippines scrambled to cope with the fallout.
Sarah, a harried traveler bound for Cebu, recounted her ordeal: “The check-in kiosks were down, and the lines were insane. We were stuck for hours, missing our flight and scrambling to rebook.” The airline staff, overwhelmed by the sudden influx of passengers, worked tirelessly to manually process check-ins and rebookings.
Meanwhile, at a Land Bank branch in Makati, a long line of customers waited anxiously to access their accounts. “I need to withdraw money for an emergency,” lamented a visibly distressed woman, a barangay kagawad in San Pablo, Laguna. “I’ve been here for over an hour, and there’s no end in sight.”
In another part of the city, a small business owner named Carlos R., faced a different kind of crisis which he posted on Instagram. His online transactions were frozen due to the outage, leaving him unable to pay suppliers or process customer orders. “This is a nightmare,” he groaned. “I’m losing money by the minute, and there’s nothing I can do about it.”
Carlos resorted to calling his clients and suppliers individually to explain the situation and arrange alternative payment methods. “It’s a hassle, but I have to keep my business running,” he said with determination.
The outage stemmed from a seemingly innocuous sensor configuration update that CrowdStrike automatically deployed to all Falcon-connected Windows systems. This update introduced a buggy driver file, “C-00000291*.sys,” containing a single-line error. Upon execution, this error triggered a logic error within the system, resulting in a critical system crash and the BSOD.
The catastrophic impact was exacerbated by the driver’s boot-loading nature. Each system restart attempted to reload the flawed driver, perpetuating the BSOD loop and rendering the affected computers unable to boot up properly.
At the core of the issue was a null pointer dereference error within the CrowdStrike update’s code. A null pointer is a variable that does not point to any valid memory location.
In this instance, a coding oversight led to the creation of such a null pointer, and a missing null check allowed the code to attempt to access data from this non-existent location. This illicit memory access triggered a memory violation, prompting Windows to crash the program as a safety measure.
CrowdStrike swiftly acknowledged the problem and released a public statement, accompanied by a temporary workaround to mitigate the damage. Simultaneously, Microsoft collaborated with CrowdStrike and external developers to expedite a permanent solution. Microsoft also issued technical guidance through the Windows Message Center, instructing users on how to recover their systems.
The eventual fix involved a patch from CrowdStrike that addressed the null pointer error and prevented further crashes. Additionally, the problematic update was rolled back to ensure no new systems were affected.
The repercussions of the outage were felt worldwide, affecting businesses, healthcare providers, airlines, financial institutions, and countless individuals. Critical operations were disrupted, causing significant financial losses and operational setbacks.
Delta and United Airlines were the worst affected US airlines. Health services in Britain, Israel and Germany were also impacted on Friday, with some services cancelled. The massive outage has put a spotlight on the vulnerability of global computer networks, showing how a single glitch can cause global chaos.
The CrowdStrike-Microsoft outage underscores the importance of rigorous testing and quality assurance in software development, especially for updates that are automatically deployed to a vast number of systems. The incident highlights the need for robust error handling mechanisms to prevent a single point of failure from cascading into a widespread outage.
Moreover, the incident emphasizes the significance of transparent communication and swift response in mitigating the impact of such events. CrowdStrike and Microsoft’s rapid actions and collaboration in addressing the issue were crucial in minimizing the disruption and restoring services.