7 lines of code let bilibili crash for 3 hours, unexpectedly because of "a scheming 0"

Hacker technology

By Nell JonasPublished 2 years ago • 6 min read

That is to say, an and b are repeatedly divided by the remainder until bounded 0, in the function:

If baggage 0 then return an end

This judgment statement takes effect, and the result is calculated.

Based on this mathematical principle, let's look at this code again, and there seems to be no problem:

But what if the b entered is a string "0"?

Bilibili's technical analysis article mentioned that the code for this accident was written in Lua. Lua has several characteristics:

? This is a dynamically typed language, commonly used variables do not need to define types, just assign values to variables.

When Lua performs arithmetic operations on a numeric string, it attempts to convert the numeric string to a number.

? In Lua, the result of the mathematical operation n% 0 is nan (Not A Number).

Let's simulate this process:

1. When b is a string "0", because the gcd function does not check its type, "0" is not equal to 0 when it encounters a decision statement. "return _ gcd (b, a% b)" in the code is triggered and returns _ gcd ("0", nan).

2. _ gcd ("0", nan) is executed again, so the return value becomes _ gcd (nan, nan).

This is the end of the calf, to determine that the condition of bounded 0 in the statement can never be met, so the endless loop appears.

In other words, the program starts to turn around crazily and accounts for 100% of CPU for a result that will never be obtained, so that other users' requests can't be processed naturally.

So the question is, how on earth did this "0" get in?

The official saying is:

In some publishing mode, the applied instance weight is briefly adjusted to 0, and the weight returned by the registry to SLB (load balancer) is "0" of string type. This release environment is only used in the production environment, and the frequency of use is extremely low, and this problem is not triggered during the pre-SLB grayscale process.

In the balance_by_lua phase, SLB passes the service IP, Port and Weight stored in shared memory to the lua-resty-balancer module as parameters for selecting upstream server. When the node weight= "0", the input parameter b received by the _ gcd function in the balancer module may be "0".

How to locate bug

From the perspective of "Zhuge Liang afterwards", the root cause of bilibili's complete collapse is more or less called "this is it".

But from the programmer's point of view, things are really not as simple as hot.

At 22:52 in the evening-when most programmers had just got off work or had not yet left work (doge), bilibili operation and maintenance staff received an alarm about the unavailability of services, and immediately suspected that there was something wrong with the computer room, network, fourth-floor LB, seventh-floor SLB and other infrastructure.

Then immediately held an emergency voice conference with the relevant technicians to start processing.

Five minutes later, OPS found that the CPU occupancy rate of the SLB on floor 7 of the main computer room, which carries all online business, reached 100% and could not handle user requests. After excluding other facilities, the locking fault was on this layer.

(layer 7 SLB refers to load balancer based on application layer information such as URL. Load balancing allocates customer requests to the server cluster through algorithms, thus reducing server pressure. )

In case of emergency, the episode also appeared: programmers who were remote at home logged on VPN but could not access the private network, so they had to go to call again and take a green channel before they were all online (because one of the domain names was represented by a malfunctioning SLB).

By this time, 25 minutes have passed, and the emergency repair has officially begun.

First of all, the operation and maintenance staff warmly restarted the SLB without recovery, and then tried to reject the user traffic cold restart. The SLB,CPU is still 100% or not restored.

Then, the operation and maintenance staff found that a large number of SLB requests for live data centers timed out, but the CPU was not overloaded, and when the SLB was ready to be restarted, the internal group response master station service was restored, and the video playback, recommendation, comment, dynamic and other functions were basically normal.

It was 23:23, 31 minutes before the accident.

It is worth mentioning that the restoration of these functions actually played a role in the "highly available disaster recovery architecture" complained by netizens at the time of the incident.

As for why this line of defense didn't work at first, there may be a little pot for you and me in it.

To put it simply, the big guy started to refresh crazily without clicking bilibili. The retry of CDN traffic back to the origin + user retry directly led to a sudden increase of more than 4 times the traffic of bilibili and a sudden increase in the number of connections to the level of 10 million, and more live SLB was overloaded.

However, not all services have a multi-activity architecture, and the matter has not been completely resolved.

In the next half hour, everyone did a lot of operations, rolled back the Lua code that was launched in the last two weeks or so, and did not restore the rest of the service.

When the time came to 12:00, there was nothing we could do about it. "No matter how bug came out, let's restore all the services."

Simple + rough: it took the operation and maintenance staff an hour to rebuild a new set of SLB clusters.

At 1: 00 in the morning, the new cluster was finally built:

On the one hand, someone is responsible for gradually switching live streaming, e-commerce, comics, payment and other core business traffic to the new cluster and restoring all services (all done at 01:50 in the morning, temporarily ending the accident that collapsed for nearly 3 hours).

On the other hand, continue to analyze the causes of bug.

After they ran out a detailed flame chart data with an analysis tool, the troublemaker "0" finally showed a hint:

The CPU hotspot is clearly concentrated in a call to the lua-resty-balancer module. The module's _ gcd function returns an unexpected value after a certain execution: NaN.

At the same time, they also found the condition that triggered the trigger: the weight=0 of a container IP.

They suspect that this function triggered one of the bug of the jit compiler, running an error and falling into a dead loop causing SLB CPU 100%.

So jit compilation is turned off globally, temporarily avoiding the risk. After everything was settled, it was almost 4 o'clock, and everyone finally had a good sleep for the time being.

The next day, everyone was not idle. After reproducing the bug in the offline environment, we found that it was not the problem with the jit compiler, but that the weight of the container instance was 0 in some special publishing mode of the service, and this 0 was a string.

As mentioned earlier, in the arithmetic operation of the dynamic language Lua, the string "0" is converted into a number and goes to the wrong branch, causing an endless loop and triggering the crash that bilibili has never seen before.

Recursive pot or weakly typed language pot?

Many netizens still have vivid memories of the accident, recalling that they thought they could not change their mobile phones or change computers, while others still remember that it was a hot search five minutes later.

Everyone is surprised that such a simple endless cycle can cause such a big website collapse.

However, some people pointed out that the endless loop is not uncommon, it is rare in the SLB layer, in the distribution process problems, it is not like in the background problems can be quickly restarted to solve.

In order to avoid this situation, some people think that it is necessary to use recursion carefully or to set a counter to return directly after reaching a value that the business is unlikely to reach.

Others think that this is not to blame for recursion, but ma

hackers

About the Creator

Nell Jonas

Reader insights

Be the first to share your insights about this piece.

How does it work?

Add your insights

Comments

There are no comments for this story

Be the first to respond and start the conversation.

Keep reading

More stories from Nell Jonas and writers in 01 and other communities.

7 lines of code let bilibili crash for 3 hours, unexpectedly because of "a scheming 0"

Hacker technology

About the Creator

Nell Jonas

Reader insights

Be the first to share your insights about this piece.

Comments

Keep reading

The whole world is hot, and the Google server has crashed.

Jaipur Metro: A Guide to the Pink City’s Rapid Transit System

The Rise of Design Systems: Streamlining UI/UX Workflows in 2024

Power of Wisdom