云服务厂商的服务器崩溃?听起来很不可思议,但没想到最近还真让我碰上了。记录此次事件只是为了提醒大家,引以为戒,绝不以偏概全,否认厂商的服务水平。厂商名称已经过脱敏处理。

事故经过

注:以下内容均来自厂商的通知,时区为美国山地 (UTC-07:00)。

2023年03月06日

SJC is out of hardware resources.

The new SSD, RAM, and servers are on the way.

  1. Two nodes will be rebooted next weekend (March 17 or 19) for a RAM upgrade.
  2. The block storage will be doubled, and the snapshot will be available this weekend (March 10 or 12).
  3. Your VM might be cold-migrated (one reboot) without notice to a new node since tight hardware resources.
  4. The new instance purchase will be available the next day after the action (March 11 or 13.) if the no.2 action is completed on time.

总结:由于硬件资源耗尽,他们要升级数据中心的硬件。你的虚拟机会被迁移并重启。

2023年03月08日

We have to performe emergency reboot for some SJC node. It will be done very soon.

总结:没有具体指明需要重启部分节点的原因,但是肯定有故障。此时,服务大约中断了13分钟。

2023-03-08-uptime

2023年03月09日

01:03 PM

We've noticed IO Error in SJC; Investigating. Keep you posted.

01:10 PM

The network components we used implemented hash Layer3+4 for Bond interface, which is not supported by Infiniband.

It caused the disonnection-dead-loop for the entire SJC Ceph cluster.

We've removed the config implemented by component and locked it.

01:14 PM

We experience extramly high load in SJC, the new hardware is on the way;

The new NVMe block storage hardware will be installed tomorrow.

08:08 PM

We are working on restore Ceph-OSD; The problem is found; It still takes more time to recovery.

09:11 PM

We are still working on it; we suggest do not reboot if you still able to run your system, since the I/O is currently suspended.

此时我前往终端截了一张图:

2023-03-09-console

09:41 PM

The remote hand is on the way to the SJC location to implement hardware requirements for repair.

The SLA solution will be posted after repair done.

22:02 PM

2023-03-09-uptime

总结:硬盘坏了,服务中断,还在修。

2023年03月10日

12:43 AM

OSD recovery, backfill in progress.

02:05 AM

Step 1 still need ~4hrs; 70% VM will return to normal;

Step 2 will take another ~4hrs; 99% VM will return to normal;

Step 3 needs whole day, it only leads to IO performance impact but not uptime impact.

The SLA is lower than TOS offered. The reimbursement will issue case by case; please submit ticket after the event end.

We are deeply sorry for the recent SLA drop that may have caused inconvenience to your business operations. We understand the importance of our services to your business and we take full responsibility for this interruption.

The fault report will be posted after the event.

07:25 AM

Ceph does not allow to run after partitial recovery; step 2 in process.

01:14 PM

Step 2 complete;

Due to one OSD failed to be recovery, and data difference during time; there are 13/512 (2.5390625%) data is unable to recovered.

Once again, we apologize for any inconvenience or concern that this may have caused. We value your trust and we will continue to work hard to earn and maintain it.

05:19 PM

[ISP发布相关补偿方案]

06:20 PM

The initial Summary:

~March 1

On or about March 1, [ISP Name] San Jose received a large number of VM orders. (almost double the number of VM at that time).

~March 3

[ISP Name] had noticed the tight resources and immediately stopped accepting new orders.

Memory resources were released to the two new nodes that were newly purchased last month.

The available storage resources were already lower than 30% at that time.

~March 6

On March 6, we increased the set-full-ratio of the OSD from 90% to 95% in order to prevent IO outages.

But this was still not enough to solve the problem, and we had ordered a enought amount of P5510 P5520 7.68TB on March 3.

FedEx expected to deliver on March 7, and we were scheduled to install these SSDs on March 8.

Due to the California weather, the delivery was delayed to March 9 and we planned to install the SSDs immediately on March 10 to relieve the pressure.

~March 8

On the night of March 8, we completed network maintenance, which caused the OSD to reboot.

Also due to OSD overload, BlueStore did not have enough space to allocate 4% log during start, resulting in OSD refusing to boot. This still only resulted in reduced IO performance.

~March 9

Due to the continued writes, on the morning of March 9, another OSD triggered a failure and caused backfill, which caused a chain reaction that resulted in a third OSD being written to full and then failing to start. This eventually led to current condition.

We immediately arranged to the on-site installation on March 9, but this still caused some PGs to be lost.

=== Tech Notes

  • San Jose uses the latest tech stack of [ISP Name]. We do not know bluestore will use 4% of the total OSD as a log. We thought it should be included in the data.
    Once the data uses all space, the log cannot be issued during initiating. It leads to failure.
  • San Jose does not have that much VM increase rate as before, the double order gave us limited time to upgrade.

=== Management Notes

  • [ISP Name] will prepare to upgrade the locations once resources are over 60%.
  • [ISP Name] will reject the order if we don't have the ability immediately to keep resources lower than 80%.

总结:整个数据中心大约丢失了2.54%的数据,提供补偿方案并十分抱歉。

此时终端显示Linux缺少系统文件,无法正常启动:

2023-03-10-console01

尝试修复

2023年03月10日

经过我和另外一位朋友的初步排查,主分区没有损坏,缺少关键系统文件无法正常启动。必须将磁盘挂载到另一个一个新的系统环境才能尝试修复 (俗称救援模式),但是由于ISP面板没有该功能,随与客服联系。

2023-03-10-console02

2023年03月12日

两日后,技术人员通知已将主分区挂载到新系统,并给予登陆凭证。检查后发现,原系统库严重损坏,修复成本极高,放弃修复。随即检查其他数据的完整性,发现大量文件不均匀丢失,并拷出能正常访问的数据。

2023-03-12-ssh

重新部署

2023年03月10日

在确认系统损坏后,立即从另一厂商购入了一台新服务器,并从最新的备份开始重建服务。由于我和我朋友从未经历过类似事件,服务器崩溃前自动脚本仅备份了核心文件,重建过程相当繁琐。

od-backup-list

2023年03月11日

所有服务恢复。

2023年03月12日

改写了所有备份脚本,确保所有工作区文件都按时备份。

申请退款

2023年03月14日

由于之前顺手多续费了一个月,但事故发生后不再希望使用他们的服务,待数据备份完成后找客服退费。客服同意全额退款,并关停服务器。

总结

教训:

  • 一定要有备份,且尽可能完整备份所有内容。
  • 不是所有服务商都有能力/资金做灾备,一旦数据中心的服务器出现故障,很有可能造成数据丢失。

损失:

  • 一部分不太重要的Python脚本
  • 接近三天春假
文章目录