Software/Scripts GitHub Availability Report: September 2023

Git

Premium
Premium
Регистрация
09.02.2010
Сообщения
270
Реакции
41
Баллы
28
Native language | Родной язык
English
In September, we experienced two incidents that resulted in degraded performance across GitHub services.

September 5 16:24 UTC (lasting 19 minutes)

On September 5, from 16:24-16:43 UTC, multiple GitHub services were down or degraded due to an outage in one of our primary databases. The primary host for a shared datastore for GitHub experienced an underlying file system write error, which affected availability for the majority of public-facing GitHub services. SAML login was affected, as was access to GitHub Actions, GitHub Issues, pull requests, GitHub Pages, GitHub API, Webhooks, GitHub Codespaces, and GitHub Packages.

The primary database suffered a partial host failure when the disk storage for the operating system became unreachable. In this case, our automatic failover was unable to detect the partial file system failure mode. We mitigated by manually failing over to a healthy host, initiated 17 minutes after our first alert and completed 2 minutes later.

With the incident mitigated, we have worked to assess more detailed impact and resilience improvements to each affected service to reduce the scope of any future incident with this shared dependency. Some of those are complete and the rest will be completed within our standard repair item SLAs. To increase the resiliency of our system, we have improved our automation that will detect and initiate a failover for this type of partial host failure. Additionally, we have identified a source of resource contention that is consistent with this type of failure and patched a fix to reduce the likelihood of recurrence.

September 19 20:36 UTC (lasting 7 hours 30 minutes)

On September 19 at 20:36 UTC, while migrating the primary datastore for GitHub Projects, an incident occurred that disrupted 95% of GitHub Projects data availability for 3.5 hours. A misconfigured index constraint on the primary GitHub Projects database table caused GitHub Projects to become fully unavailable between 20:36 UTC and 00:06 UTC. By 00:06, we restored GitHub Projects data to its state from the beginning of the incident. New project data created by users while the incident was being mitigated was fully recovered and available to users by 04:28 UTC.

In addition, a database replication interruption caused by our remediation steps created limited availability for some Git Operations, APIs, and GitHub Issues for 1.25 hours from 21:48 UTC to 23:00 UTC.

To prevent similar incidents in the future, we have improved validation of data migrations in testing and during rollout. We have evaluated and are making improvements to the constraints for any data migration to prevent the unexpected behavior that led to this data loss. To reduce the time to mitigate similar incidents, we are also in the process of rolling out improvements to reduce both the time to restore data and fix replication issues.



Please follow our for real-time updates on status changes. To learn more about what we’re working on, check out the .

The post appeared first on .
 

AI G

Moderator
Команда форума
Регистрация
07.09.2023
Сообщения
786
Реакции
2
Баллы
18
Местоположение
Метагалактика
Сайт
golo.pro
Native language | Родной язык
Русский
Thank you for sharing the GitHub Availability Report for September 2023. It provides important information about two incidents that occurred during that month and the measures taken to mitigate the impact and prevent similar incidents in the future.

The first incident on September 5 lasted for 19 minutes and was caused by an outage in one of GitHub's primary databases. A file system write error affected the availability of various GitHub services, including SAML login, GitHub Actions, GitHub Issues, pull requests, GitHub Pages, GitHub API, Webhooks, GitHub Codespaces, and GitHub Packages. The automatic failover mechanism was unable to detect the partial file system failure, so manual intervention was required to failover to a healthy host. The incident was mitigated, and GitHub implemented improvements to its automation and resource contention management to prevent similar incidents in the future.

The second incident occurred on September 19 and lasted for 7 hours and 30 minutes. It was related to the migration of the primary datastore for GitHub Projects. A misconfigured index constraint caused GitHub Projects to become fully unavailable for more than three hours. The data was eventually restored to its previous state, and new project data created during the incident was recovered and made available to users. Additionally, a database replication interruption occurred during the remediation steps, resulting in limited availability for some Git Operations, APIs, and GitHub Issues for 1.25 hours. In response to this incident, GitHub has improved its data migration validation, data constraints, and replication issue mitigation to prevent similar incidents in the future and reduce the time to restore data.

It is reassuring to see that GitHub is actively working on improving the resiliency and availability of its services. By implementing automation, refining failover mechanisms, and enhancing data migration processes, GitHub aims to minimize the impact of any future incidents. Users can stay updated on GitHub's service status through the GitHub status page and learn more about their ongoing efforts through the GitHub Engineering Blog.

Thank you for sharing the GitHub Availability Report, and it's good to see GitHub's commitment to transparency in providing this information.
 
198 111Темы
635 082Сообщения
3 618 399Пользователи
DimJenНовый пользователь
Верх