If you have read about affordance, you will remember that a solution should be self-documenting – that it should afford a solution.

Bad puns about affording luxury vacations aside, some hasty developers might cite this concept to frame their meticulous perhaps even exemplary code as self-evident to even those of us who they otherwise consider to be incapable.

The ethics and economics of such a position notwithstanding, there's much to be gained from understanding the use-cases of someone visiting your project's codebase.

The documentation on the Gitlab's

Outline for a Process

  • Why
  • Goals
  • Overview (ie. Failover)
    • Documentation
    • Roles
    • Priorities
  • Process
    • Label Taxonomy
      • Workflow
      • Sequencing
      • Workstreams
      • Teams
  • Issue Triage Queries
    • Decision matrix
  • Related
  • Prep

Issue Templates as documentation. - process.md - preflight.md - tests.md - runbooks.md

πŸ“– doc πŸ“ folder πŸ“˜

🐺 Coordinator πŸ”ͺ Chef-Runner ☎️ Comms-Handle 🐘 Database-Wrangler ☁️ Cloud-conductor πŸ† Quality-Manager ↩️ Fail-back Handler 🎩 Head- Honcho

πŸ—ΊοΈ Board

``` Useful Links | πŸ“– GCP Migration Project Doc | πŸ“– GCP Migration Weekly Call | πŸ“ GCP Project Docs | πŸ“˜ Architecture Docs | πŸ“˜ Status Reports

GitLab GCP Migration Project

Why are we doing this?

We see a number of advantages for moving from Azure to Google Cloud Platform (GCP):

  1. Reliability and performance

    1. GCP offers a low-latency 10Gbps interconnect across the board.
    2. GCP offers a global Anycast network as part of their load balancing service.
    3. GCP also has a track record of exceeding their uptime SLAs for compute VMs.
  2. Google Kubernetes Engine

    GitLab 10.1 introduced built-in support for Google Kubernetes Engine. We expect GKE usage to grow significantly, and it makes sense to bring GitLab.com closer to GCP.

  3. Pricing

    Google offers sustained use discounts and per second billing, which has saved us a significant amount with shared runners on GitLab.com.

Related articles:

  • https://venturebeat.com/2018/04/06/why-and-how-gitlab-abandoned-microsoft-azure-for-google-cloud/

Goal

Goals of the GCP Migration Project

In order of descending priority. Most important goals at the top.

  1. Use the opportunity of an inter-cloud migration to make GitLab.com suitable for mission critical client workloads
  2. Migrate GitLab.com from the Microsoft Azure Cloud platform to the Google Cloud while keeping downtime to a minimum
  3. Use the same helm charts for GitLab.com as our EEP customers use
  4. The goal here is for customers to be able to spin up a 10 person GitLab EEP instance in Kubernetes and scale it up to 100k users (or more) with little effort.
  5. Use the migration as a marketing opportunity for GitLab Inc through creation of technical content

More details are available in the GCP Migration Project Doc.

Failover

The GCP Migration project relies heavily on the GitLab's Geo feature to maintain a secondary GitLab instance in Google Cloud Platform (GCP).

The process of promoting the secondary instance in GCP to the primary and switching DNS over to point to the new Primary in GCP is called Planned Failover.

Failover Documentation

The failover procedure is documentation as issue templates:

Document Description Instances per Failover
failover.md The primary failover tracker. One
preflight_checks.md The pre-flight checklist. One or two
test_plan.md The quality assurance test document. One
Runbooks The runbooks to resolve issues. N/A

Failover Roles

Staging failovers, or rehearsals, will alternate between the lead and the backups. The production failover will be run by the lead, unless they are unable to attend for some reason.

Role Description Lead Backup Access Required
🐺 Coordinator The conductor of the event. Additionally responsible for replication and verification of all data. @nick.thomas @toon, @digitalmoksha admin & rails
πŸ”ͺ Chef-Runner Snapshot staging machines, changes gitlab.rb, executes gitlab-ctl command (through chef/knife) @ahmadsherif @eReGeBe ssh & chef
☎️ Comms-Handler External comms @dawsmith twitter
🐘 Database-Wrangler Complete the migration @ibaum @jarv ssh & chef
☁️ Cloud-conductor Changes settings in GCP and Azure consoles. Handles DNS changes @ahmadsherif @eReGeBe azure & gcp console
πŸ† Quality-Manager Owns the during- and post- failover quality assurance @meks @rymai admin
↩️ Fail-back Handler (Staging Only) Fail-back, discarding changes to GCP @ahmadsherif @eReGeBe azure & gcp
🎩 Head- Honcho (Production Only) Executive-level decision maker @edjdev @sytses

Failover Priorities

The GCP Migration goals are stated above. However, the failover is complex and technical issues may arise. In order to make decisions quickly, these are the priorities for the failover, in order of descending priority:

  1. Protect the integrity of data
  2. Ensure that all critical features are functioning correctly
    • For a list of what's considered "critical" review the "during blackout" features in QA Plan
  3. Migrate GitLab.com from Azure to Google Cloud Platform
  4. Ensure that all features are functioning correctly
  5. Do not exceed the time limits of the announced blackout window

Project Process

Label Taxonomy

Workflow (οΈπŸ—ΊοΈ Board)

Status Description Label
Planning Issue not ready for assignment or execution ~"Planning"
Ready Issue is ready for execution, awaiting assignment ~"Ready"
Blocked Issue is blocked. When you are blocked please signal by assigning this label and clearly indicating the blocker. ~"blocked"
In Progress Issue is being actively worked on ~"In Progress"

Burndown from 15 May 2018.

Sequencing (πŸ—ΊοΈ Board)

Most issues can be broadly broken down into pre-migration or post-migration tasks, depending on whether they need to be undertaken before the failover event, or after.

Sequencing Label Board
Premigration ~"Premigration" Premigration Workflow Board
Postmigration ~"Postmigration" Postmigration Workflow Board

Burndown from 15 May 2018.

Workstreams (πŸ—ΊοΈ Board)

Issues are categorized into several streams of work.

Workstream Label
Failover Testing ~"Workstream: Failover Testing"
Logging and Monitoring ~"Workstream: Logging and Monitoring"
Post Failover ~"Workstream: Post Failover"
Staging ~"Workstream: Staging"

Burndown from 15 May 2018.

Teams (πŸ—ΊοΈ Board)

Each team involved in the effort has a label associated with the issues they are responsible for.

Team Label Board
Production ~"Team:Production" Production Team Workflow Board
Geo ~"Team:Geo" Geo Team Workflow Board
Security ~"Team:Security" Security Team Workflow Board
Quality ~"Team:Quality" Quality Team Workflow Board

Burndown from 15 May 2018.

Issue Triage Queries

  1. Issues without Labels - check for untriaged issues
  2. In Progress, No Milestone - Ready, but unscheduled
  3. In Progress, No Assignee - check for issues that are ~"In Progress" without an assignee
  4. In Progress Issues - check for issues that have been ~"In Progress" for too long
  5. Ready Issues without Weight - issues that are ~Ready, but have not been weighed
  6. Ready Issues with a Started Milestone - upcoming scheduled work
  7. Issues Awaiting More Information - issues that appear to have stalled and are awaiting more information from the assignee or another team member
  8. Deadlocked Issues - issues that are not making progress towards resolution
  9. Failover Originated - issues that were raised through the failover rehearsal

Eisenhower Decision Matrix Triage

  1. Do - Do it now. Issues that are ~"Importance:High" and ~"Urgency:High"
  2. Decide - Schedule a time to do it. Issues that are ~"Importance:High" and ~"Urgency:Low"
  3. Delegate - Who can do it for you? Issues that are ~"Importance:Low" and ~"Urgency:High"
  4. Delete - Eliminate it. Issues that are ~"Importance:Low" and ~"Urgency:Low"

Related Projects

  1. Cloud Native GitLab Helm Charts: https://gitlab.com/charts/helm.gitlab.io
  2. Automate the lifecycle of environments for GitLab.com: https://gitlab.com/gitlab-com/environments
  3. GitLab.com Infrastructure: https://gitlab.com/gitlab-com/infrastructure
  4. GitLab CE: https://gitlab.com/gitlab-org/gitlab-ce

Preparing for a Failover Run

Before a failover, the coordinator needs to login to the deploy host: * deploy-01-sv-gprd.c.gitlab-production.internal for production * deploy-01-sv-gstg.c.gitlab-staging-1.internal for staging

Then carry out the following steps:

  1. Setup bin/source_vars: test -f /opt/gitlab-migration/migration/bin/source_vars || sudo cp /opt/gitlab-migration/migration/bin/source_vars_template.sh /opt/gitlab-migration/migration/bin/source_vars
  2. Configure sudo vi /opt/gitlab-migration/migration/bin/source_vars: The variables are explained in the file. Since this contains secrets, this file should not be checked in. (it's .gitignore'd)
  3. Verify /opt/gitlab-migration/migration/bin/verify-failover-config: You should receive a message indicating success
  4. Setup the workflow issues": Run /opt/gitlab-migration/migration/bin/start-failover-procedure.sh. This will setup several issues in the issue tracker for performing the checks, failover, tests, etc.
    • Any variables in the template in the format __VARIABLE__ will be substituted with their values from the bin/source_vars file, saving manual effort. ```