If you have read about affordance, you will remember that a solution should be self-documenting – that it should afford a solution.
Bad puns about affording luxury vacations aside, some hasty developers might cite this concept to frame their meticulous perhaps even exemplary code as self-evident to even those of us who they otherwise consider to be incapable.
The ethics and economics of such a position notwithstanding, there's much to be gained from understanding the use-cases of someone visiting your project's codebase.
The documentation on the Gitlab's
Outline for a Process
- Overview (ie. Failover)
- Label Taxonomy
- Label Taxonomy
- Issue Triage Queries
- Decision matrix
Issue Templates as documentation. - process.md - preflight.md - tests.md - runbooks.md
📖 doc 📁 folder 📘
🐺 Coordinator 🔪 Chef-Runner ☎️ Comms-Handle 🐘 Database-Wrangler ☁️ Cloud-conductor 🏆 Quality-Manager ↩️ Fail-back Handler 🎩 Head- Honcho
GitLab GCP Migration Project
Why are we doing this?
We see a number of advantages for moving from Azure to Google Cloud Platform (GCP):
Reliability and performance
Google Kubernetes Engine
GitLab 10.1 introduced built-in support for Google Kubernetes Engine. We expect GKE usage to grow significantly, and it makes sense to bring GitLab.com closer to GCP.
Goals of the GCP Migration Project
In order of descending priority. Most important goals at the top.
- Use the opportunity of an inter-cloud migration to make GitLab.com suitable for mission critical client workloads
- Migrate GitLab.com from the Microsoft Azure Cloud platform to the Google Cloud while keeping downtime to a minimum
- Use the same helm charts for GitLab.com as our EEP customers use
- The goal here is for customers to be able to spin up a 10 person GitLab EEP instance in Kubernetes and scale it up to 100k users (or more) with little effort.
- Use the migration as a marketing opportunity for GitLab Inc through creation of technical content
More details are available in the GCP Migration Project Doc.
The GCP Migration project relies heavily on the GitLab's Geo feature to maintain a secondary GitLab instance in Google Cloud Platform (GCP).
The process of promoting the secondary instance in GCP to the primary and switching DNS over to point to the new Primary in GCP is called Planned Failover.
The failover procedure is documentation as issue templates:
|Document||Description||Instances per Failover|
||The primary failover tracker.||One|
||The pre-flight checklist.||One or two|
||The quality assurance test document.||One|
||The runbooks to resolve issues.||N/A|
Staging failovers, or rehearsals, will alternate between the lead and the backups. The production failover will be run by the lead, unless they are unable to attend for some reason.
|🐺 Coordinator||The conductor of the event. Additionally responsible for replication and verification of all email@example.com||@toon, @digitalmoksha||admin & rails|
|🔪 Chef-Runner||Snapshot staging machines, changes
||@ahmadsherif||@eReGeBe||ssh & chef|
|☎️ Comms-Handler||External comms||@dawsmith|
|🐘 Database-Wrangler||Complete the migration||@ibaum||@jarv||ssh & chef|
|☁️ Cloud-conductor||Changes settings in GCP and Azure consoles. Handles DNS changes||@ahmadsherif||@eReGeBe||azure & gcp console|
|🏆 Quality-Manager||Owns the during- and post- failover quality assurance||@meks||@rymai||admin|
|↩️ Fail-back Handler (Staging Only)||Fail-back, discarding changes to GCP||@ahmadsherif||@eReGeBe||azure & gcp|
|🎩 Head- Honcho (Production Only)||Executive-level decision maker||@edjdev||@sytses|
The GCP Migration goals are stated above. However, the failover is complex and technical issues may arise. In order to make decisions quickly, these are the priorities for the failover, in order of descending priority:
- Protect the integrity of data
- Ensure that all critical features are functioning correctly
- For a list of what's considered "critical" review the "during blackout" features in QA Plan
- Migrate GitLab.com from Azure to Google Cloud Platform
- Ensure that all features are functioning correctly
- Do not exceed the time limits of the announced blackout window
Workflow (️🗺️ Board)
|Planning||Issue not ready for assignment or execution||~"Planning"|
|Ready||Issue is ready for execution, awaiting assignment||~"Ready"|
|Blocked||Issue is blocked. When you are blocked please signal by assigning this label and clearly indicating the blocker.||~"blocked"|
|In Progress||Issue is being actively worked on||~"In Progress"|
Sequencing (🗺️ Board)
Most issues can be broadly broken down into pre-migration or post-migration tasks, depending on whether they need to be undertaken before the failover event, or after.
|Premigration||~"Premigration"||Premigration Workflow Board|
|Postmigration||~"Postmigration"||Postmigration Workflow Board|
Workstreams (🗺️ Board)
Issues are categorized into several streams of work.
|Failover Testing||~"Workstream: Failover Testing"|
|Logging and Monitoring||~"Workstream: Logging and Monitoring"|
|Post Failover||~"Workstream: Post Failover"|
Teams (🗺️ Board)
Each team involved in the effort has a label associated with the issues they are responsible for.
|Production||~"Team:Production"||Production Team Workflow Board|
|Geo||~"Team:Geo"||Geo Team Workflow Board|
|Security||~"Team:Security"||Security Team Workflow Board|
|Quality||~"Team:Quality"||Quality Team Workflow Board|
Issue Triage Queries
- Issues without Labels - check for untriaged issues
- In Progress, No Milestone - Ready, but unscheduled
- In Progress, No Assignee - check for issues that are ~"In Progress" without an assignee
- In Progress Issues - check for issues that have been ~"In Progress" for too long
- Ready Issues without Weight - issues that are ~Ready, but have not been weighed
- Ready Issues with a Started Milestone - upcoming scheduled work
- Issues Awaiting More Information - issues that appear to have stalled and are awaiting more information from the assignee or another team member
- Deadlocked Issues - issues that are not making progress towards resolution
- Failover Originated - issues that were raised through the failover rehearsal
Eisenhower Decision Matrix Triage
- Do - Do it now. Issues that are ~"Importance:High" and ~"Urgency:High"
- Decide - Schedule a time to do it. Issues that are ~"Importance:High" and ~"Urgency:Low"
- Delegate - Who can do it for you? Issues that are ~"Importance:Low" and ~"Urgency:High"
- Delete - Eliminate it. Issues that are ~"Importance:Low" and ~"Urgency:Low"
- Cloud Native GitLab Helm Charts: https://gitlab.com/charts/helm.gitlab.io
- Automate the lifecycle of environments for GitLab.com: https://gitlab.com/gitlab-com/environments
- GitLab.com Infrastructure: https://gitlab.com/gitlab-com/infrastructure
- GitLab CE: https://gitlab.com/gitlab-org/gitlab-ce
Preparing for a Failover Run
Before a failover, the coordinator needs to login to the deploy host:
deploy-01-sv-gprd.c.gitlab-production.internal for production
deploy-01-sv-gstg.c.gitlab-staging-1.internal for staging
Then carry out the following steps:
test -f /opt/gitlab-migration/migration/bin/source_vars || sudo cp /opt/gitlab-migration/migration/bin/source_vars_template.sh /opt/gitlab-migration/migration/bin/source_vars
sudo vi /opt/gitlab-migration/migration/bin/source_vars: The variables are explained in the file. Since this contains secrets, this file should not be checked in. (it's
/opt/gitlab-migration/migration/bin/verify-failover-config: You should receive a message indicating success
- Setup the workflow issues": Run
/opt/gitlab-migration/migration/bin/start-failover-procedure.sh. This will setup several issues in the issue tracker for performing the checks, failover, tests, etc.
- Any variables in the template in the format
__VARIABLE__will be substituted with their values from the
bin/source_varsfile, saving manual effort. ```
- Any variables in the template in the format