/ 3 min read / Systems / Growing

Safe GitLab upgrades with required stops

A GitLab Omnibus major upgrade stays easier to reason about when each required stop gets its own rollback point, validation pass, and background-migration soak.

Context

Upgrading a self-managed GitLab Omnibus node across major versions is rarely risky because of one dramatic step.

The real risk comes from stacking too many moving parts at once:

  • package changes
  • PostgreSQL compatibility
  • service restarts
  • background migrations
  • weak rollback boundaries

For a single-node GitLab with embedded PostgreSQL, the safer question is not “how fast can the instance reach the latest version?” It is “how small can each failure domain be while still making forward progress?”

Decision / Insight

Use GitLab’s required upgrade stops as hard operational boundaries, and treat each stop as its own mini-migration.

The important pattern is not only the version sequence. It is the discipline around each stop:

  1. create a fresh rollback point,
  2. validate the current state before changing anything,
  3. upgrade exactly to the next required stop,
  4. wait for the instance to stabilize,
  5. let batched background migrations drain before moving again.

That shape reduces ambiguity during the upgrade and keeps rollback decisions simple.

Breakdown

The pattern that proved stable in practice was:

  • 17.11.3-ee.0 -> 17.11.7-ee.0
  • 17.11.7-ee.0 -> 18.2.8-ee.0
  • 18.2.8-ee.0 -> 18.5.5-ee.0
  • 18.5.5-ee.0 -> 18.8.7-ee.0
  • 18.8.7-ee.0 -> 18.9.3-ee.0

Each stop used the same control loop:

  • fresh EBS snapshot of the GitLab root volume
  • fresh logical backup
  • fresh copies of gitlab.rb and gitlab-secrets.json
  • runner pause during the package window
  • post-upgrade validation before resuming normal traffic

Three operational details mattered more than expected.

Background migrations are a real gate

Do not treat a version as complete just because apt finished and the UI loaded.

GitLab may still be draining batched background migrations created by that stop. Moving forward before they reach 0 increases uncertainty for the next package window and makes troubleshooting harder.

A transient 502 is not an automatic rollback trigger

During several stops, GitLab returned 502 Bad Gateway while Puma was still preloading and gitlab.socket had not been recreated yet.

That was normal startup behavior. The right check was:

  • is Puma still progressing?
  • has the Unix socket appeared?
  • has 127.0.0.1:8080 opened?

Rollback should wait for evidence of a real crash loop or a failed boot path, not just the presence of a temporary 502.

Swap is an airbag, not a performance strategy

Adding a small swap file helped protect the node during package install, reconfigure, and migration work.

That does not mean swap is the desired steady state. It means the host had a safety buffer while memory pressure temporarily increased.

Implementation

For this upgrade shape, the most useful preflight checks were:

  • gitlab-ctl status
  • gitlab-rake gitlab:check SANITIZE=true
  • gitlab-rake gitlab:doctor:secrets
  • gitlab-psql -d postgres -Atc 'show server_version;'
  • curl -skI https://127.0.0.1/-/health
  • curl -skI https://<gitlab-host>/users/sign_in
  • SELECT count(*) FROM batched_background_migrations WHERE status NOT IN (3,6);

The most useful post-upgrade checks were:

  • package version matches the intended stop exactly
  • core services stay in run
  • health and sign-in return 200
  • runner verifies successfully again
  • a safe Git read such as git ls-remote works

The final validated state for this run was:

  • GitLab 18.9.3-ee.0
  • PostgreSQL 16.11
  • runner 18.9.0

Reusable Takeaway

For GitLab Omnibus on a single node, a low-risk major upgrade pattern is:

  1. follow required stops exactly,
  2. create a new rollback point for every stop,
  3. treat background migrations as a hard gate,
  4. expect a temporary 502 while Puma recreates the Rails socket,
  5. resume the runner only after the application is actually healthy.

The non-obvious win is not only safer rollback. It is better operational clarity.

When each stop has its own evidence, rollback asset, and stabilization window, the upgrade stops feeling like one long gamble and starts behaving like a sequence of bounded maintenance tasks.