Cluster Tenant Handling

Status: Draft | Last updated: 2025-07-15 | Version: v0.0.3

Overview
Environment Architecture
Provisioning Model
Migration System
Versioning & Coordination
Lifecycle Management
Platform Integration
Appendix

1. Overview

This document describes a Kubernetes-native, operator-driven workflow for Arematics infrastructure provisioning, supporting multi-tenant and geo-aware deployments. The system is modular, secure, and environment-aware, enabling:

Automated provisioning of databases and schemas for every tenant, app, and (optional) country.
Application of service-versioned SQL migrations embedded within each service container image.
Tracking of schema versions per service and support for controlled upgrades (opt-in).
Safe operation across test, QA, and prod environments.

Operators are deployed per-cluster but share the same code base. Image tags (e.g., v1.2.3-test) isolate config differences. The system supports both global and country-scoped apps, robust migration and rollback mechanics, and integrates with the platform backend for provisioning and deletion flows.

2. Environment Architecture

2.1 Environment Layout

Env	K8s cluster	Postgres instance(s)	Vault namespace
test	`test`	shared dev instance	`vault-test`
qa	`qa`	dedicated staging	`vault-qa`
prod	`prod`	regional HA pairs	`vault-prod`

2.2 Data-Isolation Model

Each tenant receives one main Postgres database (e.g., tenant_xyz). Apps with country scoping result in an additional database per country (e.g., tenant_xyz_de). Within each DB, schemas are created per service and app, formatted as service-app-name.

┌────────────────────────┐
│ DB: tenant_xyz          (organization)        │
│ ├── auth                (base service)        │
│ ├── entity              (base service)        │
│ ├── payments            (base service)        │
│ ├── entity-crm          (app, no country)     │
└────────────────────────┘

┌────────────────────────┐
│ DB: tenant_xyz_de       (country-specific DB) │
│ ├── entity-crm          (schema)              │
│ └── payments-crm        (schema)              │
└────────────────────────┘

One Postgres DB per organization (tenant)
Extra DB per tenant+country (only when an app is country-scoped)
Schemas isolate services and apps

2.3 Country-Scope Transitions

Downgrade (country ➜ global): Select origin country → copy data into new service-app schema → keep old country DB read-only for 30 days.
Upgrade (global ➜ country): Select target origin → split data; other countries start empty.

Cleanup jobs delete obsolete DBs/schemas after retention expiry.

3. Provisioning Model

3.1 Custom Resource Definitions (CRDs) & Operators

CRD	Trigger Type	Key actions
`TenantProvisionJob`	K8s Job	• Create DB `tenant_<slug>` • Create base service schemas • Run one-time provisioning tasks
`AppProvisionJob`	K8s Job	• Resolve target DB • Create `service-app` schemas • Run one-time provisioning tasks
`TenantSchemaVersion`	Migration Operator	• Track `currentVersion` / `desiredVersion` • Handle upgrades, retries, dry-runs

Provisioning jobs are:

One-time executed and idempotent
Destroyed or garbage collected after success
Reversible via a dedicated deletion job, not via finalizers

Reconciler vs Job Model

Tenant and App provisioning use Kubernetes Jobs created from CRDs.
A central controller (optional) may monitor Job completion and status for observability.
The Migration Operator is a long-lived reconciler for version tracking, upgrades, dry-runs, and dependency checks.

3.2 Vault Secret Handling & Schema Registry

All K8s Jobs used for provisioning (e.g., schema creation, backup) are responsible for writing necessary secrets into Vault (per-tenant path). Secrets are created once and reused unless explicitly rotated.

Each tenant DB includes a schema_registry table for tracking migration state:

CREATE TABLE schema_registry (
  service        text,
  schema_name    text,
  current_version text,
  desired_version text,
  status         text   -- pending | migrating | complete | failed
);

Operators update this table and CRD status fields in lock-step.

3.3 Service Discovery & Targeting

Operators locate service pods for migration by custom labels:

arematics.com/service-name: <service-name>
arematics.com/service-version: <semver>

Operators exec into the first Ready pod matching label selectors in authorized namespaces (e.g., arematics-*).

4. Migration System

4.1 Migration Delivery & Execution

Each service image contains its own migrations under /migrations/<service>/*.sql.
Operator execs into the running service Pod, reads SQL files, and pipes them to Postgres via pgx.
Dry-Run Mode: Set spec.dryRun=true on TenantSchemaVersion to execute migrations inside a transaction and roll back, emitting a diff report.
- Each SQL migration file must contain a -- ROLLBACK: section describing how to undo the changes. Operator runs each file inside a transaction by default.
If backupBeforeMigration: true is set, a snapshot job is triggered before applying the migration:
- PostgreSQL-level dumps via pg_dump to tenant S3 buckets
- Optional: Velero for volume snapshots
Jobs respect a configured timeout (default: 10m) and retry twice on failure before blocking migration with a warning.
Operators emit backup completion logs and Prometheus metrics (tenant_backup_duration_seconds, tenant_backup_failed_total).
Cluster-wide concurrency limit (default: 3 parallel migrations), configurable via MigrationControllerConfig CRD or environment variable. Operators queue excess jobs and expose this via migration_queue_length metric.

Migration Execution (Locking, Rollback, Backup, Metrics)

PostgreSQL advisory locks (pg_advisory_lock(hashtext('tenant:service'))) during migration phase, per schema.
Operator gracefully skips or retries if the lock is held by another executor.
Migration run updates both schema_registry.status in Postgres and status.phase in the relevant CRD.
States: pending, migrating, complete, failed, blocked
Failed states are surfaced to Prometheus and UI. Retry requires manual intervention.
Retrying a failed migration:
- Locks the schema
- Re-runs all SQL steps not yet marked as applied (based on file naming)
- Transitions state back to migrating → complete on success
Admin kubectl plugin may list failed schemas, trigger retry, and view logs.

4.2 Schema-Registry Table

See Provisioning Model.

4.3 Migration UI & Workflow

Feature Overview

Feature	Triggered by	Available in UI?	Automation Option?
List current schema versions	Platform	✅ Yes	N/A
Check for updates	Operator	✅ Yes	✅ Background scan
Dry-run migration	Tenant/Admin	✅ Yes	✅ (via CRD)
Apply migration	Tenant/Admin	✅ Yes	❌ Manual only
Retry failed migration	Admin only	✅ Yes	✅ (admin tools)
View logs/status/history	Platform	✅ Yes	✅ via CRD + Prometheus

Migration Journey per Schema

START
 │
 │   (background scan or manual)
 ▼
Check for Upgrade Availability
 │
 ├── No → Stay on current version
 ▼
Yes
 │
 ▼
Dry-Run Option → Show Diff Output
 │
 ▼
Apply Migration
 │
 ▼
Track Status: migrating → complete / failed
 │
 ├── Success → Done
 └── Failure → Retry via UI (admin only)

UI Components

Schema Overview: Shows schema name, current/desired version, status, last migration time, logs (preview), filter/search, warnings for failed/blocked.
Dry-Run Interface: Button: Try Migration to vX.Y.Z, executes SQL in a transaction (rolled back), shows diff output:

  + ALTER TABLE user ADD COLUMN tfa_enabled BOOLEAN;
  - DROP TABLE old_sessions;

Actions: Cancel, Apply Migration, Download SQL Diff

Apply Migration: Triggers patch:
```
spec:
  desiredVersion: vX.Y.Z
  dryRun: false
```
UI shows progress, logs, and status transitions. If backupBeforeMigration: true, UI prompts for backup confirmation.
Retry Failed Migration: Admin-only, button: Retry Migration, re-applies patch for the same version.

RBAC/Access Control

Role	Can Migrate	Can Dry Run	Can Retry
Tenant Admin	✅	✅	❌
Platform Admin	✅	✅	✅

API/Backend Integration

Action	How it's triggered
Start dry-run	PATCH CRD: `dryRun: true`
Start real migration	PATCH CRD: `dryRun: false`, set `desiredVersion`
Retry failed migration	PATCH same `desiredVersion` again
Read status and logs	From CRD `.status` and `schema_registry` table
Check dependency conflicts	Surfaced via `.status` (blocked)

5. Versioning & Coordination

5.1 Service Pods per Version

The deployment controller keeps ≥1 Pod for each service version still used by at least one tenant.
Version Deprecation policy: Notify tenants N days ahead → auto-upgrade or block.

5.2 Multi-Service Coordination & Version Planning

Services may depend on specific versions of others (e.g., auth@1.4 requires payments@2.0).
Each service version may define dependency rules in dependencies.yml located in /migrations/<service>/.
Migration Operator checks rules before applying migrations; if dependency not met:
- Blocks migration, sets schema status to blocked, emits event + Prometheus alert.
Cross-service dependencies must be tracked cluster-wide.

Example:

# /migrations/entity/dependencies.yml
requires:
  - service: auth
    minVersion: v1.2.0
  - service: payments
    minVersion: v2.1.0

Central service version registry table (e.g. infra.service_versions) at cluster level tracks which service versions must be deployed to satisfy all tenant needs. This registry can drive GitOps pipelines or Helm values per namespace.

5.3 Version Compatibility & Dependencies

Service-A@v1.4 can declare it requires service-B >= v2.0.
Dependencies encoded in a YAML file inside the image.
Operator blocks migration if dependency not satisfied.

6. Lifecycle Management

6.1 Retention Policies & Schema Lifecycle

Certain transitions (e.g., downgrade from country-level to app-level) require retention of old schemas:

retentionPolicy:
  schemaTransition: 30d        # Retain downgraded country schemas
  deprecatedVersions: 60d      # Keep old versions active for rollbacks or audits

Cleanup Jobs scan these policies and delete data beyond expiration.
Per-tenant or per-app override via annotations or CRD fields.
Policies tracked per schema and surfaced via schema_registry.retention_expires_at.

6.2 Failure Recovery & Deletion Policy

Tenant deletion policies must comply with legal requirements (e.g., GDPR):
```
deletionPolicy:
  mode: retained | immediate | manual
  gracePeriod: 30d
```
- retained: preserves data for gracePeriod before purging
- immediate: drops all schemas and databases after deletion request
- manual: requires manual intervention to finalize
Setting stored in CRD metadata and reflected in cleanup controller.

6.3 Failure Recovery & Migration State Management

Every migration run updates both schema_registry.status in Postgres and status.phase in CRD.
States: pending, migrating, complete, failed
Failed states surfaced to Prometheus + UI. Retry must be triggered manually.
Retrying a failed migration locks the schema, re-runs unapplied steps, and transitions state on success.
Admin kubectl plugin may list failed schemas, trigger retries, and view logs.

7. Platform Integration

While operators automate infrastructure provisioning inside Kubernetes, they rely on CRDs being created to trigger actions.

To connect the Arematics Platform backend to the infrastructure layer:

The Platform backend must create or update:
- TenantProvisionJob CRDs when a new organization is registered
- AppProvisionJob CRDs when a user creates or modifies an app
Typically performed via:
- Secure service account with CRD write access in infra-system namespace
- Kubernetes API via client-go or HTTP
- Optionally: platform-side webhook service forwarding to Kubernetes operator proxy

Example flow:

User signs up ➜ Arematics backend creates TenantProvisionJob ➜ CRD created ➜ Job triggers provisioning

Additional Considerations:

Audit logs should capture who created each CRD
Tenants can query infra status via platform (mirroring CRD status fields)
Tenant deletion should trigger CRD cleanup (finalizers may be needed)
All job CRDs are idempotent. In case of failure (e.g., partial outage), orphaned DBs/schemas can be deleted and the job retried. Since data creation happens after provisioning completes, there is no critical data loss at this stage.

8. Appendix

8.1 RBAC Table

Role	Can Migrate	Can Dry Run	Can Retry
Tenant Admin	✅	✅	❌
Platform Admin	✅	✅	✅

8.2 Metrics Overview Table

Metric	Type	Labels
tenant_migration_duration_seconds	Histogram	cluster, service, tenant, version
tenant_migration_total	Counter	outcome (success/failed)
tenant_schema_version	Gauge	tenant, service, schema, version
tenant_backup_duration_seconds	Histogram	tenant, cluster, outcome
tenant_backup_failed_total	Counter	tenant, cluster

Exposed via /metrics on the operator Pod.
Grafana dashboard: Schema Version Heat-map + Migration Latency.

8.3 UI API Endpoint Summary

Action	How it's triggered
Start dry-run	PATCH CRD: `dryRun: true`
Start real migration	PATCH CRD: `dryRun: false`, set `desiredVersion`
Retry failed migration	PATCH same `desiredVersion` again
Read status and logs	From CRD `.status` and `schema_registry` table
Check dependency conflicts	Surfaced via `.status` (blocked)

Notes & Clarifications

Schema Initialization: Migrations must include DDL and may optionally include initial fixture data via init_data.sql.
Multi-service Version Alignment: Managed by the Migration Operator. autoUpgrade: true enables automatic minor upgrades.
Country-Scoped CRDs: A single App CRD with countryScoped: true controls creation of tenant_xyz_<country> DBs. Schemas inside follow normal service-app structure.
Migration Sequencing: Dependencies must be respected; optional field migrationOrder can define sequencing explicitly.
Migration Rollback: All SQL files must include a -- ROLLBACK: section describing reverse operations. Migrations are executed in isolated transactions.
Pre-Backup: Backups are required before risky migrations. Use backupBeforeMigration: true to trigger platform hooks or pg_dump/Velero scripts.
No Cross-Region Sync: Country-level databases are isolated. Any sync is left to higher-level process orchestration.

Table of Contents

1. Overview​

2. Environment Architecture​

2.1 Environment Layout​

2.2 Data-Isolation Model​

2.3 Country-Scope Transitions​

3. Provisioning Model​

3.1 Custom Resource Definitions (CRDs) & Operators​

Reconciler vs Job Model​

3.2 Vault Secret Handling & Schema Registry​

3.3 Service Discovery & Targeting​

4. Migration System​

4.1 Migration Delivery & Execution​

Migration Execution (Locking, Rollback, Backup, Metrics)​

4.2 Schema-Registry Table​

4.3 Migration UI & Workflow​

Feature Overview​

Migration Journey per Schema​

UI Components​

RBAC/Access Control​

API/Backend Integration​

5. Versioning & Coordination​

5.1 Service Pods per Version​

5.2 Multi-Service Coordination & Version Planning​

5.3 Version Compatibility & Dependencies​

6. Lifecycle Management​

6.1 Retention Policies & Schema Lifecycle​

6.2 Failure Recovery & Deletion Policy​

6.3 Failure Recovery & Migration State Management​

7. Platform Integration​

8. Appendix​

8.1 RBAC Table​

8.2 Metrics Overview Table​

8.3 UI API Endpoint Summary​

Notes & Clarifications​