diff --git a/_layouts/principle.html b/_layouts/principle.html index 84263ce..465a1a7 100644 --- a/_layouts/principle.html +++ b/_layouts/principle.html @@ -9,6 +9,18 @@
{{ content }} + {% if page.examples %} +
    + {% for example in page.examples %} +
  1. + {{ example.title }} — {{ example.body | markdownify | remove: '

    ' | remove: '

    ' | strip }}
    + + + +
  2. + {% endfor %} +
+ {% endif %}
diff --git a/assets/css/style.css b/assets/css/style.css index a72e3da..f49fa0e 100644 --- a/assets/css/style.css +++ b/assets/css/style.css @@ -461,6 +461,31 @@ button:focus-visible { box-shadow: var(--shadow-md); } +.example-anchor { + margin-left: auto; + align-self: flex-start; + flex-shrink: 0; + display: inline-flex; + align-items: center; + padding: .2rem; + border-radius: var(--radius-sm); + color: var(--color-text-muted); + opacity: 0; + transition: opacity .15s, color .15s; + text-decoration: none; +} + +.principle-content ol li:hover .example-anchor, +.example-anchor:focus { + opacity: 1; +} + +.example-anchor:hover, +.example-anchor:focus { + color: var(--color-accent-dark); + text-decoration: none; +} + .principle-content strong { color: var(--color-primary); font-weight: 600; } .principle-content ul { diff --git a/examples/01-make-the-right-thing-the-easy-thing.md b/examples/01-make-the-right-thing-the-easy-thing.md index de16da7..c2f6965 100644 --- a/examples/01-make-the-right-thing-the-easy-thing.md +++ b/examples/01-make-the-right-thing-the-easy-thing.md @@ -1,6 +1,47 @@ --- layout: principle title: "Principle 1: Make the right thing the easy thing — not the only thing" +examples: + - title: "Scaffold with security baked in" + body: "Provide a `create-service` CLI command that generates a new microservice with TLS, structured logging, and secret injection pre-configured so developers never start from a blank, insecure slate." + - title: "Pre-approved base images" + body: "Publish a curated set of hardened container base images in the internal registry. Developers pull `internal/node:20` and get a patched, minimal image without thinking about CVEs." + - title: "Lint rules as guardrails, not blockers" + body: "Ship a shared linter configuration that flags insecure patterns (e.g., hardcoded secrets, `HTTP` instead of `HTTPS`) in CI with clear fix instructions, rather than silently passing or hard-failing with no guidance." + - title: "Default resource limits in Kubernetes" + body: "Apply a `LimitRange` and `ResourceQuota` to every namespace by default so that new workloads automatically get sensible CPU and memory boundaries without requiring a ticket." + - title: "One-command local environment" + body: "Provide a `platform dev up` command that starts a local stack (database, message broker, stub services) with a single command, eliminating multi-page setup guides." + - title: "Automated dependency updates" + body: "Run Dependabot or Renovate pre-configured across all repositories so teams receive automatic PRs for dependency updates rather than having to remember to check." + - title: "Secret rotation without code changes" + body: "Integrate secrets management so that rotating a database password is a platform operation, not a developer task requiring a deployment." + - title: "Built-in distributed tracing" + body: "Inject an OpenTelemetry sidecar or SDK wrapper by default so every service emits traces without developers wiring up instrumentation manually." + - title: "Self-healing deployments" + body: "Configure liveness and readiness probes in the golden-path deployment template so Kubernetes restarts unhealthy pods automatically without developer intervention." + - title: "Compliance-as-code in CI" + body: "Embed Open Policy Agent (OPA) checks in the CI pipeline that block deployments with missing labels, wrong image registries, or overly permissive RBAC — with a link to the approved fix." + - title: "Network policies by default" + body: "Apply a default-deny network policy to all new namespaces so that only explicitly declared traffic is permitted, removing the need for developers to think about lateral movement risks." + - title: "Escape hatch via `platform.override.yaml`" + body: "Allow teams to declare justified exceptions in a versioned override file reviewed during pull requests, so diverging from the golden path is possible but auditable." + - title: "Preview environments on every PR" + body: "Automatically spin up an ephemeral environment for every pull request so reviewers can test changes without manually deploying to shared environments." + - title: "Runbook links in alerts" + body: "Attach a `runbook_url` label to every alert rule so that on-call engineers land directly on remediation steps instead of searching Confluence at 3 AM." + - title: "Database migrations in the deployment pipeline" + body: "Include a migration step in the golden-path pipeline that runs schema changes safely before rolling out the application, so developers don't run `psql` by hand in production." + - title: "Role-based access templates" + body: "Offer pre-defined IAM/RBAC role bundles (`developer`, `operator`, `read-only`) that teams request rather than composing permissions from scratch, reducing misconfigurations." + - title: "Cost visibility in pull requests" + body: "Post an automated cost estimate comment on infrastructure PRs so engineers see the financial impact of their changes before merging." + - title: "Shared observability dashboards" + body: "Pre-build Grafana dashboards for the standard golden-path stack (RED metrics, JVM heap, DB connections) so teams have useful dashboards from day one without building their own." + - title: "Centralised log aggregation by default" + body: "Route container stdout/stderr to the central log platform automatically via a DaemonSet, so developers `kubectl logs` in dev and query Loki/Splunk in production without extra setup." + - title: "Graceful degradation patterns in the SDK" + body: "Provide a platform SDK with a built-in circuit breaker and fallback interface so developers get resilience patterns without implementing Hystrix from scratch." --- # Principle 1: Make the right thing the easy thing — not the only thing @@ -8,43 +49,3 @@ title: "Principle 1: Make the right thing the easy thing — not the only thing" > Every platform decision should answer: does this help developers focus on delivering business value, or does it add friction they must carry? Golden paths guide developers towards safe, reliable outcomes — with security and compliance built in by design, invisible rather than adversarial. They must always include escape hatches: teams that need to diverge responsibly should be supported, not punished. ## 20 Practical Examples - -1. **Scaffold with security baked in** — Provide a `create-service` CLI command that generates a new microservice with TLS, structured logging, and secret injection pre-configured so developers never start from a blank, insecure slate. - -2. **Pre-approved base images** — Publish a curated set of hardened container base images in the internal registry. Developers pull `internal/node:20` and get a patched, minimal image without thinking about CVEs. - -3. **Lint rules as guardrails, not blockers** — Ship a shared linter configuration that flags insecure patterns (e.g., hardcoded secrets, `HTTP` instead of `HTTPS`) in CI with clear fix instructions, rather than silently passing or hard-failing with no guidance. - -4. **Default resource limits in Kubernetes** — Apply a `LimitRange` and `ResourceQuota` to every namespace by default so that new workloads automatically get sensible CPU and memory boundaries without requiring a ticket. - -5. **One-command local environment** — Provide a `platform dev up` command that starts a local stack (database, message broker, stub services) with a single command, eliminating multi-page setup guides. - -6. **Automated dependency updates** — Run Dependabot or Renovate pre-configured across all repositories so teams receive automatic PRs for dependency updates rather than having to remember to check. - -7. **Secret rotation without code changes** — Integrate secrets management so that rotating a database password is a platform operation, not a developer task requiring a deployment. - -8. **Built-in distributed tracing** — Inject an OpenTelemetry sidecar or SDK wrapper by default so every service emits traces without developers wiring up instrumentation manually. - -9. **Self-healing deployments** — Configure liveness and readiness probes in the golden-path deployment template so Kubernetes restarts unhealthy pods automatically without developer intervention. - -10. **Compliance-as-code in CI** — Embed Open Policy Agent (OPA) checks in the CI pipeline that block deployments with missing labels, wrong image registries, or overly permissive RBAC — with a link to the approved fix. - -11. **Network policies by default** — Apply a default-deny network policy to all new namespaces so that only explicitly declared traffic is permitted, removing the need for developers to think about lateral movement risks. - -12. **Escape hatch via `platform.override.yaml`** — Allow teams to declare justified exceptions in a versioned override file reviewed during pull requests, so diverging from the golden path is possible but auditable. - -13. **Preview environments on every PR** — Automatically spin up an ephemeral environment for every pull request so reviewers can test changes without manually deploying to shared environments. - -14. **Runbook links in alerts** — Attach a `runbook_url` label to every alert rule so that on-call engineers land directly on remediation steps instead of searching Confluence at 3 AM. - -15. **Database migrations in the deployment pipeline** — Include a migration step in the golden-path pipeline that runs schema changes safely before rolling out the application, so developers don't run `psql` by hand in production. - -16. **Role-based access templates** — Offer pre-defined IAM/RBAC role bundles (`developer`, `operator`, `read-only`) that teams request rather than composing permissions from scratch, reducing misconfigurations. - -17. **Cost visibility in pull requests** — Post an automated cost estimate comment on infrastructure PRs so engineers see the financial impact of their changes before merging. - -18. **Shared observability dashboards** — Pre-build Grafana dashboards for the standard golden-path stack (RED metrics, JVM heap, DB connections) so teams have useful dashboards from day one without building their own. - -19. **Centralised log aggregation by default** — Route container stdout/stderr to the central log platform automatically via a DaemonSet, so developers `kubectl logs` in dev and query Loki/Splunk in production without extra setup. - -20. **Graceful degradation patterns in the SDK** — Provide a platform SDK with a built-in circuit breaker and fallback interface so developers get resilience patterns without implementing Hystrix from scratch. diff --git a/examples/02-treat-the-platform-as-a-living-product.md b/examples/02-treat-the-platform-as-a-living-product.md index 11f4192..1434459 100644 --- a/examples/02-treat-the-platform-as-a-living-product.md +++ b/examples/02-treat-the-platform-as-a-living-product.md @@ -1,6 +1,47 @@ --- layout: principle title: "Principle 2: Treat the internal developer platform as a living product" +examples: + - title: "Publish a public platform roadmap" + body: "Maintain a `ROADMAP.md` or a GitHub Projects board visible to all engineering teams so everyone can see what is coming, what is in progress, and what has been delivered." + - title: "Run quarterly platform demos" + body: "Hold a short (30-minute) show-and-tell each quarter where the platform team walks through new capabilities, using real workflows from product teams as demos." + - title: "Set up a `#platform-feedback` Slack channel" + body: "Create a dedicated channel where developers can report pain points, request features, and get answers — monitored daily by a rotating platform team member." + - title: "Conduct developer satisfaction surveys" + body: "Send a brief (5-question) quarterly survey to all platform users measuring ease of use, reliability, and overall satisfaction. Share results openly." + - title: "Maintain a changelog" + body: "Keep a `CHANGELOG.md` updated with every platform release so teams know what changed, what was fixed, and what was deprecated without having to ask." + - title: "Define platform personas" + body: "Document two or three user personas (e.g., \"the solo feature dev\", \"the on-call SRE\") and review every roadmap item against their needs to avoid building for imaginary users." + - title: "Run monthly office hours" + body: "Dedicate one hour per month where any engineer can join a video call or walk-in session to ask questions, report issues, or discuss upcoming needs directly with the platform team." + - title: "Track and publish platform adoption metrics" + body: "Show the percentage of teams using the golden-path pipeline, number of services onboarded, and active portal users in a visible dashboard updated weekly." + - title: "Prioritise bugs reported by users first" + body: "Triage user-reported issues within one business day and communicate a fix timeline, reinforcing that developer feedback drives the roadmap." + - title: "Interview new joiners after onboarding" + body: "Schedule a 20-minute retrospective with every new engineer one week after they start, specifically asking what was hard about using the platform for the first time." + - title: "Announce deprecations via the same channels developers use" + body: "Post deprecation notices in Slack, the changelog, and the developer portal simultaneously so no team learns about a breaking change only at the deadline." + - title: "Measure time-to-first-deployment for new teams" + body: "Track how long it takes a new team to deploy their first service end-to-end. Use this as the headline product metric for the platform team." + - title: "Assign a platform product owner" + body: "Have a dedicated product owner (not a tech lead wearing both hats) who maintains the backlog, talks to users, and says no to low-value work." + - title: "Link platform tickets to user stories" + body: "Write platform backlog items as \"As a developer, I want to … so that …\" to keep the team focused on user value rather than technical deliverables." + - title: "Hold a retrospective after every major incident" + body: "When a platform outage affects product teams, share a blameless post-mortem and publish the follow-up action items publicly to build trust." + - title: "Create a \"first use\" experience checklist" + body: "Provide a short getting-started checklist in the developer portal so that a developer new to the platform knows the five things to do in their first hour." + - title: "Version the platform API" + body: "Treat internal platform APIs like external ones — version them, maintain backwards compatibility within a version, and provide migration guides when breaking changes are necessary." + - title: "Track feature requests in the open" + body: "Use a public (internal) GitHub issue tracker so developers can upvote existing requests rather than creating duplicates, and so the team can see demand signals at a glance." + - title: "Set and publish SLOs for the platform itself" + body: "Define service level objectives for the CI pipeline, artifact registry, and developer portal — and report against them monthly to demonstrate operational reliability." + - title: "Celebrate wins with product teams" + body: "When a platform improvement measurably reduces a team's deployment time or incident rate, share the story in a company update to make the value of platform investment visible to leadership." --- # Principle 2: Treat the internal developer platform as a living product @@ -8,43 +49,3 @@ title: "Principle 2: Treat the internal developer platform as a living product" > Platforms have no delivery date. They have users, roadmaps, feedback loops, and product lifecycles — and they must continuously evolve to remain useful. The platform team must continuously communicate value, build stakeholder buy-in, and demonstrate impact — or the platform will be unused, deprecated, or made mandatory by force, each a failure mode. ## 20 Practical Examples - -1. **Publish a public platform roadmap** — Maintain a `ROADMAP.md` or a GitHub Projects board visible to all engineering teams so everyone can see what is coming, what is in progress, and what has been delivered. - -2. **Run quarterly platform demos** — Hold a short (30-minute) show-and-tell each quarter where the platform team walks through new capabilities, using real workflows from product teams as demos. - -3. **Set up a `#platform-feedback` Slack channel** — Create a dedicated channel where developers can report pain points, request features, and get answers — monitored daily by a rotating platform team member. - -4. **Conduct developer satisfaction surveys** — Send a brief (5-question) quarterly survey to all platform users measuring ease of use, reliability, and overall satisfaction. Share results openly. - -5. **Maintain a changelog** — Keep a `CHANGELOG.md` updated with every platform release so teams know what changed, what was fixed, and what was deprecated without having to ask. - -6. **Define platform personas** — Document two or three user personas (e.g., "the solo feature dev", "the on-call SRE") and review every roadmap item against their needs to avoid building for imaginary users. - -7. **Run monthly office hours** — Dedicate one hour per month where any engineer can join a video call or walk-in session to ask questions, report issues, or discuss upcoming needs directly with the platform team. - -8. **Track and publish platform adoption metrics** — Show the percentage of teams using the golden-path pipeline, number of services onboarded, and active portal users in a visible dashboard updated weekly. - -9. **Prioritise bugs reported by users first** — Triage user-reported issues within one business day and communicate a fix timeline, reinforcing that developer feedback drives the roadmap. - -10. **Interview new joiners after onboarding** — Schedule a 20-minute retrospective with every new engineer one week after they start, specifically asking what was hard about using the platform for the first time. - -11. **Announce deprecations via the same channels developers use** — Post deprecation notices in Slack, the changelog, and the developer portal simultaneously so no team learns about a breaking change only at the deadline. - -12. **Measure time-to-first-deployment for new teams** — Track how long it takes a new team to deploy their first service end-to-end. Use this as the headline product metric for the platform team. - -13. **Assign a platform product owner** — Have a dedicated product owner (not a tech lead wearing both hats) who maintains the backlog, talks to users, and says no to low-value work. - -14. **Link platform tickets to user stories** — Write platform backlog items as "As a developer, I want to … so that …" to keep the team focused on user value rather than technical deliverables. - -15. **Hold a retrospective after every major incident** — When a platform outage affects product teams, share a blameless post-mortem and publish the follow-up action items publicly to build trust. - -16. **Create a "first use" experience checklist** — Provide a short getting-started checklist in the developer portal so that a developer new to the platform knows the five things to do in their first hour. - -17. **Version the platform API** — Treat internal platform APIs like external ones — version them, maintain backwards compatibility within a version, and provide migration guides when breaking changes are necessary. - -18. **Track feature requests in the open** — Use a public (internal) GitHub issue tracker so developers can upvote existing requests rather than creating duplicates, and so the team can see demand signals at a glance. - -19. **Set and publish SLOs for the platform itself** — Define service level objectives for the CI pipeline, artifact registry, and developer portal — and report against them monthly to demonstrate operational reliability. - -20. **Celebrate wins with product teams** — When a platform improvement measurably reduces a team's deployment time or incident rate, share the story in a company update to make the value of platform investment visible to leadership. diff --git a/examples/03-measure-from-day-one.md b/examples/03-measure-from-day-one.md index 8568cd4..00d038f 100644 --- a/examples/03-measure-from-day-one.md +++ b/examples/03-measure-from-day-one.md @@ -1,6 +1,47 @@ --- layout: principle title: "Principle 3: Measure from day one" +examples: + - title: "Instrument the golden-path pipeline from launch" + body: "Add counters to your CI/CD pipeline on day one that track builds triggered, success rate, and median duration. Store results in a time-series database so you can show trends from week one." + - title: "Track DORA metrics automatically" + body: "Configure your deployment tooling to emit deployment frequency, lead time for changes, change failure rate, and mean time to restore (MTTR) without requiring teams to fill in spreadsheets." + - title: "Create a platform health dashboard" + body: "Build a Grafana (or equivalent) dashboard visible to everyone that shows live platform SLO compliance, current queue depths, and error rates across all shared services." + - title: "Measure golden-path adoption rate" + body: "Calculate the percentage of active services that use the standard pipeline, base images, and secret management each week, and make it a team KPI." + - title: "Set a lead-time baseline before any changes" + body: "Before making platform improvements, measure current average lead time for a change to reach production. Use it as the before/after benchmark for every initiative." + - title: "Alert on adoption regressions" + body: "Set an alert if golden-path adoption drops by more than 5% week-over-week so the team can investigate whether teams are abandoning the platform or working around it." + - title: "Tag every deployment with a team identifier" + body: "Require a `team` label on every workload so you can slice adoption, cost, and reliability metrics by team without manual reconciliation." + - title: "Log every escape-hatch usage" + body: "When a team uses a platform override or exception, record it automatically. Review these logs monthly to identify gaps in the golden path worth fixing." + - title: "Track time-to-first-deployment" + body: "Record the timestamp when a team's repository is created and when their first successful production deployment occurs. The gap is your onboarding lead time metric." + - title: "Publish a weekly metrics digest" + body: "Send a short automated email or Slack summary every Monday with the previous week's platform metrics — adoption, pipeline reliability, and top errors — to keep stakeholders informed without meetings." + - title: "Measure and display documentation usage" + body: "Instrument developer portal page views and search queries. Pages with high views but low task completion signal confusing documentation worth improving." + - title: "Count support requests per platform feature" + body: "Track how many Slack questions or tickets are raised about each platform component. High question volume on a feature signals poor discoverability or documentation." + - title: "Measure change failure rate per team" + body: "Compute how often a deployment requires a hotfix or rollback per team so you can identify teams that would benefit most from stronger golden-path guardrails." + - title: "Set a developer sentiment target" + body: "Define a target score (e.g., NPS ≥ 30) for developer satisfaction with the platform and measure it each quarter so it carries the same weight as reliability metrics." + - title: "Track mean time to onboard a new service" + body: "Measure the average time from \"repository created\" to \"first deployment through the golden path.\" Use it as a north-star metric for reducing onboarding friction." + - title: "Monitor platform component error budgets" + body: "Define an error budget for the CI system (e.g., 99.5% pipeline success rate) and stop adding new features when the budget is exhausted — fix reliability first." + - title: "Instrument the developer portal search" + body: "Record every search query with zero results. Each no-result query is a capability gap or a documentation gap worth triaging." + - title: "Correlate platform upgrades with incident rates" + body: "After every platform version upgrade, compare the change failure rate in the two weeks before and after. Make this comparison part of the release review." + - title: "Report cost per team automatically" + body: "Use resource tags to generate a monthly cost breakdown per team. Share it with engineering managers so cost awareness becomes a shared responsibility without manual chargeback processes." + - title: "Review metrics in every sprint retrospective" + body: "Include at least one platform metric in the team's regular retrospective to make measurement a habit rather than a quarterly exercise." --- # Principle 3: Measure from day one @@ -8,43 +49,3 @@ title: "Principle 3: Measure from day one" > Technical sophistication without adoption delivers nothing. Track adoption, lead time, change failure rate, and developer sentiment from the start. Feedback loops and telemetry are not afterthoughts — they are how platforms improve. ## 20 Practical Examples - -1. **Instrument the golden-path pipeline from launch** — Add counters to your CI/CD pipeline on day one that track builds triggered, success rate, and median duration. Store results in a time-series database so you can show trends from week one. - -2. **Track DORA metrics automatically** — Configure your deployment tooling to emit deployment frequency, lead time for changes, change failure rate, and mean time to restore (MTTR) without requiring teams to fill in spreadsheets. - -3. **Create a platform health dashboard** — Build a Grafana (or equivalent) dashboard visible to everyone that shows live platform SLO compliance, current queue depths, and error rates across all shared services. - -4. **Measure golden-path adoption rate** — Calculate the percentage of active services that use the standard pipeline, base images, and secret management each week, and make it a team KPI. - -5. **Set a lead-time baseline before any changes** — Before making platform improvements, measure current average lead time for a change to reach production. Use it as the before/after benchmark for every initiative. - -6. **Alert on adoption regressions** — Set an alert if golden-path adoption drops by more than 5% week-over-week so the team can investigate whether teams are abandoning the platform or working around it. - -7. **Tag every deployment with a team identifier** — Require a `team` label on every workload so you can slice adoption, cost, and reliability metrics by team without manual reconciliation. - -8. **Log every escape-hatch usage** — When a team uses a platform override or exception, record it automatically. Review these logs monthly to identify gaps in the golden path worth fixing. - -9. **Track time-to-first-deployment** — Record the timestamp when a team's repository is created and when their first successful production deployment occurs. The gap is your onboarding lead time metric. - -10. **Publish a weekly metrics digest** — Send a short automated email or Slack summary every Monday with the previous week's platform metrics — adoption, pipeline reliability, and top errors — to keep stakeholders informed without meetings. - -11. **Measure and display documentation usage** — Instrument developer portal page views and search queries. Pages with high views but low task completion signal confusing documentation worth improving. - -12. **Count support requests per platform feature** — Track how many Slack questions or tickets are raised about each platform component. High question volume on a feature signals poor discoverability or documentation. - -13. **Measure change failure rate per team** — Compute how often a deployment requires a hotfix or rollback per team so you can identify teams that would benefit most from stronger golden-path guardrails. - -14. **Set a developer sentiment target** — Define a target score (e.g., NPS ≥ 30) for developer satisfaction with the platform and measure it each quarter so it carries the same weight as reliability metrics. - -15. **Track mean time to onboard a new service** — Measure the average time from "repository created" to "first deployment through the golden path." Use it as a north-star metric for reducing onboarding friction. - -16. **Monitor platform component error budgets** — Define an error budget for the CI system (e.g., 99.5% pipeline success rate) and stop adding new features when the budget is exhausted — fix reliability first. - -17. **Instrument the developer portal search** — Record every search query with zero results. Each no-result query is a capability gap or a documentation gap worth triaging. - -18. **Correlate platform upgrades with incident rates** — After every platform version upgrade, compare the change failure rate in the two weeks before and after. Make this comparison part of the release review. - -19. **Report cost per team automatically** — Use resource tags to generate a monthly cost breakdown per team. Share it with engineering managers so cost awareness becomes a shared responsibility without manual chargeback processes. - -20. **Review metrics in every sprint retrospective** — Include at least one platform metric in the team's regular retrospective to make measurement a habit rather than a quarterly exercise. diff --git a/examples/04-manage-everything-as-code.md b/examples/04-manage-everything-as-code.md index a4a0e8e..fe2d22b 100644 --- a/examples/04-manage-everything-as-code.md +++ b/examples/04-manage-everything-as-code.md @@ -1,6 +1,47 @@ --- layout: principle title: "Principle 4: Manage everything as code" +examples: + - title: "Store Terraform in dedicated, purpose-scoped repositories" + body: "Give each platform capability or service domain its own versioned repository for infrastructure modules, environment configuration, and variable files. Dedicated repositories keep ownership clear, limit blast radius, and allow teams to release infrastructure changes independently rather than serialising through a single monorepo pipeline." + - title: "Version-control Kubernetes manifests with GitOps" + body: "Use Argo CD or Flux to reconcile cluster state from a Git repository so that the only way to change production is through a reviewed and merged pull request." + - title: "Define CI/CD pipelines in YAML committed to the repo" + body: "Store every GitHub Actions workflow, Jenkins pipeline, or Tekton Task in the repository it belongs to so pipeline changes are reviewed alongside code changes." + - title: "Policy-as-code with OPA/Rego" + body: "Write all security and compliance policies as Rego rules committed to version control, tested with `opa test`, and enforced in CI rather than documented in a wiki that drifts." + - title: "Manage DNS records as code" + body: "Define all DNS entries in Terraform or Pulumi rather than clicking through a web console, so every record change is reviewable, auditable, and reproducible." + - title: "Define Grafana dashboards in JSON/YAML" + body: "Export dashboards as code (using Grafonnet or the Terraform Grafana provider) and store them in Git so dashboards cannot be silently modified or lost." + - title: "Store alert rules alongside service code" + body: "Keep Prometheus alerting rules in the same repository as the service they monitor so on-call runbooks, alert thresholds, and service code evolve together." + - title: "Manage Kubernetes RBAC as code" + body: "Define all `RoleBinding` and `ClusterRoleBinding` resources in Terraform or Kustomize overlays so that privilege escalations are always a reviewed change, not a manual `kubectl apply`." + - title: "Automate certificate management via code" + body: "Use cert-manager with configuration committed to Git rather than uploading PEM files through a UI, so certificate lifecycle (issue, renew, revoke) is fully automated and auditable." + - title: "Codify network firewall rules" + body: "Represent all security group and firewall rules as infrastructure-as-code rather than console-created rules. Enable `terraform plan` in CI to catch rule regressions before they reach production." + - title: "Write golden-path templates as versioned Helm charts or Kustomize bases" + body: "Store shared deployment templates in a versioned chart repository so every consuming service references a pinned version and upgrades are deliberate." + - title: "Run `terraform validate` and `tflint` in CI" + body: "Automatically validate and lint all Terraform on every pull request so syntax errors and style violations are caught before review, not during apply." + - title: "Store secret rotation scripts in version control" + body: "Keep automation for rotating credentials in Git (without the secrets themselves) so rotation procedures are auditable, repeatable, and not locked in one engineer's head." + - title: "Use `atlantis` or similar for infrastructure review" + body: "Run Terraform plan output automatically as a pull request comment so reviewers see the exact resource diff before approving an infrastructure change." + - title: "Track configuration drift with scheduled reconciliation" + body: "Run a nightly job that compares live cloud state against the declared state in Git and pages on-call if unmanaged resources are detected." + - title: "Manage IAM roles and policies as code" + body: "Define all IAM roles in Terraform or CloudFormation so that permission changes require a PR, pass OPA checks, and are logged in the audit trail automatically." + - title: "Store Slack or PagerDuty notification configs as code" + body: "Define escalation policies, on-call schedules, and alert routing rules in version-controlled configuration (e.g., the PagerDuty Terraform provider) so they are reproducible and reviewable." + - title: "Codify environment promotion gates" + body: "Express promotion criteria (e.g., \"all integration tests green, DORA metrics within SLO\") as code in the pipeline definition, not as informal team norms that vary over time." + - title: "Commit Dependabot/Renovate configuration to the repo" + body: "Store the dependency update configuration in `.github/dependabot.yml` or `renovate.json` so every project gets automated updates without manual enablement per repository." + - title: "Write infrastructure tests with Terratest or Checkov" + body: "Add automated tests for your Terraform modules that verify expected resources are created and security constraints are met, and run them on every PR to the platform infrastructure repo." --- # Principle 4: Manage everything as code @@ -8,43 +49,3 @@ title: "Principle 4: Manage everything as code" > Infrastructure, pipelines, security policies, and golden path configurations belong in version control — testable, reviewable, and repeatable. Code is the source of truth until the abstraction leaks; when it does, the gap between declared and actual state is a risk that must be managed. ## 20 Practical Examples - -1. **Store Terraform in dedicated, purpose-scoped repositories** — Give each platform capability or service domain its own versioned repository for infrastructure modules, environment configuration, and variable files. Dedicated repositories keep ownership clear, limit blast radius, and allow teams to release infrastructure changes independently rather than serialising through a single monorepo pipeline. - -2. **Version-control Kubernetes manifests with GitOps** — Use Argo CD or Flux to reconcile cluster state from a Git repository so that the only way to change production is through a reviewed and merged pull request. - -3. **Define CI/CD pipelines in YAML committed to the repo** — Store every GitHub Actions workflow, Jenkins pipeline, or Tekton Task in the repository it belongs to so pipeline changes are reviewed alongside code changes. - -4. **Policy-as-code with OPA/Rego** — Write all security and compliance policies as Rego rules committed to version control, tested with `opa test`, and enforced in CI rather than documented in a wiki that drifts. - -5. **Manage DNS records as code** — Define all DNS entries in Terraform or Pulumi rather than clicking through a web console, so every record change is reviewable, auditable, and reproducible. - -6. **Define Grafana dashboards in JSON/YAML** — Export dashboards as code (using Grafonnet or the Terraform Grafana provider) and store them in Git so dashboards cannot be silently modified or lost. - -7. **Store alert rules alongside service code** — Keep Prometheus alerting rules in the same repository as the service they monitor so on-call runbooks, alert thresholds, and service code evolve together. - -8. **Manage Kubernetes RBAC as code** — Define all `RoleBinding` and `ClusterRoleBinding` resources in Terraform or Kustomize overlays so that privilege escalations are always a reviewed change, not a manual `kubectl apply`. - -9. **Automate certificate management via code** — Use cert-manager with configuration committed to Git rather than uploading PEM files through a UI, so certificate lifecycle (issue, renew, revoke) is fully automated and auditable. - -10. **Codify network firewall rules** — Represent all security group and firewall rules as infrastructure-as-code rather than console-created rules. Enable `terraform plan` in CI to catch rule regressions before they reach production. - -11. **Write golden-path templates as versioned Helm charts or Kustomize bases** — Store shared deployment templates in a versioned chart repository so every consuming service references a pinned version and upgrades are deliberate. - -12. **Run `terraform validate` and `tflint` in CI** — Automatically validate and lint all Terraform on every pull request so syntax errors and style violations are caught before review, not during apply. - -13. **Store secret rotation scripts in version control** — Keep automation for rotating credentials in Git (without the secrets themselves) so rotation procedures are auditable, repeatable, and not locked in one engineer's head. - -14. **Use `atlantis` or similar for infrastructure review** — Run Terraform plan output automatically as a pull request comment so reviewers see the exact resource diff before approving an infrastructure change. - -15. **Track configuration drift with scheduled reconciliation** — Run a nightly job that compares live cloud state against the declared state in Git and pages on-call if unmanaged resources are detected. - -16. **Manage IAM roles and policies as code** — Define all IAM roles in Terraform or CloudFormation so that permission changes require a PR, pass OPA checks, and are logged in the audit trail automatically. - -17. **Store Slack or PagerDuty notification configs as code** — Define escalation policies, on-call schedules, and alert routing rules in version-controlled configuration (e.g., the PagerDuty Terraform provider) so they are reproducible and reviewable. - -18. **Codify environment promotion gates** — Express promotion criteria (e.g., "all integration tests green, DORA metrics within SLO") as code in the pipeline definition, not as informal team norms that vary over time. - -19. **Commit Dependabot/Renovate configuration to the repo** — Store the dependency update configuration in `.github/dependabot.yml` or `renovate.json` so every project gets automated updates without manual enablement per repository. - -20. **Write infrastructure tests with Terratest or Checkov** — Add automated tests for your Terraform modules that verify expected resources are created and security constraints are met, and run them on every PR to the platform infrastructure repo. diff --git a/examples/05-right-size-your-platform.md b/examples/05-right-size-your-platform.md index e717a1e..64572b1 100644 --- a/examples/05-right-size-your-platform.md +++ b/examples/05-right-size-your-platform.md @@ -1,6 +1,47 @@ --- layout: principle title: "Principle 5: Right-size your platform" +examples: + - title: "Start with a shared CI pipeline, not a service mesh" + body: "For a team of 10 engineers, a standardised GitHub Actions workflow delivers more value than Istio traffic management. Add complexity only when the problem it solves is real and present." + - title: "Use a managed Kubernetes service before building your own control plane" + body: "Let a cloud provider handle etcd, API server upgrades, and node provisioning until you have hundreds of clusters and genuine control plane requirements." + - title: "Replace a complex internal portal with a well-maintained README" + body: "If your platform serves three teams, a thorough `README.md` with working examples is more effective than a custom React portal that takes six months to build." + - title: "Use GitHub Actions before adopting Tekton" + body: "GitHub Actions covers the needs of most organisations without requiring Kubernetes expertise. Adopt a more complex pipeline engine only when specific limitations are hit." + - title: "Choose a managed secret store over building one" + body: "Use AWS Secrets Manager, HashiCorp Vault Cloud, or Azure Key Vault before building a custom secrets service. Managed services come with audit logs, rotation, and SLAs for free." + - title: "Deploy to a single cloud region initially" + body: "Multi-region active-active deployments solve latency and availability problems at scale. For most organisations, a single region with cross-zone redundancy is the right starting point." + - title: "Use a monorepo before introducing a complex microservice mesh" + body: "A single repository with a clear module structure is easier to operate for small teams. Split repositories and service meshes when Conway's Law demands it, not before." + - title: "Adopt a PaaS for non-differentiating workloads" + body: "Use Heroku, Render, or Fly.io for internal tools and low-traffic services instead of running full Kubernetes clusters for workloads that do not need that control." + - title: "Document the \"size trigger\" for each platform component" + body: "Write down the specific condition (e.g., \"when we exceed 50 services\") that would justify adding the next layer of complexity, so the team evaluates objectively rather than speculatively." + - title: "Avoid multitenancy until you have multiple tenants" + body: "Building a multi-tenant platform control plane before having two or more real tenants introduces premature complexity. Single-tenant first is a valid architecture choice." + - title: "Use labels and namespaces before a full service catalogue" + body: "Tag Kubernetes workloads with `team`, `env`, and `tier` labels early so you have structured metadata to build on — without committing to a full service catalogue before the need is proven." + - title: "Standardise on one observability stack" + body: "Pick one logging, metrics, and tracing tool per layer and use it everywhere rather than supporting multiple stacks to satisfy every team's preference. Complexity in observability hides operational risk." + - title: "Leverage cloud-native load balancers before building ingress controllers" + body: "AWS ALB or GCP Load Balancer covers the vast majority of ingress needs. Build custom ingress only when WAF rules, advanced routing, or cost at scale make it necessary." + - title: "Write a one-page platform scope document" + body: "Clearly state what the platform does and does not do. Publish it so teams do not expect capabilities the platform was never designed to provide." + - title: "Resist requests to support every language runtime" + body: "Define a supported runtime matrix (e.g., JVM, Node.js, Python) and decline to build golden paths for runtimes used by fewer than two teams until demand justifies the maintenance cost." + - title: "Use feature flags before building a full experimentation platform" + body: "A simple LaunchDarkly integration or a homegrown boolean flag table solves most A/B testing needs at small scale. A full-blown experimentation platform is a later-stage problem." + - title: "Cap the number of platform components in flight" + body: "Limit the platform team's work in progress (e.g., no more than three major initiatives at once) to ensure quality and adoption of each component before adding new ones." + - title: "Evaluate build vs. buy against total cost of ownership" + body: "Before building a custom tool, estimate maintenance cost over three years including on-call burden, upgrades, and documentation. Compare it honestly against a vendor or open-source alternative." + - title: "Deprecate underused platform components" + body: "If a capability has fewer than two active consumers after six months, question whether it belongs in the platform. Removing it reduces cognitive load for the entire organisation." + - title: "Re-evaluate platform architecture when team size doubles" + body: "Schedule a deliberate platform architecture review when the engineering organisation doubles in size. What was right-sized at 20 engineers may be under-scaled at 40 or over-engineered at 15." --- # Principle 5: Right-size your platform @@ -8,43 +49,3 @@ title: "Principle 5: Right-size your platform" > Match complexity to organisational need. A simple pipeline platform is as valid as a fully-orchestrated internal developer platform. Build for the problem you have today, not the organisation you may one day become. ## 20 Practical Examples - -1. **Start with a shared CI pipeline, not a service mesh** — For a team of 10 engineers, a standardised GitHub Actions workflow delivers more value than Istio traffic management. Add complexity only when the problem it solves is real and present. - -2. **Use a managed Kubernetes service before building your own control plane** — Let a cloud provider handle etcd, API server upgrades, and node provisioning until you have hundreds of clusters and genuine control plane requirements. - -3. **Replace a complex internal portal with a well-maintained README** — If your platform serves three teams, a thorough `README.md` with working examples is more effective than a custom React portal that takes six months to build. - -4. **Use GitHub Actions before adopting Tekton** — GitHub Actions covers the needs of most organisations without requiring Kubernetes expertise. Adopt a more complex pipeline engine only when specific limitations are hit. - -5. **Choose a managed secret store over building one** — Use AWS Secrets Manager, HashiCorp Vault Cloud, or Azure Key Vault before building a custom secrets service. Managed services come with audit logs, rotation, and SLAs for free. - -6. **Deploy to a single cloud region initially** — Multi-region active-active deployments solve latency and availability problems at scale. For most organisations, a single region with cross-zone redundancy is the right starting point. - -7. **Use a monorepo before introducing a complex microservice mesh** — A single repository with a clear module structure is easier to operate for small teams. Split repositories and service meshes when Conway's Law demands it, not before. - -8. **Adopt a PaaS for non-differentiating workloads** — Use Heroku, Render, or Fly.io for internal tools and low-traffic services instead of running full Kubernetes clusters for workloads that do not need that control. - -9. **Document the "size trigger" for each platform component** — Write down the specific condition (e.g., "when we exceed 50 services") that would justify adding the next layer of complexity, so the team evaluates objectively rather than speculatively. - -10. **Avoid multitenancy until you have multiple tenants** — Building a multi-tenant platform control plane before having two or more real tenants introduces premature complexity. Single-tenant first is a valid architecture choice. - -11. **Use labels and namespaces before a full service catalogue** — Tag Kubernetes workloads with `team`, `env`, and `tier` labels early so you have structured metadata to build on — without committing to a full service catalogue before the need is proven. - -12. **Standardise on one observability stack** — Pick one logging, metrics, and tracing tool per layer and use it everywhere rather than supporting multiple stacks to satisfy every team's preference. Complexity in observability hides operational risk. - -13. **Leverage cloud-native load balancers before building ingress controllers** — AWS ALB or GCP Load Balancer covers the vast majority of ingress needs. Build custom ingress only when WAF rules, advanced routing, or cost at scale make it necessary. - -14. **Write a one-page platform scope document** — Clearly state what the platform does and does not do. Publish it so teams do not expect capabilities the platform was never designed to provide. - -15. **Resist requests to support every language runtime** — Define a supported runtime matrix (e.g., JVM, Node.js, Python) and decline to build golden paths for runtimes used by fewer than two teams until demand justifies the maintenance cost. - -16. **Use feature flags before building a full experimentation platform** — A simple LaunchDarkly integration or a homegrown boolean flag table solves most A/B testing needs at small scale. A full-blown experimentation platform is a later-stage problem. - -17. **Cap the number of platform components in flight** — Limit the platform team's work in progress (e.g., no more than three major initiatives at once) to ensure quality and adoption of each component before adding new ones. - -18. **Evaluate build vs. buy against total cost of ownership** — Before building a custom tool, estimate maintenance cost over three years including on-call burden, upgrades, and documentation. Compare it honestly against a vendor or open-source alternative. - -19. **Deprecate underused platform components** — If a capability has fewer than two active consumers after six months, question whether it belongs in the platform. Removing it reduces cognitive load for the entire organisation. - -20. **Re-evaluate platform architecture when team size doubles** — Schedule a deliberate platform architecture review when the engineering organisation doubles in size. What was right-sized at 20 engineers may be under-scaled at 40 or over-engineered at 15. diff --git a/examples/06-start-with-a-minimal-viable-platform.md b/examples/06-start-with-a-minimal-viable-platform.md index 6488526..6baa0d4 100644 --- a/examples/06-start-with-a-minimal-viable-platform.md +++ b/examples/06-start-with-a-minimal-viable-platform.md @@ -1,6 +1,47 @@ --- layout: principle title: "Principle 6: Start with a minimal viable platform" +examples: + - title: "Launch with a single golden-path pipeline" + body: "Release a working CI/CD pipeline for one language (e.g., Java on Kubernetes) to production, gather feedback from real teams, and add the next language stack only after the first is stable and adopted." + - title: "Deploy the MVP with one pilot team" + body: "Onboard a single willing product team as your first user before any broader rollout. Their feedback will surface usability problems that reviews and demos never catch." + - title: "Use a simple YAML config to describe a service" + body: "Start with a minimal `service.yaml` (name, team, language, port) as the platform contract rather than designing a comprehensive schema upfront. Add fields only when teams ask for them." + - title: "Provide a working hello-world template" + body: "Create one fully functional example repository that can be forked, renamed, and deployed in under 30 minutes. Ship that before building the template catalogue." + - title: "Skip the portal, start with the CLI" + body: "A `platform` CLI that wraps common tasks (create, deploy, logs) gets into developers' hands faster than building a full web portal. Add the portal once the workflows are proven via the CLI." + - title: "Cut scope at the first sign of schedule pressure" + body: "When an MVP is at risk of slipping, remove features rather than delay the release. An on-time MVP that covers 60% of needs beats a comprehensive platform that arrives six months late." + - title: "Limit the MVP to one environment" + body: "Deliver a production environment first. Staging and development environment support can come in the next iteration, once the production deployment path is validated." + - title: "Write the user guide before building the feature" + body: "Draft the documentation for a capability before implementing it. If you cannot describe how a developer would use it in plain language, the design is probably not simple enough to ship." + - title: "Use a shared spreadsheet as the first service catalogue" + body: "Before building a service catalogue database or portal, maintain a shared spreadsheet of service name, owner, language, and repo. It is useful immediately and reveals what a real catalogue needs to store." + - title: "Release behind a feature flag" + body: "Roll out MVP platform capabilities to opt-in teams via a feature flag so you can gather feedback and iterate without blocking teams still on the old path." + - title: "Define \"done\" for the MVP before starting" + body: "Write three acceptance criteria (e.g., \"a new engineer can deploy a service without help in under 45 minutes\") and resist scope creep until all three are met and validated." + - title: "Ship runbooks before automation" + body: "Document the manual steps for a complex operation (e.g., certificate rotation) as a runbook first. Automate it in the next iteration once you have validated the steps are correct." + - title: "Use existing tools as the first version" + body: "Before writing a custom deployment operator, validate the workflow using `kubectl`, shell scripts, and Makefiles. Build the operator only when the manual process is proven and the automation need is clear." + - title: "Timebox spikes to two days" + body: "When evaluating a technology choice for the MVP, timebox the investigation to two days and make a decision with available information. Avoid analysis paralysis on tooling that can be changed later." + - title: "Set a hard ship date for the MVP" + body: "Commit publicly to an MVP launch date with leadership and pilot teams. The deadline creates the pressure to cut scope to essentials rather than gold-plating." + - title: "Measure the MVP against one headline metric" + body: "Pick a single metric (e.g., time-to-first-deployment) as the MVP success criterion. Resist adding more metrics until the first one is being hit consistently." + - title: "Onboard three teams before adding new capabilities" + body: "After launching the MVP, focus entirely on getting three product teams using it end-to-end before adding any new feature. Adoption depth beats feature breadth." + - title: "Collect feedback with a five-question form after first use" + body: "After a developer deploys their first service through the platform, send a short form asking what was hard, what was missing, and what worked well. Act on the top two themes." + - title: "Publish the MVP changelog with known limitations" + body: "Be transparent about what the MVP does not yet support. Teams trust a platform that is honest about its gaps more than one that over-promises." + - title: "Plan the next iteration based on MVP feedback" + body: "Do not plan iteration two until the MVP has been used by real teams for at least two weeks. Use observed friction points and user feedback to prioritise the next set of capabilities." --- # Principle 6: Start with a minimal viable platform @@ -8,43 +49,3 @@ title: "Principle 6: Start with a minimal viable platform" > Ship the thinnest platform that delivers real value to real teams, then iterate. A working golden path for one use case beats a comprehensive platform that is six months from release. Validate with actual users before building the next layer. ## 20 Practical Examples - -1. **Launch with a single golden-path pipeline** — Release a working CI/CD pipeline for one language (e.g., Java on Kubernetes) to production, gather feedback from real teams, and add the next language stack only after the first is stable and adopted. - -2. **Deploy the MVP with one pilot team** — Onboard a single willing product team as your first user before any broader rollout. Their feedback will surface usability problems that reviews and demos never catch. - -3. **Use a simple YAML config to describe a service** — Start with a minimal `service.yaml` (name, team, language, port) as the platform contract rather than designing a comprehensive schema upfront. Add fields only when teams ask for them. - -4. **Provide a working hello-world template** — Create one fully functional example repository that can be forked, renamed, and deployed in under 30 minutes. Ship that before building the template catalogue. - -5. **Skip the portal, start with the CLI** — A `platform` CLI that wraps common tasks (create, deploy, logs) gets into developers' hands faster than building a full web portal. Add the portal once the workflows are proven via the CLI. - -6. **Cut scope at the first sign of schedule pressure** — When an MVP is at risk of slipping, remove features rather than delay the release. An on-time MVP that covers 60% of needs beats a comprehensive platform that arrives six months late. - -7. **Limit the MVP to one environment** — Deliver a production environment first. Staging and development environment support can come in the next iteration, once the production deployment path is validated. - -8. **Write the user guide before building the feature** — Draft the documentation for a capability before implementing it. If you cannot describe how a developer would use it in plain language, the design is probably not simple enough to ship. - -9. **Use a shared spreadsheet as the first service catalogue** — Before building a service catalogue database or portal, maintain a shared spreadsheet of service name, owner, language, and repo. It is useful immediately and reveals what a real catalogue needs to store. - -10. **Release behind a feature flag** — Roll out MVP platform capabilities to opt-in teams via a feature flag so you can gather feedback and iterate without blocking teams still on the old path. - -11. **Define "done" for the MVP before starting** — Write three acceptance criteria (e.g., "a new engineer can deploy a service without help in under 45 minutes") and resist scope creep until all three are met and validated. - -12. **Ship runbooks before automation** — Document the manual steps for a complex operation (e.g., certificate rotation) as a runbook first. Automate it in the next iteration once you have validated the steps are correct. - -13. **Use existing tools as the first version** — Before writing a custom deployment operator, validate the workflow using `kubectl`, shell scripts, and Makefiles. Build the operator only when the manual process is proven and the automation need is clear. - -14. **Timebox spikes to two days** — When evaluating a technology choice for the MVP, timebox the investigation to two days and make a decision with available information. Avoid analysis paralysis on tooling that can be changed later. - -15. **Set a hard ship date for the MVP** — Commit publicly to an MVP launch date with leadership and pilot teams. The deadline creates the pressure to cut scope to essentials rather than gold-plating. - -16. **Measure the MVP against one headline metric** — Pick a single metric (e.g., time-to-first-deployment) as the MVP success criterion. Resist adding more metrics until the first one is being hit consistently. - -17. **Onboard three teams before adding new capabilities** — After launching the MVP, focus entirely on getting three product teams using it end-to-end before adding any new feature. Adoption depth beats feature breadth. - -18. **Collect feedback with a five-question form after first use** — After a developer deploys their first service through the platform, send a short form asking what was hard, what was missing, and what worked well. Act on the top two themes. - -19. **Publish the MVP changelog with known limitations** — Be transparent about what the MVP does not yet support. Teams trust a platform that is honest about its gaps more than one that over-promises. - -20. **Plan the next iteration based on MVP feedback** — Do not plan iteration two until the MVP has been used by real teams for at least two weeks. Use observed friction points and user feedback to prioritise the next set of capabilities. diff --git a/examples/07-adopt-before-you-build.md b/examples/07-adopt-before-you-build.md index 85868c2..191c347 100644 --- a/examples/07-adopt-before-you-build.md +++ b/examples/07-adopt-before-you-build.md @@ -1,6 +1,47 @@ --- layout: principle title: "Principle 7: Adopt before you build" +examples: + - title: "Use OpenTelemetry before writing a custom tracing SDK" + body: "Adopt the vendor-neutral OpenTelemetry standard for metrics, logs, and traces rather than building an in-house instrumentation library that only your organisation knows." + - title: "Adopt Argo CD before writing a custom deployment controller" + body: "GitOps controllers like Argo CD and Flux are mature, well-documented, and widely supported. Use them before building a bespoke deployment operator." + - title: "Use Helm charts before writing raw Kubernetes manifest generators" + body: "Helm is the de-facto Kubernetes package manager. Adopt it (or Kustomize) before developing a custom templating system that developers must learn from scratch." + - title: "Adopt Backstage before building a custom developer portal" + body: "Backstage is a mature, extensible internal developer portal. Evaluate it seriously before committing engineering months to a bespoke portal with fewer features." + - title: "Use Terraform modules from the public registry" + body: "Leverage community-maintained Terraform modules for common AWS, GCP, or Azure resources before writing your own. Fork only when the community module does not meet your constraints." + - title: "Adopt cert-manager for TLS before writing certificate automation scripts" + body: "cert-manager is a production-grade Kubernetes add-on. Use it to automate certificate lifecycle management rather than writing cron jobs that interact with the CA API directly." + - title: "Use Trivy or Grype for container scanning before building a scanner" + body: "Mature open-source image scanners exist and integrate with CI pipelines in minutes. Adopt one before writing custom vulnerability scanning tooling." + - title: "Leverage cloud provider IAM before building an authorisation service" + body: "AWS IAM, GCP IAM, and Azure RBAC cover most access control needs. Build a custom authorisation layer only when your requirements genuinely exceed what cloud IAM provides." + - title: "Use Renovate or Dependabot before writing dependency update scripts" + body: "Automated dependency management tools exist and support dozens of ecosystems. Adopt them rather than writing bespoke update scripts that require ongoing maintenance." + - title: "Adopt HashiCorp Vault before building a secrets service" + body: "Vault offers dynamic secrets, lease management, audit logging, and extensive integrations. Evaluate it before designing a secrets service from scratch." + - title: "Use Gateway API before writing a custom ingress controller" + body: "Kubernetes Gateway API provides standardised, extensible ingress configuration. Adopt it before building a proprietary ingress abstraction that teams must learn in addition to Kubernetes." + - title: "Adopt SLSA before creating a bespoke supply-chain security framework" + body: "SLSA (Supply-chain Levels for Software Artifacts) provides an industry-standard maturity model for software supply chain security. Follow it rather than inventing an internal equivalent." + - title: "Use OPA/Rego for policy-as-code before building a custom policy engine" + body: "OPA is widely adopted, well-documented, and supported by most CI/CD tools. Adopt it before writing a bespoke rule engine that only your team understands." + - title: "Leverage cloud-native autoscaling before building custom scaling logic" + body: "Kubernetes HPA, VPA, and KEDA cover the vast majority of scaling use cases. Adopt them before writing a custom scaling controller." + - title: "Use managed Postgres before building a database provisioning service" + body: "Cloud-managed databases (RDS, Cloud SQL, Azure Database) handle patching, backups, and failover. Provision them via Terraform before building a database-as-a-service layer." + - title: "Adopt trunk-based development conventions before writing branch management automation" + body: "Establish team conventions for branching and release models using documented standards before building tooling to enforce them." + - title: "Use cloud provider cost management tools before building a chargeback system" + body: "AWS Cost Explorer, GCP Billing, and Azure Cost Management provide tagging-based cost allocation. Use them before building a custom chargeback system." + - title: "Adopt conventional commits before writing a custom changelog generator" + body: "The conventional commits specification and tools like `release-please` or `semantic-release` handle changelog generation and versioning. Adopt them before writing bespoke tooling." + - title: "Use OWASP dependency-check before writing a security audit tool" + body: "OWASP dependency-check and similar tools audit third-party dependencies against known vulnerability databases. Integrate them into CI before building custom audit pipelines." + - title: "Audit your bespoke tools annually against the market" + body: "Every year, review each internally-built platform component against available open-source and commercial alternatives. Retire and replace when the market has caught up — maintaining bespoke tools is a cost that compounds." --- # Principle 7: Adopt before you build @@ -8,43 +49,3 @@ title: "Principle 7: Adopt before you build" > Apply established patterns, open standards, and proven tools before writing bespoke solutions. The platform engineering community has solved most foundational problems. Build only where genuine differentiation is needed — and retire what no longer serves. ## 20 Practical Examples - -1. **Use OpenTelemetry before writing a custom tracing SDK** — Adopt the vendor-neutral OpenTelemetry standard for metrics, logs, and traces rather than building an in-house instrumentation library that only your organisation knows. - -2. **Adopt Argo CD before writing a custom deployment controller** — GitOps controllers like Argo CD and Flux are mature, well-documented, and widely supported. Use them before building a bespoke deployment operator. - -3. **Use Helm charts before writing raw Kubernetes manifest generators** — Helm is the de-facto Kubernetes package manager. Adopt it (or Kustomize) before developing a custom templating system that developers must learn from scratch. - -4. **Adopt Backstage before building a custom developer portal** — Backstage is a mature, extensible internal developer portal. Evaluate it seriously before committing engineering months to a bespoke portal with fewer features. - -5. **Use Terraform modules from the public registry** — Leverage community-maintained Terraform modules for common AWS, GCP, or Azure resources before writing your own. Fork only when the community module does not meet your constraints. - -6. **Adopt cert-manager for TLS before writing certificate automation scripts** — cert-manager is a production-grade Kubernetes add-on. Use it to automate certificate lifecycle management rather than writing cron jobs that interact with the CA API directly. - -7. **Use Trivy or Grype for container scanning before building a scanner** — Mature open-source image scanners exist and integrate with CI pipelines in minutes. Adopt one before writing custom vulnerability scanning tooling. - -8. **Leverage cloud provider IAM before building an authorisation service** — AWS IAM, GCP IAM, and Azure RBAC cover most access control needs. Build a custom authorisation layer only when your requirements genuinely exceed what cloud IAM provides. - -9. **Use Renovate or Dependabot before writing dependency update scripts** — Automated dependency management tools exist and support dozens of ecosystems. Adopt them rather than writing bespoke update scripts that require ongoing maintenance. - -10. **Adopt HashiCorp Vault before building a secrets service** — Vault offers dynamic secrets, lease management, audit logging, and extensive integrations. Evaluate it before designing a secrets service from scratch. - -11. **Use Gateway API before writing a custom ingress controller** — Kubernetes Gateway API provides standardised, extensible ingress configuration. Adopt it before building a proprietary ingress abstraction that teams must learn in addition to Kubernetes. - -12. **Adopt SLSA before creating a bespoke supply-chain security framework** — SLSA (Supply-chain Levels for Software Artifacts) provides an industry-standard maturity model for software supply chain security. Follow it rather than inventing an internal equivalent. - -13. **Use OPA/Rego for policy-as-code before building a custom policy engine** — OPA is widely adopted, well-documented, and supported by most CI/CD tools. Adopt it before writing a bespoke rule engine that only your team understands. - -14. **Leverage cloud-native autoscaling before building custom scaling logic** — Kubernetes HPA, VPA, and KEDA cover the vast majority of scaling use cases. Adopt them before writing a custom scaling controller. - -15. **Use managed Postgres before building a database provisioning service** — Cloud-managed databases (RDS, Cloud SQL, Azure Database) handle patching, backups, and failover. Provision them via Terraform before building a database-as-a-service layer. - -16. **Adopt trunk-based development conventions before writing branch management automation** — Establish team conventions for branching and release models using documented standards before building tooling to enforce them. - -17. **Use cloud provider cost management tools before building a chargeback system** — AWS Cost Explorer, GCP Billing, and Azure Cost Management provide tagging-based cost allocation. Use them before building a custom chargeback system. - -18. **Adopt conventional commits before writing a custom changelog generator** — The conventional commits specification and tools like `release-please` or `semantic-release` handle changelog generation and versioning. Adopt them before writing bespoke tooling. - -19. **Use OWASP dependency-check before writing a security audit tool** — OWASP dependency-check and similar tools audit third-party dependencies against known vulnerability databases. Integrate them into CI before building custom audit pipelines. - -20. **Audit your bespoke tools annually against the market** — Every year, review each internally-built platform component against available open-source and commercial alternatives. Retire and replace when the market has caught up — maintaining bespoke tools is a cost that compounds. diff --git a/examples/08-deprecate-gracefully.md b/examples/08-deprecate-gracefully.md index fa58341..f0831f2 100644 --- a/examples/08-deprecate-gracefully.md +++ b/examples/08-deprecate-gracefully.md @@ -1,6 +1,47 @@ --- layout: principle title: "Principle 8: Deprecate gracefully" +examples: + - title: "Publish a written deprecation policy" + body: "Document the minimum notice period (e.g., 90 days), communication channels, and migration support commitment before you deprecate anything. Publish it where developers can find it." + - title: "Announce deprecations through multiple channels" + body: "Post a deprecation notice in Slack, the developer portal, the changelog, and directly in the tool or API being deprecated — all at the same time, not sequentially." + - title: "Embed a deprecation warning in the tool itself" + body: "Add a visible warning message to a deprecated CLI command, API response header, or pipeline step so that developers using it are reminded on every invocation, not just at announcement time." + - title: "Provide a migration guide alongside the deprecation notice" + body: "Never announce a deprecation without a step-by-step migration guide to the replacement. Developers should know exactly what to do before the end-of-life date." + - title: "Set a firm end-of-life date" + body: "Deprecations without a concrete end-of-life date drift indefinitely. Commit to a specific date and honour it to build the credibility needed for future deprecations to be taken seriously." + - title: "Offer migration tooling where possible" + body: "Provide a script, codemods, or automated PR generation to help teams migrate from the deprecated tool to its replacement, reducing the manual effort barrier to adoption." + - title: "Assign a migration owner" + body: "Name a specific platform team member responsible for tracking migration progress, answering questions, and chasing blockers for each deprecation. Ownerless deprecations stall." + - title: "Track migration progress in a shared dashboard" + body: "Show the number of services still using the deprecated component and update the count weekly. Visible progress motivates teams and allows leadership to prioritise migration work." + - title: "Run a deprecation retrospective" + body: "After every major deprecation cycle, hold a short retrospective to identify what worked and what caused friction. Use the learnings to improve the next deprecation." + - title: "Block new consumers from adopting deprecated components" + body: "Enforce a CI check or OPA policy that prevents new services from taking a dependency on a deprecated tool or API from the moment the deprecation is announced." + - title: "Maintain the deprecated tool at security-patch level only" + body: "Once a component is deprecated, commit to security patches only — no new features. Communicate this explicitly so teams understand that migrating is the only path to improvement." + - title: "Offer a supported transition period with dual-running paths" + body: "During the deprecation window, run both the old and new paths in parallel so teams can migrate at their own pace without being blocked on a single cutover weekend." + - title: "Extend deadlines only once, transparently" + body: "If migration is slower than expected, extend the end-of-life date once with a public explanation. Unlimited extensions erode trust and teach teams that deadlines are not real." + - title: "Recognise teams that migrate early" + body: "Acknowledge teams who complete migrations ahead of schedule in a company update or Slack channel. Social recognition motivates early adoption and reduces last-minute rushes." + - title: "Keep an end-of-life register" + body: "Maintain a public list of all platform components with their current lifecycle status (supported, deprecated, end-of-life). Review and update it each quarter." + - title: "Retire the deprecated component on the announced date" + body: "Actually turn it off on the published date. If deadlines slip repeatedly, teams learn to ignore deprecation notices and accumulate technical debt." + - title: "Document why the deprecation is happening" + body: "Explain the reason (security risk, maintenance burden, superseded by a better tool) in the deprecation notice. Teams who understand the \"why\" are more likely to prioritise migration." + - title: "Avoid deprecating multiple major components simultaneously" + body: "Stagger deprecations so teams are not asked to migrate from three components at once. Concurrent deprecations create burnout and resistance." + - title: "Capture deprecated APIs in API versioning strategy" + body: "Define in your API versioning policy how long each major version will be supported after a new version is released. This sets expectations before any specific deprecation is needed." + - title: "Archive, don't delete, retired repositories" + body: "When a platform component reaches end-of-life, archive its repository with a clear README pointing to the replacement. Deleted repositories break institutional knowledge and leave teams confused." --- # Principle 8: Deprecate gracefully @@ -8,43 +49,3 @@ title: "Principle 8: Deprecate gracefully" > Platforms that only add capabilities ossify. Every tool, API, and integration has a lifecycle — and retiring things gracefully is as important as introducing them. A published deprecation policy prevents the platform from becoming a museum. ## 20 Practical Examples - -1. **Publish a written deprecation policy** — Document the minimum notice period (e.g., 90 days), communication channels, and migration support commitment before you deprecate anything. Publish it where developers can find it. - -2. **Announce deprecations through multiple channels** — Post a deprecation notice in Slack, the developer portal, the changelog, and directly in the tool or API being deprecated — all at the same time, not sequentially. - -3. **Embed a deprecation warning in the tool itself** — Add a visible warning message to a deprecated CLI command, API response header, or pipeline step so that developers using it are reminded on every invocation, not just at announcement time. - -4. **Provide a migration guide alongside the deprecation notice** — Never announce a deprecation without a step-by-step migration guide to the replacement. Developers should know exactly what to do before the end-of-life date. - -5. **Set a firm end-of-life date** — Deprecations without a concrete end-of-life date drift indefinitely. Commit to a specific date and honour it to build the credibility needed for future deprecations to be taken seriously. - -6. **Offer migration tooling where possible** — Provide a script, codemods, or automated PR generation to help teams migrate from the deprecated tool to its replacement, reducing the manual effort barrier to adoption. - -7. **Assign a migration owner** — Name a specific platform team member responsible for tracking migration progress, answering questions, and chasing blockers for each deprecation. Ownerless deprecations stall. - -8. **Track migration progress in a shared dashboard** — Show the number of services still using the deprecated component and update the count weekly. Visible progress motivates teams and allows leadership to prioritise migration work. - -9. **Run a deprecation retrospective** — After every major deprecation cycle, hold a short retrospective to identify what worked and what caused friction. Use the learnings to improve the next deprecation. - -10. **Block new consumers from adopting deprecated components** — Enforce a CI check or OPA policy that prevents new services from taking a dependency on a deprecated tool or API from the moment the deprecation is announced. - -11. **Maintain the deprecated tool at security-patch level only** — Once a component is deprecated, commit to security patches only — no new features. Communicate this explicitly so teams understand that migrating is the only path to improvement. - -12. **Offer a supported transition period with dual-running paths** — During the deprecation window, run both the old and new paths in parallel so teams can migrate at their own pace without being blocked on a single cutover weekend. - -13. **Extend deadlines only once, transparently** — If migration is slower than expected, extend the end-of-life date once with a public explanation. Unlimited extensions erode trust and teach teams that deadlines are not real. - -14. **Recognise teams that migrate early** — Acknowledge teams who complete migrations ahead of schedule in a company update or Slack channel. Social recognition motivates early adoption and reduces last-minute rushes. - -15. **Keep an end-of-life register** — Maintain a public list of all platform components with their current lifecycle status (supported, deprecated, end-of-life). Review and update it each quarter. - -16. **Retire the deprecated component on the announced date** — Actually turn it off on the published date. If deadlines slip repeatedly, teams learn to ignore deprecation notices and accumulate technical debt. - -17. **Document why the deprecation is happening** — Explain the reason (security risk, maintenance burden, superseded by a better tool) in the deprecation notice. Teams who understand the "why" are more likely to prioritise migration. - -18. **Avoid deprecating multiple major components simultaneously** — Stagger deprecations so teams are not asked to migrate from three components at once. Concurrent deprecations create burnout and resistance. - -19. **Capture deprecated APIs in API versioning strategy** — Define in your API versioning policy how long each major version will be supported after a new version is released. This sets expectations before any specific deprecation is needed. - -20. **Archive, don't delete, retired repositories** — When a platform component reaches end-of-life, archive its repository with a clear README pointing to the replacement. Deleted repositories break institutional knowledge and leave teams confused. diff --git a/examples/09-build-the-foundation-before-the-portal.md b/examples/09-build-the-foundation-before-the-portal.md index 3143936..8c50340 100644 --- a/examples/09-build-the-foundation-before-the-portal.md +++ b/examples/09-build-the-foundation-before-the-portal.md @@ -1,6 +1,47 @@ --- layout: principle title: "Principle 9: Build the foundation before the portal" +examples: + - title: "Make deployments reliable before surfacing them in a UI" + body: "Ensure your deployment pipeline succeeds consistently (>99% success rate) before adding a portal button that triggers deployments. A portal that exposes flaky automation damages trust faster than no portal at all." + - title: "Define service SLOs before building a status dashboard" + body: "An uptime dashboard built on services without SLOs shows noise, not signal. Define SLOs per service first, then build the dashboard to track compliance against them." + - title: "Establish secret management before building a self-service credential UI" + body: "Set up Vault or a cloud-native secret store with proper policies and audit logging before exposing a self-service interface. The UI is only as safe as the backend it calls." + - title: "Complete the onboarding workflow manually before automating it" + body: "Walk three teams through onboarding using only runbooks and CLI commands before building a portal wizard. The manual process reveals missing steps that automated UIs silently skip." + - title: "Instrument all platform APIs before building dashboards" + body: "Add observability (request rates, error rates, latencies) to every internal API before surfacing metrics in a developer portal. Dashboards without reliable data mislead operators." + - title: "Implement access control policies before offering self-service provisioning" + body: "Define who can create which resources, in which environments, before building a portal that lets developers provision them. Self-service without guardrails is not self-service — it is a blast radius." + - title: "Test disaster recovery before showing availability metrics" + body: "Validate that backups restore, failover works, and recovery time meets your SLO before publishing an availability percentage. Displaying false confidence is worse than displaying no metric." + - title: "Resolve the top three developer pain points before adding portal features" + body: "Survey developers for their biggest friction points and fix the three most common platform-level issues before investing in new portal capabilities." + - title: "Standardise logging before building a log search UI" + body: "Define a structured log format and ensure all services emit it before building a search interface. A log UI over unstructured logs returns results teams cannot reliably query." + - title: "Establish a change management process before exposing change history in a portal" + body: "Implement deployment records (who deployed what, when, with what config) in your data layer before surfacing a change history view. The UI is only as trustworthy as the data it reads." + - title: "Validate the golden path end-to-end before adding portal shortcuts" + body: "Run a production deployment through the golden path manually and confirm every stage works correctly before adding one-click shortcuts. Shortcuts that hide broken steps are worse than no shortcuts." + - title: "Define the platform contract before building a service catalogue" + body: "Document what the platform provides and what it expects from services (resource labels, health endpoints, SLO definitions) before building a catalogue that displays compliance. The catalogue is a projection of the contract, not a substitute for it." + - title: "Make rollbacks reliable before building a portal deploy button" + body: "A deployment button that cannot roll back is a liability. Validate automated rollback end-to-end before surfacing deployment controls in a UI." + - title: "Fix flaky CI before building a CI status portal" + body: "A dashboard displaying flaky pipelines creates noise and trains teams to ignore it. Reduce the flaky test and infrastructure error rate to an acceptable level before investing in better pipeline visibility." + - title: "Complete network segmentation before building a service-to-service auth UI" + body: "Establish mTLS or a service mesh policy foundation before offering a portal for teams to configure service authorisation. The UI is meaningless if the underlying policy enforcement is absent." + - title: "Implement audit logging before adding self-service admin capabilities" + body: "Any portal feature that modifies production resources must have a corresponding audit trail. Build the audit log first; never add self-service controls without it." + - title: "Stabilise the underlying API before building multiple portal layers on top" + body: "If the internal API the portal calls is still changing frequently, adding portal layers multiplies the surface area of breaking changes. Stabilise the API contract before building additional consumers." + - title: "Establish cost allocation tags before building a cost portal" + body: "Tags applied consistently across all resources are the foundation of a cost dashboard. A cost portal built without reliable tagging shows inaccurate data and erodes confidence." + - title: "Write automated platform health checks before creating a health status page" + body: "Implement active checks that verify each platform component end-to-end before displaying a status page. A status page driven by human updates is out of date the moment it matters most." + - title: "Train the platform team on the new capability before opening it to users" + body: "Before launching a portal feature, ensure every platform team member can use it, explain it, and handle support requests for it. A portal that the team behind it does not understand will generate more questions than it answers." --- # Principle 9: Build the foundation before the portal @@ -8,43 +49,3 @@ title: "Principle 9: Build the foundation before the portal" > A portal built on a broken foundation will be abandoned. Invest in capabilities, reliability, and contracts first — the interface amplifies what is already there; it cannot substitute for what is not. ## 20 Practical Examples - -1. **Make deployments reliable before surfacing them in a UI** — Ensure your deployment pipeline succeeds consistently (>99% success rate) before adding a portal button that triggers deployments. A portal that exposes flaky automation damages trust faster than no portal at all. - -2. **Define service SLOs before building a status dashboard** — An uptime dashboard built on services without SLOs shows noise, not signal. Define SLOs per service first, then build the dashboard to track compliance against them. - -3. **Establish secret management before building a self-service credential UI** — Set up Vault or a cloud-native secret store with proper policies and audit logging before exposing a self-service interface. The UI is only as safe as the backend it calls. - -4. **Complete the onboarding workflow manually before automating it** — Walk three teams through onboarding using only runbooks and CLI commands before building a portal wizard. The manual process reveals missing steps that automated UIs silently skip. - -5. **Instrument all platform APIs before building dashboards** — Add observability (request rates, error rates, latencies) to every internal API before surfacing metrics in a developer portal. Dashboards without reliable data mislead operators. - -6. **Implement access control policies before offering self-service provisioning** — Define who can create which resources, in which environments, before building a portal that lets developers provision them. Self-service without guardrails is not self-service — it is a blast radius. - -7. **Test disaster recovery before showing availability metrics** — Validate that backups restore, failover works, and recovery time meets your SLO before publishing an availability percentage. Displaying false confidence is worse than displaying no metric. - -8. **Resolve the top three developer pain points before adding portal features** — Survey developers for their biggest friction points and fix the three most common platform-level issues before investing in new portal capabilities. - -9. **Standardise logging before building a log search UI** — Define a structured log format and ensure all services emit it before building a search interface. A log UI over unstructured logs returns results teams cannot reliably query. - -10. **Establish a change management process before exposing change history in a portal** — Implement deployment records (who deployed what, when, with what config) in your data layer before surfacing a change history view. The UI is only as trustworthy as the data it reads. - -11. **Validate the golden path end-to-end before adding portal shortcuts** — Run a production deployment through the golden path manually and confirm every stage works correctly before adding one-click shortcuts. Shortcuts that hide broken steps are worse than no shortcuts. - -12. **Define the platform contract before building a service catalogue** — Document what the platform provides and what it expects from services (resource labels, health endpoints, SLO definitions) before building a catalogue that displays compliance. The catalogue is a projection of the contract, not a substitute for it. - -13. **Make rollbacks reliable before building a portal deploy button** — A deployment button that cannot roll back is a liability. Validate automated rollback end-to-end before surfacing deployment controls in a UI. - -14. **Fix flaky CI before building a CI status portal** — A dashboard displaying flaky pipelines creates noise and trains teams to ignore it. Reduce the flaky test and infrastructure error rate to an acceptable level before investing in better pipeline visibility. - -15. **Complete network segmentation before building a service-to-service auth UI** — Establish mTLS or a service mesh policy foundation before offering a portal for teams to configure service authorisation. The UI is meaningless if the underlying policy enforcement is absent. - -16. **Implement audit logging before adding self-service admin capabilities** — Any portal feature that modifies production resources must have a corresponding audit trail. Build the audit log first; never add self-service controls without it. - -17. **Stabilise the underlying API before building multiple portal layers on top** — If the internal API the portal calls is still changing frequently, adding portal layers multiplies the surface area of breaking changes. Stabilise the API contract before building additional consumers. - -18. **Establish cost allocation tags before building a cost portal** — Tags applied consistently across all resources are the foundation of a cost dashboard. A cost portal built without reliable tagging shows inaccurate data and erodes confidence. - -19. **Write automated platform health checks before creating a health status page** — Implement active checks that verify each platform component end-to-end before displaying a status page. A status page driven by human updates is out of date the moment it matters most. - -20. **Train the platform team on the new capability before opening it to users** — Before launching a portal feature, ensure every platform team member can use it, explain it, and handle support requests for it. A portal that the team behind it does not understand will generate more questions than it answers. diff --git a/examples/10-treat-developer-experience-as-a-product.md b/examples/10-treat-developer-experience-as-a-product.md index bad62a6..676dcf7 100644 --- a/examples/10-treat-developer-experience-as-a-product.md +++ b/examples/10-treat-developer-experience-as-a-product.md @@ -1,6 +1,47 @@ --- layout: principle title: "Principle 10: Treat developer experience as a product in its own right" +examples: + - title: "Write error messages that include the fix" + body: "Every platform CLI and API error should tell the developer not just what went wrong, but what to do next. Replace `Error: invalid config` with `Error: 'replicas' must be ≥ 1. See: https://platform/docs/config#replicas`." + - title: "Create a \"getting started in 30 minutes\" guide" + body: "Write a guide that takes a new engineer from zero to a deployed service in under 30 minutes. Test it with actual new hires and fix every step that takes longer than described." + - title: "Keep all documentation in one place" + body: "Maintain a single, authoritative documentation site. Avoid scattering docs across Confluence, GitHub READMEs, Slack pins, and a developer portal simultaneously — fragmentation is its own form of broken experience." + - title: "Add working code examples to every documentation page" + body: "Every documentation page that describes a capability should include a copy-paste code sample that actually works. Broken examples destroy trust faster than missing ones." + - title: "Test the onboarding journey with a timer" + body: "Have a new team member attempt onboarding with only the public documentation and a timer. Identify every step where they need to ask for help and fix those steps before the next onboarding." + - title: "Provide a `platform doctor` command" + body: "Offer a CLI subcommand that checks the developer's local environment and reports missing tools, expired credentials, or misconfigurations with clear remediation steps." + - title: "Design CLI commands to be discoverable" + body: "Ensure every `platform help` and `platform --help` output is accurate, comprehensive, and includes examples. Developers explore capabilities through the CLI help system." + - title: "Send a contextual welcome email after first deployment" + body: "After a team successfully deploys for the first time, send a short automated email that links to the next logical step: setting up alerts, configuring autoscaling, or joining the platform community." + - title: "Name things consistently" + body: "Use consistent naming conventions across the CLI, portal, documentation, and Slack channels. Inconsistency (e.g., \"workload\" in docs vs. \"service\" in the CLI vs. \"app\" in the portal) creates unnecessary cognitive load." + - title: "Version and datestamp all documentation" + body: "Display when each documentation page was last updated and which platform version it applies to. Stale undated documentation causes developers to waste time on instructions that no longer apply." + - title: "Provide a sandbox environment" + body: "Give every team access to a cost-controlled sandbox where they can experiment with platform capabilities without risk of affecting production workloads or incurring unexpected costs." + - title: "Design forms that prevent invalid input" + body: "In web-based self-service forms, use dropdowns, validated fields, and inline guidance rather than free-text inputs that fail silently after submission." + - title: "Add \"why\" to every policy enforcement message" + body: "When a CI check or policy gate blocks a deployment, include a one-line explanation of why the policy exists alongside the fix instruction. Unexplained rejections create resentment." + - title: "Offer a dry-run mode for destructive commands" + body: "Every CLI command that modifies or deletes resources should support a `--dry-run` flag that shows the planned changes without executing them." + - title: "Conduct usability testing on new platform features" + body: "Before launching a new portal feature or CLI command, watch two or three developers use it with no assistance. Identify every point of confusion and address it before the general release." + - title: "Surface relevant documentation in context" + body: "When a developer is viewing a service in the portal, surface links to the relevant golden-path guide, runbook, and alert policy for that service — not a link to the documentation home page." + - title: "Build a changelog developers actually read" + body: "Format the platform changelog for a developer audience: short entries, grouped by user impact (new, improved, fixed, deprecated), with links to migration guides when relevant." + - title: "Respond to every support request within one business day" + body: "A fast, helpful response to the first question a developer asks sets the tone for their entire relationship with the platform. Slow or dismissive responses drive shadow IT." + - title: "Provide copy-paste Slack notification configs" + body: "Supply ready-to-use alert notification configurations (Slack webhooks, PagerDuty integrations) in documentation so teams can set up alerting in minutes, not hours." + - title: "Run an annual developer experience review" + body: "Once per year, audit the entire developer journey — from first repo creation to production deployment to incident response — and identify the three highest-friction points to prioritise in the next planning cycle." --- # Principle 10: Treat developer experience as a product in its own right @@ -8,43 +49,3 @@ title: "Principle 10: Treat developer experience as a product in its own right" > The platform's interface — documentation, onboarding, error messages, and CLI ergonomics — is as important as its capabilities. A powerful platform that is hard to discover or understand will be abandoned for something simpler. Developer experience is not polish; it is the product. ## 20 Practical Examples - -1. **Write error messages that include the fix** — Every platform CLI and API error should tell the developer not just what went wrong, but what to do next. Replace `Error: invalid config` with `Error: 'replicas' must be ≥ 1. See: https://platform/docs/config#replicas`. - -2. **Create a "getting started in 30 minutes" guide** — Write a guide that takes a new engineer from zero to a deployed service in under 30 minutes. Test it with actual new hires and fix every step that takes longer than described. - -3. **Keep all documentation in one place** — Maintain a single, authoritative documentation site. Avoid scattering docs across Confluence, GitHub READMEs, Slack pins, and a developer portal simultaneously — fragmentation is its own form of broken experience. - -4. **Add working code examples to every documentation page** — Every documentation page that describes a capability should include a copy-paste code sample that actually works. Broken examples destroy trust faster than missing ones. - -5. **Test the onboarding journey with a timer** — Have a new team member attempt onboarding with only the public documentation and a timer. Identify every step where they need to ask for help and fix those steps before the next onboarding. - -6. **Provide a `platform doctor` command** — Offer a CLI subcommand that checks the developer's local environment and reports missing tools, expired credentials, or misconfigurations with clear remediation steps. - -7. **Design CLI commands to be discoverable** — Ensure every `platform help` and `platform --help` output is accurate, comprehensive, and includes examples. Developers explore capabilities through the CLI help system. - -8. **Send a contextual welcome email after first deployment** — After a team successfully deploys for the first time, send a short automated email that links to the next logical step: setting up alerts, configuring autoscaling, or joining the platform community. - -9. **Name things consistently** — Use consistent naming conventions across the CLI, portal, documentation, and Slack channels. Inconsistency (e.g., "workload" in docs vs. "service" in the CLI vs. "app" in the portal) creates unnecessary cognitive load. - -10. **Version and datestamp all documentation** — Display when each documentation page was last updated and which platform version it applies to. Stale undated documentation causes developers to waste time on instructions that no longer apply. - -11. **Provide a sandbox environment** — Give every team access to a cost-controlled sandbox where they can experiment with platform capabilities without risk of affecting production workloads or incurring unexpected costs. - -12. **Design forms that prevent invalid input** — In web-based self-service forms, use dropdowns, validated fields, and inline guidance rather than free-text inputs that fail silently after submission. - -13. **Add "why" to every policy enforcement message** — When a CI check or policy gate blocks a deployment, include a one-line explanation of why the policy exists alongside the fix instruction. Unexplained rejections create resentment. - -14. **Offer a dry-run mode for destructive commands** — Every CLI command that modifies or deletes resources should support a `--dry-run` flag that shows the planned changes without executing them. - -15. **Conduct usability testing on new platform features** — Before launching a new portal feature or CLI command, watch two or three developers use it with no assistance. Identify every point of confusion and address it before the general release. - -16. **Surface relevant documentation in context** — When a developer is viewing a service in the portal, surface links to the relevant golden-path guide, runbook, and alert policy for that service — not a link to the documentation home page. - -17. **Build a changelog developers actually read** — Format the platform changelog for a developer audience: short entries, grouped by user impact (new, improved, fixed, deprecated), with links to migration guides when relevant. - -18. **Respond to every support request within one business day** — A fast, helpful response to the first question a developer asks sets the tone for their entire relationship with the platform. Slow or dismissive responses drive shadow IT. - -19. **Provide copy-paste Slack notification configs** — Supply ready-to-use alert notification configurations (Slack webhooks, PagerDuty integrations) in documentation so teams can set up alerting in minutes, not hours. - -20. **Run an annual developer experience review** — Once per year, audit the entire developer journey — from first repo creation to production deployment to incident response — and identify the three highest-friction points to prioritise in the next planning cycle. diff --git a/examples/11-invest-in-people-process-and-culture.md b/examples/11-invest-in-people-process-and-culture.md index 38a75af..059cb33 100644 --- a/examples/11-invest-in-people-process-and-culture.md +++ b/examples/11-invest-in-people-process-and-culture.md @@ -1,6 +1,47 @@ --- layout: principle title: "Principle 11: Invest in people, process, and culture — amplified by tools" +examples: + - title: "Define team topologies before selecting tools" + body: "Decide which teams own which capabilities (stream-aligned, platform, enabling, complicated-subsystem) before evaluating tooling. Buying a tool before designing team ownership creates orphaned capabilities." + - title: "Create a shared on-call rotation with product teams" + body: "Include at least one product engineer in platform incidents and include a platform engineer in product on-call rotations. Cross-pollination builds empathy and shared ownership." + - title: "Run a \"platform as product\" enablement session" + body: "Hold a half-day workshop for the platform team on product thinking, user research, and roadmap management. Invest in this before investing in new tooling." + - title: "Document the platform team's decision-making process" + body: "Publish how the platform team makes technology decisions (ADRs, RFCs, approval criteria) so product teams understand how to propose changes and why decisions were made." + - title: "Establish a platform community of practice" + body: "Create a regular (monthly) cross-team forum where engineers who use the platform share patterns, discuss pain points, and propose improvements. Tools do not build communities — people do." + - title: "Tie platform OKRs to developer outcomes, not platform outputs" + body: "Set objectives like \"reduce time-to-first-deployment by 40%\" not \"deliver three new platform features.\" Outcome-based OKRs align the platform team's incentives with user value." + - title: "Require platform engineers to spend time embedded in product teams" + body: "Rotate platform engineers into product teams for two weeks per quarter so they experience the developer journey firsthand and build relationships that improve feedback quality." + - title: "Write Architecture Decision Records (ADRs) for all major platform choices" + body: "Document the context, options considered, and rationale for every significant platform decision in version-controlled ADRs so institutional knowledge survives team changes." + - title: "Establish clear escalation paths" + body: "Document who owns what in the platform and how to escalate unresolved issues. Ambiguous ownership causes incidents to stall and erodes trust in the platform." + - title: "Hold regular platform team retrospectives" + body: "Run bi-weekly or monthly retrospectives within the platform team to surface and address internal process problems before they affect the developer experience." + - title: "Share the platform roadmap with engineering leadership quarterly" + body: "Present the platform roadmap and adoption metrics to engineering directors each quarter to maintain stakeholder buy-in and ensure platform investment is protected." + - title: "Create an \"inner source\" contribution model" + body: "Publish a `CONTRIBUTING.md` that allows product engineers to submit pull requests to platform components with a defined review and acceptance process. Tools cannot accept contributions — process does." + - title: "Pair platform engineers with new team onboarding" + body: "Assign a platform engineer as a \"platform buddy\" to each new product team for their first two weeks. Human support accelerates adoption faster than any amount of documentation." + - title: "Establish a blameless post-mortem culture" + body: "Run blameless post-mortems for all platform incidents and publish the reports internally. A culture of blame suppresses incident reporting, which means problems stay hidden until they cause major outages." + - title: "Set explicit working agreements for platform team interactions" + body: "Agree and publish SLAs for platform support responses, change review turnaround, and feature request triage. Clear agreements prevent frustration on both sides." + - title: "Map platform boundaries to team cognitive load" + body: "Evaluate whether the scope of the platform team's responsibilities is manageable for the team's size. An overloaded platform team produces poor DX regardless of the quality of their tools." + - title: "Celebrate cross-team contributions publicly" + body: "Acknowledge product team engineers who contribute fixes or improvements to platform components in company-wide communications. Recognition reinforces the culture of shared ownership." + - title: "Create an internal platform champions network" + body: "Identify and empower one engaged developer in each product team as a platform champion. Champions multiply the reach of the platform team without requiring the platform team to grow proportionally." + - title: "Document the \"you build it, you run it\" boundary explicitly" + body: "Clearly define what the platform team runs operationally versus what product teams own. Ambiguous operational boundaries cause incidents to be mis-routed and breed resentment." + - title: "Invest in psychological safety before process improvement" + body: "Before introducing any new process (incident reviews, code reviews, SLO reviews), ensure the team has the psychological safety to raise problems honestly. Processes imposed on fearful teams produce theatre, not improvement." --- # Principle 11: Invest in people, process, and culture — amplified by tools @@ -8,43 +49,3 @@ title: "Principle 11: Invest in people, process, and culture — amplified by to > Tools are multipliers, but only when the right culture and clear processes already exist. No tool compensates for a lack of product thinking or stakeholder trust. Your platform's boundaries will reflect your organisational structure whether you plan it or not — treat team topology and platform scope as a single design decision. ## 20 Practical Examples - -1. **Define team topologies before selecting tools** — Decide which teams own which capabilities (stream-aligned, platform, enabling, complicated-subsystem) before evaluating tooling. Buying a tool before designing team ownership creates orphaned capabilities. - -2. **Create a shared on-call rotation with product teams** — Include at least one product engineer in platform incidents and include a platform engineer in product on-call rotations. Cross-pollination builds empathy and shared ownership. - -3. **Run a "platform as product" enablement session** — Hold a half-day workshop for the platform team on product thinking, user research, and roadmap management. Invest in this before investing in new tooling. - -4. **Document the platform team's decision-making process** — Publish how the platform team makes technology decisions (ADRs, RFCs, approval criteria) so product teams understand how to propose changes and why decisions were made. - -5. **Establish a platform community of practice** — Create a regular (monthly) cross-team forum where engineers who use the platform share patterns, discuss pain points, and propose improvements. Tools do not build communities — people do. - -6. **Tie platform OKRs to developer outcomes, not platform outputs** — Set objectives like "reduce time-to-first-deployment by 40%" not "deliver three new platform features." Outcome-based OKRs align the platform team's incentives with user value. - -7. **Require platform engineers to spend time embedded in product teams** — Rotate platform engineers into product teams for two weeks per quarter so they experience the developer journey firsthand and build relationships that improve feedback quality. - -8. **Write Architecture Decision Records (ADRs) for all major platform choices** — Document the context, options considered, and rationale for every significant platform decision in version-controlled ADRs so institutional knowledge survives team changes. - -9. **Establish clear escalation paths** — Document who owns what in the platform and how to escalate unresolved issues. Ambiguous ownership causes incidents to stall and erodes trust in the platform. - -10. **Hold regular platform team retrospectives** — Run bi-weekly or monthly retrospectives within the platform team to surface and address internal process problems before they affect the developer experience. - -11. **Share the platform roadmap with engineering leadership quarterly** — Present the platform roadmap and adoption metrics to engineering directors each quarter to maintain stakeholder buy-in and ensure platform investment is protected. - -12. **Create an "inner source" contribution model** — Publish a `CONTRIBUTING.md` that allows product engineers to submit pull requests to platform components with a defined review and acceptance process. Tools cannot accept contributions — process does. - -13. **Pair platform engineers with new team onboarding** — Assign a platform engineer as a "platform buddy" to each new product team for their first two weeks. Human support accelerates adoption faster than any amount of documentation. - -14. **Establish a blameless post-mortem culture** — Run blameless post-mortems for all platform incidents and publish the reports internally. A culture of blame suppresses incident reporting, which means problems stay hidden until they cause major outages. - -15. **Set explicit working agreements for platform team interactions** — Agree and publish SLAs for platform support responses, change review turnaround, and feature request triage. Clear agreements prevent frustration on both sides. - -16. **Map platform boundaries to team cognitive load** — Evaluate whether the scope of the platform team's responsibilities is manageable for the team's size. An overloaded platform team produces poor DX regardless of the quality of their tools. - -17. **Celebrate cross-team contributions publicly** — Acknowledge product team engineers who contribute fixes or improvements to platform components in company-wide communications. Recognition reinforces the culture of shared ownership. - -18. **Create an internal platform champions network** — Identify and empower one engaged developer in each product team as a platform champion. Champions multiply the reach of the platform team without requiring the platform team to grow proportionally. - -19. **Document the "you build it, you run it" boundary explicitly** — Clearly define what the platform team runs operationally versus what product teams own. Ambiguous operational boundaries cause incidents to be mis-routed and breed resentment. - -20. **Invest in psychological safety before process improvement** — Before introducing any new process (incident reviews, code reviews, SLO reviews), ensure the team has the psychological safety to raise problems honestly. Processes imposed on fearful teams produce theatre, not improvement. diff --git a/examples/12-define-and-honour-the-platform-contract.md b/examples/12-define-and-honour-the-platform-contract.md index 51d0abb..577834b 100644 --- a/examples/12-define-and-honour-the-platform-contract.md +++ b/examples/12-define-and-honour-the-platform-contract.md @@ -1,6 +1,47 @@ --- layout: principle title: "Principle 12: Define and honour the platform contract" +examples: + - title: "Publish a one-page platform contract" + body: "Write a clear, human-readable document that states what the platform provides (e.g., \"99.9% pipeline availability during business hours\"), what it requires from services (e.g., standard labels, health endpoints), and what it explicitly does not cover." + - title: "Define SLOs for every platform component" + body: "Publish service level objectives (availability, latency, error rate) for CI/CD, artifact registry, secret management, and the developer portal. Post them in the developer documentation." + - title: "Publish an ownership boundary map" + body: "Create a diagram or table that shows which team owns what: which Kubernetes components the platform team manages versus what product teams are responsible for, down to the namespace level." + - title: "Document the supported resource request process" + body: "Publish a clear process for how teams request new platform capabilities: how to submit a request, what information is needed, who reviews it, and what the decision timeline is." + - title: "Version the platform contract" + body: "Treat the platform contract as a versioned document. When the contract changes, publish a diff, communicate the change, and give teams time to adapt — just as you would a versioned API." + - title: "Provide a written process for stepping off the golden path" + body: "Publish the steps for teams that need to diverge: what justification is required, who approves exceptions, how the exception is tracked, and when it is reviewed for removal." + - title: "Enforce the contract through automated checks" + body: "Back up the contract's requirements (labels, resource limits, health probes) with CI policy checks and admission controllers so compliance is verified continuously, not just during audits." + - title: "Define and communicate breaking change policy" + body: "Document what constitutes a breaking change to the platform contract, how far in advance teams will be notified, and what migration support will be provided. Never break the contract silently." + - title: "Publish an incident escalation matrix" + body: "Provide a clear matrix that tells product teams exactly who to contact for each type of platform incident, at what severity level, and through which channel — so they are never guessing during an outage." + - title: "Set explicit data retention policies" + body: "State in the contract how long logs, metrics, traces, and artifacts are retained. Teams designing compliance or audit capabilities need to know this boundary before building, not after." + - title: "Define \"best-effort\" vs. \"supported\" capabilities" + body: "Label each platform feature as \"supported\" (with SLO guarantees) or \"best-effort\" (available but without formal SLO). Teams need to know which tier a capability sits in before relying on it for production workloads." + - title: "Run a quarterly contract review" + body: "Hold a structured review each quarter where the platform team and product team representatives assess whether the contract is still accurate, whether new obligations have emerged, and whether any guarantees need updating." + - title: "Document what happens during platform maintenance windows" + body: "State in the contract when planned maintenance windows occur, how teams are notified, and what the expected behaviour of platform components is during those windows." + - title: "Establish an SLA for exception reviews" + body: "When a team submits a request to diverge from the golden path, commit to a decision within a stated timeframe (e.g., five business days). Unanswered exception requests lead teams to bypass the process entirely." + - title: "Publish a cost contract alongside the technical contract" + body: "Specify what platform infrastructure costs are covered centrally and what costs teams are responsible for. Unclear cost boundaries create billing surprises and strained relationships." + - title: "Provide contract compliance reports to teams" + body: "Generate and send periodic reports to each product team showing which contract requirements they meet and which they do not, with links to remediation guidance." + - title: "Create a contract FAQ" + body: "Maintain a living FAQ that answers the most common questions teams ask about the contract (\"Can I use a different database?\", \"Who handles TLS certificate renewal?\"). Update it after every recurring support question." + - title: "Link the platform contract in every service's README template" + body: "Include a link to the current platform contract in the golden-path service template README so every new service owner reads it during onboarding." + - title: "Hold the platform team accountable to the contract" + body: "Report SLO compliance for each platform component monthly, publicly within the engineering organisation. If the platform team misses SLOs, it should be as visible as a product team outage." + - title: "Retire obligations the platform no longer needs" + body: "When a platform responsibility is transferred to a team or automated away, formally remove it from the contract and communicate the change. A contract that grows but never shrinks becomes unmanageable." --- # Principle 12: Define and honour the platform contract @@ -8,43 +49,3 @@ title: "Principle 12: Define and honour the platform contract" > The platform's job is to absorb operational complexity so that developers need not manage it themselves. In return, developers who stay within the contract receive its guarantees. Make this deal explicit — through service level objectives, documented ownership boundaries, and a published process for stepping off the path. ## 20 Practical Examples - -1. **Publish a one-page platform contract** — Write a clear, human-readable document that states what the platform provides (e.g., "99.9% pipeline availability during business hours"), what it requires from services (e.g., standard labels, health endpoints), and what it explicitly does not cover. - -2. **Define SLOs for every platform component** — Publish service level objectives (availability, latency, error rate) for CI/CD, artifact registry, secret management, and the developer portal. Post them in the developer documentation. - -3. **Publish an ownership boundary map** — Create a diagram or table that shows which team owns what: which Kubernetes components the platform team manages versus what product teams are responsible for, down to the namespace level. - -4. **Document the supported resource request process** — Publish a clear process for how teams request new platform capabilities: how to submit a request, what information is needed, who reviews it, and what the decision timeline is. - -5. **Version the platform contract** — Treat the platform contract as a versioned document. When the contract changes, publish a diff, communicate the change, and give teams time to adapt — just as you would a versioned API. - -6. **Provide a written process for stepping off the golden path** — Publish the steps for teams that need to diverge: what justification is required, who approves exceptions, how the exception is tracked, and when it is reviewed for removal. - -7. **Enforce the contract through automated checks** — Back up the contract's requirements (labels, resource limits, health probes) with CI policy checks and admission controllers so compliance is verified continuously, not just during audits. - -8. **Define and communicate breaking change policy** — Document what constitutes a breaking change to the platform contract, how far in advance teams will be notified, and what migration support will be provided. Never break the contract silently. - -9. **Publish an incident escalation matrix** — Provide a clear matrix that tells product teams exactly who to contact for each type of platform incident, at what severity level, and through which channel — so they are never guessing during an outage. - -10. **Set explicit data retention policies** — State in the contract how long logs, metrics, traces, and artifacts are retained. Teams designing compliance or audit capabilities need to know this boundary before building, not after. - -11. **Define "best-effort" vs. "supported" capabilities** — Label each platform feature as "supported" (with SLO guarantees) or "best-effort" (available but without formal SLO). Teams need to know which tier a capability sits in before relying on it for production workloads. - -12. **Run a quarterly contract review** — Hold a structured review each quarter where the platform team and product team representatives assess whether the contract is still accurate, whether new obligations have emerged, and whether any guarantees need updating. - -13. **Document what happens during platform maintenance windows** — State in the contract when planned maintenance windows occur, how teams are notified, and what the expected behaviour of platform components is during those windows. - -14. **Establish an SLA for exception reviews** — When a team submits a request to diverge from the golden path, commit to a decision within a stated timeframe (e.g., five business days). Unanswered exception requests lead teams to bypass the process entirely. - -15. **Publish a cost contract alongside the technical contract** — Specify what platform infrastructure costs are covered centrally and what costs teams are responsible for. Unclear cost boundaries create billing surprises and strained relationships. - -16. **Provide contract compliance reports to teams** — Generate and send periodic reports to each product team showing which contract requirements they meet and which they do not, with links to remediation guidance. - -17. **Create a contract FAQ** — Maintain a living FAQ that answers the most common questions teams ask about the contract ("Can I use a different database?", "Who handles TLS certificate renewal?"). Update it after every recurring support question. - -18. **Link the platform contract in every service's README template** — Include a link to the current platform contract in the golden-path service template README so every new service owner reads it during onboarding. - -19. **Hold the platform team accountable to the contract** — Report SLO compliance for each platform component monthly, publicly within the engineering organisation. If the platform team misses SLOs, it should be as visible as a product team outage. - -20. **Retire obligations the platform no longer needs** — When a platform responsibility is transferred to a team or automated away, formally remove it from the contract and communicate the change. A contract that grows but never shrinks becomes unmanageable.