0. Engineering Roles
Engineering Roles
Purpose: Define the core roles across Engineering (and critical adjacent partners) with crisp missions, decision rights, operating loops, KPIs, and “definition of done.” Ends with a RACI over the most common responsibilities.
Default assumptions: single paved road; ringed deploys; SLOs before GA; one owner per service; WIP ≤ 2 bets/squad.
1) CTO — Chief Technology Officer
North Star: Turn strategy into shipped, reliable outcomes.
Owns (A): Org topology; quality bars policy (flags/telemetry/rollback/SLOs); architecture principles; reliability posture; talent system; board/executive tech narrative.
Weekly loops: Outcome review; reliability huddle; platform council; hiring loop; 1:1s with leads.
KPIs: Lead time p50 < 24h · CFR < 15% · MTTR < 1h · SLO attainment ≥ 99% · Platform adoption ≥ 80%.
Definition of Done: 100% services owned with SLOs & runbooks; policy gates enforced in CI; two months of green DORA+SLO; cadence running (weekly/monthly/quarterly).
2) HoE — Head of Engineering / VP Eng
North Star: Predictable, high‑quality delivery.
Owns (A): Delivery ops (WIP caps, change calendar), gate enforcement, service catalog hygiene, incident program, hiring/onboarding execution, dependency management.
Weekly loops: WIP & deps sweep · team outcome reviews · reliability huddle · platform check · hiring/people.
KPIs: Lead time < 24h · deploy ≥ daily · CFR < 15% · MTTR < 1h · wait time ≤ 2d · adoption ≥ 80%.
Definition of Done: Ops scoreboard live; gates enforced in CI; pages ≤2/team/wk; adoption trending to 80%.
3) HoP — Head of Platform
North Star: A paved road that makes the right way the easy way.
Owns (A): Templates/scaffolds; CI/CD; preview envs; observability defaults; platform SLOs; deprecations; supply chain (SBOM/scanning); internal customer success (SLA/docs/office hours).
Weekly loops: SLO dashboard review · backlog triage · office hours · migrations check.
KPIs: Build p50 < 10m (p95 < 20m) · Preview p95 < 5m · Uptime ≥ 99.9% · Flake < 2% · Adoption ≥ 80% · Dev NPS ≥ +40.
Definition of Done: Charter+SLOs published; SLOs green; ≥60% adoption and rising; exception memos tracked; deprecation wave completed with guides.
4) Stream TL — Tech Lead (value‑stream squad)
North Star: Ship outcomes quickly and safely for one stream.
Owns (A): End‑to‑end of stream services (SLOs, runbooks, on‑call); testing strategy; ringed deploys; within‑boundary architecture; security/privacy for changes.
Weekly loops: Outcome review (with PM) · WIP≤2 & dependency pass · reliability huddle · post‑deploy verification · tech‑debt/perf slice.
KPIs: Lead time < 24h · deploy ≥ daily · CFR < 15% · MTTR < 1h · SLOs green · wait time ≤ 2d.
Definition of Done: One owner per service; contract tests for external interfaces; rollback < 5m; preview p95 < 5m; pager quiet (≤2/wk).
5) PM — Product Manager (stream‑aligned)
North Star: Move the NSM and KRs with the fewest, safest changes.
Owns (A): Problem briefs & success metrics; within‑stream portfolio (WIP ≤ 2); experiment plans (MDE/power/guardrails); readiness gates; release notes & enablement; outcome reviews.
Weekly loops: Outcome review · backlog/WIP check · customer time · evidence & decision notes · GTM sync.
KPIs: KR delta/week on target · activation/adoption thresholds · idea→decision ≤ 2w · CFR ≤ +2pp vs baseline post‑release.
Definition of Done: Problem briefs, experiment plans, decision notes, KR tree, outcome dashboard; one Pilot→Beta→GA completed with guardrails intact.
6) SRE Lead / SRE
North Star: Keep systems available, fast, and safe—predictably.
Owns (A): SLOs/SLIs/alerts; error‑budget policy; incident response (on‑call, RCAs, drills); observability; resilience controls; DR/backups.
Weekly loops: SLO & alert review · incident huddle · change‑safety sync · on‑call health.
KPIs: SLO periods green ≥ 99% · MTTR < 1h · pages ≤ 2/team/wk · alert precision > 90% · backup restore 100%.
Definition of Done: SLOs live for tier‑1/2; burn alerts wired; policies gating rings; DR drill evidence; RCAs closed with actions merged.
7) Data Lead / Analytics Engineer
North Star: Trusted, self‑serve data and decision‑quality experiments.
Owns (A): Metric definitions & semantic layer; event taxonomy & data contracts; data reliability (freshness SLAs, tests, lineage); self‑serve BI & governance; experiment standards.
Weekly loops: Freshness & tests review · schema/contract triage · experiment clinic · office hours.
KPIs: Metric spec coverage 100% · freshness ≥ 99% · tests ≥ 95% pass · data MTTR < 1h · semantic‑layer usage ≥ 80%.
Definition of Done: NSM/KR specs live; SLAs green; certified dashboards; experiment policy enforced (MDE/power/guardrails).
8) Staff / Principal Engineer (senior IC)
North Star: Raise the technical bar and unlock multiple teams.
Owns (A): System design in a domain; one‑way door choices (ADR/RFC); reference implementations & golden paths; performance budgets; deprecations & migrations; design review program.
Weekly loops: Design reviews · pairing/implementation · reliability & perf triage · mentoring · short docs/ADRs.
KPIs: Wait time ≤ 2d · CFR < 15% · MTTR < 1h · hot‑path p95 on budget · golden‑path adoption ≥ 80%.
Definition of Done: Two cross‑team wins shipped; contracts in CI; perf budgets gated; deprecation plan executed; review SLA ≤ 48h.
9) Software Engineers — Stream & Platform
North Star: Ship small changes that move KRs, safely.
Core responsibilities (R):
- Build on the paved road; keep PRs small; add tests at the right layer.
- Operate what you ship: SLOs, alerts, runbooks, on‑call rotation.
- Instrument features (events, metrics) and annotate deploys.
- Use contracts and versioning for inter‑team interfaces; keep provider/consumer tests green.
- Write helpful notes (docs/checklists) and follow post‑deploy verification.
Weekly habits: 5–10 PRs merged; post‑deploy smoke; fix one paper‑cut; pair once; update one runbook/checklist.
Quality bars: No secrets in code; rollback < 5m; preview p95 < 5m; zero new flaky tests; changes observable by default.
RACI — Core Responsibilities by Role
Roles: CTO · HoE · HoP · TL · PM · SRE · Data · Staff/Princ · Eng
Responsibility | CTO | HoE | HoP | TL | PM | SRE | Data | Staff/Princ | Eng |
---|---|---|---|---|---|---|---|---|---|
1) Org topology & team design | A | R | I | I | I | I | I | C | I |
2) Quality bars policy (flags/telemetry/rollback/SLOs) | C | A | R | R | C | R | I | C | R |
3) Service ownership (catalog, on‑call, SLO dashboards) | I | A | I | R | I | C | I | C | R |
4) Platform (templates, CI/CD, previews, obs) | I | C | A | C | I | C | I | C | R |
5) SLOs & incident management | I | C | C | R | I | A | I | C | R |
6) Stream portfolio & WIP ≤2 | C | C | I | R | A | I | I | I | R |
7) Release gates & ringed deploys | I | C | C | A | C | R | I | C | R |
8) Error‑budget governance | C | C | I | R | I | A | I | C | R |
9) Metrics & experimentation standards | C | I | I | R | R | I | A | C | R |
10) Security & privacy posture | A | R | R | R | I | R | C | C | R |
11) Change calendar & freeze rules | I | A | C | R | C | C | I | I | R |
12) Deprecations & migrations (platform/libs) | I | C | A | R | I | C | I | R | R |
13) Contracts & interfaces (provider/consumer tests) | C | C | I | R | I | C | I | A | R |
14) Cost & FinOps (build mins, env hours, queries) | A | R | R | I | C | I | C | C | I |
15) Hiring & onboarding (engineering) | C | A | C | R | I | I | I | C | R |
16) Service catalog hygiene | I | A | I | R | I | C | I | I | R |
17) Paved‑road adoption | I | A | R | R | I | I | I | C | R |
RACI key: A = Accountable (single owner); R = Responsible (executes); C = Consulted (two‑way input); I = Informed (kept in the loop).
How to use this
- Link this doc from your Service Catalog and Team APIs.
- For any responsibility with unclear A/R, resolve in 1:1s or at the Platform/Outcome council this week.
- Keep the RACI to one A per row; review quarterly or after topology changes.
Glossary
Alphabetical; concise definitions tuned to the context of the guide.
A
- ADR — Architecture Decision Record: Short note that records a key technical decision, its context, and consequences.
- API — Application Programming Interface: A contract for how software components communicate. (Mentioned implicitly via “interfaces/contracts”.)
B
- BI — Business Intelligence: Tools/dashboards used for analysis and decision‑making.
C
- CD — Continuous Delivery/Deployment: Automating release to staging/production safely and frequently.
- CFR — Change Failure Rate: % of deployments that cause incidents/rollbacks.
- CI — Continuous Integration: Automated build/test on every change.
- CoD — Cost of Delay: Economic impact of delaying a feature/outcome.
- CTO — Chief Technology Officer: Exec owning technology strategy and outcomes.
D
- Data MTTR: Mean time to recover from data incidents. (Data reliability context.)
- Dev NPS — Developer Net Promoter Score: Satisfaction measure from internal developers.
- DORA — DevOps Research & Assessment (metrics): Lead time, deploy freq, change failure rate, MTTR.
- DR — Disaster Recovery: How systems are restored after major failure (see RTO/RPO).
- DX / DevEx — Developer Experience: How easy and pleasant it is for engineers to build/ship. (Implied by Platform/DevEx targets.)
F
- FinOps — Cloud Financial Operations: Practices to manage cloud/infrastructure spend.
G
- GA — General Availability: Broadly available, production‑ready release.
- GTM — Go‑to‑Market: Coordination with marketing/sales/support for launches.
H
- HoE — Head of Engineering (VP Eng): Runs delivery operations and engineering execution.
- HoP — Head of Platform: Owns the paved road (templates, CI/CD, previews, observability).
I
- IC — Individual Contributor: Non‑manager engineer role.
- IR — Incident Response: Process/roles to handle outages/security events.
K
- KPI — Key Performance Indicator: Metric tracking health/performance of an area.
- KR — Key Result: Outcome target aligned to the NSM.
M
- MDE — Minimum Detectable Effect: Smallest effect size an experiment aims to reliably detect.
- MTTR — Mean Time To Restore (or Recover): Average time to mitigate/resolve incidents.
N
- NPS — Net Promoter Score: Satisfaction metric (here, among internal developers for platform).
- NSM — North Star Metric: Single most important metric guiding the product/business.
O
- Ops — Operations: Practices and routines for running systems and teams.
P
- p50 / p95: 50th/95th percentile of a metric (e.g., build time, latency).
- PII — Personally Identifiable Information: Data that can identify an individual.
- PM — Product Manager: Owns problem framing, outcomes, portfolio/WIP inside a stream.
- PR — Pull Request: Proposed code change reviewed before merge.
R
- RACI — Responsible, Accountable, Consulted, Informed: Model clarifying who does what.
- R — Responsible: Executes the work.
- A — Accountable: Single owner answerable for the result/decision.
- C — Consulted: Gives input before decisions.
- I — Informed: Kept up to date after decisions.
- RFC — Request for Comments: Design document used to discuss/decide significant changes.
- RICE — Reach, Impact, Confidence, Effort: Framework to prioritize bets/initiatives.
- ROI — Return on Investment: Value delivered relative to cost.
- RPO — Recovery Point Objective: Maximum acceptable data loss window in DR.
- RTO — Recovery Time Objective: Target maximum downtime in DR.
S
- SaaS — Software as a Service: Cloud‑hosted software delivered via subscription.
- SBOM — Software Bill of Materials: Inventory of components/dependencies in a build.
- SLA — Service Level Agreement: Promised service targets (often contractual).
- SLI — Service Level Indicator: Measured signal of service health (e.g., latency).
- SLO — Service Level Objective: Target for an SLI (e.g., 99.9% availability).
- SRE — Site Reliability Engineering / Engineer: Discipline/role focused on reliability.
- SLSA — Supply‑chain Levels for Software Artifacts: Framework to harden software supply chains. (Referenced via supply‑chain posture.)
- SAST/DAST: Static/Dynamic Application Security Testing. (Security scanning in CI/CD.)
T
- TL — Tech Lead: Leads a stream‑aligned squad across design/build/test/release/operate.
V
- VP — Vice President. (As in VP Engineering.)
W
- WIP — Work in Progress: Concurrent bets/initiatives; capped to protect flow.
Y
- YAGNI — “You Aren’t Gonna Need It”: Principle to avoid premature abstraction/over‑engineering.
Notes
- Percentiles (p50/p95) and DORA are used as SLIs/KPIs to track flow, quality, and reliability.
- RACI rows in the guide intentionally have one “A” per responsibility to avoid decision ambiguity.