INTERNAL PLAYBOOK · v1 · MAY 2026

How We Run Apps

Five plays. One deliberate system. Built once for KeyContent. Cloned in an afternoon for every Coreshift HQ app after.

Owner: Abe (Operator)
Implementer: Claude Code
Stack: Supabase · Cloudflare · GitHub · Postmark
Scale: 10–20 users today
PLAY 1

Priority Framework

How we triage what gets fixed first.

Status: Active · v1 · 2026-05-11 Applies to: Every Coreshift HQ app · Currently: KeyContent Owners: Operator (assigns priority) · Claude Code (consults for triage cues)


What this doc is

How Coreshift HQ triages and prioritizes issues. Built on the P0–P3 industry standard (so we play nicely with the rest of the world), elevated with two custom fields that force better thinking:

  1. User Impact Narrative — a one-line human story, not a technical description
  2. Blast Radius — explicit count of how many users are affected

The standard tells us what to call things. Our additions force us to think about who's affected before we act.


The Four Priority Tiers

Tier Name When Action Default response time
P0 Drop Everything App is down for most/all users · Data loss · Active security breach Fix now, today Hours
P1 This Week A core feature is broken · OR a smaller feature broken for many users Fix in 1–3 days Same business day
P2 This Milestone A bug exists but users have a workaround · OR only a small subset is affected Fix in current sprint Within the week
P3 When Convenient Polish · Cosmetic · Wishlist · Edge case Backlog — close if stale No SLA

The Decision Tree (use this in 10 seconds)

1. Is the app down for everyone?              → YES → P0
                                                NO ↓
2. Is a core feature broken for many users?   → YES → P1
                                                NO ↓
3. Is anything functionally broken?            → YES → P2
                                                NO ↓
4. Cosmetic / nice-to-have?                    → P3

"Core feature" for KeyContent = login · the main content workflow · saving work · billing/payments. Each app defines its own list in its in-repo PRIORITY.md.


The Two Required Fields (the elevation)

1. User Impact Narrative

Every triaged issue must include a one-line story of how this affects the user's day.

❌ Bad (technical) ✅ Good (human)
"Save button throws TypeError" "Users lose their drafts when clicking Save — they have to retype everything"
"API 500 on /jobs endpoint" "Users can't see their own jobs list when they log in — looks like everything's gone"
"Modal animation janky on Safari" "Safari users see a flicker when opening the report widget — minor visual annoyance"

Why: Triage from the user's perspective, not the engineer's. The narrative tells you the severity better than the stack trace.

2. Blast Radius

Tag every issue with one of:

Tag Meaning
radius:single One user affected (so far)
radius:some A subset — e.g., users on Safari, free-tier users, users in a specific region
radius:many Most users will hit this
radius:all Everyone, every session

Why: Lets us spot escalation. A P2 · radius:some issue that gets re-reported as radius:many jumps to P1 instantly.


How Type, Priority, and Radius Compose

These are three orthogonal axes — all assigned, but at different moments:

Field Set by When
Type (Bug / Suggestion / Question) The user At report time (via the widget)
Priority (P0–P3) The operator At triage time
Blast Radius The operator At triage time

A typical fully-triaged issue looks like:

Type: Bug Priority: P1 Blast Radius: radius:many Narrative: Users lose their drafts when clicking Save — they have to retype everything.

This is rich enough that Claude Code can read it cold and start fixing. That's the whole point.


Common Mistakes (don't do these)

  • Everything is P1. If everything's urgent, nothing is. Be ruthless.
  • The loudest user sets priority. Impact sets priority, not volume. A single angry user reporting a typo is still P3.
  • You-found-it bugs ranked above users-found-it. Reverse it — user-reported issues are real, by definition.
  • Priority never changes. Re-triage when you learn more. A P2 affecting radius:some that turns into radius:many becomes P1.
  • Skipping the narrative. Without it, you're prioritizing on vibes. Force the user-impact line every time.

Why we tweaked the standard

The industry's P0–P3 system is a vocabulary, not a thinking framework. Two issues can both be "P1" and have wildly different actual impact on users. By requiring narrative + radius, we force a small amount of structured thinking that:

  • Prevents priority inflation
  • Makes triage trustworthy (numbers are decisions, not guesses)
  • Generates great Claude Code briefs as a byproduct
  • Sets up future automation — radius:all could auto-page; radius:single + P3 could auto-close after 60 days

These are not exotic additions. They're cheap, repeatable habits that compound over hundreds of triaged issues.


How this lives in practice

  • The widget captures Type at report time
  • Triage (10 min/day) sets Priority + Blast Radius + Narrative
  • GitHub Issues stores all three as labels and fields (Phase 1)
  • Sentinel (future ops dashboard) filters and sorts by these
  • The pitch deck (Slide 7) is the public-facing summary of this doc

When this doc changes

  • Edit when we discover a new tier is needed (rare)
  • Edit when the "core feature" list changes per app
  • Bump the version at the top
  • Mention in the next CEO update

See also

  • ../briefs/BRIEF-report-issue-widget.md — how Type is captured
  • ../briefs/BRIEF-report-issue-widget-round2.md — Type selector implementation
  • ../deck/KeyContent-Maintenance-Ops.pptx — Slide 7 (Priority Framework)
  • ../ROADMAP.md — Phase 1 Week 5 (in-repo PRIORITY.md follows this doc)
PLAY 2

PR Reviews

How a non-coder verifies every change.

Status: Active · v2 · 2026-05-13 (Railway revision) Applies to: Every PR from Claude Code or any contributor Owners: Operator (reviewer) · Claude Code (author)

v2 note: Original v1 assumed Cloudflare Pages preview URLs per PR. The actual stack uses Railway, which doesn't generate per-PR previews on the free tier. The review model has shifted from "verify on a preview, then merge" to "review-by-description pre-merge, behavior-test on staging post-merge." Safety net is Sentry + Better Stack + Railway fast rollback.


What this doc is

A behavior-based, code-free PR review checklist. Designed for an operator who doesn't write code to confidently approve or push back on changes — at speed, every day.


The principle

Behavior > code.

Most engineering orgs review PRs by reading source code line by line. We can't — and we don't need to. We trust Claude Code's implementation patterns and verify what actually matters: that the change does what we asked for, and nothing else broke.

Our verification happens at the user-experience layer, not the source-code layer.

The trade: we don't personally catch every subtle bug. We rely on Sentry monitoring + Better Stack uptime + Railway fast rollback as the safety net for what slips through.


Two-Phase Review (Railway-adapted)

Railway free tier doesn't create a preview URL per PR — only the staging branch gets deployed. So the review splits into two phases:

  • Phase A (pre-merge): description-based review only. No behavior-testing yet.
  • Phase B (post-merge-to-staging): behavior verification on staging.keycontent.ai after Railway deploys.

Total time: ~3 min pre-merge + ~5 min post-merge.

Phase A — Pre-merge review (~3 min)

Step 1 · Read the PR description carefully

  • Does the change match what I briefed?
  • Are the "What's new" items a subset of what I asked for? (No scope creep.)
  • Does the test plan match what we should verify?

✅ Yes → continue. ❌ No → comment to clarify before merging.

Step 2 · Read the files-changed list

Don't read the code itself — just the list of files. Quick sanity check:

  • Does the file count match the scope? (A "fix subject line" PR touching 12 files is a red flag.)
  • Are any unexpected files touched? (e.g., a UI fix PR that changed migration files — ask why.)

If files look reasonable → proceed. If suspicious → ask Claude Code to explain.

Step 3 · Ask clarifying questions in PR comments

You can't behavior-test yet, so questions are the only pre-merge gate. Examples:

  • "You added a new env var — does it have a sensible default if missing?"
  • "You touched the auth flow — confirm sign-in/sign-out still work after this?"
  • "This change affects the report widget — does staging Postmark still get hit?"

Claude Code answering well = green light to merge. Claude Code stumbling = bounce back.

Phase A passes → merge to staging.

Phase B — Post-merge behavior testing on staging (~5 min)

After merging to staging, Railway redeploys in ~2-3 min. Open https://staging.keycontent.ai and:

Step 4 · Reproduce the original goal

  • For bugs: repeat the steps from the original issue. Confirm the bug is gone.
  • For features: use the feature exactly as a user would. Confirm it does what we asked.

Step 5 · Smoke test the critical flows

After verifying the targeted change, quickly probe 2–3 high-traffic paths to make sure nothing else broke.

KeyContent's critical flows:

  • Sign in / sign up
  • Dashboard loads cleanly
  • The primary content workflow (create / edit / save a job)
  • Sign out

Each click ~5–10 seconds. Whole smoke test ~1 minute.

Step 6 · Mobile check

  • Resize your browser to ~400px wide (or pull it up on your phone)
  • Hit the changed page
  • Confirm nothing is broken or unreadable

Step 7 · If anything fails

  • Comment on the merged PR with the failing details
  • Brief Claude Code for the follow-up fix (new PR)
  • If the failure is severe (auth, payment, data loss): rollback staging via Railway → Deployments → redeploy previous

All checks pass → ready to promote staging → main.


Decision matrix

Situation Phase Action
Description + files match brief, questions answered well A ✅ Merge to staging
PR description doesn't match my brief A "This doesn't match my brief — please redo X" before merging
Suspicious files-changed list A ❌ Ask Claude Code to explain before merging
Already merged, all behavior checks pass on staging B ✅ Ready to promote staging → main
Bug isn't actually fixed (on staging) B ❌ Comment with exact failing steps; new PR for fix
Fix works, but smoke test broke something else B ❌ Comment + screenshot; new PR for fix
Mobile broken on staging B ❌ Comment "Mobile broken — see screenshot"; new PR
Severe failure on staging (auth, data loss) B 🚨 Rollback staging via Railway, then fix via new PR

Example comments to give Claude Code

Pre-merge clarifying question (Phase A):

"You're touching the auth middleware — confirm sign-in still works for existing sessions after this lands. Don't want to log everyone out."

Bug not fixed (Phase B, post-merge on staging):

"Tested on staging. Original issue still happens: when I click Save on the draft, the page reloads and the content is lost. New PR for follow-up fix?"

Smoke test broke something (Phase B):

"Targeted fix works on staging. But noticed: clicking 'New Job' button now crashes the page. Reproduced 3 times on staging. Screenshot attached. Need a follow-up PR."

Mobile broken (Phase B):

"On mobile width (~400px) on staging, the new ticket-type selector wraps weirdly and 'Send Report' is cut off. Screenshot."

Scope creep (Phase A):

"I see you also refactored the auth code — please pull that out into a separate PR so this one stays focused on the bug fix."


What I never do

  • Read code line by line
  • Approve a PR based on "looks good to me"
  • Skip the smoke test
  • Merge on a Friday afternoon
  • Trust the test plan without running it myself

Post-merge to PROD: the 1-hour watch

After promoting staging → main and Railway redeploys production:

  • New error spike in Sentry? → rollback via Railway (Deployments → redeploy previous, ~2 min)
  • Better Stack monitor flipped red? → same: rollback first, diagnose later
  • User reports a problem via the widget? → check if it correlates with the deploy; rollback if yes
  • All quiet for 1 hour? → done.

For hotfixes, watch for 2 hours instead of 1.


Why we elevated the standard

The industry assumes the PR reviewer is also a coder. Our system inverts that assumption: the operator doesn't read code, so the checklist is 100% behavior-focused.

This:

  1. Unblocks non-coding operators from running a real engineering ops process
  2. Catches the bugs that actually matter — user-visible breakage, not stylistic preferences
  3. Pairs with safety nets — Sentry + Better Stack + Railway fast rollback catch what visual testing misses
  4. Adapts to platform reality — when Cloudflare-Pages-style per-PR previews aren't available (Railway), the review moves to staging-after-merge with rollback as the safety net
  5. Scales across apps — same two-phase pattern works for KeyContent and every Coreshift HQ app after

The unfair advantage: we move faster than teams that require code-reading reviewers, because every PR has exactly one reviewer (the operator) and the review is mechanical.


When this doc changes

  • When new critical flows emerge in any app (add to Step 4 list)
  • When a recurring failure mode isn't being caught by the smoke test
  • After any major post-merge incident (add a check to prevent recurrence)
  • Bump the version at the top and note the change

Per-app critical flows

Each Coreshift HQ app declares its own critical flows for Step 4. KeyContent's are above. App #2 will add its own.


See also

  • HOW-WE-DO-PRIORITY.md — how we triage what gets fixed first
  • HOW-WE-DO-DEPLOYS.md — the deploy pipeline + rollback procedures (v2 Railway-adapted)
  • HOW-WE-DO-INCIDENTS.md — what to do when the post-merge watch fires
  • HOW-WE-DO-APP-AUDITS.md — the broader audit framework
  • Deck Slide 6 — Shipping a New Feature (the visual narrative — may need slight rewording for Railway in next deck revision)
  • Deck Slide 10 — Role Distribution (Operator: Triage, brief, verify)
PLAY 3

Incident Response

Pre-written calm beats in-the-moment heroics.

Status: Active · v1 · 2026-05-13 Applies to: Any unplanned event degrading user experience in production Owners: Operator (incident driver) · Claude Code (fix implementer) · Tools (detection)


What this doc is

The pre-written playbook for when something is broken in production. Designed for a solo operator at 2 AM, when you don't have time to think — just follow the script.


The principle

Pre-written calm beats in-the-moment heroics.

The worst time to design a process is during a fire. So we wrote it now. When something breaks, you don't think — you execute the script. Decisions are made in advance; the runbook is the brain.


Severity Levels

Level Definition Response window Public comms
SEV1 App is down for everyone · data loss · active security breach Now Status page + user email if > 30 min
SEV2 Core feature broken · app down for many users Within 30 min Status page
SEV3 Degraded but functional · slow · minor breakage Same day Internal only

Non-urgent bugs that aren't actively breaking things → use the P0–P3 priority framework, not the incident process.


The 6-Step Response

1. ACKNOWLEDGE (0 min)

  • Alert arrived from Better Stack or Sentry
  • You're aware. The clock is running.
  • Internally: stop everything else. This is your only task.

2. ASSESS severity (≤ 60 seconds)

Three quick questions:

  • Can users still use the app at all? → No: SEV1 or SEV2. Yes: SEV3.
  • How many users affected? → All: SEV1. Many: SEV2. Few: SEV3.
  • Data being lost or corrupted? → Yes: always SEV1.

Pick a level. Don't agonize. You can adjust as you learn more.

3. COMMUNICATE (≤ 5 minutes from acknowledge)

  • SEV1 / SEV2: update the public status page immediately. Use the templates below.
  • SEV3: note in your incident log; update the status page only if it persists past 30 min.

4. DIAGNOSE — "what changed?"

Run through these dashboards in order. 9 out of 10 incidents trace to a recent deploy.

Source What to look for
Recent deploys Anything shipped in the last 4 hours?
Sentry New error spike? Click for stack trace + breadcrumbs
Cloudflare Workers logs 5xx surge? Function timeouts?
Supabase logs DB errors? Connection issues? Migration problems?
Postmark Are emails bouncing? Server outage?
Third-party status pages Cloudflare · Supabase · Postmark all have public status pages

5. MITIGATE — pick the safest path

Was there a recent deploy?
   YES → ROLLBACK FIRST. Diagnose later.
         (Cloudflare 1-click rollback to previous version.)
   NO  ↓

Is it a third-party outage?
   YES → Communicate, wait, monitor. Update users.
   NO  ↓

Is it a code bug requiring a new fix?
   → Brief Claude Code with full Sentry trace + context.
     Get a hotfix PR. Verify on preview. Merge.
     (Use the standard PR review checklist — abbreviated for hotfix.)

6. RESOLVE

  • Test the affected flow yourself on production
  • Update status page to "Resolved"
  • Note in your incident log

The 1 AM Rule

When something fires at 1 AM, do the minimum to make it stop:

  1. Rollback the last deploy if there was one recent
  2. Communicate to users via status page
  3. Sleep. Investigate properly in the morning.

Don't do at 1 AM:

  • Write or brief new code
  • Deploy speculative fixes
  • Read complex logs
  • Make architectural decisions

The 1 AM rule exists because sleep-deprived decisions are how minor incidents become major ones.


Claude Code's Role During Incidents

You don't write code. Even during incidents. Especially during incidents.

When a hotfix is needed:

  1. Capture the full Sentry trace + URL + user details + timestamp
  2. Open Claude Code (a fresh session for the incident, if helpful)
  3. Brief it like: "P0 hotfix. {feature} broken in production since {time}. Sentry error attached. Rollback wasn't possible because {reason}. Need a fix opened as a PR immediately."
  4. Verify the fix on the preview
  5. Merge

Operator's role is triage, brief, verify — same as always, just faster.


Pre-Written Status Page Templates

Investigating:

We're investigating reports of {issue} affecting {feature}. We'll provide an update within 30 minutes.

Identified:

We've identified the issue with {feature}. We're working on a fix now.

Monitoring:

A fix has been deployed. We're monitoring to confirm everything is working as expected.

Resolved:

This incident is resolved. {Feature} is fully operational. Sorry for the disruption.


When to Email Users (beyond the status page)

Most incidents: status page is enough.

Email users only when ALL of these are true:

  • SEV1 lasted > 30 minutes, OR data was visibly affected
  • The user might reasonably think their account is broken
  • You can identify the affected users

Template:

Subject: We had a brief outage — your data is safe

Hi {name},

Earlier today between {start} and {end}, {feature} was {down/broken}.
We're back to normal now.

What you might have noticed: {brief description}
What we did: {1-2 sentences}

Thanks for your patience.

— The Coreshift HQ Team

Post-Mortem — Within 24 Hours of any SEV1 / SEV2

Save as incidents/YYYY-MM-DD-N.md in the repo. Lean template — 5 sections:

## Incident YYYY-MM-DD-N

**Severity:** SEV{1|2|3}
**Duration:** Detected HH:MM → Resolved HH:MM ({X} minutes)
**Affected:** {features, ~user count}

### What happened
{One paragraph.}

### Root cause
{One paragraph.}

### How we caught it
{Sentry / Better Stack / user report / other?}

### What we did
{Bullet list of actions taken, in order, with timestamps.}

### Action items
- [ ] Prevent recurrence: {specific change}
- [ ] Improve detection: {specific change, if applicable}
- [ ] Update this runbook: {specific addition, if applicable}

Don't:

  • Skip the post-mortem because "everyone knows what happened"
  • Make it about blame (the system failed, not the person)
  • Write a novel — 5 sections is enough

Do:

  • File the action items as P1 GitHub Issues
  • Re-read recent post-mortems quarterly — patterns emerge

Common Anti-Patterns (don't do these)

  • Skipping comms — users finding out from a friend is worse than a brief status update
  • Forcing a fix at midnight — rollback + sleep is almost always better
  • Investigating before mitigating — stop the bleeding first
  • No post-mortem because "it's fixed" — patterns repeat unless documented
  • Skipping the status page update because "it's small" — the rule is consistency, not severity

Why we elevated the standard

The industry has detailed incident playbooks designed for large engineering teams with rotating on-call rotations, incident commanders, and war rooms. Most of that doesn't apply when you're a solo operator with Claude Code as your implementer.

Our elevations:

  1. Solo-operator-aware — no incident commander, no war room, no Slack #incidents channel. Just you + the runbook.
  2. The 1 AM Rule — explicit permission to do the minimum at night. Sleep is operational equipment.
  3. Pre-written communications — copy/paste during the fire, edit later.
  4. Claude Code is the implementer — you brief, it codes, you verify. Same workflow as a normal day, just under pressure.
  5. Lean post-mortems — 5 sections, not 50. Trends matter; ritual doesn't.

These choices make incidents survivable for a small team without sacrificing the discipline that prevents recurrence.


When this doc changes

  • After any SEV1 — what new step or check would have helped?
  • When the team grows beyond 1 — add coordination details
  • When new tools enter the stack — add their dashboards to Step 4
  • After a recurring pattern emerges — codify it
  • Bump version, note the change

See also

  • HOW-WE-DO-PRIORITY.md — P0/P1 priorities map to SEV1/SEV2 severities
  • HOW-WE-DO-PR-REVIEWS.md — the standard review process; incident review is the abbreviated version
  • HOW-WE-DO-DEPLOYS.md — rollback details (coming next)
  • ../ROADMAP.md — Phase 1 Week 2 wires up Better Stack + status page (the detection layer)
  • Deck Slide 5 — Scenario 2: outage at 2 AM (the visual narrative)
PLAY 4

Deploy Pipeline

Three rules. No exceptions.

Status: Active · v2 · 2026-05-13 (Railway revision) Applies to: Every code change reaching production · Every Coreshift HQ app Owners: Operator (verifier) · Claude Code (implementer) · Railway + Supabase + Cloudflare (mechanism)

v2 note: This doc was originally written assuming Cloudflare Pages hosts the frontend with per-PR preview URLs and gradual traffic rollout. The actual stack uses Railway for frontend + Express server hosting, with Cloudflare in front for DNS, CDN, WAF, and SSL (not Pages). Sections below have been rewritten to match Railway's mechanics. Future apps in the Sentinel portfolio that use Cloudflare Pages will need a separate variant.


What this doc is

The pre-written rules for moving code from "Claude Code just wrote it" to "real users are using it." Designed for an operator who doesn't read code to ship safely, daily.


The principle

Every deploy goes through staging. Every deploy is rollback-able fast. Every deploy is observable.

Three rules. No exceptions. The cost of following them is ~10 minutes. The cost of skipping them is a user-visible incident — every time.


The Standard Deploy Pipeline

Claude Code writes fix → opens PR against staging branch
            ↓
[Known gap: Railway free tier has no per-PR preview URLs]
            ↓
Read PR description + ask Claude Code clarifying questions
   (per HOW-WE-DO-PR-REVIEWS.md — review-by-description model)
            ↓
Merge to staging branch
            ↓
Railway auto-deploys to staging.keycontent.ai  (~2-3 min)
            ↓
Smoke test on staging
            ↓
Promote staging → main via PR  (the promotion gate)
            ↓
Railway auto-deploys to keycontent.ai  (~2-3 min, atomic cutover)
            ↓
Sentry + Better Stack watch for 1 hour
            ↓
Done.

The Three Rules

Rule 1 · Every deploy goes through staging

  • No direct merges to main
  • No "trivial" exceptions — the rule has no carve-outs
  • Staging is a real Railway environment with separate Supabase project + separate data
  • Even a one-character typo fix follows the path

Rule 2 · Every deploy is rollback-able fast

  • Railway keeps every previous deployment in the service's Deployments tab
  • Rollback = click a previous deployment → Redeploy. Live in ~2 minutes.
  • Database migrations need extra thinking (see Type C below)
  • Before merging anything risky: mentally rehearse "if this breaks, which deployment do I redeploy?"

Rule 3 · Every deploy is observable

Railway free tier deploys atomically — there is no native gradual traffic rollout. The discipline shifts from pre-deploy gradual exposure to post-deploy fast detection + revert. Our safety net:

  • Sentry catches application errors automatically within seconds of the deploy
  • Better Stack catches downtime/availability issues with 3-minute checks
  • Railway rollback is the kill switch when either alarm fires
  • The 1-hour watch after every prod deploy is the operator commitment to be reachable

If Railway adds gradual rollouts (or the team upgrades to Railway preview environments or paid tier with that capability), this rule reverts to "gradual." Until then: deploy → watch → revert if needed.


Deploy Types

Type A · Standard feature or fix

Follow the standard pipeline above. No extra steps.

Type B · Hotfix (incident response)

When an incident requires an emergency fix:

  1. Brief Claude Code with the full incident context (Sentry trace + steps to reproduce)
  2. PR opened against staging
  3. Abbreviated review: only check the targeted fix (not full 5-step) IF the incident is SEV1
  4. Merge to staging → quick smoke test → promote to main
  5. Skip the slow gradual ramp — proceed 10% → 50% → 100% faster if needed
  6. Sentry watch for 2 hours after a hotfix (double the normal window)

Type C · Database migration

Migrations are the riskiest deploy type because rollback isn't clean.

  1. Claude Code writes the migration in a PR
  2. Apply to staging first (Supabase migration CLI or apply_migration MCP)
  3. Test the affected features end-to-end on staging
  4. Sample queries to confirm data didn't corrupt
  5. Snapshot production Supabase before applying to prod (backup point)
  6. Apply migration to production
  7. Verify in production
  8. If anything breaks: Postgres migrations are NOT easily reversible. Have Claude Code write a forward-fix migration. Restore from backup only as a last resort.

Migration rule: never apply a migration to production without first applying it to staging and verifying the affected features.

Type D · Secrets / environment changes

Adding or rotating a secret (e.g., POSTMARK_SERVER_TOKEN):

  1. Add to staging Supabase first
  2. Verify behavior on staging
  3. Add to production Supabase
  4. Deploy any code that depends on the new secret (if not already deployed)
  5. Verify in production

Never: add a secret only to prod without testing on staging first.


The Staging → Production Promotion Checklist

When ready to promote a feature from staging to production, all must be true:

  • Feature has been live on staging for at least 24 hours with no Sentry alerts
  • All Edge Function secrets exist in production Supabase
  • All required migrations have been applied to production
  • Storage buckets and RLS policies exist in production (if used)
  • CORS allowlist on Edge Functions includes the production origin
  • No outstanding Sentry errors related to the feature in staging
  • You have at least 2 hours to monitor after promotion (don't promote at end of day)
  • Rollback plan is mentally rehearsed

If any box is unchecked: pause, fix, then promote.


Rollback Procedures (memorize these)

Frontend + Express server (Railway)

  1. Railway dashboard → KeyContent project → the service that runs the app
  2. Deployments tab → find the last known-good deployment (sorted newest first; pick the one before the breaking deploy)
  3. Click the ⋯ menu → Redeploy
  4. Confirm. Live within ~2 minutes (Railway rebuilds + cuts over).

If you need to roll back simultaneously on staging + production, repeat the process per service. Each Railway environment is independent.

Edge Function (Supabase)

  1. Check out the previous git commit on the function file
  2. Re-deploy via Supabase CLI: supabase functions deploy {function-name} --project-ref {prod-ref}
  3. OR brief Claude Code to revert and open a new PR (slower but safer)

Database migration

  • Not easily rollback-able. Use a forward-fix migration.
  • Restore from production backup only as last resort (data loss between snapshot and now, AND Supabase Storage objects are NOT in the daily backups — see HOW-WE-DO-APP-AUDITS.md).

Secret rotation gone wrong

  • If new secret is broken: revert the secret value in Supabase or Railway dashboard to the previous one
  • Or remove the secret entirely if the code can fall back gracefully

Cloudflare-level issues

  • DNS or proxy misconfiguration → Cloudflare → DNS → Records → revert the record
  • SSL/TLS mode change broke things → SSL/TLS → Overview → revert encryption mode
  • Cloudflare doesn't deploy app code, so rollbacks here are config-level only

What to Watch After Every Deploy

For 1 hour after a normal merge (2 hours for hotfix), glance at:

  • Sentry — new error spike on keycontent-frontend or keycontent-backend?
  • Railway — deploy succeeded green, no crash-loop, log output looks healthy?
  • Better Stack — both monitors still green at https://keycontent.betteruptime.com?
  • Your inbox / bug report widget — any user reports?
  • Cloudflare — Analytics dashboard for unusual traffic or error rates?

If anything looks off: rollback first, diagnose later.


Anti-Patterns (don't do these)

  • "It's just a tiny change" — every change goes through the pipeline
  • Promoting on Friday afternoon — there's no good reason
  • Skipping the staging smoke test — staging exists precisely for this
  • Adding a secret to prod first — always staging first
  • Manual SQL changes against prod — use the migration system; untracked changes break future migrations
  • Deploying without a rollback path mentally rehearsed — if you can't rollback, don't deploy

Why we elevated the standard

The industry has dozens of deploy frameworks: blue-green, canary, feature flags, ring-based rollouts, GitOps, etc. Most are designed for large engineering teams with dedicated SRE. We don't need that complexity.

Our elevations:

  1. Three rules, no exceptions — easy to remember, hard to violate by mistake
  2. Observable-first deploys — Sentry + Better Stack + Railway rollback compose into a fast-detect-and-revert safety net. Replaces the pre-deploy gradual ramp we don't have on Railway.
  3. Description-based review pre-merge, behavior-based verification post-merge — pairs with HOW-WE-DO-PR-REVIEWS.md
  4. Staging is non-optional — even for "trivial" changes (a culture choice, not a technical one)
  5. Rollback-first culture — explicit safety bias during incidents
  6. Friday rule — no production deploys after Thursday lunch unless it's a hotfix

The trade: deploys take ~10 extra minutes each. The gain: production stays stable, mornings stay calm, and you can promote with confidence.


Concrete Example — Promoting V0 (Report Issue Widget) to Production

This is the canonical promotion exercise. Use it as the reference for future promotions:

  1. Confirm V0 has been live on staging.keycontent.ai for 24+ hours with no Sentry alerts
  2. Add POSTMARK_SERVER_TOKEN and REPORT_TO_EMAIL to production Supabase Edge Function secrets — operator dashboard
  3. Apply the storage migration to production (creates bug-report-screenshots bucket + RLS policies) — Claude Code task
  4. Deploy report-issue Edge Function to production Supabase — Claude Code task
  5. Verify Edge Function CORS allows the production origin (https://keycontent.ai) — Claude Code task
  6. Confirm Postmark sender domain is verified for production — operator dashboard
  7. Merge staging → main via PR — triggers Railway production deploy automatically (~2-3 min)
  8. Smoke test on production: submit a Bug report with screenshot — operator
  9. Sentry + Better Stack watch for 2 hours — operator
  10. If alarms fire: Railway → KeyContent prod service → Deployments → redeploy previous (the rollback)
  11. Done.

After this lands, this section becomes the "we've done it once" reference for App #2.


When this doc changes

  • After any incident traced to a deploy — what gate failed?
  • When the stack gains a new component (e.g., when Sentry is wired up in Phase 1, add a release-tracking step)
  • When a new deploy type emerges (e.g., feature flags)
  • Bump version, note the change

See also

  • HOW-WE-DO-PRIORITY.md — what justifies a hotfix vs standard deploy
  • HOW-WE-DO-PR-REVIEWS.md — the verification gate before merge
  • HOW-WE-DO-INCIDENTS.md — what to do when a deploy goes wrong
  • ../ROADMAP.md — V0 production promotion is queued under "Phase 0 wrap-up"
  • Deck Slide 6 — Scenario 3: shipping new code safely (the visual narrative)
PLAY 5

Bug Reports

Make reporting effortless. Make context automatic.

Status: Active · v1 · 2026-05-13 Applies to: User-facing feedback channel · KeyContent (live on staging) · Future Coreshift HQ apps Owners: Operator (triager) · Claude Code (implementer) · Tools (capture, store, deliver)


What this doc is

The end-to-end definition of how user feedback flows into the system — from the click of the floating widget to the issue landing in the triage queue. Codifies what's already shipped on staging.keycontent.ai.


The principle

Make reporting effortless. Make context automatic.

The user types one sentence. The system attaches everything else — who they are, where they were, what their browser is, what they were looking at. That's the difference between a useful report and a frustrating email thread.


The Pipeline

User clicks the floating "?" button on any page
            ↓
Modal opens — pre-filled with context awareness
            ↓
User picks a type:  🐛 Bug   💡 Suggestion   ❓ Question
            ↓
User types their message (placeholder adapts to type)
            ↓
User optionally drops in a screenshot
            ↓
On submit: screenshot uploads to Supabase Storage (if present)
            ↓
POST to Edge Function with: { type, message, page_url, user_agent, screenshot_path, app_id }
            ↓
Edge Function validates auth + payload, generates a signed URL for the screenshot
            ↓
Postmark sends email to operator with full context
            ↓
Operator triages in inbox  (P0–P3 + radius + narrative)
            ↓
Hand to Claude Code  →  fix lands  →  loop closed

The Three Ticket Types

Icon Type When users pick it Maps to (future GitHub label)
🐛 Bug "Something is broken" bug
💡 Suggestion "I have an idea or want a feature" enhancement
Question "I'm stuck, unsure, or need help" question

Default selection: Bug (most common in production apps).

The type selector drives two things:

  1. The textarea placeholder adapts (better coaching per type)
  2. The email subject prefix in the operator's inbox: [KeyContent · BUG] vs [· SUGGESTION] vs [· QUESTION]

What Gets Captured Automatically

The user types one sentence. The system attaches all of this without asking:

Field Where it comes from
User email Supabase auth session
User ID Supabase auth session
Page URL window.location.href at the moment of report
Browser / OS navigator.userAgent
Server timestamp Edge Function captures on receipt (UTC)
Screenshot URL Signed Supabase Storage URL (30-day expiry), if attached
App ID "keycontent" for now; future apps drop in seamlessly

This is why operator triage takes 30 seconds instead of 3 emails: the report already contains everything needed to act.


The Email Format (what lands in your inbox)

Subject: [KeyContent · BUG] from nzricky@gmail.com

Type: Bug
App: keycontent
User: nzricky@gmail.com  (id: 6b87712a-abce-48ee-b444-...)
Page: https://staging.keycontent.ai/jobs/d66a2195-...
Browser: Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/148.0.0.0 ...
Time: Mon, 11 May 2026 09:16:21 GMT
Screenshot: https://...supabase.co/storage/v1/.../screenshot.png?token=...

----- Their message -----
{the user's message}

The subject prefix is the key feature for triage: inbox rules can route by type automatically.


Architecture (the components)

Layer Component Where it lives
Frontend widget <ReportIssueButton /> mounted in app root layout KeyContent frontend repo
Storage Private bucket bug-report-screenshots, RLS-scoped to user folder Supabase Storage
Edge Function report-issue — validates auth, generates signed URL, sends email Supabase Edge Functions
Email delivery Postmark API (via POSTMARK_SERVER_TOKEN) Postmark
Auth Supabase JWT on every request Supabase Auth
Multi-app readiness app_id field in payload Forward-compatible for Sentinel

Security Model

  • Authentication required — only signed-in users can submit reports (Edge Function rejects 401 otherwise)
  • Screenshots are private by default — bucket is private, RLS restricts uploads to the user's own folder
  • Signed URLs with 30-day expiry — operator's email has a time-bounded link, not a permanent public URL
  • Secrets never in codePOSTMARK_SERVER_TOKEN and REPORT_TO_EMAIL live in Supabase Edge Function secrets
  • CORS allowlist — Edge Function only accepts requests from the app's verified origin

Operator Triage Flow

When a report lands in your inbox:

  1. Read the subject — type tells you the lane (bug/suggestion/question)
  2. Read the message + look at screenshot — 30 seconds to understand
  3. Assign priority (per HOW-WE-DO-PRIORITY.md) — P0/P1/P2/P3 + blast radius + narrative
  4. For Bug + Suggestion: brief Claude Code, get a PR, verify on preview, merge
  5. For Question: reply to the user directly with the answer
  6. Optionally: archive the email or move to a "triaged" folder

Recommended inbox rules (Gmail/Outlook):

  • [KeyContent · BUG] → flag + star + same-day
  • [KeyContent · SUGGESTION] → archive to "Product Backlog" label, weekly review
  • [KeyContent · QUESTION] → flag for same-day reply

Phase 1 Evolution (when full ops kicks in)

The pipeline is designed to grow without rewriting the frontend. Three additive enhancements:

A. Save to a Supabase table

Edge Function also writes a row to a bug_reports table (alongside sending the email):

  • Columns: id, type, message, page_url, user_agent, user_id, user_email, screenshot_path, app_id, created_at, status, priority, blast_radius
  • Enables analytics, search, and trend analysis over time
  • app_id is already in the payload — no migration of historical data needed

B. Auto-create a GitHub Issue

After saving the email, the Edge Function also creates a GitHub Issue:

  • Repo: keycontent (or matched by app_id)
  • Label: bug / enhancement / question (mapped from type)
  • Title: {type}: {first 60 chars of message}
  • Body: the same context as the email + a link to the screenshot
  • Operator triages on GitHub, not in email

C. Auto-reply on resolution

When the GitHub Issue is closed (or labelled resolved), an automated email goes back to the user:

  • Subject: Update on your report — {type}
  • Body: thanks for reporting, here's what we changed
  • Closes the loop, builds trust, almost no other apps do this

These three additions take the system from "useful" to "remarkable" — without the user-facing widget changing at all.


Why we elevated the standard

Most apps have a "Contact Us" form or a help@ email address. Reports come in with no context, and the support thread that follows is mostly back-and-forth asking for screenshots, browser versions, and account details. Hours wasted per report.

Our elevations:

  1. One-click reporting from any page (vs hunting for a contact link)
  2. Auto-captured context (vs interrogation email threads)
  3. Type-aware UX (placeholder adapts → users describe better)
  4. Screenshots without screenshot tools (drag-drop in the modal)
  5. Inbox-routable subject prefixes (vs everything in one queue)
  6. Forward-compatible for multi-app (app_id field already plumbed)

The trade: building the widget took ~2 hours of Claude Code time. The gain: every report that lands is actionable in 30 seconds instead of 3 hours of email tennis.


Cloning to Another App

When Coreshift HQ launches App #2, the bug-reports system clones in an afternoon:

  1. Drop <ReportIssueButton /> into App #2's layout (set app_id: "app2-slug")
  2. Create the bug-report-screenshots bucket + RLS policies on App #2's Supabase
  3. Deploy the report-issue Edge Function to App #2's Supabase
  4. Add POSTMARK_SERVER_TOKEN + REPORT_TO_EMAIL to App #2's secrets
  5. Optionally: separate Postmark Message Stream for App #2's reports

Everything else — the email format, the type system, the triage flow — is identical across apps. One operator can triage all apps from one inbox, filtered by the subject prefix.

This is why the system is also called "Sentinel" in the long-term vision: one set of eyes, every Coreshift HQ app.


When this doc changes

  • When the widget adds new types or fields
  • When a new app onboards (note any deviations)
  • When Phase 1 adds the DB table / GitHub Issue auto-creation
  • After any incident affecting the report pipeline
  • Bump version, note the change

See also

  • ../briefs/BRIEF-report-issue-widget.md — V1 implementation brief (the spec Claude Code shipped from)
  • ../briefs/BRIEF-report-issue-widget-round2.md — polish + type selector
  • ../briefs/BRIEF-rewire-to-postmark.md — email vendor swap
  • HOW-WE-DO-PRIORITY.md — what happens after the report lands
  • ../ROADMAP.md — Phase 2 enhancements queued
  • Deck Slide 4 — Scenario 1: a user hits a bug (the visual narrative)
PLAY 6

App Audits

Defaults are dangerous. Audit is the gate.

Status: Active · v1 · 2026-05-13 Applies to: Every Coreshift HQ app — at onboarding (before joining Sentinel ops) and quarterly thereafter Owners: Operator (runs the checklist) · Claude Code (writes fix briefs for failed items) · Tools (Supabase advisors, Cloudflare audit logs, vendor dashboards)


What this doc is

A single-pass checklist that verifies an app has its default-off settings turned on before it's considered operator-ready. Runs as a gate at onboarding (first time) and as a re-audit on a quarterly cadence thereafter.

This doc exists because of a specific lesson learned: when KeyContent was handed over for operator-driven maintenance, webhook_events had RLS disabled and Supabase Auth had HIBP password protection off — both default-off settings the original developer never toggled on, because no checklist forced the question. This checklist is the elevation.


The principle

Defaults are dangerous. Cloud platforms ship with security off-by-default because they don't know your context. An audit is the gate that catches what was never toggled on.

Not paranoia. Not enterprise compliance theatre. Just a 30-minute pass that asks: "For every cloud setting that defaults to less-safe, did we make a conscious choice?"


When this audit runs

Trigger What it produces
Onboarding — before an app joins Sentinel ops Pass/fail gate. App can't go live with Sentinel ops until 100% pass or every failure is consciously deferred.
Quarterly — recurring re-audit Delta report: what regressed, what's new, what new vendor checks should be added.
Post-incident — when an incident traces to a default-off setting Targeted re-check of the affected area + add the check to this doc.

The Audit Checklist

Each item is a yes/no check the operator can verify in ≤ 2 minutes. Items marked (N/A until X) are only relevant once the named vendor or feature is in the stack.

Supabase — Database + Auth + Storage + Edge Functions

  • get_advisors (security) returns zero critical and zero warn entries
  • get_advisors (performance) reviewed — issues triaged or deferred
  • Row Level Security enabled on every public schema table
    • Verify: SELECT relname FROM pg_class WHERE relkind='r' AND relnamespace='public'::regnamespace AND NOT relrowsecurity returns 0 rows
  • Every RLS-enabled table has at least one policy (else the authenticated role is silently locked out)
  • HIBP password protection enabled (Auth → Password security → "Check passwords against HaveIBeenPwned")
  • Email rate limits configured (Auth → Rate limits) — defaults are often too generous
  • All Storage buckets are private by default; public buckets are explicit and justified
  • Storage bucket policies scope writes to the user's own folder (path pattern {user_id}/...)
  • CORS allowlist on every Edge Function lists only known origins (no *)
  • All Edge Function secrets exist in both staging AND production projects
  • Service role key is not referenced in any client-side code (grep client/ for SERVICE_ROLE — must be 0 hits)
  • Database backup retention reviewed (Supabase default is 7 days; upgrade if data loss tolerance < 7d)
  • No raw SQL changes against production outside the migration system (every prod schema change has a supabase/migrations/ file)

Cloudflare — Frontend + DNS + WAF

  • WAF rate limiting on auth endpoints (/login, /signup, password reset)
  • DNSSEC enabled on the production domain
  • Bot Fight Mode (free tier) enabled (verified 2026-05-13 on keycontent.ai with JS Detections on)
  • SSL/TLS encryption mode set to Full (Strict) — not Flexible (verified 2026-05-13)
  • Pages preview deploys are gated (auth required) OR known to be safe-by-default
  • Environment variables exist in both Preview and Production environments
  • Page Rules / Configuration Rules reviewed for leftover staging-only rules

GitHub — Repo Hygiene

  • Branch protection on main: requires PR + status checks pass + linear history
  • Branch protection on staging: requires PR + status checks pass
  • Default branch is the production branch (main for KeyContent)
  • .gitignore excludes .env*, *.pem, *.key, credential files
  • Secret scanning enabled (free for public repos; opt-in for private)
  • Dependabot enabled for security updates
  • .github/ISSUE_TEMPLATE/ populated per HOW-WE-DO-BUG-REPORTS.md
  • No secrets in commit history (gitleaks or gh secret-scanning pass clean)
  • Repo visibility matches expectation (private for KeyContent; public exposure would be intentional)

Postmark — Transactional Email

  • Sending domain verified (DKIM + Return-Path + SPF all green)
  • Bounce + complaint webhooks configured to the app (or Sentinel ingestion)
  • Message Streams separated by purpose (e.g. outbound for transactional vs broadcast for marketing)
  • Suppression list reviewed quarterly
  • Sender reputation score acceptable (≥ 80)

Sentry — Error Monitoring (N/A until Phase 1 Week 1 ships)

  • Frontend + backend (server + edge functions) projects exist
  • DSNs configured per environment (staging DSN ≠ prod DSN)
  • Source map upload verified (stack traces show readable file names, not minified)
  • Release tracking wired to deploys (commit SHA tagging confirmed)
  • Alert rules: new-issue email, regression email, high-frequency email
  • User context attached after auth (id + email)
  • PII capture is off (sendDefaultPii: false)
  • Session replay privacy defaults (mask all text, block all media)

Better Stack — Uptime Monitoring (N/A until Phase 1 Week 2 ships)

  • Uptime monitors cover both staging and production endpoints
  • Check interval ≤ 3 min
  • Alert channels configured (email at minimum; SMS for SEV1 ideally)
  • Public status page exists at status.{app-domain} with brand customisation

Operational Hygiene

  • Report Issue widget submits a Bug, Suggestion, and Question successfully (one of each)
  • Inbox rules / labels routing [{App} · BUG/SUGGESTION/QUESTION] correctly
  • Postmark sender domain matches the app's primary domain
  • Rollback procedure mentally rehearsed per HOW-WE-DO-DEPLOYS.md Rule 2
  • Triage rhythm documented and being followed (10-min ritual per ROADMAP.md Phase 1 Week 4)
  • ROADMAP.md and KANBAN.md checked for staleness — shipped items marked [x]

What to do when an audit item fails

Don't fix items mid-audit. Finish the full pass, then triage all failures together. This prevents tunnel-vision on one finding while missing others.

For each failure:
  ↓
1. Categorise by impact using HOW-WE-DO-PRIORITY (P0/P1/P2/P3)
  ↓
2. Determine fix type:
   ▸ Code fix (e.g. missing RLS policy) → write a brief, hand to Claude Code
   ▸ Dashboard toggle (e.g. HIBP)        → add a kanban card, operator handles
   ▸ Process gap (e.g. no inbox rule)    → add to triage rhythm, no PR needed
  ↓
3. File the action — every failure produces either a PR, a kanban card,
   or an explicit "consciously deferred" note (with reason + reconsider date)

Why we elevated the standard

The industry has formal compliance audits — SOC2, ISO 27001, HIPAA, etc. They're designed for enterprise teams with compliance officers, weeks of evidence-gathering, and external auditors. Most of that ceremony doesn't apply to a solo operator running 1–N small apps. But the underlying principle — that defaults are dangerous and audit is a gate — does apply.

Our elevations:

  1. Specific to our stack — checks Supabase, Cloudflare, Postmark, GitHub, Sentry, Better Stack directly. Not generic CIS benchmarks.
  2. Automatable where possibleget_advisors, RLS query, grep checks. Operator runs them; Claude Code can re-run them on demand.
  3. Lightweight — 30-minute pass, not a 2-week formal audit.
  4. Dual-purpose — same checklist for first-time onboarding AND quarterly re-audit. No second doc to maintain.
  5. Failure flows into existing workflows — Claude Code briefs for code fixes, kanban cards for dashboard toggles. No special incident category.
  6. App-portable — every Coreshift HQ app passes the same audit before joining Sentinel ops. App #2 inherits the discipline.

The win at scale: when Sentinel covers 3+ apps, the audit is the contract. No app ever ships with RLS disabled on a webhook log or HIBP off in Auth again, because the checklist would've blocked it.


When this doc changes

  • After any incident that traces to a default-off setting — add the check that would've caught it
  • When a new vendor enters the stack — add their section (Stripe security, Zernio HMAC, etc.)
  • When Supabase / Cloudflare / etc. ship new advisor types — add the lints they surface
  • Quarterly, after running the audit — what was missing? What's noise?
  • When App #2 onboards — verify every check is portable; flag KeyContent-specific items

Bump the version at the top and note the change.


Outstanding items for KeyContent (initial audit, 2026-05-13)

The first time this checklist runs on KeyContent, it produces this delta. Surfaced during the Sentinel folder review session.

Item Status Action
RLS on webhook_events ✅ Fixed on staging 2026-05-12; awaiting prod promotion See ../briefs/BRIEF-rls-fix-webhook-events.md
HIBP password protection ✅ Enabled on staging + prod 2026-05-13
get_advisors zero-warn ✅ Staging clean; prod has 1 outstanding (RLS pending promotion) Resolves when RLS fix promotes to prod
Sentry sections N/A Will run after Phase 1 Week 1
Better Stack sections N/A Will run after Phase 1 Week 2
Branch protection ✅ Rules added on main + staging 2026-05-13 Admin bypass left ON for 2-person team flexibility
Backup retention ✅ Verified 2026-05-13 — 8 days of daily DB backups ⚠️ Storage objects not in backups; follow-up queued for Phase 1 Week 5

Re-run the full audit once Sentry and Better Stack are integrated, and quarterly thereafter.


See also

  • HOW-WE-DO-PRIORITY.md — how audit failures get triaged
  • HOW-WE-DO-DEPLOYS.md — Type C migrations (the usual fix path for RLS / schema gaps)
  • HOW-WE-DO-INCIDENTS.md — when a failure is severe enough to be incident-mode
  • ../ROADMAP.md — Phase 4 (Sentinel) — passing this audit is the gate for any new app joining
  • Supabase Database Advisors docs
  • Cloudflare WAF docs