INTERNAL PLAYBOOK · v1 · MAY 2026

How We Run Apps

Five plays. One deliberate system. Built once for KeyContent. Cloned in an afternoon for every Coreshift HQ app after.

Owner: Abe (Operator)

Implementer: Claude Code

Stack: Supabase · Cloudflare · GitHub · Postmark

Scale: 10–20 users today

PLAY 1

Priority Framework

How we triage what gets fixed first.

Status: Active · v1 · 2026-05-11 Applies to: Every Coreshift HQ app · Currently: KeyContent Owners: Operator (assigns priority) · Claude Code (consults for triage cues)

What this doc is

How Coreshift HQ triages and prioritizes issues. Built on the P0–P3 industry standard (so we play nicely with the rest of the world), elevated with two custom fields that force better thinking:

User Impact Narrative — a one-line human story, not a technical description
Blast Radius — explicit count of how many users are affected

The standard tells us what to call things. Our additions force us to think about who's affected before we act.

The Four Priority Tiers

Tier	Name	When	Action	Default response time
P0	Drop Everything	App is down for most/all users · Data loss · Active security breach	Fix now, today	Hours
P1	This Week	A core feature is broken · OR a smaller feature broken for many users	Fix in 1–3 days	Same business day
P2	This Milestone	A bug exists but users have a workaround · OR only a small subset is affected	Fix in current sprint	Within the week
P3	When Convenient	Polish · Cosmetic · Wishlist · Edge case	Backlog — close if stale	No SLA

The Decision Tree (use this in 10 seconds)

1. Is the app down for everyone?              → YES → P0
                                                NO ↓
2. Is a core feature broken for many users?   → YES → P1
                                                NO ↓
3. Is anything functionally broken?            → YES → P2
                                                NO ↓
4. Cosmetic / nice-to-have?                    → P3

"Core feature" for KeyContent = login · the main content workflow · saving work · billing/payments. Each app defines its own list in its in-repo PRIORITY.md.

The Two Required Fields (the elevation)

1. User Impact Narrative

Every triaged issue must include a one-line story of how this affects the user's day.

❌ Bad (technical)	✅ Good (human)
"Save button throws TypeError"	"Users lose their drafts when clicking Save — they have to retype everything"
"API 500 on /jobs endpoint"	"Users can't see their own jobs list when they log in — looks like everything's gone"
"Modal animation janky on Safari"	"Safari users see a flicker when opening the report widget — minor visual annoyance"

Why: Triage from the user's perspective, not the engineer's. The narrative tells you the severity better than the stack trace.

2. Blast Radius

Tag every issue with one of:

Tag	Meaning
`radius:single`	One user affected (so far)
`radius:some`	A subset — e.g., users on Safari, free-tier users, users in a specific region
`radius:many`	Most users will hit this
`radius:all`	Everyone, every session

Why: Lets us spot escalation. A P2 · radius:some issue that gets re-reported as radius:many jumps to P1 instantly.

How Type, Priority, and Radius Compose

These are three orthogonal axes — all assigned, but at different moments:

Field	Set by	When
Type (Bug / Suggestion / Question)	The user	At report time (via the widget)
Priority (P0–P3)	The operator	At triage time
Blast Radius	The operator	At triage time

A typical fully-triaged issue looks like:

Type: Bug Priority: P1 Blast Radius: radius:many Narrative: Users lose their drafts when clicking Save — they have to retype everything.

This is rich enough that Claude Code can read it cold and start fixing. That's the whole point.

Common Mistakes (don't do these)

Everything is P1. If everything's urgent, nothing is. Be ruthless.
The loudest user sets priority. Impact sets priority, not volume. A single angry user reporting a typo is still P3.
You-found-it bugs ranked above users-found-it. Reverse it — user-reported issues are real, by definition.
Priority never changes. Re-triage when you learn more. A P2 affecting radius:some that turns into radius:many becomes P1.
Skipping the narrative. Without it, you're prioritizing on vibes. Force the user-impact line every time.

Why we tweaked the standard

The industry's P0–P3 system is a vocabulary, not a thinking framework. Two issues can both be "P1" and have wildly different actual impact on users. By requiring narrative + radius, we force a small amount of structured thinking that:

Prevents priority inflation
Makes triage trustworthy (numbers are decisions, not guesses)
Generates great Claude Code briefs as a byproduct
Sets up future automation — radius:all could auto-page; radius:single + P3 could auto-close after 60 days

These are not exotic additions. They're cheap, repeatable habits that compound over hundreds of triaged issues.

How this lives in practice

The widget captures Type at report time
Triage (10 min/day) sets Priority + Blast Radius + Narrative
GitHub Issues stores all three as labels and fields (Phase 1)
Sentinel (future ops dashboard) filters and sorts by these
The pitch deck (Slide 7) is the public-facing summary of this doc

When this doc changes

Edit when we discover a new tier is needed (rare)
Edit when the "core feature" list changes per app
Bump the version at the top
Mention in the next CEO update

PR Reviews

How a non-coder verifies every change.

Status: Active · v2 · 2026-05-13 (Railway revision) Applies to: Every PR from Claude Code or any contributor Owners: Operator (reviewer) · Claude Code (author)

v2 note: Original v1 assumed Cloudflare Pages preview URLs per PR. The actual stack uses Railway, which doesn't generate per-PR previews on the free tier. The review model has shifted from "verify on a preview, then merge" to "review-by-description pre-merge, behavior-test on staging post-merge." Safety net is Sentry + Better Stack + Railway fast rollback.

What this doc is

A behavior-based, code-free PR review checklist. Designed for an operator who doesn't write code to confidently approve or push back on changes — at speed, every day.

The principle

Behavior > code.

Most engineering orgs review PRs by reading source code line by line. We can't — and we don't need to. We trust Claude Code's implementation patterns and verify what actually matters: that the change does what we asked for, and nothing else broke.

Our verification happens at the user-experience layer, not the source-code layer.

The trade: we don't personally catch every subtle bug. We rely on Sentry monitoring + Better Stack uptime + Railway fast rollback as the safety net for what slips through.

Two-Phase Review (Railway-adapted)

Railway free tier doesn't create a preview URL per PR — only the staging branch gets deployed. So the review splits into two phases:

Phase A (pre-merge): description-based review only. No behavior-testing yet.
Phase B (post-merge-to-staging): behavior verification on staging.keycontent.ai after Railway deploys.

Total time: ~3 min pre-merge + ~5 min post-merge.

Phase A — Pre-merge review (~3 min)

Step 1 · Read the PR description carefully

Does the change match what I briefed?
Are the "What's new" items a subset of what I asked for? (No scope creep.)
Does the test plan match what we should verify?

✅ Yes → continue. ❌ No → comment to clarify before merging.

Step 2 · Read the files-changed list

Don't read the code itself — just the list of files. Quick sanity check:

Does the file count match the scope? (A "fix subject line" PR touching 12 files is a red flag.)
Are any unexpected files touched? (e.g., a UI fix PR that changed migration files — ask why.)

If files look reasonable → proceed. If suspicious → ask Claude Code to explain.

Step 3 · Ask clarifying questions in PR comments

You can't behavior-test yet, so questions are the only pre-merge gate. Examples:

"You added a new env var — does it have a sensible default if missing?"
"You touched the auth flow — confirm sign-in/sign-out still work after this?"
"This change affects the report widget — does staging Postmark still get hit?"

Claude Code answering well = green light to merge. Claude Code stumbling = bounce back.

Phase A passes → merge to staging.

Phase B — Post-merge behavior testing on staging (~5 min)

After merging to staging, Railway redeploys in ~2-3 min. Open https://staging.keycontent.ai and:

Step 4 · Reproduce the original goal

For bugs: repeat the steps from the original issue. Confirm the bug is gone.
For features: use the feature exactly as a user would. Confirm it does what we asked.

Step 5 · Smoke test the critical flows

After verifying the targeted change, quickly probe 2–3 high-traffic paths to make sure nothing else broke.

KeyContent's critical flows:

Sign in / sign up
Dashboard loads cleanly
The primary content workflow (create / edit / save a job)
Sign out

Each click ~5–10 seconds. Whole smoke test ~1 minute.

Step 6 · Mobile check

Resize your browser to ~400px wide (or pull it up on your phone)
Hit the changed page
Confirm nothing is broken or unreadable

Step 7 · If anything fails

Comment on the merged PR with the failing details
Brief Claude Code for the follow-up fix (new PR)
If the failure is severe (auth, payment, data loss): rollback staging via Railway → Deployments → redeploy previous

All checks pass → ready to promote staging → main.

Decision matrix

Situation	Phase	Action
Description + files match brief, questions answered well	A	✅ Merge to staging
PR description doesn't match my brief	A	❌ "This doesn't match my brief — please redo X" before merging
Suspicious files-changed list	A	❌ Ask Claude Code to explain before merging
Already merged, all behavior checks pass on staging	B	✅ Ready to promote staging → main
Bug isn't actually fixed (on staging)	B	❌ Comment with exact failing steps; new PR for fix
Fix works, but smoke test broke something else	B	❌ Comment + screenshot; new PR for fix
Mobile broken on staging	B	❌ Comment "Mobile broken — see screenshot"; new PR
Severe failure on staging (auth, data loss)	B	🚨 Rollback staging via Railway, then fix via new PR

Example comments to give Claude Code

Pre-merge clarifying question (Phase A):

"You're touching the auth middleware — confirm sign-in still works for existing sessions after this lands. Don't want to log everyone out."

Bug not fixed (Phase B, post-merge on staging):

"Tested on staging. Original issue still happens: when I click Save on the draft, the page reloads and the content is lost. New PR for follow-up fix?"

Smoke test broke something (Phase B):

"Targeted fix works on staging. But noticed: clicking 'New Job' button now crashes the page. Reproduced 3 times on staging. Screenshot attached. Need a follow-up PR."

Mobile broken (Phase B):

"On mobile width (~400px) on staging, the new ticket-type selector wraps weirdly and 'Send Report' is cut off. Screenshot."

Scope creep (Phase A):

"I see you also refactored the auth code — please pull that out into a separate PR so this one stays focused on the bug fix."

What I never do

Read code line by line
Approve a PR based on "looks good to me"
Skip the smoke test
Merge on a Friday afternoon
Trust the test plan without running it myself

Post-merge to PROD: the 1-hour watch

After promoting staging → main and Railway redeploys production:

New error spike in Sentry? → rollback via Railway (Deployments → redeploy previous, ~2 min)
Better Stack monitor flipped red? → same: rollback first, diagnose later
User reports a problem via the widget? → check if it correlates with the deploy; rollback if yes
All quiet for 1 hour? → done.

For hotfixes, watch for 2 hours instead of 1.

Why we elevated the standard

The industry assumes the PR reviewer is also a coder. Our system inverts that assumption: the operator doesn't read code, so the checklist is 100% behavior-focused.

This:

Unblocks non-coding operators from running a real engineering ops process
Catches the bugs that actually matter — user-visible breakage, not stylistic preferences
Pairs with safety nets — Sentry + Better Stack + Railway fast rollback catch what visual testing misses
Adapts to platform reality — when Cloudflare-Pages-style per-PR previews aren't available (Railway), the review moves to staging-after-merge with rollback as the safety net
Scales across apps — same two-phase pattern works for KeyContent and every Coreshift HQ app after

The unfair advantage: we move faster than teams that require code-reading reviewers, because every PR has exactly one reviewer (the operator) and the review is mechanical.

When this doc changes

When new critical flows emerge in any app (add to Step 4 list)
When a recurring failure mode isn't being caught by the smoke test
After any major post-merge incident (add a check to prevent recurrence)
Bump the version at the top and note the change

Per-app critical flows

Each Coreshift HQ app declares its own critical flows for Step 4. KeyContent's are above. App #2 will add its own.

Incident Response

Pre-written calm beats in-the-moment heroics.

Status: Active · v1 · 2026-05-13 Applies to: Any unplanned event degrading user experience in production Owners: Operator (incident driver) · Claude Code (fix implementer) · Tools (detection)

What this doc is

The pre-written playbook for when something is broken in production. Designed for a solo operator at 2 AM, when you don't have time to think — just follow the script.

The principle

Pre-written calm beats in-the-moment heroics.

The worst time to design a process is during a fire. So we wrote it now. When something breaks, you don't think — you execute the script. Decisions are made in advance; the runbook is the brain.

Severity Levels

Level	Definition	Response window	Public comms
SEV1	App is down for everyone · data loss · active security breach	Now	Status page + user email if > 30 min
SEV2	Core feature broken · app down for many users	Within 30 min	Status page
SEV3	Degraded but functional · slow · minor breakage	Same day	Internal only

Non-urgent bugs that aren't actively breaking things → use the P0–P3 priority framework, not the incident process.

The 6-Step Response

1. ACKNOWLEDGE (0 min)

Alert arrived from Better Stack or Sentry
You're aware. The clock is running.
Internally: stop everything else. This is your only task.

2. ASSESS severity (≤ 60 seconds)

Three quick questions:

Can users still use the app at all? → No: SEV1 or SEV2. Yes: SEV3.
How many users affected? → All: SEV1. Many: SEV2. Few: SEV3.
Data being lost or corrupted? → Yes: always SEV1.

Pick a level. Don't agonize. You can adjust as you learn more.

3. COMMUNICATE (≤ 5 minutes from acknowledge)

SEV1 / SEV2: update the public status page immediately. Use the templates below.
SEV3: note in your incident log; update the status page only if it persists past 30 min.

4. DIAGNOSE — "what changed?"

Run through these dashboards in order. 9 out of 10 incidents trace to a recent deploy.

Source	What to look for
Recent deploys	Anything shipped in the last 4 hours?
Sentry	New error spike? Click for stack trace + breadcrumbs
Cloudflare Workers logs	5xx surge? Function timeouts?
Supabase logs	DB errors? Connection issues? Migration problems?
Postmark	Are emails bouncing? Server outage?
Third-party status pages	Cloudflare · Supabase · Postmark all have public status pages

5. MITIGATE — pick the safest path

Was there a recent deploy?
   YES → ROLLBACK FIRST. Diagnose later.
         (Cloudflare 1-click rollback to previous version.)
   NO  ↓

Is it a third-party outage?
   YES → Communicate, wait, monitor. Update users.
   NO  ↓

Is it a code bug requiring a new fix?
   → Brief Claude Code with full Sentry trace + context.
     Get a hotfix PR. Verify on preview. Merge.
     (Use the standard PR review checklist — abbreviated for hotfix.)

6. RESOLVE

Test the affected flow yourself on production
Update status page to "Resolved"
Note in your incident log

The 1 AM Rule

When something fires at 1 AM, do the minimum to make it stop:

Rollback the last deploy if there was one recent
Communicate to users via status page
Sleep. Investigate properly in the morning.

Don't do at 1 AM:

Write or brief new code
Deploy speculative fixes
Read complex logs
Make architectural decisions

The 1 AM rule exists because sleep-deprived decisions are how minor incidents become major ones.

Claude Code's Role During Incidents

You don't write code. Even during incidents. Especially during incidents.

When a hotfix is needed:

Capture the full Sentry trace + URL + user details + timestamp
Open Claude Code (a fresh session for the incident, if helpful)
Brief it like: "P0 hotfix. {feature} broken in production since {time}. Sentry error attached. Rollback wasn't possible because {reason}. Need a fix opened as a PR immediately."
Verify the fix on the preview
Merge

Operator's role is triage, brief, verify — same as always, just faster.

Pre-Written Status Page Templates

Investigating:

We're investigating reports of {issue} affecting {feature}. We'll provide an update within 30 minutes.

Identified:

We've identified the issue with {feature}. We're working on a fix now.

Monitoring:

A fix has been deployed. We're monitoring to confirm everything is working as expected.

Resolved:

This incident is resolved. {Feature} is fully operational. Sorry for the disruption.

When to Email Users (beyond the status page)

Most incidents: status page is enough.

Email users only when ALL of these are true:

SEV1 lasted > 30 minutes, OR data was visibly affected
The user might reasonably think their account is broken
You can identify the affected users

Template:

Subject: We had a brief outage — your data is safe

Hi {name},

Earlier today between {start} and {end}, {feature} was {down/broken}.
We're back to normal now.

What you might have noticed: {brief description}
What we did: {1-2 sentences}

Thanks for your patience.

— The Coreshift HQ Team

Post-Mortem — Within 24 Hours of any SEV1 / SEV2

Save as incidents/YYYY-MM-DD-N.md in the repo. Lean template — 5 sections:

## Incident YYYY-MM-DD-N

**Severity:** SEV{1|2|3}
**Duration:** Detected HH:MM → Resolved HH:MM ({X} minutes)
**Affected:** {features, ~user count}

### What happened
{One paragraph.}

### Root cause
{One paragraph.}

### How we caught it
{Sentry / Better Stack / user report / other?}

### What we did
{Bullet list of actions taken, in order, with timestamps.}

### Action items
- [ ] Prevent recurrence: {specific change}
- [ ] Improve detection: {specific change, if applicable}
- [ ] Update this runbook: {specific addition, if applicable}

Don't:

Skip the post-mortem because "everyone knows what happened"
Make it about blame (the system failed, not the person)
Write a novel — 5 sections is enough

Do:

File the action items as P1 GitHub Issues
Re-read recent post-mortems quarterly — patterns emerge

Common Anti-Patterns (don't do these)

Skipping comms — users finding out from a friend is worse than a brief status update
Forcing a fix at midnight — rollback + sleep is almost always better
Investigating before mitigating — stop the bleeding first
No post-mortem because "it's fixed" — patterns repeat unless documented
Skipping the status page update because "it's small" — the rule is consistency, not severity

Why we elevated the standard

The industry has detailed incident playbooks designed for large engineering teams with rotating on-call rotations, incident commanders, and war rooms. Most of that doesn't apply when you're a solo operator with Claude Code as your implementer.

Our elevations:

Solo-operator-aware — no incident commander, no war room, no Slack #incidents channel. Just you + the runbook.
The 1 AM Rule — explicit permission to do the minimum at night. Sleep is operational equipment.
Pre-written communications — copy/paste during the fire, edit later.
Claude Code is the implementer — you brief, it codes, you verify. Same workflow as a normal day, just under pressure.
Lean post-mortems — 5 sections, not 50. Trends matter; ritual doesn't.

These choices make incidents survivable for a small team without sacrificing the discipline that prevents recurrence.

When this doc changes

After any SEV1 — what new step or check would have helped?
When the team grows beyond 1 — add coordination details
When new tools enter the stack — add their dashboards to Step 4
After a recurring pattern emerges — codify it
Bump version, note the change

Deploy Pipeline

Three rules. No exceptions.

Status: Active · v2 · 2026-05-13 (Railway revision) Applies to: Every code change reaching production · Every Coreshift HQ app Owners: Operator (verifier) · Claude Code (implementer) · Railway + Supabase + Cloudflare (mechanism)

v2 note: This doc was originally written assuming Cloudflare Pages hosts the frontend with per-PR preview URLs and gradual traffic rollout. The actual stack uses Railway for frontend + Express server hosting, with Cloudflare in front for DNS, CDN, WAF, and SSL (not Pages). Sections below have been rewritten to match Railway's mechanics. Future apps in the Sentinel portfolio that use Cloudflare Pages will need a separate variant.

What this doc is

The pre-written rules for moving code from "Claude Code just wrote it" to "real users are using it." Designed for an operator who doesn't read code to ship safely, daily.

The principle

Every deploy goes through staging. Every deploy is rollback-able fast. Every deploy is observable.

Three rules. No exceptions. The cost of following them is ~10 minutes. The cost of skipping them is a user-visible incident — every time.

The Standard Deploy Pipeline

Claude Code writes fix → opens PR against staging branch
            ↓
[Known gap: Railway free tier has no per-PR preview URLs]
            ↓
Read PR description + ask Claude Code clarifying questions
   (per HOW-WE-DO-PR-REVIEWS.md — review-by-description model)
            ↓
Merge to staging branch
            ↓
Railway auto-deploys to staging.keycontent.ai  (~2-3 min)
            ↓
Smoke test on staging
            ↓
Promote staging → main via PR  (the promotion gate)
            ↓
Railway auto-deploys to keycontent.ai  (~2-3 min, atomic cutover)
            ↓
Sentry + Better Stack watch for 1 hour
            ↓
Done.

The Three Rules

Rule 1 · Every deploy goes through staging

No direct merges to main
No "trivial" exceptions — the rule has no carve-outs
Staging is a real Railway environment with separate Supabase project + separate data
Even a one-character typo fix follows the path

Rule 2 · Every deploy is rollback-able fast

Railway keeps every previous deployment in the service's Deployments tab
Rollback = click a previous deployment → Redeploy. Live in ~2 minutes.
Database migrations need extra thinking (see Type C below)
Before merging anything risky: mentally rehearse "if this breaks, which deployment do I redeploy?"

Rule 3 · Every deploy is observable

Railway free tier deploys atomically — there is no native gradual traffic rollout. The discipline shifts from pre-deploy gradual exposure to post-deploy fast detection + revert. Our safety net:

Sentry catches application errors automatically within seconds of the deploy
Better Stack catches downtime/availability issues with 3-minute checks
Railway rollback is the kill switch when either alarm fires
The 1-hour watch after every prod deploy is the operator commitment to be reachable

If Railway adds gradual rollouts (or the team upgrades to Railway preview environments or paid tier with that capability), this rule reverts to "gradual." Until then: deploy → watch → revert if needed.

Deploy Types

Type A · Standard feature or fix

Follow the standard pipeline above. No extra steps.

Type B · Hotfix (incident response)

When an incident requires an emergency fix:

Brief Claude Code with the full incident context (Sentry trace + steps to reproduce)
PR opened against staging
Abbreviated review: only check the targeted fix (not full 5-step) IF the incident is SEV1
Merge to staging → quick smoke test → promote to main
Skip the slow gradual ramp — proceed 10% → 50% → 100% faster if needed
Sentry watch for 2 hours after a hotfix (double the normal window)

Type C · Database migration

Migrations are the riskiest deploy type because rollback isn't clean.

Claude Code writes the migration in a PR
Apply to staging first (Supabase migration CLI or apply_migration MCP)
Test the affected features end-to-end on staging
Sample queries to confirm data didn't corrupt
Snapshot production Supabase before applying to prod (backup point)
Apply migration to production
Verify in production
If anything breaks: Postgres migrations are NOT easily reversible. Have Claude Code write a forward-fix migration. Restore from backup only as a last resort.

Migration rule: never apply a migration to production without first applying it to staging and verifying the affected features.

Type D · Secrets / environment changes

Adding or rotating a secret (e.g., POSTMARK_SERVER_TOKEN):

Add to staging Supabase first
Verify behavior on staging
Add to production Supabase
Deploy any code that depends on the new secret (if not already deployed)
Verify in production

Never: add a secret only to prod without testing on staging first.

The Staging → Production Promotion Checklist

When ready to promote a feature from staging to production, all must be true:

Feature has been live on staging for at least 24 hours with no Sentry alerts
All Edge Function secrets exist in production Supabase
All required migrations have been applied to production
Storage buckets and RLS policies exist in production (if used)
CORS allowlist on Edge Functions includes the production origin
No outstanding Sentry errors related to the feature in staging
You have at least 2 hours to monitor after promotion (don't promote at end of day)
Rollback plan is mentally rehearsed

If any box is unchecked: pause, fix, then promote.

Rollback Procedures (memorize these)

Frontend + Express server (Railway)

Railway dashboard → KeyContent project → the service that runs the app
Deployments tab → find the last known-good deployment (sorted newest first; pick the one before the breaking deploy)
Click the ⋯ menu → Redeploy
Confirm. Live within ~2 minutes (Railway rebuilds + cuts over).

If you need to roll back simultaneously on staging + production, repeat the process per service. Each Railway environment is independent.

Edge Function (Supabase)

Check out the previous git commit on the function file
Re-deploy via Supabase CLI: supabase functions deploy {function-name} --project-ref {prod-ref}
OR brief Claude Code to revert and open a new PR (slower but safer)

Database migration

Not easily rollback-able. Use a forward-fix migration.
Restore from production backup only as last resort (data loss between snapshot and now, AND Supabase Storage objects are NOT in the daily backups — see HOW-WE-DO-APP-AUDITS.md).

Secret rotation gone wrong

If new secret is broken: revert the secret value in Supabase or Railway dashboard to the previous one
Or remove the secret entirely if the code can fall back gracefully

Cloudflare-level issues

DNS or proxy misconfiguration → Cloudflare → DNS → Records → revert the record
SSL/TLS mode change broke things → SSL/TLS → Overview → revert encryption mode
Cloudflare doesn't deploy app code, so rollbacks here are config-level only

What to Watch After Every Deploy

For 1 hour after a normal merge (2 hours for hotfix), glance at:

Sentry — new error spike on keycontent-frontend or keycontent-backend?
Railway — deploy succeeded green, no crash-loop, log output looks healthy?
Better Stack — both monitors still green at https://keycontent.betteruptime.com?
Your inbox / bug report widget — any user reports?
Cloudflare — Analytics dashboard for unusual traffic or error rates?

If anything looks off: rollback first, diagnose later.

Anti-Patterns (don't do these)

❌ "It's just a tiny change" — every change goes through the pipeline
❌ Promoting on Friday afternoon — there's no good reason
❌ Skipping the staging smoke test — staging exists precisely for this
❌ Adding a secret to prod first — always staging first
❌ Manual SQL changes against prod — use the migration system; untracked changes break future migrations
❌ Deploying without a rollback path mentally rehearsed — if you can't rollback, don't deploy

Why we elevated the standard

The industry has dozens of deploy frameworks: blue-green, canary, feature flags, ring-based rollouts, GitOps, etc. Most are designed for large engineering teams with dedicated SRE. We don't need that complexity.

Our elevations:

Three rules, no exceptions — easy to remember, hard to violate by mistake
Observable-first deploys — Sentry + Better Stack + Railway rollback compose into a fast-detect-and-revert safety net. Replaces the pre-deploy gradual ramp we don't have on Railway.
Description-based review pre-merge, behavior-based verification post-merge — pairs with HOW-WE-DO-PR-REVIEWS.md
Staging is non-optional — even for "trivial" changes (a culture choice, not a technical one)
Rollback-first culture — explicit safety bias during incidents
Friday rule — no production deploys after Thursday lunch unless it's a hotfix

The trade: deploys take ~10 extra minutes each. The gain: production stays stable, mornings stay calm, and you can promote with confidence.

Concrete Example — Promoting V0 (Report Issue Widget) to Production

This is the canonical promotion exercise. Use it as the reference for future promotions:

Confirm V0 has been live on staging.keycontent.ai for 24+ hours with no Sentry alerts
Add POSTMARK_SERVER_TOKEN and REPORT_TO_EMAIL to production Supabase Edge Function secrets — operator dashboard
Apply the storage migration to production (creates bug-report-screenshots bucket + RLS policies) — Claude Code task
Deploy report-issue Edge Function to production Supabase — Claude Code task
Verify Edge Function CORS allows the production origin (https://keycontent.ai) — Claude Code task
Confirm Postmark sender domain is verified for production — operator dashboard
Merge staging → main via PR — triggers Railway production deploy automatically (~2-3 min)
Smoke test on production: submit a Bug report with screenshot — operator
Sentry + Better Stack watch for 2 hours — operator
If alarms fire: Railway → KeyContent prod service → Deployments → redeploy previous (the rollback)
Done.

After this lands, this section becomes the "we've done it once" reference for App #2.

When this doc changes

After any incident traced to a deploy — what gate failed?
When the stack gains a new component (e.g., when Sentry is wired up in Phase 1, add a release-tracking step)
When a new deploy type emerges (e.g., feature flags)
Bump version, note the change

Bug Reports

Make reporting effortless. Make context automatic.

Status: Active · v1 · 2026-05-13 Applies to: User-facing feedback channel · KeyContent (live on staging) · Future Coreshift HQ apps Owners: Operator (triager) · Claude Code (implementer) · Tools (capture, store, deliver)

What this doc is

The end-to-end definition of how user feedback flows into the system — from the click of the floating widget to the issue landing in the triage queue. Codifies what's already shipped on staging.keycontent.ai.

The principle

Make reporting effortless. Make context automatic.

The user types one sentence. The system attaches everything else — who they are, where they were, what their browser is, what they were looking at. That's the difference between a useful report and a frustrating email thread.

The Pipeline

User clicks the floating "?" button on any page
            ↓
Modal opens — pre-filled with context awareness
            ↓
User picks a type:  🐛 Bug   💡 Suggestion   ❓ Question
            ↓
User types their message (placeholder adapts to type)
            ↓
User optionally drops in a screenshot
            ↓
On submit: screenshot uploads to Supabase Storage (if present)
            ↓
POST to Edge Function with: { type, message, page_url, user_agent, screenshot_path, app_id }
            ↓
Edge Function validates auth + payload, generates a signed URL for the screenshot
            ↓
Postmark sends email to operator with full context
            ↓
Operator triages in inbox  (P0–P3 + radius + narrative)
            ↓
Hand to Claude Code  →  fix lands  →  loop closed

The Three Ticket Types

Icon	Type	When users pick it	Maps to (future GitHub label)
🐛	Bug	"Something is broken"	`bug`
💡	Suggestion	"I have an idea or want a feature"	`enhancement`
❓	Question	"I'm stuck, unsure, or need help"	`question`

Default selection: Bug (most common in production apps).

The type selector drives two things:

The textarea placeholder adapts (better coaching per type)
The email subject prefix in the operator's inbox: [KeyContent · BUG] vs [· SUGGESTION] vs [· QUESTION]

What Gets Captured Automatically

The user types one sentence. The system attaches all of this without asking:

Field	Where it comes from
User email	Supabase auth session
User ID	Supabase auth session
Page URL	`window.location.href` at the moment of report
Browser / OS	`navigator.userAgent`
Server timestamp	Edge Function captures on receipt (UTC)
Screenshot URL	Signed Supabase Storage URL (30-day expiry), if attached
App ID	`"keycontent"` for now; future apps drop in seamlessly

This is why operator triage takes 30 seconds instead of 3 emails: the report already contains everything needed to act.

The Email Format (what lands in your inbox)

Subject: [KeyContent · BUG] from nzricky@gmail.com

Type: Bug
App: keycontent
User: nzricky@gmail.com  (id: 6b87712a-abce-48ee-b444-...)
Page: https://staging.keycontent.ai/jobs/d66a2195-...
Browser: Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/148.0.0.0 ...
Time: Mon, 11 May 2026 09:16:21 GMT
Screenshot: https://...supabase.co/storage/v1/.../screenshot.png?token=...

----- Their message -----
{the user's message}

The subject prefix is the key feature for triage: inbox rules can route by type automatically.

Architecture (the components)

Layer	Component	Where it lives
Frontend widget	`<ReportIssueButton />` mounted in app root layout	KeyContent frontend repo
Storage	Private bucket `bug-report-screenshots`, RLS-scoped to user folder	Supabase Storage
Edge Function	`report-issue` — validates auth, generates signed URL, sends email	Supabase Edge Functions
Email delivery	Postmark API (via `POSTMARK_SERVER_TOKEN`)	Postmark
Auth	Supabase JWT on every request	Supabase Auth
Multi-app readiness	`app_id` field in payload	Forward-compatible for Sentinel

Security Model

Authentication required — only signed-in users can submit reports (Edge Function rejects 401 otherwise)
Screenshots are private by default — bucket is private, RLS restricts uploads to the user's own folder
Signed URLs with 30-day expiry — operator's email has a time-bounded link, not a permanent public URL
Secrets never in code — POSTMARK_SERVER_TOKEN and REPORT_TO_EMAIL live in Supabase Edge Function secrets
CORS allowlist — Edge Function only accepts requests from the app's verified origin

Operator Triage Flow

When a report lands in your inbox:

Read the subject — type tells you the lane (bug/suggestion/question)
Read the message + look at screenshot — 30 seconds to understand
Assign priority (per HOW-WE-DO-PRIORITY.md) — P0/P1/P2/P3 + blast radius + narrative
For Bug + Suggestion: brief Claude Code, get a PR, verify on preview, merge
For Question: reply to the user directly with the answer
Optionally: archive the email or move to a "triaged" folder

Recommended inbox rules (Gmail/Outlook):

[KeyContent · BUG] → flag + star + same-day
[KeyContent · SUGGESTION] → archive to "Product Backlog" label, weekly review
[KeyContent · QUESTION] → flag for same-day reply

Phase 1 Evolution (when full ops kicks in)

The pipeline is designed to grow without rewriting the frontend. Three additive enhancements:

A. Save to a Supabase table

Edge Function also writes a row to a bug_reports table (alongside sending the email):

Columns: id, type, message, page_url, user_agent, user_id, user_email, screenshot_path, app_id, created_at, status, priority, blast_radius
Enables analytics, search, and trend analysis over time
app_id is already in the payload — no migration of historical data needed

B. Auto-create a GitHub Issue

After saving the email, the Edge Function also creates a GitHub Issue:

Repo: keycontent (or matched by app_id)
Label: bug / enhancement / question (mapped from type)
Title: {type}: {first 60 chars of message}
Body: the same context as the email + a link to the screenshot
Operator triages on GitHub, not in email

C. Auto-reply on resolution

When the GitHub Issue is closed (or labelled resolved), an automated email goes back to the user:

Subject: Update on your report — {type}
Body: thanks for reporting, here's what we changed
Closes the loop, builds trust, almost no other apps do this

These three additions take the system from "useful" to "remarkable" — without the user-facing widget changing at all.

Why we elevated the standard

Most apps have a "Contact Us" form or a help@ email address. Reports come in with no context, and the support thread that follows is mostly back-and-forth asking for screenshots, browser versions, and account details. Hours wasted per report.

Our elevations:

One-click reporting from any page (vs hunting for a contact link)
Auto-captured context (vs interrogation email threads)
Type-aware UX (placeholder adapts → users describe better)
Screenshots without screenshot tools (drag-drop in the modal)
Inbox-routable subject prefixes (vs everything in one queue)
Forward-compatible for multi-app (app_id field already plumbed)

The trade: building the widget took ~2 hours of Claude Code time. The gain: every report that lands is actionable in 30 seconds instead of 3 hours of email tennis.

Cloning to Another App

When Coreshift HQ launches App #2, the bug-reports system clones in an afternoon:

Drop <ReportIssueButton /> into App #2's layout (set app_id: "app2-slug")
Create the bug-report-screenshots bucket + RLS policies on App #2's Supabase
Deploy the report-issue Edge Function to App #2's Supabase
Add POSTMARK_SERVER_TOKEN + REPORT_TO_EMAIL to App #2's secrets
Optionally: separate Postmark Message Stream for App #2's reports

Everything else — the email format, the type system, the triage flow — is identical across apps. One operator can triage all apps from one inbox, filtered by the subject prefix.

This is why the system is also called "Sentinel" in the long-term vision: one set of eyes, every Coreshift HQ app.

When this doc changes

When the widget adds new types or fields
When a new app onboards (note any deviations)
When Phase 1 adds the DB table / GitHub Issue auto-creation
After any incident affecting the report pipeline
Bump version, note the change

App Audits

Defaults are dangerous. Audit is the gate.

Status: Active · v1 · 2026-05-13 Applies to: Every Coreshift HQ app — at onboarding (before joining Sentinel ops) and quarterly thereafter Owners: Operator (runs the checklist) · Claude Code (writes fix briefs for failed items) · Tools (Supabase advisors, Cloudflare audit logs, vendor dashboards)

What this doc is

A single-pass checklist that verifies an app has its default-off settings turned on before it's considered operator-ready. Runs as a gate at onboarding (first time) and as a re-audit on a quarterly cadence thereafter.

This doc exists because of a specific lesson learned: when KeyContent was handed over for operator-driven maintenance, webhook_events had RLS disabled and Supabase Auth had HIBP password protection off — both default-off settings the original developer never toggled on, because no checklist forced the question. This checklist is the elevation.

The principle

Defaults are dangerous. Cloud platforms ship with security off-by-default because they don't know your context. An audit is the gate that catches what was never toggled on.

Not paranoia. Not enterprise compliance theatre. Just a 30-minute pass that asks: "For every cloud setting that defaults to less-safe, did we make a conscious choice?"

When this audit runs

Trigger	What it produces
Onboarding — before an app joins Sentinel ops	Pass/fail gate. App can't go live with Sentinel ops until 100% pass or every failure is consciously deferred.
Quarterly — recurring re-audit	Delta report: what regressed, what's new, what new vendor checks should be added.
Post-incident — when an incident traces to a default-off setting	Targeted re-check of the affected area + add the check to this doc.

The Audit Checklist

Each item is a yes/no check the operator can verify in ≤ 2 minutes. Items marked (N/A until X) are only relevant once the named vendor or feature is in the stack.

Supabase — Database + Auth + Storage + Edge Functions

get_advisors (security) returns zero critical and zero warn entries
get_advisors (performance) reviewed — issues triaged or deferred
Row Level Security enabled on every public schema table
- Verify: SELECT relname FROM pg_class WHERE relkind='r' AND relnamespace='public'::regnamespace AND NOT relrowsecurity returns 0 rows
Every RLS-enabled table has at least one policy (else the authenticated role is silently locked out)
HIBP password protection enabled (Auth → Password security → "Check passwords against HaveIBeenPwned")
Email rate limits configured (Auth → Rate limits) — defaults are often too generous
All Storage buckets are private by default; public buckets are explicit and justified
Storage bucket policies scope writes to the user's own folder (path pattern {user_id}/...)
CORS allowlist on every Edge Function lists only known origins (no *)
All Edge Function secrets exist in both staging AND production projects
Service role key is not referenced in any client-side code (grep client/ for SERVICE_ROLE — must be 0 hits)
Database backup retention reviewed (Supabase default is 7 days; upgrade if data loss tolerance < 7d)
No raw SQL changes against production outside the migration system (every prod schema change has a supabase/migrations/ file)

Cloudflare — Frontend + DNS + WAF

WAF rate limiting on auth endpoints (/login, /signup, password reset)
DNSSEC enabled on the production domain
Bot Fight Mode (free tier) enabled (verified 2026-05-13 on keycontent.ai with JS Detections on)
SSL/TLS encryption mode set to Full (Strict) — not Flexible (verified 2026-05-13)
Pages preview deploys are gated (auth required) OR known to be safe-by-default
Environment variables exist in both Preview and Production environments
Page Rules / Configuration Rules reviewed for leftover staging-only rules

GitHub — Repo Hygiene

Branch protection on main: requires PR + status checks pass + linear history
Branch protection on staging: requires PR + status checks pass
Default branch is the production branch (main for KeyContent)
.gitignore excludes .env*, *.pem, *.key, credential files
Secret scanning enabled (free for public repos; opt-in for private)
Dependabot enabled for security updates
.github/ISSUE_TEMPLATE/ populated per HOW-WE-DO-BUG-REPORTS.md
No secrets in commit history (gitleaks or gh secret-scanning pass clean)
Repo visibility matches expectation (private for KeyContent; public exposure would be intentional)

Postmark — Transactional Email

Sending domain verified (DKIM + Return-Path + SPF all green)
Bounce + complaint webhooks configured to the app (or Sentinel ingestion)
Message Streams separated by purpose (e.g. outbound for transactional vs broadcast for marketing)
Suppression list reviewed quarterly
Sender reputation score acceptable (≥ 80)

Sentry — Error Monitoring (N/A until Phase 1 Week 1 ships)

Frontend + backend (server + edge functions) projects exist
DSNs configured per environment (staging DSN ≠ prod DSN)
Source map upload verified (stack traces show readable file names, not minified)
Release tracking wired to deploys (commit SHA tagging confirmed)
Alert rules: new-issue email, regression email, high-frequency email
User context attached after auth (id + email)
PII capture is off (sendDefaultPii: false)
Session replay privacy defaults (mask all text, block all media)

Better Stack — Uptime Monitoring (N/A until Phase 1 Week 2 ships)

Uptime monitors cover both staging and production endpoints
Check interval ≤ 3 min
Alert channels configured (email at minimum; SMS for SEV1 ideally)
Public status page exists at status.{app-domain} with brand customisation

Operational Hygiene

Report Issue widget submits a Bug, Suggestion, and Question successfully (one of each)
Inbox rules / labels routing [{App} · BUG/SUGGESTION/QUESTION] correctly
Postmark sender domain matches the app's primary domain
Rollback procedure mentally rehearsed per HOW-WE-DO-DEPLOYS.md Rule 2
Triage rhythm documented and being followed (10-min ritual per ROADMAP.md Phase 1 Week 4)
ROADMAP.md and KANBAN.md checked for staleness — shipped items marked [x]

What to do when an audit item fails

Don't fix items mid-audit. Finish the full pass, then triage all failures together. This prevents tunnel-vision on one finding while missing others.

For each failure:
  ↓
1. Categorise by impact using HOW-WE-DO-PRIORITY (P0/P1/P2/P3)
  ↓
2. Determine fix type:
   ▸ Code fix (e.g. missing RLS policy) → write a brief, hand to Claude Code
   ▸ Dashboard toggle (e.g. HIBP)        → add a kanban card, operator handles
   ▸ Process gap (e.g. no inbox rule)    → add to triage rhythm, no PR needed
  ↓
3. File the action — every failure produces either a PR, a kanban card,
   or an explicit "consciously deferred" note (with reason + reconsider date)

Why we elevated the standard

The industry has formal compliance audits — SOC2, ISO 27001, HIPAA, etc. They're designed for enterprise teams with compliance officers, weeks of evidence-gathering, and external auditors. Most of that ceremony doesn't apply to a solo operator running 1–N small apps. But the underlying principle — that defaults are dangerous and audit is a gate — does apply.

Our elevations:

Specific to our stack — checks Supabase, Cloudflare, Postmark, GitHub, Sentry, Better Stack directly. Not generic CIS benchmarks.
Automatable where possible — get_advisors, RLS query, grep checks. Operator runs them; Claude Code can re-run them on demand.
Lightweight — 30-minute pass, not a 2-week formal audit.
Dual-purpose — same checklist for first-time onboarding AND quarterly re-audit. No second doc to maintain.
Failure flows into existing workflows — Claude Code briefs for code fixes, kanban cards for dashboard toggles. No special incident category.
App-portable — every Coreshift HQ app passes the same audit before joining Sentinel ops. App #2 inherits the discipline.

The win at scale: when Sentinel covers 3+ apps, the audit is the contract. No app ever ships with RLS disabled on a webhook log or HIBP off in Auth again, because the checklist would've blocked it.

When this doc changes

After any incident that traces to a default-off setting — add the check that would've caught it
When a new vendor enters the stack — add their section (Stripe security, Zernio HMAC, etc.)
When Supabase / Cloudflare / etc. ship new advisor types — add the lints they surface
Quarterly, after running the audit — what was missing? What's noise?
When App #2 onboards — verify every check is portable; flag KeyContent-specific items

Bump the version at the top and note the change.

Outstanding items for KeyContent (initial audit, 2026-05-13)

The first time this checklist runs on KeyContent, it produces this delta. Surfaced during the Sentinel folder review session.

Item	Status	Action
RLS on `webhook_events`	✅ Fixed on staging 2026-05-12; awaiting prod promotion	See `../briefs/BRIEF-rls-fix-webhook-events.md`
HIBP password protection	✅ Enabled on staging + prod 2026-05-13	—
`get_advisors` zero-warn	✅ Staging clean; prod has 1 outstanding (RLS pending promotion)	Resolves when RLS fix promotes to prod
Sentry sections	N/A	Will run after Phase 1 Week 1
Better Stack sections	N/A	Will run after Phase 1 Week 2
Branch protection	✅ Rules added on `main` + `staging` 2026-05-13	Admin bypass left ON for 2-person team flexibility
Backup retention	✅ Verified 2026-05-13 — 8 days of daily DB backups	⚠️ Storage objects not in backups; follow-up queued for Phase 1 Week 5

Re-run the full audit once Sentry and Better Stack are integrated, and quarterly thereafter.

Priority Framework

What this doc is

The Four Priority Tiers

The Decision Tree (use this in 10 seconds)

The Two Required Fields (the elevation)

1. User Impact Narrative

2. Blast Radius

How Type, Priority, and Radius Compose

Common Mistakes (don't do these)

Why we tweaked the standard

How this lives in practice

When this doc changes

See also

PR Reviews

What this doc is

The principle

Two-Phase Review (Railway-adapted)

Phase A — Pre-merge review (~3 min)

Step 1 · Read the PR description carefully

Step 2 · Read the files-changed list

Step 3 · Ask clarifying questions in PR comments

Phase B — Post-merge behavior testing on staging (~5 min)

Step 4 · Reproduce the original goal

Step 5 · Smoke test the critical flows

Step 6 · Mobile check

Step 7 · If anything fails

Decision matrix

Example comments to give Claude Code

What I never do

Post-merge to PROD: the 1-hour watch

Why we elevated the standard

When this doc changes

Per-app critical flows

See also

Incident Response

What this doc is

The principle

Severity Levels

The 6-Step Response

1. ACKNOWLEDGE (0 min)

2. ASSESS severity (≤ 60 seconds)

3. COMMUNICATE (≤ 5 minutes from acknowledge)

4. DIAGNOSE — "what changed?"

5. MITIGATE — pick the safest path

6. RESOLVE

The 1 AM Rule

Claude Code's Role During Incidents

Pre-Written Status Page Templates

When to Email Users (beyond the status page)

Post-Mortem — Within 24 Hours of any SEV1 / SEV2

Common Anti-Patterns (don't do these)

Why we elevated the standard

When this doc changes

See also

Deploy Pipeline

What this doc is

The principle

The Standard Deploy Pipeline

The Three Rules

Rule 1 · Every deploy goes through staging

Rule 2 · Every deploy is rollback-able fast

Rule 3 · Every deploy is observable

Deploy Types

Type A · Standard feature or fix

Type B · Hotfix (incident response)

Type C · Database migration

Type D · Secrets / environment changes

The Staging → Production Promotion Checklist

Rollback Procedures (memorize these)

Frontend + Express server (Railway)

Edge Function (Supabase)

Database migration

Secret rotation gone wrong

Cloudflare-level issues

What to Watch After Every Deploy

Anti-Patterns (don't do these)

Why we elevated the standard

Concrete Example — Promoting V0 (Report Issue Widget) to Production

When this doc changes

See also