How We Run Apps
Five plays. One deliberate system. Built once for KeyContent. Cloned in an afternoon for every Coreshift HQ app after.
Priority Framework
How we triage what gets fixed first.
Status: Active · v1 · 2026-05-11 Applies to: Every Coreshift HQ app · Currently: KeyContent Owners: Operator (assigns priority) · Claude Code (consults for triage cues)
What this doc is
How Coreshift HQ triages and prioritizes issues. Built on the P0–P3 industry standard (so we play nicely with the rest of the world), elevated with two custom fields that force better thinking:
- User Impact Narrative — a one-line human story, not a technical description
- Blast Radius — explicit count of how many users are affected
The standard tells us what to call things. Our additions force us to think about who's affected before we act.
The Four Priority Tiers
| Tier | Name | When | Action | Default response time |
|---|---|---|---|---|
| P0 | Drop Everything | App is down for most/all users · Data loss · Active security breach | Fix now, today | Hours |
| P1 | This Week | A core feature is broken · OR a smaller feature broken for many users | Fix in 1–3 days | Same business day |
| P2 | This Milestone | A bug exists but users have a workaround · OR only a small subset is affected | Fix in current sprint | Within the week |
| P3 | When Convenient | Polish · Cosmetic · Wishlist · Edge case | Backlog — close if stale | No SLA |
The Decision Tree (use this in 10 seconds)
1. Is the app down for everyone? → YES → P0
NO ↓
2. Is a core feature broken for many users? → YES → P1
NO ↓
3. Is anything functionally broken? → YES → P2
NO ↓
4. Cosmetic / nice-to-have? → P3
"Core feature" for KeyContent = login · the main content workflow · saving work · billing/payments.
Each app defines its own list in its in-repo PRIORITY.md.
The Two Required Fields (the elevation)
1. User Impact Narrative
Every triaged issue must include a one-line story of how this affects the user's day.
| ❌ Bad (technical) | ✅ Good (human) |
|---|---|
| "Save button throws TypeError" | "Users lose their drafts when clicking Save — they have to retype everything" |
| "API 500 on /jobs endpoint" | "Users can't see their own jobs list when they log in — looks like everything's gone" |
| "Modal animation janky on Safari" | "Safari users see a flicker when opening the report widget — minor visual annoyance" |
Why: Triage from the user's perspective, not the engineer's. The narrative tells you the severity better than the stack trace.
2. Blast Radius
Tag every issue with one of:
| Tag | Meaning |
|---|---|
radius:single |
One user affected (so far) |
radius:some |
A subset — e.g., users on Safari, free-tier users, users in a specific region |
radius:many |
Most users will hit this |
radius:all |
Everyone, every session |
Why: Lets us spot escalation. A P2 · radius:some issue that gets re-reported as radius:many jumps to P1 instantly.
How Type, Priority, and Radius Compose
These are three orthogonal axes — all assigned, but at different moments:
| Field | Set by | When |
|---|---|---|
| Type (Bug / Suggestion / Question) | The user | At report time (via the widget) |
| Priority (P0–P3) | The operator | At triage time |
| Blast Radius | The operator | At triage time |
A typical fully-triaged issue looks like:
Type: Bug Priority: P1 Blast Radius:
radius:manyNarrative: Users lose their drafts when clicking Save — they have to retype everything.
This is rich enough that Claude Code can read it cold and start fixing. That's the whole point.
Common Mistakes (don't do these)
- Everything is P1. If everything's urgent, nothing is. Be ruthless.
- The loudest user sets priority. Impact sets priority, not volume. A single angry user reporting a typo is still P3.
- You-found-it bugs ranked above users-found-it. Reverse it — user-reported issues are real, by definition.
- Priority never changes. Re-triage when you learn more. A P2 affecting
radius:somethat turns intoradius:manybecomes P1. - Skipping the narrative. Without it, you're prioritizing on vibes. Force the user-impact line every time.
Why we tweaked the standard
The industry's P0–P3 system is a vocabulary, not a thinking framework. Two issues can both be "P1" and have wildly different actual impact on users. By requiring narrative + radius, we force a small amount of structured thinking that:
- Prevents priority inflation
- Makes triage trustworthy (numbers are decisions, not guesses)
- Generates great Claude Code briefs as a byproduct
- Sets up future automation —
radius:allcould auto-page;radius:single + P3could auto-close after 60 days
These are not exotic additions. They're cheap, repeatable habits that compound over hundreds of triaged issues.
How this lives in practice
- The widget captures Type at report time
- Triage (10 min/day) sets Priority + Blast Radius + Narrative
- GitHub Issues stores all three as labels and fields (Phase 1)
- Sentinel (future ops dashboard) filters and sorts by these
- The pitch deck (Slide 7) is the public-facing summary of this doc
When this doc changes
- Edit when we discover a new tier is needed (rare)
- Edit when the "core feature" list changes per app
- Bump the version at the top
- Mention in the next CEO update
See also
../briefs/BRIEF-report-issue-widget.md— how Type is captured../briefs/BRIEF-report-issue-widget-round2.md— Type selector implementation../deck/KeyContent-Maintenance-Ops.pptx— Slide 7 (Priority Framework)../ROADMAP.md— Phase 1 Week 5 (in-repoPRIORITY.mdfollows this doc)
PR Reviews
How a non-coder verifies every change.
Status: Active · v2 · 2026-05-13 (Railway revision) Applies to: Every PR from Claude Code or any contributor Owners: Operator (reviewer) · Claude Code (author)
v2 note: Original v1 assumed Cloudflare Pages preview URLs per PR. The actual stack uses Railway, which doesn't generate per-PR previews on the free tier. The review model has shifted from "verify on a preview, then merge" to "review-by-description pre-merge, behavior-test on staging post-merge." Safety net is Sentry + Better Stack + Railway fast rollback.
What this doc is
A behavior-based, code-free PR review checklist. Designed for an operator who doesn't write code to confidently approve or push back on changes — at speed, every day.
The principle
Behavior > code.
Most engineering orgs review PRs by reading source code line by line. We can't — and we don't need to. We trust Claude Code's implementation patterns and verify what actually matters: that the change does what we asked for, and nothing else broke.
Our verification happens at the user-experience layer, not the source-code layer.
The trade: we don't personally catch every subtle bug. We rely on Sentry monitoring + Better Stack uptime + Railway fast rollback as the safety net for what slips through.
Two-Phase Review (Railway-adapted)
Railway free tier doesn't create a preview URL per PR — only the staging branch gets deployed. So the review splits into two phases:
- Phase A (pre-merge): description-based review only. No behavior-testing yet.
- Phase B (post-merge-to-staging): behavior verification on
staging.keycontent.aiafter Railway deploys.
Total time: ~3 min pre-merge + ~5 min post-merge.
Phase A — Pre-merge review (~3 min)
Step 1 · Read the PR description carefully
- Does the change match what I briefed?
- Are the "What's new" items a subset of what I asked for? (No scope creep.)
- Does the test plan match what we should verify?
✅ Yes → continue. ❌ No → comment to clarify before merging.
Step 2 · Read the files-changed list
Don't read the code itself — just the list of files. Quick sanity check:
- Does the file count match the scope? (A "fix subject line" PR touching 12 files is a red flag.)
- Are any unexpected files touched? (e.g., a UI fix PR that changed migration files — ask why.)
If files look reasonable → proceed. If suspicious → ask Claude Code to explain.
Step 3 · Ask clarifying questions in PR comments
You can't behavior-test yet, so questions are the only pre-merge gate. Examples:
- "You added a new env var — does it have a sensible default if missing?"
- "You touched the auth flow — confirm sign-in/sign-out still work after this?"
- "This change affects the report widget — does staging Postmark still get hit?"
Claude Code answering well = green light to merge. Claude Code stumbling = bounce back.
Phase A passes → merge to staging.
Phase B — Post-merge behavior testing on staging (~5 min)
After merging to staging, Railway redeploys in ~2-3 min. Open https://staging.keycontent.ai and:
Step 4 · Reproduce the original goal
- For bugs: repeat the steps from the original issue. Confirm the bug is gone.
- For features: use the feature exactly as a user would. Confirm it does what we asked.
Step 5 · Smoke test the critical flows
After verifying the targeted change, quickly probe 2–3 high-traffic paths to make sure nothing else broke.
KeyContent's critical flows:
- Sign in / sign up
- Dashboard loads cleanly
- The primary content workflow (create / edit / save a job)
- Sign out
Each click ~5–10 seconds. Whole smoke test ~1 minute.
Step 6 · Mobile check
- Resize your browser to ~400px wide (or pull it up on your phone)
- Hit the changed page
- Confirm nothing is broken or unreadable
Step 7 · If anything fails
- Comment on the merged PR with the failing details
- Brief Claude Code for the follow-up fix (new PR)
- If the failure is severe (auth, payment, data loss): rollback staging via Railway → Deployments → redeploy previous
All checks pass → ready to promote staging → main.
Decision matrix
| Situation | Phase | Action |
|---|---|---|
| Description + files match brief, questions answered well | A | ✅ Merge to staging |
| PR description doesn't match my brief | A | ❌ "This doesn't match my brief — please redo X" before merging |
| Suspicious files-changed list | A | ❌ Ask Claude Code to explain before merging |
| Already merged, all behavior checks pass on staging | B | ✅ Ready to promote staging → main |
| Bug isn't actually fixed (on staging) | B | ❌ Comment with exact failing steps; new PR for fix |
| Fix works, but smoke test broke something else | B | ❌ Comment + screenshot; new PR for fix |
| Mobile broken on staging | B | ❌ Comment "Mobile broken — see screenshot"; new PR |
| Severe failure on staging (auth, data loss) | B | 🚨 Rollback staging via Railway, then fix via new PR |
Example comments to give Claude Code
Pre-merge clarifying question (Phase A):
"You're touching the auth middleware — confirm sign-in still works for existing sessions after this lands. Don't want to log everyone out."
Bug not fixed (Phase B, post-merge on staging):
"Tested on staging. Original issue still happens: when I click Save on the draft, the page reloads and the content is lost. New PR for follow-up fix?"
Smoke test broke something (Phase B):
"Targeted fix works on staging. But noticed: clicking 'New Job' button now crashes the page. Reproduced 3 times on staging. Screenshot attached. Need a follow-up PR."
Mobile broken (Phase B):
"On mobile width (~400px) on staging, the new ticket-type selector wraps weirdly and 'Send Report' is cut off. Screenshot."
Scope creep (Phase A):
"I see you also refactored the auth code — please pull that out into a separate PR so this one stays focused on the bug fix."
What I never do
- Read code line by line
- Approve a PR based on "looks good to me"
- Skip the smoke test
- Merge on a Friday afternoon
- Trust the test plan without running it myself
Post-merge to PROD: the 1-hour watch
After promoting staging → main and Railway redeploys production:
- New error spike in Sentry? → rollback via Railway (Deployments → redeploy previous, ~2 min)
- Better Stack monitor flipped red? → same: rollback first, diagnose later
- User reports a problem via the widget? → check if it correlates with the deploy; rollback if yes
- All quiet for 1 hour? → done.
For hotfixes, watch for 2 hours instead of 1.
Why we elevated the standard
The industry assumes the PR reviewer is also a coder. Our system inverts that assumption: the operator doesn't read code, so the checklist is 100% behavior-focused.
This:
- Unblocks non-coding operators from running a real engineering ops process
- Catches the bugs that actually matter — user-visible breakage, not stylistic preferences
- Pairs with safety nets — Sentry + Better Stack + Railway fast rollback catch what visual testing misses
- Adapts to platform reality — when Cloudflare-Pages-style per-PR previews aren't available (Railway), the review moves to staging-after-merge with rollback as the safety net
- Scales across apps — same two-phase pattern works for KeyContent and every Coreshift HQ app after
The unfair advantage: we move faster than teams that require code-reading reviewers, because every PR has exactly one reviewer (the operator) and the review is mechanical.
When this doc changes
- When new critical flows emerge in any app (add to Step 4 list)
- When a recurring failure mode isn't being caught by the smoke test
- After any major post-merge incident (add a check to prevent recurrence)
- Bump the version at the top and note the change
Per-app critical flows
Each Coreshift HQ app declares its own critical flows for Step 4. KeyContent's are above. App #2 will add its own.
See also
HOW-WE-DO-PRIORITY.md— how we triage what gets fixed firstHOW-WE-DO-DEPLOYS.md— the deploy pipeline + rollback procedures (v2 Railway-adapted)HOW-WE-DO-INCIDENTS.md— what to do when the post-merge watch firesHOW-WE-DO-APP-AUDITS.md— the broader audit framework- Deck Slide 6 — Shipping a New Feature (the visual narrative — may need slight rewording for Railway in next deck revision)
- Deck Slide 10 — Role Distribution (Operator: Triage, brief, verify)
Incident Response
Pre-written calm beats in-the-moment heroics.
Status: Active · v1 · 2026-05-13 Applies to: Any unplanned event degrading user experience in production Owners: Operator (incident driver) · Claude Code (fix implementer) · Tools (detection)
What this doc is
The pre-written playbook for when something is broken in production. Designed for a solo operator at 2 AM, when you don't have time to think — just follow the script.
The principle
Pre-written calm beats in-the-moment heroics.
The worst time to design a process is during a fire. So we wrote it now. When something breaks, you don't think — you execute the script. Decisions are made in advance; the runbook is the brain.
Severity Levels
| Level | Definition | Response window | Public comms |
|---|---|---|---|
| SEV1 | App is down for everyone · data loss · active security breach | Now | Status page + user email if > 30 min |
| SEV2 | Core feature broken · app down for many users | Within 30 min | Status page |
| SEV3 | Degraded but functional · slow · minor breakage | Same day | Internal only |
Non-urgent bugs that aren't actively breaking things → use the P0–P3 priority framework, not the incident process.
The 6-Step Response
1. ACKNOWLEDGE (0 min)
- Alert arrived from Better Stack or Sentry
- You're aware. The clock is running.
- Internally: stop everything else. This is your only task.
2. ASSESS severity (≤ 60 seconds)
Three quick questions:
- Can users still use the app at all? → No: SEV1 or SEV2. Yes: SEV3.
- How many users affected? → All: SEV1. Many: SEV2. Few: SEV3.
- Data being lost or corrupted? → Yes: always SEV1.
Pick a level. Don't agonize. You can adjust as you learn more.
3. COMMUNICATE (≤ 5 minutes from acknowledge)
- SEV1 / SEV2: update the public status page immediately. Use the templates below.
- SEV3: note in your incident log; update the status page only if it persists past 30 min.
4. DIAGNOSE — "what changed?"
Run through these dashboards in order. 9 out of 10 incidents trace to a recent deploy.
| Source | What to look for |
|---|---|
| Recent deploys | Anything shipped in the last 4 hours? |
| Sentry | New error spike? Click for stack trace + breadcrumbs |
| Cloudflare Workers logs | 5xx surge? Function timeouts? |
| Supabase logs | DB errors? Connection issues? Migration problems? |
| Postmark | Are emails bouncing? Server outage? |
| Third-party status pages | Cloudflare · Supabase · Postmark all have public status pages |
5. MITIGATE — pick the safest path
Was there a recent deploy?
YES → ROLLBACK FIRST. Diagnose later.
(Cloudflare 1-click rollback to previous version.)
NO ↓
Is it a third-party outage?
YES → Communicate, wait, monitor. Update users.
NO ↓
Is it a code bug requiring a new fix?
→ Brief Claude Code with full Sentry trace + context.
Get a hotfix PR. Verify on preview. Merge.
(Use the standard PR review checklist — abbreviated for hotfix.)
6. RESOLVE
- Test the affected flow yourself on production
- Update status page to "Resolved"
- Note in your incident log
The 1 AM Rule
When something fires at 1 AM, do the minimum to make it stop:
- Rollback the last deploy if there was one recent
- Communicate to users via status page
- Sleep. Investigate properly in the morning.
Don't do at 1 AM:
- Write or brief new code
- Deploy speculative fixes
- Read complex logs
- Make architectural decisions
The 1 AM rule exists because sleep-deprived decisions are how minor incidents become major ones.
Claude Code's Role During Incidents
You don't write code. Even during incidents. Especially during incidents.
When a hotfix is needed:
- Capture the full Sentry trace + URL + user details + timestamp
- Open Claude Code (a fresh session for the incident, if helpful)
- Brief it like: "P0 hotfix. {feature} broken in production since {time}. Sentry error attached. Rollback wasn't possible because {reason}. Need a fix opened as a PR immediately."
- Verify the fix on the preview
- Merge
Operator's role is triage, brief, verify — same as always, just faster.
Pre-Written Status Page Templates
Investigating:
We're investigating reports of {issue} affecting {feature}. We'll provide an update within 30 minutes.
Identified:
We've identified the issue with {feature}. We're working on a fix now.
Monitoring:
A fix has been deployed. We're monitoring to confirm everything is working as expected.
Resolved:
This incident is resolved. {Feature} is fully operational. Sorry for the disruption.
When to Email Users (beyond the status page)
Most incidents: status page is enough.
Email users only when ALL of these are true:
- SEV1 lasted > 30 minutes, OR data was visibly affected
- The user might reasonably think their account is broken
- You can identify the affected users
Template:
Subject: We had a brief outage — your data is safe
Hi {name},
Earlier today between {start} and {end}, {feature} was {down/broken}.
We're back to normal now.
What you might have noticed: {brief description}
What we did: {1-2 sentences}
Thanks for your patience.
— The Coreshift HQ Team
Post-Mortem — Within 24 Hours of any SEV1 / SEV2
Save as incidents/YYYY-MM-DD-N.md in the repo. Lean template — 5 sections:
## Incident YYYY-MM-DD-N
**Severity:** SEV{1|2|3}
**Duration:** Detected HH:MM → Resolved HH:MM ({X} minutes)
**Affected:** {features, ~user count}
### What happened
{One paragraph.}
### Root cause
{One paragraph.}
### How we caught it
{Sentry / Better Stack / user report / other?}
### What we did
{Bullet list of actions taken, in order, with timestamps.}
### Action items
- [ ] Prevent recurrence: {specific change}
- [ ] Improve detection: {specific change, if applicable}
- [ ] Update this runbook: {specific addition, if applicable}
Don't:
- Skip the post-mortem because "everyone knows what happened"
- Make it about blame (the system failed, not the person)
- Write a novel — 5 sections is enough
Do:
- File the action items as P1 GitHub Issues
- Re-read recent post-mortems quarterly — patterns emerge
Common Anti-Patterns (don't do these)
- Skipping comms — users finding out from a friend is worse than a brief status update
- Forcing a fix at midnight — rollback + sleep is almost always better
- Investigating before mitigating — stop the bleeding first
- No post-mortem because "it's fixed" — patterns repeat unless documented
- Skipping the status page update because "it's small" — the rule is consistency, not severity
Why we elevated the standard
The industry has detailed incident playbooks designed for large engineering teams with rotating on-call rotations, incident commanders, and war rooms. Most of that doesn't apply when you're a solo operator with Claude Code as your implementer.
Our elevations:
- Solo-operator-aware — no incident commander, no war room, no Slack #incidents channel. Just you + the runbook.
- The 1 AM Rule — explicit permission to do the minimum at night. Sleep is operational equipment.
- Pre-written communications — copy/paste during the fire, edit later.
- Claude Code is the implementer — you brief, it codes, you verify. Same workflow as a normal day, just under pressure.
- Lean post-mortems — 5 sections, not 50. Trends matter; ritual doesn't.
These choices make incidents survivable for a small team without sacrificing the discipline that prevents recurrence.
When this doc changes
- After any SEV1 — what new step or check would have helped?
- When the team grows beyond 1 — add coordination details
- When new tools enter the stack — add their dashboards to Step 4
- After a recurring pattern emerges — codify it
- Bump version, note the change
See also
HOW-WE-DO-PRIORITY.md— P0/P1 priorities map to SEV1/SEV2 severitiesHOW-WE-DO-PR-REVIEWS.md— the standard review process; incident review is the abbreviated versionHOW-WE-DO-DEPLOYS.md— rollback details (coming next)../ROADMAP.md— Phase 1 Week 2 wires up Better Stack + status page (the detection layer)- Deck Slide 5 — Scenario 2: outage at 2 AM (the visual narrative)
Deploy Pipeline
Three rules. No exceptions.
Status: Active · v2 · 2026-05-13 (Railway revision) Applies to: Every code change reaching production · Every Coreshift HQ app Owners: Operator (verifier) · Claude Code (implementer) · Railway + Supabase + Cloudflare (mechanism)
v2 note: This doc was originally written assuming Cloudflare Pages hosts the frontend with per-PR preview URLs and gradual traffic rollout. The actual stack uses Railway for frontend + Express server hosting, with Cloudflare in front for DNS, CDN, WAF, and SSL (not Pages). Sections below have been rewritten to match Railway's mechanics. Future apps in the Sentinel portfolio that use Cloudflare Pages will need a separate variant.
What this doc is
The pre-written rules for moving code from "Claude Code just wrote it" to "real users are using it." Designed for an operator who doesn't read code to ship safely, daily.
The principle
Every deploy goes through staging. Every deploy is rollback-able fast. Every deploy is observable.
Three rules. No exceptions. The cost of following them is ~10 minutes. The cost of skipping them is a user-visible incident — every time.
The Standard Deploy Pipeline
Claude Code writes fix → opens PR against staging branch
↓
[Known gap: Railway free tier has no per-PR preview URLs]
↓
Read PR description + ask Claude Code clarifying questions
(per HOW-WE-DO-PR-REVIEWS.md — review-by-description model)
↓
Merge to staging branch
↓
Railway auto-deploys to staging.keycontent.ai (~2-3 min)
↓
Smoke test on staging
↓
Promote staging → main via PR (the promotion gate)
↓
Railway auto-deploys to keycontent.ai (~2-3 min, atomic cutover)
↓
Sentry + Better Stack watch for 1 hour
↓
Done.
The Three Rules
Rule 1 · Every deploy goes through staging
- No direct merges to main
- No "trivial" exceptions — the rule has no carve-outs
- Staging is a real Railway environment with separate Supabase project + separate data
- Even a one-character typo fix follows the path
Rule 2 · Every deploy is rollback-able fast
- Railway keeps every previous deployment in the service's Deployments tab
- Rollback = click a previous deployment → Redeploy. Live in ~2 minutes.
- Database migrations need extra thinking (see Type C below)
- Before merging anything risky: mentally rehearse "if this breaks, which deployment do I redeploy?"
Rule 3 · Every deploy is observable
Railway free tier deploys atomically — there is no native gradual traffic rollout. The discipline shifts from pre-deploy gradual exposure to post-deploy fast detection + revert. Our safety net:
- Sentry catches application errors automatically within seconds of the deploy
- Better Stack catches downtime/availability issues with 3-minute checks
- Railway rollback is the kill switch when either alarm fires
- The 1-hour watch after every prod deploy is the operator commitment to be reachable
If Railway adds gradual rollouts (or the team upgrades to Railway preview environments or paid tier with that capability), this rule reverts to "gradual." Until then: deploy → watch → revert if needed.
Deploy Types
Type A · Standard feature or fix
Follow the standard pipeline above. No extra steps.
Type B · Hotfix (incident response)
When an incident requires an emergency fix:
- Brief Claude Code with the full incident context (Sentry trace + steps to reproduce)
- PR opened against staging
- Abbreviated review: only check the targeted fix (not full 5-step) IF the incident is SEV1
- Merge to staging → quick smoke test → promote to main
- Skip the slow gradual ramp — proceed 10% → 50% → 100% faster if needed
- Sentry watch for 2 hours after a hotfix (double the normal window)
Type C · Database migration
Migrations are the riskiest deploy type because rollback isn't clean.
- Claude Code writes the migration in a PR
- Apply to staging first (Supabase migration CLI or
apply_migrationMCP) - Test the affected features end-to-end on staging
- Sample queries to confirm data didn't corrupt
- Snapshot production Supabase before applying to prod (backup point)
- Apply migration to production
- Verify in production
- If anything breaks: Postgres migrations are NOT easily reversible. Have Claude Code write a forward-fix migration. Restore from backup only as a last resort.
Migration rule: never apply a migration to production without first applying it to staging and verifying the affected features.
Type D · Secrets / environment changes
Adding or rotating a secret (e.g., POSTMARK_SERVER_TOKEN):
- Add to staging Supabase first
- Verify behavior on staging
- Add to production Supabase
- Deploy any code that depends on the new secret (if not already deployed)
- Verify in production
Never: add a secret only to prod without testing on staging first.
The Staging → Production Promotion Checklist
When ready to promote a feature from staging to production, all must be true:
- Feature has been live on staging for at least 24 hours with no Sentry alerts
- All Edge Function secrets exist in production Supabase
- All required migrations have been applied to production
- Storage buckets and RLS policies exist in production (if used)
- CORS allowlist on Edge Functions includes the production origin
- No outstanding Sentry errors related to the feature in staging
- You have at least 2 hours to monitor after promotion (don't promote at end of day)
- Rollback plan is mentally rehearsed
If any box is unchecked: pause, fix, then promote.
Rollback Procedures (memorize these)
Frontend + Express server (Railway)
- Railway dashboard → KeyContent project → the service that runs the app
- Deployments tab → find the last known-good deployment (sorted newest first; pick the one before the breaking deploy)
- Click the ⋯ menu → Redeploy
- Confirm. Live within ~2 minutes (Railway rebuilds + cuts over).
If you need to roll back simultaneously on staging + production, repeat the process per service. Each Railway environment is independent.
Edge Function (Supabase)
- Check out the previous git commit on the function file
- Re-deploy via Supabase CLI:
supabase functions deploy {function-name} --project-ref {prod-ref} - OR brief Claude Code to revert and open a new PR (slower but safer)
Database migration
- Not easily rollback-able. Use a forward-fix migration.
- Restore from production backup only as last resort (data loss between snapshot and now, AND Supabase Storage objects are NOT in the daily backups — see
HOW-WE-DO-APP-AUDITS.md).
Secret rotation gone wrong
- If new secret is broken: revert the secret value in Supabase or Railway dashboard to the previous one
- Or remove the secret entirely if the code can fall back gracefully
Cloudflare-level issues
- DNS or proxy misconfiguration → Cloudflare → DNS → Records → revert the record
- SSL/TLS mode change broke things → SSL/TLS → Overview → revert encryption mode
- Cloudflare doesn't deploy app code, so rollbacks here are config-level only
What to Watch After Every Deploy
For 1 hour after a normal merge (2 hours for hotfix), glance at:
- Sentry — new error spike on
keycontent-frontendorkeycontent-backend? - Railway — deploy succeeded green, no crash-loop, log output looks healthy?
- Better Stack — both monitors still green at https://keycontent.betteruptime.com?
- Your inbox / bug report widget — any user reports?
- Cloudflare — Analytics dashboard for unusual traffic or error rates?
If anything looks off: rollback first, diagnose later.
Anti-Patterns (don't do these)
- ❌ "It's just a tiny change" — every change goes through the pipeline
- ❌ Promoting on Friday afternoon — there's no good reason
- ❌ Skipping the staging smoke test — staging exists precisely for this
- ❌ Adding a secret to prod first — always staging first
- ❌ Manual SQL changes against prod — use the migration system; untracked changes break future migrations
- ❌ Deploying without a rollback path mentally rehearsed — if you can't rollback, don't deploy
Why we elevated the standard
The industry has dozens of deploy frameworks: blue-green, canary, feature flags, ring-based rollouts, GitOps, etc. Most are designed for large engineering teams with dedicated SRE. We don't need that complexity.
Our elevations:
- Three rules, no exceptions — easy to remember, hard to violate by mistake
- Observable-first deploys — Sentry + Better Stack + Railway rollback compose into a fast-detect-and-revert safety net. Replaces the pre-deploy gradual ramp we don't have on Railway.
- Description-based review pre-merge, behavior-based verification post-merge — pairs with HOW-WE-DO-PR-REVIEWS.md
- Staging is non-optional — even for "trivial" changes (a culture choice, not a technical one)
- Rollback-first culture — explicit safety bias during incidents
- Friday rule — no production deploys after Thursday lunch unless it's a hotfix
The trade: deploys take ~10 extra minutes each. The gain: production stays stable, mornings stay calm, and you can promote with confidence.
Concrete Example — Promoting V0 (Report Issue Widget) to Production
This is the canonical promotion exercise. Use it as the reference for future promotions:
- Confirm V0 has been live on
staging.keycontent.aifor 24+ hours with no Sentry alerts - Add
POSTMARK_SERVER_TOKENandREPORT_TO_EMAILto production Supabase Edge Function secrets — operator dashboard - Apply the storage migration to production (creates
bug-report-screenshotsbucket + RLS policies) — Claude Code task - Deploy
report-issueEdge Function to production Supabase — Claude Code task - Verify Edge Function CORS allows the production origin (
https://keycontent.ai) — Claude Code task - Confirm Postmark sender domain is verified for production — operator dashboard
- Merge staging → main via PR — triggers Railway production deploy automatically (~2-3 min)
- Smoke test on production: submit a Bug report with screenshot — operator
- Sentry + Better Stack watch for 2 hours — operator
- If alarms fire: Railway → KeyContent prod service → Deployments → redeploy previous (the rollback)
- Done.
After this lands, this section becomes the "we've done it once" reference for App #2.
When this doc changes
- After any incident traced to a deploy — what gate failed?
- When the stack gains a new component (e.g., when Sentry is wired up in Phase 1, add a release-tracking step)
- When a new deploy type emerges (e.g., feature flags)
- Bump version, note the change
See also
HOW-WE-DO-PRIORITY.md— what justifies a hotfix vs standard deployHOW-WE-DO-PR-REVIEWS.md— the verification gate before mergeHOW-WE-DO-INCIDENTS.md— what to do when a deploy goes wrong../ROADMAP.md— V0 production promotion is queued under "Phase 0 wrap-up"- Deck Slide 6 — Scenario 3: shipping new code safely (the visual narrative)
Bug Reports
Make reporting effortless. Make context automatic.
Status: Active · v1 · 2026-05-13 Applies to: User-facing feedback channel · KeyContent (live on staging) · Future Coreshift HQ apps Owners: Operator (triager) · Claude Code (implementer) · Tools (capture, store, deliver)
What this doc is
The end-to-end definition of how user feedback flows into the system — from the click of the floating widget to the issue landing in the triage queue. Codifies what's already shipped on staging.keycontent.ai.
The principle
Make reporting effortless. Make context automatic.
The user types one sentence. The system attaches everything else — who they are, where they were, what their browser is, what they were looking at. That's the difference between a useful report and a frustrating email thread.
The Pipeline
User clicks the floating "?" button on any page
↓
Modal opens — pre-filled with context awareness
↓
User picks a type: 🐛 Bug 💡 Suggestion ❓ Question
↓
User types their message (placeholder adapts to type)
↓
User optionally drops in a screenshot
↓
On submit: screenshot uploads to Supabase Storage (if present)
↓
POST to Edge Function with: { type, message, page_url, user_agent, screenshot_path, app_id }
↓
Edge Function validates auth + payload, generates a signed URL for the screenshot
↓
Postmark sends email to operator with full context
↓
Operator triages in inbox (P0–P3 + radius + narrative)
↓
Hand to Claude Code → fix lands → loop closed
The Three Ticket Types
| Icon | Type | When users pick it | Maps to (future GitHub label) |
|---|---|---|---|
| 🐛 | Bug | "Something is broken" | bug |
| 💡 | Suggestion | "I have an idea or want a feature" | enhancement |
| ❓ | Question | "I'm stuck, unsure, or need help" | question |
Default selection: Bug (most common in production apps).
The type selector drives two things:
- The textarea placeholder adapts (better coaching per type)
- The email subject prefix in the operator's inbox:
[KeyContent · BUG]vs[· SUGGESTION]vs[· QUESTION]
What Gets Captured Automatically
The user types one sentence. The system attaches all of this without asking:
| Field | Where it comes from |
|---|---|
| User email | Supabase auth session |
| User ID | Supabase auth session |
| Page URL | window.location.href at the moment of report |
| Browser / OS | navigator.userAgent |
| Server timestamp | Edge Function captures on receipt (UTC) |
| Screenshot URL | Signed Supabase Storage URL (30-day expiry), if attached |
| App ID | "keycontent" for now; future apps drop in seamlessly |
This is why operator triage takes 30 seconds instead of 3 emails: the report already contains everything needed to act.
The Email Format (what lands in your inbox)
Subject: [KeyContent · BUG] from nzricky@gmail.com
Type: Bug
App: keycontent
User: nzricky@gmail.com (id: 6b87712a-abce-48ee-b444-...)
Page: https://staging.keycontent.ai/jobs/d66a2195-...
Browser: Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/148.0.0.0 ...
Time: Mon, 11 May 2026 09:16:21 GMT
Screenshot: https://...supabase.co/storage/v1/.../screenshot.png?token=...
----- Their message -----
{the user's message}
The subject prefix is the key feature for triage: inbox rules can route by type automatically.
Architecture (the components)
| Layer | Component | Where it lives |
|---|---|---|
| Frontend widget | <ReportIssueButton /> mounted in app root layout |
KeyContent frontend repo |
| Storage | Private bucket bug-report-screenshots, RLS-scoped to user folder |
Supabase Storage |
| Edge Function | report-issue — validates auth, generates signed URL, sends email |
Supabase Edge Functions |
| Email delivery | Postmark API (via POSTMARK_SERVER_TOKEN) |
Postmark |
| Auth | Supabase JWT on every request | Supabase Auth |
| Multi-app readiness | app_id field in payload |
Forward-compatible for Sentinel |
Security Model
- Authentication required — only signed-in users can submit reports (Edge Function rejects 401 otherwise)
- Screenshots are private by default — bucket is private, RLS restricts uploads to the user's own folder
- Signed URLs with 30-day expiry — operator's email has a time-bounded link, not a permanent public URL
- Secrets never in code —
POSTMARK_SERVER_TOKENandREPORT_TO_EMAILlive in Supabase Edge Function secrets - CORS allowlist — Edge Function only accepts requests from the app's verified origin
Operator Triage Flow
When a report lands in your inbox:
- Read the subject — type tells you the lane (bug/suggestion/question)
- Read the message + look at screenshot — 30 seconds to understand
- Assign priority (per
HOW-WE-DO-PRIORITY.md) — P0/P1/P2/P3 + blast radius + narrative - For Bug + Suggestion: brief Claude Code, get a PR, verify on preview, merge
- For Question: reply to the user directly with the answer
- Optionally: archive the email or move to a "triaged" folder
Recommended inbox rules (Gmail/Outlook):
[KeyContent · BUG]→ flag + star + same-day[KeyContent · SUGGESTION]→ archive to "Product Backlog" label, weekly review[KeyContent · QUESTION]→ flag for same-day reply
Phase 1 Evolution (when full ops kicks in)
The pipeline is designed to grow without rewriting the frontend. Three additive enhancements:
A. Save to a Supabase table
Edge Function also writes a row to a bug_reports table (alongside sending the email):
- Columns:
id, type, message, page_url, user_agent, user_id, user_email, screenshot_path, app_id, created_at, status, priority, blast_radius - Enables analytics, search, and trend analysis over time
app_idis already in the payload — no migration of historical data needed
B. Auto-create a GitHub Issue
After saving the email, the Edge Function also creates a GitHub Issue:
- Repo:
keycontent(or matched byapp_id) - Label:
bug/enhancement/question(mapped from type) - Title:
{type}: {first 60 chars of message} - Body: the same context as the email + a link to the screenshot
- Operator triages on GitHub, not in email
C. Auto-reply on resolution
When the GitHub Issue is closed (or labelled resolved), an automated email goes back to the user:
- Subject:
Update on your report — {type} - Body: thanks for reporting, here's what we changed
- Closes the loop, builds trust, almost no other apps do this
These three additions take the system from "useful" to "remarkable" — without the user-facing widget changing at all.
Why we elevated the standard
Most apps have a "Contact Us" form or a help@ email address. Reports come in with no context, and the support thread that follows is mostly back-and-forth asking for screenshots, browser versions, and account details. Hours wasted per report.
Our elevations:
- One-click reporting from any page (vs hunting for a contact link)
- Auto-captured context (vs interrogation email threads)
- Type-aware UX (placeholder adapts → users describe better)
- Screenshots without screenshot tools (drag-drop in the modal)
- Inbox-routable subject prefixes (vs everything in one queue)
- Forward-compatible for multi-app (
app_idfield already plumbed)
The trade: building the widget took ~2 hours of Claude Code time. The gain: every report that lands is actionable in 30 seconds instead of 3 hours of email tennis.
Cloning to Another App
When Coreshift HQ launches App #2, the bug-reports system clones in an afternoon:
- Drop
<ReportIssueButton />into App #2's layout (setapp_id: "app2-slug") - Create the
bug-report-screenshotsbucket + RLS policies on App #2's Supabase - Deploy the
report-issueEdge Function to App #2's Supabase - Add
POSTMARK_SERVER_TOKEN+REPORT_TO_EMAILto App #2's secrets - Optionally: separate Postmark Message Stream for App #2's reports
Everything else — the email format, the type system, the triage flow — is identical across apps. One operator can triage all apps from one inbox, filtered by the subject prefix.
This is why the system is also called "Sentinel" in the long-term vision: one set of eyes, every Coreshift HQ app.
When this doc changes
- When the widget adds new types or fields
- When a new app onboards (note any deviations)
- When Phase 1 adds the DB table / GitHub Issue auto-creation
- After any incident affecting the report pipeline
- Bump version, note the change
See also
../briefs/BRIEF-report-issue-widget.md— V1 implementation brief (the spec Claude Code shipped from)../briefs/BRIEF-report-issue-widget-round2.md— polish + type selector../briefs/BRIEF-rewire-to-postmark.md— email vendor swapHOW-WE-DO-PRIORITY.md— what happens after the report lands../ROADMAP.md— Phase 2 enhancements queued- Deck Slide 4 — Scenario 1: a user hits a bug (the visual narrative)
App Audits
Defaults are dangerous. Audit is the gate.
Status: Active · v1 · 2026-05-13 Applies to: Every Coreshift HQ app — at onboarding (before joining Sentinel ops) and quarterly thereafter Owners: Operator (runs the checklist) · Claude Code (writes fix briefs for failed items) · Tools (Supabase advisors, Cloudflare audit logs, vendor dashboards)
What this doc is
A single-pass checklist that verifies an app has its default-off settings turned on before it's considered operator-ready. Runs as a gate at onboarding (first time) and as a re-audit on a quarterly cadence thereafter.
This doc exists because of a specific lesson learned: when KeyContent was handed over for operator-driven maintenance, webhook_events had RLS disabled and Supabase Auth had HIBP password protection off — both default-off settings the original developer never toggled on, because no checklist forced the question. This checklist is the elevation.
The principle
Defaults are dangerous. Cloud platforms ship with security off-by-default because they don't know your context. An audit is the gate that catches what was never toggled on.
Not paranoia. Not enterprise compliance theatre. Just a 30-minute pass that asks: "For every cloud setting that defaults to less-safe, did we make a conscious choice?"
When this audit runs
| Trigger | What it produces |
|---|---|
| Onboarding — before an app joins Sentinel ops | Pass/fail gate. App can't go live with Sentinel ops until 100% pass or every failure is consciously deferred. |
| Quarterly — recurring re-audit | Delta report: what regressed, what's new, what new vendor checks should be added. |
| Post-incident — when an incident traces to a default-off setting | Targeted re-check of the affected area + add the check to this doc. |
The Audit Checklist
Each item is a yes/no check the operator can verify in ≤ 2 minutes. Items marked (N/A until X) are only relevant once the named vendor or feature is in the stack.
Supabase — Database + Auth + Storage + Edge Functions
-
get_advisors(security) returns zero critical and zero warn entries -
get_advisors(performance) reviewed — issues triaged or deferred - Row Level Security enabled on every
publicschema table- Verify:
SELECT relname FROM pg_class WHERE relkind='r' AND relnamespace='public'::regnamespace AND NOT relrowsecurityreturns 0 rows
- Verify:
- Every RLS-enabled table has at least one policy (else the authenticated role is silently locked out)
- HIBP password protection enabled (Auth → Password security → "Check passwords against HaveIBeenPwned")
- Email rate limits configured (Auth → Rate limits) — defaults are often too generous
- All Storage buckets are private by default; public buckets are explicit and justified
- Storage bucket policies scope writes to the user's own folder (path pattern
{user_id}/...) - CORS allowlist on every Edge Function lists only known origins (no
*) - All Edge Function secrets exist in both staging AND production projects
- Service role key is not referenced in any client-side code (grep
client/forSERVICE_ROLE— must be 0 hits) - Database backup retention reviewed (Supabase default is 7 days; upgrade if data loss tolerance < 7d)
- No raw SQL changes against production outside the migration system (every prod schema change has a
supabase/migrations/file)
Cloudflare — Frontend + DNS + WAF
- WAF rate limiting on auth endpoints (
/login,/signup, password reset) - DNSSEC enabled on the production domain
- Bot Fight Mode (free tier) enabled (verified 2026-05-13 on keycontent.ai with JS Detections on)
- SSL/TLS encryption mode set to Full (Strict) — not Flexible (verified 2026-05-13)
- Pages preview deploys are gated (auth required) OR known to be safe-by-default
- Environment variables exist in both Preview and Production environments
- Page Rules / Configuration Rules reviewed for leftover staging-only rules
GitHub — Repo Hygiene
- Branch protection on
main: requires PR + status checks pass + linear history - Branch protection on
staging: requires PR + status checks pass - Default branch is the production branch (
mainfor KeyContent) -
.gitignoreexcludes.env*,*.pem,*.key, credential files - Secret scanning enabled (free for public repos; opt-in for private)
- Dependabot enabled for security updates
-
.github/ISSUE_TEMPLATE/populated perHOW-WE-DO-BUG-REPORTS.md - No secrets in commit history (gitleaks or
gh secret-scanningpass clean) - Repo visibility matches expectation (private for KeyContent; public exposure would be intentional)
Postmark — Transactional Email
- Sending domain verified (DKIM + Return-Path + SPF all green)
- Bounce + complaint webhooks configured to the app (or Sentinel ingestion)
- Message Streams separated by purpose (e.g.
outboundfor transactional vsbroadcastfor marketing) - Suppression list reviewed quarterly
- Sender reputation score acceptable (≥ 80)
Sentry — Error Monitoring (N/A until Phase 1 Week 1 ships)
- Frontend + backend (server + edge functions) projects exist
- DSNs configured per environment (staging DSN ≠ prod DSN)
- Source map upload verified (stack traces show readable file names, not minified)
- Release tracking wired to deploys (commit SHA tagging confirmed)
- Alert rules: new-issue email, regression email, high-frequency email
- User context attached after auth (id + email)
- PII capture is off (
sendDefaultPii: false) - Session replay privacy defaults (mask all text, block all media)
Better Stack — Uptime Monitoring (N/A until Phase 1 Week 2 ships)
- Uptime monitors cover both staging and production endpoints
- Check interval ≤ 3 min
- Alert channels configured (email at minimum; SMS for SEV1 ideally)
- Public status page exists at
status.{app-domain}with brand customisation
Operational Hygiene
- Report Issue widget submits a Bug, Suggestion, and Question successfully (one of each)
- Inbox rules / labels routing
[{App} · BUG/SUGGESTION/QUESTION]correctly - Postmark sender domain matches the app's primary domain
- Rollback procedure mentally rehearsed per
HOW-WE-DO-DEPLOYS.mdRule 2 - Triage rhythm documented and being followed (10-min ritual per
ROADMAP.mdPhase 1 Week 4) -
ROADMAP.mdandKANBAN.mdchecked for staleness — shipped items marked[x]
What to do when an audit item fails
Don't fix items mid-audit. Finish the full pass, then triage all failures together. This prevents tunnel-vision on one finding while missing others.
For each failure:
↓
1. Categorise by impact using HOW-WE-DO-PRIORITY (P0/P1/P2/P3)
↓
2. Determine fix type:
▸ Code fix (e.g. missing RLS policy) → write a brief, hand to Claude Code
▸ Dashboard toggle (e.g. HIBP) → add a kanban card, operator handles
▸ Process gap (e.g. no inbox rule) → add to triage rhythm, no PR needed
↓
3. File the action — every failure produces either a PR, a kanban card,
or an explicit "consciously deferred" note (with reason + reconsider date)
Why we elevated the standard
The industry has formal compliance audits — SOC2, ISO 27001, HIPAA, etc. They're designed for enterprise teams with compliance officers, weeks of evidence-gathering, and external auditors. Most of that ceremony doesn't apply to a solo operator running 1–N small apps. But the underlying principle — that defaults are dangerous and audit is a gate — does apply.
Our elevations:
- Specific to our stack — checks Supabase, Cloudflare, Postmark, GitHub, Sentry, Better Stack directly. Not generic CIS benchmarks.
- Automatable where possible —
get_advisors, RLS query, grep checks. Operator runs them; Claude Code can re-run them on demand. - Lightweight — 30-minute pass, not a 2-week formal audit.
- Dual-purpose — same checklist for first-time onboarding AND quarterly re-audit. No second doc to maintain.
- Failure flows into existing workflows — Claude Code briefs for code fixes, kanban cards for dashboard toggles. No special incident category.
- App-portable — every Coreshift HQ app passes the same audit before joining Sentinel ops. App #2 inherits the discipline.
The win at scale: when Sentinel covers 3+ apps, the audit is the contract. No app ever ships with RLS disabled on a webhook log or HIBP off in Auth again, because the checklist would've blocked it.
When this doc changes
- After any incident that traces to a default-off setting — add the check that would've caught it
- When a new vendor enters the stack — add their section (Stripe security, Zernio HMAC, etc.)
- When Supabase / Cloudflare / etc. ship new advisor types — add the lints they surface
- Quarterly, after running the audit — what was missing? What's noise?
- When App #2 onboards — verify every check is portable; flag KeyContent-specific items
Bump the version at the top and note the change.
Outstanding items for KeyContent (initial audit, 2026-05-13)
The first time this checklist runs on KeyContent, it produces this delta. Surfaced during the Sentinel folder review session.
| Item | Status | Action |
|---|---|---|
RLS on webhook_events |
✅ Fixed on staging 2026-05-12; awaiting prod promotion | See ../briefs/BRIEF-rls-fix-webhook-events.md |
| HIBP password protection | ✅ Enabled on staging + prod 2026-05-13 | — |
get_advisors zero-warn |
✅ Staging clean; prod has 1 outstanding (RLS pending promotion) | Resolves when RLS fix promotes to prod |
| Sentry sections | N/A | Will run after Phase 1 Week 1 |
| Better Stack sections | N/A | Will run after Phase 1 Week 2 |
| Branch protection | ✅ Rules added on main + staging 2026-05-13 |
Admin bypass left ON for 2-person team flexibility |
| Backup retention | ✅ Verified 2026-05-13 — 8 days of daily DB backups | ⚠️ Storage objects not in backups; follow-up queued for Phase 1 Week 5 |
Re-run the full audit once Sentry and Better Stack are integrated, and quarterly thereafter.
See also
HOW-WE-DO-PRIORITY.md— how audit failures get triagedHOW-WE-DO-DEPLOYS.md— Type C migrations (the usual fix path for RLS / schema gaps)HOW-WE-DO-INCIDENTS.md— when a failure is severe enough to be incident-mode../ROADMAP.md— Phase 4 (Sentinel) — passing this audit is the gate for any new app joining- Supabase Database Advisors docs
- Cloudflare WAF docs