codebase-analysis/SKILL.md

---
name: codebase-analysis
description: >-
  Analyze open source repositories to extract conventions or patterns.
  Two modes: "conventions" (how a project works architecturally) and
  "patterns" (how to write idiomatic code in that language/ecosystem).
  Use when asked to "analyze a repo", "extract patterns from", "what
  conventions does X use", "how should I write X", "what's idiomatic",
  "add X to the analysis repos", or "how does X do Y architecturally".
  Do NOT use for: code review of specific PRs (use pr-review), security
  audits (use vuln-scout), or reading a single file for a quick answer.
---

# Codebase Analysis

Extract conventions or idiomatic patterns from open source repos.

## Mode

Set `MODE` when invoking (or infer from request):

| Mode | Question | Output | Repo suffix |
|------|----------|--------|-------------|
| `conventions` | "How does this project work?" | Architecture, governance, unique infra | `*-conventions` |
| `patterns` | "How should I write code like this?" | Prescriptive rules for users | `*-patterns` |

**Default:** `conventions` unless the request says "idiomatic",
"how to write", "style guide", or "patterns for users".

**Both modes share Phases 1-7.** They diverge at Phase 8 (synthesis).

## Configuration

Set these in your workspace context (TOOLS.md, AGENTS.md, or pass
explicitly when invoking the skill):

| Parameter | Description | Example |
|-----------|-------------|----------|
| `CLONE_DIR` | Directory to clone repos into | `~/src/analysis/` |
| `CLONE_HOST` | Machine with disk + git for cloning | `forge`, `localhost` |
| `GIT_REMOTE` | Where convention repos are pushed | `https://git.example.com` |
| `GIT_ORG` | Org/user for convention repos | `myorg`, `username` |
| `GIT_TOKEN_PATH` | Path to auth token for pushing | `~/.credentials/git-token` |

**Minimum required:** `CLONE_DIR` and `GIT_REMOTE`. If others are
omitted:
- `CLONE_HOST` defaults to localhost (current machine)
- `GIT_ORG` defaults to the authenticated user
- `GIT_TOKEN_PATH` uses default git credential helper

**Example in TOOLS.md:**
```markdown
## Codebase Analysis
- CLONE_DIR: ~/src/analysis/
- CLONE_HOST: my-dev-server (ssh user@host)
- GIT_REMOTE: https://git.example.com
- GIT_ORG: my-patterns
- GIT_TOKEN_PATH: ~/.credentials/git-token
```

If not explicitly provided, infer from workspace context (TOOLS.md,
shell environment, or git remote configuration).

## Naming

- `*-patterns` = prescriptive (how users should write code)
- `*-conventions` = descriptive (how a specific codebase works)

A language can have both: `go-patterns` (write Go like this) AND
`golang-conventions` (how the Go team builds Go itself).

## Thinking Framework

Before starting any analysis, ask:

1. **What is this project's essence?** A trading system is a state
   machine where the state is money. A workflow engine is a tree of
   state machines. Name the essence — the patterns follow from it.
2. **What forces shaped it?** Team size, age, performance constraints,
   backward compatibility obligations. These predict WHERE conventions
   will be strict vs relaxed.
3. **What would surprise me?** The interesting findings are never "they
   use interfaces" — it's "they have 566 dynamic config settings" or
   "zero TODOs in 3.8M of code." Surprise = insight.

## Prioritization: What to Dig Into

Not everything is interesting. Focus on patterns that:

- **Appear >50 times** — this is a conscious convention, not a one-off
- **Have a dedicated package** — someone thought it was important enough
  to abstract
- **Other projects solve differently** — reveals a real design tradeoff
- **Have a surprising name** — indicates the team had to invent
  vocabulary for a novel concept
- **Were introduced recently with many PR comments** — active design
  decisions with recorded rationale

Skip patterns that are:
- Standard library usage (unless the project wraps/extends it)
- Single-use internal helpers
- Generated code
- Exact copies of well-known open-source patterns without modification

## Phases

### Phase 1: Shape (5 min)

Clone to `CLONE_DIR/<name>` on `CLONE_HOST`. Full clone — never shallow.

Measure: size, files, commits, contributors, top-level dirs.

**What matters here:** The ratio of test files to production files.
The presence/absence of `internal/` vs flat structure. Whether there's
a single `pkg/` or many top-level packages. These reveal organizational
philosophy before you read a single line.

### Phase 2: What the Codebase Values (10 min)

Find the most-imported internal packages. The top 5 are the
project's definition of "foundational."

**Ask:** Why these? What do they share? Usually: logging, errors,
config, and one domain-specific abstraction that IS the project.
That domain-specific one is where the real conventions live.

See `references/commands.md` for grep patterns by language.

### Phase 3: Interface Contracts (10 min)

Find interfaces/behaviours/protocols — but don't list them all.

**Focus on:** Interfaces with >3 implementations (these are real
extension points). Interfaces in constructor signatures (these are
dependency injection boundaries). Interfaces that appear in BOTH
production and test code (these are the testability seams).

**Skip:** One-method interfaces (usually just for mocking). Interfaces
only used in one place (not yet conventions).

### Phase 4: Quality Fingerprint (5 min)

Measure: TODO count, FIXME count, HACK count, test count, mock count.

**What to notice:**
- TODO format reveals discipline: `TODO(owner):` = accountability,
  `TODO:` = aspirational, version-gated = systematic cleanup
- Zero TODOs in a large codebase means active cleanup culture
- High mock count relative to test count suggests heavy DI
- HACK count > 0 is honest; HACK count = 0 in a large project is
  suspicious (they probably use different words)

### Phase 5: Unique Patterns (15 min)

Look for infrastructure NOT in stdlib. Categories:

- **Concurrency:** goroutine handles, schedulers, shutdown primitives
- **Testing:** custom assertions, fake registries, golden file systems
- **Configuration:** dynamic config, feature flags, runtime toggles
- **Error handling:** custom error types, assertion systems, panic
  recovery patterns
- **Extension:** plugin registration, hook systems, middleware chains

**The test for uniqueness:** Would you be surprised to find this in
another project of similar size? If yes → convention worth documenting.
If no → standard practice, skip.

### Phase 6: Git Archaeology (20 min)

For each unique pattern found in Phase 5:

1. Find the commit that introduced it (`git log --diff-filter=A`)
2. Read the commit message — the "why" is usually there
3. Check if it replaced something (`git log -S "old_name"`)
4. Note the date and author — context for why shortcuts were taken

**The insight is always WHY, not WHAT.** A bare goroutine with a
TODO is uninteresting as a listing. A bare goroutine introduced during
a complex 20-file admission control feature, tagged by the author in
the same commit, that survived 3 years because nobody touched the
function — that's a lesson about how real codebases evolve.

See `references/commands.md` for git archaeology patterns.

**If the repo is on a forge without PR history** (self-hosted, mailing
list-based): Fall back to commit messages and CHANGELOG. The commit
body IS the PR description for these projects. Look for "Reviewed-by"
trailers and linked issues.

### Phase 7: PR Discussions (20 min)

Find PRs where key patterns were introduced. Read:
- The PR body (author's motivation)
- Review comments (the debate)
- The resolution

**What to extract from discussions:**
- What the author was defending (= where the real insight is)
- What reviewers pushed back on (= non-obvious tradeoffs)
- Whether it was "merge and iterate" vs "perfect before merge"
- Whether external validation was cited (benchmarks, user feedback)
- The migration strategy (big-bang vs gradual coexistence)

**The highest-value finding:** When a reviewer says "I wish we'd done
X instead" and the author explains why X doesn't work. That tradeoff
reasoning is pure expert knowledge.

### Phase 8: Synthesis

Produce output based on MODE. Push to `GIT_REMOTE`.

---

#### MODE: conventions

Output: `<project>-conventions` repo.

**`analysis.md`** — the full story:
1. Repo shape and organizational philosophy
2. Import hierarchy (what it values)
3. Key patterns with code examples + origin stories
4. PR discussion excerpts (attributed quotes)
5. Cross-ecosystem comparisons (prior art, independent invention)
6. Quality metrics in context (not bare numbers)

**`conventions.md`** — the reference:
For each unique pattern:
- Name and location in source
- Code example (real, not simplified)
- When to use / When NOT to use
- Origin (commit date, author, PR# if available)

**Tone:** Descriptive. "This project does X because Y."

---

#### MODE: patterns

Output: `<language>-patterns` or `<ecosystem>-patterns` repo.

**Synthesis question:** "What should a developer copy from this
codebase?" Filter everything through: "If I were writing new code
in this language/ecosystem, what rules does this source teach me?"

**This is iterative, not one-shot.** Keep extracting until you've
identified ALL patterns the source demonstrates. A first pass finds
the obvious ones. Second pass greps for variations and edge cases.
Third pass finds the patterns that break. You're done when scanning
the source no longer reveals new rules.

**Process:**
1. **Discovery pass** — scan the source by topic area, identify every
   distinct pattern (aim for 15-30+ per topic in a large codebase)
2. **Deepening pass** — for each pattern, grep for 5-10 real usages
   across the codebase. Note variations. Find the best example.
3. **Edge case pass** — find where each pattern DOESN'T apply.
   Grep for violations — are they bugs, or legitimate exceptions?
4. **Cross-reference pass** — which patterns interact? Which ones
   conflict? Document the decision framework for choosing between
   competing patterns.

Repeat until scanning the source yields no new patterns. A language
stdlib should produce 50-200+ patterns across all topics.

**Output structure — one file per topic:**

`patterns/<topic>.md` — topics include:
- Error handling
- Naming conventions
- Concurrency patterns
- Testing patterns
- Interface/protocol design
- Module organization
- Documentation conventions
- Performance idioms
- Configuration patterns
- Extension/plugin patterns

Each pattern entry:
- Name (short, linkable heading)
- Rule (one sentence: "Do X" or "Never Y")
- Example (real code from the source, not invented)
- Why (the force that makes this the right choice)
- When to use (the trigger condition — what situation calls for this)
- When NOT to use (where the pattern breaks down)
- Source (file path + line, or commit where it's demonstrated)

**`smells.md`** — anti-patterns found in the source:
- What it looks like
- Why it exists (technical debt? deliberate tradeoff?)
- What to do instead

**Tone:** Prescriptive. "Write it this way because X."

**Key difference from conventions mode:** Skip governance, team
structure, TODO culture, and project history unless they directly
inform HOW to write code. Focus on patterns a user should copy.

**Done criteria:** You've scanned every major directory in the source.
No new patterns emerge from further grep/read. Each topic file has
15+ patterns with real examples. Edge cases are documented.

---

End all output files with `<!-- PATTERN_COMPLETE -->` sentinel.

## Cross-Ecosystem Observations

Always note when a pattern exists in multiple repos. These
independent inventions reveal forces that transcend project context:
- Temporal goro.Handle (2021) ↔ CockroachDB stop.Handle (2025)
- Ecto zero TODOs (version-gated) ↔ Oban zero TODOs (2-week cleanup)
- Prometheus init() plugins ↔ Temporal init() plugins

## The 4 Categories of Pattern Breaks

When you find convention violations, classify:

1. **Ship behavior, fix plumbing later** — tagged with TODO same commit
2. **Better tooling exposed limitation** — observability, not correctness
3. **Removal cost > carrying cost** — zero-interest debt
4. **Context needs different pattern** — not actually a break

See `references/pattern-breaks.md` for real examples with git history.

## NEVER

- **NEVER analyze with a shallow clone** and assume full picture —
  archaeology requires full history
- **NEVER present patterns from one file as repo-wide conventions** —
  verify frequency across the codebase first
- **NEVER skip PR discussions** — code without context is just syntax;
  the discussion IS the insight
- **NEVER report bare numbers** ("738 TODOs") — always contextualize
  (per 1000 files, vs comparable projects, trending up/down)
- **NEVER confuse "the maintainer likes X" with "X is the right
  pattern"** — solo-maintained projects reflect one person's taste;
  team projects reflect negotiated conventions
- **NEVER present a pattern as "unique" without checking** if stdlib
  has it or if it's a well-known library pattern
- **NEVER list patterns without when-NOT-to-use** — that's where the
  expertise actually lives
- **NEVER quote PR discussions without attribution** — who said it
  matters (maintainer vs drive-by contributor)
- **NEVER analyze repos <1000 commits** — not enough history for
  meaningful archaeology
- **NEVER conflate language patterns with project conventions** — `go-
  patterns` is stdlib idiom; `temporal-conventions` is project choice

## Output Repos

Push to `GIT_REMOTE` under:
- **conventions mode:** `GIT_ORG/<project>-conventions`
- **patterns mode:** `GIT_ORG/<language>-patterns`

See `references/commands.md` for repo creation and push commands.

## Fallbacks

- **No PR discussions?** Use commit messages as primary source.
  Many projects (Linux, PostgreSQL) do all review in commit messages
  and mailing lists.
- **Repo too large to clone fully?** Clone shallow first, do Phase
  1-5, then `git fetch --unshallow` only if Phase 6-7 are needed.
- **Private repo / no forge API?** Skip Phase 7. Phase 6 (local git
  history) still works.
- **<3000 commits?** Reduce Phase 6-7 expectations. Younger projects
  have less archaeology to mine — focus on Phase 5 (unique patterns)
  and the project's README/docs for rationale.

## Execution Notes

- Clone on `CLONE_HOST` — needs disk space for full git history
- `gh api` or equivalent for forge PR lookups (requires authentication)
- One repo at a time for focused analysis
- Markdownlint all output before pushing