codebase-analysis/SKILL.md

---
name: codebase-analysis
description: >-
  Analyze open source repositories to extract conventions or patterns.
  Two modes: "conventions" (how a project works architecturally) and
  "patterns" (how to write idiomatic code in that language/ecosystem).
  Use when asked to "analyze a repo", "extract patterns from", "what
  conventions does X use", "how should I write X", "what's idiomatic",
  "add X to the analysis repos", or "how does X do Y architecturally".
  Do NOT use for: code review of specific PRs (use pr-review), security
  audits (use vuln-scout), or reading a single file for a quick answer.
---

# Codebase Analysis

Extract conventions or idiomatic patterns from open source repos.

## Mode

Set `MODE` when invoking (or infer from request):

| Mode | Question | Output | Repo suffix |
|------|----------|--------|-------------|
| `conventions` | "How does this project work?" | Architecture, governance, unique infra | `*-conventions` |
| `patterns` | "How should I write code like this?" | Prescriptive rules for users | `*-patterns` |

**Default:** `conventions` unless the request says "idiomatic",
"how to write", "style guide", or "patterns for users".

**Both modes share Phases 1-7.** They diverge at Phase 8 (synthesis).

## Configuration

Set these in your workspace context (TOOLS.md, AGENTS.md, or pass
explicitly when invoking the skill):

| Parameter | Description | Example |
|-----------|-------------|----------|
| `CLONE_DIR` | Directory to clone repos into | `~/src/analysis/` |
| `CLONE_HOST` | Machine with disk + git for cloning | `forge`, `localhost` |
| `GIT_REMOTE` | Where convention repos are pushed | `https://git.example.com` |
| `GIT_ORG` | Org/user for convention repos | `myorg`, `username` |
| `GIT_TOKEN_PATH` | Path to auth token for pushing | `~/.credentials/git-token` |

**Minimum required:** `CLONE_DIR` and `GIT_REMOTE`. If others are
omitted:
- `CLONE_HOST` defaults to localhost (current machine)
- `GIT_ORG` defaults to the authenticated user
- `GIT_TOKEN_PATH` uses default git credential helper

**Example in TOOLS.md:**
```markdown
## Codebase Analysis
- CLONE_DIR: ~/src/analysis/
- CLONE_HOST: my-dev-server (ssh user@host)
- GIT_REMOTE: https://git.example.com
- GIT_ORG: my-patterns
- GIT_TOKEN_PATH: ~/.credentials/git-token
```

If not explicitly provided, infer from workspace context (TOOLS.md,
shell environment, or git remote configuration).

## Naming

- `*-patterns` = prescriptive (how users should write code)
- `*-conventions` = descriptive (how a specific codebase works)

A language can have both: `go-patterns` (write Go like this) AND
`golang-conventions` (how the Go team builds Go itself).

## Thinking Framework

Before starting any analysis, ask:

1. **What is this project's essence?** A trading system is a state
   machine where the state is money. A workflow engine is a tree of
   state machines. Name the essence — the patterns follow from it.
2. **What forces shaped it?** Team size, age, performance constraints,
   backward compatibility obligations. These predict WHERE conventions
   will be strict vs relaxed.
3. **What would surprise me?** The interesting findings are never "they
   use interfaces" — it's "they have 566 dynamic config settings" or
   "zero TODOs in 3.8M of code." Surprise = insight.

## Prioritization: What to Dig Into

Not everything is interesting. Focus on patterns that:

- **Appear >50 times** — this is a conscious convention, not a one-off
- **Have a dedicated package** — someone thought it was important enough
  to abstract
- **Other projects solve differently** — reveals a real design tradeoff
- **Have a surprising name** — indicates the team had to invent
  vocabulary for a novel concept
- **Were introduced recently with many PR comments** — active design
  decisions with recorded rationale

Skip patterns that are:
- Standard library usage (unless the project wraps/extends it)
- Single-use internal helpers
- Generated code
- Exact copies of well-known open-source patterns without modification

## Phases

### Phase 1: Shape (5 min)

Clone to `CLONE_DIR/<name>` on `CLONE_HOST`. Full clone — never shallow.

Measure: size, files, commits, contributors, top-level dirs.

**What matters here:** The ratio of test files to production files.
The presence/absence of `internal/` vs flat structure. Whether there's
a single `pkg/` or many top-level packages. These reveal organizational
philosophy before you read a single line.

### Phase 2: What the Codebase Values (10 min)

Find the most-imported internal packages. The top 5 are the
project's definition of "foundational."

**Ask:** Why these? What do they share? Usually: logging, errors,
config, and one domain-specific abstraction that IS the project.
That domain-specific one is where the real conventions live.

See `references/commands.md` for grep patterns by language.

### Phase 3: Interface Contracts (10 min)

Find interfaces/behaviours/protocols — but don't list them all.

**Focus on:** Interfaces with >3 implementations (these are real
extension points). Interfaces in constructor signatures (these are
dependency injection boundaries). Interfaces that appear in BOTH
production and test code (these are the testability seams).

**Skip:** One-method interfaces (usually just for mocking). Interfaces
only used in one place (not yet conventions).

### Phase 4: Quality Fingerprint (5 min)

Measure: TODO count, FIXME count, HACK count, test count, mock count.

**What to notice:**
- TODO format reveals discipline: `TODO(owner):` = accountability,
  `TODO:` = aspirational, version-gated = systematic cleanup
- Zero TODOs in a large codebase means active cleanup culture
- High mock count relative to test count suggests heavy DI
- HACK count > 0 is honest; HACK count = 0 in a large project is
  suspicious (they probably use different words)

### Phase 5: Unique Patterns (15 min)

Look for infrastructure NOT in stdlib. Categories:

- **Concurrency:** goroutine handles, schedulers, shutdown primitives
- **Testing:** custom assertions, fake registries, golden file systems
- **Configuration:** dynamic config, feature flags, runtime toggles
- **Error handling:** custom error types, assertion systems, panic
  recovery patterns
- **Extension:** plugin registration, hook systems, middleware chains

**The test for uniqueness:** Would you be surprised to find this in
another project of similar size? If yes → convention worth documenting.
If no → standard practice, skip.

### Phase 6: Git Archaeology (20 min)

For each unique pattern found in Phase 5:

1. Find the commit that introduced it (`git log --diff-filter=A`)
2. Read the commit message — the "why" is usually there
3. Check if it replaced something (`git log -S "old_name"`)
4. Note the date and author — context for why shortcuts were taken

**The insight is always WHY, not WHAT.** A bare goroutine with a
TODO is uninteresting as a listing. A bare goroutine introduced during
a complex 20-file admission control feature, tagged by the author in
the same commit, that survived 3 years because nobody touched the
function — that's a lesson about how real codebases evolve.

See `references/commands.md` for git archaeology patterns.

**If the repo is on a forge without PR history** (self-hosted, mailing
list-based): Fall back to commit messages and CHANGELOG. The commit
body IS the PR description for these projects. Look for "Reviewed-by"
trailers and linked issues.

### Phase 7: PR Discussions (20 min)

Find PRs where key patterns were introduced. Read:
- The PR body (author's motivation)
- Review comments (the debate)
- The resolution

**What to extract from discussions:**
- What the author was defending (= where the real insight is)
- What reviewers pushed back on (= non-obvious tradeoffs)
- Whether it was "merge and iterate" vs "perfect before merge"
- Whether external validation was cited (benchmarks, user feedback)
- The migration strategy (big-bang vs gradual coexistence)

**The highest-value finding:** When a reviewer says "I wish we'd done
X instead" and the author explains why X doesn't work. That tradeoff
reasoning is pure expert knowledge.

### Phase 8: Synthesis

Produce output based on MODE. Push to `GIT_REMOTE`.

---

#### MODE: conventions

Output: `<project>-conventions` repo.

**`analysis.md`** — the full story:
1. Repo shape and organizational philosophy
2. Import hierarchy (what it values)
3. Key patterns with code examples + origin stories
4. PR discussion excerpts (attributed quotes)
5. Cross-ecosystem comparisons (prior art, independent invention)
6. Quality metrics in context (not bare numbers)

**`conventions.md`** — the reference:
For each unique pattern:
- Name and location in source
- Code example (real, not simplified)
- When to use / When NOT to use
- Origin (commit date, author, PR# if available)

**Tone:** Descriptive. "This project does X because Y."

---

#### MODE: patterns

Output: `<language>-patterns` or `<ecosystem>-patterns` repo.

**Synthesis question:** "What should a developer copy from this
codebase?" Filter everything through: "If I were writing new code
in this language/ecosystem, what rules does this source teach me?"

**`patterns/<topic>.md`** — one file per topic area:
- Error handling patterns
- Naming conventions
- Concurrency patterns
- Testing patterns
- Interface/protocol design
- Module organization

Each pattern entry:
- Name (short, linkable heading)
- Rule (one sentence: "Do X" or "Never Y")
- Example (real code from the source, not invented)
- Why (the force that makes this the right choice)
- When NOT to use (where the pattern breaks down)
- Source (file path + line, or commit where it's demonstrated)

**`smells.md`** — anti-patterns found in the source:
- What it looks like
- Why it exists (technical debt? deliberate tradeoff?)
- What to do instead

**Tone:** Prescriptive. "Write it this way because X."

**Key difference from conventions mode:** Skip governance, team
structure, TODO culture, and project history unless they directly
inform HOW to write code. Focus on patterns a user should copy.

---

End all output files with `<!-- PATTERN_COMPLETE -->` sentinel.

## Cross-Ecosystem Observations

Always note when a pattern exists in multiple repos. These
independent inventions reveal forces that transcend project context:
- Temporal goro.Handle (2021) ↔ CockroachDB stop.Handle (2025)
- Ecto zero TODOs (version-gated) ↔ Oban zero TODOs (2-week cleanup)
- Prometheus init() plugins ↔ Temporal init() plugins

## The 4 Categories of Pattern Breaks

When you find convention violations, classify:

1. **Ship behavior, fix plumbing later** — tagged with TODO same commit
2. **Better tooling exposed limitation** — observability, not correctness
3. **Removal cost > carrying cost** — zero-interest debt
4. **Context needs different pattern** — not actually a break

See `references/pattern-breaks.md` for real examples with git history.

## NEVER

- **NEVER analyze with a shallow clone** and assume full picture —
  archaeology requires full history
- **NEVER present patterns from one file as repo-wide conventions** —
  verify frequency across the codebase first
- **NEVER skip PR discussions** — code without context is just syntax;
  the discussion IS the insight
- **NEVER report bare numbers** ("738 TODOs") — always contextualize
  (per 1000 files, vs comparable projects, trending up/down)
- **NEVER confuse "the maintainer likes X" with "X is the right
  pattern"** — solo-maintained projects reflect one person's taste;
  team projects reflect negotiated conventions
- **NEVER present a pattern as "unique" without checking** if stdlib
  has it or if it's a well-known library pattern
- **NEVER list patterns without when-NOT-to-use** — that's where the
  expertise actually lives
- **NEVER quote PR discussions without attribution** — who said it
  matters (maintainer vs drive-by contributor)
- **NEVER analyze repos <1000 commits** — not enough history for
  meaningful archaeology
- **NEVER conflate language patterns with project conventions** — `go-
  patterns` is stdlib idiom; `temporal-conventions` is project choice

## Output Repos

Push to `GIT_REMOTE` under:
- **conventions mode:** `GIT_ORG/<project>-conventions`
- **patterns mode:** `GIT_ORG/<language>-patterns`

See `references/commands.md` for repo creation and push commands.

## Fallbacks

- **No PR discussions?** Use commit messages as primary source.
  Many projects (Linux, PostgreSQL) do all review in commit messages
  and mailing lists.
- **Repo too large to clone fully?** Clone shallow first, do Phase
  1-5, then `git fetch --unshallow` only if Phase 6-7 are needed.
- **Private repo / no forge API?** Skip Phase 7. Phase 6 (local git
  history) still works.
- **<3000 commits?** Reduce Phase 6-7 expectations. Younger projects
  have less archaeology to mine — focus on Phase 5 (unique patterns)
  and the project's README/docs for rationale.

## Execution Notes

- Clone on `CLONE_HOST` — needs disk space for full git history
- `gh api` or equivalent for forge PR lookups (requires authentication)
- One repo at a time for focused analysis
- Markdownlint all output before pushing