14 KiB
name, description
| name | description |
|---|---|
| codebase-analysis | Analyze open source repositories to extract conventions or patterns. Two modes: "conventions" (how a project works architecturally) and "patterns" (how to write idiomatic code in that language/ecosystem). Use when asked to "analyze a repo", "extract patterns from", "what conventions does X use", "how should I write X", "what's idiomatic", "add X to the analysis repos", or "how does X do Y architecturally". Do NOT use for: code review of specific PRs (use pr-review), security audits (use vuln-scout), or reading a single file for a quick answer. |
Codebase Analysis
Extract conventions or idiomatic patterns from open source repos.
Mode
Set MODE when invoking (or infer from request):
| Mode | Question | Output | Repo suffix |
|---|---|---|---|
conventions |
"How does this project work?" | Architecture, governance, unique infra | *-conventions |
patterns |
"How should I write code like this?" | Prescriptive rules for users | *-patterns |
Default: conventions unless the request says "idiomatic",
"how to write", "style guide", or "patterns for users".
Both modes share Phases 1-7. They diverge at Phase 8 (synthesis).
Configuration
Set these in your workspace context (TOOLS.md, AGENTS.md, or pass explicitly when invoking the skill):
| Parameter | Description | Example |
|---|---|---|
CLONE_DIR |
Directory to clone repos into | ~/src/analysis/ |
CLONE_HOST |
Machine with disk + git for cloning | forge, localhost |
GIT_REMOTE |
Where convention repos are pushed | https://git.example.com |
GIT_ORG |
Org/user for convention repos | myorg, username |
GIT_TOKEN_PATH |
Path to auth token for pushing | ~/.credentials/git-token |
Minimum required: CLONE_DIR and GIT_REMOTE. If others are
omitted:
CLONE_HOSTdefaults to localhost (current machine)GIT_ORGdefaults to the authenticated userGIT_TOKEN_PATHuses default git credential helper
Example in TOOLS.md:
## Codebase Analysis
- CLONE_DIR: ~/src/analysis/
- CLONE_HOST: my-dev-server (ssh user@host)
- GIT_REMOTE: https://git.example.com
- GIT_ORG: my-patterns
- GIT_TOKEN_PATH: ~/.credentials/git-token
If not explicitly provided, infer from workspace context (TOOLS.md, shell environment, or git remote configuration).
Naming
*-patterns= prescriptive (how users should write code)*-conventions= descriptive (how a specific codebase works)
A language can have both: go-patterns (write Go like this) AND
golang-conventions (how the Go team builds Go itself).
Thinking Framework
Before starting any analysis, ask:
- What is this project's essence? A trading system is a state machine where the state is money. A workflow engine is a tree of state machines. Name the essence — the patterns follow from it.
- What forces shaped it? Team size, age, performance constraints, backward compatibility obligations. These predict WHERE conventions will be strict vs relaxed.
- What would surprise me? The interesting findings are never "they use interfaces" — it's "they have 566 dynamic config settings" or "zero TODOs in 3.8M of code." Surprise = insight.
Prioritization: What to Dig Into
Not everything is interesting. Focus on patterns that:
- Appear >50 times — this is a conscious convention, not a one-off
- Have a dedicated package — someone thought it was important enough to abstract
- Other projects solve differently — reveals a real design tradeoff
- Have a surprising name — indicates the team had to invent vocabulary for a novel concept
- Were introduced recently with many PR comments — active design decisions with recorded rationale
Skip patterns that are:
- Standard library usage (unless the project wraps/extends it)
- Single-use internal helpers
- Generated code
- Exact copies of well-known open-source patterns without modification
Phases
Phase 1: Shape (5 min)
Clone to CLONE_DIR/<name> on CLONE_HOST. Full clone — never shallow.
Measure: size, files, commits, contributors, top-level dirs.
What matters here: The ratio of test files to production files.
The presence/absence of internal/ vs flat structure. Whether there's
a single pkg/ or many top-level packages. These reveal organizational
philosophy before you read a single line.
Phase 2: What the Codebase Values (10 min)
Find the most-imported internal packages. The top 5 are the project's definition of "foundational."
Ask: Why these? What do they share? Usually: logging, errors, config, and one domain-specific abstraction that IS the project. That domain-specific one is where the real conventions live.
See references/commands.md for grep patterns by language.
Phase 3: Interface Contracts (10 min)
Find interfaces/behaviours/protocols — but don't list them all.
Focus on: Interfaces with >3 implementations (these are real extension points). Interfaces in constructor signatures (these are dependency injection boundaries). Interfaces that appear in BOTH production and test code (these are the testability seams).
Skip: One-method interfaces (usually just for mocking). Interfaces only used in one place (not yet conventions).
Phase 4: Quality Fingerprint (5 min)
Measure: TODO count, FIXME count, HACK count, test count, mock count.
What to notice:
- TODO format reveals discipline:
TODO(owner):= accountability,TODO:= aspirational, version-gated = systematic cleanup - Zero TODOs in a large codebase means active cleanup culture
- High mock count relative to test count suggests heavy DI
- HACK count > 0 is honest; HACK count = 0 in a large project is suspicious (they probably use different words)
Phase 5: Unique Patterns (15 min)
Look for infrastructure NOT in stdlib. Categories:
- Concurrency: goroutine handles, schedulers, shutdown primitives
- Testing: custom assertions, fake registries, golden file systems
- Configuration: dynamic config, feature flags, runtime toggles
- Error handling: custom error types, assertion systems, panic recovery patterns
- Extension: plugin registration, hook systems, middleware chains
The test for uniqueness: Would you be surprised to find this in another project of similar size? If yes → convention worth documenting. If no → standard practice, skip.
Phase 6: Git Archaeology (20 min)
For each unique pattern found in Phase 5:
- Find the commit that introduced it (
git log --diff-filter=A) - Read the commit message — the "why" is usually there
- Check if it replaced something (
git log -S "old_name") - Note the date and author — context for why shortcuts were taken
The insight is always WHY, not WHAT. A bare goroutine with a TODO is uninteresting as a listing. A bare goroutine introduced during a complex 20-file admission control feature, tagged by the author in the same commit, that survived 3 years because nobody touched the function — that's a lesson about how real codebases evolve.
See references/commands.md for git archaeology patterns.
If the repo is on a forge without PR history (self-hosted, mailing list-based): Fall back to commit messages and CHANGELOG. The commit body IS the PR description for these projects. Look for "Reviewed-by" trailers and linked issues.
Phase 7: PR Discussions (20 min)
Find PRs where key patterns were introduced. Read:
- The PR body (author's motivation)
- Review comments (the debate)
- The resolution
What to extract from discussions:
- What the author was defending (= where the real insight is)
- What reviewers pushed back on (= non-obvious tradeoffs)
- Whether it was "merge and iterate" vs "perfect before merge"
- Whether external validation was cited (benchmarks, user feedback)
- The migration strategy (big-bang vs gradual coexistence)
The highest-value finding: When a reviewer says "I wish we'd done X instead" and the author explains why X doesn't work. That tradeoff reasoning is pure expert knowledge.
Phase 8: Synthesis
Produce output based on MODE. Push to GIT_REMOTE.
MODE: conventions
Output: <project>-conventions repo.
analysis.md — the full story:
- Repo shape and organizational philosophy
- Import hierarchy (what it values)
- Key patterns with code examples + origin stories
- PR discussion excerpts (attributed quotes)
- Cross-ecosystem comparisons (prior art, independent invention)
- Quality metrics in context (not bare numbers)
conventions.md — the reference:
For each unique pattern:
- Name and location in source
- Code example (real, not simplified)
- When to use / When NOT to use
- Origin (commit date, author, PR# if available)
Tone: Descriptive. "This project does X because Y."
MODE: patterns
Output: <language>-patterns or <ecosystem>-patterns repo.
Synthesis question: "What should a developer copy from this codebase?" Filter everything through: "If I were writing new code in this language/ecosystem, what rules does this source teach me?"
This is iterative, not one-shot. Keep extracting until you've identified ALL patterns the source demonstrates. A first pass finds the obvious ones. Second pass greps for variations and edge cases. Third pass finds the patterns that break. You're done when scanning the source no longer reveals new rules.
Process:
- Discovery pass — scan the source by topic area, identify every distinct pattern (aim for 15-30+ per topic in a large codebase)
- Deepening pass — for each pattern, grep for 5-10 real usages across the codebase. Note variations. Find the best example.
- Edge case pass — find where each pattern DOESN'T apply. Grep for violations — are they bugs, or legitimate exceptions?
- Cross-reference pass — which patterns interact? Which ones conflict? Document the decision framework for choosing between competing patterns.
Repeat until scanning the source yields no new patterns. A language stdlib should produce 50-200+ patterns across all topics.
Output structure — one file per topic:
patterns/<topic>.md — topics include:
- Error handling
- Naming conventions
- Concurrency patterns
- Testing patterns
- Interface/protocol design
- Module organization
- Documentation conventions
- Performance idioms
- Configuration patterns
- Extension/plugin patterns
Each pattern entry:
- Name (short, linkable heading)
- Rule (one sentence: "Do X" or "Never Y")
- Example (real code from the source, not invented)
- Why (the force that makes this the right choice)
- When to use (the trigger condition — what situation calls for this)
- When NOT to use (where the pattern breaks down)
- Source (hyperlinked to the commit or file on the forge, e.g.
[src/io/io.go#L86](https://github.com/golang/go/blob/master/src/io/io.go#L86). Use permalink format with commit SHA when possible for stability.)
smells.md — anti-patterns found in the source:
- What it looks like
- Why it exists (technical debt? deliberate tradeoff?)
- What to do instead
Tone: Prescriptive. "Write it this way because X."
Key difference from conventions mode: Skip governance, team structure, TODO culture, and project history unless they directly inform HOW to write code. Focus on patterns a user should copy.
Done criteria: You've scanned every major directory in the source. No new patterns emerge from further grep/read. Each topic file has 15+ patterns with real examples. Edge cases are documented.
End all output files with <!-- PATTERN_COMPLETE --> sentinel.
Cross-Ecosystem Observations
Always note when a pattern exists in multiple repos. These independent inventions reveal forces that transcend project context:
- Temporal goro.Handle (2021) ↔ CockroachDB stop.Handle (2025)
- Ecto zero TODOs (version-gated) ↔ Oban zero TODOs (2-week cleanup)
- Prometheus init() plugins ↔ Temporal init() plugins
The 4 Categories of Pattern Breaks
When you find convention violations, classify:
- Ship behavior, fix plumbing later — tagged with TODO same commit
- Better tooling exposed limitation — observability, not correctness
- Removal cost > carrying cost — zero-interest debt
- Context needs different pattern — not actually a break
See references/pattern-breaks.md for real examples with git history.
NEVER
- NEVER analyze with a shallow clone and assume full picture — archaeology requires full history
- NEVER present patterns from one file as repo-wide conventions — verify frequency across the codebase first
- NEVER skip PR discussions — code without context is just syntax; the discussion IS the insight
- NEVER report bare numbers ("738 TODOs") — always contextualize (per 1000 files, vs comparable projects, trending up/down)
- NEVER confuse "the maintainer likes X" with "X is the right pattern" — solo-maintained projects reflect one person's taste; team projects reflect negotiated conventions
- NEVER present a pattern as "unique" without checking if stdlib has it or if it's a well-known library pattern
- NEVER list patterns without when-NOT-to-use — that's where the expertise actually lives
- NEVER quote PR discussions without attribution — who said it matters (maintainer vs drive-by contributor)
- NEVER analyze repos <1000 commits — not enough history for meaningful archaeology
- NEVER conflate language patterns with project conventions —
go- patternsis stdlib idiom;temporal-conventionsis project choice
Output Repos
Push to GIT_REMOTE under:
- conventions mode:
GIT_ORG/<project>-conventions - patterns mode:
GIT_ORG/<language>-patterns
See references/commands.md for repo creation and push commands.
Fallbacks
- No PR discussions? Use commit messages as primary source. Many projects (Linux, PostgreSQL) do all review in commit messages and mailing lists.
- Repo too large to clone fully? Clone shallow first, do Phase
1-5, then
git fetch --unshallowonly if Phase 6-7 are needed. - Private repo / no forge API? Skip Phase 7. Phase 6 (local git history) still works.
- <3000 commits? Reduce Phase 6-7 expectations. Younger projects have less archaeology to mine — focus on Phase 5 (unique patterns) and the project's README/docs for rationale.
Execution Notes
- Clone on
CLONE_HOST— needs disk space for full git history gh apior equivalent for forge PR lookups (requires authentication)- One repo at a time for focused analysis
- Markdownlint all output before pushing