--- name: codebase-analysis description: >- Analyze open source repositories to extract conventions or patterns. Two modes: "conventions" (how a project works architecturally) and "patterns" (how to write idiomatic code in that language/ecosystem). Use when asked to "analyze a repo", "extract patterns from", "what conventions does X use", "how should I write X", "what's idiomatic", "add X to the analysis repos", or "how does X do Y architecturally". Do NOT use for: code review of specific PRs (use pr-review), security audits (use vuln-scout), or reading a single file for a quick answer. --- # Codebase Analysis Extract conventions or idiomatic patterns from open source repos. ## Mode Set `MODE` when invoking (or infer from request): | Mode | Question | Output | Repo suffix | |------|----------|--------|-------------| | `conventions` | "How does this project work?" | Architecture, governance, unique infra | `*-conventions` | | `patterns` | "How should I write code like this?" | Prescriptive rules for users | `*-patterns` | **Default:** `conventions` unless the request says "idiomatic", "how to write", "style guide", or "patterns for users". **Both modes share Phases 1-7.** They diverge at Phase 8 (synthesis). ## Configuration Set these in your workspace context (TOOLS.md, AGENTS.md, or pass explicitly when invoking the skill): | Parameter | Description | Example | |-----------|-------------|----------| | `CLONE_DIR` | Directory to clone repos into | `~/src/analysis/` | | `CLONE_HOST` | Machine with disk + git for cloning | `forge`, `localhost` | | `GIT_REMOTE` | Where convention repos are pushed | `https://git.example.com` | | `GIT_ORG` | Org/user for convention repos | `myorg`, `username` | | `GIT_TOKEN_PATH` | Path to auth token for pushing | `~/.credentials/git-token` | **Minimum required:** `CLONE_DIR` and `GIT_REMOTE`. If others are omitted: - `CLONE_HOST` defaults to localhost (current machine) - `GIT_ORG` defaults to the authenticated user - `GIT_TOKEN_PATH` uses default git credential helper **Example in TOOLS.md:** ```markdown ## Codebase Analysis - CLONE_DIR: ~/src/analysis/ - CLONE_HOST: my-dev-server (ssh user@host) - GIT_REMOTE: https://git.example.com - GIT_ORG: my-patterns - GIT_TOKEN_PATH: ~/.credentials/git-token ``` If not explicitly provided, infer from workspace context (TOOLS.md, shell environment, or git remote configuration). ## Naming - `*-patterns` = prescriptive (how users should write code) - `*-conventions` = descriptive (how a specific codebase works) A language can have both: `go-patterns` (write Go like this) AND `golang-conventions` (how the Go team builds Go itself). ## Thinking Framework Before starting any analysis, ask: 1. **What is this project's essence?** A trading system is a state machine where the state is money. A workflow engine is a tree of state machines. Name the essence — the patterns follow from it. 2. **What forces shaped it?** Team size, age, performance constraints, backward compatibility obligations. These predict WHERE conventions will be strict vs relaxed. 3. **What would surprise me?** The interesting findings are never "they use interfaces" — it's "they have 566 dynamic config settings" or "zero TODOs in 3.8M of code." Surprise = insight. ## Prioritization: What to Dig Into Not everything is interesting. Focus on patterns that: - **Appear >50 times** — this is a conscious convention, not a one-off - **Have a dedicated package** — someone thought it was important enough to abstract - **Other projects solve differently** — reveals a real design tradeoff - **Have a surprising name** — indicates the team had to invent vocabulary for a novel concept - **Were introduced recently with many PR comments** — active design decisions with recorded rationale Skip patterns that are: - Standard library usage (unless the project wraps/extends it) - Single-use internal helpers - Generated code - Exact copies of well-known open-source patterns without modification ## Phases ### Phase 1: Shape (5 min) Clone to `CLONE_DIR/` on `CLONE_HOST`. Full clone — never shallow. Measure: size, files, commits, contributors, top-level dirs. **What matters here:** The ratio of test files to production files. The presence/absence of `internal/` vs flat structure. Whether there's a single `pkg/` or many top-level packages. These reveal organizational philosophy before you read a single line. ### Phase 2: What the Codebase Values (10 min) Find the most-imported internal packages. The top 5 are the project's definition of "foundational." **Ask:** Why these? What do they share? Usually: logging, errors, config, and one domain-specific abstraction that IS the project. That domain-specific one is where the real conventions live. See `references/commands.md` for grep patterns by language. ### Phase 3: Interface Contracts (10 min) Find interfaces/behaviours/protocols — but don't list them all. **Focus on:** Interfaces with >3 implementations (these are real extension points). Interfaces in constructor signatures (these are dependency injection boundaries). Interfaces that appear in BOTH production and test code (these are the testability seams). **Skip:** One-method interfaces (usually just for mocking). Interfaces only used in one place (not yet conventions). ### Phase 4: Quality Fingerprint (5 min) Measure: TODO count, FIXME count, HACK count, test count, mock count. **What to notice:** - TODO format reveals discipline: `TODO(owner):` = accountability, `TODO:` = aspirational, version-gated = systematic cleanup - Zero TODOs in a large codebase means active cleanup culture - High mock count relative to test count suggests heavy DI - HACK count > 0 is honest; HACK count = 0 in a large project is suspicious (they probably use different words) ### Phase 5: Unique Patterns (15 min) Look for infrastructure NOT in stdlib. Categories: - **Concurrency:** goroutine handles, schedulers, shutdown primitives - **Testing:** custom assertions, fake registries, golden file systems - **Configuration:** dynamic config, feature flags, runtime toggles - **Error handling:** custom error types, assertion systems, panic recovery patterns - **Extension:** plugin registration, hook systems, middleware chains **The test for uniqueness:** Would you be surprised to find this in another project of similar size? If yes → convention worth documenting. If no → standard practice, skip. ### Phase 6: Git Archaeology (20 min) For each unique pattern found in Phase 5: 1. Find the commit that introduced it (`git log --diff-filter=A`) 2. Read the commit message — the "why" is usually there 3. Check if it replaced something (`git log -S "old_name"`) 4. Note the date and author — context for why shortcuts were taken **The insight is always WHY, not WHAT.** A bare goroutine with a TODO is uninteresting as a listing. A bare goroutine introduced during a complex 20-file admission control feature, tagged by the author in the same commit, that survived 3 years because nobody touched the function — that's a lesson about how real codebases evolve. See `references/commands.md` for git archaeology patterns. **If the repo is on a forge without PR history** (self-hosted, mailing list-based): Fall back to commit messages and CHANGELOG. The commit body IS the PR description for these projects. Look for "Reviewed-by" trailers and linked issues. ### Phase 7: PR Discussions (20 min) Find PRs where key patterns were introduced. Read: - The PR body (author's motivation) - Review comments (the debate) - The resolution **What to extract from discussions:** - What the author was defending (= where the real insight is) - What reviewers pushed back on (= non-obvious tradeoffs) - Whether it was "merge and iterate" vs "perfect before merge" - Whether external validation was cited (benchmarks, user feedback) - The migration strategy (big-bang vs gradual coexistence) **The highest-value finding:** When a reviewer says "I wish we'd done X instead" and the author explains why X doesn't work. That tradeoff reasoning is pure expert knowledge. ### Phase 8: Synthesis Produce output based on MODE. Push to `GIT_REMOTE`. --- #### MODE: conventions Output: `-conventions` repo. **`analysis.md`** — the full story: 1. Repo shape and organizational philosophy 2. Import hierarchy (what it values) 3. Key patterns with code examples + origin stories 4. PR discussion excerpts (attributed quotes) 5. Cross-ecosystem comparisons (prior art, independent invention) 6. Quality metrics in context (not bare numbers) **`conventions.md`** — the reference: For each unique pattern: - Name and location in source - Code example (real, not simplified) - When to use / When NOT to use - Origin (commit date, author, PR# if available) **Tone:** Descriptive. "This project does X because Y." --- #### MODE: patterns Output: `-patterns` or `-patterns` repo. **Synthesis question:** "What should a developer copy from this codebase?" Filter everything through: "If I were writing new code in this language/ecosystem, what rules does this source teach me?" **This is iterative, not one-shot.** The method produces quality through decomposition, not through asking one agent to "write a good file." Each step is bounded, mechanical, and verifiable. ### The Repeatable Method **Step 1: Quantify** (5 min per topic) For each topic area, run frequency grep commands to find patterns. The goal is COUNTS — how often does this pattern appear? ``` # Example: error handling in Go grep -rn "^var Err" --include="*.go" | grep -v test | wc -l → 55 grep -rn "fmt.Errorf.*%w" --include="*.go" | grep -v test | wc -l → 115 grep -rn "errors\.Is\|errors\.As" --include="*.go" | wc -l → 212 ``` Output: a numbered list of pattern names + counts. This IS the table of contents for that topic file. **Step 2: Extract one** (5-10 min per pattern) For EACH pattern from the list, in order: 1. Find the best example (grep → pick the clearest one) 2. Read 10 lines of surrounding context (understand WHY) 3. Write one pattern entry (40-80 lines, all required sections) 4. Move to the next pattern The key constraint: **write one pattern entry completely before starting the next.** Never read all patterns then write all entries. This prevents context exhaustion and ensures each entry is complete. **Step 3: Decision tree** (5 min per topic) After all patterns are written, add a decision tree at the end. Format: "If X, use pattern A. If Y, use pattern B." **Step 4: Cross-references** (2 min per topic) Add `See also:` links to related topic files. **Step 5: Hyperlinks** (mechanical, scriptable) Convert all source references to clickable permalinks: ```bash HEAD=$(git rev-parse HEAD) BASE="https://github.com/OWNER/REPO/blob/${HEAD}" sed -i -E "s|\`(path/file\.ext):([0-9]+)\`|[\1#L\2](${BASE}/\1#L\2)|g" file.md ``` ### Delegation Strategy When using sub-agents: - **DO:** One agent per pattern entry (bounded: read one, write one) - **DO:** Give the agent the grep output as input (they don't discover, they deepen a known pattern) - **DO:** Include one complete example entry in the prompt as the quality reference - **DON'T:** Ask one agent to write an entire topic file - **DON'T:** Ask agents to "discover patterns" (they'll find 5 obvious ones and miss 10 important ones) - **DON'T:** Let agents choose their own structure (give them the template) **Template for sub-agent task:** ``` Write pattern entry for: [PATTERN NAME] Source repo: [REPO] at commit [SHA] Access: [SSH command to get to the source] Permalink base: [URL] Grep that found this: [the grep command + sample output] Reference quality: [paste ONE complete pattern entry as example] Write to: [output path] ``` ### Parallelism - Step 1 (quantify): run for ALL topics in parallel (just grep) - Step 2 (extract): run per-pattern entries in parallel (max 5) - Steps 3-5: sequential (need all entries to exist first) ### Done Criteria A topic file is done when: - [ ] Every pattern from Step 1's list has an entry - [ ] Each entry has ALL required sections (source, why, when to use with before/after, when NOT to use with over-application) - [ ] Decision tree exists at the end - [ ] All source refs are hyperlinked - [ ] PATTERN_COMPLETE sentinel at EOF - [ ] File is 500-1000 lines (if shorter, entries are too shallow) A language is done when: - [ ] 8-12 topic files exist - [ ] Each topic has 10-15+ patterns - [ ] Total is 5,000-10,000+ lines - [ ] No grep scan reveals patterns not yet documented - [ ] smells.md covers anti-patterns found in the source **Output structure — one file per topic:** `patterns/.md` — topics include (but aren't limited to): - Error handling (sentinel errors, error types, wrapping, multi-error) - Naming conventions (packages, types, functions, receivers) - Concurrency patterns (goroutines, channels, mutexes, sync primitives) - Testing patterns (table-driven, helpers, fixtures, benchmarks, examples) - Interface/protocol design (size, composition, assertion, extension) - Module/package organization (layout, internal/, visibility) - Documentation conventions (godoc, deprecation, package-level) - Performance idioms (pooling, preallocate, append, zero-alloc) - Configuration patterns (functional options, config structs, defaults) - Extension/plugin patterns (registration, middleware, hooks) - Struct patterns (constructors, zero values, embedding, tags) - API design (backwards compat, versioning, deprecation strategy) **Start with 8–10 topics for a language stdlib; add more if the source shows distinct patterns in additional areas.** Each topic should map to a real problem domain that developers face. **File naming:** Use lowercase, hyphenated names that describe the topic clearly: `error-handling.md`, `testing-advanced.md`, `api-conventions.md`, `concurrency.md`. **Each pattern entry requires ALL of these sections:** ### `## N. Pattern Name` Short, linkable heading (no generic names like "Pattern 1"). ### `### Source:` Hyperlinked to the exact file and line on the forge. Format: `[src/io/io.go#L86](https://github.com/golang/go/blob/COMMIT_SHA/src/io/io.go#L86)` Use permalink format (commit SHA) for stability. ### Real source example The actual code from the source, with file:line comments. Not simplified, not invented. This IS the evidence. ### `### Why` The force that makes this the right choice. Not "because the stdlib does it" — explain the FORCE (testability, allocation cost, readability under diff, composability). ### `### When to Use` **Triggers:** — bullet list of specific situations that call for this. **Example — before:** — code showing the problem WITHOUT the pattern. This is critical. Readers must recognize their own bad code here. **Example — after:** — code showing the same problem WITH the pattern. The before/after pair is what makes patterns teachable. ### `### When NOT to Use` **Don't use this when:** — bullet list of boundary conditions. **Over-application example:** — code showing what happens when you use this pattern where it doesn't belong. This prevents cargo-culting. **Better alternative:** — what to do instead in those cases. ### `### Anti-pattern` (when relevant) Explicit `DON'T:` block showing the wrong approach with a comment explaining why it's wrong, followed by `DO:` showing the fix. --- **Each topic file ALSO needs:** - **Summary/Decision Tree at the end** — "If X, use pattern A. If Y, use pattern B." Readers should be able to skip to the decision tree and find their situation. - **Cross-references** — link to related patterns in other topic files. e.g., error-handling links to interfaces when discussing error types. --- **Quality bar:** Each pattern entry should be 40–80 lines including code examples. A topic file with 10 patterns should be 500–900 lines. If entries are shorter than 40 lines, they're missing before/after examples or anti-patterns. --- **`smells.md`** — anti-patterns found in the source: - What it looks like (with real code) - Why it exists (technical debt? deliberate tradeoff? historical?) - What to do instead (with code showing the fix) - How to detect it (grep pattern or linter rule) **Tone:** Prescriptive. "Write it this way because X." **Key difference from conventions mode:** Skip governance, team structure, TODO culture, and project history unless they directly inform HOW to write code. Focus on patterns a user should copy. **Done criteria:** You've scanned every major directory in the source. No new patterns emerge from further grep/read. Each topic file has 10–15+ patterns, each with before/after examples, anti-patterns, and decision guidance. Total output for a language stdlib should be 5,000–10,000+ lines across all topic files. --- End all output files with `` sentinel. ## Cross-Ecosystem Observations Always note when a pattern exists in multiple repos. These independent inventions reveal forces that transcend project context: - Temporal goro.Handle (2021) ↔ CockroachDB stop.Handle (2025) - Ecto zero TODOs (version-gated) ↔ Oban zero TODOs (2-week cleanup) - Prometheus init() plugins ↔ Temporal init() plugins ## The 4 Categories of Pattern Breaks When you find convention violations, classify: 1. **Ship behavior, fix plumbing later** — tagged with TODO same commit 2. **Better tooling exposed limitation** — observability, not correctness 3. **Removal cost > carrying cost** — zero-interest debt 4. **Context needs different pattern** — not actually a break See `references/pattern-breaks.md` for real examples with git history. ## NEVER - **NEVER analyze with a shallow clone** and assume full picture — archaeology requires full history - **NEVER present patterns from one file as repo-wide conventions** — verify frequency across the codebase first - **NEVER skip PR discussions** — code without context is just syntax; the discussion IS the insight - **NEVER report bare numbers** ("738 TODOs") — always contextualize (per 1000 files, vs comparable projects, trending up/down) - **NEVER confuse "the maintainer likes X" with "X is the right pattern"** — solo-maintained projects reflect one person's taste; team projects reflect negotiated conventions - **NEVER present a pattern as "unique" without checking** if stdlib has it or if it's a well-known library pattern - **NEVER list patterns without when-NOT-to-use** — that's where the expertise actually lives - **NEVER quote PR discussions without attribution** — who said it matters (maintainer vs drive-by contributor) - **NEVER analyze repos <1000 commits** — not enough history for meaningful archaeology - **NEVER conflate language patterns with project conventions** — `go- patterns` is stdlib idiom; `temporal-conventions` is project choice ## Output Repos Push to `GIT_REMOTE` under: - **conventions mode:** `GIT_ORG/-conventions` - **patterns mode:** `GIT_ORG/-patterns` See `references/commands.md` for repo creation and push commands. ## Fallbacks - **No PR discussions?** Use commit messages as primary source. Many projects (Linux, PostgreSQL) do all review in commit messages and mailing lists. - **Repo too large to clone fully?** Clone shallow first, do Phase 1-5, then `git fetch --unshallow` only if Phase 6-7 are needed. - **Private repo / no forge API?** Skip Phase 7. Phase 6 (local git history) still works. - **<3000 commits?** Reduce Phase 6-7 expectations. Younger projects have less archaeology to mine — focus on Phase 5 (unique patterns) and the project's README/docs for rationale. ## Execution Notes - Clone on `CLONE_HOST` — needs disk space for full git history - `gh api` or equivalent for forge PR lookups (requires authentication) - One repo at a time for focused analysis - Markdownlint all output before pushing