283 lines
11 KiB
Markdown
283 lines
11 KiB
Markdown
---
|
|
name: codebase-analysis
|
|
description: >-
|
|
Analyze open source repositories to extract architectural conventions,
|
|
patterns, and design decisions. Produces convention docs and pattern
|
|
files pushed to git repos. Use when asked to "analyze a repo",
|
|
"extract patterns from", "what conventions does X use", "add X to the
|
|
analysis repos", "how does X do Y architecturally", or when adding a
|
|
new project to the conventions collection. Triggers on: codebase
|
|
analysis, repo analysis, extract conventions, architectural analysis,
|
|
project conventions, "analyze this repo", "what patterns does X use",
|
|
"how is X structured". Do NOT use for: code review of specific PRs
|
|
(use pr-review), security audits (use vuln-scout), or reading a
|
|
single file for a quick answer.
|
|
---
|
|
|
|
# Codebase Analysis
|
|
|
|
Extract architectural conventions from open source repositories.
|
|
Output: convention repos (one per project analyzed).
|
|
|
|
## Configuration
|
|
|
|
Set these in your workspace context (TOOLS.md, AGENTS.md, or pass
|
|
explicitly when invoking the skill):
|
|
|
|
| Parameter | Description | Example |
|
|
|-----------|-------------|----------|
|
|
| `CLONE_DIR` | Directory to clone repos into | `~/src/analysis/` |
|
|
| `CLONE_HOST` | Machine with disk + git for cloning | `forge`, `localhost` |
|
|
| `GIT_REMOTE` | Where convention repos are pushed | `https://gitea.example.com` |
|
|
| `GIT_ORG` | Org/user for convention repos | `myorg`, `username` |
|
|
| `GIT_TOKEN_PATH` | Path to auth token for pushing | `~/.credentials/gitea-token` |
|
|
|
|
**Minimum required:** `CLONE_DIR` and `GIT_REMOTE`. If others are
|
|
omitted:
|
|
- `CLONE_HOST` defaults to localhost (current machine)
|
|
- `GIT_ORG` defaults to the authenticated user
|
|
- `GIT_TOKEN_PATH` uses default git credential helper
|
|
|
|
**Example in TOOLS.md:**
|
|
```markdown
|
|
## Codebase Analysis
|
|
- CLONE_DIR: ~/src/analysis/
|
|
- CLONE_HOST: my-dev-server (ssh user@host)
|
|
- GIT_REMOTE: https://github.com
|
|
- GIT_ORG: my-patterns
|
|
- GIT_TOKEN_PATH: ~/.credentials/github-token
|
|
```
|
|
|
|
If not explicitly provided, infer from workspace context (TOOLS.md,
|
|
shell environment, or git remote configuration).
|
|
|
|
## Naming
|
|
|
|
- `*-patterns` = language-level (how the language wants you to write)
|
|
- `*-conventions` = project-specific (how a codebase chose to do it)
|
|
|
|
## Thinking Framework
|
|
|
|
Before starting any analysis, ask:
|
|
|
|
1. **What is this project's essence?** A trading system is a state
|
|
machine where the state is money. A workflow engine is a tree of
|
|
state machines. Name the essence — the patterns follow from it.
|
|
2. **What forces shaped it?** Team size, age, performance constraints,
|
|
backward compatibility obligations. These predict WHERE conventions
|
|
will be strict vs relaxed.
|
|
3. **What would surprise me?** The interesting findings are never "they
|
|
use interfaces" — it's "they have 566 dynamic config settings" or
|
|
"zero TODOs in 3.8M of code." Surprise = insight.
|
|
|
|
## Prioritization: What to Dig Into
|
|
|
|
Not everything is interesting. Focus on patterns that:
|
|
|
|
- **Appear >50 times** — this is a conscious convention, not a one-off
|
|
- **Have a dedicated package** — someone thought it was important enough
|
|
to abstract
|
|
- **Other projects solve differently** — reveals a real design tradeoff
|
|
- **Have a surprising name** — indicates the team had to invent
|
|
vocabulary for a novel concept
|
|
- **Were introduced recently with many PR comments** — active design
|
|
decisions with recorded rationale
|
|
|
|
Skip patterns that are:
|
|
- Standard library usage (unless the project wraps/extends it)
|
|
- Single-use internal helpers
|
|
- Generated code
|
|
- Exact copies of well-known open-source patterns without modification
|
|
|
|
## Phases
|
|
|
|
### Phase 1: Shape (5 min)
|
|
|
|
Clone to `CLONE_DIR/<name>` on `CLONE_HOST`. Full clone — never shallow.
|
|
|
|
Measure: size, files, commits, contributors, top-level dirs.
|
|
|
|
**What matters here:** The ratio of test files to production files.
|
|
The presence/absence of `internal/` vs flat structure. Whether there's
|
|
a single `pkg/` or many top-level packages. These reveal organizational
|
|
philosophy before you read a single line.
|
|
|
|
### Phase 2: What the Codebase Values (10 min)
|
|
|
|
Find the most-imported internal packages. The top 5 are the
|
|
project's definition of "foundational."
|
|
|
|
**Ask:** Why these? What do they share? Usually: logging, errors,
|
|
config, and one domain-specific abstraction that IS the project.
|
|
That domain-specific one is where the real conventions live.
|
|
|
|
See `references/commands.md` for grep patterns by language.
|
|
|
|
### Phase 3: Interface Contracts (10 min)
|
|
|
|
Find interfaces/behaviours/protocols — but don't list them all.
|
|
|
|
**Focus on:** Interfaces with >3 implementations (these are real
|
|
extension points). Interfaces in constructor signatures (these are
|
|
dependency injection boundaries). Interfaces that appear in BOTH
|
|
production and test code (these are the testability seams).
|
|
|
|
**Skip:** One-method interfaces (usually just for mocking). Interfaces
|
|
only used in one place (not yet conventions).
|
|
|
|
### Phase 4: Quality Fingerprint (5 min)
|
|
|
|
Measure: TODO count, FIXME count, HACK count, test count, mock count.
|
|
|
|
**What to notice:**
|
|
- TODO format reveals discipline: `TODO(owner):` = accountability,
|
|
`TODO:` = aspirational, version-gated = systematic cleanup
|
|
- Zero TODOs in a large codebase means active cleanup culture
|
|
- High mock count relative to test count suggests heavy DI
|
|
- HACK count > 0 is honest; HACK count = 0 in a large project is
|
|
suspicious (they probably use different words)
|
|
|
|
### Phase 5: Unique Patterns (15 min)
|
|
|
|
Look for infrastructure NOT in stdlib. Categories:
|
|
|
|
- **Concurrency:** goroutine handles, schedulers, shutdown primitives
|
|
- **Testing:** custom assertions, fake registries, golden file systems
|
|
- **Configuration:** dynamic config, feature flags, runtime toggles
|
|
- **Error handling:** custom error types, assertion systems, panic
|
|
recovery patterns
|
|
- **Extension:** plugin registration, hook systems, middleware chains
|
|
|
|
**The test for uniqueness:** Would you be surprised to find this in
|
|
another project of similar size? If yes → convention worth documenting.
|
|
If no → standard practice, skip.
|
|
|
|
### Phase 6: Git Archaeology (20 min)
|
|
|
|
For each unique pattern found in Phase 5:
|
|
|
|
1. Find the commit that introduced it (`git log --diff-filter=A`)
|
|
2. Read the commit message — the "why" is usually there
|
|
3. Check if it replaced something (`git log -S "old_name"`)
|
|
4. Note the date and author — context for why shortcuts were taken
|
|
|
|
**The insight is always WHY, not WHAT.** A bare goroutine with a
|
|
TODO is uninteresting as a listing. A bare goroutine introduced during
|
|
a complex 20-file admission control feature, tagged by the author in
|
|
the same commit, that survived 3 years because nobody touched the
|
|
function — that's a lesson about how real codebases evolve.
|
|
|
|
See `references/commands.md` for git archaeology patterns.
|
|
|
|
**If the repo is on a forge without PR history** (self-hosted, mailing
|
|
list-based): Fall back to commit messages and CHANGELOG. The commit
|
|
body IS the PR description for these projects. Look for "Reviewed-by"
|
|
trailers and linked issues.
|
|
|
|
### Phase 7: PR Discussions (20 min)
|
|
|
|
Find PRs where key patterns were introduced. Read:
|
|
- The PR body (author's motivation)
|
|
- Review comments (the debate)
|
|
- The resolution
|
|
|
|
**What to extract from discussions:**
|
|
- What the author was defending (= where the real insight is)
|
|
- What reviewers pushed back on (= non-obvious tradeoffs)
|
|
- Whether it was "merge and iterate" vs "perfect before merge"
|
|
- Whether external validation was cited (benchmarks, user feedback)
|
|
- The migration strategy (big-bang vs gradual coexistence)
|
|
|
|
**The highest-value finding:** When a reviewer says "I wish we'd done
|
|
X instead" and the author explains why X doesn't work. That tradeoff
|
|
reasoning is pure expert knowledge.
|
|
|
|
### Phase 8: Synthesis
|
|
|
|
Produce two files. Push to Gitea.
|
|
|
|
**`analysis.md`** — the full story:
|
|
1. Repo shape and organizational philosophy
|
|
2. Import hierarchy (what it values)
|
|
3. Key patterns with code examples + origin stories
|
|
4. PR discussion excerpts (attributed quotes)
|
|
5. Cross-ecosystem comparisons (prior art, independent invention)
|
|
6. Quality metrics in context (not bare numbers)
|
|
|
|
**`conventions.md`** — the reference:
|
|
For each pattern:
|
|
- Name and location in source
|
|
- Code example (real, not simplified)
|
|
- When to use / When NOT to use
|
|
- Origin (commit date, author, PR# if available)
|
|
|
|
End both files with `<!-- PATTERN_COMPLETE -->` sentinel.
|
|
|
|
## Cross-Ecosystem Observations
|
|
|
|
Always note when a pattern exists in multiple repos. These
|
|
independent inventions reveal forces that transcend project context:
|
|
- Temporal goro.Handle (2021) ↔ CockroachDB stop.Handle (2025)
|
|
- Ecto zero TODOs (version-gated) ↔ Oban zero TODOs (2-week cleanup)
|
|
- Prometheus init() plugins ↔ Temporal init() plugins
|
|
|
|
## The 4 Categories of Pattern Breaks
|
|
|
|
When you find convention violations, classify:
|
|
|
|
1. **Ship behavior, fix plumbing later** — tagged with TODO same commit
|
|
2. **Better tooling exposed limitation** — observability, not correctness
|
|
3. **Removal cost > carrying cost** — zero-interest debt
|
|
4. **Context needs different pattern** — not actually a break
|
|
|
|
See `references/pattern-breaks.md` for real examples with git history.
|
|
|
|
## NEVER
|
|
|
|
- **NEVER analyze with a shallow clone** and assume full picture —
|
|
archaeology requires full history
|
|
- **NEVER present patterns from one file as repo-wide conventions** —
|
|
verify frequency across the codebase first
|
|
- **NEVER skip PR discussions** — code without context is just syntax;
|
|
the discussion IS the insight
|
|
- **NEVER report bare numbers** ("738 TODOs") — always contextualize
|
|
(per 1000 files, vs comparable projects, trending up/down)
|
|
- **NEVER confuse "the maintainer likes X" with "X is the right
|
|
pattern"** — solo-maintained projects reflect one person's taste;
|
|
team projects reflect negotiated conventions
|
|
- **NEVER present a pattern as "unique" without checking** if stdlib
|
|
has it or if it's a well-known library pattern
|
|
- **NEVER list patterns without when-NOT-to-use** — that's where the
|
|
expertise actually lives
|
|
- **NEVER quote PR discussions without attribution** — who said it
|
|
matters (maintainer vs drive-by contributor)
|
|
- **NEVER analyze repos <1000 commits** — not enough history for
|
|
meaningful archaeology
|
|
- **NEVER conflate language patterns with project conventions** — `go-
|
|
patterns` is stdlib idiom; `temporal-conventions` is project choice
|
|
|
|
## Output Repos
|
|
|
|
Push to `GIT_REMOTE` under `GIT_ORG/<project>-conventions`. See
|
|
`references/commands.md` for repo creation and push commands.
|
|
|
|
## Fallbacks
|
|
|
|
- **No PR discussions?** Use commit messages as primary source.
|
|
Many projects (Linux, PostgreSQL) do all review in commit messages
|
|
and mailing lists.
|
|
- **Repo too large to clone fully?** Clone shallow first, do Phase
|
|
1-5, then `git fetch --unshallow` only if Phase 6-7 are needed.
|
|
- **Private repo / no GitHub API?** Skip Phase 7. Phase 6 (local git
|
|
history) still works.
|
|
- **<3000 commits?** Reduce Phase 6-7 expectations. Younger projects
|
|
have less archaeology to mine — focus on Phase 5 (unique patterns)
|
|
and the project's README/docs for rationale.
|
|
|
|
## Execution Notes
|
|
|
|
- Clone on `CLONE_HOST` — needs disk space for full git history
|
|
- `gh api` for GitHub PR lookups (requires authenticated `gh` CLI)
|
|
- One repo at a time for focused analysis
|
|
- Markdownlint all output before pushing
|