rodin/codebase-analysis

Fork 0

Files

T

Rodin 04734c2175 fix: source references hyperlinked to forge (commit SHA + line)

2026-04-30 13:35:17 -07:00

14 KiB

Raw Blame History

name, description

name	description
codebase-analysis	Analyze open source repositories to extract conventions or patterns. Two modes: "conventions" (how a project works architecturally) and "patterns" (how to write idiomatic code in that language/ecosystem). Use when asked to "analyze a repo", "extract patterns from", "what conventions does X use", "how should I write X", "what's idiomatic", "add X to the analysis repos", or "how does X do Y architecturally". Do NOT use for: code review of specific PRs (use pr-review), security audits (use vuln-scout), or reading a single file for a quick answer.

name

description

codebase-analysis

Analyze open source repositories to extract conventions or patterns. Two modes: "conventions" (how a project works architecturally) and "patterns" (how to write idiomatic code in that language/ecosystem). Use when asked to "analyze a repo", "extract patterns from", "what conventions does X use", "how should I write X", "what's idiomatic", "add X to the analysis repos", or "how does X do Y architecturally". Do NOT use for: code review of specific PRs (use pr-review), security audits (use vuln-scout), or reading a single file for a quick answer.

Codebase Analysis

Extract conventions or idiomatic patterns from open source repos.

Mode

Set MODE when invoking (or infer from request):

Mode	Question	Output	Repo suffix
`conventions`	"How does this project work?"	Architecture, governance, unique infra	`*-conventions`
`patterns`	"How should I write code like this?"	Prescriptive rules for users	`*-patterns`

Default: conventions unless the request says "idiomatic", "how to write", "style guide", or "patterns for users".

Both modes share Phases 1-7. They diverge at Phase 8 (synthesis).

Configuration

Set these in your workspace context (TOOLS.md, AGENTS.md, or pass explicitly when invoking the skill):

Parameter	Description	Example
`CLONE_DIR`	Directory to clone repos into	`~/src/analysis/`
`CLONE_HOST`	Machine with disk + git for cloning	`forge`, `localhost`
`GIT_REMOTE`	Where convention repos are pushed	`https://git.example.com`
`GIT_ORG`	Org/user for convention repos	`myorg`, `username`
`GIT_TOKEN_PATH`	Path to auth token for pushing	`~/.credentials/git-token`

Minimum required: CLONE_DIR and GIT_REMOTE. If others are omitted:

CLONE_HOST defaults to localhost (current machine)
GIT_ORG defaults to the authenticated user
GIT_TOKEN_PATH uses default git credential helper

Example in TOOLS.md:

## Codebase Analysis
- CLONE_DIR: ~/src/analysis/
- CLONE_HOST: my-dev-server (ssh user@host)
- GIT_REMOTE: https://git.example.com
- GIT_ORG: my-patterns
- GIT_TOKEN_PATH: ~/.credentials/git-token

If not explicitly provided, infer from workspace context (TOOLS.md, shell environment, or git remote configuration).

Naming

*-patterns = prescriptive (how users should write code)
*-conventions = descriptive (how a specific codebase works)

A language can have both: go-patterns (write Go like this) AND golang-conventions (how the Go team builds Go itself).

Thinking Framework

Before starting any analysis, ask:

What is this project's essence? A trading system is a state machine where the state is money. A workflow engine is a tree of state machines. Name the essence — the patterns follow from it.
What forces shaped it? Team size, age, performance constraints, backward compatibility obligations. These predict WHERE conventions will be strict vs relaxed.
What would surprise me? The interesting findings are never "they use interfaces" — it's "they have 566 dynamic config settings" or "zero TODOs in 3.8M of code." Surprise = insight.

Prioritization: What to Dig Into

Not everything is interesting. Focus on patterns that:

Appear >50 times — this is a conscious convention, not a one-off
Have a dedicated package — someone thought it was important enough to abstract
Other projects solve differently — reveals a real design tradeoff
Have a surprising name — indicates the team had to invent vocabulary for a novel concept
Were introduced recently with many PR comments — active design decisions with recorded rationale

Skip patterns that are:

Standard library usage (unless the project wraps/extends it)
Single-use internal helpers
Generated code
Exact copies of well-known open-source patterns without modification

Phases

Phase 1: Shape (5 min)

Clone to CLONE_DIR/<name> on CLONE_HOST. Full clone — never shallow.

Measure: size, files, commits, contributors, top-level dirs.

What matters here: The ratio of test files to production files. The presence/absence of internal/ vs flat structure. Whether there's a single pkg/ or many top-level packages. These reveal organizational philosophy before you read a single line.

Phase 2: What the Codebase Values (10 min)

Find the most-imported internal packages. The top 5 are the project's definition of "foundational."

Ask: Why these? What do they share? Usually: logging, errors, config, and one domain-specific abstraction that IS the project. That domain-specific one is where the real conventions live.

See references/commands.md for grep patterns by language.

Phase 3: Interface Contracts (10 min)

Find interfaces/behaviours/protocols — but don't list them all.

Focus on: Interfaces with >3 implementations (these are real extension points). Interfaces in constructor signatures (these are dependency injection boundaries). Interfaces that appear in BOTH production and test code (these are the testability seams).

Skip: One-method interfaces (usually just for mocking). Interfaces only used in one place (not yet conventions).

Phase 4: Quality Fingerprint (5 min)

Measure: TODO count, FIXME count, HACK count, test count, mock count.

What to notice:

TODO format reveals discipline: TODO(owner): = accountability, TODO: = aspirational, version-gated = systematic cleanup
Zero TODOs in a large codebase means active cleanup culture
High mock count relative to test count suggests heavy DI
HACK count > 0 is honest; HACK count = 0 in a large project is suspicious (they probably use different words)

Phase 5: Unique Patterns (15 min)

Look for infrastructure NOT in stdlib. Categories:

Concurrency: goroutine handles, schedulers, shutdown primitives
Testing: custom assertions, fake registries, golden file systems
Configuration: dynamic config, feature flags, runtime toggles
Error handling: custom error types, assertion systems, panic recovery patterns
Extension: plugin registration, hook systems, middleware chains

The test for uniqueness: Would you be surprised to find this in another project of similar size? If yes → convention worth documenting. If no → standard practice, skip.

Phase 6: Git Archaeology (20 min)

For each unique pattern found in Phase 5:

Find the commit that introduced it (git log --diff-filter=A)
Read the commit message — the "why" is usually there
Check if it replaced something (git log -S "old_name")
Note the date and author — context for why shortcuts were taken

The insight is always WHY, not WHAT. A bare goroutine with a TODO is uninteresting as a listing. A bare goroutine introduced during a complex 20-file admission control feature, tagged by the author in the same commit, that survived 3 years because nobody touched the function — that's a lesson about how real codebases evolve.

See references/commands.md for git archaeology patterns.

If the repo is on a forge without PR history (self-hosted, mailing list-based): Fall back to commit messages and CHANGELOG. The commit body IS the PR description for these projects. Look for "Reviewed-by" trailers and linked issues.

Phase 7: PR Discussions (20 min)

Find PRs where key patterns were introduced. Read:

The PR body (author's motivation)
Review comments (the debate)
The resolution

What to extract from discussions:

What the author was defending (= where the real insight is)
What reviewers pushed back on (= non-obvious tradeoffs)
Whether it was "merge and iterate" vs "perfect before merge"
Whether external validation was cited (benchmarks, user feedback)
The migration strategy (big-bang vs gradual coexistence)

The highest-value finding: When a reviewer says "I wish we'd done X instead" and the author explains why X doesn't work. That tradeoff reasoning is pure expert knowledge.

Phase 8: Synthesis

Produce output based on MODE. Push to GIT_REMOTE.

MODE: conventions

Output: <project>-conventions repo.

analysis.md — the full story:

Repo shape and organizational philosophy
Import hierarchy (what it values)
Key patterns with code examples + origin stories
PR discussion excerpts (attributed quotes)
Cross-ecosystem comparisons (prior art, independent invention)
Quality metrics in context (not bare numbers)

conventions.md — the reference: For each unique pattern:

Name and location in source
Code example (real, not simplified)
When to use / When NOT to use
Origin (commit date, author, PR# if available)

Tone: Descriptive. "This project does X because Y."

MODE: patterns

Output: <language>-patterns or <ecosystem>-patterns repo.

Synthesis question: "What should a developer copy from this codebase?" Filter everything through: "If I were writing new code in this language/ecosystem, what rules does this source teach me?"

This is iterative, not one-shot. Keep extracting until you've identified ALL patterns the source demonstrates. A first pass finds the obvious ones. Second pass greps for variations and edge cases. Third pass finds the patterns that break. You're done when scanning the source no longer reveals new rules.

Process:

Discovery pass — scan the source by topic area, identify every distinct pattern (aim for 15-30+ per topic in a large codebase)
Deepening pass — for each pattern, grep for 5-10 real usages across the codebase. Note variations. Find the best example.
Edge case pass — find where each pattern DOESN'T apply. Grep for violations — are they bugs, or legitimate exceptions?
Cross-reference pass — which patterns interact? Which ones conflict? Document the decision framework for choosing between competing patterns.

Repeat until scanning the source yields no new patterns. A language stdlib should produce 50-200+ patterns across all topics.

Output structure — one file per topic:

patterns/<topic>.md — topics include:

Error handling
Naming conventions
Concurrency patterns
Testing patterns
Interface/protocol design
Module organization
Documentation conventions
Performance idioms
Configuration patterns
Extension/plugin patterns

Each pattern entry:

Name (short, linkable heading)
Rule (one sentence: "Do X" or "Never Y")
Example (real code from the source, not invented)
Why (the force that makes this the right choice)
When to use (the trigger condition — what situation calls for this)
When NOT to use (where the pattern breaks down)
Source (hyperlinked to the commit or file on the forge, e.g. [src/io/io.go#L86](https://github.com/golang/go/blob/master/src/io/io.go#L86). Use permalink format with commit SHA when possible for stability.)

smells.md — anti-patterns found in the source:

What it looks like
Why it exists (technical debt? deliberate tradeoff?)
What to do instead

Tone: Prescriptive. "Write it this way because X."

Key difference from conventions mode: Skip governance, team structure, TODO culture, and project history unless they directly inform HOW to write code. Focus on patterns a user should copy.

Done criteria: You've scanned every major directory in the source. No new patterns emerge from further grep/read. Each topic file has 15+ patterns with real examples. Edge cases are documented.

End all output files with  sentinel.

Cross-Ecosystem Observations

Always note when a pattern exists in multiple repos. These independent inventions reveal forces that transcend project context:

Temporal goro.Handle (2021) ↔ CockroachDB stop.Handle (2025)
Ecto zero TODOs (version-gated) ↔ Oban zero TODOs (2-week cleanup)
Prometheus init() plugins ↔ Temporal init() plugins

The 4 Categories of Pattern Breaks

When you find convention violations, classify:

Ship behavior, fix plumbing later — tagged with TODO same commit
Better tooling exposed limitation — observability, not correctness
Removal cost > carrying cost — zero-interest debt
Context needs different pattern — not actually a break

See references/pattern-breaks.md for real examples with git history.

NEVER

NEVER analyze with a shallow clone and assume full picture — archaeology requires full history
NEVER present patterns from one file as repo-wide conventions — verify frequency across the codebase first
NEVER skip PR discussions — code without context is just syntax; the discussion IS the insight
NEVER report bare numbers ("738 TODOs") — always contextualize (per 1000 files, vs comparable projects, trending up/down)
NEVER confuse "the maintainer likes X" with "X is the right pattern" — solo-maintained projects reflect one person's taste; team projects reflect negotiated conventions
NEVER present a pattern as "unique" without checking if stdlib has it or if it's a well-known library pattern
NEVER list patterns without when-NOT-to-use — that's where the expertise actually lives
NEVER quote PR discussions without attribution — who said it matters (maintainer vs drive-by contributor)
NEVER analyze repos <1000 commits — not enough history for meaningful archaeology
NEVER conflate language patterns with project conventions — go- patterns is stdlib idiom; temporal-conventions is project choice

Output Repos

Push to GIT_REMOTE under:

conventions mode: GIT_ORG/<project>-conventions
patterns mode: GIT_ORG/<language>-patterns

See references/commands.md for repo creation and push commands.

Fallbacks

No PR discussions? Use commit messages as primary source. Many projects (Linux, PostgreSQL) do all review in commit messages and mailing lists.
Repo too large to clone fully? Clone shallow first, do Phase 1-5, then git fetch --unshallow only if Phase 6-7 are needed.
Private repo / no forge API? Skip Phase 7. Phase 6 (local git history) still works.
<3000 commits? Reduce Phase 6-7 expectations. Younger projects have less archaeology to mine — focus on Phase 5 (unique patterns) and the project's README/docs for rationale.

Execution Notes

Clone on CLONE_HOST — needs disk space for full git history
gh api or equivalent for forge PR lookups (requires authentication)
One repo at a time for focused analysis
Markdownlint all output before pushing

14 KiB Raw Blame History