feat: codebase-analysis skill

8-phase methodology for extracting architectural conventions from
open source repositories:

1. Clone & Shape — measure repo dimensions
2. Import Hierarchy — find foundational packages
3. Interface Contracts — key abstractions
4. Error Handling & Quality Markers — TODOs, style
5. Unique Patterns — project-specific infrastructure
6. Git Archaeology — trace WHY decisions were made
7. PR Discussions — read the actual debates
8. Synthesis — produce analysis.md + conventions.md

Includes pattern-breaks reference (4 categories of why
conventions are violated, with real examples from CockroachDB,
Prometheus, Temporal, Ecto, Oban).

Output: Gitea repos named <project>-conventions.
This commit is contained in:
Rodin
2026-04-30 11:49:40 -07:00
parent 6bc78f7404
commit 38a9d66072
2 changed files with 290 additions and 0 deletions
+193
View File
@@ -0,0 +1,193 @@
---
name: codebase-analysis
description: >-
Analyze open source repositories to extract architectural conventions,
patterns, and design decisions. Produces convention docs and pattern
files suitable for Gitea repos. Use when asked to "analyze a repo",
"extract patterns from", "what conventions does X use", "add X to the
analysis repos", "how does X do Y architecturally", or when adding a
new project to the conventions collection. Triggers on: codebase
analysis, repo analysis, extract conventions, architectural analysis,
project conventions, "analyze this repo", "what patterns does X use",
"how is X structured". Do NOT use for: code review of specific PRs
(use pr-review), security audits (use vuln-scout), or reading a
single file for a quick answer.
---
# Codebase Analysis
Extract architectural conventions from open source repositories through
a structured multi-phase process. Output goes to Gitea convention repos.
## Naming Convention
- **Language patterns** (`go-patterns`, `elixir-patterns`): stdlib/language
idioms verified from source
- **Project conventions** (`<project>-conventions`): how a specific codebase
does things — e.g., `temporal-conventions`, `cockroachdb-conventions`
## Phases
Execute in order. Each phase builds on the previous.
### Phase 1: Clone & Shape
Clone the repo on forge (`~/src/analysis/<name>`). Measure:
```
- Total size, file count, commit count, contributor count
- Top-level directory structure
- Language breakdown (if polyglot)
```
### Phase 2: Import Hierarchy
Find the most-depended-upon packages. This reveals what the codebase
considers foundational:
```bash
# Go: grep imports, extract internal packages, count
grep -rh '"<module>/' --include="*.go" | sed ... | sort | uniq -c | sort -rn
# Elixir: grep aliases/imports
grep -rh "alias\|import\|use" --include="*.ex" | ...
```
The top 5-10 packages are where architectural decisions live.
### Phase 3: Interface Contracts
Find the key abstractions:
```bash
# Go
grep -rn "type.*interface {" --include="*.go" | grep -v test | grep -v mock
# Elixir
grep -rn "@callback\|@behaviour\|defprotocol" --include="*.ex"
```
### Phase 4: Error Handling & Quality Markers
```bash
# Error style
grep -rh "fmt.Errorf.*%w" | wc -l # wrapping
grep -rh "errors.New\|errors.Wrap" | wc -l # sentinel/pkg
# Quality markers
grep -rn "TODO\|FIXME\|HACK" --include="*.go" | grep -v test | wc -l
```
Note TODO format — does the project use `TODO(owner):`? Plain `TODO:`?
Version-gated TODOs?
### Phase 5: Unique Patterns
Look for project-specific infrastructure that's NOT in stdlib:
- Custom concurrency primitives (handles, schedulers, pools)
- Testing infrastructure (custom assertions, harnesses)
- Configuration systems (dynamic config, feature flags)
- Plugin/extension systems
### Phase 6: Git Archaeology (requires full clone)
This is where the real insight lives. For each interesting pattern:
```bash
# When was it introduced?
git log --all --oneline --diff-filter=A -- path/to/pattern
# Who wrote it and why?
git log --format="%an%n%ad%n%s%n%b" <commit> -1
# What did it replace?
git log -p -S "old_pattern_name" -- relevant/path
```
### Phase 7: PR Discussions
Find the PR where a key pattern was introduced:
```bash
gh api "search/issues?q=repo:<org>/<repo>+<search>+type:pr" \
--jq '.items[] | {number, title}'
```
Then read:
- PR body (the "why" from the author)
- Review comments (the debate)
- Issue comments (the resolution)
Key questions to answer:
1. What problem motivated this pattern?
2. What alternatives were considered?
3. What did reviewers push back on?
4. How was it migrated in (big-bang vs gradual)?
### Phase 8: Synthesis & Output
Produce two artifacts:
**1. `analysis.md`** — Full architectural analysis:
- Repository shape and import hierarchy
- Key patterns with code examples
- PR discussion summaries
- Cross-ecosystem comparisons
- Code quality metrics
**2. `conventions.md`** — Extracted patterns with:
- Pattern name
- Location in source
- Code example
- When to use / when NOT to use
- Origin story (from PR discussion if available)
## Output Repos
Push to Gitea under `rodin/<project>-conventions`:
```bash
GITEA_TOKEN=$(cat ~/.openclaw/credentials/gitea-rodin)
# Create repo if needed:
curl -s -X POST "https://gitea.weiker.me/api/v1/user/repos" \
-H "Authorization: token $GITEA_TOKEN" \
-H "Content-Type: application/json" \
-d '{"name": "<project>-conventions", "auto_init": true, "default_branch": "master", "license": "MIT"}'
```
## Quality Criteria
Before pushing, verify:
- Each pattern has a real code example from the repo
- "When to use / when NOT to use" sections exist
- PR discussion quotes are attributed
- Cross-ecosystem comparisons note prior art
- Files end with `<!-- PATTERN_COMPLETE -->` sentinel
## Cross-Ecosystem Observations
Always note when a pattern exists in multiple repos. Example: Temporal's
`goro.Handle` (2021) predates CockroachDB's `stop.Handle` (2025) —
same solution, invented independently. These connections are the most
valuable findings.
## The 4 Categories of Pattern Breaks
When you find impurity (pattern violations), classify them:
1. **Ship behavior, fix plumbing later** — time pressure, author knows it's
wrong (tagged with TODO)
2. **Better tooling exposed the limitation** — old pattern worked but was
invisible to profiling/tracing
3. **Removal cost > carrying cost** — technically debt but zero interest rate
4. **Context needs different pattern** — not actually a break
See `references/pattern-breaks.md` for detailed examples.
## Execution Notes
- Clone on **forge** (`host="node", node="forge"`) — has disk space
- Use full clones (not `--depth 1`) for archaeology
- `gh api` for GitHub PR lookups (authenticated on all nodes)
- One repo at a time for focused analysis
- Markdownlint all output before pushing
+97
View File
@@ -0,0 +1,97 @@
# Pattern Breaks: Why Conventions Are Violated
Based on git archaeology across CockroachDB, Prometheus, Temporal,
Ecto, and Oban. Each category with real examples.
---
## Category 1: Ship Behavior, Fix Plumbing Later
**Example:** CockroachDB bare goroutine in kvadmission
```go
// TODO(irfansharif): Use a stopper here instead.
go func() {
ticker := time.NewTicker(10 * time.Minute)
for { select { case <-ticker.C: ... } }
}()
```
**History:** Irfan Sharif, Aug 2022. Part of the elastic CPU limiter
(20+ file change). Complex new admission control system. The author
tagged it with TODO in the SAME commit. Ship the behavior, fix the
plumbing later.
**Why it survives:** Nobody touched the function for any other reason.
The goroutine leaks nowhere (process-lifetime). The TODO is correct
but not urgent.
**For review:** Worth noting, not worth blocking the PR.
---
## Category 2: Better Tooling Exposed the Limitation
**Example:** CockroachDB Handle replacing RunAsyncTask
**History:** Feb 2025, Tobias Grieger. RunAsyncTask was *correct* but
made every goroutine look identical in execution traces. The Handle
pattern was motivated by profiling/observability, not correctness.
**Key quote from PR:** "It becomes an ordeal to search for a specific
goroutine, since all goroutines started by the Stopper get assigned
the same searchable identity."
**For review:** Track it. This is how progress happens.
---
## Category 3: Removal Cost > Carrying Cost
**Example:** CockroachDB stale "Remove in 22.2" TODO
```go
// TODO(ajwerner): Remove in 22.2.
SystemConfigProvider config.SystemConfigProvider
```
**History:** Feb 2022. Bridge for mixed-version clusters during
gossip→span-config migration. A struct field that costs one pointer
of memory. Removing it means touching StoreConfig initialization
across dozens of call sites.
**Why it survives:** Cost of removal >> cost of carrying. The
interest rate on this debt is zero.
**For review:** Leave it alone.
---
## Category 4: Context Needs Different Pattern
**Example:** Prometheus global vars for scrape timestamp tolerance
```go
var ScrapeTimestampTolerance = 2 * time.Millisecond
```
**History:** Started as "experimental, hidden flag" (Oct 2020). Made
public after production showed 10ms tolerance saves real disk space.
**Why a global:** Set once at startup from a command-line flag. Read
thousands of times per second in the scrape loop. Threading a config
struct would add parameters to every hot-path function for a value
that never changes after init.
**For review:** Approve it. This IS the right pattern for this context.
---
## Summary Table
| Category | Response in Review | Example |
|----------|-------------------|---------|
| Time pressure | Note, don't block | kvadmission goroutine |
| Better tooling | Track migration | Handle API |
| Zero-interest debt | Leave alone | Stale version TODOs |
| Different context | Approve | Hot-path globals |