8-phase methodology for extracting architectural conventions from open source repositories: 1. Clone & Shape — measure repo dimensions 2. Import Hierarchy — find foundational packages 3. Interface Contracts — key abstractions 4. Error Handling & Quality Markers — TODOs, style 5. Unique Patterns — project-specific infrastructure 6. Git Archaeology — trace WHY decisions were made 7. PR Discussions — read the actual debates 8. Synthesis — produce analysis.md + conventions.md Includes pattern-breaks reference (4 categories of why conventions are violated, with real examples from CockroachDB, Prometheus, Temporal, Ecto, Oban). Output: Gitea repos named <project>-conventions.
5.8 KiB
name, description
| name | description |
|---|---|
| codebase-analysis | Analyze open source repositories to extract architectural conventions, patterns, and design decisions. Produces convention docs and pattern files suitable for Gitea repos. Use when asked to "analyze a repo", "extract patterns from", "what conventions does X use", "add X to the analysis repos", "how does X do Y architecturally", or when adding a new project to the conventions collection. Triggers on: codebase analysis, repo analysis, extract conventions, architectural analysis, project conventions, "analyze this repo", "what patterns does X use", "how is X structured". Do NOT use for: code review of specific PRs (use pr-review), security audits (use vuln-scout), or reading a single file for a quick answer. |
Codebase Analysis
Extract architectural conventions from open source repositories through a structured multi-phase process. Output goes to Gitea convention repos.
Naming Convention
- Language patterns (
go-patterns,elixir-patterns): stdlib/language idioms verified from source - Project conventions (
<project>-conventions): how a specific codebase does things — e.g.,temporal-conventions,cockroachdb-conventions
Phases
Execute in order. Each phase builds on the previous.
Phase 1: Clone & Shape
Clone the repo on forge (~/src/analysis/<name>). Measure:
- Total size, file count, commit count, contributor count
- Top-level directory structure
- Language breakdown (if polyglot)
Phase 2: Import Hierarchy
Find the most-depended-upon packages. This reveals what the codebase considers foundational:
# Go: grep imports, extract internal packages, count
grep -rh '"<module>/' --include="*.go" | sed ... | sort | uniq -c | sort -rn
# Elixir: grep aliases/imports
grep -rh "alias\|import\|use" --include="*.ex" | ...
The top 5-10 packages are where architectural decisions live.
Phase 3: Interface Contracts
Find the key abstractions:
# Go
grep -rn "type.*interface {" --include="*.go" | grep -v test | grep -v mock
# Elixir
grep -rn "@callback\|@behaviour\|defprotocol" --include="*.ex"
Phase 4: Error Handling & Quality Markers
# Error style
grep -rh "fmt.Errorf.*%w" | wc -l # wrapping
grep -rh "errors.New\|errors.Wrap" | wc -l # sentinel/pkg
# Quality markers
grep -rn "TODO\|FIXME\|HACK" --include="*.go" | grep -v test | wc -l
Note TODO format — does the project use TODO(owner):? Plain TODO:?
Version-gated TODOs?
Phase 5: Unique Patterns
Look for project-specific infrastructure that's NOT in stdlib:
- Custom concurrency primitives (handles, schedulers, pools)
- Testing infrastructure (custom assertions, harnesses)
- Configuration systems (dynamic config, feature flags)
- Plugin/extension systems
Phase 6: Git Archaeology (requires full clone)
This is where the real insight lives. For each interesting pattern:
# When was it introduced?
git log --all --oneline --diff-filter=A -- path/to/pattern
# Who wrote it and why?
git log --format="%an%n%ad%n%s%n%b" <commit> -1
# What did it replace?
git log -p -S "old_pattern_name" -- relevant/path
Phase 7: PR Discussions
Find the PR where a key pattern was introduced:
gh api "search/issues?q=repo:<org>/<repo>+<search>+type:pr" \
--jq '.items[] | {number, title}'
Then read:
- PR body (the "why" from the author)
- Review comments (the debate)
- Issue comments (the resolution)
Key questions to answer:
- What problem motivated this pattern?
- What alternatives were considered?
- What did reviewers push back on?
- How was it migrated in (big-bang vs gradual)?
Phase 8: Synthesis & Output
Produce two artifacts:
1. analysis.md — Full architectural analysis:
- Repository shape and import hierarchy
- Key patterns with code examples
- PR discussion summaries
- Cross-ecosystem comparisons
- Code quality metrics
2. conventions.md — Extracted patterns with:
- Pattern name
- Location in source
- Code example
- When to use / when NOT to use
- Origin story (from PR discussion if available)
Output Repos
Push to Gitea under rodin/<project>-conventions:
GITEA_TOKEN=$(cat ~/.openclaw/credentials/gitea-rodin)
# Create repo if needed:
curl -s -X POST "https://gitea.weiker.me/api/v1/user/repos" \
-H "Authorization: token $GITEA_TOKEN" \
-H "Content-Type: application/json" \
-d '{"name": "<project>-conventions", "auto_init": true, "default_branch": "master", "license": "MIT"}'
Quality Criteria
Before pushing, verify:
- Each pattern has a real code example from the repo
- "When to use / when NOT to use" sections exist
- PR discussion quotes are attributed
- Cross-ecosystem comparisons note prior art
- Files end with
<!-- PATTERN_COMPLETE -->sentinel
Cross-Ecosystem Observations
Always note when a pattern exists in multiple repos. Example: Temporal's
goro.Handle (2021) predates CockroachDB's stop.Handle (2025) —
same solution, invented independently. These connections are the most
valuable findings.
The 4 Categories of Pattern Breaks
When you find impurity (pattern violations), classify them:
- Ship behavior, fix plumbing later — time pressure, author knows it's wrong (tagged with TODO)
- Better tooling exposed the limitation — old pattern worked but was invisible to profiling/tracing
- Removal cost > carrying cost — technically debt but zero interest rate
- Context needs different pattern — not actually a break
See references/pattern-breaks.md for detailed examples.
Execution Notes
- Clone on forge (
host="node", node="forge") — has disk space - Use full clones (not
--depth 1) for archaeology gh apifor GitHub PR lookups (authenticated on all nodes)- One repo at a time for focused analysis
- Markdownlint all output before pushing