diff --git a/SKILL.md b/SKILL.md index bf7a49a..ece5ac6 100644 --- a/SKILL.md +++ b/SKILL.md @@ -16,178 +16,235 @@ description: >- # Codebase Analysis -Extract architectural conventions from open source repositories through -a structured multi-phase process. Output goes to Gitea convention repos. +Extract architectural conventions from open source repositories. +Output: Gitea repos under `rodin/-conventions`. -## Naming Convention +## Naming -- **Language patterns** (`go-patterns`, `elixir-patterns`): stdlib/language - idioms verified from source -- **Project conventions** (`-conventions`): how a specific codebase - does things — e.g., `temporal-conventions`, `cockroachdb-conventions` +- `*-patterns` = language-level (go-patterns, elixir-patterns) +- `*-conventions` = project-specific (temporal-conventions, etc.) + +## Thinking Framework + +Before starting any analysis, ask: + +1. **What is this project's essence?** A trading system is a state + machine where the state is money. A workflow engine is a tree of + state machines. Name the essence — the patterns follow from it. +2. **What forces shaped it?** Team size, age, performance constraints, + backward compatibility obligations. These predict WHERE conventions + will be strict vs relaxed. +3. **What would surprise me?** The interesting findings are never "they + use interfaces" — it's "they have 566 dynamic config settings" or + "zero TODOs in 3.8M of code." Surprise = insight. + +## Prioritization: What to Dig Into + +Not everything is interesting. Focus on patterns that: + +- **Appear >50 times** — this is a conscious convention, not a one-off +- **Have a dedicated package** — someone thought it was important enough + to abstract +- **Other projects solve differently** — reveals a real design tradeoff +- **Have a surprising name** — indicates the team had to invent + vocabulary for a novel concept +- **Were introduced recently with many PR comments** — active design + decisions with recorded rationale + +Skip patterns that are: +- Standard library usage (unless the project wraps/extends it) +- Single-use internal helpers +- Generated code +- Exact copies of well-known open-source patterns without modification ## Phases -Execute in order. Each phase builds on the previous. +### Phase 1: Shape (5 min) -### Phase 1: Clone & Shape +Clone on forge (`~/src/analysis/`). Full clone — never shallow. -Clone the repo on forge (`~/src/analysis/`). Measure: +Measure: size, files, commits, contributors, top-level dirs. -``` -- Total size, file count, commit count, contributor count -- Top-level directory structure -- Language breakdown (if polyglot) -``` +**What matters here:** The ratio of test files to production files. +The presence/absence of `internal/` vs flat structure. Whether there's +a single `pkg/` or many top-level packages. These reveal organizational +philosophy before you read a single line. -### Phase 2: Import Hierarchy +### Phase 2: What the Codebase Values (10 min) -Find the most-depended-upon packages. This reveals what the codebase -considers foundational: +Find the most-imported internal packages. The top 5 are the +project's definition of "foundational." -```bash -# Go: grep imports, extract internal packages, count -grep -rh '"/' --include="*.go" | sed ... | sort | uniq -c | sort -rn +**Ask:** Why these? What do they share? Usually: logging, errors, +config, and one domain-specific abstraction that IS the project. +That domain-specific one is where the real conventions live. -# Elixir: grep aliases/imports -grep -rh "alias\|import\|use" --include="*.ex" | ... -``` +See `references/commands.md` for grep patterns by language. -The top 5-10 packages are where architectural decisions live. +### Phase 3: Interface Contracts (10 min) -### Phase 3: Interface Contracts +Find interfaces/behaviours/protocols — but don't list them all. -Find the key abstractions: +**Focus on:** Interfaces with >3 implementations (these are real +extension points). Interfaces in constructor signatures (these are +dependency injection boundaries). Interfaces that appear in BOTH +production and test code (these are the testability seams). -```bash -# Go -grep -rn "type.*interface {" --include="*.go" | grep -v test | grep -v mock +**Skip:** One-method interfaces (usually just for mocking). Interfaces +only used in one place (not yet conventions). -# Elixir -grep -rn "@callback\|@behaviour\|defprotocol" --include="*.ex" -``` +### Phase 4: Quality Fingerprint (5 min) -### Phase 4: Error Handling & Quality Markers +Measure: TODO count, FIXME count, HACK count, test count, mock count. -```bash -# Error style -grep -rh "fmt.Errorf.*%w" | wc -l # wrapping -grep -rh "errors.New\|errors.Wrap" | wc -l # sentinel/pkg +**What to notice:** +- TODO format reveals discipline: `TODO(owner):` = accountability, + `TODO:` = aspirational, version-gated = systematic cleanup +- Zero TODOs in a large codebase means active cleanup culture +- High mock count relative to test count suggests heavy DI +- HACK count > 0 is honest; HACK count = 0 in a large project is + suspicious (they probably use different words) -# Quality markers -grep -rn "TODO\|FIXME\|HACK" --include="*.go" | grep -v test | wc -l -``` +### Phase 5: Unique Patterns (15 min) -Note TODO format — does the project use `TODO(owner):`? Plain `TODO:`? -Version-gated TODOs? +Look for infrastructure NOT in stdlib. Categories: -### Phase 5: Unique Patterns +- **Concurrency:** goroutine handles, schedulers, shutdown primitives +- **Testing:** custom assertions, fake registries, golden file systems +- **Configuration:** dynamic config, feature flags, runtime toggles +- **Error handling:** custom error types, assertion systems, panic + recovery patterns +- **Extension:** plugin registration, hook systems, middleware chains -Look for project-specific infrastructure that's NOT in stdlib: -- Custom concurrency primitives (handles, schedulers, pools) -- Testing infrastructure (custom assertions, harnesses) -- Configuration systems (dynamic config, feature flags) -- Plugin/extension systems +**The test for uniqueness:** Would you be surprised to find this in +another project of similar size? If yes → convention worth documenting. +If no → standard practice, skip. -### Phase 6: Git Archaeology (requires full clone) +### Phase 6: Git Archaeology (20 min) -This is where the real insight lives. For each interesting pattern: +For each unique pattern found in Phase 5: -```bash -# When was it introduced? -git log --all --oneline --diff-filter=A -- path/to/pattern +1. Find the commit that introduced it (`git log --diff-filter=A`) +2. Read the commit message — the "why" is usually there +3. Check if it replaced something (`git log -S "old_name"`) +4. Note the date and author — context for why shortcuts were taken -# Who wrote it and why? -git log --format="%an%n%ad%n%s%n%b" -1 +**The insight is always WHY, not WHAT.** A bare goroutine with a +TODO is uninteresting as a listing. A bare goroutine introduced during +a complex 20-file admission control feature, tagged by the author in +the same commit, that survived 3 years because nobody touched the +function — that's a lesson about how real codebases evolve. -# What did it replace? -git log -p -S "old_pattern_name" -- relevant/path -``` +See `references/commands.md` for git archaeology patterns. -### Phase 7: PR Discussions +**If the repo is on a forge without PR history** (self-hosted, mailing +list-based): Fall back to commit messages and CHANGELOG. The commit +body IS the PR description for these projects. Look for "Reviewed-by" +trailers and linked issues. -Find the PR where a key pattern was introduced: +### Phase 7: PR Discussions (20 min) -```bash -gh api "search/issues?q=repo:/++type:pr" \ - --jq '.items[] | {number, title}' -``` - -Then read: -- PR body (the "why" from the author) +Find PRs where key patterns were introduced. Read: +- The PR body (author's motivation) - Review comments (the debate) -- Issue comments (the resolution) +- The resolution -Key questions to answer: -1. What problem motivated this pattern? -2. What alternatives were considered? -3. What did reviewers push back on? -4. How was it migrated in (big-bang vs gradual)? +**What to extract from discussions:** +- What the author was defending (= where the real insight is) +- What reviewers pushed back on (= non-obvious tradeoffs) +- Whether it was "merge and iterate" vs "perfect before merge" +- Whether external validation was cited (benchmarks, user feedback) +- The migration strategy (big-bang vs gradual coexistence) -### Phase 8: Synthesis & Output +**The highest-value finding:** When a reviewer says "I wish we'd done +X instead" and the author explains why X doesn't work. That tradeoff +reasoning is pure expert knowledge. -Produce two artifacts: +### Phase 8: Synthesis -**1. `analysis.md`** — Full architectural analysis: -- Repository shape and import hierarchy -- Key patterns with code examples -- PR discussion summaries -- Cross-ecosystem comparisons -- Code quality metrics +Produce two files. Push to Gitea. -**2. `conventions.md`** — Extracted patterns with: -- Pattern name -- Location in source -- Code example -- When to use / when NOT to use -- Origin story (from PR discussion if available) +**`analysis.md`** — the full story: +1. Repo shape and organizational philosophy +2. Import hierarchy (what it values) +3. Key patterns with code examples + origin stories +4. PR discussion excerpts (attributed quotes) +5. Cross-ecosystem comparisons (prior art, independent invention) +6. Quality metrics in context (not bare numbers) -## Output Repos +**`conventions.md`** — the reference: +For each pattern: +- Name and location in source +- Code example (real, not simplified) +- When to use / When NOT to use +- Origin (commit date, author, PR# if available) -Push to Gitea under `rodin/-conventions`: - -```bash -GITEA_TOKEN=$(cat ~/.openclaw/credentials/gitea-rodin) -# Create repo if needed: -curl -s -X POST "https://gitea.weiker.me/api/v1/user/repos" \ - -H "Authorization: token $GITEA_TOKEN" \ - -H "Content-Type: application/json" \ - -d '{"name": "-conventions", "auto_init": true, "default_branch": "master", "license": "MIT"}' -``` - -## Quality Criteria - -Before pushing, verify: -- Each pattern has a real code example from the repo -- "When to use / when NOT to use" sections exist -- PR discussion quotes are attributed -- Cross-ecosystem comparisons note prior art -- Files end with `` sentinel +End both files with `` sentinel. ## Cross-Ecosystem Observations -Always note when a pattern exists in multiple repos. Example: Temporal's -`goro.Handle` (2021) predates CockroachDB's `stop.Handle` (2025) — -same solution, invented independently. These connections are the most -valuable findings. +Always note when a pattern exists in multiple repos. These +independent inventions reveal forces that transcend project context: +- Temporal goro.Handle (2021) ↔ CockroachDB stop.Handle (2025) +- Ecto zero TODOs (version-gated) ↔ Oban zero TODOs (2-week cleanup) +- Prometheus init() plugins ↔ Temporal init() plugins ## The 4 Categories of Pattern Breaks -When you find impurity (pattern violations), classify them: +When you find convention violations, classify: -1. **Ship behavior, fix plumbing later** — time pressure, author knows it's - wrong (tagged with TODO) -2. **Better tooling exposed the limitation** — old pattern worked but was - invisible to profiling/tracing -3. **Removal cost > carrying cost** — technically debt but zero interest rate +1. **Ship behavior, fix plumbing later** — tagged with TODO same commit +2. **Better tooling exposed limitation** — observability, not correctness +3. **Removal cost > carrying cost** — zero-interest debt 4. **Context needs different pattern** — not actually a break -See `references/pattern-breaks.md` for detailed examples. +See `references/pattern-breaks.md` for real examples with git history. + +## NEVER + +- **NEVER analyze with a shallow clone** and assume full picture — + archaeology requires full history +- **NEVER present patterns from one file as repo-wide conventions** — + verify frequency across the codebase first +- **NEVER skip PR discussions** — code without context is just syntax; + the discussion IS the insight +- **NEVER report bare numbers** ("738 TODOs") — always contextualize + (per 1000 files, vs comparable projects, trending up/down) +- **NEVER confuse "the maintainer likes X" with "X is the right + pattern"** — solo-maintained projects reflect one person's taste; + team projects reflect negotiated conventions +- **NEVER present a pattern as "unique" without checking** if stdlib + has it or if it's a well-known library pattern +- **NEVER list patterns without when-NOT-to-use** — that's where the + expertise actually lives +- **NEVER quote PR discussions without attribution** — who said it + matters (maintainer vs drive-by contributor) +- **NEVER analyze repos <1000 commits** — not enough history for + meaningful archaeology +- **NEVER conflate language patterns with project conventions** — `go- + patterns` is stdlib idiom; `temporal-conventions` is project choice + +## Output Repos + +Push to Gitea under `rodin/-conventions`. See +`references/commands.md` for repo creation and push commands. + +## Fallbacks + +- **No PR discussions?** Use commit messages as primary source. + Many projects (Linux, PostgreSQL) do all review in commit messages + and mailing lists. +- **Repo too large to clone fully?** Clone shallow first, do Phase + 1-5, then `git fetch --unshallow` only if Phase 6-7 are needed. +- **Private repo / no GitHub API?** Skip Phase 7. Phase 6 (local git + history) still works. +- **<3000 commits?** Reduce Phase 6-7 expectations. Younger projects + have less archaeology to mine — focus on Phase 5 (unique patterns) + and the project's README/docs for rationale. ## Execution Notes -- Clone on **forge** (`host="node", node="forge"`) — has disk space -- Use full clones (not `--depth 1`) for archaeology +- Clone on **forge** (`host="node", node="forge"`) - `gh api` for GitHub PR lookups (authenticated on all nodes) - One repo at a time for focused analysis - Markdownlint all output before pushing diff --git a/references/commands.md b/references/commands.md new file mode 100644 index 0000000..3e4327f --- /dev/null +++ b/references/commands.md @@ -0,0 +1,122 @@ +# Commands Reference + +Shell patterns for each phase. Load only when you need the exact +syntax — the SKILL.md tells you WHAT to look for; this tells you HOW. + +--- + +## Phase 1: Shape + +```bash +# Clone (always full — never --depth 1) +cd ~/src/analysis && git clone + +# Dimensions +du -sh +find . -name "*.go" | grep -v vendor | wc -l # Go files +git log --oneline | wc -l # commits +git log --format="%an" | sort -u | wc -l # contributors +ls -d */ # top-level dirs +``` + +## Phase 2: Import Hierarchy + +```bash +# Go: most-imported internal packages +grep -rh '"/' --include="*.go" \ + | sed 's/.*"\///' | sed 's/".*//' \ + | cut -d/ -f1 | sort | uniq -c | sort -rn | head -15 + +# Elixir: most-aliased modules +grep -rh "alias " --include="*.ex" \ + | sed 's/.*alias //' | cut -d. -f1-2 \ + | sort | uniq -c | sort -rn | head -15 +``` + +## Phase 3: Interfaces + +```bash +# Go interfaces (skip test/mock/generated) +grep -rn "type.*interface {" --include="*.go" \ + | grep -v test | grep -v mock | grep -v _gen.go + +# Elixir behaviours/protocols +grep -rn "@callback\|@behaviour\|defprotocol" --include="*.ex" +``` + +## Phase 4: Quality Markers + +```bash +# TODOs (non-test, note format) +grep -rn "TODO" --include="*.go" | grep -v test | wc -l +grep -rh "// TODO" --include="*.go" | head -5 # check format + +# Other markers +grep -rn "FIXME\|HACK" --include="*.go" | grep -v test | wc -l + +# Test ratio +find . -name "*_test.go" | wc -l # test files +find . -name "*mock*" | wc -l # mock files +``` + +## Phase 6: Git Archaeology + +```bash +# When was a file/pattern introduced? +git log --all --oneline --diff-filter=A -- path/to/file + +# Who wrote it and why? +git log --format="%an%n%ad%n%s%n%b" -1 + +# When was a string added/removed (pickaxe search)? +git log -p -S "pattern_string" -- relevant/path/ + +# TODOs added with context +git log --all -p -S "TODO" -- lib/ \ + | grep "^commit\|^+.*TODO\|^-.*TODO" | head -20 + +# Find if full clone or shallow +git rev-parse --is-shallow-repository +# Unshallow if needed +git fetch --unshallow +``` + +## Phase 7: PR Lookup + +```bash +# Find PR by title/content +gh api "search/issues?q=repo:/++type:pr" \ + --jq '.items[] | {number: .number, title: .title}' + +# Read PR metadata +gh api repos///pulls/ \ + --jq '{title, body: (.body | split("\n") | .[0:30] | join("\n")), comments, review_comments, created_at, merged_at}' + +# Read review comments (the debate) +gh api "repos///pulls//comments?per_page=30" \ + --jq '.[] | select(.body | length > 50) | {user: .user.login, body: (.body | split("\n") | .[0:5] | join("\n"))}' + +# Read issue comments +gh api repos///issues//comments \ + --jq '.[] | {user: .user.login, body: (.body | split("\n") | .[0:6] | join("\n"))}' +``` + +## Output: Repo Creation & Push + +```bash +GITEA_TOKEN=$(cat ~/.openclaw/credentials/gitea-rodin) + +# Create repo +curl -s -X POST "https://gitea.weiker.me/api/v1/user/repos" \ + -H "Authorization: token $GITEA_TOKEN" \ + -H "Content-Type: application/json" \ + -d '{"name": "-conventions", "auto_init": true, "default_branch": "master", "license": "MIT"}' + +# Clone, add content, push +git clone "https://rodin:${GITEA_TOKEN}@gitea.weiker.me/rodin/-conventions.git" +cd -conventions +git config user.name "Rodin" +git config user.email "rodin@forgedthought.ai" +# ... add files ... +git add -A && git commit -m "docs: initial conventions from /" && git push +```