feat: add YAML support for persona files #58
@@ -34,21 +34,36 @@ func parsePersona(data []byte, source string) (*Persona, error) {
|
||||
### YAML Parsing with Depth Protection
|
||||
|
||||
|
|
||||
```go
|
||||
func parseYAML(data []byte, source string) (*Persona, error) {
|
||||
var p Persona
|
||||
// gopkg.in/yaml.v3 does NOT have built-in depth limiting.
|
||||
// Use explicit depth check via yaml.Node API.
|
||||
if err := yaml.Unmarshal(data, &p); err != nil {
|
||||
return nil, fmt.Errorf("parse persona %s: %w", source, err)
|
||||
func unmarshalYAMLWithDepthLimit(data []byte, out any, maxDepth int) error {
|
||||
var node yaml.Node
|
||||
dec := yaml.NewDecoder(bytes.NewReader(data))
|
||||
if err := dec.Decode(&node); err != nil {
|
||||
return err
|
||||
}
|
||||
if err := validatePersona(&p, source); err != nil {
|
||||
return nil, err
|
||||
if err := checkYAMLDepth(&node, 0, maxDepth); err != nil {
|
||||
return err
|
||||
}
|
||||
return &p, nil
|
||||
return node.Decode(out)
|
||||
|
sonnet-review-bot
commented
[MINOR] The design document's **[MINOR]** The design document's `parseYAML` code snippet in the 'YAML Parsing with Depth Protection' section shows a naive `yaml.Unmarshal` without depth checking, but the actual implementation correctly uses the two-pass Node approach. The doc snippet is stale/inconsistent with the implementation. The paragraph below it explains the correct approach, but the code block above it is misleading. This is documentation-only and doesn't affect behavior, but could confuse future contributors.
|
||||
}
|
||||
|
sonnet-review-bot
commented
[NIT] The design doc's error cases table says 'Deeply nested YAML | Library rejects (v1.16.0+ fix)' but the actual implementation uses explicit depth checking via **[NIT]** The design doc's error cases table says 'Deeply nested YAML | Library rejects (v1.16.0+ fix)' but the actual implementation uses explicit depth checking via `checkYAMLDepth`, not a library-level fix. The implementation is correct (and better), but the design doc description is inaccurate.
|
||||
|
||||
func checkYAMLDepth(node *yaml.Node, depth, maxDepth int) error {
|
||||
if depth > maxDepth {
|
||||
return fmt.Errorf("YAML nesting depth exceeds maximum (%d)", maxDepth)
|
||||
}
|
||||
// Handle alias nodes by following the Alias pointer
|
||||
if node.Kind == yaml.AliasNode && node.Alias != nil {
|
||||
return checkYAMLDepth(node.Alias, depth, maxDepth)
|
||||
|
sonnet-review-bot
commented
[NIT] The design doc's Error Cases table says 'Deeply nested YAML | Library rejects (v1.16.0+ fix)' — but the actual implementation does not rely on library-level rejection; it implements its own depth check. This entry is misleading and should say 'Application-level depth check rejects' to accurately reflect the implementation. **[NIT]** The design doc's Error Cases table says 'Deeply nested YAML | Library rejects (v1.16.0+ fix)' — but the actual implementation does *not* rely on library-level rejection; it implements its own depth check. This entry is misleading and should say 'Application-level depth check rejects' to accurately reflect the implementation.
sonnet-review-bot
commented
[NIT] The design doc's Error Cases table says 'Deeply nested YAML | Library rejects (v1.16.0+ fix)' but the actual implementation handles this via the custom **[NIT]** The design doc's Error Cases table says 'Deeply nested YAML | Library rejects (v1.16.0+ fix)' but the actual implementation handles this via the custom `checkYAMLDepth` walker, not the library itself. The description is misleading and may cause confusion during future maintenance.
|
||||
}
|
||||
for _, child := range node.Content {
|
||||
if err := checkYAMLDepth(child, depth+1, maxDepth); err != nil {
|
||||
return err
|
||||
}
|
||||
}
|
||||
return nil
|
||||
}
|
||||
```
|
||||
|
||||
The `gopkg.in/yaml.v3` library does not have built-in depth protection, so we implement explicit depth checking by first decoding into a `yaml.Node`, walking the tree to verify depth, then decoding into the target struct.
|
||||
The `gopkg.in/yaml.v3` library does not have built-in depth protection, so we implement explicit depth checking by first decoding into a `yaml.Node`, walking the tree to verify depth (including alias resolution), then decoding into the target struct.
|
||||
|
||||
## State/Data Model
|
||||
|
||||
|
||||
@@ -52,6 +52,9 @@ func LoadPersona(path string) (*Persona, error) {
|
||||
if err != nil {
|
||||
return nil, fmt.Errorf("read persona file %s: %w", path, err)
|
||||
}
|
||||
if !info.Mode().IsRegular() {
|
||||
|
sonnet-review-bot
commented
[NIT] The fallback to JSON in **[NIT]** The fallback to JSON in `LoadBuiltinPersona` has a comment 'Fall back to JSON for backwards compatibility', but the `embeddedPersonas` embed directive now only includes `*.yaml` files (`//go:embed personas/*.yaml`). The JSON files have been deleted. This means the JSON fallback path in `LoadBuiltinPersona` is dead code — it can never succeed because no `.json` files are embedded. The fallback should either be removed or the embed directive should include `*.json` if JSON built-in personas need to be supported.
|
||||
return nil, fmt.Errorf("persona file %s is not a regular file", path)
|
||||
}
|
||||
|
gpt-review-bot
commented
[MINOR] Built-in persona loader falls back to JSON if YAML not found, but the embedded FS only includes **[MINOR]** Built-in persona loader falls back to JSON if YAML not found, but the embedded FS only includes `*.yaml`. Since the JSON files were removed and not embedded, this fallback will never succeed for built-ins and can be simplified or the embed pattern expanded if JSON fallback is intended.
|
||||
if info.Size() > MaxPersonaFileSize {
|
||||
return nil, fmt.Errorf("persona file %s exceeds maximum size (%d bytes)", path, MaxPersonaFileSize)
|
||||
|
gpt-review-bot
commented
[MINOR] LoadPersona rejects non-regular files (e.g., symlinks) by checking Mode().IsRegular(). This may be overly strict for CI or repo setups that use symlinks. Consider allowing symlinks by resolving them (os.Stat vs. Lstat) or relaxing the check. **[MINOR]** LoadPersona rejects non-regular files (e.g., symlinks) by checking Mode().IsRegular(). This may be overly strict for CI or repo setups that use symlinks. Consider allowing symlinks by resolving them (os.Stat vs. Lstat) or relaxing the check.
|
||||
}
|
||||
|
sonnet-review-bot
commented
[NIT] The comment on **[NIT]** The comment on `LoadBuiltinPersona` says 'Try YAML first (preferred format)' but then falls back to JSON. Since the embed directive is now `//go:embed personas/*.yaml` only (JSON files were deleted), the JSON fallback in `LoadBuiltinPersona` is dead code for built-in personas. It's harmless and provides forwards-compatibility for the unlikely case someone adds a `.json` builtin later, but could be simplified.
|
||||
@@ -59,6 +62,11 @@ func LoadPersona(path string) (*Persona, error) {
|
||||
if err != nil {
|
||||
return nil, fmt.Errorf("read persona file %s: %w", path, err)
|
||||
}
|
||||
|
sonnet-review-bot
commented
[NIT] In **[NIT]** In `LoadBuiltinPersona`, the JSON fallback path reassigns `data, err = embeddedPersonas.ReadFile(...)` but the original `err` from the YAML attempt is already discarded by the `if err == nil { return ... }` pattern. This is fine and correct, but worth noting that the error variable from the YAML attempt is silently dropped. Since the YAML error is expected (file doesn't exist), this is intentional and acceptable.
|
||||
// Re-check size after read to defend against TOCTOU races where file
|
||||
|
[MINOR] Missing regular file check: LoadPersona does not verify that the path refers to a regular file. Reading special files (FIFOs, devices) could block or cause unexpected behavior. Consider rejecting non-regular files (info.Mode().IsRegular()). **[MINOR]** Missing regular file check: LoadPersona does not verify that the path refers to a regular file. Reading special files (FIFOs, devices) could block or cause unexpected behavior. Consider rejecting non-regular files (info.Mode().IsRegular()).
|
||||
// grows between stat and read (e.g., appending process, replaced file).
|
||||
|
gpt-review-bot
commented
[NIT] ListBuiltinPersonas returns names in non-deterministic order (map iteration). Consider sorting the slice before returning for determinism, which can help with UX and potential downstream consumers. **[NIT]** ListBuiltinPersonas returns names in non-deterministic order (map iteration). Consider sorting the slice before returning for determinism, which can help with UX and potential downstream consumers.
|
||||
if len(data) > MaxPersonaFileSize {
|
||||
return nil, fmt.Errorf("persona file %s exceeds maximum size (%d bytes)", path, MaxPersonaFileSize)
|
||||
}
|
||||
return parsePersona(data, path)
|
||||
}
|
||||
|
[MINOR] Potential TOCTOU/resource exhaustion risk: file size is checked via os.Stat before reading, but not re-validated after os.ReadFile. An attacker with control over the path could swap the file between Stat and Read, bypassing the size cap and causing large in-memory reads. **[MINOR]** Potential TOCTOU/resource exhaustion risk: file size is checked via os.Stat before reading, but not re-validated after os.ReadFile. An attacker with control over the path could swap the file between Stat and Read, bypassing the size cap and causing large in-memory reads.
|
||||
|
||||
@@ -144,7 +152,10 @@ func parsePersona(data []byte, source string) (*Persona, error) {
|
||||
|
||||
// unmarshalYAMLWithDepthLimit unmarshals YAML data with explicit depth limiting.
|
||||
|
sonnet-review-bot
commented
[MINOR] The **[MINOR]** The `unmarshalYAMLWithDepthLimit` function uses `interface{}` as the parameter type for `out` instead of `any`. The project targets the latest stable Go release, and `any` is the idiomatic alias since Go 1.18. This is inconsistent with the rest of the codebase convention.
sonnet-review-bot
commented
[MINOR] The two-pass decode approach (first into yaml.Node for depth check, then strict decode from raw bytes) is necessary because KnownFields doesn't work on yaml.Node.Decode(), but this means the input bytes are parsed twice. The comment explains this correctly. A minor concern: if **[MINOR]** The two-pass decode approach (first into yaml.Node for depth check, then strict decode from raw bytes) is necessary because KnownFields doesn't work on yaml.Node.Decode(), but this means the input bytes are parsed twice. The comment explains this correctly. A minor concern: if `gopkg.in/yaml.v3` ever adds KnownFields support to node.Decode(), this could be simplified — but for now this is the correct workaround and is well-documented.
|
||||
// This protects against stack exhaustion from deeply nested structures.
|
||||
func unmarshalYAMLWithDepthLimit(data []byte, out interface{}, maxDepth int) error {
|
||||
// Note: Multi-document YAML files are accepted but only the first document is
|
||||
// parsed; additional documents are silently ignored. This is acceptable for
|
||||
// persona files where multi-document support is not a use case.
|
||||
func unmarshalYAMLWithDepthLimit(data []byte, out any, maxDepth int) error {
|
||||
var node yaml.Node
|
||||
|
[MAJOR] The recursive YAML depth checker does not detect alias/anchor cycles. If a YAML document contains cyclic aliases (e.g., an alias ultimately pointing back to an anchored node that references the alias), checkYAMLDepth can recurse indefinitely, leading to stack exhaustion or a hang (DoS). Add cycle detection (visited set of *yaml.Node) or a total node visitation cap to prevent infinite recursion. **[MAJOR]** The recursive YAML depth checker does not detect alias/anchor cycles. If a YAML document contains cyclic aliases (e.g., an alias ultimately pointing back to an anchored node that references the alias), checkYAMLDepth can recurse indefinitely, leading to stack exhaustion or a hang (DoS). Add cycle detection (visited set of *yaml.Node) or a total node visitation cap to prevent infinite recursion.
|
||||
dec := yaml.NewDecoder(bytes.NewReader(data))
|
||||
|
[MINOR] JSON parsing uses json.Unmarshal, which silently ignores unknown fields. For consistency with YAML's KnownFields(true) and to harden against typos or unintended keys, consider switching to json.Decoder with DisallowUnknownFields() to reject unknown JSON fields. **[MINOR]** JSON parsing uses json.Unmarshal, which silently ignores unknown fields. For consistency with YAML's KnownFields(true) and to harden against typos or unintended keys, consider switching to json.Decoder with DisallowUnknownFields() to reject unknown JSON fields.
|
||||
if err := dec.Decode(&node); err != nil {
|
||||
|
sonnet-review-bot
commented
[MINOR] The decoder created with **[MINOR]** The decoder created with `yaml.NewDecoder` is used to decode a single document but does not check for a second call to `Decode` returning `io.EOF` to confirm there's only one document. For untrusted user-supplied YAML files, silently ignoring additional documents is acceptable (they won't affect the parsed struct), but it's worth noting that multi-document YAML files will have their additional documents silently ignored rather than surfacing an error. This is a minor UX issue, not a correctness bug.
sonnet-review-bot
commented
[MINOR] The **[MINOR]** The `unmarshalYAMLWithDepthLimit` function silently ignores multi-document YAML files (only parses the first document). While the comment acknowledges this, it could be a footgun: a user who accidentally writes `---` in their persona file will get no error, just silent truncation. Consider calling `dec.Decode` a second time and returning an error if a second document is found, to give users a clearer signal.
|
||||
@@ -159,10 +170,16 @@ func unmarshalYAMLWithDepthLimit(data []byte, out interface{}, maxDepth int) err
|
||||
}
|
||||
|
||||
|
sonnet-review-bot
commented
[MINOR] The multi-document rejection check ( **[MINOR]** The multi-document rejection check (`dec.Decode(&extra)`) re-uses the same decoder after the first `dec.Decode(&node)` call. If the first document exhausted the decoder but there's trailing non-document content (e.g., a trailing `...` end-of-document marker), the behavior depends on go-yaml internals. This is likely fine in practice, but the comment could clarify that this relies on go-yaml's decoder advancing past the first document.
|
||||
// checkYAMLDepth recursively checks that YAML nodes don't exceed the depth limit.
|
||||
// Handles alias nodes by following the Alias pointer to check the target's depth.
|
||||
func checkYAMLDepth(node *yaml.Node, depth, maxDepth int) error {
|
||||
if depth > maxDepth {
|
||||
return fmt.Errorf("YAML nesting depth exceeds maximum (%d)", maxDepth)
|
||||
|
sonnet-review-bot
commented
[MINOR] The alias-following logic in **[MINOR]** The alias-following logic in `checkYAMLDepth` does not count the alias node's own depth level before recursing into the alias target. If an alias node appears near the depth limit, the target's content is checked starting from the same `depth` value as the alias itself (not `depth+1`). This means a deeply aliased structure could potentially bypass the limit by 1 level. While the impact is minimal given the depth=20 limit, incrementing depth before recursing into the alias target would be more consistent.
|
||||
}
|
||||
|
sonnet-review-bot
commented
[MINOR] The alias depth check uses **[MINOR]** The alias depth check uses `return checkYAMLDepth(node.Alias, depth, maxDepth)` — i.e., the alias itself doesn't add to the depth. This is intentional per the comment, but YAML aliases can be used to create large fan-out structures. An alias to a deeply nested structure effectively doubles the depth impact without incrementing depth for the alias node itself. For the persona use case with MaxYAMLDepth=20 this is acceptable, but worth noting the design tradeoff.
|
||||
// Handle alias nodes: follow the alias to its anchor target.
|
||||
// The alias itself doesn't add depth, but we must check the target.
|
||||
if node.Kind == yaml.AliasNode && node.Alias != nil {
|
||||
return checkYAMLDepth(node.Alias, depth, maxDepth)
|
||||
}
|
||||
for _, child := range node.Content {
|
||||
|
sonnet-review-bot
commented
[MINOR] The two-pass decode approach (first into yaml.Node for depth checking, then a fresh decoder with KnownFields) re-parses the raw bytes a second time. This means a malicious YAML that causes the first-pass decoder to consume significant CPU (e.g., extremely long scalar strings) gets two chances to do so. The depth/node count limits reduce but don't fully eliminate this risk. A minor concern since file size is bounded at 64KB, but worth noting. **[MINOR]** The two-pass decode approach (first into yaml.Node for depth checking, then a fresh decoder with KnownFields) re-parses the raw bytes a second time. This means a malicious YAML that causes the first-pass decoder to consume significant CPU (e.g., extremely long scalar strings) gets two chances to do so. The depth/node count limits reduce but don't fully eliminate this risk. A minor concern since file size is bounded at 64KB, but worth noting.
|
||||
if err := checkYAMLDepth(child, depth+1, maxDepth); err != nil {
|
||||
return err
|
||||
|
||||
[NIT] The design doc mentions updating the embed directive to include both formats, but the code embeds only *.yaml (and built-in JSON files were removed). Consider updating the doc to reflect the final decision to embed YAML only.