← Playbooks
Edgecraft~27 min read·5,899 words

Edgecraft 10-Year Vision: Detailed Specs

How to do each proposal RIGHT — and what WRONG looks like.


Proposal 1: The Rosetta Engine — Cross-Domain Pattern Discovery

How to Do It Right

The core insight that must drive everything: Cross-domain parallels are only valuable when they're specific enough to change behavior. "Grip tension is bad in both shooting and pickleball" is obvious. "The fix for both is identical — lower the body instead of squeezing the hand — because the root cause is the same: the brain defaults to hand tension when the real problem is postural instability" is a Rosetta insight. The transfer mechanism (the WHY) is what makes it useful.

Step 1 — Build structural fingerprints (not semantic similarity)

The naive approach is to embed all 317 skill descriptions and compute cosine similarity. This produces garbage. "Grip pressure" in shooting and "grip pressure" in pickleball will match on surface semantics, but so will "shot selection" in pickleball and "shot placement" in hunting — a meaningless match.

Instead, fingerprint each skill on its structural properties:

  • Diagnostic shape: What kinds of root causes does it have? (motor-control, decision-making, emotional-regulation, information-processing, timing, spatial-awareness). Categorize every root cause into ~10 universal failure modes.
  • Progression arc: Does the skill progress via volume → precision → speed → adaptation? Or via unconscious-incompetence → conscious-incompetence → conscious-competence → unconscious-competence? Different progression arcs indicate different skill types.
  • Edge type distribution: Mostly "conventional wisdom is wrong"? Mostly "hidden causal lever"? This reveals whether the skill is counter-intuitive (conventional wisdom wrong) or under-explored (hidden levers).
  • DAG position: Is it a root skill (no prerequisites), a mid-chain skill, or a terminal skill? Root skills tend to be motor or perceptual fundamentals. Terminal skills tend to be integrative/strategic. These map across domains.
  • Fix modality: Are the fixes primarily about changing physical mechanics, changing mental models, changing decision processes, or changing environmental setup?

Two skills match when they share structural DNA — same failure modes, same progression arc, same fix modalities — even if the surface content is completely different.

Step 2 — Generate transfer insights with the "bridge mechanism"

For each high-structural-similarity pair, the transfer insight must include:

  1. The shared root cause — what's actually the same underneath the surface difference
  2. The bridge mechanism — WHY the same fix works in both domains (the causal explanation)
  3. The specific transfer — what a practitioner of domain A can literally do differently because of what they learned from domain B

Template:

In [domain A], [symptom A] is caused by [root cause].
In [domain B], [symptom B] is caused by the same root cause: [shared mechanism].
The fix transfers: [specific technique from A] works in [B] because [bridge mechanism].
A [domain B] practitioner can apply this by: [specific action].

Step 3 — Synthesize meta-patterns bottom-up, not top-down

Don't start with a list of meta-patterns and try to sort skills into them. Start with the pairwise matches and cluster bottom-up. Groups of 5+ skills from 3+ domains that share the same structural fingerprint = a meta-pattern. Name it after the shared mechanism, not the surface skill.

Good meta-pattern names: "Postural stability as proxy for extremity control," "Decision fatigue masquerades as skill regression," "Volume before precision in motor learning"

Bad meta-pattern names: "Grip skills," "Pressure skills," "Timing skills" (too vague to be actionable)

Step 4 — The "What else does this teach me?" UX

On each skill page, show 3-5 cross-domain matches. But don't show ALL matches — show the ones with the most surprising bridge mechanism. Rank by: (a) structural similarity score, (b) surface dissimilarity (more different domains = more surprising), (c) specificity of the transfer insight. The goal is "I would never have made this connection myself."

What WRONG Looks Like

Wrong: Surface-level semantic matching

"Grip pressure" (pickleball) matches "Grip pressure" (shooting)
Transfer: "Both require proper grip"

This is obvious. Nobody learns anything. The system becomes a thesaurus, not a discovery engine. You spent 50,000 comparisons to tell people things they already knew.

How to detect it: If >50% of matches are between skills with the same name or obvious synonyms, the matching is too shallow. If the transfer insight could be written without reading either skill file, it's vacuous.

Wrong: Forced connections

"Email marketing" (marketing) matches "Electronic caller use" (hunting)
Transfer: "Both use electronic tools to attract targets"

This is technically true and completely useless. The structural similarity is zero — they share no failure modes, no progression arcs, no fix modalities. The match was forced because the algorithm found surface-level word overlap ("electronic," "attract").

How to detect it: If the bridge mechanism requires a metaphor to explain (rather than a shared causal chain), the match is forced. If a domain expert would roll their eyes, it's wrong.

Wrong: Too many matches, no curation
Showing 20 cross-domain matches per skill with similarity scores but no editorial judgment. The user drowns in low-confidence matches and stops trusting any of them.

How to detect it: If the average user reads 0 transfer insights because there are too many to scan, curation failed. Max 5 per skill, ranked by surprise value.

Wrong: Meta-patterns that are just category labels

Meta-pattern: "Motor Skills"
Contains: grip pressure, trigger control, paddle angle, calling sequences, mouse clicking

This groups things by surface category, not by shared mechanism. A meta-pattern should be a discovery — something you didn't know these skills had in common.

How to detect it: If you could have named the meta-pattern before doing any analysis, it's a label, not a pattern.

Success Criteria

  • 70%+ of transfer insights pass the "would a domain expert find this genuinely interesting?" test
  • Each meta-pattern contains skills from 3+ domains
  • <10% of shown matches are obvious/trivial connections
  • The system discovers at least 5 connections that surprise even Charles (the person who built all 7 domains)

Proposal 2: The Deep Audit — Systematic Quality Scoring & Gap Detection

How to Do It Right

The core insight that must drive everything: A quality score is only useful if it's calibrated — if a score of 80 means the same thing across all domains and all skills. An uncalibrated score is worse than no score because it creates false confidence.

Step 1 — Build the rubric on exemplars, not theory

Before scoring anything, identify the 5 best skill files and the 5 worst skill files across the entire system (by reading them, not by metrics). Use these as calibration anchors:

  • The best file = what 95+ looks like
  • The worst file = what 20 looks like
  • Build the rubric so these exemplars land at those scores

For each of the 12 dimensions, define what 0, 25, 50, 75, and 100 look like with concrete examples from actual skill files. The rubric is not abstract — it's grounded in "skill X's diagnostic section is a 90 because it has 5 entries, all with specific fixes and coaching cues. Skill Y's is a 30 because it has 1 entry with a generic fix ('practice more')."

Step 2 — Score in passes, not all at once

Don't try to evaluate 12 dimensions simultaneously. Do 3 passes:

Pass 1: Structural scan (fast, objective, automatable)

  • Sections present/missing (parse the markdown headers)
  • Counts: number of diagnostics, edges, coaching cues, sources
  • Prerequisite count, level assignment
  • Word count per section
  • This pass can cover all 317 skills in one session

Pass 2: Content quality (slow, requires reading)

  • Diagnostic specificity: "practice more" vs. "move your contact point forward 2 inches"
  • Progression distinctness: are 4 levels genuinely different or restatements?
  • Coaching cue memorability: would a coach actually say this in the moment?
  • Source quality: primary sources vs. secondary/unsourced claims
  • This pass should be done domain-by-domain, ~50 skills per session

Pass 3: Graph integrity (analytical, requires cross-referencing)

  • Level consistency with prerequisite depth
  • Content cross-references vs. graph edges (do skills mention each other without being connected?)
  • Redundancy detection (skills with >60% content overlap)
  • This pass is per-domain, examining the full graph structure

Step 3 — The raw-to-skill matcher must account for diminishing returns

Not all raw files are equally valuable. A 30KB transcript from an elite practitioner is worth more than a 2KB motivational clip. The matcher should output not just "this file matches skill X" but "this file would likely add [N] new diagnostics and [M] new coaching cues to skill X, raising its score from 45 to ~65." The prioritization is by expected quality delta, not by topic match.

Step 4 — The health dashboard must show trajectories, not snapshots

A skill scoring 60 today that scored 40 last month is on a good trajectory. A skill scoring 75 that hasn't changed in 3 months is stuck. The dashboard should show:

  • Current score per dimension
  • Score change since last audit
  • Ingestion priority rank
  • Which raw files would improve it most

Step 5 — Auto-enrichment must be conservative

When cross-pollinating content from one skill to another, the bar is HIGH. Only transfer content when:

  • The source skill has a diagnostic that directly addresses a gap in the target skill
  • The insight is domain-appropriate (don't transfer pickleball mechanics to marketing)
  • The transfer is cited properly ("Adapted from [source skill] in [source domain]")

Never auto-generate new content. Only transplant existing content with proper attribution.

What WRONG Looks Like

Wrong: Rubric gaming

Skill file has all sections present (score: 90)
But "## Diagnostic Tree" contains only: "### Symptom: Not performing well"
With fix: "Practice more and focus on fundamentals"

The structural scan gives high marks because the section exists. But the content is worthless. This is the "teaching to the test" problem — if the rubric rewards section existence over section quality, people (or AI) will fill in headers with garbage to boost scores.

How to prevent it: Pass 2 (content quality) must weight heavier than Pass 1 (structural). A skill with 3 sections but all brilliant content should outscore a skill with all sections but generic filler. The content quality pass should have specific anti-patterns to detect:

  • "Practice more" or "focus on fundamentals" in a fix → score 0 for that diagnostic
  • Progression levels that differ only by adjective ("good" → "very good" → "excellent") → score 0 for progression
  • Coaching cues longer than 15 words → penalize (real coaching cues are short)

Wrong: False precision in quality scores

Skill X: 73.4/100
Skill Y: 74.1/100

Nobody can distinguish 73.4 from 74.1 in content quality. False precision creates false confidence in rankings. Two skills within 5 points of each other are effectively tied.

How to prevent it: Report scores in bands: Red (0-40), Yellow (41-60), Light Green (61-80), Green (81-100). Use precise scores internally for ranking but present bands to the user.

Wrong: Raw-to-skill matcher recommending irrelevant files

"Ingest 2025-03-18 - The $10M AI SaaS Playbook.md → it matches 'hooks' skill"
(Because the transcript mentions "hooks" in the context of video hooks, not marketing hooks as the skill defines them)

Surface keyword matching produces garbage recommendations. The matcher must understand the semantic scope of each skill, not just its keywords.

How to prevent it: The matcher should explain WHY a file matches: "This file contains 3 specific diagnostic chains about hook failures (audience doesn't stop scrolling → hook is statement not question → rewrite as open loop). These would add to the Diagnostic Tree section which currently has 0 entries." If the explanation doesn't hold up, the match is wrong.

Wrong: Graph validator proposing too many changes

"Restructure entire pickleball domain: 23 level changes, 14 new edges, 8 merges"

A validator that proposes wholesale restructuring is useless because no human can review 45 changes at once. The result: all changes get approved without scrutiny, or all get rejected as too risky.

How to prevent it: Cap proposals at 5 per domain per pass. Rank by confidence. Present the highest-confidence changes first. Only propose the next batch after the first is approved and applied.

Success Criteria

  • Every skill has a quality score that a human reviewer agrees with (within one band) for 90%+ of skills
  • The graph validator finds at least 3 genuine structural issues per domain (proving it's not just rubber-stamping)
  • The raw-to-skill matcher's top-5 recommendations per skill are relevant (verified by reading the first page of each file)
  • Quality scores are reproducible — running the audit twice produces the same results

Proposal 3: The Diagnostic Engine — Conversational Skill Doctor

How to Do It Right

The core insight that must drive everything: A diagnostic engine is only as good as its discrimination — its ability to narrow from "something is wrong" to "THIS specific thing is wrong." A system that returns 15 possible root causes is a search engine, not a doctor. A doctor asks discriminating questions that cut the possibility space in half with each answer.

Step 1 — Build the symptom taxonomy BEFORE the decision trees

The raw diagnostics use inconsistent language. One skill says "ball goes long." Another says "shots sailing past the baseline." Another says "hitting too deep." These are the same symptom described three different ways. Before building any decision logic, cluster all 1,149 symptom descriptions into canonical symptoms.

The taxonomy should be:

Level 0: Domain (pickleball)
Level 1: Observable category (ball flight, body mechanics, timing, strategy, mental)
Level 2: Specific symptom ("ball goes long on dinks")
Level 3: Diagnostic entries (3 different root causes for "ball goes long on dinks")

Step 2 — Decision trees must use OBSERVABLE discriminators

A discriminating question must be answerable by the practitioner through direct observation — not through expert analysis. The practitioner doesn't know their wrist angle; they DO know whether it happens on forehand or backhand.

Good discriminators: "Does it happen on forehand or backhand?" / "Is it worse when you're tired?" / "Does it happen on the first shot or only after rallies?" / "Are you aware of it while it's happening or only after?"

Bad discriminators: "Is your contact point anterior to your center of mass?" / "Are you pronating early?" / "Is your swing path inside-out?" These require expertise the practitioner doesn't have — if they knew these things, they wouldn't need the diagnostic engine.

Step 3 — The leverage point analysis must account for fix difficulty

A root cause that appears in 7 symptoms but requires 6 months of retraining is less actionable than a root cause appearing in 3 symptoms that can be fixed in one practice session. The leverage score should be:

leverage = (symptom_count * severity_weight) / fix_difficulty

Where fix_difficulty is estimated from the progression data (how long does it take to move from "doing it wrong" to "doing it right" at the relevant level?).

Step 4 — The conversational interface must feel like a COACH, not a search engine

The interaction pattern:

  1. User describes problem in their words
  2. System maps to 2-3 most likely canonical symptoms (shows them for confirmation)
  3. User confirms or clarifies
  4. System asks 1-2 discriminating questions
  5. System presents the diagnosis: root cause, fix, coaching cue, source
  6. System asks: "Did this help? Is this the right problem?" If no, branch to next-most-likely root cause

The tone should be direct and coaching-like: "Sounds like your dinks are popping up. Quick question — is this happening more on your backhand side or forehand? ... Backhand. Got it. The most common cause is grip tension — you're squeezing instead of guiding. Try this: before your next dink, consciously loosen your bottom three fingers. The coaching cue to remember: 'Dead fish grip.'"

Step 5 — Build the diagnostic graph to reveal systemic patterns

The bipartite graph (symptoms → root causes) reveals clusters that aren't visible in individual skill files:

  • Hub root causes: Root causes connected to 5+ symptoms. These are systemic — fixing them cascades through multiple skills.
  • Orphan symptoms: Symptoms with only one possible root cause. These are easy wins — the diagnosis is unambiguous.
  • Symptom clusters: Groups of symptoms that always share the same root cause. If a practitioner has one symptom from the cluster, they likely have all of them.

Visualize this as a network graph where node size = frequency and edge thickness = co-occurrence.

What WRONG Looks Like

Wrong: A flat searchable list

Search: "shots going long"
Results: 23 diagnostics containing "long" or "past" or "deep"

This is just ctrl+F with extra steps. It doesn't diagnose anything — it dumps 23 possibilities on the user and says "figure it out yourself." The user is exactly where they started, except now they're overwhelmed by options.

How to detect it: If the system's output for any query is more than 5 entries without ranking or narrowing, it's a search engine, not a diagnostic engine. The whole point is NARROWING.

Wrong: Decision trees that are too deep

Q1: Forehand or backhand?
Q2: Fast or slow?
Q3: Net or baseline?
Q4: Early or late in the rally?
Q5: Against hard hitters or soft hitters?
Q6: When you're winning or losing?
Q7: ...

By question 4, the user has lost patience. They wanted a quick answer, not a 20-question game. Each question must meaningfully cut the possibility space. If a question doesn't eliminate at least 30% of remaining possibilities, it shouldn't be asked.

How to detect it: If the average path through a decision tree is >4 questions, it's too deep. Redesign the tree with higher-discrimination questions at the top.

Wrong: Leverage points that are trivially obvious

"Top leverage point for pickleball: Practice more"
"Top leverage point for shooting: Dry fire daily"

These are correct but useless. Every coach says them. The leverage point analysis should reveal specific, non-obvious root causes — things like "foot placement on the ready position is the root cause of 40% of kitchen play errors, but it's addressed in 0% of coaching conversations because coaches focus on paddle angle."

How to detect it: If the leverage points could have been listed without the analysis (by any experienced practitioner), the analysis isn't revealing anything. The test: would a domain expert say "huh, I hadn't thought about it that way"?

Wrong: The conversational interface invents diagnoses

User: "My serve keeps going into the net"
System: "This might be because you're not tossing high enough. Try tossing 6 inches higher."
(No diagnostic entry exists for this — the system hallucinated advice)

The system must ONLY return diagnoses that exist in the knowledge base. If no matching diagnostic exists, the correct response is: "I don't have a specific diagnostic for that symptom yet. The closest matches are [X, Y]. Would either of these describe your situation?"

How to detect it: Every diagnosis shown to the user must be traceable to a specific diagnostic entry with a specific skillId and source. If the system outputs advice without a source citation, it's inventing.

Success Criteria

  • Average path to diagnosis: 2-3 questions (not 1, not 7)
  • 80%+ of users who complete the diagnostic flow report the diagnosis was relevant
  • Every diagnosis links to source material (never hallucinated)
  • Leverage point analysis reveals at least 2 non-obvious root causes per domain that surprise domain practitioners
  • The system gracefully handles "no match" — it says so honestly instead of forcing a bad match

Proposal 4: The Portfolio Brain — Edgecraft as Intelligence Layer for All 6 Products

How to Do It Right

The core insight that must drive everything: The value of connecting Edgecraft to products is NOT in automating decisions — it's in enriching decision context. A product that makes decisions without human judgment is brittle. A product that surfaces relevant knowledge at decision time makes the human smarter. The knowledge packs should be "here's what the domain expert knows about this situation," not "here's what to do."

Step 1 — Map at the decision-point level, not the domain level

"GunDealAlerts uses the marketing domain" is too vague to be useful. The mapping must be at the specific decision-point level:

GunDealAlerts: "Is this deal actually good?"
├── Decision point: Price vs. historical average
│   └── Edgecraft: pricing-strategy → "price-too-low-prevents-purchase" edge
│       Context: A price SO far below normal triggers "what's wrong with it?" —
│       even a great deal can look suspicious if the discount is extreme
│
├── Decision point: Deal urgency assessment
│   └── Edgecraft: offer-design → "urgency vs scarcity" framework
│       Context: Time-limited deals (urgency) vs quantity-limited deals (scarcity)
│       work through different mechanisms — the alert should distinguish them
│
├── Decision point: Community engagement prediction
│   └── Edgecraft: hooks → "pattern interrupts" edge
│       Context: Deals with surprising or unusual elements get shared more —
│       flag these for higher promotion priority

Each mapping must answer: "At this specific decision point, what specific knowledge from Edgecraft would change the decision?"

Step 2 — Decision rules must be ADVISORY, not AUTOMATED

Extract rules as structured context, not as if/then automation:

Right:

{
  "decision_point": "deal_score_modifier",
  "knowledge": "Prices below the credibility floor can actually reduce conversions",
  "source_skill": "pricing-strategy",
  "source_edge": "price-too-low-prevents-purchase",
  "implication": "Deals >60% off historical average may need a 'legitimacy flag' —
                  surface evidence that the deal is real (retailer reputation,
                  time-limited clearance reason)",
  "confidence": "high — supported by Sutherland's research on Nespresso pricing"
}

Wrong:

{
  "rule": "if discount_pct > 60 then score -= 20",
  "source": "pricing-strategy"
}

The first enriches human judgment. The second replaces it with a brittle rule that will be wrong in many contexts (e.g., clearance sales on discontinued models genuinely are 60%+ off).

Step 3 — The feedback loop must capture SURPRISE, not just outcomes

The most valuable feedback is when the product's output surprises the user — positively or negatively. A CritterScout stand that scored poorly but produced a great hunt is MORE valuable as feedback than a high-scoring stand that performed as expected. The surprise is where the knowledge gaps live.

Design the feedback capture around:

  • "Was this better or worse than expected?" (not "was this good or bad?")
  • "What happened that the system didn't predict?"
  • These responses map directly to new diagnostic entries: symptom = "system predicted X," root cause = "didn't account for Y," fix = "add factor Y to the model."

Step 4 — Start with ONE product, prove the architecture, then expand

Don't try to connect all 6 products at once. Pick the product where the domain knowledge is deepest and the decision points are clearest. Recommendation: CritterScout + hunting domain (67 skills, 144 diagnostics, clear decision points around stand evaluation).

Build the full stack for one product:

  • Decision-point mapping (5-10 key decisions in CritterScout)
  • Knowledge pack generation (hunting domain → stand evaluation context)
  • Feedback capture UI ("how was this stand?")
  • One cycle of feedback → diagnostic entry → knowledge pack update

Then replicate the architecture for the other 5 products.

What WRONG Looks Like

Wrong: Over-automation — the product makes decisions humans should make

CritterScout auto-downgrades a stand from A to C because the hunting domain's
"wind diagnostic" says wind from the south is bad. But the user KNOWS the wind
will shift by afternoon.

The knowledge pack told the product to penalize southern wind. The product blindly applied the penalty. The user, who has context the system doesn't, gets a bad recommendation.

How to prevent it: Knowledge packs provide CONTEXT, not RULES. The product surfaces: "Note: wind from the south may compromise this stand (hunting domain: wind-setup diagnostic). Consider afternoon wind shift patterns." The human decides.

Wrong: Knowledge packs that are too abstract to be useful

GunDealAlerts knowledge pack:
"Consider the buyer's perceived value equation when scoring deals."

This is a reminder, not actionable knowledge. The product developer reading this gains nothing they couldn't get from a fortune cookie.

How to detect it: If a knowledge pack entry doesn't contain a specific scenario, specific knowledge, and a specific implication, it's too abstract. Every entry should pass the test: "A product developer who reads this will change one line of code or one scoring weight."

Wrong: Connecting products to domains where the knowledge doesn't actually help

GradeOptimizer ← sports-betting domain
"Betting on student performance using odds models"

This is a forced connection. The sports-betting domain's knowledge about market efficiency doesn't transfer to predicting student engagement. Not every product needs every domain. The mapping matrix should have explicit "NO — these don't connect" entries, not just "YES" entries.

How to prevent it: For each proposed connection, state the specific decision it improves and the specific knowledge that changes it. If you can't fill both in concretely, the connection doesn't exist.

Wrong: Feedback loops that capture noise, not signal

"How was this deal?" → 1-5 star rating
Result: 4.2 average across all deals. Useless.

A 1-5 rating captures satisfaction, not surprise. Satisfaction is driven by external factors (did the user actually need the item, was shipping fast). Surprise is driven by model accuracy (did the system correctly predict this was a good deal).

How to prevent it: Ask about expectation violation, not satisfaction. "Was this deal better, worse, or about as good as our score predicted?" The delta between prediction and outcome is the learning signal.

Success Criteria

  • One product (CritterScout) is fully connected with measurable decision improvement
  • Knowledge packs contain zero abstract/generic entries — every entry is specific and actionable
  • Feedback loop has captured at least 10 "surprise" entries that map to new diagnostics
  • The architecture document is clear enough that connecting the next product takes 1/3 the time

Proposal 5: The Mastery Compiler — Prerequisite-Gated Personalized Learning Paths

How to Do It Right

The core insight that must drive everything: A learning path is only valuable if the practitioner trusts it — if they believe the sequence is optimal and the benchmarks are real. Trust comes from transparency: showing WHY skill A must come before skill B, and HOW the benchmark was chosen. A black-box "do this next" is as untrustworthy as a random curriculum.

Step 1 — Benchmark extraction must distinguish assessable from aspirational

Reading all 317 skill files' progression levels will produce two kinds of benchmarks:

Assessable benchmarks (the practitioner can self-evaluate):

  • "Can hold a 10-ball dink rally without a forced error"
  • "Dry fire draw-to-first-shot under 1.2 seconds"
  • "Can explain the three components of the value equation from memory"

Aspirational benchmarks (require external observation or extended tracking):

  • "Demonstrates consistent shot quality under match pressure"
  • "Naturally incorporates advanced footwork without conscious thought"
  • "Recognized as a trusted advisor by 3+ clients"

The system must tag each benchmark as assessable or aspirational. Only assessable benchmarks drive mastery gating. Aspirational benchmarks are shown for context ("this is what Level 4 looks like") but don't lock/unlock skills.

Step 2 — The path must show PARALLEL opportunities, not just a linear sequence

Most prerequisite DAGs are not linear chains — they're partial orders with many valid sequences. If skill C requires both A and B, the practitioner can work on A and B simultaneously. The path compiler should output:

Week 1-2: Work on A and B in parallel (no dependencies between them)
Week 3-4: Start C (A and B are prerequisites — check benchmarks first)
Week 4-6: Work on C and D in parallel (D has different prerequisites)

Not:

Step 1: Master A
Step 2: Master B
Step 3: Master C
Step 4: Master D

The linear sequence is 2x longer and doesn't reflect the actual graph structure. This is where the topological sort with parallelization matters.

Step 3 — Time estimates must be ranges with confidence levels, not point estimates

Right: "Level 1→2 for third-shot-drop: 2-4 weeks of focused practice (based on
       progression data describing 'regular drilling' at this level + common plateau
       'muscle memory of wrist flip takes 100+ reps to override')"

Wrong: "Level 1→2 for third-shot-drop: 3 weeks"

Point estimates create frustration when the practitioner takes longer. Ranges with the reasoning behind them create realistic expectations and show the system's work.

Step 4 — Spaced reinforcement must be LIGHTWEIGHT, not burdensome

The biggest risk with spaced repetition in a skill context (vs. flashcards) is that "reviewing" a physical skill takes 15-30 minutes of practice, not 30 seconds of recalling a fact. If the scheduler sends 5 review prompts per day, the practitioner spends their entire practice session on review with no forward progress.

Design constraints:

  • Max 1-2 review prompts per practice session
  • Reviews are specific coaching cues, not full skill re-practice: "Quick check: on your last 5 dinks, were your bottom three fingers loose? If yes, you're maintaining grip control."
  • Graduated intervals: 1 day → 3 days → 7 days → 14 days → 30 days → done (skill is consolidated)
  • Motor skills need shorter initial intervals (physical memory decays fast) than analytical skills (conceptual understanding persists longer)

Step 5 — The adaptive path must explain WHY it's changing

When the system detects a plateau and adjusts the path, it must explain the reasoning:

Right: "You've been working on third-shot-drop for 3 weeks without hitting the
       Level 2 benchmark. The most common root cause is insufficient grip control
       (from the diagnostic engine). I'm adding 'grip-pressure' back into your
       active practice — it's a prerequisite you may need to revisit. Once your
       grip benchmark is solid, third-shot-drop typically unlocks within a week."

Wrong: "Path updated. New sequence: grip-pressure → third-shot-drop"

The explanation builds trust. The silent update feels arbitrary.

Step 6 — Multi-path comparison must have a clear recommendation with rationale

When multiple paths exist, don't just show them as equal options. Recommend one and explain why:

Path A: 8 skills, ~6 weeks, passes through 4 high-edge-density skills
        (you'll learn the most non-obvious insights along the way)
Path B: 6 skills, ~4 weeks, avoids the 2 skills with hardest plateaus
        (fastest route but misses some depth)

Recommended: Path A — the extra 2 weeks gives you 4 additional edges that
compound into later skills. Path B is better if you're preparing for a
specific event in <5 weeks.

What WRONG Looks Like

Wrong: Linear paths that ignore graph parallelism

Step 1: Master grip-pressure (2 weeks)
Step 2: Master paddle-angle (2 weeks)
Step 3: Master contact-point (2 weeks)
Step 4: Master third-shot-drop (3 weeks)
Total: 9 weeks

Correct: Work on grip-pressure + paddle-angle simultaneously (2 weeks)
         → contact-point (1 week, builds on both)
         → third-shot-drop (3 weeks)
Total: 6 weeks

The linear path wastes 3 weeks because it serializes skills that have no dependency between them. The path compiler MUST output parallel groups.

How to detect it: If the path length (in weeks) equals the sum of all individual skill times, parallelization isn't happening. The path should be shorter than the sum.

Wrong: Benchmarks that are unmeasurable

"Level 2 benchmark: Demonstrates improved consistency"
Assessment: Can you do this? [Yes] [No] [Sometimes]

"Improved consistency" is not assessable. Compared to what? Measured how? The practitioner clicks "Sometimes" because they have no idea what they're measuring. The mastery score is noise.

How to prevent it: Every benchmark in the assessment framework must include a specific observable behavior and a quantity or frequency. If the source data doesn't provide one, the system should flag it as "needs enrichment" rather than using a vague benchmark.

Assessable: "Can execute 8 out of 10 dinks into the kitchen without a forced error"
Not assessable: "Demonstrates solid dink control"

Wrong: Spaced reinforcement that becomes homework

Day 1: Review grip-pressure (15 min drill)
Day 2: Review paddle-angle (15 min drill) + Review grip-pressure (quiz)
Day 3: Review contact-point (20 min drill) + Review paddle-angle (quiz)
Day 4: Review third-shot-drop (25 min drill) + Review contact-point + grip-pressure
...by Day 7, you're spending 90 minutes on review with no forward progress

Review load compounds and overwhelms forward progress. The practitioner stops using the system because it feels like busywork.

How to prevent it: Hard cap: max 20% of practice time on review. If review load exceeds this, consolidate: "grip-pressure and paddle-angle are both reviewed during your normal warm-up dinking — no separate review needed." Group reviews by activity, not by individual skill.

Wrong: Path recalculation that feels like the system is punishing you

"You haven't mastered grip-pressure yet. Adding it back. Your path is now
3 weeks longer."

The user feels like the system is judging them. The path got longer. They're discouraged.

How to prevent it: Frame adjustments as discoveries, not setbacks: "Great news — the diagnostic engine identified grip tension as the specific thing holding back your dinks. Addressing it directly will likely unlock third-shot-drop faster than continuing to drill drops. Your revised path is actually more efficient because you're attacking the root cause."

Success Criteria

  • 80%+ of extracted benchmarks are assessable (specific, observable, quantifiable)
  • Paths use parallelization (path time < sum of individual skill times)
  • Spaced review load never exceeds 20% of recommended practice time
  • Path adjustments include explanations that users rate as helpful (not punitive)
  • At least one user completes a full path (target → mastered) using the system end-to-end
  • Time estimates are within 50% of actual time for 70%+ of skills (validated by user reports)

Cross-Cutting Quality Principles

These apply to ALL 5 proposals:

1. Transparency Over Polish

Every output the system produces should show its reasoning. A quality score without the rubric breakdown is a black box. A cross-domain match without the bridge mechanism is a fortune cookie. A learning path without the rationale is a todo list. Show the work.

2. Graceful Degradation

When the system doesn't have enough data to produce a confident output, it should say so honestly — not produce a low-confidence output and hide the uncertainty. "I don't have a diagnostic for this symptom yet" is better than a forced match at 30% confidence. "This quality score is uncertain because the rubric dimension X couldn't be evaluated" is better than a fake 50/100.

3. Falsifiability

Every claim the system makes should be testable. "This is the #1 leverage point" should come with "because it appears in 7 diagnostics across 4 skills, more than any other root cause — here are the 7." If someone disagrees, they can check the evidence. Systems that make unfalsifiable claims ("this connection is meaningful") are astrology.

4. Conservation of Existing Content

None of these proposals should DELETE or OVERWRITE existing content. They should ADD new data structures (quality scores, cross-domain links, diagnostic indices, learning paths) alongside the existing skill files. The skill files remain the source of truth. Everything else is derived.

5. Incremental Delivery

Each proposal should produce useful artifacts at each step, not only at the end. The Deep Audit produces useful quality scores after Pass 1 (structural scan), even before Pass 2 (content quality). The Diagnostic Engine produces a useful flat index before the decision trees are built. The Mastery Compiler produces useful path visualizations before the spaced reinforcement scheduler exists. Ship incrementally, not all-or-nothing.