From: Javier Sagredo Date: Wed, 6 May 2026 22:10:09 +0000 (+0200) Subject: Add docs/INTERNALS.md (with neutral examples) X-Git-Url: https://git.sagredo.dev/?a=commitdiff_plain;h=336245abb83deb9b4f15721e8eaa6fdd054d6d24;p=classgraph.git Add docs/INTERNALS.md (with neutral examples) The internals walkthrough was sitting uncommitted from an earlier session. Adding it now, with the Cardano-flavoured examples (Ticked / Shelley / Byron / TxOut/LedgerState) swapped for neutral Wrap / Foo Int / Element shapes — the doc explains classgraph internals, not any particular downstream project's vocabulary. Co-Authored-By: Claude Opus 4.7 (1M context) --- diff --git a/docs/INTERNALS.md b/docs/INTERNALS.md new file mode 100644 index 0000000..4a5b747 --- /dev/null +++ b/docs/INTERNALS.md @@ -0,0 +1,378 @@ +# How classgraph gets its data + +This document explains where every piece of information in the viewer +comes from, how type-family resolution works, and what the user can and +cannot trust about the rendered chains. It's targeted at someone who +wants to know whether a given edge in the graph is "real" GHC information +or an inferred (possibly fragile) reconstruction. + +## Table of contents + +- [Pipeline at a glance](#pipeline-at-a-glance) +- [Where extraction runs](#where-extraction-runs) +- [Classes](#classes) +- [Class instances](#class-instances) +- [Type and data families](#type-and-data-families) +- [Type family instances](#type-family-instances) +- [Type-family resolution in the viewer](#type-family-resolution-in-the-viewer) +- [Are the resolution chains deterministic?](#are-the-resolution-chains-deterministic) +- [What the viewer is *not* doing](#what-the-viewer-is-not-doing) + +## Pipeline at a glance + +``` + ┌────────────────────┐ typecheck ┌─────────────────────┐ + │ user's source code │ ──────────────► │ TcGblEnv (in GHC) │ + └────────────────────┘ │ tcg_tcs │ + │ tcg_insts │ + │ tcg_fam_insts │ + └──────────┬──────────┘ + │ typeCheckResultAction + ┌──────────▼──────────┐ + │ Classgraph.Extract │ + │ (this repo's lib) │ + └──────────┬──────────┘ + │ JSON + ┌──────────▼──────────┐ + │ classgraph-view │ + │ merge + render │ + └─────────────────────┘ +``` + +All extraction happens **after typechecking** but **before desugaring to +Core**. By that point GHC has already: + +- Resolved all class superclass theta to fully simplified `PredType`s. +- Fully kind-elaborated every `Type`. +- Computed the closed normal form of class instance heads. +- Fully populated `tcg_insts` (the local instances) and + `tcg_fam_insts` (the local family instances). + +That's important: every fact we record is *what GHC itself sees and +uses* once typechecking is done. We don't run our own typechecker, our +own constraint solver, or our own type-family reducer over the source +code. We just *transcribe* GHC's `TcGblEnv` into JSON. + +The single exception — **class-instance resolution chains via type +families** — happens in the viewer (JavaScript), purely against the +already-extracted JSON. That logic *is* a reconstruction, and is +called out explicitly in the +["Type family resolution in the viewer"](#type-family-resolution-in-the-viewer) +section below. + +## Where extraction runs + +The plugin registers a single hook: + +```haskell +plugin = defaultPlugin + { typeCheckResultAction = \_args _summ tcg_env -> do + liftIO (writeJsonForModule tcg_env) + return tcg_env + } +``` + +`typeCheckResultAction :: [CommandLineOption] -> ModSummary -> TcGblEnv -> TcM TcGblEnv` +runs once per module compiled with the plugin, after the typechecker +has finished but *before* GHC desugars to Core. We never modify +`tcg_env`; the plugin is observation-only. + +This is the latest hook at which all the extracted data is reliably +available without us having to do interface-file fishing. In +particular, `tcg_insts` and `tcg_fam_insts` are already populated by +this point with every instance the user wrote in this module, including +deriving-clause instances and standalone-deriving instances. + +## Classes + +Source: `tcg_tcs :: [TyCon]` filtered by `tyConClass_maybe`. For each +`Class` the schema records: + +| Schema field | GHC API | +|-------------------|-----------------------------------------| +| `ciName` | `className cls` — the class's `Name` | +| `ciTyVars` | `classTyVars cls` | +| `ciSuperclasses` | `classSCTheta cls`, decomposed below | +| `ciAssocTypes` | `classATs cls` — the assoc family TyCons| +| `ciMethods` | `classMethods cls` (only the names) | +| `ciSrc` | `nameSrcSpan (className cls)` | + +Superclass theta needs a tiny bit of decomposition because GHC +represents some constraint shapes structurally: + +- We use `classifyPredType :: PredType -> Pred` to split every + superclass into a (`ClassPred`, `EqPred`, `IrredPred`, `ForAllPred`). +- `ClassPred (cTupleClass _) [a, b, ...]` is a constraint tuple — we + recurse into each `a`, `b` and emit one `SuperclassEdge` per element. + This is needed because GHC sometimes hands us a single tuple-shaped + pred where the user wrote `class (Foo a, Bar a) => …`. +- The boxed equality classes `(~)` and `(~~)` are technically + `ClassPred`s (they really are classes in GHC's type theory), but for + rendering we recognise them by their `OccName` and emit them as + equality predicates with two operands rather than as edges to a + synthetic `~` node. +- For all other class predicates we emit one edge. + +Args of class predicates run through a `visibleArgs` filter +(`tyConBinders` + `isInvisibleTyConBinder`), so kind / runtime-rep +arguments don't leak into the rendered superclass list. + +## Class instances + +Source: `tcg_insts :: [ClsInst]`. Each `ClsInst` becomes one +`InstanceInfo`: + +| Schema field | GHC API | +|----------------|------------------------------------------------------| +| `iiClass` | from `instanceSig inst` | +| `iiArgs` | from `instanceSig inst` (the head args) | +| `iiContext` | from `instanceSig inst` (the theta), decomposed | +| `iiTyVars` | from `instanceSig inst` | +| `iiOrphan` | `isOrphan (is_orphan inst)` | +| `iiOverlap` | pretty-print of `is_flag inst` | +| `iiSrc` | `nameSrcSpan (varName (is_dfun inst))` | + +`instanceSig :: ClsInst -> ([TyVar], [Type], Class, [Type])` returns the +fully resolved (`tvs`, `theta`, `cls`, `args`) for the instance's +dictionary function. That theta is whatever GHC stored on the dfun +after typechecking. So the context we display is exactly what the +solver would require to use the instance. + +Two non-obvious choices: + +- **`iiSrc` from the dfun, not the class.** The dfun's `Name` is tagged + with the source location of the `instance …` declaration GHC + synthesised it from. The class's `Name` only has a useful span when + the class was defined in a module compiled in the *current* package — + for classes loaded from another package's `.hi` file the span is + `UnhelpfulSpan` (interface files don't preserve source locations). + Using the dfun span means "Defined at" works uniformly. + +- **`iiContext` decomposition** uses the same `classifyPredType` + + CTuple / equality logic as superclass theta, so a user-written + `instance (Foo a, Bar a) => Baz a` produces two distinct context + predicates rather than one tuple-shaped one. + +## Type and data families + +Source: `tcg_tcs` filtered by `isFamilyTyCon`, plus the union with +every class's `classATs` (associated families belong to a class but +aren't always also in `tcg_tcs` directly). Per family: + +| Schema field | GHC API | +|-----------------|--------------------------------------------------| +| `tfName` | `tyConName tc` | +| `tfTyVars` | `tyConTyVars tc` | +| `tfFlavor` | from `famTyConFlav_maybe tc` and the class index | +| `tfResultKind` | `tyConResKind tc`, pretty-printed | +| `tfSrc` | `nameSrcSpan (tyConName tc)` | +| `tfEquations` | for closed families: branches of the `CoAxiom` | + +Flavor taxonomy: + +- `OpenSynFamilyTyCon` → `OpenFam` +- `ClosedSynFamilyTyCon (Maybe (CoAxiom Branched))` → `ClosedFam`, + with the branches extracted into `tfEquations` +- `BuiltInSynFamTyCon{}` → `ClosedFam` (treated as closed for our purposes) +- `DataFamilyTyCon{}` → `DataFam` +- Anything found under some class's `classATs` → `AssocFam parentClass` + +Closed-family equations don't live in `tcg_fam_insts` — they're +`CoAxBranch`es inside the family's `CoAxiom Branched`. We extract them +via `coAxBranches (co_ax_branches ax)` and treat each branch as a +`FamInstInfo` for rendering purposes. + +## Type family instances + +Source: `tcg_fam_insts :: [FamInst]`. Each becomes one `FamInstInfo`: + +| Schema field | GHC API | +|---------------|------------------------------------------------------------| +| `fiFamily` | `fi_fam fi` | +| `fiTyVars` | `fi_tvs fi` | +| `fiArgs` | `fi_tys fi`, filtered through `visibleArgs (famInstTyCon fi)` | +| `fiRhs` | `fi_rhs fi` | +| `fiSrc` | `coAxBranchSpan (coAxiomSingleBranch (fi_axiom fi))` | +| `fiIsData` | `fi_flavor fi == DataFamilyInst _` | + +A few extraction choices worth documenting: + +- **Visible-arg filter on `fi_tys`.** GHC stores the kind args of a + poly-kinded family alongside the value args. For + `data family Wrap (l :: K)` we'd otherwise get + `Wrap (((TYPE (BoxedRep Lifted)) -> …) l)` instead of `Wrap l`. + `tyConBinders (famInstTyCon fi)` tells us which positions are + invisible binders; we drop those. + +- **`fiSrc` from the axiom branch, not the family name.** Each + fam-instance has its own `CoAxiom Unbranched` representing the + equation; `coAxBranchSpan` of its single branch points at the + `data instance` / `type instance` declaration in the user's source. + The family's `Name` would point at the `data family` / `type family` + declaration, which is wrong for every instance after the first. + +- **Data-family RHS is intentionally hidden in the viewer.** GHC + synthesises a fresh data-constructor TyCon for every + `data instance`, with a name like `R:Foo[List]Intmk`. That name has + no source-level meaning, so we only record it in the schema (for + completeness) and the viewer drops it from labels when `fiIsData`. + +## Type-family resolution in the viewer + +So far everything has been *transcription*: we record what GHC told us +and stop. The one place we go beyond transcription is the **resolution +chains** in the instance view. This part is computed in JavaScript +against the already-extracted JSON, *not* by GHC. + +Setup. For an instance view focused on a class `C`, we draw — among +other things — every superclass requirement of `C`'s declaration with +the focused-instance's args substituted in. Concretely, for each +superclass edge `seSuperclass = S, seArgs` of `C`, and each instance +`inst` of `C`, the constraint is `S (subst seArgs inst.iiArgs)`. + +That `subst seArgs inst.iiArgs` may produce something with a +`FamilyApp` in it, e.g. `S (F a) (F b)` where `F` is some type family. +The viewer then asks "for which fam-instances of `F` does this pred +become solvable, and which class instance of `S` would satisfy it?". + +The algorithm: + +1. **Find the family applications.** `collectFamilyRefs` walks the + substituted args and collects every `FamilyApp` qualname. + +2. **For each fam-instance `fi` of that family:** + + 1. **Bidirectional unification** between `fi.fiArgs` and the + `FamilyApp`'s args. `biUnify` allows TyVarRefs on either side to + bind to anything on the other side. *If unification fails, this + `fi` is irrelevant — skip it entirely.* (This is what stops e.g. + a `Foo Int`-shaped predicate from being matched against a + `type instance F Bool = …` equation.) + + 2. **Substitute `fi.fiRhs` into the predicate.** The bidirectional + unification gave us substitutions for both fi's tyvars (used to + rewrite fi's RHS) and the predicate's tyvars (used to specialise + the rest of the predicate). The result is a fully-substituted + predicate with no occurrence of the originating family. + + 3. **Reduce the result.** `reduceTypeArg` walks the resulting + `TypeArg` and tries to apply *any* available fam-instance + (using one-sided unification — predicate vars stay free) to + collapse remaining family applications. This handles chained + reductions like `F (G x)` where `F` and `G` are both reducible. + + 4. **Find a matching class instance.** `findMatchingInstances` + scans `pdInstances` for an instance of the predicate's class + whose `iiArgs` shape-match the reduced args (TyVarRefs on + either side acting as wildcards). + + 5. **Draw the chain.** Edge from focused instance → family node → + `fi` node → matched class instance. + +If any step fails — relevance check, no fam-instance applies, no +matching class instance exists in the dump — the chain stops at the +last node we can justify. + +## Are the resolution chains deterministic? + +For the **transcribed data** (classes, instances, type families, +fam-instances, contexts, superclass thetas, source spans, flags) the +answer is unambiguous: it's exactly what GHC observed at the end of +typechecking. No reconstruction, no heuristic, no overlap resolution +on our side. If you re-run the plugin against the same source with the +same GHC version you get bit-identical JSON. + +For the **resolution chains in the instance view**, the picture is +more nuanced. Here's what can and cannot vary: + +- **Open type families with overlapping instances.** GHC forbids + overlapping open-family equations, so for any concrete argument list + there is at most one applicable `type instance`. But the viewer + walks fam-instances by *order in `pdFamInstances`*, which depends on + the order modules were compiled. If you have `type instance F Int = …` + in `pkg-A` and `type instance F Bool = …` in `pkg-B`, both are + presented as alternative resolutions; the user sees both chains + because both are *possible* satisfiers — without further input + (`a := Int` vs `a := Bool`) GHC itself can't pick one. The viewer + is honest about that ambiguity rather than hiding it. + +- **Closed type families with branch order.** GHC's evaluation rule is + "first matching branch wins". The viewer's `reduceTypeArg` mimics + this: it tries fam-instances in extraction order and stops at the + first that unifies. For closed families this is the same order as + the source-code branches, so the result matches GHC's. For open + families across multiple modules the order is build-graph-dependent + but, because open-family overlap is forbidden, the *set* of + applicable equations is unique even if the iteration order isn't. + +- **Ambiguous matches in `findMatchingInstances`.** Our class-instance + matcher treats TyVarRefs on either side as wildcards. That's + intentionally permissive: in the absence of a full constraint solver + we'd rather show the user "any of these instances might apply" than + hide instances behind a stricter check that would miss real + resolutions. The user sees every match as a separate chain edge and + decides which is meaningful in their context. + +- **Reductions stopping early.** If a fam-instance's RHS *itself* + mentions another type family (a common shape: `type Element (Wrap + c) = Element c` inside an associated family with the same name), + the viewer's reducer can't make further progress unless the + next-step fam-instance is also in the dumps. Chains terminate at + whichever fam-instance node the reducer last successfully unified + with — honest about the dependency boundary. + +- **Type-family functions defined by GHC itself** (e.g. + `Data.Type.Equality.==`, `If`, type-level arithmetic from + `GHC.TypeLits`) come through as `BuiltInSynFamTyCon` and are + transcribed but **not** evaluated by the viewer. We don't + reimplement GHC's built-in family rewrites. If a chain involves + `1 + 1` where `(+)` is `GHC.TypeNats.+`, the viewer will stop at + `+` rather than producing `2`. + +- **`UndecidableInstances` / cyclic reduction.** The reducer is + iterative, not fixed-point: each call visits the structure once and + then bails. We don't loop forever on a divergent chain. In practice + this means a circular family like `type instance F a = G a` / + `type instance G a = F a` shows up as two separate fam-instance + nodes pointing at each other via `fiRhs` references, but no edge in + the resolution-chain layer. + +So: *the data is deterministic and faithful, the chains are an honest +best-effort visualization*. When in doubt, the chain edges in the +instance view are hints; the underlying transcribed data +(`pdInstances`, `pdFamInstances`, `tfEquations`, etc.) is ground +truth. + +## What the viewer is *not* doing + +A few things we deliberately don't do, in case you wonder: + +- **No instance overlap resolution.** GHC's instance solver picks one + instance from a set of applicable ones using overlap pragmas + (`{-# OVERLAPPING #-}` etc.), specificity, and module-context rules. + We record `iiOverlap` as text on each instance but don't simulate + the solver. If two instances of the same class match a given + argument shape, the viewer shows both with the same chain edge. + +- **No constraint solving.** We don't try to compute, given a type, the + full set of class instances reachable through context constraints. + The instance view shows direct context predicates (`iiContext`) and + one hop of superclass requirements; transitive resolution is left + to the user clicking through. + +- **No quantified-constraint handling.** `forall a. Eq a => Eq (T a)` + appearing in a context is `ForAllPred` and we drop it from the + `iiContext` display. Quantified constraints are GHC-9-and-up but + rare; adding them would require rendering the implication shape. + +- **No analysis of `INCOHERENT` / `OVERLAPPABLE` / overlap warnings.** + We just show every instance. + +- **No re-typechecking when JSON is loaded.** The viewer trusts the + JSON. If you hand-edit `.classgraph/*.json`, the viewer will happily + render whatever's there — it has no way to detect inconsistency. + +If any of these matter for your use case and the answer is "we should +do it", the place to start is `Classgraph.Extract` (for fields we'd +need to add to the schema) or `viewer.js`'s `buildInstanceView` / +`reduceTypeArg` / `biUnify` (for the in-viewer reconstruction).