Vision Pipeline — Linkuistics

Python uv workspace at vision/ that decomposes VM screenshots into structured UI data. Each stage is a workspace member with its own pyproject.toml and test suite, composed top-down by the pipeline orchestrator.

Stage contract

Every stage exposes a class with a run() method. The input is a PIL.Image plus any upstream outputs; the output is a list of typed detections defined in testanyware_common.types. Stages must be pure functions of their inputs so the orchestrator can cache and parallelize.

class Stage:
    def run(self, image: Image, upstream: Outputs) -> Outputs: ...

Outputs is a stage-specific dataclass (e.g. WindowDetections, ElementDetections) imported from testanyware_common.types. No stage imports another stage's implementation — only its output type.

Workspace layout

Declared in vision/pyproject.toml:

vision/
├── common/                          # testanyware-common — shared types, utilities
├── pipeline/                        # orchestrator (currently a stub)
├── stages/
│   ├── window-detection/
│   │   ├── generator/               # synthetic-data generator
│   │   ├── training/                # detector training script + configs
│   │   └── analysis/                # runtime window-boundary detector
│   ├── drawing-primitives/          # low-level line/box/shape primitives
│   │                                # (absorbed from the Redraw project)
│   └── icon-classification/         # per-button icon classifier
│       ├── src/icon_classification/ # classifier + shape-heuristic fallback
│       ├── training/                # training workflow (Create ML)
│       └── data/                    # model artefacts (post-training)

Stages

`window-detection`

Three sub-packages that together own the "where are the windows" step.

generator — produces synthetic training images and ground-truth labels (window rectangles + chrome regions). Used to bootstrap the detector without needing thousands of real screenshots.
training — configs + scripts to train the detector from the generator's output or real labelled data.
analysis — the runtime detector. Input: a screenshot. Output: window bounding boxes and chrome regions that downstream stages consume for layout context.

`drawing-primitives`

Low-level geometric primitives (line, box, shape grouping) used by the element and chrome stages. Absorbed from the standalone Redraw project. Pure geometry — no model, no ML.

`icon-classification`

Per-button icon classification against a fixed 52-label vocabulary (gear, checkmark, close-x, chevrons, plus, minus, etc.). Given a screenshot and a list of button-like detections from an upstream stage, returns the best label (or "unknown") for each.

Status: model not yet trained. The eventual CoreML/ONNX model has not been produced, so classification currently falls back to the shape-analysis heuristic at src/icon_classification/shape_analysis.py. The heuristic handles ~8 obvious geometric icons (plus, minus, close-x, checkmark, four chevrons); everything else comes back as "unknown". Once a trained model lands at data/icon_classifier.onnx or data/icon_classifier.mlmodelc, the classifier will use it automatically. See vision/stages/icon-classification/training/README.md for the end-to-end training workflow (collect → label → Create ML → bundle).

Test organisation

Marker-driven (pytest markers declared in vision/pyproject.toml):

Marker	Meaning
`unit`	Pure logic, no models or VMs
`vision`	Detector accuracy against golden datasets
`integration`	End-to-end against live VMs
`slow`	Takes more than 10 seconds

Default invocation skips integration and slow tests:

cd vision && uv sync && uv run pytest

Required flag: pytest is invoked with --import-mode=importlib (baked into the uv workspace config). This is necessary because several workspace members share top-level package names and the default prepend import mode causes duplicate-module collisions.

How this composes

The pipeline orchestrator at vision/pipeline/ (currently a stub) will wire stages in topological order:

screenshot
  │
  ▼
window-detection/analysis ── chrome regions
  │
  ▼
(element detection stage — future)
  │
  ▼
icon-classification  ← uses button-like detections
  │
  ▼
drawing-primitives (geometric hints for all above)

Stages produce additive annotations on a shared Detections object; downstream stages read earlier detections via dataclass fields, not by re-running upstream work. This is the same composition model used on the host-CLI side: each stage runs alone from the command line and the pipeline is assembled at the outermost boundary.