ML Systems / Operations Tooling

Vision UI

A local-first desktop activity analysis tool designed to support evidence-linked troubleshooting, operator replay, and future AI-assisted support workflows from screenshots, native context, frame evidence, and narrative summaries.

Animated screen recording of Vision UI analyzing captured desktop frames

Overview

Vision UI turns desktop screenshot streams into evidence-linked accounts of user activity so support and operations teams can replay what happened instead of relying only on memory or loosely written tickets. The system captures screenshots and native context, extracts OCR and interface-region proposals, serializes per-frame JSON, and produces deterministic session summaries plus qualitative narratives that remain tied to saved evidence.

Key Features

Evidence

Interactive Slide Deck

Vision UI Presentation

Download PPTX
Vision UI Presentation slide 1 of 11
Slide 1 of 11

The report frames the learned grounding stack as promising but still experimental. The strongest combined grounding run reached 0.4450 top-1 accuracy, with 0.4533 on held-out UI-Vision, 0.4434 on ScreenSpot-Pro, and 0.6906 selected-predictor recall across the combined evaluation.

What I Learned

The project made the gap between a fluent summary and a trustworthy operational account very clear. For troubleshooting and review, the value is not just narration; it is narration grounded in inspectable screenshots, UI targets, native context, confidence markers, and saved evidence.