← projects

Mira

2025AuthorTypeScript

Overview

A vision-first AI reading assistant that acts like a "Cursor for textbooks"—understands pages in real time and provides contextual explanations with minimal friction.

Architecture

Instead of traditional RAG pipelines, Mira uses a vision-first approach by sending page screenshots directly to a multimodal LLM. Each PDF page maintains its own isolated conversation context, effectively creating per-page memory threads. The system routes queries either to page-level reasoning (via screenshots) or document-level reasoning (by sending the full PDF) depending on user intent.

Implementation

Built as a PDF reader with an embedded chat interface. On each query, the app captures the current page as an image and sends it along with prior page-specific chat history. Responses are post-processed into smaller message chunks to mimic natural conversation. Optional text-to-speech is integrated via Kokoro (hosted on Replicate), converting LLM outputs into audio. The system also optimizes token usage by avoiding text extraction and relying solely on image inputs.

Results

Eliminated the need for complex PDF parsing while improving contextual understanding of diagrams, equations, and layouts. Reduced token usage (~250 tokens per page image) compared to text+image pipelines. Delivered a significantly smoother UX by removing friction from context switching and transforming long LLM outputs into conversational responses. Demonstrated that vision-based approaches can simplify architecture while maintaining strong accuracy for educational use cases.

LLMsTypeScriptReactJSNodeJSMongoDB