Mira — Divya Soni

01 / Overview

Overview

A vision-first AI reading assistant that acts like a "Cursor for textbooks"—understands pages in real time and provides contextual explanations with minimal friction.

02 / Architecture

Architecture

Instead of traditional RAG pipelines, Mira uses a vision-first approach by sending page screenshots directly to a multimodal LLM. Each PDF page maintains its own isolated conversation context, effectively creating per-page memory threads. The system routes queries either to page-level reasoning (via screenshots) or document-level reasoning (by sending the full PDF) depending on user intent.

03 / Implementation

Implementation

Built as a PDF reader with an embedded chat interface. On each query, the app captures the current page as an image and sends it along with prior page-specific chat history. Responses are post-processed into smaller message chunks to mimic natural conversation. Optional text-to-speech is integrated via Kokoro (hosted on Replicate), converting LLM outputs into audio. The system also optimizes token usage by avoiding text extraction and relying solely on image inputs.

04 / Results

Results

Eliminated the need for complex PDF parsing while improving contextual understanding of diagrams, equations, and layouts. Reduced token usage (~250 tokens per page image) compared to text+image pipelines. Delivered a significantly smoother UX by removing friction from context switching and transforming long LLM outputs into conversational responses. Demonstrated that vision-based approaches can simplify architecture while maintaining strong accuracy for educational use cases.