- Updated `.gitignore` and `.cursorignore` to exclude additional build artifacts and temporary files. - Enhanced `.cursorrules` with comprehensive AI guidelines and best practices. - Improved `.notes/directory_structure.md` to reflect the current project structure and module organization. - Updated `ARCHITECTURE.md` to include new insights on the system's modular design and privacy-first approach. - Refined `README.md` for clarity on project setup and usage instructions. - Added new entries in `.notes/meeting_notes.md` to document recent progress and decisions. - Ensured all changes align with the project's privacy and security standards.
5.5 KiB
Technical Architecture
Layered System Modules
1. Accessibility Service Layer
Detects gestures, draws overlays. Uses Android's AccessibilityService to create system-wide overlays that capture user input without interfering with underlying apps.
Key Components:
- AccessibilityService implementation
- TYPE_ACCESSIBILITY_OVERLAY window management
- Gesture capture and event routing
2. Region Processor
Converts user input area into actionable bounds. Translates circle/touch gestures into screen coordinates and extracts the selected region.
Key Components:
- Gesture recognition (circle, tap, drag)
- Coordinate transformation
- Region boundary calculation
- Screenshot/content capture via MediaProjection API
3. Content-Type Detector
Classifies input for routing (audio/image/text). Analyzes the captured region to determine what type of content it contains and how to process it.
Key Components:
- View hierarchy analysis
- OCR text detection
- Audio source identification
- Image/media classification
- File type detection
4. Local AI Engine
Speech Module
- Options: Vosk, DeepSpeech, PocketSphinx
- Purpose: Offline speech-to-text transcription
- Input: Audio streams, recorded audio
- Output: Transcribed text
LLM Module
- Options: MLC Chat, SmolChat, Edge Gallery (Llama 3, Phi-3, Gemma, Qwen in GGUF format)
- Purpose: Local reasoning, summarization, explanation
- Input: Text content, context
- Output: AI-generated responses
Vision Module
- Options: MLKit, TFLite, ONNX lightweight models
- Purpose: Image analysis, object detection, content classification
- Input: Images, screenshots
- Output: Classifications, descriptions, detected objects
5. Dialogue Agent
Maintains session state, open dialogue, executes upon command. Manages conversation flow and context persistence.
Key Components:
- State machine for dialogue flow
- Context/memory management
- Command parsing and routing
- Execution trigger handling
6. UI/Feedback Layer
Interactive Compose overlays for user interaction and feedback display.
Key Components:
- Jetpack Compose UI components
- Overlay windows
- Feedback animations
- Status indicators
- Action buttons
7. Privacy/Data Management
Local-only data handling, cache and session controls.
Key Components:
- Local storage management
- Session cache controls
- Privacy settings
- Data retention policies
- No network call enforcement
Dataflow Diagram
User Gesture → Accessibility Service → Region Processor
↓
Content-Type Detector
↓
┌─────────────────────────┼─────────────────────────┐
↓ ↓ ↓
Speech Module LLM Module Vision Module
↓ ↓ ↓
└─────────────────────────┼─────────────────────────┘
↓
Dialogue Agent
↓
UI/Feedback
↓
User sees response
↓
Continue dialogue or Execute
Technology Stack
- Platform: Android (minSdk 27+, targetSdk 34)
- UI Framework: Jetpack Compose
- Overlay System: AccessibilityService (TYPE_ACCESSIBILITY_OVERLAY)
- Screen Capture: MediaProjection API
- Speech-to-Text: Vosk / DeepSpeech / PocketSphinx (offline)
- LLM: MLC Chat / SmolChat / Edge Gallery (on-device Llama 3, Phi-3, Gemma, Qwen)
- Vision: MLKit / TFLite / ONNX
- Voice Commands: Porcupine / Vosk / PocketSphinx (wake word detection)
- Language: Kotlin
- Build System: Gradle
Design Principles
- Privacy First: All processing happens on-device. Zero network calls for AI inference.
- Modular Architecture: Each component is independently testable and replaceable.
- Async Operations: All ML inference is non-blocking and asynchronous.
- Dependency Injection: Components are loosely coupled via DI.
- Composition Over Inheritance: Favor composable functions and interfaces.
- Local Data Only: All app data stays within the local app context.
Known Constraints and Considerations
- Model Size: LLMs and STT models can be large; ensure device compatibility
- Device Fragmentation: Overlay behavior varies across OEMs (MIUI, Samsung, etc.)
- Gesture Recognition: Need robust handling to minimize false positives
- Performance: On-device inference requires optimization for lower-end devices
- Battery Impact: Continuous overlay and ML inference need power optimization
Security and Privacy
- No Network Permissions: App does not request internet access for AI features
- Local Storage: All data stored in app-private directories
- No Cloud Services: Zero dependency on external APIs or cloud infrastructure
- User Control: Complete user control over data retention and cache clearing
- Transparency: Open-source codebase for full auditability