crkl/ARCHITECTURE.md
ilia c351858749 Update project configuration files and enhance documentation
- Updated `.gitignore` and `.cursorignore` to exclude additional build artifacts and temporary files.
- Enhanced `.cursorrules` with comprehensive AI guidelines and best practices.
- Improved `.notes/directory_structure.md` to reflect the current project structure and module organization.
- Updated `ARCHITECTURE.md` to include new insights on the system's modular design and privacy-first approach.
- Refined `README.md` for clarity on project setup and usage instructions.
- Added new entries in `.notes/meeting_notes.md` to document recent progress and decisions.
- Ensured all changes align with the project's privacy and security standards.
2025-10-18 14:32:33 -04:00

5.5 KiB

Technical Architecture

Layered System Modules

1. Accessibility Service Layer

Detects gestures, draws overlays. Uses Android's AccessibilityService to create system-wide overlays that capture user input without interfering with underlying apps.

Key Components:

  • AccessibilityService implementation
  • TYPE_ACCESSIBILITY_OVERLAY window management
  • Gesture capture and event routing

2. Region Processor

Converts user input area into actionable bounds. Translates circle/touch gestures into screen coordinates and extracts the selected region.

Key Components:

  • Gesture recognition (circle, tap, drag)
  • Coordinate transformation
  • Region boundary calculation
  • Screenshot/content capture via MediaProjection API

3. Content-Type Detector

Classifies input for routing (audio/image/text). Analyzes the captured region to determine what type of content it contains and how to process it.

Key Components:

  • View hierarchy analysis
  • OCR text detection
  • Audio source identification
  • Image/media classification
  • File type detection

4. Local AI Engine

Speech Module

  • Options: Vosk, DeepSpeech, PocketSphinx
  • Purpose: Offline speech-to-text transcription
  • Input: Audio streams, recorded audio
  • Output: Transcribed text

LLM Module

  • Options: MLC Chat, SmolChat, Edge Gallery (Llama 3, Phi-3, Gemma, Qwen in GGUF format)
  • Purpose: Local reasoning, summarization, explanation
  • Input: Text content, context
  • Output: AI-generated responses

Vision Module

  • Options: MLKit, TFLite, ONNX lightweight models
  • Purpose: Image analysis, object detection, content classification
  • Input: Images, screenshots
  • Output: Classifications, descriptions, detected objects

5. Dialogue Agent

Maintains session state, open dialogue, executes upon command. Manages conversation flow and context persistence.

Key Components:

  • State machine for dialogue flow
  • Context/memory management
  • Command parsing and routing
  • Execution trigger handling

6. UI/Feedback Layer

Interactive Compose overlays for user interaction and feedback display.

Key Components:

  • Jetpack Compose UI components
  • Overlay windows
  • Feedback animations
  • Status indicators
  • Action buttons

7. Privacy/Data Management

Local-only data handling, cache and session controls.

Key Components:

  • Local storage management
  • Session cache controls
  • Privacy settings
  • Data retention policies
  • No network call enforcement

Dataflow Diagram

User Gesture → Accessibility Service → Region Processor
                                              ↓
                                     Content-Type Detector
                                              ↓
                    ┌─────────────────────────┼─────────────────────────┐
                    ↓                         ↓                         ↓
              Speech Module              LLM Module              Vision Module
                    ↓                         ↓                         ↓
                    └─────────────────────────┼─────────────────────────┘
                                              ↓
                                      Dialogue Agent
                                              ↓
                                        UI/Feedback
                                              ↓
                                    User sees response
                                              ↓
                              Continue dialogue or Execute

Technology Stack

  • Platform: Android (minSdk 27+, targetSdk 34)
  • UI Framework: Jetpack Compose
  • Overlay System: AccessibilityService (TYPE_ACCESSIBILITY_OVERLAY)
  • Screen Capture: MediaProjection API
  • Speech-to-Text: Vosk / DeepSpeech / PocketSphinx (offline)
  • LLM: MLC Chat / SmolChat / Edge Gallery (on-device Llama 3, Phi-3, Gemma, Qwen)
  • Vision: MLKit / TFLite / ONNX
  • Voice Commands: Porcupine / Vosk / PocketSphinx (wake word detection)
  • Language: Kotlin
  • Build System: Gradle

Design Principles

  1. Privacy First: All processing happens on-device. Zero network calls for AI inference.
  2. Modular Architecture: Each component is independently testable and replaceable.
  3. Async Operations: All ML inference is non-blocking and asynchronous.
  4. Dependency Injection: Components are loosely coupled via DI.
  5. Composition Over Inheritance: Favor composable functions and interfaces.
  6. Local Data Only: All app data stays within the local app context.

Known Constraints and Considerations

  • Model Size: LLMs and STT models can be large; ensure device compatibility
  • Device Fragmentation: Overlay behavior varies across OEMs (MIUI, Samsung, etc.)
  • Gesture Recognition: Need robust handling to minimize false positives
  • Performance: On-device inference requires optimization for lower-end devices
  • Battery Impact: Continuous overlay and ML inference need power optimization

Security and Privacy

  • No Network Permissions: App does not request internet access for AI features
  • Local Storage: All data stored in app-private directories
  • No Cloud Services: Zero dependency on external APIs or cloud infrastructure
  • User Control: Complete user control over data retention and cache clearing
  • Transparency: Open-source codebase for full auditability