crkl/ARCHITECTURE.md
ilia c351858749 Update project configuration files and enhance documentation
- Updated `.gitignore` and `.cursorignore` to exclude additional build artifacts and temporary files.
- Enhanced `.cursorrules` with comprehensive AI guidelines and best practices.
- Improved `.notes/directory_structure.md` to reflect the current project structure and module organization.
- Updated `ARCHITECTURE.md` to include new insights on the system's modular design and privacy-first approach.
- Refined `README.md` for clarity on project setup and usage instructions.
- Added new entries in `.notes/meeting_notes.md` to document recent progress and decisions.
- Ensured all changes align with the project's privacy and security standards.
2025-10-18 14:32:33 -04:00

141 lines
5.5 KiB
Markdown

# Technical Architecture
## Layered System Modules
### 1. Accessibility Service Layer
Detects gestures, draws overlays. Uses Android's `AccessibilityService` to create system-wide overlays that capture user input without interfering with underlying apps.
**Key Components:**
- AccessibilityService implementation
- TYPE_ACCESSIBILITY_OVERLAY window management
- Gesture capture and event routing
### 2. Region Processor
Converts user input area into actionable bounds. Translates circle/touch gestures into screen coordinates and extracts the selected region.
**Key Components:**
- Gesture recognition (circle, tap, drag)
- Coordinate transformation
- Region boundary calculation
- Screenshot/content capture via MediaProjection API
### 3. Content-Type Detector
Classifies input for routing (audio/image/text). Analyzes the captured region to determine what type of content it contains and how to process it.
**Key Components:**
- View hierarchy analysis
- OCR text detection
- Audio source identification
- Image/media classification
- File type detection
### 4. Local AI Engine
#### Speech Module
- **Options:** Vosk, DeepSpeech, PocketSphinx
- **Purpose:** Offline speech-to-text transcription
- **Input:** Audio streams, recorded audio
- **Output:** Transcribed text
#### LLM Module
- **Options:** MLC Chat, SmolChat, Edge Gallery (Llama 3, Phi-3, Gemma, Qwen in GGUF format)
- **Purpose:** Local reasoning, summarization, explanation
- **Input:** Text content, context
- **Output:** AI-generated responses
#### Vision Module
- **Options:** MLKit, TFLite, ONNX lightweight models
- **Purpose:** Image analysis, object detection, content classification
- **Input:** Images, screenshots
- **Output:** Classifications, descriptions, detected objects
### 5. Dialogue Agent
Maintains session state, open dialogue, executes upon command. Manages conversation flow and context persistence.
**Key Components:**
- State machine for dialogue flow
- Context/memory management
- Command parsing and routing
- Execution trigger handling
### 6. UI/Feedback Layer
Interactive Compose overlays for user interaction and feedback display.
**Key Components:**
- Jetpack Compose UI components
- Overlay windows
- Feedback animations
- Status indicators
- Action buttons
### 7. Privacy/Data Management
Local-only data handling, cache and session controls.
**Key Components:**
- Local storage management
- Session cache controls
- Privacy settings
- Data retention policies
- No network call enforcement
## Dataflow Diagram
```
User Gesture → Accessibility Service → Region Processor
Content-Type Detector
┌─────────────────────────┼─────────────────────────┐
↓ ↓ ↓
Speech Module LLM Module Vision Module
↓ ↓ ↓
└─────────────────────────┼─────────────────────────┘
Dialogue Agent
UI/Feedback
User sees response
Continue dialogue or Execute
```
## Technology Stack
- **Platform:** Android (minSdk 27+, targetSdk 34)
- **UI Framework:** Jetpack Compose
- **Overlay System:** AccessibilityService (TYPE_ACCESSIBILITY_OVERLAY)
- **Screen Capture:** MediaProjection API
- **Speech-to-Text:** Vosk / DeepSpeech / PocketSphinx (offline)
- **LLM:** MLC Chat / SmolChat / Edge Gallery (on-device Llama 3, Phi-3, Gemma, Qwen)
- **Vision:** MLKit / TFLite / ONNX
- **Voice Commands:** Porcupine / Vosk / PocketSphinx (wake word detection)
- **Language:** Kotlin
- **Build System:** Gradle
## Design Principles
1. **Privacy First:** All processing happens on-device. Zero network calls for AI inference.
2. **Modular Architecture:** Each component is independently testable and replaceable.
3. **Async Operations:** All ML inference is non-blocking and asynchronous.
4. **Dependency Injection:** Components are loosely coupled via DI.
5. **Composition Over Inheritance:** Favor composable functions and interfaces.
6. **Local Data Only:** All app data stays within the local app context.
## Known Constraints and Considerations
- **Model Size:** LLMs and STT models can be large; ensure device compatibility
- **Device Fragmentation:** Overlay behavior varies across OEMs (MIUI, Samsung, etc.)
- **Gesture Recognition:** Need robust handling to minimize false positives
- **Performance:** On-device inference requires optimization for lower-end devices
- **Battery Impact:** Continuous overlay and ML inference need power optimization
## Security and Privacy
- **No Network Permissions:** App does not request internet access for AI features
- **Local Storage:** All data stored in app-private directories
- **No Cloud Services:** Zero dependency on external APIs or cloud infrastructure
- **User Control:** Complete user control over data retention and cache clearing
- **Transparency:** Open-source codebase for full auditability