- Updated `.gitignore` and `.cursorignore` to exclude additional build artifacts and temporary files. - Enhanced `.cursorrules` with comprehensive AI guidelines and best practices. - Improved `.notes/directory_structure.md` to reflect the current project structure and module organization. - Updated `ARCHITECTURE.md` to include new insights on the system's modular design and privacy-first approach. - Refined `README.md` for clarity on project setup and usage instructions. - Added new entries in `.notes/meeting_notes.md` to document recent progress and decisions. - Ensured all changes align with the project's privacy and security standards.
141 lines
5.5 KiB
Markdown
141 lines
5.5 KiB
Markdown
# Technical Architecture
|
|
|
|
## Layered System Modules
|
|
|
|
### 1. Accessibility Service Layer
|
|
Detects gestures, draws overlays. Uses Android's `AccessibilityService` to create system-wide overlays that capture user input without interfering with underlying apps.
|
|
|
|
**Key Components:**
|
|
- AccessibilityService implementation
|
|
- TYPE_ACCESSIBILITY_OVERLAY window management
|
|
- Gesture capture and event routing
|
|
|
|
### 2. Region Processor
|
|
Converts user input area into actionable bounds. Translates circle/touch gestures into screen coordinates and extracts the selected region.
|
|
|
|
**Key Components:**
|
|
- Gesture recognition (circle, tap, drag)
|
|
- Coordinate transformation
|
|
- Region boundary calculation
|
|
- Screenshot/content capture via MediaProjection API
|
|
|
|
### 3. Content-Type Detector
|
|
Classifies input for routing (audio/image/text). Analyzes the captured region to determine what type of content it contains and how to process it.
|
|
|
|
**Key Components:**
|
|
- View hierarchy analysis
|
|
- OCR text detection
|
|
- Audio source identification
|
|
- Image/media classification
|
|
- File type detection
|
|
|
|
### 4. Local AI Engine
|
|
|
|
#### Speech Module
|
|
- **Options:** Vosk, DeepSpeech, PocketSphinx
|
|
- **Purpose:** Offline speech-to-text transcription
|
|
- **Input:** Audio streams, recorded audio
|
|
- **Output:** Transcribed text
|
|
|
|
#### LLM Module
|
|
- **Options:** MLC Chat, SmolChat, Edge Gallery (Llama 3, Phi-3, Gemma, Qwen in GGUF format)
|
|
- **Purpose:** Local reasoning, summarization, explanation
|
|
- **Input:** Text content, context
|
|
- **Output:** AI-generated responses
|
|
|
|
#### Vision Module
|
|
- **Options:** MLKit, TFLite, ONNX lightweight models
|
|
- **Purpose:** Image analysis, object detection, content classification
|
|
- **Input:** Images, screenshots
|
|
- **Output:** Classifications, descriptions, detected objects
|
|
|
|
### 5. Dialogue Agent
|
|
Maintains session state, open dialogue, executes upon command. Manages conversation flow and context persistence.
|
|
|
|
**Key Components:**
|
|
- State machine for dialogue flow
|
|
- Context/memory management
|
|
- Command parsing and routing
|
|
- Execution trigger handling
|
|
|
|
### 6. UI/Feedback Layer
|
|
Interactive Compose overlays for user interaction and feedback display.
|
|
|
|
**Key Components:**
|
|
- Jetpack Compose UI components
|
|
- Overlay windows
|
|
- Feedback animations
|
|
- Status indicators
|
|
- Action buttons
|
|
|
|
### 7. Privacy/Data Management
|
|
Local-only data handling, cache and session controls.
|
|
|
|
**Key Components:**
|
|
- Local storage management
|
|
- Session cache controls
|
|
- Privacy settings
|
|
- Data retention policies
|
|
- No network call enforcement
|
|
|
|
## Dataflow Diagram
|
|
|
|
```
|
|
User Gesture → Accessibility Service → Region Processor
|
|
↓
|
|
Content-Type Detector
|
|
↓
|
|
┌─────────────────────────┼─────────────────────────┐
|
|
↓ ↓ ↓
|
|
Speech Module LLM Module Vision Module
|
|
↓ ↓ ↓
|
|
└─────────────────────────┼─────────────────────────┘
|
|
↓
|
|
Dialogue Agent
|
|
↓
|
|
UI/Feedback
|
|
↓
|
|
User sees response
|
|
↓
|
|
Continue dialogue or Execute
|
|
```
|
|
|
|
## Technology Stack
|
|
|
|
- **Platform:** Android (minSdk 27+, targetSdk 34)
|
|
- **UI Framework:** Jetpack Compose
|
|
- **Overlay System:** AccessibilityService (TYPE_ACCESSIBILITY_OVERLAY)
|
|
- **Screen Capture:** MediaProjection API
|
|
- **Speech-to-Text:** Vosk / DeepSpeech / PocketSphinx (offline)
|
|
- **LLM:** MLC Chat / SmolChat / Edge Gallery (on-device Llama 3, Phi-3, Gemma, Qwen)
|
|
- **Vision:** MLKit / TFLite / ONNX
|
|
- **Voice Commands:** Porcupine / Vosk / PocketSphinx (wake word detection)
|
|
- **Language:** Kotlin
|
|
- **Build System:** Gradle
|
|
|
|
## Design Principles
|
|
|
|
1. **Privacy First:** All processing happens on-device. Zero network calls for AI inference.
|
|
2. **Modular Architecture:** Each component is independently testable and replaceable.
|
|
3. **Async Operations:** All ML inference is non-blocking and asynchronous.
|
|
4. **Dependency Injection:** Components are loosely coupled via DI.
|
|
5. **Composition Over Inheritance:** Favor composable functions and interfaces.
|
|
6. **Local Data Only:** All app data stays within the local app context.
|
|
|
|
## Known Constraints and Considerations
|
|
|
|
- **Model Size:** LLMs and STT models can be large; ensure device compatibility
|
|
- **Device Fragmentation:** Overlay behavior varies across OEMs (MIUI, Samsung, etc.)
|
|
- **Gesture Recognition:** Need robust handling to minimize false positives
|
|
- **Performance:** On-device inference requires optimization for lower-end devices
|
|
- **Battery Impact:** Continuous overlay and ML inference need power optimization
|
|
|
|
## Security and Privacy
|
|
|
|
- **No Network Permissions:** App does not request internet access for AI features
|
|
- **Local Storage:** All data stored in app-private directories
|
|
- **No Cloud Services:** Zero dependency on external APIs or cloud infrastructure
|
|
- **User Control:** Complete user control over data retention and cache clearing
|
|
- **Transparency:** Open-source codebase for full auditability
|
|
|