atlas/home-voice-agent/asr/README.md

# ASR (Automatic Speech Recognition) Service

Speech-to-text service using faster-whisper for real-time transcription.

## Features

- HTTP endpoint for file transcription
- WebSocket endpoint for streaming transcription
- Support for multiple audio formats (WAV, MP3, FLAC, etc.)
- Auto language detection
- Low-latency processing
- GPU acceleration support (CUDA)

## Installation

```bash
# Install Python dependencies
pip install -r requirements.txt

# For GPU support (optional)
# CUDA toolkit must be installed
# faster-whisper will use GPU automatically if available
```

## Usage

### Standalone Service

```bash
# Run as HTTP/WebSocket server
python3 -m asr.server

# Or use uvicorn directly
uvicorn asr.server:app --host 0.0.0.0 --port 8001
```

### Python API

```python
from asr.service import ASRService

service = ASRService(
    model_size="small",
    device="cpu",  # or "cuda" for GPU
    language="en"
)

# Transcribe file
with open("audio.wav", "rb") as f:
    result = service.transcribe_file(f.read())
    print(result["text"])
```

## API Endpoints

### HTTP

- `GET /health` - Health check
- `POST /transcribe` - Transcribe audio file
  - `audio`: Audio file (multipart/form-data)
  - `language`: Language code (optional)
  - `format`: Response format ("text" or "json")
- `GET /languages` - Get supported languages

### WebSocket

- `WS /stream` - Streaming transcription
  - Send audio chunks (binary)
  - Send `{"action": "end"}` to finish
  - Receive partial and final results

## Configuration

- **Model Size**: small (default), tiny, base, medium, large
- **Device**: cpu (default), cuda (if GPU available)
- **Compute Type**: int8 (default), int8_float16, float16, float32
- **Language**: en (default), or None for auto-detect

## Performance

- **CPU (small model)**: ~2-4s latency
- **GPU (small model)**: ~0.5-1s latency
- **GPU (medium model)**: ~1-2s latency

## Integration

The ASR service is triggered by:
1. Wake-word detection events
2. Direct HTTP/WebSocket requests
3. Audio file uploads

Output is sent to:
1. LLM for processing
2. Conversation manager
3. Response generation

## Testing

```bash
# Test health
curl http://localhost:8001/health

# Test transcription
curl -X POST http://localhost:8001/transcribe \
  -F "audio=@test.wav" \
  -F "language=en" \
  -F "format=json"
```

## Notes

- First run downloads the model (~500MB for small)
- GPU acceleration requires CUDA
- Streaming transcription needs proper audio format handling
- Supports many languages (see /languages endpoint)