# ASR API Contract

API specification for the Automatic Speech Recognition (ASR) service.

## Overview

The ASR service converts audio input to text. It supports streaming audio for real-time transcription.

## Base URL

```
http://localhost:8001/api/asr
```

(Configurable port and host)

## Endpoints

### 1. Health Check

```
GET /health
```

**Response:**
```json
{
  "status": "healthy",
  "model": "faster-whisper",
  "model_size": "base",
  "language": "en"
}
```

### 2. Transcribe Audio (File Upload)

```
POST /transcribe
Content-Type: multipart/form-data
```

**Request:**
- `audio`: Audio file (WAV, MP3, FLAC, etc.)
- `language` (optional): Language code (default: "en")
- `format` (optional): Response format ("text" or "json", default: "text")

**Response (text format):**
```
This is the transcribed text.
```

**Response (json format):**
```json
{
  "text": "This is the transcribed text.",
  "segments": [
    {
      "start": 0.0,
      "end": 2.5,
      "text": "This is the transcribed text."
    }
  ],
  "language": "en",
  "duration": 2.5
}
```

### 3. Streaming Transcription (WebSocket)

```
WS /stream
```

**Client → Server:**
- Send audio chunks (binary)
- Send `{"action": "end"}` to finish

**Server → Client:**
```json
{
  "type": "partial",
  "text": "Partial transcription..."
}
```

```json
{
  "type": "final",
  "text": "Final transcription.",
  "segments": [...]
}
```

### 4. Get Supported Languages

```
GET /languages
```

**Response:**
```json
{
  "languages": [
    {"code": "en", "name": "English"},
    {"code": "es", "name": "Spanish"},
    ...
  ]
}
```

## Error Responses

```json
{
  "error": "Error message",
  "code": "ERROR_CODE"
}
```

**Error Codes:**
- `INVALID_AUDIO`: Audio file is invalid or unsupported
- `TRANSCRIPTION_FAILED`: Transcription process failed
- `LANGUAGE_NOT_SUPPORTED`: Requested language not supported
- `SERVICE_UNAVAILABLE`: ASR service is unavailable

## Rate Limiting

- **File upload**: 10 requests/minute
- **Streaming**: 1 concurrent stream per client

## Audio Format Requirements

- **Format**: WAV, MP3, FLAC, OGG
- **Sample Rate**: 16kHz recommended (auto-resampled)
- **Channels**: Mono or stereo (converted to mono)
- **Bit Depth**: 16-bit recommended

## Performance

- **Latency**: < 500ms for short utterances (< 5s)
- **Accuracy**: > 95% WER for clear speech
- **Model**: faster-whisper (base or small)

## Integration

### With Wake-Word Service
1. Wake-word detects activation
2. Sends "start" signal to ASR
3. ASR begins streaming transcription
4. Wake-word sends "stop" signal
5. ASR returns final transcription

### With LLM
1. ASR returns transcribed text
2. Text sent to LLM for processing
3. LLM response sent to TTS

## Example Usage

### Python Client

```python
import requests

# Transcribe file
with open("audio.wav", "rb") as f:
    response = requests.post(
        "http://localhost:8001/api/asr/transcribe",
        files={"audio": f},
        data={"language": "en", "format": "json"}
    )
    result = response.json()
    print(result["text"])
```

### JavaScript Client

```javascript
// Streaming transcription
const ws = new WebSocket("ws://localhost:8001/api/asr/stream");

ws.onmessage = (event) => {
    const data = JSON.parse(event.data);
    if (data.type === "final") {
        console.log("Transcription:", data.text);
    }
};

// Send audio chunks
const audioChunk = ...; // Audio data
ws.send(audioChunk);
```

## Future Enhancements

- Speaker diarization
- Punctuation and capitalization
- Custom vocabulary
- Confidence scores per word
- Multiple language detection