简体中文 | English
All endpoints live under http://<host>:8780. JSON for data, multipart/form-data
for file uploads.
For complete env defaults, API override precedence, and ASR / AS-norm internal
defaults that are not public knobs yet, see
configuration.en.md.
With API_KEY set, every request except the ones below must carry
Authorization: Bearer <API_KEY> or X-API-Key: <API_KEY>:
| Path | Public | Match |
|---|---|---|
GET / |
✅ bundled web UI | exact |
GET /healthz |
✅ liveness probe | exact |
GET /docs / /redoc / /openapi.json |
✅ FastAPI auto docs | exact |
GET /static/* |
✅ static assets | /static/ prefix |
other /api/* |
❌ requires the key | — |
Missing or wrong key → 401 Unauthorized. Key comparison uses
hmac.compare_digest (constant-time). Since 0.2.0, /docs, /redoc,
/openapi.json are exact-match public paths — /docsXYZ now
returns 401.
POST /api/transcribe
↓
queued → converting → denoising (if effective denoise_model ≠ none) → transcribing → identifying → completed
↘ failed
The BetterAINote worker polls /api/jobs/{id} every 5 seconds and stops
as soon as it sees completed or failed.
curl http://localhost:8780/healthz
# {"ok":true}Form fields:
| Field | Type | Description |
|---|---|---|
file |
file | Required — audio (wav / mp3 / m4a / flac / ogg / webm) |
language |
string | Optional, ISO 639-1; omit to auto-detect (Mandarin audio outputs Simplified Chinese) |
min_speakers |
int | Optional, 0 = auto |
max_speakers |
int | Optional, 0 = auto |
denoise_model |
string | Optional. Noise reduction backend: none, deepfilternet, noisereduce. When omitted, the server uses DENOISE_MODEL (default none). Sending none explicitly disables denoising for this request only. |
snr_threshold |
float | Optional. DeepFilterNet SNR gate threshold (dB) for this request only. When deepfilternet is selected, audio at or above this level skips DeepFilterNet. Overrides DENOISE_SNR_THRESHOLD (default 10.0); noisereduce does not use this gate. |
no_repeat_ngram_size |
int | Optional, default 0 (disabled). When ≥ 3, suppresses n-gram repetitions in the transcript (e.g. "like like like" → "like"). Values < 3 are treated as 0. Non-integer values return 422. |
Response (200):
{ "id": "tr_example_id", "status": "queued" }POST /api/transcribe has two dedup paths, both keyed by the upload SHA256:
- Completed-result dedup: if an identical file already has a completed transcription, the endpoint returns that existing job immediately without re-running Whisper:
{ "id": "tr_existing_id", "status": "completed", "deduplicated": true }- In-flight dedup: if an identical file is already being processed by another live
request, the later caller is attached to the first job instead of starting a second
worker. The response reuses the first job id and stays in
queueduntil that job advances:
{ "id": "tr_existing_inflight", "status": "queued", "deduplicated": true }In both cases, deduplicated: true means this request did not create a new transcription
worker. Use the returned id normally — poll /api/jobs/{id} or export as usual.
Upload size: the server streams the upload in chunks and returns
413 the moment the total exceeds MAX_UPLOAD_BYTES (default 2 GiB):
{ "detail": "Upload exceeds MAX_UPLOAD_BYTES (2147483648 bytes)" }The partial file is deleted from data/uploads/. Lower the cap in
.env if your disk is small (the value is in bytes).
Filename: the multipart filename is reduced to
PurePosixPath(filename).name before use. A client-supplied
filename=../../etc/passwd.wav lands on disk as just
tr_<id>_passwd.wav.
503 cases: POST /api/transcribe can also fail before work starts:
503 Failed to persist job state — disk error, retry later503 Failed to start background transcription — retry later
Example:
curl -X POST http://localhost:8780/api/transcribe \
-H "Authorization: Bearer $API_KEY" \
-F "file=@meeting.wav" \
-F "language=en" \
-F "max_speakers=4"Noise reduction precedence is: explicit API field first, then server env. In
practice, omit denoise_model to inherit DENOISE_MODEL, send
denoise_model=none to disable denoising for one request, and send
snr_threshold only when this job needs a threshold different from
DENOISE_SNR_THRESHOLD. That threshold only affects deepfilternet;
noisereduce runs directly whenever selected.
Note:
GET /api/jobs/{id}checks the in-memory job dictionary first; on a cache miss it falls back todata/transcriptions/<id>/status.jsonon disk.
- If a completed job is still present in memory,
resultis served from the in-memory job cache.- On a cache miss, completed jobs load
result.jsonfrom disk.- If status is in-progress at the time of the miss, it returns
status=failed, error="Process restarted while job was in progress"(set byrecover_orphan_jobs()at startup).- Returns 404 only if
status.jsondoes not exist.Service restarts no longer leave jobs in an indeterminate state — clients will always receive a definitive terminal status.
{
"id": "tr_...",
"status": "queued | converting | denoising | transcribing | identifying | completed | failed",
"filename": "meeting.wav",
"error": "...", // only when status = failed
"result": { // only when status = completed
"id": "tr_...",
"language": "en",
"segments": [
{
"id": 0,
"start": 0.0,
"end": 4.32,
"text": "This is the first segment.",
"speaker_label": "SPEAKER_00",
"speaker_id": "spk_...",
"speaker_name": "Alice",
"similarity": 0.8421,
"words": [
{ "word": "This", "start": 0.05, "end": 0.18, "score": 0.98 },
{ "word": "is", "start": 0.18, "end": 0.29, "score": 0.96 }
]
}
],
"speaker_map": {
"SPEAKER_00": {
"matched_id": "spk_...",
"matched_name": "Alice",
"similarity": 0.8421,
"embedding_key": "SPEAKER_00"
}
},
"unique_speakers": ["Alice"],
"params": {
"language": "en", // shows "auto" when no language was specified at submit time
"denoise_model": "none",
"snr_threshold": 10.0,
"voiceprint_threshold": 0.75,
"min_speakers": 0,
"max_speakers": 0,
"no_repeat_ngram_size": 0
},
"artifacts": {
"manifest_version": "artifact_manifest.v1",
"stable": [
{
"name": "result",
"filename": "result.json",
"role": "primary_result",
"media_type": "application/json",
"required_for_result": true
},
{
"name": "speaker_embedding",
"filename": "emb_SPEAKER_00.npy",
"role": "speaker_embedding",
"media_type": "application/octet-stream",
"required_for_result": false,
"speaker_label": "SPEAKER_00"
}
],
"optional": [],
"experimental": []
},
"alignment": {
"status": "succeeded",
"language": "en",
"model": null,
"model_source": "whisperx_default",
"cache_only": false
}
}
}speaker_label is the raw pyannote label — it never changes even when
an existing voiceprint was matched. Use it as the key for any later
enrollment or rename call.
Result contract anchors: completed results report status="completed" in
the persisted transcription object. segments[].speaker_label is always the
raw diarization cluster label. segments[].words and top-level alignment are
optional metadata; top-level artifacts is optional as well. Clients must
tolerate these fields being absent.
speaker_id / speaker_name: matching uses an adaptive threshold, not a
fixed 0.75 cutoff. Actual logic:
- Base threshold is
VOICEPRINT_THRESHOLD(default0.75). - Each speaker's effective threshold is relaxed automatically based on the cosine
spread of their enrolled samples: a one-sample speaker lands around
0.70; higher spread can relax it further (up to0.10), with an absolute floor of0.60. - Once AS-norm is active (
cohort >= 10), matching switches to the normalised score and uses a sample-count-aware threshold around the0.5operating point: one-sample speakers are stricter (at least0.60by default), stable multi-sample speakers stay near the base, and candidates too close to the second-best AS-norm score are left unnamed for review.
If the best candidate clears the effective threshold, the service returns the
matched speaker_id / speaker_name; otherwise speaker_id is null and
speaker_name falls back to the raw label (for example SPEAKER_00).
If two diarization labels in the same result resolve to the same display name,
the service keeps both raw speaker_label values and disambiguates display
names in segment output, for example Alice and Alice (2). Voiceprint naming
does not collapse diarization clusters.
similarity: speaker-match score.
- Raw cosine mode (
cohort < 10, including fresh installs): range is[-1, 1]and usually[0, 1], representing cosine similarity against the enrolled speaker average. - AS-norm mode (
cohort >= 10): this becomes a normalised z-score and is therefore unbounded (it can be greater than1.0or negative). - The value is aggregated at the speaker level, not per individual segment.
speaker_id != nullmeans the score passed the effective threshold in the current mode.
See voiceprint-tuning.en.md for environment
variables, API parameters, AS-norm top_n / cohort / margin defaults, and
tuning guidance.
words[] is a new optional field added in 0.3.0 (WhisperX forced
alignment output). Each entry carries its own start/end/score.
Alignment can be skipped or fail for languages whose align model is unavailable
or disabled; when it does, the key is simply absent from the segment and the job
still finishes. Clients that don't recognize the field should just ignore it.
alignment records forced-alignment status when available. Common values:
status=succeeded, status=skipped with reason=language_disabled, or
status=failed with a sanitized error_type and actionable_hint. The
default Chinese alignment model is reported as
jonatasgrosman/wav2vec2-large-xlsr-53-chinese-zh-cn; if an older custom
runtime is blocked by transformers' torch.load safety check, reason is
torch_version_blocked rather than not_found. This metadata intentionally
does not expose tokens, hostnames, or local filesystem paths.
params records the effective settings used for this specific job,
including any per-request overrides. Makes each result self-contained —
no need to cross-reference the original request. See
configuration.en.md for each setting's source and
default.
artifacts is an optional manifest describing stable, optional, and
experimental artifacts that live alongside this result. Current stable entries
include the primary result.json and one emb_<speaker_label>.npy speaker
embedding per cluster. The manifest exposes only filenames, roles, categories,
media types, and speaker_label; it does not expose local paths, hosts, tokens,
real job runtime paths, or debug data. Default clients do not need this field,
and older results without artifacts remain compatible.
Completed GET /api/jobs/{id} results and GET /api/transcriptions/{id} share the
same payload shape. That means speaker_map and unique_speakers are available in
the completed job result as well:
- If you need the latest persisted result after manual segment edits, prefer
GET /api/transcriptions/{id}.GET /api/jobs/{id}may still be serving the worker's in-memory completed copy until that cache entry is evicted. speaker_mapmay be an empty object when the pipeline produced no usable speaker embeddings (for example, all diarized turns were too short to enroll).unique_speakersis derived from the resolvedsegments[].speaker_namevalues and therefore uses enrolled names when matched, otherwise the raw diarization labels.
[
{ "id": "tr_...", "filename": "...", "created_at": "...",
"segment_count": 42, "speaker_count": 3 }
]Same shape as the completed result field inside GET /api/jobs/{id}, plus two
aggregation fields for UI / downstream consumers:
| Field | Type | Description |
|---|---|---|
speaker_map |
object | speaker_label → {matched_id, matched_name, similarity, embedding_key} mapping; reflects the diarization model's voiceprint match result and does not change when segments are manually corrected |
unique_speakers |
array[string] | Deduplicated list of speaker names, recalculated from the persisted segments[].speaker_name values to reflect the latest manual corrections |
artifacts |
object | Optional artifact manifest for stable / optional / experimental artifacts; clients must tolerate it being absent |
Query format=srt | txt | json. Returns the file as a download.
GET /api/voiceprints
POST /api/voiceprints/enroll
PUT /api/voiceprints/{speaker_id}/name
DELETE /api/voiceprints/{speaker_id}
[
{ "id": "spk_example_id", "name": "Alice",
"sample_count": 3,
"created_at": "2026-04-18T08:06:41.951819",
"updated_at": "2026-04-18T09:17:02.113207" }
]Note (enroll idempotency):
add_speakernow deduplicates byname— re-enrolling a speaker with the same name merges the new embedding into the existing record rather than creating a duplicate.Pass
speaker_idonly when you intend to update that exact existing voiceprint. If the suppliedspeaker_idis well-formed but not found, the endpoint does not 404; it falls back to the create/name-dedup path.
Form fields:
| Field | Required | Description |
|---|---|---|
tr_id |
✅ | Transcription id, matches result.id |
speaker_label |
✅ | Must be the raw SPEAKER_XX label, not the display name |
speaker_name |
✅ | Display name, e.g. "Alice" |
speaker_id |
❌ | Explicit update target. If this id exists, the endpoint updates that voiceprint and returns action: "updated". If omitted, or if the id is well-formed but not found, the endpoint takes the create path, which may still merge into an existing same-name record via add_speaker() dedup. Format must match ^spk_[A-Za-z0-9_-]{1,64}$ (e.g. spk_example_id); returns 422 if invalid. |
Response:
{ "action": "created | updated", "speaker_id": "spk_..." }Example:
curl -X POST http://localhost:8780/api/voiceprints/enroll \
-H "Authorization: Bearer $API_KEY" \
-F "tr_id=tr_example_id" \
-F "speaker_label=SPEAKER_00" \
-F "speaker_name=Alice"Rebuilds the AS-norm impostor cohort matrix from all existing transcriptions. Manual rebuilds are still supported, but 0.7.1 also has automatic cohort loading and refresh.
Response:
{ "cohort_size": 313, "skipped": 2, "saved_to": "/data/transcriptions/asnorm_cohort.npy" }skipped — number of transcriptions whose embedding files could not be loaded (corrupt or missing .npy).
Cohort lifecycle and behaviour:
| Cohort size | Identification path | Effective threshold |
|---|---|---|
| 0 (fresh install / no transcriptions) | raw cosine | base 0.75 + adaptive relaxation, floor 0.60 |
| 1–9 (fewer than 10) | raw cosine (score() fallback) |
same as above |
| ≥ 10 | AS-norm normalised score | ~0.5 (relative to impostor distribution; VOICEPRINT_THRESHOLD ignored) |
Startup behaviour:
- If
data/transcriptions/asnorm_cohort.npyalready exists, the service loads it directly on startup. - Otherwise it scans persisted transcription results /
emb_*.npyfiles and builds a fresh cohort, then saves it back to that path.
Refresh timing: each enroll / update bumps a generation counter. A background daemon
thread named cohort-rebuild wakes every 60 s and calls maybe_rebuild_cohort() once
the latest enrollment is at least 30 s old. The rebuild is lock-protected, so the
daemon and POST /api/voiceprints/rebuild-cohort cannot run the rebuild concurrently.
No manual action is needed — new embeddings usually enter the matching path
within about 30-90 s of enrollment; they enter full AS-norm scoring only when
the cohort has at least 10 embeddings, otherwise raw-cosine fallback remains in
effect. Automatic rebuilds protect a larger loaded or persisted
cohort: if the transcription source is empty, has only a few embeddings, or has fewer
embeddings than the current cohort, the daemon keeps the existing asnorm_cohort.npy
instead of shrinking it after transcription cleanup. POST /api/voiceprints/rebuild-cohort
remains available for an immediate forced rebuild and uses the currently available
embeddings as an explicit manual operation.
Form name=<new name>. Renames only; the embedding is unchanged.
Removes the voiceprint permanently. Future recordings of that person will not auto-match.
Manually reassign a single segment to a different speaker.
Form fields:
| Field | Required | Description |
|---|---|---|
speaker_name |
✅ | New speaker display name |
speaker_id |
❌ | ID of a registered voiceprint (format: ^spk_[A-Za-z0-9_-]{1,64}$); omitting this clears any previously assigned speaker_id on the segment |
Behavior:
- Only the targeted segment is updated — other segments are not affected.
speaker_mapis not modified — it records the diarization model's voiceprint match result and is not affected by manual corrections.unique_speakersis recalculated from all segments after each edit to reflect the latest corrections.- When
speaker_idis omitted, any stalespeaker_idon the target segment is explicitly cleared tonull.
Errors:
422—speaker_idformat invalid (does not match^spk_[A-Za-z0-9_-]{1,64}$)404—speaker_idnot found in the voiceprint DB404—tr_idtranscription not found404—seg_idnot found in this transcription
| Code | Meaning |
|---|---|
| 400 | Missing or invalid request field; illegal job_id format (^tr_[A-Za-z0-9_-]{1,64}$) / invalid characters in speaker_label / path traversal detected |
| 422 | Field value fails type or value validation; speaker_id does not match ^spk_[A-Za-z0-9_-]{1,64}$; no_repeat_ngram_size is not an integer |
| 401 | Missing or wrong API key |
| 404 | Unknown tr_id / speaker_id / missing embedding |
| 413 | Upload exceeded MAX_UPLOAD_BYTES (default 2 GiB) — see /api/transcribe |
| 503 | Failed to persist initial queued status or failed to start the background transcription thread |
| 500 | Server-side exception (check docker logs voscript) |
| 504 | ffmpeg transcoding timed out (exceeded FFMPEG_TIMEOUT_SEC, default 1800 s) |
Body shape:
{ "detail": "..." }| BetterAINote code | Endpoint called |
|---|---|
submitVoiceTranscribeJob |
POST /api/transcribe |
pollVoiceTranscribeJob |
GET /api/jobs/{id} |
VoiceTranscribeClient.listVoiceprints |
GET /api/voiceprints |
VoiceTranscribeClient.enrollVoiceprint |
POST /api/voiceprints/enroll |
VoiceTranscribeClient.renameVoiceprint |
PUT /api/voiceprints/{id}/name |
VoiceTranscribeClient.deleteVoiceprint |
DELETE /api/voiceprints/{id} |
Source files live in the BetterAINote repo
under src/lib/transcription/providers/voice-transcribe-provider.ts and
src/lib/voice-transcribe/client.ts.