-
Notifications
You must be signed in to change notification settings - Fork 213
feat: implement TTS aligned transcripts #990
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
🦋 Changeset detectedLatest commit: 6f37f02 The changes in this PR will be included in the next version bump. This PR includes changesets to release 18 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
📝 WalkthroughWalkthroughThis PR adds a TimedString primitive and pipes TTS-aligned, word-level timing data throughout agents, updating TTS/STT plugins, voice session/agent APIs, generation and synchronization logic, and streaming paths to accept and forward ReadableStream<string | TimedString>. Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant Agent
participant TTS
participant Sync as TextSynchronizer
participant Transcriber
Client->>Agent: audio input / TTS request
Agent->>TTS: forward text stream (string|TimedString) with useTtsAlignedTranscript
TTS->>Agent: audio frames (AudioFrame) + timedTranscripts (TimedString[])
Agent->>Sync: attach timedTranscripts via USERDATA_TIMED_TRANSCRIPT
Sync->>Sync: compute SpeakingRateData / annotated rates
Sync->>Transcriber: emit synchronized text (string|TimedString)
Transcriber->>Client: synchronized transcript output
sequenceDiagram
participant PluginWS as TTS Plugin WS
participant Parser
participant Accumulator
participant FrameOut as Audio Frame Emitter
participant Consumer as Agent/Client
PluginWS->>Parser: receive websocket messages
Parser->>Accumulator: validate & extract word timestamps (hasWordTimestamps)
Accumulator->>FrameOut: batch into TimedString objects, attach to frames
FrameOut->>Consumer: emit AudioFrame + timedTranscripts
Consumer->>Sync: consume timedTranscripts for alignment
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 6
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
agents/src/voice/transcription/synchronizer.ts (1)
655-670:flush()changed to async but parent class defines it as synchronous - interface violation.The
flush()method inSyncedTextOutputis nowasyncbut the parent classTextOutputdefinesabstract flush(): void;(synchronous). This breaks the Liskov Substitution Principle and creates interface incompatibility.Additionally, multiple callers do not await this method:
agents/src/voice/generation.tslines 693, 792agents/src/voice/room_io/room_io.tsline 316agents/src/voice/transcription/synchronizer.tslines 318, 660 (even within the new async method)Either make the parent
flush()async or keepSyncedTextOutput.flush()synchronous and handle synchronization differently.examples/src/basic_agent.ts (1)
4-15: Initialize logger before LLM usage.Examples must call
initializeLogger({ pretty: true })before any LLM functionality.🛠️ Suggested fix
import { type JobContext, type JobProcess, WorkerOptions, cli, defineAgent, + initializeLogger, llm, metrics, voice, } from '@livekit/agents'; @@ import { fileURLToPath } from 'node:url'; import { z } from 'zod'; +initializeLogger({ pretty: true }); + export default defineAgent({As per coding guidelines, please initialize the logger in examples before using LLMs.
🤖 Fix all issues with AI agents
In `@agents/src/tts/stream_adapter.ts`:
- Around line 106-124: The audio frame write assumes audio.frame.userdata is
always defined and replaces the timed-transcript array, which can throw if
userdata is undefined and clobber existing entries; before assigning, ensure
audio.frame.userdata exists (e.g., audio.frame.userdata = audio.frame.userdata
?? {}) and safely append the timedString instead of replacing: read existing
array at audio.frame.userdata[USERDATA_TIMED_TRANSCRIPT] (or default to []),
push timedString, then reassign that array back to
audio.frame.userdata[USERDATA_TIMED_TRANSCRIPT]; apply the same defensive
initialization and append logic to the identical pattern in the other occurrence
that writes USERDATA_TIMED_TRANSCRIPT (the block using createTimedString /
isFirstFrame and audio.frame.userdata).
In `@agents/src/voice/agent_activity.ts`:
- Around line 1419-1434: The current await on ttsGenData.timedTextsFut.await can
hang if the TTS task fails before resolving; update the block that checks
useTtsAlignedTranscript / tts?.capabilities.alignedTranscript / ttsGenData to
race the timedTextsFut.await with a failure/timeout signal from the TTS task (or
ensure the TTS path always resolves/rejects the future on error). Specifically,
when handling ttsGenData and calling timedTextsFut.await, use a Promise.race (or
equivalent) between timedTextsFut.await and a fallback that rejects or times out
if the TTS generation task fails (or if ttsGenData exposes a task/error
promise), then log and fall back to llmOutput if the race returns an
error/timeout so the transcriptionInput assignment never deadlocks.
In `@agents/src/voice/generation.ts`:
- Around line 579-614: The variable initialPushedDuration is unused because
pushedDuration is initialized to 0 per performTTSInference call; remove the
unused offset logic or mark it as intentional scaffolding: either delete
initialPushedDuration and the + initialPushedDuration adjustments in the
createTimedString call (and its comment), or replace the comment with a TODO
stating this is reserved for multi-inference offsets and keep
initialPushedDuration so future callers can pass a non-zero pushedDuration;
update references in performTTSInference, the createTimedString call that
adjusts startTime/endTime, and any related comment near timedTextsWriter
accordingly.
In `@examples/src/timed_transcript_agent.ts`:
- Around line 23-32: Add a call to initializeLogger({ pretty: true }) at the top
of the module before any LLM-related imports/usage (i.e., before references to
llm, defineAgent, stream, voice, etc.); specifically, place the initialization
right after module imports and before any code that calls or constructs llm or
defineAgent so the logger is configured prior to LLM tooling being used.
In `@plugins/cartesia/src/tts.ts`:
- Around line 405-411: The debug log message in the TTS chunk timeout handler
incorrectly says "STT chunk stream"; update the string passed to
this.#logger.debug inside the timeout callback (the block that sets timeout =
setTimeout(...)) to say "TTS chunk stream timeout after
${this.#opts.chunkTimeout}ms" so the log accurately reflects TTS, leaving the
rest of the timeout closure (including ws.close()) unchanged.
In `@plugins/cartesia/src/types.ts`:
- Around line 1-3: Update the SPDX copyright header at the top of the file by
changing the year in the SPDX-FileCopyrightText line from 2024 to 2025; locate
the SPDX-FileCopyrightText entry (the comment starting with
"SPDX-FileCopyrightText: 2024 LiveKit, Inc.") and replace 2024 with 2025 so the
header reads "SPDX-FileCopyrightText: 2025 LiveKit, Inc." while leaving the
SPDX-License-Identifier line unchanged.
🧹 Nitpick comments (6)
agents/src/voice/transcription/synchronizer.ts (2)
61-85: Potential negativedtinaddByAnnotationwhen timestamps arrive out of order.If
startTimeis less thanpushedDuration(e.g., due to timing drift or reordered messages),dtbecomes negative on line 68, resulting in a potentially negative or incorrect rate calculation on line 72. Consider guarding against this edge case.🔧 Suggested defensive check
addByAnnotation(text: string, startTime: number | undefined, endTime: number | undefined): void { if (startTime !== undefined) { // Calculate the integral of the speaking rate up to the start time const integral = this.speakIntegrals.length > 0 ? this.speakIntegrals[this.speakIntegrals.length - 1]! : 0; const dt = startTime - this.pushedDuration; + // Guard against negative dt (out-of-order timestamps) + if (dt < 0) { + this.textBuffer.push(text); + if (endTime !== undefined) { + this.addByAnnotation('', endTime, undefined); + } + return; + } // Use the length of the text directly instead of hyphens const textLen = this.textBuffer.reduce((sum, t) => sum + t.length, 0);
90-124: Linear search mislabeled as binary search.The comment mentions "Binary search" but the implementation is a linear scan (O(n)). For typical use cases with small arrays this is acceptable, but the comment is misleading.
📝 Fix comment or implement actual binary search
- // Binary search for the right position (equivalent to np.searchsorted with side="right") + // Linear search for the right position (equivalent to np.searchsorted with side="right") + // Note: For small arrays this is efficient enough; consider binary search for large datasets let idx = 0; for (let i = 0; i < this.timestamps.length; i++) {plugins/cartesia/src/types.ts (1)
83-89: Type naming may cause confusion -CartesiaServerMessageincludes error messages.
CartesiaServerMessageis inferred fromcartesiaMessageSchema(which is the union including error messages), not fromcartesiaServerMessageSchema. This means the type includesCartesiaErrorMessage, which may be intentional but the naming suggests otherwise.Consider either:
- Renaming to
CartesiaMessageto reflect that it includes errors, or- Creating a separate type for the full union
📝 Clarify type naming
-export type CartesiaServerMessage = z.infer<typeof cartesiaMessageSchema>; +/** Union of all Cartesia messages including errors */ +export type CartesiaMessage = z.infer<typeof cartesiaMessageSchema>; +/** Server messages excluding error messages */ +export type CartesiaServerMessage = z.infer<typeof cartesiaServerMessageSchema>;plugins/cartesia/src/tts.ts (1)
372-386: Array bounds not validated before indexed access.The word timestamps arrays (
words,start,end) are assumed to have the same length, but this isn't validated. If arrays have mismatched lengths, undefined values could be accessed.🔧 Add length validation
if (this.#opts.wordTimestamps !== false && hasWordTimestamps(serverMsg)) { const wordTimestamps = serverMsg.word_timestamps; + const minLength = Math.min( + wordTimestamps.words.length, + wordTimestamps.start.length, + wordTimestamps.end.length, + ); - for (let i = 0; i < wordTimestamps.words.length; i++) { + for (let i = 0; i < minLength; i++) { const word = wordTimestamps.words[i]; const startTime = wordTimestamps.start[i]; const endTime = wordTimestamps.end[i];plugins/elevenlabs/src/tts.ts (1)
1037-1059: Consider event-based signaling instead of polling.The 10ms polling loop works but consumes CPU cycles. Consider using a condition variable or Promise-based signaling when new data arrives. However, for the current use case, this is acceptable.
agents/src/voice/agent_activity.ts (1)
1219-1258:say()path doesn’t use aligned transcripts even when enabled.
transcriptionNodereceives raw text only, souseTtsAlignedTranscripthas no effect forAgentActivity.say(). Consider wiringttsGenData.timedTextsFutinto the transcription input (similar to the pipeline path) when the TTS supports aligned transcripts.
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
pnpm-lock.yamlis excluded by!**/pnpm-lock.yaml
📒 Files selected for processing (24)
agents/src/index.tsagents/src/inference/stt.tsagents/src/llm/realtime.tsagents/src/tts/stream_adapter.tsagents/src/tts/tts.tsagents/src/types.tsagents/src/voice/agent.tsagents/src/voice/agent_activity.tsagents/src/voice/agent_session.tsagents/src/voice/generation.tsagents/src/voice/index.tsagents/src/voice/io.tsagents/src/voice/transcription/synchronizer.tsexamples/src/basic_agent.tsexamples/src/timed_transcript_agent.tsplugins/cartesia/package.jsonplugins/cartesia/src/tts.tsplugins/cartesia/src/types.tsplugins/deepgram/src/stt.tsplugins/deepgram/src/stt_v2.tsplugins/elevenlabs/src/tts.tsplugins/openai/src/realtime/api_proto.tsplugins/openai/src/realtime/realtime_model.tsplugins/openai/src/realtime/realtime_model_beta.ts
🧰 Additional context used
📓 Path-based instructions (3)
**/*.{ts,tsx,js,jsx}
📄 CodeRabbit inference engine (.cursor/rules/agent-core.mdc)
Add SPDX-FileCopyrightText and SPDX-License-Identifier headers to all newly added files with '// SPDX-FileCopyrightText: 2025 LiveKit, Inc.' and '// SPDX-License-Identifier: Apache-2.0'
Files:
agents/src/tts/tts.tsplugins/openai/src/realtime/api_proto.tsagents/src/voice/agent_session.tsagents/src/voice/io.tsplugins/cartesia/src/types.tsplugins/cartesia/src/tts.tsagents/src/tts/stream_adapter.tsplugins/deepgram/src/stt.tsexamples/src/timed_transcript_agent.tsplugins/openai/src/realtime/realtime_model_beta.tsagents/src/voice/generation.tsagents/src/index.tsagents/src/types.tsagents/src/voice/agent.tsagents/src/voice/agent_activity.tsexamples/src/basic_agent.tsagents/src/inference/stt.tsagents/src/voice/index.tsplugins/deepgram/src/stt_v2.tsagents/src/llm/realtime.tsagents/src/voice/transcription/synchronizer.tsplugins/elevenlabs/src/tts.tsplugins/openai/src/realtime/realtime_model.ts
**/*.{ts,tsx}?(test|example|spec)
📄 CodeRabbit inference engine (.cursor/rules/agent-core.mdc)
When testing inference LLM, always use full model names from
agents/src/inference/models.ts(e.g., 'openai/gpt-4o-mini' instead of 'gpt-4o-mini')
Files:
agents/src/tts/tts.tsplugins/openai/src/realtime/api_proto.tsagents/src/voice/agent_session.tsagents/src/voice/io.tsplugins/cartesia/src/types.tsplugins/cartesia/src/tts.tsagents/src/tts/stream_adapter.tsplugins/deepgram/src/stt.tsexamples/src/timed_transcript_agent.tsplugins/openai/src/realtime/realtime_model_beta.tsagents/src/voice/generation.tsagents/src/index.tsagents/src/types.tsagents/src/voice/agent.tsagents/src/voice/agent_activity.tsexamples/src/basic_agent.tsagents/src/inference/stt.tsagents/src/voice/index.tsplugins/deepgram/src/stt_v2.tsagents/src/llm/realtime.tsagents/src/voice/transcription/synchronizer.tsplugins/elevenlabs/src/tts.tsplugins/openai/src/realtime/realtime_model.ts
**/*.{ts,tsx}?(test|example)
📄 CodeRabbit inference engine (.cursor/rules/agent-core.mdc)
Initialize logger before using any LLM functionality with
initializeLogger({ pretty: true })from '@livekit/agents'
Files:
agents/src/tts/tts.tsplugins/openai/src/realtime/api_proto.tsagents/src/voice/agent_session.tsagents/src/voice/io.tsplugins/cartesia/src/types.tsplugins/cartesia/src/tts.tsagents/src/tts/stream_adapter.tsplugins/deepgram/src/stt.tsexamples/src/timed_transcript_agent.tsplugins/openai/src/realtime/realtime_model_beta.tsagents/src/voice/generation.tsagents/src/index.tsagents/src/types.tsagents/src/voice/agent.tsagents/src/voice/agent_activity.tsexamples/src/basic_agent.tsagents/src/inference/stt.tsagents/src/voice/index.tsplugins/deepgram/src/stt_v2.tsagents/src/llm/realtime.tsagents/src/voice/transcription/synchronizer.tsplugins/elevenlabs/src/tts.tsplugins/openai/src/realtime/realtime_model.ts
🧠 Learnings (4)
📚 Learning: 2026-01-16T14:33:39.551Z
Learnt from: CR
Repo: livekit/agents-js PR: 0
File: .cursor/rules/agent-core.mdc:0-0
Timestamp: 2026-01-16T14:33:39.551Z
Learning: Applies to examples/src/test_*.ts : For plugin component debugging (STT, TTS, LLM), create test example files prefixed with `test_` under the examples directory and run with `pnpm build && node ./examples/src/test_my_plugin.ts`
Applied to files:
plugins/deepgram/src/stt.tsexamples/src/timed_transcript_agent.tsagents/src/voice/generation.tsagents/src/index.tsexamples/src/basic_agent.tsagents/src/voice/index.ts
📚 Learning: 2026-01-16T14:33:39.551Z
Learnt from: CR
Repo: livekit/agents-js PR: 0
File: .cursor/rules/agent-core.mdc:0-0
Timestamp: 2026-01-16T14:33:39.551Z
Learning: Use `pnpm build && pnpm dlx tsx ./examples/src/my_agent.ts dev|download-files --log-level=debug|info(default)` to run example agents from the examples directory
Applied to files:
examples/src/timed_transcript_agent.ts
📚 Learning: 2026-01-16T14:33:39.551Z
Learnt from: CR
Repo: livekit/agents-js PR: 0
File: .cursor/rules/agent-core.mdc:0-0
Timestamp: 2026-01-16T14:33:39.551Z
Learning: Applies to **/*.{ts,tsx}?(test|example) : Initialize logger before using any LLM functionality with `initializeLogger({ pretty: true })` from 'livekit/agents'
Applied to files:
agents/src/voice/generation.tsexamples/src/basic_agent.ts
📚 Learning: 2026-01-16T14:33:39.551Z
Learnt from: CR
Repo: livekit/agents-js PR: 0
File: .cursor/rules/agent-core.mdc:0-0
Timestamp: 2026-01-16T14:33:39.551Z
Learning: Applies to **/*.{ts,tsx}?(test|example|spec) : When testing inference LLM, always use full model names from `agents/src/inference/models.ts` (e.g., 'openai/gpt-4o-mini' instead of 'gpt-4o-mini')
Applied to files:
plugins/openai/src/realtime/realtime_model.ts
🧬 Code graph analysis (14)
agents/src/tts/tts.ts (1)
agents/src/voice/io.ts (1)
TimedString(48-55)
agents/src/voice/io.ts (3)
agents/src/voice/agent.ts (1)
ModelSettings(58-61)agents/src/voice/index.ts (2)
ModelSettings(4-4)TimedString(9-9)agents/src/index.ts (3)
TimedString(37-37)createTimedString(37-37)isTimedString(37-37)
plugins/cartesia/src/tts.ts (2)
agents/src/voice/index.ts (1)
TimedString(9-9)plugins/cartesia/src/types.ts (6)
CartesiaServerMessage(89-89)cartesiaMessageSchema(74-77)isErrorMessage(111-113)hasWordTimestamps(115-119)isChunkMessage(95-97)isDoneMessage(103-105)
agents/src/tts/stream_adapter.ts (3)
agents/src/index.ts (1)
createTimedString(37-37)agents/src/voice/io.ts (1)
createTimedString(60-75)agents/src/types.ts (1)
USERDATA_TIMED_TRANSCRIPT(9-9)
plugins/deepgram/src/stt.ts (2)
agents/src/index.ts (1)
createTimedString(37-37)agents/src/voice/io.ts (1)
createTimedString(60-75)
examples/src/timed_transcript_agent.ts (4)
agents/src/index.ts (6)
TimedString(37-37)voice(40-40)stream(40-40)llm(40-40)tts(40-40)cli(40-40)agents/src/voice/io.ts (2)
TimedString(48-55)stream(98-100)plugins/cartesia/src/tts.ts (1)
stream(130-132)agents/src/voice/agent_activity.ts (2)
llm(337-339)tts(341-343)
plugins/openai/src/realtime/realtime_model_beta.ts (1)
agents/src/voice/io.ts (2)
TimedString(48-55)createTimedString(60-75)
agents/src/voice/generation.ts (3)
agents/src/utils.ts (4)
Future(123-160)Task(420-532)done(141-143)done(525-527)agents/src/voice/io.ts (3)
TimedString(48-55)createTimedString(60-75)isTimedString(80-87)agents/src/types.ts (1)
USERDATA_TIMED_TRANSCRIPT(9-9)
agents/src/voice/agent.ts (4)
agents/src/voice/agent_activity.ts (1)
useTtsAlignedTranscript(367-370)agents/src/voice/agent_session.ts (1)
useTtsAlignedTranscript(279-281)agents/src/voice/io.ts (1)
TimedString(48-55)agents/src/types.ts (1)
USERDATA_TIMED_TRANSCRIPT(9-9)
agents/src/voice/agent_activity.ts (3)
agents/src/voice/generation.ts (2)
performTTSInference(520-647)_TTSGenerationData(54-63)agents/src/voice/index.ts (1)
TimedString(9-9)agents/src/voice/io.ts (1)
TimedString(48-55)
plugins/deepgram/src/stt_v2.ts (2)
agents/src/index.ts (1)
createTimedString(37-37)agents/src/voice/io.ts (1)
createTimedString(60-75)
agents/src/llm/realtime.ts (2)
agents/src/index.ts (1)
TimedString(37-37)agents/src/voice/io.ts (1)
TimedString(48-55)
plugins/elevenlabs/src/tts.ts (2)
agents/src/index.ts (2)
createTimedString(37-37)TimedString(37-37)agents/src/voice/index.ts (1)
TimedString(9-9)
plugins/openai/src/realtime/realtime_model.ts (6)
agents/src/tts/stream_adapter.ts (1)
stream(40-42)agents/src/index.ts (3)
stream(40-40)TimedString(37-37)createTimedString(37-37)agents/src/stt/stream_adapter.ts (1)
stream(36-38)agents/src/stream/stream_channel.ts (1)
StreamChannel(7-12)agents/src/stream/index.ts (1)
StreamChannel(7-7)agents/src/voice/index.ts (1)
TimedString(9-9)
🔇 Additional comments (35)
plugins/cartesia/package.json (1)
52-52: The Zod version range is safe for this codebase. Both versions 3.25.76 and 4.1.8 exist on npm, and while Zod 4 has breaking changes (e.g.,z.record()signature, error APIs, string format helpers), the schemas intypes.tsonly use compatible APIs (z.object(),z.array(),z.string(),z.number(),z.literal(),z.boolean(),z.union(),z.discriminatedUnion(),z.infer<>). None of these are affected by the breaking changes, so supporting both major versions is valid.Likely an incorrect or invalid review comment.
plugins/deepgram/src/stt_v2.ts (1)
4-12: Consistent TimedString construction.
Aligns word entries with the TimedString factory for downstream alignment support.Also applies to: 489-495
plugins/deepgram/src/stt.ts (1)
4-15: TimedString wrapping looks good.
This keeps word payloads consistent with the new timing-aware APIs.Also applies to: 445-451
plugins/openai/src/realtime/api_proto.ts (1)
597-608: Optional start_time is a clean, compatible extension.
No issues with adding this field as optional.agents/src/voice/io.ts (3)
27-35: Doc note on userdata support is helpful.
Clarifies future extraction path without affecting runtime behavior.
40-87: TimedString core utilities look solid.
The symbol marker + factory + type guard provide a clean, consistent API.
257-263: captureText widening is appropriate.
Allows timed segments to flow through the text pipeline as intended.agents/src/index.ts (1)
37-37: Public re-export is appropriate.
Makes TimedString utilities discoverable from the package root.agents/src/types.ts (1)
5-9: USERDATA_TIMED_TRANSCRIPT constant looks good.
Centralizing the key reduces drift across modules.agents/src/voice/transcription/synchronizer.ts (1)
577-589: LGTM - Proper handling of timed transcripts without audio passthrough.The flush logic correctly handles the case where timed transcripts are used: if text is pending but no audio was pushed to the synchronizer, it ends audio input to allow text processing rather than rotating the segment. This aligns with the PR objective of TTS-aligned transcripts where audio goes directly to the room.
plugins/cartesia/src/types.ts (1)
95-119: LGTM - Type guards provide clear message discrimination.The type guards are well-implemented for runtime discrimination. The
hasWordTimestampshelper provides good semantic clarity for the TTS alignment use case.plugins/cartesia/src/tts.ts (2)
283-342: LGTM - Robust event channel pattern prevents message loss.The refactored WebSocket handling with a buffered event channel and single listener registration is a solid improvement. This pattern correctly addresses the issue of message loss during listener re-registration that can occur with repeated
once()calls.
412-432: LGTM - Proper coordination for stream termination.The
sentenceStreamClosedflag correctly coordinates the WebSocket lifecycle, ensuring the connection only closes when both: (1) Cartesia returns a done message, AND (2) all sentences have been sent. This prevents premature termination.plugins/elevenlabs/src/tts.ts (2)
173-236: LGTM - Timestamp normalization correctly removes leading silence.The
toTimedWordsfunction properly normalizes timestamps by subtractingfirstWordOffsetMs, withMath.max(0, ...)guards preventing negative values. The documentation clearly explains why this is needed (ElevenLabs returns absolute timestamps that may include leading silence).
511-515: LGTM - First word offset capture logic is correct.The offset is captured only once (when
firstWordOffsetMs === null) and only from non-zero start times, correctly identifying the first actual word timing for normalization.agents/src/voice/generation.ts (2)
536-555: LGTM - Clean text extraction stream implementation.The IIFE pattern correctly transforms the mixed
string | TimedStringinput stream into a text-only stream for the TTS node, with proper error handling and resource cleanup.
662-697: LGTM - Proper handling of TimedString in text forwarding.The implementation correctly extracts the text for accumulation while passing the original
TimedString(with timing metadata) totextOutput.captureText()for synchronization. This enables the synchronizer to use word-level timing information.agents/src/inference/stt.ts (1)
491-499: LGTM - Correct use ofcreateTimedStringfactory.The change from object literals to using
createTimedStringensures consistentTimedStringobjects with the properTIMED_STRING_SYMBOLmarker, aligning with the broader API surface updates across the codebase.agents/src/llm/realtime.ts (1)
9-26: TimedString support in textStream looks good.Clear type expansion and doc clarify downstream expectations.
examples/src/basic_agent.ts (2)
55-55: Switching to a concrete Cartesia TTS instance is solid.Makes the example align with the new plugin usage.
62-66: Aligned transcript flag wiring looks good.Explicitly enabling the option keeps the example clear.
agents/src/tts/tts.ts (2)
16-38: Timed transcripts on SynthesizedAudio are well integrated.The optional field keeps compatibility while enabling timestamps.
42-55: Capability flag for aligned transcripts is clear and useful.Nice, minimal API extension.
agents/src/voice/index.ts (1)
5-5: Exporting VoiceOptions is the right public surface update.Keeps config types accessible to users.
agents/src/tts/stream_adapter.ts (1)
6-9: Imports for timed transcript support are appropriate.No concerns here.
agents/src/voice/agent_session.ts (3)
76-82: VoiceOptions addition is well documented.Matches the new aligned transcript flow.
95-95: Defaulting to aligned transcripts makes sense.Please confirm providers that don’t support aligned transcripts still fall back cleanly to plain text.
275-281: Getter for useTtsAlignedTranscript is a clean addition.Straightforward and consistent with other session getters.
plugins/openai/src/realtime/realtime_model_beta.ts (2)
58-61: TimedString propagation wiring looks good.
The widened channel typing and creation align with timed transcript streaming.Also applies to: 1103-1107
1274-1286: Aligned transcript delta wrapping looks good.plugins/openai/src/realtime/realtime_model.ts (2)
57-60: TimedString channel typing update looks consistent.Also applies to: 1194-1199
1374-1385: No action required. The OpenAI Realtime API returnsstart_timein seconds, andTimedString.startTimeis documented to accept seconds. The code correctly passes the value directly without conversion.agents/src/voice/agent_activity.ts (1)
362-370: Agent-level override for aligned transcripts is clear.agents/src/voice/agent.ts (2)
63-82: Agent-leveluseTtsAlignedTranscriptoption and transcription typing look good.Also applies to: 165-192, 228-232
399-442: Timed transcript userdata attachment is solid.
This cleanly propagates aligned transcript data downstream.
✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
plugins/elevenlabs/src/tts.ts (1)
510-519: Avoid shifting timestamps when the first spoken character starts at 0.With the current “first non‑zero start” logic, a true 0ms start can get shifted by the next character’s start time. Consider keying the offset off the first non‑whitespace character instead.
🛠️ Suggested fix
- if (ctx.firstWordOffsetMs === null && start > 0) { - ctx.firstWordOffsetMs = start; - } + if (ctx.firstWordOffsetMs === null && char.trim().length > 0) { + ctx.firstWordOffsetMs = start; + }
🤖 Fix all issues with AI agents
In `@agents/src/voice/transcription/synchronizer.ts`:
- Around line 199-202: The hasPendingText getter currently checks only whether
any text was ever pushed (this.textData.pushedText.length > 0); change it to
return true only when there is unforwarded text by comparing pushed vs forwarded
counts, i.e. return this.textData.pushedText.length >
this.textData.forwardedText.length; update the hasPendingText accessor in
synchronizer.ts to use this comparison so segment rotation logic (which relies
on hasPendingText) behaves correctly.
♻️ Duplicate comments (5)
agents/src/tts/stream_adapter.ts (1)
112-119: Guardaudio.frame.userdatabefore assignment.If
AudioFrame.userdatacan be undefined in@livekit/rtc-node, this write will throw and can also overwrite existing metadata. Please confirm initialization guarantees; otherwise initialize and append safely.🛠️ Safer assignment pattern
- audio.frame.userdata[USERDATA_TIMED_TRANSCRIPT] = [timedString]; + const userdata = (audio.frame.userdata ??= {}); + const existing = userdata[USERDATA_TIMED_TRANSCRIPT]; + userdata[USERDATA_TIMED_TRANSCRIPT] = Array.isArray(existing) + ? [...existing, timedString] + : [timedString];agents/src/voice/generation.ts (1)
586-624: TheinitialPushedDurationoffset has no effect within a single inference.The
initialPushedDurationis always 0 becausepushedDurationis initialized to 0 at line 567 andinitialPushedDurationis captured immediately before the loop. The offset adjustment serves no functional purpose currently.If this is scaffolding for future multi-inference duration accumulation, add a TODO comment explaining the intent. Otherwise, consider removing the unused offset logic.
♻️ Option 1: Add TODO explaining the scaffolding
// pushed_duration stays CONSTANT within one inference. It represents // the cumulative duration from PREVIOUS TTS inferences. We capture it here before // the loop to match Python's behavior. + // TODO: Currently always 0 since pushedDuration is local. If multi-inference + // duration accumulation is needed, pass the initial offset from the caller. const initialPushedDuration = pushedDuration;♻️ Option 2: Remove unused offset logic
- // pushed_duration stays CONSTANT within one inference. It represents - // the cumulative duration from PREVIOUS TTS inferences. We capture it here before - // the loop to match Python's behavior. - const initialPushedDuration = pushedDuration; - while (true) { // ... const adjustedTimedText = createTimedString({ text: timedText.text, - startTime: - timedText.startTime !== undefined - ? timedText.startTime + initialPushedDuration - : undefined, - endTime: - timedText.endTime !== undefined - ? timedText.endTime + initialPushedDuration - : undefined, + startTime: timedText.startTime, + endTime: timedText.endTime, confidence: timedText.confidence, startTimeOffset: timedText.startTimeOffset, });agents/src/voice/agent_activity.ts (1)
1417-1430: Guard against deadlock if TTS initialization fails before timed texts resolve.
ttsGenData.timedTextsFut.awaitcan hang indefinitely if the TTS task throws before resolving the future. Consider racing it with the TTS task result or adding a timeout to avoid a stuck pipeline.🛠️ Suggested fix
// Check if we should use TTS aligned transcripts // Conditions: useTtsAlignedTranscript enabled, TTS has alignedTranscript capability, and we have ttsGenData if (this.useTtsAlignedTranscript && this.tts?.capabilities.alignedTranscript && ttsGenData) { - // Wait for the timed texts stream to be resolved - const timedTextsStream = await ttsGenData.timedTextsFut.await; + // Avoid hanging if TTS fails before timedTextsFut resolves + const timedTextsStream = await Promise.race([ + ttsGenData.timedTextsFut.await, + ttsTask ? ttsTask.result.then(() => null).catch(() => null) : Promise.resolve(null), + ]); if (timedTextsStream) { this.logger.debug('Using TTS aligned transcripts for transcription node input'); transcriptionInput = timedTextsStream; } }plugins/cartesia/src/types.ts (1)
1-3: Copyright year should be 2025 per coding guidelines.Based on coding guidelines, SPDX-FileCopyrightText should use 2025 for newly added files.
📝 Fix copyright year
-// SPDX-FileCopyrightText: 2024 LiveKit, Inc. +// SPDX-FileCopyrightText: 2025 LiveKit, Inc. // // SPDX-License-Identifier: Apache-2.0plugins/cartesia/src/tts.ts (1)
405-411: Typo in log message: "STT" should be "TTS".The timeout debug log incorrectly refers to "STT chunk stream" but this is TTS code.
📝 Fix typo
timeout = setTimeout(() => { // cartesia chunk timeout quite often, so we make it a debug log this.#logger.debug( - `Cartesia WebSocket STT chunk stream timeout after ${this.#opts.chunkTimeout}ms`, + `Cartesia WebSocket TTS chunk stream timeout after ${this.#opts.chunkTimeout}ms`, ); ws.close(); }, this.#opts.chunkTimeout);
🧹 Nitpick comments (2)
plugins/cartesia/src/types.ts (1)
64-76: Consider using.passthrough()for error schema to handle unknown fields.The error message schema uses
z.string()for thetypefield, which prevents it from being included in the discriminated union. However, if Cartesia adds new message types in the future, they would be parsed as errors. Consider whether this fallback behavior is intentional.agents/src/voice/transcription/synchronizer.ts (1)
99-107: Linear scan instead of binary search.The comment mentions "binary search" but the implementation is a linear scan. For small arrays this is fine, but consider using actual binary search for better performance with longer transcripts.
♻️ Binary search implementation
// Binary search for the right position (equivalent to np.searchsorted with side="right") - let idx = 0; - for (let i = 0; i < this.timestamps.length; i++) { - if (this.timestamps[i]! <= timestamp) { - idx = i + 1; - } else { - break; - } + let lo = 0; + let hi = this.timestamps.length; + while (lo < hi) { + const mid = Math.floor((lo + hi) / 2); + if (this.timestamps[mid]! <= timestamp) { + lo = mid + 1; + } else { + hi = mid; + } } + const idx = lo;
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (20)
agents/src/inference/stt.tsagents/src/llm/realtime.tsagents/src/tts/stream_adapter.tsagents/src/tts/tts.tsagents/src/types.tsagents/src/voice/agent.tsagents/src/voice/agent_activity.tsagents/src/voice/agent_session.tsagents/src/voice/generation.tsagents/src/voice/index.tsagents/src/voice/io.tsagents/src/voice/transcription/synchronizer.tsplugins/cartesia/src/tts.tsplugins/cartesia/src/types.tsplugins/deepgram/src/stt.tsplugins/deepgram/src/stt_v2.tsplugins/elevenlabs/src/tts.tsplugins/openai/src/realtime/api_proto.tsplugins/openai/src/realtime/realtime_model.tsplugins/openai/src/realtime/realtime_model_beta.ts
🚧 Files skipped from review as they are similar to previous changes (8)
- plugins/openai/src/realtime/api_proto.ts
- agents/src/llm/realtime.ts
- plugins/deepgram/src/stt.ts
- agents/src/types.ts
- agents/src/voice/index.ts
- agents/src/tts/tts.ts
- plugins/deepgram/src/stt_v2.ts
- plugins/openai/src/realtime/realtime_model.ts
🧰 Additional context used
📓 Path-based instructions (3)
**/*.{ts,tsx,js,jsx}
📄 CodeRabbit inference engine (.cursor/rules/agent-core.mdc)
Add SPDX-FileCopyrightText and SPDX-License-Identifier headers to all newly added files with '// SPDX-FileCopyrightText: 2025 LiveKit, Inc.' and '// SPDX-License-Identifier: Apache-2.0'
Files:
agents/src/voice/agent_session.tsagents/src/inference/stt.tsplugins/cartesia/src/tts.tsagents/src/voice/transcription/synchronizer.tsplugins/cartesia/src/types.tsagents/src/voice/agent_activity.tsagents/src/tts/stream_adapter.tsplugins/elevenlabs/src/tts.tsagents/src/voice/io.tsagents/src/voice/agent.tsplugins/openai/src/realtime/realtime_model_beta.tsagents/src/voice/generation.ts
**/*.{ts,tsx}?(test|example|spec)
📄 CodeRabbit inference engine (.cursor/rules/agent-core.mdc)
When testing inference LLM, always use full model names from
agents/src/inference/models.ts(e.g., 'openai/gpt-4o-mini' instead of 'gpt-4o-mini')
Files:
agents/src/voice/agent_session.tsagents/src/inference/stt.tsplugins/cartesia/src/tts.tsagents/src/voice/transcription/synchronizer.tsplugins/cartesia/src/types.tsagents/src/voice/agent_activity.tsagents/src/tts/stream_adapter.tsplugins/elevenlabs/src/tts.tsagents/src/voice/io.tsagents/src/voice/agent.tsplugins/openai/src/realtime/realtime_model_beta.tsagents/src/voice/generation.ts
**/*.{ts,tsx}?(test|example)
📄 CodeRabbit inference engine (.cursor/rules/agent-core.mdc)
Initialize logger before using any LLM functionality with
initializeLogger({ pretty: true })from '@livekit/agents'
Files:
agents/src/voice/agent_session.tsagents/src/inference/stt.tsplugins/cartesia/src/tts.tsagents/src/voice/transcription/synchronizer.tsplugins/cartesia/src/types.tsagents/src/voice/agent_activity.tsagents/src/tts/stream_adapter.tsplugins/elevenlabs/src/tts.tsagents/src/voice/io.tsagents/src/voice/agent.tsplugins/openai/src/realtime/realtime_model_beta.tsagents/src/voice/generation.ts
🧠 Learnings (3)
📚 Learning: 2026-01-16T14:33:39.551Z
Learnt from: CR
Repo: livekit/agents-js PR: 0
File: .cursor/rules/agent-core.mdc:0-0
Timestamp: 2026-01-16T14:33:39.551Z
Learning: Applies to examples/src/test_*.ts : For plugin component debugging (STT, TTS, LLM), create test example files prefixed with `test_` under the examples directory and run with `pnpm build && node ./examples/src/test_my_plugin.ts`
Applied to files:
plugins/cartesia/src/tts.tsagents/src/voice/generation.ts
📚 Learning: 2026-01-16T14:33:39.551Z
Learnt from: CR
Repo: livekit/agents-js PR: 0
File: .cursor/rules/agent-core.mdc:0-0
Timestamp: 2026-01-16T14:33:39.551Z
Learning: Applies to **/*.{ts,tsx,js,jsx} : Add SPDX-FileCopyrightText and SPDX-License-Identifier headers to all newly added files with '// SPDX-FileCopyrightText: 2025 LiveKit, Inc.' and '// SPDX-License-Identifier: Apache-2.0'
Applied to files:
plugins/cartesia/src/types.ts
📚 Learning: 2026-01-16T14:33:39.551Z
Learnt from: CR
Repo: livekit/agents-js PR: 0
File: .cursor/rules/agent-core.mdc:0-0
Timestamp: 2026-01-16T14:33:39.551Z
Learning: Applies to **/*.{ts,tsx}?(test|example) : Initialize logger before using any LLM functionality with `initializeLogger({ pretty: true })` from 'livekit/agents'
Applied to files:
plugins/openai/src/realtime/realtime_model_beta.tsagents/src/voice/generation.ts
🧬 Code graph analysis (8)
agents/src/inference/stt.ts (1)
agents/src/voice/io.ts (2)
TimedString(40-47)createTimedString(52-67)
plugins/cartesia/src/tts.ts (2)
agents/src/voice/index.ts (1)
TimedString(9-9)plugins/cartesia/src/types.ts (6)
CartesiaServerMessage(88-88)cartesiaMessageSchema(73-76)isErrorMessage(110-112)hasWordTimestamps(114-116)isChunkMessage(94-96)isDoneMessage(102-104)
agents/src/voice/transcription/synchronizer.ts (1)
agents/src/voice/io.ts (2)
TimedString(40-47)isTimedString(72-79)
agents/src/voice/agent_activity.ts (4)
agents/src/voice/generation.ts (3)
performTTSInference(528-661)_TTSGenerationData(62-71)performAudioForwarding(810-828)agents/src/voice/index.ts (1)
TimedString(9-9)agents/src/voice/io.ts (1)
TimedString(40-47)agents/src/index.ts (1)
TimedString(37-37)
agents/src/tts/stream_adapter.ts (2)
agents/src/voice/io.ts (1)
createTimedString(52-67)agents/src/types.ts (1)
USERDATA_TIMED_TRANSCRIPT(8-8)
plugins/elevenlabs/src/tts.ts (1)
agents/src/voice/index.ts (1)
TimedString(9-9)
plugins/openai/src/realtime/realtime_model_beta.ts (1)
agents/src/voice/io.ts (2)
TimedString(40-47)createTimedString(52-67)
agents/src/voice/generation.ts (2)
agents/src/voice/io.ts (4)
TimedString(40-47)TTSNode(27-30)createTimedString(52-67)isTimedString(72-79)agents/src/types.ts (1)
USERDATA_TIMED_TRANSCRIPT(8-8)
🔇 Additional comments (30)
plugins/elevenlabs/src/tts.ts (3)
11-12: Timed transcript normalization and TimedString construction look solid.Using
createTimedStringand subtracting the first-word offset keeps timestamps in seconds and consistent with the new TimedString contract.Also applies to: 120-121, 173-239, 299-311, 531-537, 557-563
1021-1077: Queued timed transcripts are properly attached to frames.Draining the timed transcript queue and attaching to the next emitted frame (including final flush) should prevent alignment drops.
643-651: syncAlignment properly gates alignedTranscript capability.Mapping the option into the base
TTScapability makes feature gating explicit and consistent.agents/src/voice/agent_session.ts (1)
76-81: Aligned transcript option is well surfaced at the session level.Defaulting to true and exposing a getter keeps behavior centralized and discoverable.
Also applies to: 94-95, 274-279
agents/src/tts/stream_adapter.ts (1)
6-9: StreamAdapter timing accumulation is clear.
cumulativeDurationpluscreateTimedStringgives deterministic token start times while keeping the adapter aligned‑transcript aware.Also applies to: 18-18, 57-58, 106-110
agents/src/inference/stt.ts (1)
19-19: TimedString factory use keeps markers consistent.Centralizing construction through
createTimedStringensures the symbol marker is always set.Also applies to: 492-499
plugins/openai/src/realtime/realtime_model_beta.ts (2)
13-20: TimedString-capable text channels are wired correctly.Typing the text channel as
string | TimedStringwhile keepingaudioTranscriptconcatenation on raw deltas preserves downstream expectations.Also applies to: 66-66, 1112-1112, 1291-1291
1280-1284: No action needed. The OpenAI Realtime API returnsstart_timein seconds, not milliseconds, so the current code is correct. Theevent.start_timevalue can be passed directly tocreateTimedStringwithout conversion.Likely an incorrect or invalid review comment.
agents/src/voice/io.ts (1)
32-79: TimedString utilities andTextOutputsignature update look good.Symbol marker + factory + guard centralize aligned transcript handling, and
captureTextnow supports timed entries.Also applies to: 249-249
agents/src/voice/agent_activity.ts (4)
362-369: LGTM! Clean precedence logic for the new setting.The getter correctly implements the agent-level override pattern, allowing per-agent configuration to take precedence over session-level defaults.
1247-1261: LGTM! Proper integration of TTS generation data.The refactoring correctly uses
ttsGenData.audioStreamfor audio forwarding while enabling the timed transcripts pipeline.
1453-1465: LGTM! Consistent use of ttsGenData for audio forwarding.The null check with a clear error message ensures developers understand when the invariant is violated.
1855-1856: Type signature correctly updated to support TimedString.The realtime generation task properly handles the mixed
string | TimedStringstream type for transcription input.plugins/cartesia/src/types.ts (1)
94-115: Type guards are redundant after Zod parsing but useful for type narrowing.Since
cartesiaMessageSchema.parse()already validates the message structure, these guards primarily serve TypeScript type narrowing. The implementation is correct.agents/src/voice/transcription/synchronizer.ts (3)
36-133: LGTM! SpeakingRateData provides timing interpolation for TTS alignment.The class correctly tracks word timing annotations and computes accumulated speaking units for synchronization. The approach of storing timestamps, rates, and integrals allows for efficient interpolation.
579-591: LGTM! Timed texts bypass audio synchronizer correctly.When using TTS-aligned transcripts, audio goes directly to the room while text timing comes from TTS annotations. The logic correctly handles the case where text is pending but no audio was pushed through the synchronizer.
653-660: Good addition of barrier await before accessing_impl.This prevents race conditions where
flushcould access an outdated_implduring segment rotation.agents/src/voice/agent.ts (3)
75-82: LGTM! Clean implementation of the useTtsAlignedTranscript option.The option is well-documented, properly stored, and exposed via a getter. The comment referencing the Python implementation (agent.py line 50, 80) is helpful for cross-language parity.
Also applies to: 91-94, 118-119, 164-165, 185-190
430-433: LGTM! Timed transcripts correctly attached to audio frames.The
USERDATA_TIMED_TRANSCRIPTkey is used consistently to propagate word-level timing from TTS through the audio pipeline.
447-452: LGTM! Default transcriptionNode passes through TimedString.The passthrough behavior correctly preserves timing information when no custom transcription processing is needed.
plugins/cartesia/src/tts.ts (6)
56-74: LGTM! Well-documented wordTimestamps option with sensible default.The option enables word-level timing data by default, aligning with the broader TTS-aligned transcript feature.
81-93: LGTM! Capabilities correctly reflect wordTimestamps setting.The
alignedTranscriptcapability is derived from thewordTimestampsoption, ensuring downstream code can check TTS capabilities accurately.
280-340: Excellent refactoring to event channel pattern.The switch from repeatedly attaching/detaching WebSocket listeners to a single event channel prevents message loss during processing. The pattern correctly buffers incoming messages.
370-386: LGTM! Word timestamps correctly converted to TimedString.The code properly iterates through the word timestamps array and creates TimedString objects with timing data. Adding a space after each word ensures consistent tokenization downstream.
412-432: Good coordination between sentence stream and done messages.The logic correctly waits for
sentenceStreamClosedbefore processing the final done message, ensuring all sentences are sent before closing the WebSocket.
476-522: LGTM! toCartesiaOptions correctly handles streaming parameter.The function conditionally adds
add_timestamps: trueonly for streaming mode when word timestamps are enabled, matching Cartesia's WebSocket API requirements.agents/src/voice/generation.ts (4)
58-71: LGTM! Clean interface for TTS generation data.The
_TTSGenerationDatainterface properly encapsulates the audio stream and timed texts future, enabling the agent activity to coordinate transcription with TTS output.
542-562: LGTM! Text extraction handles mixed string/TimedString input.The async IIFE correctly extracts plain text for the TTS node while allowing the original TimedString objects to flow elsewhere if needed.
578-582: Critical: Future resolved before loop enables concurrent reading.This design allows
agent_activityto start consuming the timed texts stream whileperformTTSInferenceis still writing, enabling real-time transcript synchronization.
676-711: LGTM! forwardText correctly handles mixed string/TimedString input.The function extracts text for accumulation while passing the original value (including TimedString with timing) to
textOutput.captureTextfor synchronization.
✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@agents/src/voice/transcription/synchronizer.test.ts`:
- Around line 1-3: Update the file header copyright year from 2024 to 2025 in
the SPDX comment block at the top of the file (the two-line block containing
"SPDX-FileCopyrightText" and "SPDX-License-Identifier"); replace "2024 LiveKit,
Inc." with "2025 LiveKit, Inc." so the header complies with the new file year
guideline.
♻️ Duplicate comments (3)
agents/src/voice/agent_activity.ts (1)
1414-1425: Guard against potential hang if TTS initialization fails beforetimedTextsFutresolves.The
await ttsGenData.timedTextsFut.awaiton line 1420 can hang indefinitely if TTS throws before resolving the future. Consider racing it with a failure signal from the TTS task.🛠️ Suggested fix
// Check if we should use TTS aligned transcripts if (this.useTtsAlignedTranscript && this.tts?.capabilities.alignedTranscript && ttsGenData) { - // Wait for the timed texts stream to be resolved - const timedTextsStream = await ttsGenData.timedTextsFut.await; + // Avoid hanging if TTS fails before timedTextsFut resolves + const timedTextsStream = await Promise.race([ + ttsGenData.timedTextsFut.await, + ttsTask ? ttsTask.result.then(() => null).catch(() => null) : Promise.resolve(null), + ]); if (timedTextsStream) { this.logger.debug('Using TTS aligned transcripts for transcription node input'); transcriptionInput = timedTextsStream; } }agents/src/voice/transcription/synchronizer.ts (1)
199-201: VerifyhasPendingTextlogic for both use cases.The current implementation returns
trueif any text was ever pushed (pushedText.length > 0), rather than checking for unforwarded text (pushedText.length > forwardedText.length).This may be intentional for the new TTS-aligned transcript flow where text is pushed without audio going through the synchronizer. However, for cases where text is being processed normally, this could return
trueeven when all text has been forwarded.Consider whether the logic should be:
return this.textData.pushedText.length > this.textData.forwardedText.length;agents/src/voice/generation.ts (1)
569-619: LGTM with note about scaffolding.The timed transcript extraction logic is correct:
timedTextsFutis resolved before the loop (critical for streaming)- Timed transcripts are extracted from
frame.userdataand forwarded- The
initialPushedDurationoffset is scaffolding for future multi-inference support (currently always 0)
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (9)
.changeset/odd-moose-check.mdagents/src/tts/tts.tsagents/src/voice/agent.tsagents/src/voice/agent_activity.tsagents/src/voice/agent_session.tsagents/src/voice/generation.tsagents/src/voice/transcription/synchronizer.test.tsagents/src/voice/transcription/synchronizer.tsexamples/src/basic_agent.ts
🚧 Files skipped from review as they are similar to previous changes (3)
- examples/src/basic_agent.ts
- agents/src/voice/agent_session.ts
- agents/src/tts/tts.ts
🧰 Additional context used
📓 Path-based instructions (3)
**/*.{ts,tsx,js,jsx}
📄 CodeRabbit inference engine (.cursor/rules/agent-core.mdc)
Add SPDX-FileCopyrightText and SPDX-License-Identifier headers to all newly added files with '// SPDX-FileCopyrightText: 2025 LiveKit, Inc.' and '// SPDX-License-Identifier: Apache-2.0'
Files:
agents/src/voice/transcription/synchronizer.test.tsagents/src/voice/agent_activity.tsagents/src/voice/transcription/synchronizer.tsagents/src/voice/generation.tsagents/src/voice/agent.ts
**/*.{ts,tsx}?(test|example|spec)
📄 CodeRabbit inference engine (.cursor/rules/agent-core.mdc)
When testing inference LLM, always use full model names from
agents/src/inference/models.ts(e.g., 'openai/gpt-4o-mini' instead of 'gpt-4o-mini')
Files:
agents/src/voice/transcription/synchronizer.test.tsagents/src/voice/agent_activity.tsagents/src/voice/transcription/synchronizer.tsagents/src/voice/generation.tsagents/src/voice/agent.ts
**/*.{ts,tsx}?(test|example)
📄 CodeRabbit inference engine (.cursor/rules/agent-core.mdc)
Initialize logger before using any LLM functionality with
initializeLogger({ pretty: true })from '@livekit/agents'
Files:
agents/src/voice/transcription/synchronizer.test.tsagents/src/voice/agent_activity.tsagents/src/voice/transcription/synchronizer.tsagents/src/voice/generation.tsagents/src/voice/agent.ts
🧠 Learnings (3)
📚 Learning: 2026-01-16T14:33:39.551Z
Learnt from: CR
Repo: livekit/agents-js PR: 0
File: .cursor/rules/agent-core.mdc:0-0
Timestamp: 2026-01-16T14:33:39.551Z
Learning: Applies to **/{examples,test}/**/*.test.ts : Include both basic streaming and tool calling tests to verify full LLM functionality
Applied to files:
agents/src/voice/transcription/synchronizer.test.ts
📚 Learning: 2026-01-16T14:33:39.551Z
Learnt from: CR
Repo: livekit/agents-js PR: 0
File: .cursor/rules/agent-core.mdc:0-0
Timestamp: 2026-01-16T14:33:39.551Z
Learning: Applies to examples/src/test_*.ts : For plugin component debugging (STT, TTS, LLM), create test example files prefixed with `test_` under the examples directory and run with `pnpm build && node ./examples/src/test_my_plugin.ts`
Applied to files:
agents/src/voice/transcription/synchronizer.test.tsagents/src/voice/generation.ts
📚 Learning: 2026-01-16T14:33:39.551Z
Learnt from: CR
Repo: livekit/agents-js PR: 0
File: .cursor/rules/agent-core.mdc:0-0
Timestamp: 2026-01-16T14:33:39.551Z
Learning: Applies to **/*.{ts,tsx}?(test|example) : Initialize logger before using any LLM functionality with `initializeLogger({ pretty: true })` from 'livekit/agents'
Applied to files:
agents/src/voice/generation.ts
🧬 Code graph analysis (2)
agents/src/voice/transcription/synchronizer.test.ts (1)
agents/src/voice/transcription/synchronizer.ts (1)
SpeakingRateData(40-134)
agents/src/voice/transcription/synchronizer.ts (1)
agents/src/voice/io.ts (2)
TimedString(40-47)isTimedString(72-79)
🔇 Additional comments (23)
.changeset/odd-moose-check.md (1)
1-9: LGTM!The changeset is properly formatted with appropriate patch version bumps for the TTS-aligned transcripts feature across the affected packages.
agents/src/voice/agent.ts (5)
29-33: LGTM!The imports for
USERDATA_TIMED_TRANSCRIPTandTimedStringare correctly added to support the TTS-aligned transcripts feature.
85-86: LGTM!The private field declaration follows the existing pattern for optional agent configuration.
176-179: LGTM!The getter correctly returns
boolean | undefined, allowing the agent-level setting to beundefinedwhen not explicitly set, enabling proper precedence logic inAgentActivity.
409-413: LGTM!The timed transcripts are correctly attached to
frame.userdatausing theUSERDATA_TIMED_TRANSCRIPTconstant, with proper null/empty checks before assignment.
204-208: LGTM!The
transcriptionNodesignature is correctly updated to accept and returnReadableStream<string | TimedString>, enabling the flow of timing information through the transcription pipeline.agents/src/voice/transcription/synchronizer.test.ts (1)
7-206: LGTM!Comprehensive test coverage for
SpeakingRateDataincluding:
- Constructor initialization
addByRatewith single/multiple entries and zero rateaddByAnnotationwith buffering, flushing, and recursive endTime handlingaccumulateTowith interpolation, extrapolation, and capping logic- Integration scenarios for realistic TTS word-timing workflows
The mathematical assertions align correctly with the implementation.
agents/src/voice/agent_activity.ts (5)
63-75: LGTM!The imports for
_TTSGenerationData,ToolExecutionOutput, andTimedStringare correctly added to support the TTS-aligned transcripts feature integration.
362-366: LGTM!The getter correctly implements precedence logic where the agent-level setting takes priority over the session-level setting using nullish coalescing.
1243-1257: LGTM!The code correctly uses the new
_TTSGenerationDatastructure, extractingaudioStreamfor audio forwarding.
1850-1864: LGTM!The realtime generation path correctly types
ttsTextInputandtrTextInputasReadableStream<string | TimedString>, maintaining consistency with the pipeline reply task.
1883-1891: LGTM!The realtime path correctly uses the new
_TTSGenerationDatastructure, extractingaudioStreamfromttsGenData.agents/src/voice/transcription/synchronizer.ts (6)
11-17: LGTM!The imports are correctly updated to include
TimedStringandisTimedStringfor handling timing information in the synchronizer.
36-134: LGTM!The
SpeakingRateDataclass is well-implemented with clear documentation. The timing annotation accumulation logic correctly handles:
- Rate-based additions with integral calculation
- Annotation-based additions with text buffering and flushing
- Interpolation and extrapolation in
accumulateToThe class is appropriately exported for testing purposes.
232-262: LGTM!The
pushTextmethod is correctly updated to handle bothstringandTimedStringinputs. The lazy initialization ofannotatedRateand proper extraction of timing information enables accurate synchronization with TTS timing annotations.
352-388: LGTM!The timing synchronization logic correctly prioritizes TTS timing annotations when available (
annotatedRate) and falls back to the estimated hyphen-per-second calculation otherwise. The comparison of target vs forwarded text lengths ensures accurate synchronization with actual TTS playback timing.
572-584: LGTM!The updated flush logic correctly handles the TTS-aligned transcript flow where audio goes directly to the room while text still needs to be processed through the synchronizer. The distinction between "pending text" and "empty segment" ensures proper behavior for both flows.
623-661: LGTM!The
SyncedTextOutputchanges correctly:
- Extract plain text from
TimedStringwhen the synchronizer is disabled (pass-through mode)- Pass the full
TimedStringtopushTextwhen enabled, preserving timing information- Await the barrier in
flushto ensure safe access to_implafter potential segment rotationagents/src/voice/generation.ts (5)
27-39: LGTM!The imports for
USERDATA_TIMED_TRANSCRIPTand TimedString utilities are correctly added to support TTS-aligned transcript generation.
58-72: LGTM!The
_TTSGenerationDatainterface is well-designed, encapsulating the audio stream and timed transcripts future. Using aFuturefortimedTextsFutenables async resolution before the TTS loop completes.
519-553: LGTM!The
performTTSInferencefunction correctly:
- Accepts
ReadableStream<string | TimedString>for flexible input- Creates a text-only stream for TTS consumption by extracting text from TimedString objects
- Handles both string and TimedString inputs uniformly
643-651: LGTM!The return structure correctly provides the
_TTSGenerationDataobject containing both the audio stream and timed texts future, enabling consumers to coordinate audio playback with synchronized transcription.
659-711: LGTM!The text forwarding functions correctly handle
ReadableStream<string | TimedString>:
- Extract text for accumulation in
out.text- Pass the original value (preserving TimedString) to
textOutput.captureTextfor synchronized transcription- Maintain consistent function signatures
✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.
Summary
Implements TTS aligned transcripts feature for the Node.js Agents. This feature enables word-level timestamp synchronization between TTS audio and displayed transcripts.
Changes
Core Implementation
TimedStringinterface for text with timing information (startTime,endTime)createTimedStringfactory functionTTSCapabilities.alignedTranscriptflag to indicate TTS provider supportperformTTSInferenceutility that extracts timed texts from TTS audio framesTranscriptionSynchronizerto useSpeakingRateDatafor interpolating word timing from annotationsUSERDATA_TIMED_TRANSCRIPTconstant for attaching timed transcripts to audio frame userdataElevenLabs Plugin
syncAlignmentoption (defaults totrue)toTimedWordsto parse ElevenLabs alignment data intoTimedStringobjectsfirstWordOffsetMsto handle absolute timestampsCartesia Plugin
types.tswith Zod schemas for Cartesia WebSocket API messages (chunk,timestamps,done,flush_done,error)createStreamChannelpattern, preventing message loss during listener re-registrationword_timestampsmessageswordTimestampsoption (defaults totrue)Voice Options
useTtsAlignedTranscriptfrom session-level parameter intoVoiceOptionsinterfacetrueStream Adapter
StreamAdapterto createTimedStringwith cumulative duration for non-streaming TTS providersalignedTranscript: truecapability for adapted streamsAgent Activity
useTtsAlignedTranscriptis enabled and TTS supports itTest plan
StreamAdaptorsync with audioSummary by CodeRabbit
New Features
Tests
✏️ Tip: You can customize this high-level summary in your review settings.