CosyVoice 双向流式 streamingCall() — 前后端总体方案

📅 2026/7/1 4:08:06
CosyVoice 双向流式 streamingCall() — 前后端总体方案
CosyVoice 双向流式streamingCall()— 前后端总体方案在保留现有LLM 流式typecontent的前提下把 TTS 从「整段call() OSS URL」升级为CosyVoice 双向 WebSocket 音频帧直推前端并保证语音失败不影响文字。一、现状 vs 目标维度现状目标文字百炼streamCall→typecontent不变TTS APIcall(整段)阻塞streamingCall(delta)streamingComplete()分句本地VoiceStreamingSegmentBufferCosyVoice 服务端自动分句可去掉本地缓冲推前端voiceChunk.voiceUrlOSS音频帧Base64 或 WS Binary 可选短 URL 降级落库段级 OSS merge →ext.voiceUrl内存攒帧 → 结束 merge 上传一次complete等 TTS 队列finishAndAwait文字结束即发 completeTTS 异步收尾二、总体架构CosyVoice WSS streamingCall百炼 Agent streamCallMobileImControllerask-ai WebSocket小程序/H5CosyVoice WSS streamingCall百炼 Agent streamCallMobileImControllerask-ai WebSocket小程序/H5loop[LLM 流式]promptprocessWebSocketAskAi创建 SpeechSynthesizercallbackchatStreamForWebSocketdeltatypecontentstreamingCall(delta)onEvent audioFrametypevoiceFrame全文结束typetextComplete 或 complete(无语音)streamingComplete()onCompletemerge帧→WAV→OSS→exttypevoiceComplete原则文字与语音解耦content 先推TTS 异常只 catch 打日志不中断 LLM。一问一 CosyVoice 连接每问答一个SpeechSynthesizer禁止单例共享。文本单线程喂 TTSLLM delta 入队单线程streamingCall避免多线程同实例。出站串行同一WebSocketSession的 content / voiceFrame 走per-connection 发送队列。三、后端方案3.1 模块划分模块职责AliyunAgentServiceImpl不变推contentonContentDeltaCosyVoiceStreamingSession新管理一条 WSSstreamingCall/streamingComplete/ callback / closeVoiceSynthesisService新增openStreamingSession(voiceId, callback)配置 WSS 端点ImStreamingVoicePushSession重构去掉call()段级 OSS改为转发 delta 收帧推前端 攒帧 mergeWebSocketOutboundQueue新同一连接 Text/Binary 串行 sendMobileImController创建/销毁 sessionLLM 结束后先发 completeTTS 异步 finalize3.2 CosyVoice 会话生命周期// 问答开始有 doctorId voiceId 连接可用SpeechSynthesisParamparambuilder().apiKey(...).model(cosyvoice-v2)// 与复刻音色一致.voice(voiceId).format(PCM_22050HZ_MONO_16BIT)// 或 MP3流式推荐 PCM/MP3.build();SpeechSynthesizersynthesizernewSpeechSynthesizer(param,callback);// LLM 每个 deltaonContentDeltatry-catchtextFeederQueue.offer(delta);// 单线程 consumer → synthesizer.streamingCall(delta)// LLM 结束synthesizer.streamingComplete();await onComplete/latch;// merge 落库 closesynthesizer.getDuplexApi().close(1000,bye);配置新增aliyun:voice:api-key:sk-xxxmodel:cosyvoice-v2websocket-url:wss://{workspaceId}.cn-beijing.maas.aliyuncs.com/api-ws/v1/inferenceSDK 建议≥ 2.22.0getOutput()句子事件当前 2.19.4 可先 PoC。3.3 与 LLM 的衔接onContentDelta(delta): 1. outboundQueue.send(content) // 已在 Agent 层完成 2. cosyVoiceSession.feedText(delta) // 非阻塞入队 chatStreamForWebSocket 返回后: 1. updateBotMessageAfterAiReply 2. outboundQueue.send({ type: textComplete }) // 或带 complete 3. cosyVoiceSession.finishAsync() // streamingComplete 不阻塞主线程 4. 环信 / 告警 等不依赖 TTS3.4 推前端协议建议文字不变{type:content,content:增量,messageId:...,timestamp:123}音频帧新增推荐 JSONBase64 便于小程序{type:voiceFrame,messageId:429476821505122304,seq:12,format:pcm,sampleRate:22050,channels:1,bitDepth:16,sentenceIndex:0,event:sentence-synthesis,data:base64...,timestamp:123}可选句子边界来自result.getOutput(){type:voiceSentence,messageId:...,sentenceIndex:0,text:...,event:sentence-end}结束{type:voiceComplete,messageId:...,hasVoice:true,voiceUrl:https://.../merged.wav}{type:complete,messageId:...,timestamp:123}消息时机contentLLM 流式voiceGeneratingTTS 连接建立可选voiceFrameCosyVoiceonEvent有audioFrametextCompleteLLM 结束不等 TTSvoiceCompleteTTSonComplete merge 落库后complete与textComplete同发或 voice 可选降级帧推送失败或小程序不支持流式播放时保留短 MP3 分片 URL作voiceChunk兼容。3.5 落库onEvent: audioFrames.add(frame) onComplete: bytes concat(frames) 或 decode PCM → WAV mergedUrl upload OSS voice/merged ext: { hasVoice, voiceUrl, voiceFormat, voiceId }段级 OSS 可取消只保留最终 merge 一次。3.6 失败与隔离失败处理无音色 / 无 doctorId不建 TTS仅 content completestreamingCall/ CosyVoice 报错日志 voiceComplete(hasVoicefalse)不影响已推 content单帧推送失败日志继续后续帧merge/OSS 失败无ext.voiceUrl实时播放仍可能完整23s 无新文本超时关 TTSLLM 若仍输出需续连或整问重建 session同连接连发两问in-flight 锁或拒绝第二问3.7 并发每用户每问1× SpeechSynthesizer 1× WSS全局有界 TTS 连接池如 3264超出排队禁止Spring 单例SpeechSynthesizerWebSocketConnectionManager增加sendBinary(connectionId, bytes) 与 Text 共用 outbound 队列四、前端方案4.1 状态机IDLE → CONNECTED → AI_STREAMING → TEXT_DONE → VOICE_STREAMING → DONE ↓ content ↓ voiceFrame ↓ ↓ 追加播放队列文字按序拼接content语音按messageIdseq维护播放队列4.2 播放微信小程序方案做法适用A. Base64 → 临时文件每句/每 N 帧攒成 WAV →wx.getFileSystemManager写 temp →InnerAudioContext.src改造小延迟略高于 H5B. 句子级 WAV收sentence-end后拼帧写文件再播与 CosyVoice 分句对齐推荐C. 降级 URL仍收voiceChunk.voiceUrl兼容旧版不建议小程序裸 PCM 逐帧直播无 Web Audio实现成本高。4.3 前端伪代码consttextBuf{};constaudioQueue[];// { messageId, seq, pcmChunks[] }letplayingfalse;onMessage(msg){switch(msg.type){casecontent:appendText(msg.messageId,msg.content);break;casevoiceFrame:enqueueFrame(msg);tryPlayNext();break;casetextComplete:markTextDone(msg.messageId);break;casevoiceComplete:casecomplete:finishSession(msg.messageId);break;}}4.4 与旧协议兼容检测首包有voiceFrame走流式仅有voiceChunk走 URL 队列版本号连接时?voiceProtocol2或connected里带features: [voiceStream]五、分阶段实施阶段内容风险P0SDK 升级 CosyVoiceStreamingSessionPoC服务端收帧落日志低P1voiceFrameBase64 推 WS小程序句子级 temp 文件播放中P2去掉段级 OSS 本地VoiceStreamingSegmentBuffercomplete 与 TTS 解耦中P3outbound 队列、有界连接池、in-flight 锁中P4可选 Binary 帧、H5 Web Audio 低延迟路径低六、和现有voiceChunk对比现在voiceChunk OSS双向streamingCall首包延迟整句合成 上传更低帧级带宽客户端拉 OSSWS 直推Base64 更大小程序InnerAudio URL成熟需 temp 文件或句子 WAV服务端简单WSS 长连接 队列 协议历史回放段级 merge仅 merge 一次即可七、推荐结论推荐路径LLM 文字协议不动 CosyVoice 双向流式 voiceFrame句子边界拼 WAV 结束 merge 一次 OSS 文字结束立即complete 语音失败可降级无 voice。后端核心CosyVoiceStreamingSession 文本单线程 feeder outbound 串行队列前端核心按messageId/seq攒帧按句写 temp WAV 播放兼容保留voiceChunk作降级开关若要落地 P0/P1切Agent 模式可从VoiceSynthesisServiceImStreamingVoicePushSession改造起笔。