背景
还记得我刚毕业那会儿,接触的项目就是跟音频应用相关的工作,包括语音识别、语音合成以及音频剪辑等功能,具体见https://github.com/heartsuit/BaiduASRAndTTS,当时主要是调用了百度的 ASR
与 TTS
接口。后来有的项目用到语音合成功能时,直接通过前端的 NPM
包 speak-tts
即可实现(调用客户端操作系统的类库实现)。
关于语音合成 TTS
(Text to Speech),如果想要私有化部署一套 TTS
服务,如今的选择是真多: ChatTTS
, VITS
, MeloTTS
, CoquiTTS
等,此外, HuggingFace
, ModelScope
上的开源模型更是数不胜数。
这里的需求是在纯CPU、无互联网的环境下完成中文文本语音合成,时间要求是5秒以内。
结合实际的应用场景,本次主要关注在纯 CPU
场景下,对于中文文本的合成效果(人声自然)与合成效率(时间短)两个方面。以下将通过对 eSpeak
, ChatTTS
, CoquiTTS
这三种语音合成 TTS
服务离线部署测试,分析三种方案的优劣。
先看结果
模型名称 | 合成效果 | 合成效率 |
---|---|---|
eSpeak | 鬼畜声、比较差 | 毫秒级~几秒 |
ChatTTS | 逼真流畅,音质高 | 50秒左右 |
CoquiTTS | 正常人声,音质一般 | 5秒左右 |
Note:以上是在纯 CPU
环境下使用100个字符以内的中文文本进行测试得出的结果。
eSpeak
eSpeak
是一个开源的文本到语音(TTS)合成器,适用于多种语言,包括英语和其他语言。 eSpeak
使用了形式合成方法,能够生成高质量的声音,并且因为其小文件大小和多语言特性,被广泛应用于各种场景中。
关于 eSpeak
的 TTS
方案,我直接使用 Cursor
来编码,全程没有一行自己编写的代码,部署到服务器上之后成功运行,不过,中文的合成效果很是鬼畜,挺差的。。
- 目录结构
D:.
│ docker-compose.yml
│ Dockerfile
│ package.json
│ README.md
│
└─src│ index.js│ tts.js│└─routestts.routes.js
代码结构比较简单,标准的 Node.js
后端项目,使用了传统的 Express
Web
框架,不多作解释。
tts.js
const {exec
} = require('child_process');
const util = require('util');
const path = require('path');
const fs = require('fs').promises;const execPromise = util.promisify(exec);class TextToSpeech {constructor() {this.outputDir = process.env.OUTPUT_DIR || './output';}async convertToSpeech(text, options = {}) {const {language = 'zh',speed = 175,pitch = 50,volume = 100} = options;const fileName = `${Date.now()}-${Math.random().toString(36).substring(7)}`;const wavFile = path.join(this.outputDir, `${fileName}.wav`);const mp3File = path.join(this.outputDir, `${fileName}.mp3`);try {// 1. 使用 espeak 生成 wav 文件await execPromise(`espeak -v ${language} -s ${speed} -p ${pitch} -a ${volume} -w "${wavFile}" "${text}"`);// 2. 转换为 mp3await execPromise(`sox "${wavFile}" "${mp3File}"`);// 3. 读取 MP3 文件const audioBuffer = await fs.readFile(mp3File);// 4. 清理临时文件await Promise.all([fs.unlink(wavFile),fs.unlink(mp3File)]);return audioBuffer;} catch (error) {// 清理任何可能存在的临时文件try {await Promise.all([fs.unlink(wavFile).catch(() => {}),fs.unlink(mp3File).catch(() => {})]);} catch (e) {// 忽略清理错误}throw new Error(`TTS conversion failed: ${error.message}`);}}
}module.exports = TextToSpeech;
tts.routes.js
const express = require('express');
const router = express.Router();
const TextToSpeech = require('../tts');const tts = new TextToSpeech();router.post('/convert', async (req, res) => {try {const {text,options} = req.body;if (!text) {return res.status(400).json({error: 'Text is required'});}const audioBuffer = await tts.convertToSpeech(text, options);// 设置响应头res.set({'Content-Type': 'audio/mpeg','Content-Disposition': `attachment; filename="speech-${Date.now()}.mp3"`});// 发送音频数据res.send(audioBuffer);} catch (error) {console.error('TTS Error:', error);res.status(500).json({error: error.message});}
});// 添加健康检查端点
router.get('/health', (req, res) => {res.json({status: 'ok'});
});module.exports = router;
index.js
const express = require('express');
const cors = require('cors');
const morgan = require('morgan');
const ttsRoutes = require('./routes/tts.routes');const app = express();
const port = process.env.PORT || 4000;// 中间件
app.use(cors());
app.use(morgan('dev'));
app.use(express.json());// 路由
app.use('/api/tts', ttsRoutes);// 错误处理
app.use((err, req, res, next) => {console.error(err.stack);res.status(500).json({error: 'Something went wrong!'});
});app.listen(port, () => {console.log(`Server running on port ${port}`);
});
package.json
{"name": "tts-service","version": "1.0.0","description": "Offline text-to-speech service","main": "src/index.js","dependencies": {"express": "^4.18.2","cors": "^2.8.5","morgan": "^1.10.0","uuid": "^9.0.0"}
}
Docker部署文件
- Dockerfile
FROM node:18-slim# 设置时区
ENV TZ=Asia/Shanghai
RUN ln -snf /usr/share/zoneinfo/$TZ /etc/localtime && echo $TZ > /etc/timezone# 安装必要的包
RUN apt-get update && apt-get install -y \espeak \espeak-ng \sox \libsox-fmt-mp3 \&& rm -rf /var/lib/apt/lists/*WORKDIR /app# 复制 package.json 和 package-lock.json
COPY package*.json ./# 安装依赖
RUN npm install --registry=https://registry.npmmirror.com/# 复制源代码
COPY . .# 创建输出目录
RUN mkdir -p output && chmod 777 outputEXPOSE 4000
CMD ["node", "src/index.js"]
- docker-compose.yml
version: '3'
services:tts-service:build: .ports:- "4000:4000"volumes:- ./output:/app/outputenvironment:- PORT=4000- OUTPUT_DIR=/app/output
README.md
使用方法:
1. 构建和运行服务:
docker-compose up --build2. API 使用示例:
使用 curl:
curl -X POST http://192.168.44.171:4000/api/tts/convert \-H "Content-Type: application/json" \-d '{"text": "这是一个测试文本","options": {"language": "zh+f2","speed": 175,"pitch": 50,"volume": 100}}' \--output speech.mp3
ChatTTS
ChatTTS
是一款专为对话场景设计的文本转语音(TTS)模型,支持中英文双语。经过大量训练, ChatTTS
能够生成自然、流畅且富有表现力的语音合成内容,并在韵律控制上超越了大多数开源模型。它不仅能细致地控制笑声、停顿和感叹词等特征,还可以用于多种语言和场景的语音内容生成。此外, ChatTTS
特别适用于大型语言模型(LLM)助手的对话任务以及诸如对话式音频和视频介绍等应用。它的性能非常出色,甚至在与微软 Azure-tts
这样的商业级项目相比时,也毫不逊色。
ChatTTS
提供了在线工具来生成语音,可以快速体验:https://chattts.com/zh?__theme=dark#Demo,不过经过测试,普通的一句话 TTS
一般需要20秒(这还是使用了 GPU
的)。下面是使用了基于 ChatTTS
实现语音合成的开源 UI
项目:ChatTTS-ui,主要是对官方的ChatTTS进行了容器化构建,并提供了一个用于测试的 Web
可视化页面和开放的 API
接口。
官方在线体验
本地部署
# 下载ChatTTS源码
cd ChatTTS-ui-main# 运行cpu版本
docker-compose -f docker-compose.cpu.yaml up -d
纯CPU运行效果
浏览器打开:http://192.168.44.171:9966;
CoquiTTS
CoquiTTS
是一个开源的文字到语音(Text-to-Speech, TTS)系统,旨在使语音合成技术对研究人员、开发者和创造者更加可接近。它基于共同学习技术,能够从各语言的训练数据集中转换知识,从而有效降低所需的数据量。 CoquiTTS
支持多种语言,包括跨语言克隆,例如英文到中文、中文到英文等,共计 16
种语言。
此外, CoquiTTS
还提供了先进的多语言文本转语音库支持超过1100种语言的功能,并包含如 Tacotron2
、 VITS
和 YourTTS
等多种深度学习模型。它不仅用于生成高质量的语音,还提供训练新模型和微调现有模型的工具,支持多说话人 TTS,并提供了数据集分析功能。
CoquiTTS
因其高效性和多功能性而受到广泛关注,在 GitHub
上获得了 36.4k
的星标数,成为新一代开源语音技术的领军者。
安装
容器化部署,使用中文模型: tts_models/zh-CN/baker/tacotron2-DDC-GST
。
[root@tts ~]# docker run --rm -it -p 5002:5002 --entrypoint /bin/bash ghcr.io/coqui-ai/tts-cpu
Unable to find image 'ghcr.io/coqui-ai/tts-cpu:latest' locally
latest: Pulling from coqui-ai/tts-cpu
025c56f98b67: Pull complete
778656c04542: Pull complete
85485c9f43dd: Pull complete
23b3c91f0de2: Pull complete
fd19b936aab8: Pull complete
30b21c9aef2b: Pull complete
cc12d1e5322b: Pull complete
b91e9a336532: Pull complete
d679a5e35c77: Pull complete
0d84a5b8bca3: Pull complete
4f4fb700ef54: Pull complete
d170b2e70a00: Pull complete
c612db99f0b2: Pull complete
Digest: sha256:a2f6659245358c38efb1bb44b39f7b7b3459e03e9ed5687c447681cb82c35de3
Status: Downloaded newer image for ghcr.io/coqui-ai/tts-cpu:latest
root@b452b7513c7e:~# python3 TTS/server/server.py --list_modelsName format: type/language/dataset/model1: tts_models/multilingual/multi-dataset/xtts_v22: tts_models/multilingual/multi-dataset/xtts_v1.13: tts_models/multilingual/multi-dataset/your_tts4: tts_models/multilingual/multi-dataset/bark5: tts_models/bg/cv/vits6: tts_models/cs/cv/vits7: tts_models/da/cv/vits8: tts_models/et/cv/vits9: tts_models/ga/cv/vits10: tts_models/en/ek1/tacotron211: tts_models/en/ljspeech/tacotron2-DDC12: tts_models/en/ljspeech/tacotron2-DDC_ph13: tts_models/en/ljspeech/glow-tts14: tts_models/en/ljspeech/speedy-speech15: tts_models/en/ljspeech/tacotron2-DCA16: tts_models/en/ljspeech/vits17: tts_models/en/ljspeech/vits--neon18: tts_models/en/ljspeech/fast_pitch19: tts_models/en/ljspeech/overflow20: tts_models/en/ljspeech/neural_hmm21: tts_models/en/vctk/vits22: tts_models/en/vctk/fast_pitch23: tts_models/en/sam/tacotron-DDC24: tts_models/en/blizzard2013/capacitron-t2-c5025: tts_models/en/blizzard2013/capacitron-t2-c150_v226: tts_models/en/multi-dataset/tortoise-v227: tts_models/en/jenny/jenny28: tts_models/es/mai/tacotron2-DDC29: tts_models/es/css10/vits30: tts_models/fr/mai/tacotron2-DDC31: tts_models/fr/css10/vits32: tts_models/uk/mai/glow-tts33: tts_models/uk/mai/vits34: tts_models/zh-CN/baker/tacotron2-DDC-GST35: tts_models/nl/mai/tacotron2-DDC36: tts_models/nl/css10/vits37: tts_models/de/thorsten/tacotron2-DCA38: tts_models/de/thorsten/vits39: tts_models/de/thorsten/tacotron2-DDC40: tts_models/de/css10/vits-neon41: tts_models/ja/kokoro/tacotron2-DDC42: tts_models/tr/common-voice/glow-tts43: tts_models/it/mai_female/glow-tts44: tts_models/it/mai_female/vits45: tts_models/it/mai_male/glow-tts46: tts_models/it/mai_male/vits47: tts_models/ewe/openbible/vits48: tts_models/hau/openbible/vits49: tts_models/lin/openbible/vits50: tts_models/tw_akuapem/openbible/vits51: tts_models/tw_asante/openbible/vits52: tts_models/yor/openbible/vits53: tts_models/hu/css10/vits54: tts_models/el/cv/vits55: tts_models/fi/css10/vits56: tts_models/hr/cv/vits57: tts_models/lt/cv/vits58: tts_models/lv/cv/vits59: tts_models/mt/cv/vits60: tts_models/pl/mai_female/vits61: tts_models/pt/cv/vits62: tts_models/ro/cv/vits63: tts_models/sk/cv/vits64: tts_models/sl/cv/vits65: tts_models/sv/cv/vits66: tts_models/ca/custom/vits67: tts_models/fa/custom/glow-tts68: tts_models/bn/custom/vits-male69: tts_models/bn/custom/vits-female70: tts_models/be/common-voice/glow-ttsName format: type/language/dataset/model1: vocoder_models/universal/libri-tts/wavegrad2: vocoder_models/universal/libri-tts/fullband-melgan3: vocoder_models/en/ek1/wavegrad4: vocoder_models/en/ljspeech/multiband-melgan5: vocoder_models/en/ljspeech/hifigan_v26: vocoder_models/en/ljspeech/univnet7: vocoder_models/en/blizzard2013/hifigan_v28: vocoder_models/en/vctk/hifigan_v29: vocoder_models/en/sam/hifigan_v210: vocoder_models/nl/mai/parallel-wavegan11: vocoder_models/de/thorsten/wavegrad12: vocoder_models/de/thorsten/fullband-melgan13: vocoder_models/de/thorsten/hifigan_v114: vocoder_models/ja/kokoro/hifigan_v115: vocoder_models/uk/mai/multiband-melgan16: vocoder_models/tr/common-voice/hifigan17: vocoder_models/be/common-voice/hifiganName format: type/language/dataset/model1: voice_conversion_models/multilingual/vctk/freevc24root@b452b7513c7e:~# python3 TTS/server/server.py --model_name tts_models/zh-CN/baker/tacotron2-DDC-GST> tts_models/zh-CN/baker/tacotron2-DDC-GST is already downloaded.> Using model: tacotron2> Setting up Audio Processor...| > sample_rate:22050| > resample:False| > num_mels:80| > log_func:np.log10| > min_level_db:-100| > frame_shift_ms:None| > frame_length_ms:None| > ref_level_db:0| > fft_size:1024| > power:1.5| > preemphasis:0.0| > griffin_lim_iters:60| > signal_norm:True| > symmetric_norm:True| > mel_fmin:50.0| > mel_fmax:7600.0| > pitch_fmin:0.0| > pitch_fmax:640.0| > spec_gain:1.0| > stft_pad_mode:reflect| > max_norm:4.0| > clip_norm:True| > do_trim_silence:True| > trim_db:60| > do_sound_norm:False| > do_amp_to_db_linear:True| > do_amp_to_db_mel:True| > do_rms_norm:False| > db_level:None| > stats_path:/root/.local/share/tts/tts_models--zh-CN--baker--tacotron2-DDC-GST/scale_stats.npy| > base:10| > hop_length:256| > win_length:1024> Model's reduction rate `r` is set to: 2* Serving Flask app 'server'* Debug mode: off
INFO:werkzeug:WARNING: This is a development server. Do not use it in a production deployment. Use a production WSGI server instead.* Running on all addresses (::)* Running on http://[::1]:5002* Running on http://[::1]:5002
INFO:werkzeug:Press CTRL+C to quit
INFO:werkzeug:::ffff:192.168.26.12 - - [31/Dec/2024 02:29:57] "GET / HTTP/1.1" 200 -
INFO:werkzeug:::ffff:192.168.26.12 - - [31/Dec/2024 02:30:39] "GET /favicon.ico HTTP/1.1" 404 -
INFO:werkzeug:::ffff:192.168.26.12 - - [31/Dec/2024 02:30:39] "GET /static/coqui-log-green-TTS.png HTTP/1.1" 200 -> Model input: 你好。> Speaker Idx: > Language Idx: > Text splitted to sentences.
['你好。']
Building prefix dict from the default dictionary ...
DEBUG:jieba:Building prefix dict from the default dictionary ...
Dumping model to file cache /tmp/jieba.cache
DEBUG:jieba:Dumping model to file cache /tmp/jieba.cache
Loading model cost 0.612 seconds.
DEBUG:jieba:Loading model cost 0.612 seconds.
Prefix dict has been built successfully.
DEBUG:jieba:Prefix dict has been built successfully.> Processing time: 1.3838930130004883> Real-time factor: 1.3535681749760808
INFO:werkzeug:::ffff:192.168.26.12 - - [31/Dec/2024 02:30:51] "GET /api/tts?text=你好。&speaker_id=&style_wav={"0":%200.1}&language_id= HTTP/1.1" 200 -> Model input: Coqui TTS 支持多种语言,包括跨语言克隆,例如英文到中文、中文到英文等,共计16种语言。> Speaker Idx: > Language Idx: > Text splitted to sentences.
['Coqui TTS 支持多种语言,包括跨语言克隆,例如英文到中文、中文到英文等,共计16种语言。']
Coqui TTS dʒʏ1ʈʂʏ2 duo1dʒoŋ3y3iɛn2 , baʌ1kuo4 kua4 y3iɛn2 kø4loŋ2 , li4ʐu2 ɨŋ1wœn2 daʌ4 dʒoŋ1wœn2 dʒoŋ1wœn2 daʌ4 ɨŋ1wœn2 dɵŋ3 , goŋ4dʑi4 ʂʏ2lio4dʒoŋ3 y3iɛn2 。[!] Character 'C' not found in the vocabulary. Discarding it.
Coqui TTS dʒʏ1ʈʂʏ2 duo1dʒoŋ3y3iɛn2 , baʌ1kuo4 kua4 y3iɛn2 kø4loŋ2 , li4ʐu2 ɨŋ1wœn2 daʌ4 dʒoŋ1wœn2 dʒoŋ1wœn2 daʌ4 ɨŋ1wœn2 dɵŋ3 , goŋ4dʑi4 ʂʏ2lio4dʒoŋ3 y3iɛn2 。[!] Character 'T' not found in the vocabulary. Discarding it.
Coqui TTS dʒʏ1ʈʂʏ2 duo1dʒoŋ3y3iɛn2 , baʌ1kuo4 kua4 y3iɛn2 kø4loŋ2 , li4ʐu2 ɨŋ1wœn2 daʌ4 dʒoŋ1wœn2 dʒoŋ1wœn2 daʌ4 ɨŋ1wœn2 dɵŋ3 , goŋ4dʑi4 ʂʏ2lio4dʒoŋ3 y3iɛn2 。[!] Character 'S' not found in the vocabulary. Discarding it.
Coqui TTS dʒʏ1ʈʂʏ2 duo1dʒoŋ3y3iɛn2 , baʌ1kuo4 kua4 y3iɛn2 kø4loŋ2 , li4ʐu2 ɨŋ1wœn2 daʌ4 dʒoŋ1wœn2 dʒoŋ1wœn2 daʌ4 ɨŋ1wœn2 dɵŋ3 , goŋ4dʑi4 ʂʏ2lio4dʒoŋ3 y3iɛn2 。[!] Character 'g' not found in the vocabulary. Discarding it.> Processing time: 3.930570125579834> Real-time factor: 0.4210506765887842
INFO:werkzeug:::ffff:192.168.26.12 - - [31/Dec/2024 02:37:24] "GET /api/tts?text=Coqui%20TTS%20支持多种语言,包括跨语言克隆,例如英文到中文、中文到英文等,共计16种语言。&speaker_id=&style_wav={"0":%200.1}&language_id= HTTP/1.1" 200
纯CPU运行效果
浏览器打开:http://192.168.44.171:5002;大部分情况下,会在5秒左右返回合成结果。
Note:
- 记得要在中文最后加上中文句号,否则模型会在后面加一段啊的声音,补齐12s的时长,这应该是个Bug。
- 对于较长的文本,比如50个文字以上,有时会出现最后一部分被截断的情况,这应该也是个Bug。
遇到模型下载失败的问题
- 报错信息:从GitHub上下载模型文件超时
root@b452b7513c7e:~# python3 TTS/server/server.py --model_name tts_models/zh-CN/baker/tacotron2-DDC-GST> Downloading model to /root/.local/share/tts/tts_models--zh-CN--baker--tacotron2-DDC-GST> Failed to download the model file to /root/.local/share/tts/tts_models--zh-CN--baker--tacotron2-DDC-GST
Traceback (most recent call last):File "/usr/local/lib/python3.10/site-packages/urllib3/connection.py", line 203, in _new_connsock = connection.create_connection(File "/usr/local/lib/python3.10/site-packages/urllib3/util/connection.py", line 85, in create_connectionraise errFile "/usr/local/lib/python3.10/site-packages/urllib3/util/connection.py", line 73, in create_connectionsock.connect(sa)
TimeoutError: [Errno 110] Connection timed outThe above exception was the direct cause of the following exception:
Traceback (most recent call last):File "/usr/local/lib/python3.10/site-packages/requests/adapters.py", line 486, in sendresp = conn.urlopen(File "/usr/local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 844, in urlopenretries = retries.increment(File "/usr/local/lib/python3.10/site-packages/urllib3/util/retry.py", line 515, in incrementraise MaxRetryError(_pool, url, reason) from reason # type: ignore[arg-type]
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='github.com', port=443): Max retries exceeded with url: /coqui-ai/TTS/releases/download/v0.6.1_models/tts_models--zh-CN--baker--tacotron2-DDC-GST.zip (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7faab643bb50>, 'Connection to github.com timed out. (connect timeout=None)'))
- 解决方法:手动下载模型
从报错信息中可以看到模型的下载地址: https://github.com/coqui-ai/TTS/releases/download/v0.6.1_models/tts_models--zh-CN--baker--tacotron2-DDC-GST.zip
;手动下载后解压并传至容器的 /root/.local/share/tts
目录下。
[root@tts opt]# unzip tts_models--zh-CN--baker--tacotron2-DDC-GST.zip
Archive: tts_models--zh-CN--baker--tacotron2-DDC-GST.zipcreating: tts_models--zh-CN--baker--tacotron2-DDC-GST/extracting: tts_models--zh-CN--baker--tacotron2-DDC-GST/model_file.pth extracting: tts_models--zh-CN--baker--tacotron2-DDC-GST/scale_stats.npy extracting: tts_models--zh-CN--baker--tacotron2-DDC-GST/config.json
[root@tts opt]# docker cp tts_models--zh-CN--baker--tacotron2-DDC-GST nostalgic_hawking:/root/.local/share/ttsSuccessfully copied 686MB to nostalgic_hawking:/root/.local/share/tts
小总结
架构是一种权衡。
根据当前的实际需求,下面来总结一下三种 TTS
方案的对比,在纯 CPU
、无互联网环境下进行中文文本语音合成(要求5秒内完成)时:
方案名称 | 合成效果 | 合成速度 | 是否满足需求 |
---|---|---|---|
eSpeak | 鬼畜声 | 毫秒级 | 速度满足,效果差 |
ChatTTS | 最佳 | 50秒左右 | 效果好,速度不达标 |
CoquiTTS | 正常 | 5秒左右 | 基本满足要求 |
综上, CoquiTTS
是最符合需求的方案,它能在纯 CPU
环境下5秒内完成合成,且语音效果可以接受。
Reference
- https://chattts.com/zh?__theme=dark#Demo
- https://github.com/jianchang512/ChatTTS-ui
- https://github.com/coqui-ai/TTS
- https://docs.coqui.ai/en/latest/tutorial_for_nervous_beginners.html
If you have any questions or any bugs are found, please feel free to contact me.
Your comments and suggestions are welcome!