北京市委网站_厦门网络公司的网络平台_郑州seo课程_google官网入口下载

背景

还记得我刚毕业那会儿，接触的项目就是跟音频应用相关的工作，包括语音识别、语音合成以及音频剪辑等功能，具体见https://github.com/heartsuit/BaiduASRAndTTS，当时主要是调用了百度的 ASR 与 TTS 接口。后来有的项目用到语音合成功能时，直接通过前端的 NPM 包 speak-tts 即可实现（调用客户端操作系统的类库实现）。

关于语音合成 TTS （Text to Speech），如果想要私有化部署一套 TTS 服务，如今的选择是真多： ChatTTS ， VITS ， MeloTTS ， CoquiTTS 等，此外， HuggingFace ， ModelScope 上的开源模型更是数不胜数。

这里的需求是在纯CPU、无互联网的环境下完成中文文本语音合成，时间要求是5秒以内。

结合实际的应用场景，本次主要关注在纯 CPU 场景下，对于中文文本的合成效果（人声自然）与合成效率（时间短）两个方面。以下将通过对 eSpeak ， ChatTTS ， CoquiTTS 这三种语音合成 TTS 服务离线部署测试，分析三种方案的优劣。

先看结果

模型名称	合成效果	合成效率
eSpeak	鬼畜声、比较差	毫秒级~几秒
ChatTTS	逼真流畅，音质高	50秒左右
CoquiTTS	正常人声，音质一般	5秒左右

Note：以上是在纯 CPU 环境下使用100个字符以内的中文文本进行测试得出的结果。

eSpeak

eSpeak 是一个开源的文本到语音（TTS）合成器，适用于多种语言，包括英语和其他语言。 eSpeak 使用了形式合成方法，能够生成高质量的声音，并且因为其小文件大小和多语言特性，被广泛应用于各种场景中。

关于 eSpeak 的 TTS 方案，我直接使用 Cursor 来编码，全程没有一行自己编写的代码，部署到服务器上之后成功运行，不过，中文的合成效果很是鬼畜，挺差的。。

D:.
│  docker-compose.yml
│  Dockerfile
│  package.json
│  README.md
│
└─src│  index.js│  tts.js│└─routestts.routes.js

代码结构比较简单，标准的 Node.js 后端项目，使用了传统的 Express Web 框架，不多作解释。

tts.js

const {exec
} = require('child_process');
const util = require('util');
const path = require('path');
const fs = require('fs').promises;const execPromise = util.promisify(exec);class TextToSpeech {constructor() {this.outputDir = process.env.OUTPUT_DIR || './output';}async convertToSpeech(text, options = {}) {const {language = 'zh',speed = 175,pitch = 50,volume = 100} = options;const fileName = `${Date.now()}-${Math.random().toString(36).substring(7)}`;const wavFile = path.join(this.outputDir, `${fileName}.wav`);const mp3File = path.join(this.outputDir, `${fileName}.mp3`);try {// 1. 使用 espeak 生成 wav 文件await execPromise(`espeak -v ${language} -s ${speed} -p ${pitch} -a ${volume} -w "${wavFile}" "${text}"`);// 2. 转换为 mp3await execPromise(`sox "${wavFile}" "${mp3File}"`);// 3. 读取 MP3 文件const audioBuffer = await fs.readFile(mp3File);// 4. 清理临时文件await Promise.all([fs.unlink(wavFile),fs.unlink(mp3File)]);return audioBuffer;} catch (error) {// 清理任何可能存在的临时文件try {await Promise.all([fs.unlink(wavFile).catch(() => {}),fs.unlink(mp3File).catch(() => {})]);} catch (e) {// 忽略清理错误}throw new Error(`TTS conversion failed: ${error.message}`);}}
}module.exports = TextToSpeech;

tts.routes.js

const express = require('express');
const router = express.Router();
const TextToSpeech = require('../tts');const tts = new TextToSpeech();router.post('/convert', async (req, res) => {try {const {text,options} = req.body;if (!text) {return res.status(400).json({error: 'Text is required'});}const audioBuffer = await tts.convertToSpeech(text, options);// 设置响应头res.set({'Content-Type': 'audio/mpeg','Content-Disposition': `attachment; filename="speech-${Date.now()}.mp3"`});// 发送音频数据res.send(audioBuffer);} catch (error) {console.error('TTS Error:', error);res.status(500).json({error: error.message});}
});// 添加健康检查端点
router.get('/health', (req, res) => {res.json({status: 'ok'});
});module.exports = router;

index.js

const express = require('express');
const cors = require('cors');
const morgan = require('morgan');
const ttsRoutes = require('./routes/tts.routes');const app = express();
const port = process.env.PORT || 4000;// 中间件
app.use(cors());
app.use(morgan('dev'));
app.use(express.json());// 路由
app.use('/api/tts', ttsRoutes);// 错误处理
app.use((err, req, res, next) => {console.error(err.stack);res.status(500).json({error: 'Something went wrong!'});
});app.listen(port, () => {console.log(`Server running on port ${port}`);
});

package.json

{"name": "tts-service","version": "1.0.0","description": "Offline text-to-speech service","main": "src/index.js","dependencies": {"express": "^4.18.2","cors": "^2.8.5","morgan": "^1.10.0","uuid": "^9.0.0"}
}

Docker部署文件

Dockerfile

FROM node:18-slim# 设置时区
ENV TZ=Asia/Shanghai
RUN ln -snf /usr/share/zoneinfo/$TZ /etc/localtime && echo $TZ > /etc/timezone# 安装必要的包
RUN apt-get update && apt-get install -y \espeak \espeak-ng \sox \libsox-fmt-mp3 \&& rm -rf /var/lib/apt/lists/*WORKDIR /app# 复制 package.json 和 package-lock.json
COPY package*.json ./# 安装依赖
RUN npm install --registry=https://registry.npmmirror.com/# 复制源代码
COPY . .# 创建输出目录
RUN mkdir -p output && chmod 777 outputEXPOSE 4000
CMD ["node", "src/index.js"]

docker-compose.yml

version: '3'
services:tts-service:build: .ports:- "4000:4000"volumes:- ./output:/app/outputenvironment:- PORT=4000- OUTPUT_DIR=/app/output

README.md

使用方法：
1. 构建和运行服务：
docker-compose up --build2. API 使用示例：
使用 curl：
curl -X POST http://192.168.44.171:4000/api/tts/convert \-H "Content-Type: application/json" \-d '{"text": "这是一个测试文本","options": {"language": "zh+f2","speed": 175,"pitch": 50,"volume": 100}}' \--output speech.mp3

ChatTTS

ChatTTS 是一款专为对话场景设计的文本转语音（TTS）模型，支持中英文双语。经过大量训练， ChatTTS 能够生成自然、流畅且富有表现力的语音合成内容，并在韵律控制上超越了大多数开源模型。它不仅能细致地控制笑声、停顿和感叹词等特征，还可以用于多种语言和场景的语音内容生成。此外， ChatTTS 特别适用于大型语言模型（LLM）助手的对话任务以及诸如对话式音频和视频介绍等应用。它的性能非常出色，甚至在与微软 Azure-tts 这样的商业级项目相比时，也毫不逊色。

ChatTTS 提供了在线工具来生成语音，可以快速体验：https://chattts.com/zh?__theme=dark#Demo，不过经过测试，普通的一句话 TTS 一般需要20秒（这还是使用了 GPU 的）。下面是使用了基于 ChatTTS 实现语音合成的开源 UI 项目：ChatTTS-ui，主要是对官方的ChatTTS进行了容器化构建，并提供了一个用于测试的 Web 可视化页面和开放的 API 接口。

官方在线体验

本地部署

# 下载ChatTTS源码
cd ChatTTS-ui-main# 运行cpu版本  
docker-compose -f docker-compose.cpu.yaml up -d

纯CPU运行效果

浏览器打开：http://192.168.44.171:9966；

CoquiTTS

CoquiTTS 是一个开源的文字到语音（Text-to-Speech, TTS）系统，旨在使语音合成技术对研究人员、开发者和创造者更加可接近。它基于共同学习技术，能够从各语言的训练数据集中转换知识，从而有效降低所需的数据量。 CoquiTTS 支持多种语言，包括跨语言克隆，例如英文到中文、中文到英文等，共计 16 种语言。

此外， CoquiTTS 还提供了先进的多语言文本转语音库支持超过1100种语言的功能，并包含如 Tacotron2 、 VITS 和 YourTTS 等多种深度学习模型。它不仅用于生成高质量的语音，还提供训练新模型和微调现有模型的工具，支持多说话人 TTS，并提供了数据集分析功能。

CoquiTTS 因其高效性和多功能性而受到广泛关注，在 GitHub 上获得了 36.4k 的星标数，成为新一代开源语音技术的领军者。

安装

容器化部署，使用中文模型： tts_models/zh-CN/baker/tacotron2-DDC-GST 。

[root@tts ~]# docker run --rm -it -p 5002:5002 --entrypoint /bin/bash ghcr.io/coqui-ai/tts-cpu
Unable to find image 'ghcr.io/coqui-ai/tts-cpu:latest' locally
latest: Pulling from coqui-ai/tts-cpu
025c56f98b67: Pull complete 
778656c04542: Pull complete 
85485c9f43dd: Pull complete 
23b3c91f0de2: Pull complete 
fd19b936aab8: Pull complete 
30b21c9aef2b: Pull complete 
cc12d1e5322b: Pull complete 
b91e9a336532: Pull complete 
d679a5e35c77: Pull complete 
0d84a5b8bca3: Pull complete 
4f4fb700ef54: Pull complete 
d170b2e70a00: Pull complete 
c612db99f0b2: Pull complete 
Digest: sha256:a2f6659245358c38efb1bb44b39f7b7b3459e03e9ed5687c447681cb82c35de3
Status: Downloaded newer image for ghcr.io/coqui-ai/tts-cpu:latest
root@b452b7513c7e:~# python3 TTS/server/server.py --list_modelsName format: type/language/dataset/model1: tts_models/multilingual/multi-dataset/xtts_v22: tts_models/multilingual/multi-dataset/xtts_v1.13: tts_models/multilingual/multi-dataset/your_tts4: tts_models/multilingual/multi-dataset/bark5: tts_models/bg/cv/vits6: tts_models/cs/cv/vits7: tts_models/da/cv/vits8: tts_models/et/cv/vits9: tts_models/ga/cv/vits10: tts_models/en/ek1/tacotron211: tts_models/en/ljspeech/tacotron2-DDC12: tts_models/en/ljspeech/tacotron2-DDC_ph13: tts_models/en/ljspeech/glow-tts14: tts_models/en/ljspeech/speedy-speech15: tts_models/en/ljspeech/tacotron2-DCA16: tts_models/en/ljspeech/vits17: tts_models/en/ljspeech/vits--neon18: tts_models/en/ljspeech/fast_pitch19: tts_models/en/ljspeech/overflow20: tts_models/en/ljspeech/neural_hmm21: tts_models/en/vctk/vits22: tts_models/en/vctk/fast_pitch23: tts_models/en/sam/tacotron-DDC24: tts_models/en/blizzard2013/capacitron-t2-c5025: tts_models/en/blizzard2013/capacitron-t2-c150_v226: tts_models/en/multi-dataset/tortoise-v227: tts_models/en/jenny/jenny28: tts_models/es/mai/tacotron2-DDC29: tts_models/es/css10/vits30: tts_models/fr/mai/tacotron2-DDC31: tts_models/fr/css10/vits32: tts_models/uk/mai/glow-tts33: tts_models/uk/mai/vits34: tts_models/zh-CN/baker/tacotron2-DDC-GST35: tts_models/nl/mai/tacotron2-DDC36: tts_models/nl/css10/vits37: tts_models/de/thorsten/tacotron2-DCA38: tts_models/de/thorsten/vits39: tts_models/de/thorsten/tacotron2-DDC40: tts_models/de/css10/vits-neon41: tts_models/ja/kokoro/tacotron2-DDC42: tts_models/tr/common-voice/glow-tts43: tts_models/it/mai_female/glow-tts44: tts_models/it/mai_female/vits45: tts_models/it/mai_male/glow-tts46: tts_models/it/mai_male/vits47: tts_models/ewe/openbible/vits48: tts_models/hau/openbible/vits49: tts_models/lin/openbible/vits50: tts_models/tw_akuapem/openbible/vits51: tts_models/tw_asante/openbible/vits52: tts_models/yor/openbible/vits53: tts_models/hu/css10/vits54: tts_models/el/cv/vits55: tts_models/fi/css10/vits56: tts_models/hr/cv/vits57: tts_models/lt/cv/vits58: tts_models/lv/cv/vits59: tts_models/mt/cv/vits60: tts_models/pl/mai_female/vits61: tts_models/pt/cv/vits62: tts_models/ro/cv/vits63: tts_models/sk/cv/vits64: tts_models/sl/cv/vits65: tts_models/sv/cv/vits66: tts_models/ca/custom/vits67: tts_models/fa/custom/glow-tts68: tts_models/bn/custom/vits-male69: tts_models/bn/custom/vits-female70: tts_models/be/common-voice/glow-ttsName format: type/language/dataset/model1: vocoder_models/universal/libri-tts/wavegrad2: vocoder_models/universal/libri-tts/fullband-melgan3: vocoder_models/en/ek1/wavegrad4: vocoder_models/en/ljspeech/multiband-melgan5: vocoder_models/en/ljspeech/hifigan_v26: vocoder_models/en/ljspeech/univnet7: vocoder_models/en/blizzard2013/hifigan_v28: vocoder_models/en/vctk/hifigan_v29: vocoder_models/en/sam/hifigan_v210: vocoder_models/nl/mai/parallel-wavegan11: vocoder_models/de/thorsten/wavegrad12: vocoder_models/de/thorsten/fullband-melgan13: vocoder_models/de/thorsten/hifigan_v114: vocoder_models/ja/kokoro/hifigan_v115: vocoder_models/uk/mai/multiband-melgan16: vocoder_models/tr/common-voice/hifigan17: vocoder_models/be/common-voice/hifiganName format: type/language/dataset/model1: voice_conversion_models/multilingual/vctk/freevc24root@b452b7513c7e:~# python3 TTS/server/server.py --model_name tts_models/zh-CN/baker/tacotron2-DDC-GST> tts_models/zh-CN/baker/tacotron2-DDC-GST is already downloaded.> Using model: tacotron2> Setting up Audio Processor...| > sample_rate:22050| > resample:False| > num_mels:80| > log_func:np.log10| > min_level_db:-100| > frame_shift_ms:None| > frame_length_ms:None| > ref_level_db:0| > fft_size:1024| > power:1.5| > preemphasis:0.0| > griffin_lim_iters:60| > signal_norm:True| > symmetric_norm:True| > mel_fmin:50.0| > mel_fmax:7600.0| > pitch_fmin:0.0| > pitch_fmax:640.0| > spec_gain:1.0| > stft_pad_mode:reflect| > max_norm:4.0| > clip_norm:True| > do_trim_silence:True| > trim_db:60| > do_sound_norm:False| > do_amp_to_db_linear:True| > do_amp_to_db_mel:True| > do_rms_norm:False| > db_level:None| > stats_path:/root/.local/share/tts/tts_models--zh-CN--baker--tacotron2-DDC-GST/scale_stats.npy| > base:10| > hop_length:256| > win_length:1024> Model's reduction rate `r` is set to: 2* Serving Flask app 'server'* Debug mode: off
INFO:werkzeug:WARNING: This is a development server. Do not use it in a production deployment. Use a production WSGI server instead.* Running on all addresses (::)* Running on http://[::1]:5002* Running on http://[::1]:5002
INFO:werkzeug:Press CTRL+C to quit
INFO:werkzeug:::ffff:192.168.26.12 - - [31/Dec/2024 02:29:57] "GET / HTTP/1.1" 200 -
INFO:werkzeug:::ffff:192.168.26.12 - - [31/Dec/2024 02:30:39] "GET /favicon.ico HTTP/1.1" 404 -
INFO:werkzeug:::ffff:192.168.26.12 - - [31/Dec/2024 02:30:39] "GET /static/coqui-log-green-TTS.png HTTP/1.1" 200 -> Model input: 你好。> Speaker Idx: > Language Idx: > Text splitted to sentences.
['你好。']
Building prefix dict from the default dictionary ...
DEBUG:jieba:Building prefix dict from the default dictionary ...
Dumping model to file cache /tmp/jieba.cache
DEBUG:jieba:Dumping model to file cache /tmp/jieba.cache
Loading model cost 0.612 seconds.
DEBUG:jieba:Loading model cost 0.612 seconds.
Prefix dict has been built successfully.
DEBUG:jieba:Prefix dict has been built successfully.> Processing time: 1.3838930130004883> Real-time factor: 1.3535681749760808
INFO:werkzeug:::ffff:192.168.26.12 - - [31/Dec/2024 02:30:51] "GET /api/tts?text=你好。&speaker_id=&style_wav={"0":%200.1}&language_id= HTTP/1.1" 200 -> Model input: Coqui TTS 支持多种语言，包括跨语言克隆，例如英文到中文、中文到英文等，共计16种语言。> Speaker Idx: > Language Idx: > Text splitted to sentences.
['Coqui TTS 支持多种语言，包括跨语言克隆，例如英文到中文、中文到英文等，共计16种语言。']
Coqui   TTS   dʒʏ1ʈʂʏ2 duo1dʒoŋ3y3iɛn2 ， baʌ1kuo4 kua4 y3iɛn2 kø4loŋ2 ， li4ʐu2 ɨŋ1wœn2 daʌ4 dʒoŋ1wœn2   dʒoŋ1wœn2 daʌ4 ɨŋ1wœn2 dɵŋ3 ， goŋ4dʑi4 ʂʏ2lio4dʒoŋ3 y3iɛn2 。[!] Character 'C' not found in the vocabulary. Discarding it.
Coqui   TTS   dʒʏ1ʈʂʏ2 duo1dʒoŋ3y3iɛn2 ， baʌ1kuo4 kua4 y3iɛn2 kø4loŋ2 ， li4ʐu2 ɨŋ1wœn2 daʌ4 dʒoŋ1wœn2   dʒoŋ1wœn2 daʌ4 ɨŋ1wœn2 dɵŋ3 ， goŋ4dʑi4 ʂʏ2lio4dʒoŋ3 y3iɛn2 。[!] Character 'T' not found in the vocabulary. Discarding it.
Coqui   TTS   dʒʏ1ʈʂʏ2 duo1dʒoŋ3y3iɛn2 ， baʌ1kuo4 kua4 y3iɛn2 kø4loŋ2 ， li4ʐu2 ɨŋ1wœn2 daʌ4 dʒoŋ1wœn2   dʒoŋ1wœn2 daʌ4 ɨŋ1wœn2 dɵŋ3 ， goŋ4dʑi4 ʂʏ2lio4dʒoŋ3 y3iɛn2 。[!] Character 'S' not found in the vocabulary. Discarding it.
Coqui   TTS   dʒʏ1ʈʂʏ2 duo1dʒoŋ3y3iɛn2 ， baʌ1kuo4 kua4 y3iɛn2 kø4loŋ2 ， li4ʐu2 ɨŋ1wœn2 daʌ4 dʒoŋ1wœn2   dʒoŋ1wœn2 daʌ4 ɨŋ1wœn2 dɵŋ3 ， goŋ4dʑi4 ʂʏ2lio4dʒoŋ3 y3iɛn2 。[!] Character 'g' not found in the vocabulary. Discarding it.> Processing time: 3.930570125579834> Real-time factor: 0.4210506765887842
INFO:werkzeug:::ffff:192.168.26.12 - - [31/Dec/2024 02:37:24] "GET /api/tts?text=Coqui%20TTS%20支持多种语言，包括跨语言克隆，例如英文到中文、中文到英文等，共计16种语言。&speaker_id=&style_wav={"0":%200.1}&language_id= HTTP/1.1" 200

纯CPU运行效果

浏览器打开：http://192.168.44.171:5002；大部分情况下，会在5秒左右返回合成结果。

Note:

记得要在中文最后加上中文句号，否则模型会在后面加一段啊的声音，补齐12s的时长，这应该是个Bug。
对于较长的文本，比如50个文字以上，有时会出现最后一部分被截断的情况，这应该也是个Bug。

遇到模型下载失败的问题

报错信息：从GitHub上下载模型文件超时

root@b452b7513c7e:~# python3 TTS/server/server.py --model_name tts_models/zh-CN/baker/tacotron2-DDC-GST> Downloading model to /root/.local/share/tts/tts_models--zh-CN--baker--tacotron2-DDC-GST> Failed to download the model file to /root/.local/share/tts/tts_models--zh-CN--baker--tacotron2-DDC-GST
Traceback (most recent call last):File "/usr/local/lib/python3.10/site-packages/urllib3/connection.py", line 203, in _new_connsock = connection.create_connection(File "/usr/local/lib/python3.10/site-packages/urllib3/util/connection.py", line 85, in create_connectionraise errFile "/usr/local/lib/python3.10/site-packages/urllib3/util/connection.py", line 73, in create_connectionsock.connect(sa)
TimeoutError: [Errno 110] Connection timed outThe above exception was the direct cause of the following exception:
Traceback (most recent call last):File "/usr/local/lib/python3.10/site-packages/requests/adapters.py", line 486, in sendresp = conn.urlopen(File "/usr/local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 844, in urlopenretries = retries.increment(File "/usr/local/lib/python3.10/site-packages/urllib3/util/retry.py", line 515, in incrementraise MaxRetryError(_pool, url, reason) from reason  # type: ignore[arg-type]
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='github.com', port=443): Max retries exceeded with url: /coqui-ai/TTS/releases/download/v0.6.1_models/tts_models--zh-CN--baker--tacotron2-DDC-GST.zip (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7faab643bb50>, 'Connection to github.com timed out. (connect timeout=None)'))

解决方法：手动下载模型

从报错信息中可以看到模型的下载地址： https://github.com/coqui-ai/TTS/releases/download/v0.6.1_models/tts_models--zh-CN--baker--tacotron2-DDC-GST.zip ；手动下载后解压并传至容器的 /root/.local/share/tts 目录下。

[root@tts opt]# unzip tts_models--zh-CN--baker--tacotron2-DDC-GST.zip 
Archive:  tts_models--zh-CN--baker--tacotron2-DDC-GST.zipcreating: tts_models--zh-CN--baker--tacotron2-DDC-GST/extracting: tts_models--zh-CN--baker--tacotron2-DDC-GST/model_file.pth  extracting: tts_models--zh-CN--baker--tacotron2-DDC-GST/scale_stats.npy  extracting: tts_models--zh-CN--baker--tacotron2-DDC-GST/config.json  
[root@tts opt]# docker cp tts_models--zh-CN--baker--tacotron2-DDC-GST nostalgic_hawking:/root/.local/share/ttsSuccessfully copied 686MB to nostalgic_hawking:/root/.local/share/tts

小总结

架构是一种权衡。

根据当前的实际需求，下面来总结一下三种 TTS 方案的对比，在纯 CPU 、无互联网环境下进行中文文本语音合成（要求5秒内完成）时：

方案名称	合成效果	合成速度	是否满足需求
eSpeak	鬼畜声	毫秒级	速度满足，效果差
ChatTTS	最佳	50秒左右	效果好，速度不达标
CoquiTTS	正常	5秒左右	基本满足要求

综上， CoquiTTS 是最符合需求的方案，它能在纯 CPU 环境下5秒内完成合成，且语音效果可以接受。

Reference

https://chattts.com/zh?__theme=dark#Demo
https://github.com/jianchang512/ChatTTS-ui
https://github.com/coqui-ai/TTS
https://docs.coqui.ai/en/latest/tutorial_for_nervous_beginners.html

If you have any questions or any bugs are found, please feel free to contact me.

Your comments and suggestions are welcome!