当前位置: 首页 > news >正文

架构轻巧的kokoro 文本转语音模型

Kokoro是一个具有8200万个参数的开放权重TTS模型。尽管其架构轻巧,但它提供了与较大型号相当的质量,同时速度更快,更具成本效益。使用Apache许可的权重,Kokoro可以部署在从生产环境到个人项目的任何地方。

官网:hexgrad/kokoro: https://hf.co/hexgrad/Kokoro-82M

现在我们来实践下Kokoro

Linux下安装使用

安装库

pip install -q kokoro>=0.8.2 "misaki[zh]>=0.8.2" soundfile

一键执行安装使用 

为了简单,可以学习官网,直接在kaggle或colab的notebook里,输入下面语句,运行即可:

!pip install -q kokoro>=0.9.4 soundfile
!apt-get -qq -y install espeak-ng > /dev/null 2>&1
from kokoro import KPipeline
from IPython.display import display, Audio
import soundfile as sf
import torch
pipeline = KPipeline(lang_code='a')
text = '''
[Kokoro](/kˈOkəɹO/) is an open-weight TTS model with 82 million parameters. Despite its lightweight architecture, it delivers comparable quality to larger models while being significantly faster and more cost-efficient. With Apache-licensed weights, [Kokoro](/kˈOkəɹO/) can be deployed anywhere from production environments to personal projects.
'''
generator = pipeline(text, voice='af_heart')
for i, (gs, ps, audio) in enumerate(generator):print(i, gs, ps)display(Audio(data=audio, rate=24000, autoplay=i==0))sf.write(f'{i}.wav', audio, 24000)

这些语句包括pip 安装kokoro 和soundfile这两个python包,使用apt 安装了espeak-ng这个软件包(在Ubuntu下) 。

将要转语音的文字赋值给text变量,然后就可以进行文本转换了。

英文效果还行,但是无法混用数字,如果里面有中文就不行。

多种语言

安装中文库

pip install misaki[zh]

在kaggle或colab的notebook里的例子: 

# 1️⃣ Install kokoro
!pip install -q kokoro>=0.9.4 soundfile
# 2️⃣ Install espeak, used for English OOD fallback and some non-English languages
!apt-get -qq -y install espeak-ng > /dev/null 2>&1# 3️⃣ Initalize a pipeline
from kokoro import KPipeline
from IPython.display import display, Audio
import soundfile as sf
import torch
# 🇺🇸 'a' => American English, 🇬🇧 'b' => British English
# 🇪🇸 'e' => Spanish es
# 🇫🇷 'f' => French fr-fr
# 🇮🇳 'h' => Hindi hi
# 🇮🇹 'i' => Italian it
# 🇯🇵 'j' => Japanese: pip install misaki[ja]
# 🇧🇷 'p' => Brazilian Portuguese pt-br
# 🇨🇳 'z' => Mandarin Chinese: pip install misaki[zh]
pipeline = KPipeline(lang_code='a') # <= make sure lang_code matches voice, reference above.# This text is for demonstration purposes only, unseen during training
text = '''
The sky above the port was the color of television, tuned to a dead channel.
"It's not like I'm using," Case heard someone say, as he shouldered his way through the crowd around the door of the Chat. "It's like my body's developed this massive drug deficiency."
It was a Sprawl voice and a Sprawl joke. The Chatsubo was a bar for professional expatriates; you could drink there for a week and never hear two words in Japanese.These were to have an enormous impact, not only because they were associated with Constantine, but also because, as in so many other areas, the decisions taken by Constantine (or in his name) were to have great significance for centuries to come. One of the main issues was the shape that Christian churches were to take, since there was not, apparently, a tradition of monumental church buildings when Constantine decided to help the Christian church build a series of truly spectacular structures. The main form that these churches took was that of the basilica, a multipurpose rectangular structure, based ultimately on the earlier Greek stoa, which could be found in most of the great cities of the empire. Christianity, unlike classical polytheism, needed a large interior space for the celebration of its religious services, and the basilica aptly filled that need. We naturally do not know the degree to which the emperor was involved in the design of new churches, but it is tempting to connect this with the secular basilica that Constantine completed in the Roman forum (the so-called Basilica of Maxentius) and the one he probably built in Trier, in connection with his residence in the city at a time when he was still caesar.[Kokoro](/kˈOkəɹO/) is an open-weight TTS model with 82 million parameters. Despite its lightweight architecture, it delivers comparable quality to larger models while being significantly faster and more cost-efficient. With Apache-licensed weights, [Kokoro](/kˈOkəɹO/) can be deployed anywhere from production environments to personal projects.
'''# 4️⃣ Generate, display, and save audio files in a loop.
generator = pipeline(text, voice='af_heart', # <= change voice herespeed=1, split_pattern=r'\n+'
)
# Alternatively, load voice tensor directly:
# voice_tensor = torch.load('path/to/voice.pt', weights_only=True)
# generator = pipeline(
#     text, voice=voice_tensor,
#     speed=1, split_pattern=r'\n+'
# )for i, (gs, ps, audio) in enumerate(generator):print(i)  # i => indexprint(gs) # gs => graphemes/textprint(ps) # ps => phonemesdisplay(Audio(data=audio, rate=24000, autoplay=i==0))sf.write(f'{i}.wav', audio, 24000) # save each audio file

windows下安装

windows下主要是需要安装espeak-ng ,去这里下载:

​https://github.com/espeak-ng/espeak-ng/releases​

下载espeak-ng安装软件,安装即可。

ONNX部署

安装依赖库

!pip install -U kokoro-onnx soundfile 'misaki[zh]'

下载模型文件

!wget https://github.com/thewh1teagle/kokoro-onnx/releases/download/model-files-v1.1/kokoro-v1.1-zh.onnx
!wget     https://github.com/thewh1teagle/kokoro-onnx/releases/download/model-files-v1.1/voices-v1.1-zh.bin
!wget     https://huggingface.co/hexgrad/Kokoro-82M-v1.1-zh/raw/main/config.json

文本转换

import soundfile as sf
from misaki import zhfrom kokoro_onnx import Kokoro# Misaki G2P with espeak-ng fallback
g2p = zh.ZHG2P(version="1.1")text = "千里之行,始于足下。欢迎使用Kokoro TTS,高效生成自然语音!"
voice = "zf_001"
kokoro = Kokoro("kokoro-v1.1-zh.onnx", "voices-v1.1-zh.bin", vocab_config="config.json")
phonemes, _ = g2p(text)
samples, sample_rate = kokoro.create(phonemes, voice=voice, speed=1.0, is_phonemes=True)
sf.write("audio.wav", samples, sample_rate)
print("Created audio.wav")
display(Audio(data="audio.wav", rate=24000, autoplay=i==0))

推理速度还是较快的。

但是中文里面如果有英文,它是不会读出来的。所以效果还是略差一点。

总结

中文效果较差,英文效果还凑活。

当然kokoro 只有82M大小,能有这个效果已经很不错了!

http://www.lqws.cn/news/538867.html

相关文章:

  • LeetCode 2302.统计得分小于K的子数组数目
  • Docker 入门教程(二):Docker 的基本原理
  • 大厂测开实习和小厂开发实习怎么选
  • python pandas数据清洗
  • NebulaGraph 图数据库介绍
  • 抖音图文带货和短视频带货有什么区别
  • Nginx配置文件介绍和基本使用
  • 面试150 文本左右对齐
  • 2-深度学习挖短线股-3-训练数据计算
  • mysql无法启动的数据库迁移
  • 【办公类-105-01】20250626 托小班报名表-条件格式-判断双胞胎EXCLE
  • Python 中 `for` 循环与 `while` 循环的实际应用区别:实例解析
  • 【NLP】使用 LangGraph 构建 RAG 的Research Multi-Agent
  • FFMpeg的AVFrame数据格式解析
  • C++(模板与容器)
  • 重定向攻击与防御
  • AI+时代已至|AI人才到底该如何培育?
  • AI编程工具深度对比:腾讯云代码助手CodeBuddy、Cursor与通义灵码
  • vscode ssh远程连接到Linux并实现免密码登录
  • 爬虫简单实操2——以贴吧为例爬取“某吧”前10页的网页代码
  • Spring Cloud Feign 整合 Sentinel 实现服务降级与熔断保护
  • [AI]从0到1通过神经网络训练模型
  • 每日算法刷题Day38 6.25:leetcode前缀和3道题,用时1h40min
  • 第七章:总结
  • 【RabbitMQ】多系统下的安装配置与编码使用(python)
  • Spring Task定时任务详解与实战应用
  • java中的anyMatch和allMatch方法
  • OSEK/VDX OS ISO17356-3,【1】规范概述
  • SpringBoot项目快速开发框架JeecgBoot——Web处理!
  • linux cp与mv那个更可靠