笔者最近在挑选开源的语音识别模型,首要测试的是百度的paddlepaddle;
测试之前,肯定需要了解一下音频解析的一些基本技术点,于是有此篇先导文章。
笔者看到的音频解析主要有几个:
- soundfile
- ffmpy
- librosa
文章目录
1 librosa
安装代码:
!pip install librosa -i https://mirror.baidu.com/pypi/simple
!pip install soundfile -i https://mirror.baidu.com/pypi/simple
参考文档:librosa
1.1 音频读入
文档位置:https://librosa.org/doc/latest/core.html#audio-loading
signal, sr = librosa.load(path, sr=None)
其中load的参数包括:
librosa.load(path, *, sr=22050, mono=True, offset=0.0, duration=None, dtype=<class 'numpy.float32'>, res_type='kaiser_best')
其中sr = None,‘None’ 保留原始采样频率,设置其他采样频率会进行重采样,有点耗时
可以读 .wav 和 .mp3;
1.2 音频写出
在网络上其他几篇:python音频采样率转换 和 python 音频文件采样率转换在导出音频文件时候,会出现错误,贴一下他们的代码
代码片段一:
def resample_rate(path,new_sample_rate = 16000):
signal, sr = librosa.load(path, sr=None)
wavfile = path.split('/')[-1]
wavfile = wavfile.split('.')[0]
file_name = wavfile + '_new.wav'
new_signal = librosa.resample(signal, sr, new_sample_rate) #
librosa.output.write_wav(file_name, new_signal , new_sample_rate)
代码片段二:
import librosa
import os
noise_name="/media/dfy/fc0b6513-c379-4548-b391-876575f1493f/home/dfy/PycharmProjects/noise_data/"
noise_name_list=os.listdir(noise_name)
for one_name in noise_name_list:
data=librosa.load(noise_name+one_name,16000)
librosa.output.write_wav(noise_name+one_name,data[0],16000,norm=False)
if __name__ == '__main__':
pass
上述都是使用 librosa.output
进行导出,最新的librosa已经摒弃了这个函数。出现报错:
AttributeError: module librosa has no attribute output No module named numba.decorators错误解决
0.8.0版本的将output的api屏蔽掉了,所以要么就是librosa降低版本,比如到0.7.2,要么使用另外的方式。
于是来到官方文档:librosa
推荐使用write的方式,是使用这个库:PySoundFile
1.3 librosa 读入 + PySoundFile写出
如果出现报错:
Input audio file has sample rate [44100], but decoder expects [16000]
就是音频采样比不对,需要修改一下。
笔者将1+2的开源库结合,微调了python音频采样率转换 和 python 音频文件采样率转换,得到以下,切换音频采样频率的函数:
import librosa
import os
import numpy as np
import soundfile as sf
def resample_rate(path,new_sample_rate = 16000):
signal, sr = librosa.load(path, sr=None)
wavfile = path.split('/')[-1]
wavfile = wavfile.split('.')[0]
file_name = wavfile + '_new.wav'
new_signal = librosa.resample(signal, sr, new_sample_rate) #
#librosa.output.write_wav(file_name, new_signal , new_sample_rate)
sf.write(file_name, new_signal, new_sample_rate, subtype='PCM_24')
print(f'{file_name} has download.')
# wav_file = 'video/xxx.wav'
resample_rate(wav_file,new_sample_rate = 16000)
改变为sample_rate 为16000
的音频文件
1.4 从其他库转为librosa格式
参考:https://librosa.org/doc/latest/generated/librosa.load.html#librosa.load
第一种:
# Load using an already open SoundFile object
import soundfile
sfo = soundfile.SoundFile(librosa.ex('brahms'))
y, sr = librosa.load(sfo)
第二种:
# Load using an already open audioread object
import audioread.ffdec # Use ffmpeg decoder
aro = audioread.ffdec.FFmpegAudioFile(librosa.ex('brahms'))
y, sr = librosa.load(aro)
2 PySoundFile
python-soundfile是一个基于libsndfile、CFFI和NumPy的音频库。
可以直接使用函数read()和write()来读写声音文件。要按块方式读取声音文件,请使用blocks()。另外,声音文件也可以作为SoundFile对象打开。
PySoundFile的官方文档:readthedocs
下载:
!pip install soundfile -i https://mirror.baidu.com/pypi/simple
2.1 读入音频
read files from zip compressed archives:
import zipfile as zf
import soundfile as sf
import io
with zf.ZipFile('test.zip') as myzip:
with myzip.open('stereo_file.wav') as myfile:
tmp = io.BytesIO(myfile.read())
data, samplerate = sf.read(tmp)
Download and read from URL:
import soundfile as sf
import io
from six.moves.urllib.request import urlopen
url = "https://raw.githubusercontent.com/librosa/librosa/master/tests/data/test1_44100.wav"
data, samplerate = sf.read(io.BytesIO(urlopen(url).read()))
2.2 导出音频
导出音频的:
import numpy as np
import soundfile as sf
rate = 44100
data = np.random.uniform(-1, 1, size=(rate * 10, 2))
# Write out audio as 24bit PCM WAV
sf.write('stereo_file.wav', data, samplerate, subtype='PCM_24')
# Write out audio as 24bit Flac
sf.write('stereo_file.flac', data, samplerate, format='flac', subtype='PCM_24')
# Write out audio as 16bit OGG
sf.write('stereo_file.ogg', data, samplerate, format='ogg', subtype='vorbis')
3 ffmpy
Python 批量转换视频音频采样率(附代码) | Python工具
下载:
pip install ffmpy -i https://pypi.douban.com/simple
具体代码见原文,只截取其中一段:
def transfor(video_path: str, tmp_dir: str, result_dir: str):
file_name = os.path.basename(video_path)
base_name = file_name.split('.')[0]
file_ext = file_name.split('.')[-1]
ext = 'wav'
audio_path = os.path.join(tmp_dir, '{}.{}'.format(base_name, ext))
print('文件名:{},提取音频'.format(audio_path))
ff = FFmpeg(
inputs={
video_path: None}, outputs={
audio_path: '-f {} -vn -ac 1 -ar 16000 -y'.format('wav')})
print(ff.cmd)
ff.run()
if os.path.exists(audio_path) is False:
return None
video_tmp_path = os.path.join(
tmp_dir, '{}_1.{}'.format(
base_name, file_ext))
ff_video = FFmpeg(inputs={video_path: None},
outputs={video_tmp_path: '-an'})
print(ff_video.cmd)
ff_video.run()
result_video_path = os.path.join(result_dir, file_name)
ff_fuse = FFmpeg(inputs={video_tmp_path: None, audio_path: None}, outputs={
result_video_path: '-map 0:v -map 1:a -c:v copy -c:a aac -shortest'})
print(ff_fuse.cmd)
ff_fuse.run()
return result_video_path
4 AudioSegment
参考文章:
Python | 语音处理 | 用 librosa / AudioSegment / soundfile 读取音频文件的对比
from pydub import AudioSegment #需要导入pydub三方库,第一次使用需要安装
audio_path = './data/example.mp3'
t = time.time()
song = AudioSegment.from_file(audio_path, format='mp3')
# print(len(song)) #时长,单位:毫秒
# print(song.frame_rate) #采样频率,单位:赫兹
# print(song.sample_width) #量化位数,单位:字节
# print(song.channels) #声道数,常见的MP3多是双声道的,声道越多文件也会越大。
wav = np.array(song.get_array_of_samples())
sr = song.frame_rate
print(f"sr={sr}, len={len(wav)}, 耗时: {time.time()-t}")
print(f"(min, max, mean) = ({wav.min()}, {wav.max()}, {wav.mean()})")
wav
输出结果为:
sr=16000, len=64320, 耗时: 0.04667925834655762
(min, max, mean) = (-872, 740, -0.6079446517412935)
array([ 1, -1, -2, ..., -1, 1, -2], dtype=int16)
5 paddleaudio
安装:
! pip install paddleaudio -i https://mirror.baidu.com/pypi/simple
paddle官方封装的一个,音频基本操作应该是librosa的库
具体参考:
https://paddleaudio-doc.readthedocs.io/en/latest/index.html
import paddleaudio
audio_file = 'XXX.wav'
paddleaudio.load(audio_file, sr=None, mono=True, normal=False)
得出:
(array([-3.9100647e-04, -3.0159950e-05, 1.1110306e-04, ...,
1.4603138e-04, 2.5625229e-03, -7.6780319e-03], dtype=float32),
16000)
音频数值 + 采样率
文章出处登录后可见!