webRTC：如何通过从 WAV 文件获取的样本将 webRTC 的 VAD 应用于音频

Question

目前，我正在解析 wav 文件并在 std::vector<int16_t> sample 中存储样本。现在，我想在此数据上应用 VAD（语音 Activity 检测）以找出语音的 "regions"，更具体地说是单词的开始和结束.

解析的wav文件为16KHz，16位PCM，单声道。我的代码在 C++ 中。

我已经搜索了很多，但找不到关于 webRTC 的 VAD 功能的适当文档。

根据我的发现，我需要使用的函数是WebRtcVad_Process()。它的原型如下：

int WebRtcVad_Process(VadInst* handle, int fs, const int16_t* audio_frame,
                      size_t frame_length)

根据我在这里找到的内容：

Each frame of audio that you send to the VAD must be 10, 20 or 30 milliseconds long. Here's an outline of an example that assumes audio_frame is 10 ms (320 bytes) of audio at 16000 Hz:

int is_voiced = WebRtcVad_Process(vad, 16000, audio_frame, 160);

有道理:

1 sample = 2B = 16 bits  
SampleRate = 16000 sample/sec = 16 samples/ms  
For 10 ms, no of samples    =   160

所以，基于此，我实现了这个：

const int16_t * temp = sample.data();
for(int i = 0, ms = 0; i < sample.size(); i += 160, ms++)
{
    int isActive = WebRtcVad_Process(vad, 16000, temp, 160); //10 ms window
    std::cout<<ms<<" ms : "<<isActive<<std::endl;
    temp = temp + 160; // processed 160 samples
}

现在，我不确定这是否正确。另外，我也不确定这是否给我正确的输出。

所以，

是否可以使用直接从 wav 文件解析的样本，还是需要一些处理？
我是否在寻找正确的功能来完成这项工作？
如何使用函数对音频流进行正确的VAD？
是否可以区分口语单词？
检查我得到的输出是否正确的最佳方法是什么？
如果不是，完成此任务的最佳方法是什么？

Answer 1

首先我要说的是，不，我认为您无法使用 VAD 将话语分割成单个单词。来自 article on speech segmentation in Wikipedia:

One might expect that the inter-word spaces used by many written languages like English or Spanish would correspond to pauses in their spoken version, but that is true only in very slow speech, when the speaker deliberately inserts those pauses. In normal speech, one typically finds many consecutive words being said with no pauses between them, and often the final sounds of one word blend smoothly or fuse with the initial sounds of the next word.

也就是说，我会尽力回答您的其他问题。

在运行进行VAD之前，您需要将可以压缩的WAV文件解码为原始PCM音频数据。参见例如Reading and processing WAV file data in C/C++。或者，您可以在运行编码之前使用 sox 之类的方法将 WAV 文件转换为原始音频。此命令会将任何格式的 WAV 文件转换为 WebRTCVAD 期望格式的 16 KHz、16 位 PCM：
```
sox my_file.wav -r 16000 -b 16 -c 1 -e signed-integer -B my_file.raw
```

看来您使用的功能是正确的。更具体地说，您应该这样做：

#include "webrtc/common_audio/vad/include/webrtc_vad.h"
// ...
VadInst *vad;
WebRtcVad_Create(&vad);
WebRtcVad_Init(vad);
const int16_t * temp = sample.data();
for(int i = 0, ms = 0; i < sample.size(); i += 160, ms += 10)
{
  int isActive = WebRtcVad_Process(vad, 16000, temp, 160); //10 ms window
  std::cout << ms << " ms : " << isActive << std::endl;
  temp = temp + 160; // processed 160 samples (320 bytes)
}

要查看它是否正常工作，您可以运行已知文件并查看是否获得预期的结果。例如，您可以从处理静音开始，并确认您永远不会（或很少——这个算法并不完美）看到从 WebRtcVad_Process 返回的有声结果。然后你可以尝试一个文件，除了中间的一个简短的话语等之外，所有的文件都是静音的。如果你想与现有的测试进行比较，py-webrtcvad 模块有一个单元测试可以做到这一点；请参阅 test_process_file function.
要进行词级分割，您可能需要找到可以执行此操作的语音识别库，或者让您可以访问执行此操作所需的信息。例如。 this thread on the Kaldi mailing list好像在讲怎么分词

webRTC：如何通过从 WAV 文件获取的样本将 webRTC 的 VAD 应用于音频

webRTC : How to apply webRTC's VAD on audio through samples obtained from WAV file

c++

audio

speech-recognition

voice-recognition

webrtc