MS 认知习惯 voice-submitting 样本 data-returning "Only the RIFF(WAV) format is accepted. Check the format of your audio files."

Question

只是检查以确保这应该受到支持。页面 here 说您应该能够使用任何频率至少为 16kHz 的 PCM 文件。我正在尝试使用 NAudio 将较长的 wav 文件分割成话语，并且我可以生成这些文件，但是我提交的所有训练数据都返回了处理错误 "Only the RIFF(WAV) format is accepted. Check the format of your audio files." 音频文件是 16 位 PCM , mono, 44kHz wav 文件，并且都在 60s 以下。我可能缺少对文件格式的其他限制吗？ wav 文件确实有一个有效的 RIFF header（已验证字节存在）。

Answer 1

我设法通过显式重新编码从 SpeechRecognizer 收到的音频来解决这个问题。绝对不是一个有效的解决方案，但这只是一种测试方法。参考代码如下（放在Recognizer.Recognized中）：

string rawResult = ea.Result.ToString();  //can get access to raw value this way.
Regex r = new Regex(@".*Offset"":(\d*),.*");
UInt64 offset = Convert.ToUInt64(r?.Match(rawResult)?.Groups[1]?.Value);
r = new Regex(@".*Duration"":(\d*),.*");
UInt64 duration = Convert.ToUInt64(r?.Match(rawResult)?.Groups[1]?.Value);

//create segment files
File.AppendAllText($@"{path}\{fileName}\{fileName}.txt", $"{segmentNumber}\t{ea.Result.Text}\r\n");

//offset and duration are in 100ns units
WaveFileReader w = new WaveFileReader(v);
long totalDurationInMs = w.SampleCount / w.WaveFormat.SampleRate * 1000;  //total length of the file
ulong offsetInMs = offset / 10000;  //convert from 100ns intervals to ms
ulong durationInMs = duration / 10000;
long bytesPerMilliseconds = w.WaveFormat.AverageBytesPerSecond / 1000;
w.Position = bytesPerMilliseconds * (long)offsetInMs;
long bytesToRead = bytesPerMilliseconds * (long)durationInMs;
byte[] buffer = new byte[bytesToRead];
int bytesRead = w.Read(buffer, 0, (int)bytesToRead);
string wavFileName = $@"{path}\{fileName}\{segmentNumber}.wav";
string tempFileName = wavFileName + ".tmp";
WaveFileWriter wr = new WaveFileWriter(tempFileName, w.WaveFormat);
wr.Write(buffer, 0, bytesRead);
wr.Close();

//this is probably really inefficient, but it's also the simplest way to get things in the right format.  It's a prototype-deal with it...
WaveFileReader r2 = new WaveFileReader(tempFileName);
//from other project
var desiredOutputFormat = new WaveFormat(16000, 16, 1);
using (var converter = new WaveFormatConversionStream(desiredOutputFormat, r2))
{
    WaveFileWriter.CreateWaveFile(wavFileName, converter);
}

segmentNumber++;

这会将输入文件拆分为单独的每回合文件，并使用文件名将回合记录附加到文本文件中。

好消息是，这产生了一个 "valid" 数据集，我能够从中创建一个声音。坏消息是语音字体产生的音频几乎完全无法理解，我将其归因于使用机器转录的样本以及不规则的转弯中断和可能嘈杂的音频。我可能会看看是否可以通过手动编辑一些文件来提高准确性，但我至少想 post 在这里回答，以防其他人遇到同样的问题。

此外，16 KHz 和 44 KHz PCM 似乎都可以用于自定义语音，因此如果您有更高质量的可用音频，那将是一个加分项。

MS 认知习惯 voice-submitting 样本 data-returning "Only the RIFF(WAV) format is accepted. Check the format of your audio files."

MS Cognitive custom voice-submitting sample data-returning "Only the RIFF(WAV) format is accepted. Check the format of your audio files."

text-to-speech

microsoft-cognitive