将原始 PCM 数据转换为 RIFF WAV

Question

我正在尝试将原始音频数据从一种格式转换为另一种格式以用于语音识别。

音频以 20ms 块的形式从 Discord 服务器接收，格式为：48Khz, 16-bit stereo signed BigEndian PCM.
我正在使用 CMU's Sphinx 进行语音识别，它将音频作为 RIFF (little-endian) WAVE audio, 16-bit, mono 16,000Hz

InputStream

音频数据在 byte[] 中接收，长度为 3840。此 byte[] 数组包含 20ms 上述格式 1 的音频。也就是说这段音频的1秒是3840 * 50，也就是192,000。所以这是每秒 192,000 个样本。这是有道理的，48KHz 采样率，乘以 2（96K 样本），因为一个字节是 8 位，而我们的音频是 16 位，对于立体声再乘以 2。所以 48,000 * 2 * 2 = 192,000.

所以每次接收到音频包我都会先调用这个方法：

private void addToPacket(byte[] toAdd) {
    if(packet.length >= 576000 && !done) {
        System.out.println("Processing needs to occur...");
        getResult(convertAudio());
        packet = null; // reset the packet
        return;
    }

    byte[] newPacket = new byte[packet.length + 3840];
    // copy old packet onto new temp array
    System.arraycopy(packet, 0, newPacket, 0, packet.length);
    // copy toAdd packet onto new temp array
    System.arraycopy(toAdd, 0, newPacket, 3840, toAdd.length);
    // overwrite the old packet with the newly resized packet
    packet = newPacket;
}

这只会将新数据包添加到一个大字节[] 上，直到字节[] 包含 3 秒的音频数据（576,000 个样本，或 192000 * 3）。 3 秒的音频数据足以检测用户是否说了像 "hey computer." 这样的机器人激活热词。以下是我如何转换声音数据：

    private byte[] convertAudio() {
        // STEP 1 - DROP EVERY OTHER PACKET TO REMOVE STEREO FROM THE AUDIO
        byte[] mono = new byte[96000];
        for(int i = 0, j = 0; i % 2 == 0 && i < packet.length; i++, j++) {
            mono[j] = packet[i];
        }

        // STEP 2 - DROP EVERY 3RD PACKET TO CONVERT TO 16K HZ Audio
        byte[] resampled = new byte[32000];
        for(int i = 0, j = 0; i % 3 == 0 && i < mono.length; i++, j++) {
            resampled[j] = mono[i];
        }

        // STEP 3 - CONVERT TO LITTLE ENDIAN
        ByteBuffer buffer = ByteBuffer.allocate(resampled.length);
        buffer.order(ByteOrder.BIG_ENDIAN);
        for(byte b : resampled) {
            buffer.put(b);
        }
        buffer.order(ByteOrder.LITTLE_ENDIAN);
        buffer.rewind();
        for(int i = 0; i < resampled.length; i++) {
            resampled[i] = buffer.get(i);
        }

        return resampled;
    }

最后，尝试识别语音：

private void getResult(byte[] toProcess) {
    InputStream stream = new ByteArrayInputStream(toProcess);
    recognizer.startRecognition(stream);
    SpeechResult result;
    while ((result = recognizer.getResult()) != null) {
        System.out.format("Hypothesis: %s\n", result.getHypothesis());
    }
    recognizer.stopRecognition();
}

我遇到的问题是 CMUSphinx 不会崩溃或提供任何错误消息，它只是每 3 秒提出一个空假设。我不确定为什么，但我猜是我没有正确转换声音。有任何想法吗？任何帮助将不胜感激。

Answer 1

因此，实际上有一个更好的内部解决方案，用于从 byte[].

转换音频

这是我发现效果很好的方法：

        // Specify the output format you want
        AudioFormat target = new AudioFormat(16000f, 16, 1, true, false);
        // Get the audio stream ready, and pass in the raw byte[]
        AudioInputStream is = AudioSystem.getAudioInputStream(target, new AudioInputStream(new ByteArrayInputStream(raw), AudioReceiveHandler.OUTPUT_FORMAT, raw.length));
        // Write a temporary file to the computer somewhere, this method will return a InputStream that can be used for recognition
        try {
            AudioSystem.write(is, AudioFileFormat.Type.WAVE, new File("C:\filename.wav"));
        } catch(Exception e) {}

将原始 PCM 数据转换为 RIFF WAV

Converting Raw PCM Data to RIFF WAV

java

audio

binary

speech-recognition