Whisper.cpp:VS2022下的语音转文本示例

Whisper.cpp:VS2022下的语音转文本示例

来自AI助手的总结
本文介绍了如何使用Git克隆和构建项目,并提供了CPU和GPU版本的构建步骤及相关代码实现。

克隆项目:

使用Git工具,将项目克隆到本地:

git clone https://github.com/ggml-org/whisper.cpp.git

构建项目:

构建说明

构建可以分为CPU版本和GPU版本,即一个通过CPU的算力来为模型提供服务,另一个通过GPU的算力来为模型服务。

如果你的PC拥有GPU,那么更推荐你使用GPU版本的构建,这会极大提高转录速度。

以下是文章作者的环境配置:

  • IDE:VS2022
  • C++版本:C++17
  • 构建工具:cmake
  • GPU:RTX 3060(6G显存)

CPU:

构建命令(请尽可能构建Release版本的):

Release版本:
cmake -B build -G "Visual Studio 17 2022" -A x64 -D WHISPER_SHARED_LIB=ON
cmake --build build --config Release
Debug版本:
cmake -B build-cpu-dbg -G "Visual Studio 17 2022" -A x64 -DWHISPER_SHARED_LIB=ON
cmake --build build-cpu-dbg --config Debug -j%NUMBER_OF_PROCESSORS%

GPU:

使用GPU请确保你已经安装了CUDA(点击前往CUDA官网),安装好后,在命令行窗口中输入:

nvcc --version
nvidia-smi

出现的结果如下图,那么说明环境没问题

20251116222736238-image 20251116222922212-image

构建命令(请尽可能使用Release版本的,而不是Debug,构建时请尽可能使用x64 Native Tools Command Prompt for VS 2022工具,开始菜单搜索即可):

Release版本:

cmake -S . -B build-gpu -G "Ninja" -DGGML_CUDA=ON -DGGML_CUBLAS=ON -DGGML_CUDA_KERNELS=ON -DGGML_CUDA_ARCHITECTURES="86" -DCMAKE_CUDA_COMPILER="C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v13.0/bin/nvcc.exe" -DCUDA_TOOLKIT_ROOT_DIR="C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v13.0" -DCMAKE_BUILD_TYPE=Release
cmake --build build-gpu -j

Debug版本:

cmake -S . -B build-gpu-debug -G "Visual Studio 17 2022" -A x64 -DGGML_CUDA=ON -DGGML_CUBLAS=ON -DGGML_CUDA_KERNELS=ON -DCMAKE_CUDA_ARCHITECTURES="86" -DCUDAToolkit_ROOT="C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v13.0"
make --build build-gpu-debug --config Debug --target ALL_BUILD -j

构建成功:

如果构建成功,那么你将会看到build-gpu,构建过程中不会出现错误提示

20251116223641948-image

导入库:

构建完成后,在/build-gpu目录下,找到:

20251128144524639-image

新建一个库文件夹,新建的文件夹中,再新建图1中的两个文件夹

  1. 将上图中bin目录下的这几个.dll文件(图2)全部复制到lib文件夹中;
  2. 将上图中src目录下的.lib文件(图3)复制到lib文件夹中;
  3. 将上图中ggml目录下的.lib文件(图4)复制到lib文件夹中;

20251128144740403-image

图1

20251128144614145-image

图2

20251128145148113-image

图3

20251128145845122-image

图4

在项目的根目录下(whisper.cpp目录中,下图1),将

  1. ggml目录完整复制到之前新建的include文件夹中;
  2. 将项目根目录下的include目录(下图2)中的文件完整复制到之前创建的include文件夹中

20251128150201611-image

图1

20251128150415901-image

图2

链接库:

新建一个VS空项目,将刚刚的库文件夹放入项目目录中:

  1. 解决方案 -> 属性 -> 配置熟悉 -> 常规 -> C++语言标准 -> 选择C++17 (下图1);
  2. 在项目属性页:C/C++ -> 常规 -> 附加包含目录,将添加includeinclude/ggml/include路径,参考图2;
  3. 属性 -> 链接器 -> 常规 -> 附加库目录,将lib文件夹添加,参考图3;
  4. 属性 -> 链接器 -> 输入 -> 附加依赖项,添加下方依赖(参考图4):
whisper.lib
ggml.lib
ggml-base.lib

20251128154310878-image

图1

20251128154826969-image

图2

20251128155206765-image

图3

20251128155538398-image

图4

代码实现:

头文件(QAppWhisper.h):

/*
*
* ********  WhisperEngine            ********
* ********  By Ciallo                ********
* ********  2025/11/23               ********
* ********  V1.0                     ********
* ********  https://wang-sz.cn       ********
* * Github:https://github.com/WinterShadowy *
*
*/

#pragma once

#include <string>
#include <vector>
#include <functional>
#include <thread>
#include <mutex>
#include <condition_variable>
#include <atomic>
#include <whisper.h>

class WhisperEngine {
public:
    // 单例访问
    static WhisperEngine& instance();

    // 禁用拷贝/移动
    WhisperEngine(const WhisperEngine&) = delete;
    WhisperEngine& operator=(const WhisperEngine&) = delete;
    WhisperEngine(WhisperEngine&&) = delete;
    WhisperEngine& operator=(WhisperEngine&&) = delete;

    // 异步初始化模型:立即返回,加载在后台线程执行
    // model_path: 模型文件路径(例如 "ggml-whisper-large.bin")
    void InitializeAsync(const std::string& model_path);

    // 查询状态
    bool IsInitializing() const;
    bool IsInitialized() const;
    bool InitSucceeded() const;
    void SetNumThreads(unsigned int n);
    void SetGpuLayers(int n);
    void SetUseMmap(bool v);
    // 同步转写(阻塞当前线程)
    // waitInitMs: 若模型未就绪,等待最多 waitInitMs 毫秒(0 表示不等待)
    // 返回值:识别文本或以 "[Error]" 开头的错误信息
    std::string TranscribeFromWav(const std::string& wav_path, int waitInitMs = 0);

    // 异步转写:识别在后台线程执行,回调在后台线程调用(若需要在 UI 线程处理结果,请在回调内转发)
    // callback 参数类型为 void(const std::string& result)
    void TranscribeAsync(const std::string& wav_path, std::function<void(std::string)> callback);

    // 停止/释放模型(再次调用 initializeAsync 可以重新加载)
    void Shutdown();

    ~WhisperEngine();

private:
    WhisperEngine();

    // 后台初始化线程入口
    void initThreadFunc_v_1_1(const std::string model_path);
    void initThreadFunc_v_1_2(const std::string model_path);
    void initThreadFunc_v_1_3(const std::string model_path);

    void initThreadFunc_v_1_4(const std::string model_path);

    // 内部用于转写的工作函数,假定模型已加载且调用方已加锁或序列化
    std::string doTranscribe_v_1_1(const std::string& wav_path);
private:
    // whisper 上下文
    whisper_context* m_ctx = nullptr;

    unsigned int m_n_threads = 0;    // 0 表示使用 hardware_concurrency()
    int m_n_gpu_layers = -1;         // -1 表示使用默认/不启用 GPU(或自动选择)
    bool m_use_mmap = false;
    // 初始化状态
    std::atomic<bool> m_initializing{ false };
    std::atomic<bool> m_initialized{ false };
    std::atomic<bool> m_init_success{ false };

    // 互斥与条件变量:用于等待初始化完成
    std::mutex m_init_mutex;
    std::condition_variable m_init_cv;

    // 用于序列化对 whisper_full 的调用
    std::mutex m_whisper_call_mutex;

    // 后台初始化线程(持有 thread 对象以便 join)
    std::thread m_init_thread;

    // 若使用 transcribeAsync,可用线程池或单后台线程来执行任务;此示例用 std::thread 每个任务一个线程
};

源文件(QAppWhisper.cpp):

/*
*
* ********  WhisperEngine            ********
* ********  By Ciallo                ********
* ********  2025/11/23               ********
* ********  V1.0                     ********
* ********  https://wang-sz.cn       ********
* * Github:https://github.com/WinterShadowy *
*
*/

#include "QAppWhisper.h"

#define _CRT_SECURE_NO_WARNINGS
#pragma warning(disable:4996)

#include <whisper.h>
#include <cstdio>
#include <cstring>
#include <QDebug>
#include <chrono>
#include <iostream>

#pragma comment(lib, "whisper.lib")

__declspec(deprecated("This function will be removed in future versions. Use newFunction() instead."))
bool read_wav_v_1_0(const char* fname, std::vector<float>& pcmf32, int expect_sample_rate = 16000);
bool read_wav_v_1_1(const char* fname, std::vector<float>& pcmf32, int expect_sample_rate = 16000);
bool read_wav_v_1_2(const char* fname, std::vector<float>& pcmf32, int expect_sample_rate = 16000);

// 单例实现
WhisperEngine& WhisperEngine::instance()
{
    static WhisperEngine inst;
    return inst;
}

WhisperEngine::WhisperEngine()
    : m_ctx(nullptr)
{
}

WhisperEngine::~WhisperEngine()
{
    Shutdown();
}

void WhisperEngine::InitializeAsync(const std::string& model_path)
{
    bool expected = false;
    // 如果已经在初始化或已初始化,就忽略或重新加载
    if (m_initializing.load() || m_initialized.load()) 
    {
        // 如果已经初始化成功且是同一路径,可以选择忽略或重载;示例直接返回
        return;
    }

    m_initializing.store(true);
    m_init_success.store(false);
    m_initialized.store(false);

    // 启动后台线程执行初始化
    m_init_thread = std::thread([this, model_path]() {
        this->initThreadFunc_v_1_4(model_path);
        });

    // 若需要立即 detach 可改为 detach,但建议 join 在析构时进行
}

void WhisperEngine::initThreadFunc_v_1_1(const std::string model_path)
{

    // 这里在后台线程中实际加载模型
    // 注意:whisper_init_from_file_with_params 可能比较耗时
    whisper_context_params cparams = whisper_context_default_params();
    // 你可以修改 cparams(例如 n_threads)如需要

    whisper_context* ctx = whisper_init_from_file_with_params(model_path.c_str(), cparams);
    {
        std::lock_guard<std::mutex> lk(m_init_mutex);
        m_ctx = ctx;
        if (ctx) 
        {
            m_init_success.store(true);
            m_initialized.store(true);
        }
        else 
        {
            m_init_success.store(false);
            m_initialized.store(false);
        }
        m_initializing.store(false);
    }
    m_init_cv.notify_all();
}

void WhisperEngine::initThreadFunc_v_1_2(const std::string model_path)
{
    whisper_context_params cparams = whisper_context_default_params();
    cparams.use_gpu = true;
#ifdef WHISPER_HAS_USE_MMAP
    cparams.use_mmap = m_use_mmap;
#endif

    if (m_n_gpu_layers >= 0) {
#ifdef WHISPER_HAS_N_GPU_LAYERS
        cparams.n_gpu_layers = m_n_gpu_layers;
#else
        // 如果库没有这个字段,忽略或记录提示
        // printf / qDebug 提示用户需要用支持 GPU 的 whisper.cpp 重建库
        qDebug() << "whisper lib: n_gpu_layers not available in whisper_context_params; ignoring";
#endif
    }

    qDebug() << "Initializing model:" << QString::fromStdString(model_path)
        << " gpu_layers=" << m_n_gpu_layers << " use_mmap=" << m_use_mmap;

    whisper_context* ctx = whisper_init_from_file_with_params(model_path.c_str(), cparams);

    {
        std::lock_guard<std::mutex> lk(m_init_mutex);
        m_ctx = ctx;
        m_init_success.store(ctx != nullptr);
        m_initialized.store(true);   // 标记初始化完成(成功或失败)
        m_initializing.store(false);
    }
    m_init_cv.notify_all();
}

void WhisperEngine::initThreadFunc_v_1_3(const std::string model_path)
{
    whisper_context_params cparams = whisper_context_default_params();
    cparams.use_gpu = true;
    cparams.gpu_device = 0;      // 你的 3060 通常是 0 号设备
    cparams.flash_attn = true;   // 开启 Flash Attention(构建支持时生效)
#ifdef WHISPER_HAS_USE_MMAP
    cparams.use_mmap = m_use_mmap;
#endif

#ifdef WHISPER_HAS_N_GPU_LAYERS
    // m_n_gpu_layers < 0 时,尽量把能下放的层都交给 GPU(库内部会裁剪到可支持的最大层数)
    cparams.n_gpu_layers = (m_n_gpu_layers < 0) ? 999 : m_n_gpu_layers;
#else
    qDebug() << "whisper lib: n_gpu_layers not available in whisper_context_params; ignoring";
#endif

    qDebug() << "Initializing model:" << QString::fromStdString(model_path)
        << " gpu_layers=" << m_n_gpu_layers << " use_mmap=" << m_use_mmap
        << " flash_attn=" << cparams.flash_attn << " gpu_device=" << cparams.gpu_device;

    whisper_context* ctx = whisper_init_from_file_with_params(model_path.c_str(), cparams);

    {
        std::lock_guard<std::mutex> lk(m_init_mutex);
        m_ctx = ctx;
        m_init_success.store(ctx != nullptr);
        m_initialized.store(true);
        m_initializing.store(false);
    }

    if (ctx) {
        qDebug() << "Whisper backend:" << whisper_print_system_info();
    }
    m_init_cv.notify_all();
}

void WhisperEngine::initThreadFunc_v_1_4(const std::string model_path)
{
    // 构造 GPU 参数
    whisper_context_params cparams = whisper_context_default_params();
    cparams.use_gpu = true;   // 启用 CUDA
    cparams.gpu_device = 0;      // 3060 单卡
    cparams.flash_attn = true;  // Ampere 可开 true,需编译支持
    // 真正加载(可能耗时数秒)
    whisper_context* ctx = whisper_init_from_file_with_params(model_path.c_str(), cparams);

    // 写回状态
    {
        std::lock_guard<std::mutex> lk(m_init_mutex);
        m_ctx = ctx;
        if (ctx)
        {
            m_init_success.store(true);
            m_initialized.store(true);
            std::fprintf(stdout, "[WhisperEngine] GPU model loaded success from %s\n", model_path.c_str());
        }
        else
        {
            m_init_success.store(false);
            m_initialized.store(false);
            std::fprintf(stderr, "[WhisperEngine] GPU model loaded FAILED from %s\n", model_path.c_str());
        }
        m_initializing.store(false);
    }
    m_init_cv.notify_all();
}

bool WhisperEngine::IsInitializing() const 
{ 
    return m_initializing.load(); 
}
bool WhisperEngine::IsInitialized() const
{
    return m_initialized.load();
}
bool WhisperEngine::InitSucceeded() const 
{ 
    return m_init_success.load(); 
}
void WhisperEngine::SetNumThreads(unsigned int n)
{
    m_n_threads = n;
    return;
}
void WhisperEngine::SetGpuLayers(int n)
{
    m_n_gpu_layers = n;
    return;
}
void WhisperEngine::SetUseMmap(bool v)
{
    m_use_mmap = v;
    return;
}

std::string WhisperEngine::doTranscribe_v_1_1(const std::string& wav_path)
{
    if (!m_ctx) 
        return std::string("[Error] model not loaded");

    // 读取 wav 到 float buffer(使用你已有函数)
    std::vector<float> pcmf32;
    if (!read_wav_v_1_1(wav_path.c_str(), pcmf32, 16000)) 
    {
        return std::string("[Error] read WAV failed");
    }

    // 调用 whisper_full
    whisper_full_params wparams = whisper_full_default_params(WHISPER_SAMPLING_GREEDY);
    wparams.language = "zh";
    wparams.no_timestamps = true;
    wparams.n_threads = 4;       // CPU 并行
    wparams.n_max_text_ctx = 16384;// 可识别 30 s 块
    // 保护对 whisper_full 的并发调用(串行化)
    std::lock_guard<std::mutex> lk(m_whisper_call_mutex);

    int rc = whisper_full(m_ctx, wparams, pcmf32.data(), pcmf32.size());
    if (rc != 0) 
    {
        return std::string("[Error] whisper_full failed");
    }

    // 提取结果
    std::string result;
    int n = whisper_full_n_segments(m_ctx);
    for (int i = 0; i < n; ++i) 
    {
        const char* seg = whisper_full_get_segment_text(m_ctx, i);
        if (seg) 
            result += seg;
    }
    return result;
}
std::string WhisperEngine::TranscribeFromWav(const std::string& wav_path, int waitInitMs)
{
    // 如果模型正在初始化且 caller 指定等待时间,则等待
    if (!m_initialized.load()) 
    {
        if (waitInitMs > 0)
        {
            std::unique_lock<std::mutex> lk(m_init_mutex);
            m_init_cv.wait_for(lk, std::chrono::milliseconds(waitInitMs), [this]() {
                return this->m_initialized.load() || !this->m_initializing.load();
                });
        }
    }

    if (!m_initialized.load() || !m_init_success.load()) 
    {
        return std::string("[Error] model not ready");
    }

    return doTranscribe_v_1_1(wav_path);
}
void WhisperEngine::TranscribeAsync(const std::string& wav_path, std::function<void(std::string)> callback)
{
    // 在后台线程执行识别任务
    std::thread task([this, wav_path, callback]() {
        // 等待模型就绪,最多等待 30s(可调整)
        {
            std::unique_lock<std::mutex> lk(m_init_mutex);
            if (!m_initialized.load()) 
            {
                m_init_cv.wait_for(lk, std::chrono::seconds(30), [this]() {
                    return this->m_initialized.load() || !this->m_initializing.load();
                    });
            }
        }
        if (!m_initialized.load() || !m_init_success.load()) 
        {
            if (callback) 
                callback("[Error] model not ready");
            return;
        }
        std::string res = doTranscribe_v_1_1(wav_path);
        if (callback) 
            callback(res);
        });
    task.detach(); // fire-and-forget; 管理任务线程池可改造
}

void WhisperEngine::Shutdown()
{
    // 如果 init thread 仍在运行,等待其结束
    if (m_init_thread.joinable()) 
    {
        // 如果在初始化中,不能强制停止 whisper init; 等待一段时间
        m_init_thread.join();
    }

    {
        std::lock_guard<std::mutex> lk(m_init_mutex);
        if (m_ctx) 
        {
            whisper_free(m_ctx);
            m_ctx = nullptr;
        }
        m_initialized.store(false);
        m_init_success.store(false);
        m_initializing.store(false);
    }
    m_init_cv.notify_all();
}

// 支持信息:
// 16 kHz / 32-bit-float / 单声道 WAV 读取
// 不支持其他格式
bool read_wav_v_1_0(const char* fname, std::vector<float>& pcmf32, int expect_sample_rate)
{
    FILE* fp = std::fopen(fname, "rb");
    if (!fp) 
    {                                    // 文件打不开
        std::fprintf(stderr, "[read_wav] fopen fail: %s\n", fname);
        return false;
    }

    struct {
        char     riff[4] = {};
        uint32_t size = 0;
        char     wave[4] = {};
        char     fmt[4] = {};
        uint32_t fmt_sz = 0;
        uint16_t fmt_tag = 0;
        uint16_t ch = 0;
        uint32_t sr = 0;
        uint32_t br = 0;
        uint16_t ba = 0;
        uint16_t bps = 0;
        char     data[4] = {};
        uint32_t data_sz = 0;
    } h;

    bool ok = true;

    // 读固定头
    if (ok && std::fread(&h, sizeof(h), 1, fp) != 1) 
        ok = false;
    if (ok && (std::memcmp(h.riff, "RIFF", 4) || std::memcmp(h.wave, "WAVE", 4))) 
    {
        std::fprintf(stderr, "[read_wav] not a RIFF/WAVE file\n");
        ok = false;
    }
    if (ok && h.fmt_tag != 1) 
    {                       // 1 = PCM
        std::fprintf(stderr, "[read_wav] not PCM (fmt_tag=%u)\n", h.fmt_tag);
        ok = false;
    }

    // 跳过 fmt 扩展字节
    if (ok && h.fmt_sz > 16) 
        std::fseek(fp, h.fmt_sz - 16, SEEK_CUR);

    // 替换原来的 “找 data chunk” 代码
    bool found = false;
    while (ok) 
    {
        char id[4];
        uint32_t sz = 0;
        if (std::fread(id, 4, 1, fp) != 1) 
            break;
        if (std::fread(&sz, 4, 1, fp) != 1) 
            break;
        if (memcmp(id, "data", 4) == 0) 
        {
            h.data_sz = sz;
            found = true;
            break;
        }
        // 跳过当前 chunk 数据(按字节对齐到偶数)
        std::fseek(fp, sz + (sz & 1), SEEK_CUR);
    }
    if (!found) 
    {
        fprintf(stderr, "[read_wav] no 'data' chunk\n");
        ok = false;
    }

    // 格式检查
    if (ok && h.sr != (uint32_t)expect_sample_rate) 
    {
        std::fprintf(stderr, "[read_wav] sample rate mismatch: file=%u, expected=%d\n", h.sr, expect_sample_rate);
        ok = false;
    }
    if (ok && h.ch != 1) 
    {
        std::fprintf(stderr, "[read_wav] not mono (channels=%u)\n", h.ch);
        ok = false;
    }
    if (ok && h.bps != 32) 
    {
        std::fprintf(stderr, "[read_wav] not 32-bit float (bits=%u)\n", h.bps);
        ok = false;
    }

    // 读样本
    if (ok) 
    {
        size_t n = h.data_sz / sizeof(float);
        pcmf32.resize(n);
        if (std::fread(pcmf32.data(), sizeof(float), n, fp) != n) 
        {
            std::fprintf(stderr, "[read_wav] fread samples failed\n");
            ok = false;
        }
    }
    std::fclose(fp);
    return ok && !pcmf32.empty();
}

// 支持信息:
// sample_rate == expect_sample_rate(默认 16000)
// format == IEEE float (fmt_tag == 3)
//           bits_per_sample == 32
//           channels == 1 (单声道)
bool read_wav_v_1_1(const char* fname, std::vector<float>& pcmf32, int expect_sample_rate)
{
    FILE* fp = std::fopen(fname, "rb");
    if (!fp)
    {
        std::fprintf(stderr, "[read_wav] fopen fail: %s", fname);
        return false;
    }

    // 读 RIFF header (12 bytes)
    char riff[4];
    uint32_t riff_sz = 0;
    char wave[4];
    if (std::fread(riff, 1, 4, fp) != 4 ||
        std::fread(&riff_sz, sizeof(riff_sz), 1, fp) != 1 ||
        std::fread(wave, 1, 4, fp) != 4)
    {
        std::fprintf(stderr, "[read_wav] read header failed");
        std::fclose(fp);
        return false;
    }
    if (std::memcmp(riff, "RIFF", 4) != 0 || std::memcmp(wave, "WAVE", 4) != 0)
    {
        std::fprintf(stderr, "[read_wav] not a RIFF/WAVE file");
        std::fclose(fp);
        return false;
    }

    // variables to hold fmt/data info
    bool got_fmt = false;
    bool got_data = false;
    uint16_t fmt_tag = 0;
    uint16_t channels = 0;
    uint32_t sample_rate = 0;
    uint16_t bits_per_sample = 0;
    uint32_t data_sz = 0;
    std::vector<uint8_t> data_buf;

    // 遍历 chunks 查找 "fmt " 和 "data"
    while (!got_data)
    {
        char id[4];
        uint32_t chunk_sz = 0;
        if (std::fread(id, 1, 4, fp) != 4)
            break;
        if (std::fread(&chunk_sz, sizeof(chunk_sz), 1, fp) != 1)
            break;

        if (std::memcmp(id, "fmt ", 4) == 0)
        {
            // 读取 fmt chunk(chunk_sz >= 16 常见)
            std::vector<uint8_t> fmt(chunk_sz);
            if (chunk_sz > 0)
            {
                if (std::fread(fmt.data(), 1, chunk_sz, fp) != chunk_sz)
                {
                    std::fprintf(stderr, "[read_wav] read fmt chunk failed");
                    std::fclose(fp);
                    return false;
                }
            }
            if (chunk_sz < 16)
            {
                std::fprintf(stderr, "[read_wav] fmt chunk too small");
                std::fclose(fp);
                return false;
            }
            // 直接按小端解析常见字段(RIFF/WAV 为 little-endian)
            fmt_tag = *reinterpret_cast<const uint16_t*>(&fmt[0]);
            channels = *reinterpret_cast<const uint16_t*>(&fmt[2]);
            sample_rate = *reinterpret_cast<const uint32_t*>(&fmt[4]);
            // skip byte rate (4) block align (2)
            bits_per_sample = *reinterpret_cast<const uint16_t*>(&fmt[14]);

            got_fmt = true;
            // 如果 fmt chunk 比 16 大,已经通过 read 跳过了扩展部分
        }
        else if (std::memcmp(id, "data", 4) == 0)
        {
            // 读 data chunk
            if (chunk_sz > 0)
            {
                data_buf.resize(chunk_sz);
                if (std::fread(data_buf.data(), 1, chunk_sz, fp) != chunk_sz)
                {
                    std::fprintf(stderr, "[read_wav] read data chunk failed");
                    std::fclose(fp);
                    return false;
                }
                data_sz = chunk_sz;
            }
            else
            {
                data_buf.clear();
                data_sz = 0;
            }
            got_data = true;
        }
        else
        {
            // 跳过未知 chunk(按偶数字节对齐)
            if (chunk_sz > 0)
            {
                if (std::fseek(fp, chunk_sz, SEEK_CUR) != 0)
                {
                    std::fprintf(stderr, "[read_wav] fseek failed while skipping chunk");
                    std::fclose(fp);
                    return false;
                }
            }
        }

        // chunk 大小若为奇数,文件中有 pad 字节
        if (chunk_sz & 1)
        {
            std::fseek(fp, 1, SEEK_CUR);
        }
    }

    if (!got_fmt)
    {
        std::fprintf(stderr, "[read_wav] no 'fmt ' chunk");
        std::fclose(fp);
        return false;
    }
    if (!got_data)
    {
        std::fprintf(stderr, "[read_wav] no 'data' chunk");
        std::fclose(fp);
        return false;
    }

    // 检查格式:只接受 IEEE float (fmt_tag == 3)
    if (fmt_tag != 3)
    {
        std::fprintf(stderr, "[read_wav] not IEEE float (fmt_tag=%u)\n", fmt_tag);
        std::fclose(fp);
        return false;
    }
    if (bits_per_sample != 32)
    {
        std::fprintf(stderr, "[read_wav] not 32-bit float (bits=%u)\n", bits_per_sample);
        std::fclose(fp);
        return false;
    }
    if (channels != 1)
    {
        std::fprintf(stderr, "[read_wav] not mono (channels=%u)\n", channels);
        std::fclose(fp);
        return false;
    }
    if (sample_rate != (uint32_t)expect_sample_rate)
    {
        std::fprintf(stderr, "[read_wav] sample rate mismatch: file=%u, expected=%d", sample_rate, expect_sample_rate);
        std::fclose(fp);
        return false;
    }

    // 读取样本(每个样本 4 字节 float)
    if (data_sz == 0)
    {
        std::fprintf(stderr, "[read_wav] data chunk is empty");
        std::fclose(fp);
        return false;
    }
    size_t n_samples = data_sz / sizeof(float);
    pcmf32.resize(n_samples);
    // 直接内存拷贝(假定文件为小端 IEEE float)
    std::memcpy(pcmf32.data(), data_buf.data(), n_samples * sizeof(float));

    std::fclose(fp);
    return !pcmf32.empty();
}

bool read_wav_v_1_2(const char* fname, std::vector<float>& pcmf32, int expect_sample_rate)
{
    FILE* fp = std::fopen(fname, "rb");
    if (!fp)
    {
        std::fprintf(stderr, "[read_wav] fopen fail: %s", fname);
        return false;
    }

    // read RIFF header (12 bytes)
    char riff[4];
    uint32_t riff_sz;
    char wave[4];
    if (std::fread(riff, 1, 4, fp) != 4 ||
        std::fread(&riff_sz, sizeof(riff_sz), 1, fp) != 1 ||
        std::fread(wave, 1, 4, fp) != 4)
    {
        std::fprintf(stderr, "[read_wav] read header failed");
        std::fclose(fp);
        return false;
    }
    if (std::memcmp(riff, "RIFF", 4) != 0 || std::memcmp(wave, "WAVE", 4) != 0)
    {
        std::fprintf(stderr, "[read_wav] not a RIFF/WAVE file");
        std::fclose(fp);
        return false;
    }

    // variables to hold fmt and data info
    bool got_fmt = false;
    bool got_data = false;
    uint16_t audio_format = 0;
    uint16_t num_channels = 0;
    uint32_t sample_rate = 0;
    uint16_t bits_per_sample = 0;
    std::vector<uint8_t> data_buf;
    uint32_t data_size = 0;

    // iterate chunks
    while (!got_data) {
        char id[4];
        uint32_t chunk_sz = 0;
        if (std::fread(id, 1, 4, fp) != 4) break;
        if (std::fread(&chunk_sz, sizeof(chunk_sz), 1, fp) != 1) break;

        // handle chunk id
        if (std::memcmp(id, "fmt ", 4) == 0) 
        {
            // read fmt chunk
            std::vector<uint8_t> fmt(chunk_sz);
            if (chunk_sz > 0 && std::fread(fmt.data(), 1, chunk_sz, fp) != chunk_sz) {
                std::fprintf(stderr, "[read_wav] read fmt chunk failed");
                std::fclose(fp);
                return false;
            }
            // parse basic fields (first 16 bytes expected)
            if (chunk_sz < 16) {
                std::fprintf(stderr, "[read_wav] fmt chunk too small");
                std::fclose(fp);
                return false;
            }
            // little-endian parsing
            audio_format = *reinterpret_cast<const uint16_t*>(&fmt[0]);
            num_channels = *reinterpret_cast<const uint16_t*>(&fmt[2]);
            sample_rate = *reinterpret_cast<const uint32_t*>(&fmt[4]);
            // skip byte rate (4 bytes) and block align (2 bytes)
            bits_per_sample = *reinterpret_cast<const uint16_t*>(&fmt[14]);

            // if fmt chunk has extension (e.g., WAVE_FORMAT_EXTENSIBLE), we could inspect subformat,
            // but for typical files, audio_format == 1 (PCM) or 3 (IEEE float) is enough.
            got_fmt = true;
        }
        else if (std::memcmp(id, "data", 4) == 0)
        {
            // read data chunk into buffer
            if (chunk_sz > 0)
            {
                data_buf.resize(chunk_sz);
                if (std::fread(data_buf.data(), 1, chunk_sz, fp) != chunk_sz)
                {
                    std::fprintf(stderr, "[read_wav] read data chunk failed");
                    std::fclose(fp);
                    return false;
                }
                data_size = chunk_sz;
                got_data = true;
            }
            else
            {
                // empty data chunk
                data_buf.clear();
                data_size = 0;
                got_data = true;
            }
        }
        else
        {
            // skip unknown chunk (pad to even)
            // chunk_sz may be odd; skip chunk_sz bytes, plus pad byte if needed
            uint32_t skip = chunk_sz;
            if (skip > 0)
            {
                if (std::fseek(fp, skip, SEEK_CUR) != 0)
                {
                    std::fprintf(stderr, "[read_wav] fseek failed while skipping chunk");
                    std::fclose(fp);
                    return false;
                }
            }
        }

        // chunk sizes are word aligned; if chunk_sz is odd, there may be a pad byte
        if (chunk_sz & 1)
        {
            std::fseek(fp, 1, SEEK_CUR);
        }
    }

    if (!got_fmt)
    {
        std::fprintf(stderr, "[read_wav] no 'fmt ' chunk");
        std::fclose(fp);
        return false;
    }
    if (!got_data)
    {
        std::fprintf(stderr, "[read_wav] no 'data' chunk");
        std::fclose(fp);
        return false;
    }

    // sample rate check
    if (expect_sample_rate > 0 && sample_rate != (uint32_t)expect_sample_rate)
    {
        std::fprintf(stderr, "[read_wav] sample rate mismatch: file=%u, expected=%d", sample_rate, expect_sample_rate);
        // 如果你希望仍然接受不同采样率,可以注释掉下面返回 false 的行并改为警告
        std::fclose(fp);
        return false;
    }

    // channels check (we will accept multi-channel but downmix to mono)
    if (num_channels < 1)
    {
        std::fprintf(stderr, "[read_wav] invalid channel count: %u", num_channels);
        std::fclose(fp);
        return false;
    }

    // convert data buffer into float vector
    pcmf32.clear();
    if (audio_format == 3)
    {
        // IEEE float
        if (bits_per_sample != 32)
        {
            std::fprintf(stderr, "[read_wav] unexpected float bits: %u", bits_per_sample);
            std::fclose(fp);
            return false;
        }
        size_t total_samples = data_size / sizeof(float);
        if (total_samples == 0)
        {
            std::fprintf(stderr, "[read_wav] no samples");
            std::fclose(fp);
            return false;
        }
        // if multi-channel, average channels to mono
        size_t frames = total_samples / num_channels;
        pcmf32.resize(frames);
        const float* src = reinterpret_cast<const float*>(data_buf.data());
        for (size_t i = 0; i < frames; ++i)
        {
            float acc = 0.0f;
            for (uint16_t c = 0; c < num_channels; ++c)
            {
                acc += src[i * num_channels + c];
            }
            pcmf32[i] = acc / static_cast<float>(num_channels);
        }
    }
    else if (audio_format == 1)
    {
        // PCM integer
        if (bits_per_sample == 16)
        {
            size_t total_samples = data_size / sizeof(int16_t);
            size_t frames = total_samples / num_channels;
            pcmf32.resize(frames);
            const int16_t* src = reinterpret_cast<const int16_t*>(data_buf.data());
            for (size_t i = 0; i < frames; ++i)
            {
                int64_t acc = 0;
                for (uint16_t c = 0; c < num_channels; ++c)
                {
                    acc += src[i * num_channels + c];
                }
                float v = static_cast<float>(acc) / (32768.0f * static_cast<float>(num_channels));
                pcmf32[i] = std::max(-1.0f, std::min(1.0f, v));
            }
        }
        else if (bits_per_sample == 24)
        {
            // 24-bit packed little-endian
            size_t bytes_per_sample = 3;
            size_t total_samples = data_size / bytes_per_sample;
            size_t frames = total_samples / num_channels;
            pcmf32.resize(frames);
            const uint8_t* b = data_buf.data();
            for (size_t i = 0; i < frames; ++i)
            {
                int64_t acc = 0;
                for (uint16_t c = 0; c < num_channels; ++c) {
                    size_t idx = (i * num_channels + c) * 3;
                    int32_t sample = (int32_t)((b[idx]) | (b[idx + 1] << 8) | (b[idx + 2] << 16));
                    // sign extension for 24-bit
                    if (sample & 0x800000) sample |= ~0xFFFFFF;
                    acc += sample;
                }
                float v = static_cast<float>(acc) / (8388608.0f * static_cast<float>(num_channels)); // 2^23
                pcmf32[i] = std::max(-1.0f, std::min(1.0f, v));
            }
        }
        else if (bits_per_sample == 32)
        {
            // 32-bit PCM integer
            size_t total_samples = data_size / sizeof(int32_t);
            size_t frames = total_samples / num_channels;
            pcmf32.resize(frames);
            const int32_t* src = reinterpret_cast<const int32_t*>(data_buf.data());
            for (size_t i = 0; i < frames; ++i)
            {
                int64_t acc = 0;
                for (uint16_t c = 0; c < num_channels; ++c)
                {
                    acc += src[i * num_channels + c];
                }
                float v = static_cast<float>(acc) / (2147483648.0f * static_cast<float>(num_channels));
                pcmf32[i] = std::max(-1.0f, std::min(1.0f, v));
            }
        }
        else
        {
            std::fprintf(stderr, "[read_wav] unsupported PCM bit depth: %u", bits_per_sample);
            std::fclose(fp);
            return false;
        }
    }
    else
    {
        std::fprintf(stderr, "[read_wav] unsupported audio format (fmt_tag=%u)\n", audio_format);
        std::fclose(fp);
        return false;
    }

    std::fclose(fp);
    return !pcmf32.empty();
}

使用:

 

© 版权声明
THE END
喜欢就支持一下吧
点赞12 分享
评论 抢沙发

请登录后发表评论

    暂无评论内容