hierarchical-embedding-layer

发表于 2026-05-30 分类于 inference

HierarchicalEmbeddingLayer 是 NV Embedding Cache SDK 中最复杂、功能最完整的嵌入层实现，通过 hitmask 驱动的级联查找组合 GPU 缓存、CPU 哈希表与远程存储。

HierarchicalEmbeddingLayer 深度分析

概述

HierarchicalEmbeddingLayer 是 NV Embedding Cache SDK 中 最复杂、功能最完整 的嵌入层实现。它组合了多张表（tables）构成查找层级，按照”先快后慢”的顺序逐级查找：首先尝试 GPU 集合关联缓存，未命中则转发到 CPU 端哈希表，仍未命中则转发到远程存储（Redis / RocksDB）。每张表独立管理各自的存储后端，层负责协调表间的数据流动和缺失键转发。

该层对应 README 中的第三种配置场景：

一些嵌入缓存在 GPU 内存中，一些缓存在 Host 内存中，所有嵌入保留在远程参数服务器：使用带 Hierarchical 缓存类型的 NVEmbedding（Python）/ HierarchicalEmbeddingLayer（C++）。

文件位置：include/hierarchical_embedding_layer.hpp（声明）和 src/hierarchical_embedding_layer.cu（实现）。

核心概念与架构

多表级联（Cascade）

HierarchicalEmbeddingLayer 接收一个 std::vector<table_ptr_t> 作为构造参数，这个有序列表定义了查找的优先级顺序：

tables_[0] → GPU 集合关联缓存（GpuTable，最快）
    ↓ miss
tables_[1] → CPU 哈希表（HostTable，如 nvhm_map / abseil_flat_map / stl_umap）
    ↓ miss
tables_[2] → 远程存储（RedisClusterTable / RocksDBTable，最慢但容量最大）

设计约束：

至少需要一张表（NVE_CHECK_(tables_.size() > 0)）
GPU 表必须连续排列在前（不允许 CPU 表之后再有 GPU 表，因为无性能意义）
所有表的行大小必须一致（get_max_row_size() 相同）
可选提供默认嵌入向量（default_embedding），用于在所有表都 miss 时的回退值。此时最后一张表必须是 host table

Config 结构体

struct Config {
  std::string layer_name;
  std::shared_ptr<InsertHeuristic> insert_heuristic = nullptr;
  int64_t min_insert_freq_gpu = 0;
  int64_t min_insert_freq_host = 0;
  int64_t min_insert_size_gpu = 1 << 16;   // 65536
  int64_t min_insert_size_host = 0;
  std::vector<uint8_t> default_embedding = {};
};

与 LinearUVMEmbeddingLayer 相比，增加了：

min_insert_freq_host：Host 表的最小插入间隔频率
min_insert_size_host：Host 表的最小插入批量（默认 0，即每次有 miss 就立即插入）
default_embedding：所有表都 miss 时的默认值。默认空表示缺失键行为未定义

lookup() 查找流程的完整实现

HierarchicalEmbeddingLayer::lookup() 是整个 NVE 中最复杂的操作，其核心逻辑围绕每张表独立工作、逐级传递 hitmask 展开。

数据流全景

用户调用 layer->lookup(ctx, num_keys, keys, output, ...)
  │
  ├── ① ScopedDevice: 切换到 GPU
  │
  ├── ② 分配 hitmask 缓冲区，初始化为 0
  │     hitmask[i] = 0 → 键 i 尚未被任何表命中
  │
  ├── ③ 遍历每张表，执行查找:
  │     for i = 0; i < num_tables; i++:
  │       │
  │       ├── 重置该表的 lookup counter
  │       │
  │       ├── table->find_bw(ctx, num_keys, keys_bw, hitmask_bw,
  │       │                  output_stride, output_bw, value_sizes)
  │       │   └── 表在查找时跳过 hitmask 中已标记为 1 的键
  │       │   └── 命中后在 hitmask 中将对应位设为 1
  │       │   └── 将命中键的数据写入 output 的对应位置
  │       │
  │       └── table->get_lookup_counter() 获取该表的命中数
  │
  ├── ④ 同步与命中率计算:
  │     │ 至少有一张 GPU 表时需要 cudaStreamSynchronize
  │     │ 计算每张表的 hitrate = hits / left_keys
  │     │ left_keys = num_keys - 已累加的所有 hits
  │
  ├── ⑤ 默认嵌入填充:
  │     if (left_keys > 0 && default_embedding 非空)
  │       在 CPU 线程池中并行填充未命中位置的默认值
  │
  ├── ⑥ GPU scatter（关键步骤）:
  │     if (首表是 GPU && 末表是 Host)
  │       调用 EmbeddingForwardScatter kernel
  │       将 Host 端填充的数据 scatter 回 Device 输出缓冲区
  │
  ├── ⑦ 如果 output 不在最终位置 → cudaMemcpyAsync
  │
  ├── ⑧ 自动插入:
  │     for 每张表:
  │       auto_insert_handlers_[i]->auto_insert(...)
  │       收集该表的 miss 键 → 达到阈值后批量插入
  │
  └── ⑨ pooling → NVE_THROW_NOT_IMPLEMENTED_()

hitmask 驱动的级联查找

hitmask 是整个级联查找的核心数据结构。它是一个大位图，每个 bit 对应一个 key：

bit = 0：该键尚未被任何表命中，需要传递给当前表处理
bit = 1：该键已被前面的表命中，当前表应跳过

初始化：

// 分配 hitmask 缓冲区
const auto hitmask_elements = (num_keys + 63) / 64;
const auto hitmask_buffer_size = hitmask_elements * sizeof(max_bitmask_repr_t);

// 零初始化（所有 bit = 0，即全未命中）
if (hitmask_first_access == cudaMemoryTypeDevice) {
    cudaMemsetAsync(hitmask_buf, 0, hitmask_buffer_size, lookup_stream);
} else {
    std::memset(hitmask_buf, 0, hitmask_buffer_size);
}

每张表的 find_bw() 实现会做两件事：

只处理 hitmask 中 bit=0 的键（跳过已命中键）
对命中的键，将 hitmask 对应 bit 设为 1，并将数据写入 output

GpuTable 的 find_bw 在 UVM 回退查找（run_find_uvm）中同时处理 cache hit 和 UVM miss：cache tag 匹配则从 cache 数据区读取，否则从 UVM table 读取；两者都命中时标记 hitmask bit=1。

HostTable 的 find 实现（如 STLContainerTable::find）逐分区并行查找，仅在 hitmask bit=0 的条目上耗费查找资源。

hitmask 的数据传输路径

级联查找中，hitmask 可能需要在 GPU 和 CPU 之间传输：

场景 1: 首表 GPU + 次表 GPU → hitmask 全程在 Device
场景 2: 首表 GPU + 次表 Host → hitmask 需从 Device 回传到 Host
场景 3: 首表 Host + 次表 Host → hitmask 全程在 Host

BufferWrapper 的 access_buffer() 透明处理了这种传输：

auto* hitmask_host = hitmask_bw->access_buffer(
    cudaMemoryTypeHost, true /*copy_content*/, lookup_stream);
// hitmask 当前在 Device → 自动 cudaMemcpyAsync 到 Host

GpuTable 在层级中的角色

当 GpuTable 作为层级的首表时，其 find_bw 方法会：

执行 EmbedCacheSA 查找：GPU 集合关联缓存标签匹配
UVN 回退：如果 uvm_table 非空，cache miss 时直接从 UVM 内存读取（run_find_uvm）
更新 hitmask：所有在 UVM 中找到的键（cache hit + UVM hit）标记为已命中
计数 miss：仅记录 cache miss 的数量（传递给 auto_insert_handler 作为 GPU 插入决策依据）

当 GpuTable 作为非首表（即前面已经有其他表）时，其 find_bw 的 hitmask 会包含来自前面所有表的命中信息——它只需要处理前面表未能命中的键。

CPU 端 HostTable 的查找与 gather

对于 CPU 端的 host table，查找流程在 cpu_ops/cpu_gather.h 中的 cpu_kernel_gather 实现：

template<typename IndexT>
int cpu_kernel_gather(thread_pool_ptr_t thread_pool,
              uint64_t n,
              const IndexT* keys,
              max_bitmask_repr_t* hit_mask,
              size_t value_stride,
              void* values,
              int8_t* cpu_table_ptr,
              size_t row_size_in_bytes,
              uint64_t num_threads)
{
    const auto gather_task = [=] (const size_t idx) {
        for (uint64_t i = 0; i < keys_per_task; i++) {
            if ((hit_mask[...] & bit_mask) == 0) {
                IndexT key = keys[...];
                memcpy(dst_ptr, src_ptr, row_size_in_bytes);
            }
        }
    };
    thread_pool->execute_n(0, num_threads, gather_task);

    // 全部完成后统一设置所有 hitmask bit=1
    memset(hit_mask, 0xff, ...);
    return 0;
}

关键点：

线程池并行：通过 thread_pool->execute_n 将工作分派到多个 CPU 核心
memcpy 直接读取：从线性 CPU 内存中 memcpy 嵌入向量到 output
批量设置 hitmask：全部完成后统一设置所有 bit 为 1，表示这些键已被 host table 完全处理
对齐优化：每个任务的 keys 数对齐到 64（hitmask 的 bit 边界）

对于基于哈希表的 host table（如 STLContainerTable 或 AbseilFlatMapTable），查找流程是先通过哈希表查询键是否存在，如果存在则从对应的内存槽中拷贝数据。

EmbeddingForwardScatter — GPU scatter kernel

当层级结构为首表 GPU + 末表 Host 时，最终的查找数据可能散落在 Host 和 Device 两处：GPU cache 命中的数据在 Device 上，CPU host table 填充的数据在 Host 上。EmbeddingForwardScatter kernel 负责将 Host 数据 scatter 回 Device 输出缓冲区。

kernel 定义在 cuda_ops/scatter.cuh：

template<uint32_t SubwarpWidth, typename DataType>
__global__ void EmbedScatter(
    const int8_t* __restrict__ src,       // Host 端源数据
    int8_t* __restrict__ dst,             // Device 端目标输出
    const uint32_t embed_width_in_bytes,
    const uint64_t* __restrict__ hit_mask, // GPU 端 hitmask
    const int32_t num_indices)
{
    const int embed = blockIdx.x * blockDim.y + threadIdx.y;
    if (embed >= num_indices) return;

    // 检查 hitmask: 如果该键已被 GPU 命中 → 跳过
    const int mask_entry = embed / 64;
    const int mask_bit = embed % 64;
    if (hit_mask[mask_entry] & (1ULL << mask_bit)) {
      return;  // GPU 已命中，无需 scatter
    }

    // 只有 GPU 未命中的键 → 从 Host 源拷贝到 Device 目标
    memcpy_warp<SubwarpWidth, DataType>(dst, src, embed_width_in_bytes);
}

这个 kernel 的精妙之处在于：

只传输缺失数据：GPU 已命中的键不参与拷贝，节省带宽
Subwarp 并行：使用 memcpy_warp 以 subwarp 粒度高效拷贝
向量化加载：根据对齐情况选择 float4/float2/float 等加载粒度

默认嵌入向量填充

当所有表都 miss 且配置了 default_embedding 时，使用 CPU 线程池并行填充：

if ((left_keys > 0) && (config_.default_embedding.size() > 0)) {
    auto* hit_mask_buf = hitmask_bw->access_buffer(cudaMemoryTypeUnregistered, true, stream);
    auto* output_buf = output_bw->access_buffer(cudaMemoryTypeUnregistered, true, stream);

    const auto fill_default_task = [=](const int64_t idx) {
        for (int64_t k = start_key; k < end_key; k++) {
            if (hit_mask_buf[k / 64] 的第 k%64 位 == 0) {
                memcpy(output_bytes + k * output_stride, default_emb, row_size);
            }
        }
    };
    thread_pool->execute_n(0, num_tasks, fill_default_task);
}

重要约束：使用 default_embedding 时最后一张表必须是 host 表，因为默认值填充在 CPU 上执行。如果最后是 GPU 表，则无法保证所有键能被覆盖。

命中率计算与报告

left_keys = num_keys;
for (size_t i = 0; i < num_tables; i++) {
    if (tables_[i]->lookup_counter_hits() == false) {
        table_hits[i] = left_keys - table_hits[i];  // miss → hit
    }
    table_hitrates[i] = (float)table_hits[i] / left_keys;
    left_keys -= table_hits[i];
}

关键观察：

GpuTable 计数的是 miss（lookup_counter_hits() == false）
HostTable 计数的是 hit（lookup_counter_hits() == true）
hitrates[i] 表示第 i 张表在当前剩余未命中键中的占比
最终 hitrates 数组被归一化为相对于 num_keys 的比例

AutoInsertHandler 的多表协同

HierarchicalEmbeddingLayer 为每张表各维护一个 AutoInsertHandler：

std::vector<std::shared_ptr<AutoInsertHandler>> auto_insert_handlers_;

在构造函数中创建，每层的参数独立：

for (size_t i = 0; i < tables_.size(); i++) {
    bool gpu_table = (table->get_device_id() >= 0);
    auto_insert_handlers_.push_back(std::make_shared<AutoInsertHandler>(
        heuristic,
        table, i, allocator_,
        gpu_table ? config_.min_insert_freq_gpu : config_.min_insert_freq_host,
        gpu_table ? config_.min_insert_size_gpu : config_.min_insert_size_host,
        sizeof(KeyType), gpu_device_, &invalid_key
    ));
}

在 lookup() 末尾触发自动插入：

for (size_t i = 0; i < num_tables; i++) {
    auto_insert_handlers_[i]->auto_insert(
        layer_ctx, keys_bw, output_bw, table_hitrates[i],
        num_keys, output_stride,
        (left_keys > 0) ? hitmask_bw : nullptr
    );
}

默认 InsertHeuristic 的阈值设置

for (size_t i = 0; i < tables_.size(); i++) {
    if (i == last_host_idx) {
        thresholds.push_back(0.0f);  // 最后一张 host 表永不触发插入
    } else {
        thresholds.push_back(DEFAULT_THRESHOLD);  // 0.75
    }
}

这个设计很精妙：最后一张 host 表的阈值为 0.0，意味着系统不会试图向最后的持久化存储层（如 RocksDB 或 Redis）自动插入数据，因为该层已经拥有全量数据。

insert() 实现的区别

与 LinearUVMEmbeddingLayer（固定 table_id=0）不同，HierarchicalEmbeddingLayer::insert() 接受 table_id 参数指定目标表：

void insert(context_ptr_t& ctx, const int64_t num_keys, const void* keys,
            ..., const int64_t table_id) override {
    if (table_id < 0 || table_id >= tables_.size()) {
        NVE_LOG_INFO_("Insert called with invalid table_id - ignored call");
        return;
    }
    auto& table = tables_.at(table_id);
    table->insert_bw(table_ctx, num_keys, keys_bw, value_stride, value_size, values_bw);
}

这允许用户精确控制数据插入到哪一层：插入到 GPU cache（table_id=0）、插入到 CPU host table（table_id=1）或插入到远程存储（table_id=2）。

update() / accumulate() 的多表广播

update() 和 accumulate() 会广播到所有表：

void update(...) {
    for (size_t i = 0; i < tables_.size(); i++) {
        if (!auto_insert_handlers_.empty()) {
            auto_insert_handlers_[i]->lock_modify();
        }
        table->update_bw(table_ctx, num_keys, keys_bw, value_stride, value_size, values_bw);
        if (!auto_insert_handlers_.empty()) {
            auto_insert_handlers_[i]->unlock_modify();
        }
    }
}

设计取舍：

每张表收到完整的 keys 和 values，而不是只发送 miss 的键
每张表只更新自身已有条目
自动插入 handler 在 update 期间被锁定，防止插入与更新的并发冲突

执行上下文创建

template <typename KeyType>
context_ptr_t HierarchicalEmbeddingLayer<KeyType>::create_execution_context(...) {
    std::vector<context_ptr_t> table_contexts;
    for (auto& t : tables_) {
        auto ctx = t->create_execution_context(lookup_stream, modify_stream,
                                                thread_pool, actual_allocator);
        table_contexts.push_back(std::move(ctx));
    }
    return std::make_shared<LayerExecutionContext>(
        lookup_stream, modify_stream, thread_pool, actual_allocator, table_contexts);
}

每个子表各自创建自己的 ExecutionContext，统一包装在 LayerExecutionContext 中。在 lookup 时通过 layer_ctx->table_contexts_.at(i) 访问。

与 LinearUVMEmbeddingLayer 的架构对比

维度	LinearUVMEmbeddingLayer	HierarchicalEmbeddingLayer
层级数	2 层（GPU cache + UVM table）	N 层（可扩展）
第二级存储	UVM 线性内存（硬件管理）	CPU 哈希表 + 远程表（软件管理）
二级访问	GPU 直接读取（UVM page fault）	CPU 线程池 + memcpy 或网络
hitmask 用法	无（GpuTable 内部使用）	核心数据结构，层间传递
scatter	无（全程 GPU 端）	EmbeddingForwardScatter kernel
插入策略	只针对 GPU cache	每层独立 AutoInsertHandler
default_embedding	不支持	支持
pooling	支持（find_and_combine）	暂未实现（throw）
update/accumulate	单表更新	多表广播

典型配置示例

C++ API

// 1. GPU cache
GPUTableConfig gpu_cache_cfg;
gpu_cache_cfg.device_id = 0;
gpu_cache_cfg.cache_size = 1ULL << 31;       // 2GB GPU cache
gpu_cache_cfg.row_size_in_bytes = 128;
gpu_cache_cfg.uvm_table = uvm_table;
gpu_cache_cfg.count_misses = true;
auto gpu_table = std::make_shared<GpuTable<int64_t>>(gpu_cache_cfg);

// 2. CPU host table (RocksDB-backed)
auto json_config = R"({
  "table_factories": {
    "my_rocksdb": {
      "implementation": "rocksdb",
      "path": "/data/embeddings/",
      "read_only": true
    }
  },
  "tables": {
    "0": "my_rocksdb"
  }
})"_json;
auto host_db = build_host_database(json_config);
auto host_table = host_db.at(0);

// 3. Hierarchical embedding layer
std::vector<table_ptr_t> tables = {gpu_table, host_table};

HierarchicalEmbeddingLayer<int64_t>::Config cfg;
cfg.layer_name = "my_hier_layer";
cfg.min_insert_size_gpu = 65536;

HierarchicalEmbeddingLayer<int64_t> layer(cfg, tables);

Python API

import pynve

embedding = pynve.NVEmbedding(
    num_embeddings=1_000_000_000,
    embedding_dim=128,
    cache_type="Hierarchical",
    cache_size=2_000_000_000,        # 2GB GPU cache
    host_cache_size=50_000_000_000,  # 50GB host cache
)

性能特征

查找延迟分布

场景	延迟	说明
GPU cache hit	最低	纯 GPU kernel，几微秒
GPU cache miss → UVM hit	中	UVM page fault，几十微秒
GPU miss → CPU host table hit	较高	host 回传 + 线程池查找，百微秒级
所有表 miss + default	最高	线程池填充，取决于并行度

关键调优参数

参数	作用	调优建议
`tables_` 顺序	先快后慢	GPU cache → host table → remote
`min_insert_size_gpu`	GPU 缓存插入批量	增大则插入更少、batch 更大
`min_insert_size_host`	Host 表插入批量	默认 0（立即插入），大表可适当增大
`default_embedding`	所有表 miss 时的回退值	推荐系统可设为全 0 向量
`InsertHeuristic` 阈值	各层触发频率	默认 0.75，可统计调整

总结

HierarchicalEmbeddingLayer 是 NVE 三级存储层级（GPU → CPU → Remote）的最高层封装。其核心机制是 hitmask 驱动的级联查找——每张表只处理前面表未命中的键，避免了重复查找。

与 LinearUVMEmbeddingLayer 的 UVM 硬件透明回退不同，HierarchicalEmbeddingLayer 的层间数据流需要显式的 hitmask 传递和 scatter scatter 操作，代码复杂度更高，但也带来了更大的灵活性：用户可以在层级中插入任意类型的表（包括远程 Redis/RocksDB），甚至为每层配置独立的自动插入策略。对于远超显存和系统内存的超大规模嵌入表场景（数十亿到数万亿的 Embedding ID），HierarchicalEmbeddingLayer 是唯一的选择。