Fading Coder

One Final Commit for the Last Sprint

Home > Tech > Content

Understanding TritonServer Cache Manager Architecture and Implementation

Tech 1

Overview

TritonServer implements a caching layer to boost inference performence. The core mechanism involves storing inference request results in cache. When subsequent identical requests arrive, Triton can retrieve results directly from cache instead of re-executing inference, significantly reducing latency and conserving computational resources.

Cache keys are generated by hashing the model name, version, input tensor names, and model inputs together. This creates a unique identifier for each inference request.

Triton supports two cache backends: in-memory (local cache) and Redis (redis cache). Each is configured via command-line parameters at startup.

Cache Implementation Architecture

Triton employs a plugin-style architecture for cache implementations, defining four standard APIs that every cache backend must implement:

TRITONSERVER_Error*
TRITONCACHE_CacheInitialize(TRITONCACHE_Cache** cache, const char* cache_config)

TRITONSERVER_Error*
TRITONCACHE_CacheFinalize(TRITONCACHE_Cache* cache)

TRITONSERVER_Error*
TRITONCACHE_CacheLookup(
    TRITONCACHE_Cache* cache, const char* key, TRITONCACHE_CacheEntry* entry,
    TRITONCACHE_Allocator* allocator)

TRITONSERVER_Error*
TRITONCACHE_CacheInsert(
    TRITONCACHE_Cache* cache, const char* key, TRITONCACHE_CacheEntry* entry,
    TRITONCACHE_Allocator* allocator)

These four functions handle initialization, cleanup, cache lookup, and cache insertion respectively. This design decouples Triton from specific cache implementations—Triton only interacts with these standardized APIs, while each cache backend provides its own implementation.

Redis Cache Backend

Startup Command Options

tritonserver --model-repository=/models --log-verbose=1 --cache-config=local,size=1048576 --log-file=output.log

tritonserver --model-repository=/models --log-verbose=1 --cache-config=redis,host=172.17.0.1 --cache-config=redis,port=6379 --cache-config=redis,password="secret"

TRITONCACHE_CacheInitialize Implementation

Located in redis_cache/src/cache_api.cc, this function creates a RedisCache instance based on configuraton parameters:

TRITONSERVER_Error*
TRITONCACHE_CacheInitialize(TRITONCACHE_Cache** cache, const char* cache_config)
{
  if (cache == nullptr) {
    return TRITONSERVER_ErrorNew(
        TRITONSERVER_ERROR_INVALID_ARG, "cache pointer cannot be null");
  }
  if (cache_config == nullptr) {
    return TRITONSERVER_ErrorNew(
        TRITONSERVER_ERROR_INVALID_ARG, "cache configuration cannot be null");
  }

  std::unique_ptr<RedisCache> cacheInstance;
  RETURN_IF_ERROR(RedisCache::Create(cache_config, &cacheInstance));
  *cache = reinterpret_cast<TRITONCACHE_Cache*>(cacheInstance.release());
  return nullptr;
}

The static Create factory method handles object construction with a typical Triton pattern:

TRITONSERVER_Error*
RedisCache::Create(
    const std::string& cache_config, std::unique_ptr<RedisCache>* cache)
{
  rapidjson::Document document;
  document.Parse(cache_config.c_str());
  
  if (!document.HasMember("host") || !document.HasMember("port")) {
    return TRITONSERVER_ErrorNew(
        TRITONSERVER_ERROR_INVALID_ARG,
        "RedisCache initialization failed: missing required 'host' and 'port' parameters");
  }

  sw::redis::ConnectionOptions connOptions;
  sw::redis::ConnectionPoolOptions poolOptions;

  // Environment variable overrides
  setOptionFromEnv(USERNAME_ENV_VAR_NAME, connOptions.user);
  setOptionFromEnv(PASSWORD_ENV_VAR_NAME, connOptions.password);

  setOption("host", connOptions.host, document);
  setOption("port", connOptions.port, document);
  setOption("user", connOptions.user, document);
  setOption("password", connOptions.password, document);
  setOption("db", connOptions.db, document);
  setOption("connect_timeout", connOptions.connect_timeout, document);
  setOption("socket_timeout", connOptions.socket_timeout, document);
  setOption("pool_size", poolOptions.size, document);
  setOption("wait_timeout", poolOptions.wait_timeout, document);
  
  if (!document.HasMember("wait_timeout")) {
    poolOptions.wait_timeout = std::chrono::milliseconds(1000);
  }

  // TLS configuration
  if (document.HasMember("tls_enabled")) {
    connOptions.tls.enabled =
        strcmp(document["tls_enabled"].GetString(), "true") == 0;
    setOption("cert", connOptions.tls.cert, document);
    setOption("key", connOptions.tls.key, document);
    setOption("cacert", connOptions.tls.cacert, document);
    setOption("cacert_dir", connOptions.tls.cacertdir, document);
    setOption("sni", connOptions.tls.sni, document);
  }

  try {
    cache->reset(new RedisCache(connOptions, poolOptions));
  }
  catch (const std::exception& ex) {
    return TRITONSERVER_ErrorNew(
        TRITONSERVER_ERROR_INTERNAL,
        ("RedisCache initialization failed: " + std::string(ex.what())).c_str());
  }
  return nullptr;
}

The constructor initializes the Redis connection using the redis++ library and verifies connectivity via ping:

std::unique_ptr<sw::redis::Redis>
init_client(
    const sw::redis::ConnectionOptions& connectionOptions,
    sw::redis::ConnectionPoolOptions poolOptions)
{
  std::unique_ptr<sw::redis::Redis> redis =
      std::make_unique<sw::redis::Redis>(connectionOptions, poolOptions);
  const auto pingMessage = "Triton RedisCache client connected";
  if (redis->ping(pingMessage) != pingMessage) {
    throw std::runtime_error("Redis server ping failed");
  }

  LOG_VERBOSE(1) << "Redis connection established successfully";
  return redis;
}

TRITONCACHE_CacheFinalize Implementation

TRITONSERVER_Error*
TRITONCACHE_CacheFinalize(TRITONCACHE_Cache* cache)
{
  if (cache == nullptr) {
    return TRITONSERVER_ErrorNew(
        TRITONSERVER_ERROR_INVALID_ARG, "cache pointer cannot be null");
  }

  delete reinterpret_cast<RedisCache*>(cache);
  return nullptr;
}

TRITONCACHE_CacheInsert Implementation

This function serializes cache entries into Redis. The entry paramter represents a CacheEntry object that encapsulates cached inference results:

TRITONSERVER_Error*
TRITONCACHE_CacheInsert(
    TRITONCACHE_Cache* cache, const char* key, TRITONCACHE_CacheEntry* entry,
    TRITONSERVER_Allocator* allocator)
{
  RETURN_IF_ERROR(CheckArgs(cache, key, entry, allocator));
  auto redisBackend = reinterpret_cast<RedisCache*>(cache);
  CacheEntry serializedData;
  size_t bufferCount = 0;
  
  RETURN_IF_ERROR(TRITONCACHE_CacheEntryBufferCount(entry, &bufferCount));
  std::vector<std::shared_ptr<char[]>> managedBufs;
  
  for (size_t idx = 0; idx < bufferCount; idx++) {
    TRITONSERVER_BufferAttributes* attr = nullptr;
    RETURN_IF_ERROR(TRITONSERVER_BufferAttributesNew(&attr));
    std::shared_ptr<TRITONSERVER_BufferAttributes> guardedAttr(
        attr, TRITONSERVER_BufferAttributesDelete);
    
    void* dataPtr = nullptr;
    size_t dataSize = 0;
    int64_t memoryTypeId;
    TRITONSERVER_MemoryType memoryType;
    
    RETURN_IF_ERROR(TRITONCACHE_CacheEntryGetBuffer(entry, idx, &dataPtr, attr));
    RETURN_IF_ERROR(TRITONSERVER_BufferAttributesByteSize(attr, &dataSize));
    RETURN_IF_ERROR(TRITONSERVER_BufferAttributesMemoryType(attr, &memoryType));
    RETURN_IF_ERROR(TRITONSERVER_BufferAttributesMemoryTypeId(attr, &memoryTypeId));

    if (!dataSize) {
      return TRITONSERVER_ErrorNew(
          TRITONSERVER_ERROR_INTERNAL, "buffer size cannot be zero");
    }
    
    if (memoryType != TRITONSERVER_MEMORY_CPU &&
        memoryType != TRITONSERVER_MEMORY_CPU_PINNED) {
      return TRITONSERVER_ErrorNew(
          TRITONSERVER_ERROR_INVALID_ARG,
          "currently only CPU memory buffers are supported for caching");
    }

    std::shared_ptr<char[]> localBuffer(new char[dataSize]);
    TRITONCACHE_CacheEntrySetBuffer(
        entry, idx, static_cast<void*>(localBuffer.get()), nullptr);

    managedBufs.push_back(localBuffer);
    serializedData.items.insert(std::make_pair(
        getFieldName(idx, fieldType::bufferSize), std::to_string(dataSize)));
    serializedData.items.insert(std::make_pair(
        getFieldName(idx, fieldType::memoryType), std::to_string(memoryType)));
    serializedData.items.insert(std::make_pair(
        getFieldName(idx, fieldType::memoryTypeId),
        std::to_string(memoryTypeId)));
  }

  TRITONCACHE_Copy(allocator, entry);
  
  for (size_t idx = 0; idx < bufferCount; idx++) {
    auto bytesToTransfer =
        std::stoi(serializedData.items.at(getFieldName(idx, fieldType::bufferSize)));
    serializedData.items.insert(std::make_pair(
        getFieldName(idx, fieldType::buffer),
        std::string(managedBufs.at(idx).get(), bytesToTransfer)));
  }

  if (serializedData.items.size() % FIELDS_PER_BUFFER != 0) {
    return TRITONSERVER_ErrorNew(
        TRITONSERVER_ERROR_INVALID_ARG,
        "incomplete cache entry detected");
  }

  RETURN_IF_ERROR(redisBackend->Insert(key, serializedData));
  return nullptr;
}

Within the Redis cache backend, CacheEntry is defined as:

struct CacheEntry {
  size_t numBuffers = 0;
  std::unordered_map<std::string, std::string> items;
};

TRITONCACHE_CacheLookup Implementation

Retrieves cached data from Redis hash tables and copies results back to Triton's cache entry structure.

Redis Cache Backend Implementation Details

The Redis cache implementation leverages Redis hash tables for efficient storage. Implementation consists of three primary modules.

Client Initialization

std::unique_ptr<sw::redis::Redis>
init_client(
    const sw::redis::ConnectionOptions& connectionOptions,
    sw::redis::ConnectionPoolOptions poolOptions)
{
  std::unique_ptr<sw::redis::Redis> redis =
      std::make_unique<sw::redis::Redis>(connectionOptions, poolOptions);
  const auto msg = "Triton RedisCache client connected";
  if (redis->ping(msg) != msg) {
    throw std::runtime_error("Failed to ping Redis server.");
  }

  LOG_VERBOSE(1) << "Successfully connected to Redis";
  return redis;
}

Insert Operation

TRITONSERVER_Error*
RedisCache::Insert(const std::string& key, CacheEntry& entry)
{
  try {
    _client->hmset(key, entry.items.begin(), entry.items.end());
  }
  catch (const sw::redis::TimeoutError& e) {
    return handleError("Timeout inserting key: ", key, e.what());
  }
  catch (const sw::redis::IoError& e) {
    return handleError("Failed to insert key: ", key, e.what());
  }
  catch (const std::exception& e) {
    return handleError("Failed to insert key: ", key, e.what());
  }
  catch (...) {
    return handleError("Failed to insert key: ", key, "Unknown error.");
  }

  return nullptr;
}

The hmset command stores all map entries into a Redis hash table keyed by the inference request hash.

Lookup Operation

std::pair<TRITONSERVER_Error*, CacheEntry>
RedisCache::Lookup(const std::string& key)
{
  CacheEntry entry;

  try {
    this->_client->hgetall(
        key, std::inserter(entry.items, entry.items.begin()));

    entry.numBuffers = entry.items.size() / FIELDS_PER_BUFFER;
    return {nullptr, entry};
  }
  catch (const sw::redis::TimeoutError& e) {
    return {handleError("Timeout retrieving key: ", key, e.what()), {}};
  }
  catch (const sw::redis::IoError& e) {
    return {handleError("Failed to retrieve key: ", key, e.what()), {}};
  }
  catch (const std::exception& e) {
    return {handleError("Failed to retrieve key: ", key, e.what()), {}};
  }
  catch (...) {
    return {handleError("Failed to retrieve key: ", key, "Unknown error."), {}};
  }
}

The hgetall command retrieves all fields and values from the hash table into the entry structure.

Troubleshooting Common Cache Configuration Issues

When enabling caching, ensure both server-level and model-level configurations are properly set:

Server startup with cache enabled:

tritonserver --model-repository=/models --cache-config=redis,host=172.17.0.1 --cache-config=redis,port=6379 --cache-config=redis,password="secret"

Model configuration file with response cache enabled:

response_cache {
  enable: true
}

If cached results are not appearing in Redis, verify network connectivity between Triton and Redis, check that model input tensors are properly configured for caching, and ensure the request key hash matches the expected format.

Tags: TritonServer

Related Articles

Understanding Strong and Weak References in Java

Strong References Strong reference are the most prevalent type of object referencing in Java. When an object has a strong reference pointing to it, the garbage collector will not reclaim its memory. F...

Comprehensive Guide to SSTI Explained with Payload Bypass Techniques

Introduction Server-Side Template Injection (SSTI) is a vulnerability in web applications where user input is improper handled within the template engine and executed on the server. This exploit can r...

Implement Image Upload Functionality for Django Integrated TinyMCE Editor

Django’s Admin panel is highly user-friendly, and pairing it with TinyMCE, an effective rich text editor, simplifies content management significantly. Combining the two is particular useful for bloggi...

Leave a Comment

Anonymous

◎Feel free to join the discussion and share your thoughts.