Architectural Fundamentals of Sparse Mixture-of-Experts and Key-Value Caching
Mixture-of-Experts Architecture Overview The Mixture-of-Experts (MoE) paradigm fundamentally modifies the standard Transformer decoder layer by replacing the monolithic feed-forward network (FFN) with a dynamic routing mechanism and a collection of specialized sub-networks. This architecture consist...