Building upon this formulation, the experts in both Qwen (Team, 2024) and Mixtral (Jiang et al., 2024) adopt the structure of LLaMA (Touvron et al., 2023). Specifically, the feed-forward network (FFN) within each expert consists of three linear layers that function as Eq. (2), where ⊙signifies element-wise multiplication,W up,W gate ∈Rd h× ...