Noam Shazeer 論文 2017 Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer 2020 GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding 2022 Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity ST-MoE: Designing Stable and Transferable Sparse Expert Models