Qwen 3 Next
Qwen3-Next represents the next-generation foundation models optimized for extreme context length and large-scale parameter efficiency. The series introduces architectural innovations including Hybrid Attention (Gated DeltaNet + Gated Attention), High-Sparsity MoE with 1:50 activation ratio, and Multi-Token Prediction for enhanced performance and inference acceleration.
This guide shows how to fine-tune it with Axolotl with multi-turn conversations and proper masking.
Getting started
Install Axolotl following the installation guide.
Install Cut Cross Entropy to reduce training VRAM usage.
Install FLA for improved performance
uv pip uninstall causal-conv1d && uv pip install flash-linear-attention==0.4.1- Run the finetuning example:
axolotl train examples/qwen3-next/qwen3-next-80b-a3b-qlora.yamlThis config uses about ~47 GiB (no target experts) and ~71GiB (target experts) VRAM.
Let us know how it goes. Happy finetuning! 🚀
TIPS
- For inference, you can experiment with
temperature: 0.7,top_p: 0.8,top_k: 20, andmin_p: 0. - You can run a full finetuning by removing the
adapter: qloraandload_in_4bit: truefrom the config. See Multi-GPU section below. - Read more on how to load your own dataset at docs.
- The dataset format follows the OpenAI Messages format as seen here.