Apr 16, 2025

Qwen

Qwen represents Alibaba Cloud's foundational AI model initiative. Launched initially in April 2023, Qwen isn't a single model but a comprehensive suite of LLMs designed to tackle a diverse range of tasks, from natural language understanding and generation to code creation and even processing visual and auditory information.

Built upon the proven transformer architecture, the majority of Qwen models are released under the permissive Apache 2.0 license, readily available on platforms like Hugging Face. This open approach allows us to experiment, customize, and deploy these models locally or within our own infrastructure, complementing the API access provided via Alibaba Cloud.

We are all familiar with models like GPT, Claude, and Gemini, but Qwen looks like an underrated gem in the LLM landscape. So let's take a look at it.

`Qwen2.5` - the foundation

At the heart of the family lies Qwen2.5, a series of powerful and efficient text-based LLMs. It’s a dense, decoder-only transformer-based LLM with improved capabilities over Qwen2. These serve as the foundation for many specialized models and are excellent general-purpose tools.

Here are some key features of Qwen2.5:

Models come in a wide array of sizes: 0.5B, 1.5B, 3B, 7B, 14B, 32B, and 72B parameters. You can find them on HuggingFace.
Context window supports up to 128k tokens for input and generating outputs up to 8k tokens.
Modality: These are text-only models, adept at understanding and generating natural language, code, and structured text formats.
Multilingual support: Pretrained on a massive dataset (~18 trillion tokens), Qwen2.5 boasts strong capabilities in 29+ languages, including English, Chinese, French, Spanish, Arabic, and many others.
Function calling / tool use: While not featuring a built-in API calling mechanism, Qwen2.5 can be used with agent frameworks, follow instructions and generate structured outputs, particularly JSON.
Use cases: Ideal for chatbots, content generation, question answering, summarization, translation, and knowledge assistants.

`Coder` and `Math` - specialized experts

Building on the Qwen2.5 foundation, Alibaba has released specialized models fine-tuned for coding and mathematics.

Below you can find key details about these models.

`Qwen2.5-Coder`

Optimized for (no surprise here) programming tasks – code generation, completion, explanation, etc.
Available in 0.5B, 1.5B, 3B, 7B, 14B and 32B parameter versions.
Context window has 128k tokens.
Multilingual: Supports numerous programming languages, covering mainstream ones like Python, Java, C++, JavaScript, etc. Natural language prompts are best handled in English and Chinese.
Tool use: Does not have an explicit tool API but acts as a coding assistant. It can be integrated into IDEs for autocompletion or used by agents to generate functions, fix errors, or create test cases based on instructions.
Use cases: Code auto-completion, function/class generation, code explanation, debugging assistance, cross-language code translation, AI pair programming tools.

`Qwen2.5-Math`

Designed for solving mathematical problems, from basic arithmetic to complex algebra and word problems.
Offered in 1.5B, 7B, and 72B parameters, plus a 72B reward model used for training.
Context window has 128k tokens.
Multilingual: Optimized for mathematical problems posed in English and Chinese (with LaTeX or Unicode math symbols). It can likely tackle math problems posed in other languages to some extent (due to the multilingual nature of base Qwen2.5), but it’s optimized for the two above.
Tool use: Features Tool-Integrated Reasoning (TIR). The model can decide to invoke external tools (like a calculator or symbolic solver) during its reasoning process, incorporating the results to improve accuracy.
Use cases: Math homework assistance (showing steps), solving competition math problems, automated theorem proving support, applications requiring high-confidence mathematical reasoning (engineering, finance).

`VL` and `Omni` - multimodalities

Qwen extends beyond text, offering models that can understand and interact with visual and auditory information.

`Qwen2.5-VL`: understanding images and videos

Qwen2.5-VL is a vision-language model series that extends Qwen2.5 with visual understanding, enabling image and video comprehension alongside text generation. This model excels at analyzing visual content and describing or reasoning about it in text form.

Available in 3B, 7B, 32B, and 72B parameter versions.
Context window: Capable of analyzing long videos (over an hour) by sampling frames. Supports large textual context alongside visual input. Additionally, it can process multiple images in one query or lengthy documents scanned into images, outputting consolidated results.
Modality: Accepts images and videos (plus text prompts) as input, producing text outputs. Can generate structured data like bounding box coordinates (JSON) for objects.
Multilingual: Can describe visuals or answer questions about them in English, Chinese, and potentially other languages inherited from the base model.
Tool use: Acts as a "visual agent". Can reason about visual content to invoke tools or APIs (e.g., interpreting a UI screenshot to suggest actions). Natively outputs structured vision task results.
Use cases: Image captioning, visual question answering (VQA), Optical Character Recognition (OCR), document/chart analysis, video content analysis, multimedia chatbots.

Qwen2.5-VL is an open-weight model, but there are also two proprietary models available via Alibaba API - qwen-vl-max (enhanced capabilities of visual reasoning and instruction following compared with qwen-vl-plus. Best for complex tasks.) and qwen-vl-plus (Enhanced detail and text recognition capabilities, supporting images with over one million pixel resolution and any aspect ratio. Exceptional performance for various visual tasks.)

`Qwen2.5-Omni`: real-time multimodal interaction

Qwen2.5-Omni is Qwen’s superior end-to-end multimodal model, capable of perceiving and generating across text, vision, and audio modalities in real time. It introduces a novel Thinker-Talker architecture for simultaneous understanding and response generation.

Available only in 7B parameters, based on the Qwen2.5-7B backbone.
Context window: designed for streaming interaction rather than a fixed token limit, enabling continuous, real-time conversation. Inherits large text context capability.
Modality: Accepts text, images, audio, and video inputs. Outputs both text and synthesized speech (with selectable English voice personas like "Chelsie" or "Ethan" and likely Chinese). Enables real-time voice/video chat.
Multilingual: Understands/generates text in multiple languages. Speech synthesis currently supports English and likely Chinese.
Tool use: Focuses on direct multimodal interaction rather than third-party tools. Can act on spoken instructions within an assistant framework.
Use Cases: Multimodal virtual assistants, interactive robots, customer service avatars, video conferencing assistants, applications requiring simultaneous audiovisual understanding and spoken dialogue.

You can check out this demo to see Qwen2.5-Omni it in action.

`QVQ` and `QwQ` - advanced reasoning models

Beyond standard multimodal capabilities, Qwen offers models specifically enhanced for complex reasoning, both visually and textually.

QVQ series (`QVQ-Preview` and `QVQ-Max`): deep visual reasoning

QVQ is a vision-language model series focused on Visual Question Answering and reasoning with visual evidence. It builds upon Qwen2.5-VL but emphasizes reasoning steps (thinking) about images and videos. The initial release was QVQ-72B-Preview, demonstrating the concept of a model that can not only describe an image but also reason about it to solve complex tasks.

QVQ-Max is the successor to the QVQ-Preview and is accessible via API only.

Based on the 72B Qwen2.5-VL. QVQ-Maxemploys optimizations like MoE for enhanced scalability and efficiency.
Extended capacity for processing multiple images/long videos and maintaining very long reasoning chains or dialogues about visual content.
Modality: Accepts static images and videos, outputs text that includes detailed step-by-step reasoning alongside the final answer.
Multilingual: Supports discussing visual content and providing reasoning in multiple languages (primarily English/Chinese).
Tool use: Primarily focuses on enhanced internal reasoning. Can interface with tools similarly to Qwen2.5-VL if needed.
License: QVQ-72B-Preview was open-source (Apache 2.0). QVQ-Max is currently proprietary (API access via Alibaba Cloud).
Use Cases: Solving complex visual puzzles, analyzing scientific graphics or medical images with detailed justification, interpreting surveillance footage with reasoning, geometry problem-solving from diagrams, applications needing high-confidence, explainable visual analysis.

Here is another demo prepared by Qwen team, showing the capabilities of QVQ-Max

`QwQ`: reinforced textual reasoning

QwQ (Qwen with Questions) is a specialized model in the Qwen family focusing on improving reasoning via reinforcement learning. Based on the Qwen2.5 32B model, QwQ underwent intensive training (including multi-stage RL) to enhance its performance on challenging reasoning tasks across domains like math and coding. The result is a model that can tackle complex questions with deeper thinking and better accuracy than the base model.

Size: 32B parameters, based on Qwen2.5-32B.
Context window: inherits 32K tokens context window from Qwen2.5-32B.
Modality: text-only, specialized for logical puzzles, mathematical word problems, code reasoning, and complex Q&A.
Multilingual: Supports English and Chinese, with performance primarily benchmarked in English.
Tool use: Focuses on improving internal reasoning depth and accuracy, rather than external tool calling. Generates more rigorous step-by-step solutions.
Use Cases: Solving hard competition math/coding problems, tackling logic puzzles, advanced Q&A systems, tutoring applications needing detailed explanations, research into AI reasoning.

`Qwen2.5-Max` - the flagship

Qwen2.5-Max is the large-scale Mixture-of-Experts (MoE) version of Qwen, representing Alibaba’s most advanced LLM in the Qwen 2.5 generation. It scales the model capacity dramatically (hundreds of billions of parameters) while using experts to keep inference efficient. Qwen2.5-Max is positioned to compete with top-tier models like GPT-4-class systems in capability.

Approximately 325B parameters.
Mixture-of-Experts (MoE) architecture. Activates only a subset of "expert" sub-models per token, significantly improving inference efficiency (reportedly 40-60% fewer GPU resources than a dense model of similar size). Pretrained on over 20 trillion tokens.
Supports a 32k context length.
Modality: text-only, excelling across a wide range of text tasks from knowledge retrieval to coding and creative writing.
Multilingual: Strong capabilities in English, Chinese, and numerous other languages, benefiting from the vast training data. Documentation point to it being at least as capable as Qwen2.5 in the 29+ supported languages.
Tool use: Inherits strong instruction-following and structured output capabilities. While there isn’t a separate function-calling interface exposed, its strong reasoning means it can be integrated as the brain of an AI agent.
Proprietary (API access via Alibaba Cloud). The model weights are not publicly released.
Use cases: Enterprise-grade AI applications, complex decision support systems, deep knowledge Q&A, large-scale code generation and analysis, business analytics on extensive reports, advanced chatbots requiring domain expertise.

`Qwen-Plus` and `Qwen-Turbo`

On the Alibaba Cloud Model Studio we can find two more flagship models - Qwen-Plus and Qwen-Turbo. There is not much information available about them, but they are positioned as lighter and faster variants of the Qwen2.5-Max model, and both are available via API only. You can find them on OpenRouter as well: Qwen-Plus, Qwen-Turbo.

Here is some key available information about them:

`Qwen-Plus`

Exact parameter count unspecified.
Provides a balanced combination of performance, speed, and cost.
Based on the Qwen2.5 base model
Context Window: around 130k tokens. Max output length is 8k tokens.
Speed: Slower than Turbo, but offers higher accuracy and capability for complex tasks.
Cost: Mid-range pricing, balancing performance and cost.
Multilingual Support: Likely supports multiple languages, but explicit details are not provided.
Multimodality: Text-only.
Performance: Outperforms DeepSeek-V2.5 and competes with Llama-3.1-405B in benchmarks.
Tool use: Not explicitly detailed.

`Qwen-Turbo`

Exact parameter count unspecified.
Based on Qwen2.5 base model.
Context Window: Supports 1M input tokens. Max output length is 8k tokens.
Speed: Fastest in the Qwen series, with an output speed of 110.2 tokens/second.
Cost: Lowest-priced option, optimized for cost-effectiveness.
Multilingual Support: Likely supports multiple languages (common for Alibaba models), but explicit details are not provided.
Multimodality: Text-only.
Performance: Best suited for simple tasks (e.g., basic Q&A, text generation).
Tool use: Not explicitly detailed.

Qwen family at-a-glance

This table summarizes the key characteristics of the main Qwen models discussed (without the Qwen-Plus and Qwen-Turbo models):

Model	Parameter sizes	Primary modality	Context window	Specialization	License	Multilingual
Qwen2.5 (Base)	0.5B, 1.5B, 3B, 7B, 14B, 32B, 72B	Text	128k (Input) / 8k (Output)	Foundational, General Purpose	Apache 2.0	29+ Languages
Qwen2.5-Coder	0.5B, 1.5B, 3B, 7B, 14B, 32B	Text (Code)	~Tens of Thousands	Code Generation & Assistance	Apache 2.0	Many Prog. Langs + Eng/Chi
Qwen2.5-Math	1.5B, 7B, 72B	Text (Math)	~128k	Mathematical Reasoning, Tool Use (TIR)	Apache 2.0	English & Chinese
Qwen2.5-VL	3B, 7B, 32B, 72B	Image/Video -> Text	Long Video + Large Text	Vision-Language Understanding	Apache 2.0	Yes (Eng/Chi focused)
Qwen2.5-Omni	7B	Text/Image/Audio/Video -> Text/Speech	Streaming / Real-time	End-to-End Multimodal Interaction	Apache 2.0	Yes (Speech Eng/Chi)
QVQ-Max	~72B	Image/Video -> Text (w/ Reasoning)	Extended Visual & Text	Deep Visual Reasoning (CoT)	Proprietary API	Yes (Eng/Chi focused)
QwQ	32B	Text (Reasoning)	~32k	Reinforced Reasoning (Math/Logic/Code)	Apache 2.0	Yes (Eng/Chi)
Qwen2.5-Max	~325B (MoE)	Text	32k	Flagship Scale & Performance (MoE)	Proprietary API	Yes (Broad)

(Note: Context window sizes can sometimes vary based on specific implementation or fine-tuning. The table provides typical or maximum advertised values.)

Qwen chat

Qwen Chat is an UI chat interface for the Qwen family of models. Chat allows users to interact with the models in a conversational manner, making it easy to test and explore their capabilities for free.

If you want to try out the models locally, you can find them on Hugging Face and Ollama.

Conclusion

The Qwen family from Alibaba Cloud presents a compelling and versatile suite of large language models. With a strong emphasis on open-source releases for many core and specialized models, we have access to really powerful AI capabilities. From the foundational Qwen2.5 suitable for general tasks to specialized Coder and Math variants, advanced multimodal Omni and VL models, deep reasoning QVQ and QwQ models, and the enterprise-scale Qwen2.5-Max, there is likely a Qwen model well-suited for your application needs 🤞

Qwen2.5 - the foundation

Coder and Math - specialized experts

Qwen2.5-Coder

Qwen2.5-Math

VL and Omni - multimodalities

Qwen2.5-VL: understanding images and videos

Qwen2.5-Omni: real-time multimodal interaction

QVQ and QwQ - advanced reasoning models

QVQ series (QVQ-Preview and QVQ-Max): deep visual reasoning

QwQ: reinforced textual reasoning

Qwen2.5-Max - the flagship

Qwen-Plus and Qwen-Turbo

Qwen-Plus

Qwen-Turbo