vllm部署Qwen3方案
Qwen3概述
多种思考模式
可用户提示或系统消息中添加 /think 和 /no_think 来逐轮切换模型的思考模式
- 思考模式:在这种模式下,模型会逐步推理,经过深思熟虑后给出最终答案。这种方法非常适合需要深入思考的复杂问题。
- 非思考模式:在此模式中,模型提供快速、近乎即时的响应,适用于那些对速度要求高于深度的简单问题。
多语言
119 种语言和方言MCP 支持
Qwen3-30B-A3B
- 一个拥有约 300 亿总参数和 30 亿激活参数的小型 MoE 模型
- 需24GB+显存
Qwen3-Embedding & Qwen3-Reranker
Model Type | Models | Size | Layers | Sequence Length | Embedding Dimension | MRL Support | Instruction Aware |
---|---|---|---|---|---|---|---|
Text Embedding | Qwen3-Embedding-0.6B | 0.6B | 28 | 32K | 1024 | Yes | Yes |
Text Embedding | Qwen3-Embedding-4B | 4B | 36 | 32K | 2560 | Yes | Yes |
Text Embedding | Qwen3-Embedding-8B | 8B | 36 | 32K | 4096 | Yes | Yes |
Text Reranking | Qwen3-Reranker-0.6B | 0.6B | 28 | 32K | - | - | Yes |
Text Reranking | Qwen3-Reranker-4B | 4B | 36 | 32K | - | - | Yes |
Text Reranking | Qwen3-Reranker-8B | 8B | 36 | 32K | - | - | Yes |
- 经济型:Embedding-4B + Reranker-4B(显存总需求<30GB)
- 高性能型:Embedding-8B + Reranker-8B(需多GPU,吞吐量提升40%+)
对比BGE-M3:全方位代差优势
指标 | Qwen3-8B | BGE-M3 | 优势幅度 |
---|---|---|---|
综合得分 | 70.58 | 59.56 | ↑11.02 |
上下文长度 | 32K | 8K | ↑ 4倍 |
检索任务(MSMARCO) | 57.65 | 40.88 | ↑41% |
开放问答(NQ) | 10.06 | -3.11 | 实现负分逆转 |
多语言理解 | 28.66 | 20.10 | ↑42% |
vllm 安装
uv(首选)
1
2
3uv venv vllm --python 3.12 --seed
source vllm/bin/activate
uv pip install vllmconda(有
license
问题)1
2
3
4
5conda env list ## 查看conda创建的所以虚拟环境
conda create -n vllm python=3.12 ## 创建特定版本python
conda activate vllm ## 进入某个虚拟环境
conda env remove -n vllm ## 删除某个虚拟环境
pip install vllm
模型下载
1 | ## 安装下载环境 |
vllm 服务启动
vllm serve <model_path>
1 | vllm serve /home/models/Qwen3-30B-A3B-Thinking-2507-FP8 \ |
--tensor-parallel-size
: 将您的模型分布到主机 GPU 中,GPU数量。--GPU-memory-utilization
:调整模型权重、激活和 KV 缓存的加速器内存使用率。作为 0.0 到 1.0 的比例测量,默认为 0.9.例如,您可以将此值设置为 0.8,将 AI Inference Server 的 GPU 内存消耗限制为 80%。使用部署稳定的最大值来最大化吞吐量。--max-model-len
: 限制模型的最大上下文长度,以令牌表示。如果模型的默认上下文长度太长,则将其设置为防止内存出现问题。--max-num-batched-tokens
: 每迭代的最大批处理令牌数量,将令牌的最大批处理大小限制为每个步骤处理(以令牌表示)。增加这可以提高吞吐量,但可能会影响输出令牌延迟。--max-num-seqs
: # 每迭代的最大序列数量,吞吐量关键!
参数相关
1 | usage: vllm serve [-h] [--model MODEL] |
docker compose 部署
底座模型部署
Qwen3-30B-A3B
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
services:
Qwen3-30B-A3B-Instruct-2507-FP8:
image: vllm/vllm-openai:v0.10.1.1 # 使用最新的vLLM镜像
container_name: Qwen3-30B-A3B-Instruct-2507-FP8
restart: unless-stopped
profiles: ["Instruct"]
volumes:
- /home/models/Qwen3-30B-A3B-Instruct-2507-FP8:/models/Qwen3-30B-A3B-Instruct-2507-FP8 # 将本地的模型目录挂载到容器内的/models路径
command: ["--model", "/models/Qwen3-30B-A3B-Instruct-2507-FP8","--served-model-name", "Qwen3-30B-A3B-Instruct-2507-FP8","--gpu-memory-utilization", "0.80","--max-model-len", "32768","--tensor-parallel-size", "1"]
ports:
- 8003:8000
deploy:
resources:
reservations:
devices:
- driver: nvidia
device_ids: ["0"]
capabilities: [gpu]
Qwen3-30B-A3B-Thinking-2507-FP8:
image: vllm/vllm-openai:v0.10.1.1 # 使用最新的vLLM镜像
container_name: Qwen3-30B-A3B-Thinking-2507-FP8
restart: unless-stopped
profiles: ["Thinking"]
volumes:
- /home/models/Qwen3-30B-A3B-Thinking-2507-FP8:/models/Qwen3-30B-A3B-Thinking-2507-FP8 # 将本地的模型目录挂载到容器内的/models路径
command: ["--model", "/models/Qwen3-30B-A3B-Thinking-2507-FP8","--served-model-name", "Qwen3-30B-A3B-Thinking-2507-FP8","--gpu-memory-utilization", "0.80","--max-model-len", "32768","--tensor-parallel-size", "1","--reasoning-parser", "deepseek_r1"]
ports:
- 8003:8000
deploy:
resources:
reservations:
devices:
- driver: nvidia
device_ids: ["0"]
capabilities: [gpu]
Qwen3-30B-A3B-FP8:
image: vllm/vllm-openai:v0.10.1.1 # 使用最新的vLLM镜像
container_name: Qwen3-30B-A3B-FP8
restart: unless-stopped
profiles: ["Instruct&Thinking"]
volumes:
- /home/models/Qwen3-30B-A3B-FP8:/models/Qwen3-30B-A3B-FP8 # 将本地的模型目录挂载到容器内的/models路径
command: ["--model", "/models/Qwen3-30B-A3B-FP8","--served-model-name", "Qwen3-30B-A3B-FP8","--gpu-memory-utilization", "0.80","--max-model-len", "32768","--tensor-parallel-size", "1","--reasoning-parser", "deepseek_r1"]
ports:
- 8003:8000
deploy:
resources:
reservations:
devices:
- driver: nvidia
device_ids: ["0"]
capabilities: [gpu]启动
1
docker compose --profile Instruct up -d
curl验证一下
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15{
"model": "Qwen3-30B-A3B-Instruct-2507-FP8",
"messages": [
{
"role": "system",
"content": "你是一名旅游顾问,我计划在2024年夏季进行一次为期10天的欧洲之旅,主要目的地包括巴黎、米兰和马德里。预算为每人10000元人民币,希望体验当地的文化和美食,同时偏好舒适的住宿条件。用户特别强调需要包含到访每个城市的必游景点,并希望有自由活动的时间。"
},
{
"role": "user",
"content": "我计划7月中旬出发,请给我一份旅游计划"
}
],
"temperature": 0.3,
"stream": true
}
其他部署
Qwen3-Embedding-8B
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18services:
Qwen3-Embedding-8B:
container_name: Qwen3-Embedding-8B
restart: no
image: vllm/vllm-openai:v0.10.1.1
ipc: host
volumes:
- /home/models/Qwen3-Embedding-8B:/models/Qwen3-Embedding-8B
command: ["--model", "/models/Qwen3-Embedding-8B", "--served-model-name", "Qwen3-Embedding-8B", "--gpu-memory-utilization", "0.90"]
ports:
- 8001:8000
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]Qwen3-Reranker-8B
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18services:
Qwen3-Reranker-8B:
container_name: Qwen3-Reranker-8B
restart: no
image: vllm/vllm-openai:v0.10.1.1
ipc: host
volumes:
- /home/models/Qwen3-Reranker-8B:/models/Qwen3-Reranker-8B
command: ['--model', '/models/Qwen3-Reranker-8B', '--served-model-name', 'Qwen3-Reranker-8B', '--gpu-memory-utilization', '0.90', '--hf_overrides','{"architectures": ["Qwen3ForSequenceClassification"],"classifier_from_token": ["no", "yes"],"is_original_qwen3_reranker": true}']
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
ports:
- 8002:8000
使用 FastAPI 集成 Qwen 大模型服务
- 依赖文件(requirements.txt):安装依赖的 Python 库:
1
2
3
4
5fastapi==0.116.1
uvicorn==0.35.0
openai==1.100.1
dashscope==1.24.1
python-dotenv==1.1.1将 DeepSeek 和 Qwen 的相关信息配置到系统环境变量中。1
pip install -r requirements.txt
接口代码1
2
3
4
5
6
7DeepSeek 配置
DEEPSEEK_API_KEY = "sk-your-deepseek-key"
DEEPSEEK_BASE_URL = "https://api.deepseek.com"
Qwen3 配置(对接底座大模型)
QWEN3_API_KEY = "sk-your-qwen-key"
QWEN3_API_BASE_URL = "http://192.168.103.224:8003/v1"1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70import os
import uvicorn
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from dotenv import load_dotenv
from openai import OpenAI as DeepSeekClient
from openai import OpenAI as Qwen3Client
# 配置模型参数
MODEL_CONFIG = {
"deepseek": {
"client": DeepSeekClient(
api_key=os.getenv("DEEPSEEK_API_KEY"),
base_url=os.getenv("DEEPSEEK_BASE_URL")
),
"model_name": "deepseek-chat"
},
"qwen3": {
"client": Qwen3Client(
api_key=os.getenv("QWEN3_API_KEY"),
base_url=os.getenv("QWEN3_API_BASE_URL")
),
"model_name": "Qwen3-30B-A3B-Instruct-2507-FP8"
}
}
# 统一请求格式(兼容OpenAI)
class ChatRequest(BaseModel):
model: str # deepseek 或 qwen
messages: list
temperature: float = 0.7
max_tokens: int = 1024
# 统一响应格式
class ChatResponse(BaseModel):
model: str
content: str
asyncdef chat_completion(request: ChatRequest):
model_type = request.model.lower()
if model_type notin MODEL_CONFIG:
raise HTTPException(400, f"Unsupported model: {model_type}")
try:
if model_type == "deepseek":
response = MODEL_CONFIG["deepseek"]["client"].chat.completions.create(
model=MODEL_CONFIG["deepseek"]["model_name"],
messages=request.messages,
temperature=request.temperature,
max_tokens=request.max_tokens
)
content = response.choices[0].message.content
elif model_type == "qwen3":
response = MODEL_CONFIG["deepseek"]["client"].chat.completions.create(
model=MODEL_CONFIG["deepseek"]["model_name"],
messages=request.messages,
temperature=request.temperature,
max_tokens=request.max_tokens
)
content = response.choices[0].message.content
return ChatResponse(model=model_type, content=content)
except Exception as e:
raise HTTPException(500, f"API Error: {str(e)}")