vllm部署Qwen3方案

Qwen3概述

  • 多种思考模式

    可用户提示或系统消息中添加 /think 和 /no_think 来逐轮切换模型的思考模式

    • 思考模式:在这种模式下,模型会逐步推理,经过深思熟虑后给出最终答案。这种方法非常适合需要深入思考的复杂问题。
    • 非思考模式:在此模式中,模型提供快速、近乎即时的响应,适用于那些对速度要求高于深度的简单问题。
  • 多语言
    119 种语言和方言

  • MCP 支持

Qwen3-30B-A3B

  • 一个拥有约 300 亿总参数和 30 亿激活参数的小型 MoE 模型
  • 需24GB+显存

Qwen3-Embedding & Qwen3-Reranker

Model Type Models Size Layers Sequence Length Embedding Dimension MRL Support Instruction Aware
Text Embedding Qwen3-Embedding-0.6B 0.6B 28 32K 1024 Yes Yes
Text Embedding Qwen3-Embedding-4B 4B 36 32K 2560 Yes Yes
Text Embedding Qwen3-Embedding-8B 8B 36 32K 4096 Yes Yes
Text Reranking Qwen3-Reranker-0.6B 0.6B 28 32K - - Yes
Text Reranking Qwen3-Reranker-4B 4B 36 32K - - Yes
Text Reranking Qwen3-Reranker-8B 8B 36 32K - - Yes
  • 经济型:Embedding-4B + Reranker-4B(显存总需求<30GB)
  • 高性能型:Embedding-8B + Reranker-8B(需多GPU,吞吐量提升40%+)

对比BGE-M3:全方位代差优势

指标 Qwen3-8B BGE-M3 优势幅度
综合得分 70.58 59.56 ↑11.02
上下文长度 32K 8K ↑ 4倍
检索任务(MSMARCO) 57.65 40.88 ↑41%
开放问答(NQ) 10.06 -3.11 实现负分逆转
多语言理解 28.66 20.10 ↑42%

vllm 安装

  • uv(首选)

    1
    2
    3
    uv venv llvm --python 3.12 --seed
    source llvm/bin/activate
    uv pip install vllm
  • conda(有 license 问题)

    1
    2
    3
    4
    5
    conda env list ## 查看conda创建的所以虚拟环境
    conda create -n llvm python=3.12 ## 创建特定版本python
    conda activate llvm ## 进入某个虚拟环境
    conda env remove -n llvm ## 删除某个虚拟环境
    pip install vllm

模型下载

1
2
3
4
5
6
7
8
## 安装下载环境
pip install modelscope
## 下载模型
modelscope download --model Qwen/Qwen3-30B-A3B
## 下载指定目录下
modelscope download --model Qwen/Qwen3-30B-A3B --local_dir /home/models
modelscope download --model Qwen/Qwen3-Embedding-8B --local_dir /home/models
modelscope download --model Qwen/Qwen3-Reranker-8B --local_dir /home/models

vllm 服务启动

vllm serve <model_path>

1
2
3
4
5
6
7
vllm serve /home/models/Qwen3-30B-A3B \
--port 8000 \
--host 0.0.0.0 \
--gpu-memory-utilization 0.95 \
--enable-reasoning \
--reasoning-parser deepseek_r1 \
--served-model-name Qwen3-30B-A3B

参数相关

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
usage: vllm serve [-h] [--model MODEL]
[--task {auto,generate,embedding,embed,classify,score,reward,transcription}]
[--tokenizer TOKENIZER] [--hf-config-path HF_CONFIG_PATH]
[--skip-tokenizer-init] [--revision REVISION]
[--code-revision CODE_REVISION]
[--tokenizer-revision TOKENIZER_REVISION]
[--tokenizer-mode {auto,slow,mistral,custom}]
[--trust-remote-code]
[--allowed-local-media-path ALLOWED_LOCAL_MEDIA_PATH]
[--download-dir DOWNLOAD_DIR]
[--load-format {auto,pt,safetensors,npcache,dummy,tensorizer,sharded_state,gguf,bitsandbytes,mistral,runai_streamer}]
[--config-format {auto,hf,mistral}]
[--dtype {auto,half,float16,bfloat16,float,float32}]
[--kv-cache-dtype {auto,fp8,fp8_e5m2,fp8_e4m3}]
[--max-model-len MAX_MODEL_LEN]
[--guided-decoding-backend GUIDED_DECODING_BACKEND]
[--logits-processor-pattern LOGITS_PROCESSOR_PATTERN]
[--model-impl {auto,vllm,transformers}]
[--distributed-executor-backend {ray,mp,uni,external_launcher}]
[--pipeline-parallel-size PIPELINE_PARALLEL_SIZE]
[--tensor-parallel-size TENSOR_PARALLEL_SIZE]
[--enable-expert-parallel]
[--max-parallel-loading-workers MAX_PARALLEL_LOADING_WORKERS]
[--ray-workers-use-nsight] [--block-size {8,16,32,64,128}]
[--enable-prefix-caching | --no-enable-prefix-caching]
[--disable-sliding-window] [--use-v2-block-manager]
[--num-lookahead-slots NUM_LOOKAHEAD_SLOTS] [--seed SEED]
[--swap-space SWAP_SPACE] [--cpu-offload-gb CPU_OFFLOAD_GB]
[--gpu-memory-utilization GPU_MEMORY_UTILIZATION]
[--num-gpu-blocks-override NUM_GPU_BLOCKS_OVERRIDE]
[--max-num-batched-tokens MAX_NUM_BATCHED_TOKENS]
[--max-num-partial-prefills MAX_NUM_PARTIAL_PREFILLS]
[--max-long-partial-prefills MAX_LONG_PARTIAL_PREFILLS]
[--long-prefill-token-threshold LONG_PREFILL_TOKEN_THRESHOLD]
[--max-num-seqs MAX_NUM_SEQS] [--max-logprobs MAX_LOGPROBS]
[--disable-log-stats]
[--quantization {aqlm,awq,deepspeedfp,tpu_int8,fp8,ptpc_fp8,fbgemm_fp8,modelopt,nvfp4,marlin,gguf,gptq_marlin_24,gptq_marlin,awq_marlin,gptq,compressed-tensors,bitsandbytes,qqq,hqq,experts_int8,neuron_quant,ipex,quark,moe_wna16,None}]
[--rope-scaling ROPE_SCALING] [--rope-theta ROPE_THETA]
[--hf-overrides HF_OVERRIDES] [--enforce-eager]
[--max-seq-len-to-capture MAX_SEQ_LEN_TO_CAPTURE]
[--disable-custom-all-reduce]
[--tokenizer-pool-size TOKENIZER_POOL_SIZE]
[--tokenizer-pool-type TOKENIZER_POOL_TYPE]
[--tokenizer-pool-extra-config TOKENIZER_POOL_EXTRA_CONFIG]
[--limit-mm-per-prompt LIMIT_MM_PER_PROMPT]
[--mm-processor-kwargs MM_PROCESSOR_KWARGS]
[--disable-mm-preprocessor-cache] [--enable-lora]
[--enable-lora-bias] [--max-loras MAX_LORAS]
[--max-lora-rank MAX_LORA_RANK]
[--lora-extra-vocab-size LORA_EXTRA_VOCAB_SIZE]
[--lora-dtype {auto,float16,bfloat16}]
[--long-lora-scaling-factors LONG_LORA_SCALING_FACTORS]
[--max-cpu-loras MAX_CPU_LORAS] [--fully-sharded-loras]
[--enable-prompt-adapter]
[--max-prompt-adapters MAX_PROMPT_ADAPTERS]
[--max-prompt-adapter-token MAX_PROMPT_ADAPTER_TOKEN]
[--device {auto,cuda,neuron,cpu,openvino,tpu,xpu,hpu}]
[--num-scheduler-steps NUM_SCHEDULER_STEPS]
[--use-tqdm-on-load | --no-use-tqdm-on-load]
[--multi-step-stream-outputs [MULTI_STEP_STREAM_OUTPUTS]]
[--scheduler-delay-factor SCHEDULER_DELAY_FACTOR]
[--enable-chunked-prefill [ENABLE_CHUNKED_PREFILL]]
[--speculative-model SPECULATIVE_MODEL]
[--speculative-model-quantization {aqlm,awq,deepspeedfp,tpu_int8,fp8,ptpc_fp8,fbgemm_fp8,modelopt,nvfp4,marlin,gguf,gptq_marlin_24,gptq_marlin,awq_marlin,gptq,compressed-tensors,bitsandbytes,qqq,hqq,experts_int8,neuron_quant,ipex,quark,moe_wna16,None}]
[--num-speculative-tokens NUM_SPECULATIVE_TOKENS]
[--speculative-disable-mqa-scorer]
[--speculative-draft-tensor-parallel-size SPECULATIVE_DRAFT_TENSOR_PARALLEL_SIZE]
[--speculative-max-model-len SPECULATIVE_MAX_MODEL_LEN]
[--speculative-disable-by-batch-size SPECULATIVE_DISABLE_BY_BATCH_SIZE]
[--ngram-prompt-lookup-max NGRAM_PROMPT_LOOKUP_MAX]
[--ngram-prompt-lookup-min NGRAM_PROMPT_LOOKUP_MIN]
[--spec-decoding-acceptance-method {rejection_sampler,typical_acceptance_sampler}]
[--typical-acceptance-sampler-posterior-threshold TYPICAL_ACCEPTANCE_SAMPLER_POSTERIOR_THRESHOLD]
[--typical-acceptance-sampler-posterior-alpha TYPICAL_ACCEPTANCE_SAMPLER_POSTERIOR_ALPHA]
[--disable-logprobs-during-spec-decoding [DISABLE_LOGPROBS_DURING_SPEC_DECODING]]
[--model-loader-extra-config MODEL_LOADER_EXTRA_CONFIG]
[--ignore-patterns IGNORE_PATTERNS]
[--preemption-mode PREEMPTION_MODE]
[--served-model-name SERVED_MODEL_NAME [SERVED_MODEL_NAME ...]]
[--qlora-adapter-name-or-path QLORA_ADAPTER_NAME_OR_PATH]
[--show-hidden-metrics-for-version SHOW_HIDDEN_METRICS_FOR_VERSION]
[--otlp-traces-endpoint OTLP_TRACES_ENDPOINT]
[--collect-detailed-traces COLLECT_DETAILED_TRACES]
[--disable-async-output-proc]
[--scheduling-policy {fcfs,priority}]
[--scheduler-cls SCHEDULER_CLS]
[--override-neuron-config OVERRIDE_NEURON_CONFIG]
[--override-pooler-config OVERRIDE_POOLER_CONFIG]
[--compilation-config COMPILATION_CONFIG]
[--kv-transfer-config KV_TRANSFER_CONFIG]
[--worker-cls WORKER_CLS]
[--worker-extension-cls WORKER_EXTENSION_CLS]
[--generation-config GENERATION_CONFIG]
[--override-generation-config OVERRIDE_GENERATION_CONFIG]
[--enable-sleep-mode] [--calculate-kv-scales]
[--additional-config ADDITIONAL_CONFIG] [--enable-reasoning]
[--reasoning-parser {deepseek_r1}]

docker compose 部署

  • Qwen3-30B-A3B

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    version: '3.8'

    services:
    qwen3:
    image: vllm/vllm-openai:latest # 使用最新的vLLM镜像
    container_name: qwen3-vllm
    restart: unless-stopped
    environment:
    - HF_ENDPOINT=https://hf-mirror.com
    - MODEL_PATH=/models/Qwen3-30B-A3B # 模型路径
    - SERVED_MODEL_NAME=Qwen3-30B-A3B # 对外服务时使用的模型名称
    - API_KEY=dakewe # 设置API密钥
    - MEMORY_UTILIZATION=0.95 # GPU内存利用率,接近但不超过显存限制
    volumes:
    - ./models:/models # 将本地的模型目录挂载到容器内的/models路径
    runtime: nvidia # 使用NVIDIA运行时以支持GPU
    deploy:
    resources:
    reservations:
    devices:
    - driver: nvidia
    count: all
    capabilities: [gpu]
    ports:
    - "8000:8000" # 将容器的8000端口映射到主机的8000端口
  • Qwen3-Embedding-8B

    临时,等vllm-openai 支持
    issue: https://github.com/vllm-project/vllm/issues/19229

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    services:
    Qwen3-Embedding-8B:
    container_name: Qwen3-Embedding-8B
    restart: no
    image: dengcao/vllm-openai:v0.9.2-dev #采用vllm最新的开发版制作的镜像,经测试正常,可放心使用
    ipc: host
    volumes:
    - ./models:/models
    command: ["--model", "/models/Qwen3-Embedding-8B", "--served-model-name", "Qwen3-Embedding-8B", "--gpu-memory-utilization", "0.90"]
    ports:
    - 8001:8000
    deploy:
    resources:
    reservations:
    devices:
    - driver: nvidia
    count: all
    capabilities: [gpu]
  • Qwen3-Reranker-8B

    临时,等vllm-openai 支持
    issue: https://github.com/vllm-project/vllm/issues/19229

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    services:  
    Qwen3-Reranker-8B:
    container_name: Qwen3-Reranker-8B
    restart: no
    image: dengcao/vllm-openai:v0.9.2-dev #采用vllm最新的开发版制作的镜像,经在NVIDIA RTX3060平台主机上测试正常,可放心使用。
    ipc: host
    volumes:
    - ./models:/models
    command: ['--model', '/models/Qwen3-Reranker-8B', '--served-model-name', 'Qwen3-Reranker-8B', '--gpu-memory-utilization', '0.90', '--hf_overrides','{"architectures": ["Qwen3ForSequenceClassification"],"classifier_from_token": ["no", "yes"],"is_original_qwen3_reranker": true}']
    deploy:
    resources:
    reservations:
    devices:
    - driver: nvidia
    count: all
    capabilities: [gpu]
    ports:
    - 8002:8000