vllm部署Qwen3方案

发表于 2025-07-01 更新于 2025-10-15 阅读次数：阅读次数：

Qwen3概述

多种思考模式

可用户提示或系统消息中添加 /think 和 /no_think 来逐轮切换模型的思考模式
- 思考模式：在这种模式下，模型会逐步推理，经过深思熟虑后给出最终答案。这种方法非常适合需要深入思考的复杂问题。
- 非思考模式：在此模式中，模型提供快速、近乎即时的响应，适用于那些对速度要求高于深度的简单问题。
多语言
119 种语言和方言
MCP 支持

Qwen3-30B-A3B

一个拥有约 300 亿总参数和 30 亿激活参数的小型 MoE 模型
需24GB+显存

Qwen3-Embedding & Qwen3-Reranker

Model Type	Models	Size	Layers	Sequence Length	Embedding Dimension	MRL Support	Instruction Aware
Text Embedding	Qwen3-Embedding-0.6B	0.6B	28	32K	1024	Yes	Yes
Text Embedding	Qwen3-Embedding-4B	4B	36	32K	2560	Yes	Yes
Text Embedding	Qwen3-Embedding-8B	8B	36	32K	4096	Yes	Yes
Text Reranking	Qwen3-Reranker-0.6B	0.6B	28	32K	-	-	Yes
Text Reranking	Qwen3-Reranker-4B	4B	36	32K	-	-	Yes
Text Reranking	Qwen3-Reranker-8B	8B	36	32K	-	-	Yes

经济型：Embedding-4B + Reranker-4B（显存总需求<30GB）
高性能型：Embedding-8B + Reranker-8B（需多GPU，吞吐量提升40%+）

对比BGE-M3：全方位代差优势

指标	Qwen3-8B	BGE-M3	优势幅度
综合得分	70.58	59.56	↑11.02
上下文长度	32K	8K	↑ 4倍
检索任务(MSMARCO)	57.65	40.88	↑41%
开放问答(NQ)	10.06	-3.11	实现负分逆转
多语言理解	28.66	20.10	↑42%

vllm 安装

1
2
3

uv venv vllm --python 3.12 --seed
source vllm/bin/activate
uv pip install vllm

conda

conda env list ## 查看conda创建的所以虚拟环境
conda create -n vllm python=3.12 ## 创建特定版本python
conda activate vllm ## 进入某个虚拟环境
conda env remove -n vllm ##  删除某个虚拟环境
pip install vllm

模型下载

## 安装下载环境
pip install modelscope
## 下载模型
modelscope download --model Qwen/Qwen3-30B-A3B
## 下载指定目录下
modelscope download --model Qwen/Qwen3-30B-A3B-FP8 --local_dir /home/models/Qwen3-30B-A3B-FP8
modelscope download --model Qwen/Qwen3-30B-A3B-Instruct-2507-FP8 --local_dir /home/models/Qwen3-30B-A3B-Instruct-2507-FP8
modelscope download --model Qwen/Qwen3-30B-A3B-Thinking-2507-FP8 --local_dir /home/models/Qwen3-30B-A3B-Thinking-2507-FP8

modelscope download --model Qwen/Qwen3-Embedding-8B --local_dir /home/models/Qwen3-Embedding-8B
modelscope download --model Qwen/Qwen3-Reranker-8B --local_dir /home/models/Qwen3-Reranker-8B
modelscope download --model Qwen/Qwen3-VL-8B-Instruct-FP8 --local_dir /home/models/Qwen3-VL-8B-Instruct-FP8

vllm 服务启动

vllm serve <model_path>

vllm serve /home/models/Qwen3-30B-A3B-Thinking-2507-FP8 \
  --port 8003 \
  --host 0.0.0.0 \
  --gpu-memory-utilization 0.85 \
  --max-model-len 12288 \
  --max-num-seqs 256 \
  --max-num-batched-tokens 4096 \
  --tensor-parallel-size 1 \
  --reasoning-parser deepseek_r1 \
  --served-model-name Qwen3-30B-A3B-Thinking-2507-FP8

--tensor-parallel-size: 将您的模型分布到主机 GPU 中,GPU数量。
--GPU-memory-utilization:调整模型权重、激活和 KV 缓存的加速器内存使用率。作为 0.0 到 1.0 的比例测量，默认为 0.9.例如，您可以将此值设置为 0.8，将 AI Inference Server 的 GPU 内存消耗限制为 80%。使用部署稳定的最大值来最大化吞吐量。
--max-model-len: 限制模型的最大上下文长度，以令牌表示。如果模型的默认上下文长度太长，则将其设置为防止内存出现问题。
--max-num-batched-tokens: 每迭代的最大批处理令牌数量,将令牌的最大批处理大小限制为每个步骤处理（以令牌表示）。增加这可以提高吞吐量，但可能会影响输出令牌延迟。
--max-num-seqs: # 每迭代的最大序列数量,吞吐量关键！

参数相关

usage: vllm serve [-h] [--model MODEL]
                  [--task {auto,generate,embedding,embed,classify,score,reward,transcription}]
                  [--tokenizer TOKENIZER] [--hf-config-path HF_CONFIG_PATH]
                  [--skip-tokenizer-init] [--revision REVISION]
                  [--code-revision CODE_REVISION]
                  [--tokenizer-revision TOKENIZER_REVISION]
                  [--tokenizer-mode {auto,slow,mistral,custom}]
                  [--trust-remote-code]
                  [--allowed-local-media-path ALLOWED_LOCAL_MEDIA_PATH]
                  [--download-dir DOWNLOAD_DIR]
                  [--load-format {auto,pt,safetensors,npcache,dummy,tensorizer,sharded_state,gguf,bitsandbytes,mistral,runai_streamer}]
                  [--config-format {auto,hf,mistral}]
                  [--dtype {auto,half,float16,bfloat16,float,float32}]
                  [--kv-cache-dtype {auto,fp8,fp8_e5m2,fp8_e4m3}]
                  [--max-model-len MAX_MODEL_LEN]
                  [--guided-decoding-backend GUIDED_DECODING_BACKEND]
                  [--logits-processor-pattern LOGITS_PROCESSOR_PATTERN]
                  [--model-impl {auto,vllm,transformers}]
                  [--distributed-executor-backend {ray,mp,uni,external_launcher}]
                  [--pipeline-parallel-size PIPELINE_PARALLEL_SIZE]
                  [--tensor-parallel-size TENSOR_PARALLEL_SIZE]
                  [--enable-expert-parallel]
                  [--max-parallel-loading-workers MAX_PARALLEL_LOADING_WORKERS]
                  [--ray-workers-use-nsight] [--block-size {8,16,32,64,128}]
                  [--enable-prefix-caching | --no-enable-prefix-caching]
                  [--disable-sliding-window] [--use-v2-block-manager]
                  [--num-lookahead-slots NUM_LOOKAHEAD_SLOTS] [--seed SEED]
                  [--swap-space SWAP_SPACE] [--cpu-offload-gb CPU_OFFLOAD_GB]
                  [--gpu-memory-utilization GPU_MEMORY_UTILIZATION]
                  [--num-gpu-blocks-override NUM_GPU_BLOCKS_OVERRIDE]
                  [--max-num-batched-tokens MAX_NUM_BATCHED_TOKENS]
                  [--max-num-partial-prefills MAX_NUM_PARTIAL_PREFILLS]
                  [--max-long-partial-prefills MAX_LONG_PARTIAL_PREFILLS]
                  [--long-prefill-token-threshold LONG_PREFILL_TOKEN_THRESHOLD]
                  [--max-num-seqs MAX_NUM_SEQS] [--max-logprobs MAX_LOGPROBS]
                  [--disable-log-stats]
                  [--quantization {aqlm,awq,deepspeedfp,tpu_int8,fp8,ptpc_fp8,fbgemm_fp8,modelopt,nvfp4,marlin,gguf,gptq_marlin_24,gptq_marlin,awq_marlin,gptq,compressed-tensors,bitsandbytes,qqq,hqq,experts_int8,neuron_quant,ipex,quark,moe_wna16,None}]
                  [--rope-scaling ROPE_SCALING] [--rope-theta ROPE_THETA]
                  [--hf-overrides HF_OVERRIDES] [--enforce-eager]
                  [--max-seq-len-to-capture MAX_SEQ_LEN_TO_CAPTURE]
                  [--disable-custom-all-reduce]
                  [--tokenizer-pool-size TOKENIZER_POOL_SIZE]
                  [--tokenizer-pool-type TOKENIZER_POOL_TYPE]
                  [--tokenizer-pool-extra-config TOKENIZER_POOL_EXTRA_CONFIG]
                  [--limit-mm-per-prompt LIMIT_MM_PER_PROMPT]
                  [--mm-processor-kwargs MM_PROCESSOR_KWARGS]
                  [--disable-mm-preprocessor-cache] [--enable-lora]
                  [--enable-lora-bias] [--max-loras MAX_LORAS]
                  [--max-lora-rank MAX_LORA_RANK]
                  [--lora-extra-vocab-size LORA_EXTRA_VOCAB_SIZE]
                  [--lora-dtype {auto,float16,bfloat16}]
                  [--long-lora-scaling-factors LONG_LORA_SCALING_FACTORS]
                  [--max-cpu-loras MAX_CPU_LORAS] [--fully-sharded-loras]
                  [--enable-prompt-adapter]
                  [--max-prompt-adapters MAX_PROMPT_ADAPTERS]
                  [--max-prompt-adapter-token MAX_PROMPT_ADAPTER_TOKEN]
                  [--device {auto,cuda,neuron,cpu,openvino,tpu,xpu,hpu}]
                  [--num-scheduler-steps NUM_SCHEDULER_STEPS]
                  [--use-tqdm-on-load | --no-use-tqdm-on-load]
                  [--multi-step-stream-outputs [MULTI_STEP_STREAM_OUTPUTS]]
                  [--scheduler-delay-factor SCHEDULER_DELAY_FACTOR]
                  [--enable-chunked-prefill [ENABLE_CHUNKED_PREFILL]]
                  [--speculative-model SPECULATIVE_MODEL]
                  [--speculative-model-quantization {aqlm,awq,deepspeedfp,tpu_int8,fp8,ptpc_fp8,fbgemm_fp8,modelopt,nvfp4,marlin,gguf,gptq_marlin_24,gptq_marlin,awq_marlin,gptq,compressed-tensors,bitsandbytes,qqq,hqq,experts_int8,neuron_quant,ipex,quark,moe_wna16,None}]
                  [--num-speculative-tokens NUM_SPECULATIVE_TOKENS]
                  [--speculative-disable-mqa-scorer]
                  [--speculative-draft-tensor-parallel-size SPECULATIVE_DRAFT_TENSOR_PARALLEL_SIZE]
                  [--speculative-max-model-len SPECULATIVE_MAX_MODEL_LEN]
                  [--speculative-disable-by-batch-size SPECULATIVE_DISABLE_BY_BATCH_SIZE]
                  [--ngram-prompt-lookup-max NGRAM_PROMPT_LOOKUP_MAX]
                  [--ngram-prompt-lookup-min NGRAM_PROMPT_LOOKUP_MIN]
                  [--spec-decoding-acceptance-method {rejection_sampler,typical_acceptance_sampler}]
                  [--typical-acceptance-sampler-posterior-threshold TYPICAL_ACCEPTANCE_SAMPLER_POSTERIOR_THRESHOLD]
                  [--typical-acceptance-sampler-posterior-alpha TYPICAL_ACCEPTANCE_SAMPLER_POSTERIOR_ALPHA]
                  [--disable-logprobs-during-spec-decoding [DISABLE_LOGPROBS_DURING_SPEC_DECODING]]
                  [--model-loader-extra-config MODEL_LOADER_EXTRA_CONFIG]
                  [--ignore-patterns IGNORE_PATTERNS]
                  [--preemption-mode PREEMPTION_MODE]
                  [--served-model-name SERVED_MODEL_NAME [SERVED_MODEL_NAME ...]]
                  [--qlora-adapter-name-or-path QLORA_ADAPTER_NAME_OR_PATH]
                  [--show-hidden-metrics-for-version SHOW_HIDDEN_METRICS_FOR_VERSION]
                  [--otlp-traces-endpoint OTLP_TRACES_ENDPOINT]
                  [--collect-detailed-traces COLLECT_DETAILED_TRACES]
                  [--disable-async-output-proc]
                  [--scheduling-policy {fcfs,priority}]
                  [--scheduler-cls SCHEDULER_CLS]
                  [--override-neuron-config OVERRIDE_NEURON_CONFIG]
                  [--override-pooler-config OVERRIDE_POOLER_CONFIG]
                  [--compilation-config COMPILATION_CONFIG]
                  [--kv-transfer-config KV_TRANSFER_CONFIG]
                  [--worker-cls WORKER_CLS]
                  [--worker-extension-cls WORKER_EXTENSION_CLS]
                  [--generation-config GENERATION_CONFIG]
                  [--override-generation-config OVERRIDE_GENERATION_CONFIG]
                  [--enable-sleep-mode] [--calculate-kv-scales]
                  [--additional-config ADDITIONAL_CONFIG] [--enable-reasoning]
                  [--reasoning-parser {deepseek_r1}]

Qwen3 底座模型部署部署

docker-compose.yaml

services:
  Qwen3-30B-A3B-Instruct-2507-FP8:
    image: vllm/vllm-openai:v0.10.1.1 
    container_name: Qwen3-30B-A3B-Instruct-2507-FP8
    restart: unless-stopped
    profiles: ["Instruct"]
    volumes:
      - /home/models/Qwen3-30B-A3B-Instruct-2507-FP8:/models/Qwen3-30B-A3B-Instruct-2507-FP8
    command: ["--model", "/models/Qwen3-30B-A3B-Instruct-2507-FP8","--served-model-name", "Qwen3-30B-A3B-Instruct-2507-FP8","--gpu-memory-utilization", "0.80","--max-model-len", "32768","--tensor-parallel-size", "1"]
    ports:
      - 8003:8000
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ["0"]
              capabilities: [gpu]

  Qwen3-30B-A3B-Thinking-2507-FP8:
    image: vllm/vllm-openai:v0.10.1.1 
    container_name: Qwen3-30B-A3B-Thinking-2507-FP8
    restart: unless-stopped
    profiles: ["Thinking"]
    volumes:
      - /home/models/Qwen3-30B-A3B-Thinking-2507-FP8:/models/Qwen3-30B-A3B-Thinking-2507-FP8
    command: ["--model", "/models/Qwen3-30B-A3B-Thinking-2507-FP8","--served-model-name", "Qwen3-30B-A3B-Thinking-2507-FP8","--gpu-memory-utilization", "0.80","--max-model-len", "32768","--tensor-parallel-size", "1","--reasoning-parser", "deepseek_r1"]
    ports:
      - 8003:8000
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ["0"]
              capabilities: [gpu]

  Qwen3-30B-A3B-FP8:
    image: vllm/vllm-openai:v0.10.1.1 
    container_name: Qwen3-30B-A3B-FP8
    restart: unless-stopped
    profiles: ["Instruct&Thinking"]
    volumes:
      - /home/models/Qwen3-30B-A3B-FP8:/models/Qwen3-30B-A3B-FP8
    command: ["--model", "/models/Qwen3-30B-A3B-FP8","--served-model-name", "Qwen3-30B-A3B-FP8","--gpu-memory-utilization", "0.40","--max-model-len", "32768","--tensor-parallel-size", "1","--reasoning-parser", "deepseek_r1"]
    ports:
      - 8003:8000
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ["0"]
              capabilities: [gpu]

  Qwen3-Embedding-8B:
    container_name: Qwen3-Embedding-8B
    restart: no
    image: vllm/vllm-openai:v0.10.1.1
    profiles: ["Embedding"]
    volumes:
      - /home/models/Qwen3-Embedding-8B:/models/Qwen3-Embedding-8B
    command: ["--model", "/models/Qwen3-Embedding-8B",  "--served-model-name", "Qwen3-Embedding-8B",  "--gpu-memory-utilization", "0.85"]
    ports:
      - 8001:8000
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ["1"]
              capabilities: [gpu]

  Qwen3-Reranker-8B:
    container_name: Qwen3-Reranker-8B
    restart: no
    image: vllm/vllm-openai:v0.10.1.1
    profiles: ["Reranker"]
    volumes:
      - /home/models/Qwen3-Reranker-8B:/models/Qwen3-Reranker-8B
    command: ['--model', '/models/Qwen3-Reranker-8B',  '--served-model-name', 'Qwen3-Reranker-8B',  '--gpu-memory-utilization', '0.45', '--hf_overrides','{"architectures": ["Qwen3ForSequenceClassification"],"classifier_from_token": ["no", "yes"],"is_original_qwen3_reranker": true}']
    ports:
      - 8002:8000
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ["1"]
              capabilities: [gpu]

启动

docker compose --profile Instruct&Thinking up -d
docker compose --profile Instruct up -d
docker compose --profile Thinking up -d
docker compose --profile Embedding up -d
docker compose --profile Reranker up -d

curl验证一下

{
    "model": "Qwen3-30B-A3B-Instruct-2507-FP8",
    "messages": [
        {
            "role": "system",
            "content": "你是一名旅游顾问，我计划在2024年夏季进行一次为期10天的欧洲之旅，主要目的地包括巴黎、米兰和马德里。预算为每人10000元人民币，希望体验当地的文化和美食，同时偏好舒适的住宿条件。用户特别强调需要包含到访每个城市的必游景点，并希望有自由活动的时间。"
        },
        {
            "role": "user",
            "content": "我计划7月中旬出发，请给我一份旅游计划"
        }
    ],
    "temperature": 0.3,
    "stream": true
}

new api统一网关管理大模型

使用 FastAPI 集成 Qwen 大模型服务

依赖文件（requirements.txt）：

fastapi==0.116.1
uvicorn==0.35.0
openai==1.100.1
dashscope==1.24.1
python-dotenv==1.1.1

安装依赖的 Python 库：

1	pip install -r requirements.txt

将 DeepSeek 和 Qwen 的相关信息配置到系统环境变量中。

# DeepSeek 配置
DEEPSEEK_API_KEY = "sk-your-deepseek-key"
DEEPSEEK_BASE_URL = "https://api.deepseek.com"

# Qwen3 配置(对接底座大模型)
QWEN3_API_KEY = "sk-your-qwen-key"
QWEN3_API_BASE_URL = "http://192.168.103.224:8003/v1"

接口代码

import os
import uvicorn
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from dotenv import load_dotenv
from openai import OpenAI as DeepSeekClient
from openai import OpenAI as Qwen3Client

# 配置模型参数
MODEL_CONFIG = {
    "deepseek": {
        "client": DeepSeekClient(
            api_key=os.getenv("DEEPSEEK_API_KEY"),
            base_url=os.getenv("DEEPSEEK_BASE_URL")
        ),
        "model_name": "deepseek-chat"
    },
    "qwen3": {
        "client": Qwen3Client(
            api_key=os.getenv("QWEN3_API_KEY"),
            base_url=os.getenv("QWEN3_API_BASE_URL")
        ),
        "model_name": "Qwen3-30B-A3B-Instruct-2507-FP8"
    }
}


# 统一请求格式（兼容OpenAI）
class ChatRequest(BaseModel):
    model: str  # deepseek 或 qwen
    messages: list
    temperature: float = 0.7
    max_tokens: int = 1024


# 统一响应格式
class ChatResponse(BaseModel):
    model: str
    content: str


@app.post("/v1/chat")
asyncdef chat_completion(request: ChatRequest):
    model_type = request.model.lower()
    if model_type notin MODEL_CONFIG:
        raise HTTPException(400, f"Unsupported model: {model_type}")

    try:
        if model_type == "deepseek":
            response = MODEL_CONFIG["deepseek"]["client"].chat.completions.create(
                model=MODEL_CONFIG["deepseek"]["model_name"],
                messages=request.messages,
                temperature=request.temperature,
                max_tokens=request.max_tokens
            )
            content = response.choices[0].message.content

        elif model_type == "qwen3":
            response = MODEL_CONFIG["deepseek"]["client"].chat.completions.create(
                model=MODEL_CONFIG["deepseek"]["model_name"],
                messages=request.messages,
                temperature=request.temperature,
                max_tokens=request.max_tokens
            )
            content = response.choices[0].message.content

        return ChatResponse(model=model_type, content=content)

    except Exception as e:
        raise HTTPException(500, f"API Error: {str(e)}")

Qwen3概述

Qwen3-30B-A3B

Qwen3-Embedding & Qwen3-Reranker

vllm 安装

模型下载

vllm 服务启动

参数相关

Qwen3 底座模型部署 部署

new api统一网关管理大模型

使用 FastAPI 集成 Qwen 大模型服务

Qwen3 底座模型部署部署