Deploying Qwen 3 as a Private API on Alibaba Cloud Function Compute

After integrating Qwen 3.6 into Cursor through Model Studio, I started wondering: what if I needed a private, self-hosted Qwen endpoint? Not everyone wants to route inference through a shared API. Some workloads require data sovereignty, custom fine-tuned models, or simply a predictable cost model without per-token pricing. That’s where Alibaba Cloud Function Compute (FC) comes in.

Function Compute is Alibaba Cloud’s serverless platform, and I’ve been using it since 2018 when I built serverless apps with the old fun CLI. Back then, it was mostly for lightweight HTTP handlers and event processing. Today, FC 3.0 supports GPU instances (Tesla, Ampere, Ada, Hopper, and even Blackwell series), container images up to 30 GB, and long-running tasks up to 24 hours. That makes it a viable platform for hosting LLM inference without managing any servers.

Why Serverless for LLM Inference?

The traditional approach to self-hosting an LLM involves spinning up a GPU VM, installing CUDA drivers, setting up vLLM or TGI, configuring a reverse proxy, and then paying for that GPU 24/7 whether you’re running inference or not (and then the world runs out of available GPUs).

Function Compute flips this model. You package your model into a container image, deploy it as a function, and FC handles scaling. When there are no requests, it can scale to zero. When traffic spikes, it scales out automatically. You pay for the compute time you actually use.

For a private API that handles bursty internal traffic (think a development team running code reviews, or a backend service doing occasional text classification) this is significantly more cost-effective than a dedicated GPU instance sitting idle between requests.

The Architecture

The setup is straightforward:

graph TD
    A[vLLM + Qwen] -->|push| B[ACR]
    B -->|pull| C[FC GPU]
    C --> D[HTTP Trigger]
    D --> E[API Gateway]
    D --> F[Cursor]
    E --> F

Step 1: Prepare the Container Image

We need a Docker image that runs vLLM serving a Qwen model. I’m using qwen3.6-35b-a3b because the MoE architecture keeps the active parameters at 3B, which fits comfortably in the GPU memory available on FC’s GPU instances. In Frankfurt (eu-central-1), FC offers Ada series GPUs with 48 GB of memory and Hopper series with 96 GB (more than enough for this model).

Create a Dockerfile:

FROM registry.cn-hangzhou.aliyuncs.com/modelscope-repo/modelscope:ubuntu22.04-cuda12.1.0-py310-torch2.3.0-tf2.16.1-1.16.0

RUN pip install vllm>=0.8

ENV MODEL_ID=Qwen/Qwen3.6-35B-A3B
ENV VLLM_PORT=9000

RUN python -c "from modelscope import snapshot_download; snapshot_download('${MODEL_ID}')"

EXPOSE 9000

CMD ["python", "-m", "vllm.entrypoints.openai.api_server", \
     "--model", "/root/.cache/modelscope/hub/Qwen/Qwen3.6-35B-A3B", \
     "--port", "9000", \
     "--trust-remote-code", \
     "--max-model-len", "8192"]

Build and push to ACR:

docker build -t registry.eu-central-1.aliyuncs.com/your-namespace/qwen-fc:latest .
docker push registry.eu-central-1.aliyuncs.com/your-namespace/qwen-fc:latest

I use the Frankfurt region (eu-central-1) since that’s closest to my infrastructure in Europe. Note that GPU availability varies by region (Frankfurt supports Ada and Hopper GPUs, while Tesla series GPUs are only available in regions like Singapore, Tokyo, and Virginia). If you’re not familiar with ACR, I covered cross-region container replication in a previous post about ACR Enterprise Edition if you need to push images to multiple regions.

Step 2: Create the Function

FC 3.0 dropped the old “service” concept (thank god, now you just create functions directly). Using the Alibaba Cloud CLI (aliyun) with the FC 3.0 API (/2023-03-30):

aliyun fc CreateFunction --region eu-central-1 \
  --body '{
    "functionName": "qwen-api",
    "runtime": "custom-container",
    "handler": "index.handler",
    "timeout": 300,
    "cpu": 8,
    "memorySize": 65536,
    "diskSize": 10240,
    "gpuConfig": {
      "gpuType": "fc.gpu.ada.1",
      "gpuMemorySize": 49152
    },
    "customContainerConfig": {
      "image": "registry.eu-central-1.aliyuncs.com/your-namespace/qwen-fc:latest",
      "port": 9000,
      "entrypoint": ["python"],
      "command": ["-m", "vllm.entrypoints.openai.api_server", "--model", "/root/.cache/modelscope/hub/Qwen/Qwen3.6-35B-A3B", "--port", "9000", "--trust-remote-code", "--max-model-len", "8192"]
    }
  }'

The key settings here:

gpuType: "fc.gpu.ada.1" — an Ada series GPU with 48 GB VRAM, available in Frankfurt. FC allocates the full GPU card to a single container.
cpu: 8 and memorySize: 65536 — 8 vCPUs and 64 GB RAM, matching the Ada.1 card specs. FC 3.0 requires cpu (minimum 0.05), and the CPU-to-memory ratio must be between 1:1 and 1:4.
diskSize: 10240 — 10 GB ephemeral disk (valid values: 512 MB or 10240 MB)
timeout: 300 — 5 minutes per request, generous enough for long generations
port: 9000 — the port vLLM listens on inside the container

Step 3: Add an HTTP Trigger

aliyun fc CreateTrigger --region eu-central-1 \
  --functionName qwen-api \
  --body '{
    "triggerName": "http-trigger",
    "triggerType": "http",
    "triggerConfig": "{\"authType\":\"anonymous\",\"methods\":[\"POST\",\"GET\"]}"
  }'

Note that triggerConfig is a JSON string, not an object (a quirk of the FC 3.0 API). For production use, you’d want authType: "function" to require signed requests, or put API Gateway in front with proper authentication. For testing, anonymous access gets you moving faster.

Step 4: Test the Endpoint

Once deployed, FC gives you an endpoint URL. Since vLLM exposes an OpenAI-compatible API, you can call it exactly like you would OpenAI:

curl -X POST https://your-function-id.eu-central-1.fc.aliyuncs.com/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3.6-35B-A3B",
    "messages": [{"role": "user", "content": "Explain the MoE architecture in one paragraph"}],
    "max_tokens": 256
  }'

The first request will be slow (cold start while the model loads into GPU memory), but subsequent requests within the keep-alive window will respond in seconds.

The Cold Start Problem (and How to Mitigate It)

The elephant in the room with serverless LLM inference is STILL cold starts. Loading a multi-gigabyte model into GPU memory takes time, surprise surprise. On FC, you can mitigate this with:

Provisioned instances: Purchase a resident resource pool and bind provisioned instances to your function. In Frankfurt, a single Ada series GPU card costs ~$1,576/month as a resident resource. This trades some of the cost benefits of serverless for latency predictability. You can configure provisioned instances through the FC console under Elastic Management > Resident Resource Pools, or set a minimum instance count for your function so at least one is always warm.

Image acceleration: ACR supports image acceleration which significantly reduces container pull times by using on-demand loading instead of downloading the entire image upfront.

Smaller models: If cold start matters more than capability, use qwen3.5-flash instead. Smaller model = faster load (duh).

Connecting It to Cursor

Here’s where it comes full circle. Since this endpoint is OpenAI-compatible, you can point Cursor at it the same way I described in my Qwen 3.6 + Cursor post. Just replace the Model Studio base URL with your Function Compute endpoint:

https://your-function-id.eu-central-1.fc.aliyuncs.com/v1

Now you have a private, self-hosted coding assistant that runs on your own infrastructure, with your own data staying within your own Alibaba Cloud account. No tokens leaving your VPC if you configure the function with VPC access.

Cost Comparison

Let’s do the math for a small development team making ~500 inference requests per day:

Approach	Monthly Cost (approx.)
GPU ECS instance (24/7)	~$900/month
Function Compute (GPU, 500 req/day, 10s avg)	~$150/month
Model Studio API (pay-per-token, 500 req/day)	~$30/month

Function Compute sits in the sweet spot between full control (self-hosted on ECS) and zero management (Model Studio API). You get data sovereignty and custom model support without paying for idle GPU time.

When This Makes Sense

This approach is best when you need to think about data privacy, custom models, predictable costs, and compliance.

If you just need a quick Qwen integration for development, Model Studio is still the simpler and cheaper option. Use the Cursor integration I covered before for that (and remember, you can use the same API key for both).