In this episode, we'll discuss how much the speed can differ when we choose different underlying architectures and inference tools, even with the same parameter scale or the same AI model.

This topic is what I did.Mac mini unboxingI had given a simple demonstration at the time. But back then, I was just starting out with AI models, and I wasn't very clear about the differences between different AI models, as well as the differences in parameters such as the scale, architecture, and quantization format of the same AI model.

Of course, I can't say I fully understand it yet; I've just spent a little more time learning about it because I want to deploy OpecClaw.

Then we discovered that for the same AI model, adjusting the above factors could double or even quadruple the content generation speed. So, I'd like to share this with you, which also serves as a supplement to the AI model experience I shared in last year's unboxing video.

Okay, let's begin.

I. Demonstration Equipment and Models

First, let me introduce the devices and AI models I've used.

I'm using a Mac mini M4 with 32GB of RAM.

The AI model used is: Qwen3.5-35B-A3B

Don't think I'm crazy when you see 35B; this model is very representative.

In last year's unboxing video, I tested the number of tokens output per second by DeepSeek-R1 and concluded that since the 14B model can output about 15 tokens per second, while the 32B model can only output 4 to 5 tokens, I recommend choosing the 14B model.

However, this conclusion is too simplistic.

So, let's talk about the second question: how to determine if an AI model is suitable for you by looking at a simple description of its parameters.

Model parameter size is important, but there are many other factors that affect the user experience, such as the underlying architecture of the model (Dense/MoE), the quantization scheme (Q4_K_M/S/XS), the inference framework used (Ollama/LM Studio/oMLX), and the type of capability (Vision/Tool Use/Reasoning), whether a "thinking" process is required, whether a "reasoning" function is required, which in turn affects the first character response time and the token output speed, etc.

The expert hybrid model (MoE) Qwen3.5-35B-A3B perfectly illustrates this point:

"Model size" is not the same as "actual operational burden".

II. AI Model Parameter Analysis

If you are familiar with AI models, you can skip to Part 3.

Next, let's talk about the model and inference engine we'll be using.

Model NameQwen3.5-35B-A3B
Architecture typeMoE (Mixture of Experts)
Model size35B Total / 3B Active (Total 35 billion / Activated 3 billion)
Quantification scheme: 4-bit (INT4 / Q4_K_M)
Deployment framework：MLX (Apple Silicon Native) / GGUF (llama.cpp)

Because we need to compare different deployment frameworks, and to better leverage the multi-task concurrency capabilities of expert hybrid models like Qwen3.5-35B-A3B, we'll be using LM Studio for the inference engine, and also recommend a new tool: oMLX. To understand its technical features in detail, you can summarize it using various AI tools. In short, it's designed specifically for Apple's M chips, achieving extreme speed, and its unique SSD KV caching technology frees up memory, making it ideal for large-scale multi-turn dialogue tasks.

Next, let's take the Qwen3.5-35B-A3B again and talk about how to choose the right model for your device based on the model's key parameters.

2.1 Model Name and Parameter Scale

Let's first look at Qwen 3.5-35B. Qwen is the source of the model, 3.5 is the version, and 35B is the parameter scale of the model.

We can easily determine the characteristics of various models by using Qwen, DeepSeek, GML, MiniMax, Gemma, and Llama, along with their version numbers, and then choose the appropriate model based on our own use case.

Parameter models like 9B, 14B, and 32B are directly linked to video memory. Of course, since the Apple M chip uses a unified memory architecture, it's directly related to video memory. Here's a simple conversion formula:

Memory usage (model weights) = Parameter size × Bit depth of each parameter ÷ 8

For example, in our 32B model, the quantization method is 4-bit, so the required memory is:

32 x 4 x 8 = 16G

That means at least 16GB of memory is required. Of course, running the model requires memory not only from the model itself, but also from the inference engine and the graphical interface. Ultimately, the breakdown would look like this:

Model size	4-bit actual VRAM usage	32GB of RAM remaining
9B	~6 GB	easy
14B	~10 GB	Good experience
32B	~20 GB	limit

Of course, please note: in actual operation,Key-Value Cache (Context Cache) As dialogue grows, it often consumes several gigabytes of additional memory, which is why the 32B model is at a "critical point" with 32GB of memory.

2.2 Token/s and Memory Bandwidth

The model's parameter size, bit depth, and computer memory size determine whether the model can run. However, a crucial factor determining how fast the model runs is memory bandwidth. For example, my Mac mini uses a standard M4 chip, with a memory bandwidth of 120GB/s.

Here is a simple calculation formula:

Inference speed (Tokens/s) = Memory bandwidth ÷ Actual size of the model running

Let's look at the table again:

Model size	4-bit actual VRAM usage	Inference speed (Tokens/s)
9B	~6 GB	20
14B	~10 GB	12
32B	~20 GB	6

First, it should be noted that this is just a simple analogy. In actual operation, inference speed is affected by many factors, including the number of computing units, the key-value cache access mode, the batch size, the number of concurrent requests, framework optimization, and the cache hit rate.

This can be roughly understood as follows: the larger the model, the more memory it consumes, the greater the pressure on memory bandwidth, and the slower the inference speed usually becomes.

Therefore, when we want to deploy a local AI model on a Mac, we need to consider two factors: the version of the M chip and the amount of memory.

2.3 Operational Logic and Density

Let's look at the model Qwen3.5-35B-A3B. 35B is the physical scale of the model, so what does A3B represent?

Here, A stands for "activate".

In other words, although the total size of this model is 35B, only 3B of its parameters are executed for reasoning in each dialogue. Essentially, the model you're using has an intelligence of 35B, but only the 3B most relevant parameters are actually being processed.

You can think of it as an 'expert pool' with 35 billion pieces of knowledge, but when you ask a specific question, it will only dispatch the 3 billion most specialized 'experts'. It retains the brain of a large model while possessing the speaking speed of a small model.

So, this is the parameter that confused me when I first came into contact with models: model density.

In other words, the model parameters are divided into dense models that perform full inference and expert hybrid models (MoE) that involve only a small number of relevant parameters in inference.

Neither type of model is inherently better or worse. Dense models are generally better in terms of stability and consistency, while expert hybrid models have advantages in inference efficiency and scalability.

However, for home computers and everyday needs like ours, the expert hybrid model would be more suitable.

So, what is the theoretical inference speed of the 35B-A3B on the standard M4 chip?

Model size	4-bit actual VRAM usage	Inference speed (Tokens/s)
9B	~6 GB	20
14B	~10 GB	12
32B	~20 GB	6
35B-A3B	~20-24GB	80

To protect myself, I'd like to add a further explanation: While I'm still using the size of A3B to calculate speed, for the MoE model, activating the 3B parameters does not equate to runtime speed being the same as the 3B model. It's also affected by factors such as routing overhead, memory access, and key-value caching.

You can note down this 80 Tokens/s. You'll see in the oMLX benchmark test later that a single-threaded task only gets 47, while an 8-threaded continuous batch processing task gets as high as 93.

The discrepancies in the data are due to two main factors: firstly, the Qwen3.5-35B-A3B expert hybrid model offers greater potential for multi-task inference; and secondly, oMLX's unique SSD KV caching technology. Of course, factors such as the L2 cache of the M4 chip were not considered, which could also lead to discrepancies in the data.

I think beginners can start by establishing a simple conversion between computer configuration and model parameters. If needed, they can then spend more time exploring the details.

2.4 Deployment Framework and Inference Engine

Previously, we chose the right computer and the right model. Similarly, choosing the right deployment framework and inference tool (engine) for the model is also very important.

For the Qwen3.5-35B-A3B model, I used two deployment frameworks to give you a clear sense of how they affect inference speed.

The first method is based on the GGUF universal format llama.cpp. I used ollama, which is the most commonly used format, to download it, and used Anything LLM to load it for easy display of relevant data.

The second type is the MLX framework, which is specifically optimized for Apple's M chip. I will demonstrate it using LM Studio and oMLX respectively.

It's important to note that although both GGUF and MLX use the 4-bit Qwen3.5-35B-A3B model, their quantization precision differs. Regarding quantization precision:

A typical example of the GGUF format is Q4_K_M, which uses 6-bit quantization for critical parts and retains 4-bit for non-critical parts. Due to this mixed precision, the GPU needs to frequently perform 'non-standard bit width' conversions during computation.decompression overheadIn frameworks that do not natively support this, it will significantly slow down the speed.

MLX stands for INT4 (full 4-bit), which allows Apple's M chip to directly access model parameters without the need for "finding" and "translation." This results in more efficient memory access and scheduling that is more aligned with the M chip when running models on a Mac.

This is one of the reasons why Mac computers prefer the MLX model.

III. Model Deployment Comparison Test

In this comparative test, I used four inference tools: Ollama, Anything LLM, LM Studio, and oMLX.

There are two downloaded models: GGUF and MLX Qwen3.5-35B-A3B 4-bit.

The testing issues mainly fall into three categories: generation speed test, first-word response test, and multi-round overload test.

Finally, I'll add a test. One of my reasons for choosing to deploy the model locally is to use OpenClaw. So, let's compare Qwen3.5-35B and Qwen3-Coder-30B. If you're like me and want to use OpenClaw to develop web pages or applications, perhaps specializing in the programming-related model would be better.

3.1 Generation Speed Test (Tokens/s)

Test MethodSend them the same complex prompt (e.g., "Please write a complete Snake game in Python with detailed comments") and observe the speed at which the background printouts are generated.

Ollama: 15.42 t/s
LM Studio: 35.06 t/s
oMLX: 35.70 t/s

3.2: First-Word Response Time / Tip Word Processing (TTFT / Prefill)

Test MethodSend them a long document of about 5000 words and ask them to summarize it. Calculate the number of seconds from "pressing Enter" to "speaking the first word". Theoretically, MLX should have the advantage in this round of testing; you can see for yourself.

LM Studio: slightly
oMLX: (omitted)

3.3 Agent Multi-round Reload Test (Reprefill / Memory Test)

Test MethodUse the "Standard 10-Round High-Pressure Test Script" in the appendix.

It is a scenario that simulates OpenClaw continuously writing code.
Please inA brand new dialog boxIn the middle, send the following 10 questions in sequence.
For the first 9 rounds, you don't need to pay attention to its answers; just wait patiently for it to finish generating (these rounds will quickly consume the context of about 100,000 tokens).
⚠️ The key point is in round 10! The instant the 10th Prompt is sent, immediately press your stopwatch until it appears on the screen.He uttered the first wordRecord this time difference (TTFT).

In this round of testing, besides observing the generation speed of 10 questions, we also examined the number and efficiency of the cached tokens. After just 10 questions, there were already 140,000 tokens, with 110,000 tokens cached. This is equivalent to using hard drive space instead of memory, saving 1-3 GB of space.The larger the number of model parameters and the higher the quantization bit depth (the higher the accuracy), the better.The space required to load the model and the dynamic cache space generated when processing the same number of tokens will bothLarger.

It's important to understand that solid-state drives (SSDs) are slower than RAM. While SSDs save valuable RAM space, they sacrifice a slight amount of inference speed. However, in the long run of multiple rounds of questioning, the resulting more stable system operation is clearly more worthwhile.

3.4 oMLX Continuous Batch Processing Benchmark Test

Test MethodIn oMlx benchmark tests, the concurrent task inference speed of the Qwen3.5-35B-A3B model was tested.

In this round of concurrent testing, we need to look not only at the token generation speed, but also at the time it takes to generate the first character.

For my standard 32GB M4, 2X speeds result in an ideal TPS of 72.1 tok/s and an average TTFT of 4933.2ms. At 4X speeds, the average TTFT drops to 9664.7ms, which is somewhat counterproductive.

3.5 Are inference models really that good?

Test MethodTest the speed of the first round of questions using Qwen3-Coder-30B.

Finally, one more thing: although we use 35B General Version I've done a speed test, but if you're like me and want to run OpenClaw locally to automate code writing, then I strongly recommend you change the model to... Qwen3-Coder-30B-A3B (MLX version)The general model has good writing skills, but it occasionally provides incorrect JSON formatting, causing the agent to crash; while the Coder model is an emotionless code machine, and it will never crash in OpenClaw.

IV. Summary

Alright, that's all for this video.

I actually made this video twice, and revised the script several times. You can probably tell from the tests. Originally, I just wanted to make a simple comparison: which model and tool is faster, so as to choose the most suitable one for use in OpenClaw.

But I later discovered that my understanding of AI models was often superficial and incomplete.

A couple of days ago, I saw a comment under last year's Mac mini unboxing video, saying that the video helped him.

This made me feel guilty, and it was while replying to this friend that strengthened my resolve to redo the video.

And that's exactly why I discovered—

👉 AI is not something that can be solved by "selecting the right parameters"; it is more like a complete system engineering project.

The model, quantization, inference engine, hardware architecture, and practical needs—every choice will affect the final result.

Logically, I should provide a "standard answer": for example, what model should be chosen for what scenario, what inference tool should be used, and what computer configuration should be used for what purpose, such as daily chatting, analysis reports, or software development.

But after actually finishing writing this copy, I felt that stubbornly searching for a fixed answer is a kind of "obsession".

There are no fixed laws, and no fixed laws are not laws at all. Cultivating the mind is worse than cultivating worldly laws.

In the context of AI, this is actually quite easy to understand.

The best model or framework today may be replaced in a few months.

The solution that is currently best suited for you could be completely different if you change the machine, the scenario, or the model.

Therefore, understanding is more important than memorizing "which one to use":

👉 Why it's more suitable here.

As for the so-called cultivation of the "mind," my understanding is:

If we treat AI as a way of thinking, then we should try to understand and break down problems using "AI methods".

If you treat AI as a tool, then use it to its fullest potential to discover and solve problems.

The former represents an upgrade in cognition;

The latter is an amplification of efficiency.

In today's information-saturated and rapidly evolving world, parameters may become outdated and models may be obsolete, but your understanding will not.

Hopefully this video can help you.

If you find this helpful, please subscribe to my channel, or like, comment, and share!

That's it, bye-bye~

By Loogn sir

An ordinary person who likes to use fun to resist mediocrity; often writes about his own interests; so you will see technology, digital, entertainment, credit cards, Internet... Refuse to be high-sounding and don't be a pseudo-expert; make professional life-like and biochemistry interesting; well, that's it~

1,608 thoughts on “拒绝“OpenClaw”焦虑！用“AI”视角拆解模型、硬件与部署框架，顺便推荐oMLX”

pereplanirovka kvartir_rgMi says:

July 15, 2026 at 9:04 am

Люди помогите советом Замучился я с перепланировкой Инспекция не пропускает ничего Я уже голову сломал Короче, единственные кто берётся за всё — перепланировка с согласованием в Мосжилинспекции И согласовали без проблем В общем, там и примеры и расценки — согласуем перепланировку [url=https://pereplanirovka-kvartir-vhj.ru]https://pereplanirovka-kvartir-vhj.ru[/url] Потом себе дороже выйдет Перешлите тому кто тоже ремонт затеял

Reply
pereplanirovka kvartir_xhMi says:

July 15, 2026 at 9:04 am

Люди помогите советом Хотел стену снести между комнатами А тут оказывается столько бумаг Я уже голову сломал Короче, нашел наконец нормальных специалистов — перепланировка квартиры под ключ в Москве с гарантией И чертежи сделали В общем, сохраняйте себе — согласовать перепланировку квартиры в москве [url=https://pereplanirovka-kvartir-vhj.ru]согласовать перепланировку квартиры в москве[/url] Не начинайте без проекта Перешлите тому кто тоже ремонт затеял

Reply
sppk_scSa says:

July 15, 2026 at 9:11 am

[url=https://seo-prodvizhenie-pod-klyuch.ru]SEO продвижение под ключ[/url] — включает ли оно написание контента или это оплачивается отдельно?

Reply
rastrear WhatsApp says:

July 15, 2026 at 9:23 am

Hi there! Would you mind if I share your blog with my myspace group? There’s a lot of folks that I think would really appreciate your content. Please let me know. Cheers

Reply
vavada_wjpt says:

July 15, 2026 at 10:01 am

Народ всем привет То вообще доступ закрывают Искал долго, перепробовал кучу вариантов Короче, работает стабильно и честно — vavada casino с крутыми бонусами Вывод денег за 5 минут В общем, жмите чтобы не потерять — вавада казино официальный сайт [url=https://kurica2.ru]вавада казино официальный сайт[/url] Только вавада реально рулит Перешлите тому кто тоже ищет нормальное казино

Reply
vavada_vrpt says:

July 15, 2026 at 10:01 am

Ребята кто играет То вообще доступ закрывают Денег слил на всяком говне Короче, работает стабильно и честно — вавада казино онлайн лучший выбор Вывод денег за 5 минут В общем, жмите чтобы не потерять — вавада онлайн [url=https://kurica2.ru]вавада онлайн[/url] Только вавада реально рулит Перешлите тому кто тоже ищет нормальное казино

Reply
proekt pereplanirovki kvartiri_tzEi says:

July 15, 2026 at 10:51 am

Ребята кто в Москве Замучился я уже с этим согласованием Уже знакомые налетели на миллион Я уже голову сломал Короче, единственные кто делает быстро — проект перепланировки квартиры под ключ Всё согласовали за месяц В общем, жмите чтобы не потерять — проект перепланировки квартиры в москве [url=https://proekt-pereplanirovki-kvartiry-qxr.ru]проект перепланировки квартиры в москве[/url] Потом себе дороже Перешлите тому кто ремонт затеял

Reply
proekt pereplanirovki kvartiri_dnEi says:

July 15, 2026 at 10:51 am

Слушайте кто делал проект Планирую объединить две комнаты в гостиную Оказывается без бумажки ты никто Нервов просто нет Короче, единственные кто делает быстро — проект перепланировки с согласованием в Москве И в инспекцию подали В общем, там и примеры и цены — проект для перепланировки квартиры [url=https://proekt-pereplanirovki-kvartiry-qxr.ru]https://proekt-pereplanirovki-kvartiry-qxr.ru[/url] Не начинайте без проекта Перешлите тому кто ремонт затеял

Reply

Reject "OpenClaw" anxiety! Deconstructing models, hardware, and deployment frameworks from an "AI" perspective, and recommending oMLX along the way.

I. Demonstration Equipment and Models