In this episode, we'll discuss how much the speed can differ when we choose different underlying architectures and inference tools, even with the same parameter scale or the same AI model.
This topic is what I did.Mac mini unboxingI had given a simple demonstration at the time. But back then, I was just starting out with AI models, and I wasn't very clear about the differences between different AI models, as well as the differences in parameters such as the scale, architecture, and quantization format of the same AI model.
Of course, I can't say I fully understand it yet; I've just spent a little more time learning about it because I want to deploy OpecClaw.
Then we discovered that for the same AI model, adjusting the above factors could double or even quadruple the content generation speed. So, I'd like to share this with you, which also serves as a supplement to the AI model experience I shared in last year's unboxing video.
Okay, let's begin.
I. Demonstration Equipment and Models
First, let me introduce the devices and AI models I've used.
I'm using a Mac mini M4 with 32GB of RAM.
The AI model used is: Qwen3.5-35B-A3B
Don't think I'm crazy when you see 35B; this model is very representative.
In last year's unboxing video, I tested the number of tokens output per second by DeepSeek-R1 and concluded that since the 14B model can output about 15 tokens per second, while the 32B model can only output 4 to 5 tokens, I recommend choosing the 14B model.
However, this conclusion is too simplistic.
So, let's talk about the second question: how to determine if an AI model is suitable for you by looking at a simple description of its parameters.
Model parameter size is important, but there are many other factors that affect the user experience, such as the underlying architecture of the model (Dense/MoE), the quantization scheme (Q4_K_M/S/XS), the inference framework used (Ollama/LM Studio/oMLX), and the type of capability (Vision/Tool Use/Reasoning), whether a "thinking" process is required, whether a "reasoning" function is required, which in turn affects the first character response time and the token output speed, etc.
The expert hybrid model (MoE) Qwen3.5-35B-A3B perfectly illustrates this point:
"Model size" is not the same as "actual operational burden".
II. AI Model Parameter Analysis
If you are familiar with AI models, you can skip to Part 3.
Next, let's talk about the model and inference engine we'll be using.
- Model NameQwen3.5-35B-A3B
- Architecture typeMoE (Mixture of Experts)
- Model size35B Total / 3B Active (Total 35 billion / Activated 3 billion)
- Quantification scheme: 4-bit (INT4 / Q4_K_M)
- Deployment framework:MLX (Apple Silicon Native) / GGUF (llama.cpp)
Because we need to compare different deployment frameworks, and to better leverage the multi-task concurrency capabilities of expert hybrid models like Qwen3.5-35B-A3B, we'll be using LM Studio for the inference engine, and also recommend a new tool: oMLX. To understand its technical features in detail, you can summarize it using various AI tools. In short, it's designed specifically for Apple's M chips, achieving extreme speed, and its unique SSD KV caching technology frees up memory, making it ideal for large-scale multi-turn dialogue tasks.
Next, let's take the Qwen3.5-35B-A3B again and talk about how to choose the right model for your device based on the model's key parameters.
2.1 Model Name and Parameter Scale
Let's first look at Qwen 3.5-35B. Qwen is the source of the model, 3.5 is the version, and 35B is the parameter scale of the model.
We can easily determine the characteristics of various models by using Qwen, DeepSeek, GML, MiniMax, Gemma, and Llama, along with their version numbers, and then choose the appropriate model based on our own use case.
Parameter models like 9B, 14B, and 32B are directly linked to video memory. Of course, since the Apple M chip uses a unified memory architecture, it's directly related to video memory. Here's a simple conversion formula:
Memory usage (model weights) = Parameter size × Bit depth of each parameter ÷ 8
For example, in our 32B model, the quantization method is 4-bit, so the required memory is:
32 x 4 x 8 = 16G
That means at least 16GB of memory is required. Of course, running the model requires memory not only from the model itself, but also from the inference engine and the graphical interface. Ultimately, the breakdown would look like this:
| Model size | 4-bit actual VRAM usage | 32GB of RAM remaining |
|---|---|---|
| 9B | ~6 GB | easy |
| 14B | ~10 GB | Good experience |
| 32B | ~20 GB | limit |
Of course, please note: in actual operation,Key-Value Cache (Context Cache) As dialogue grows, it often consumes several gigabytes of additional memory, which is why the 32B model is at a "critical point" with 32GB of memory.
2.2 Token/s and Memory Bandwidth
The model's parameter size, bit depth, and computer memory size determine whether the model can run. However, a crucial factor determining how fast the model runs is memory bandwidth. For example, my Mac mini uses a standard M4 chip, with a memory bandwidth of 120GB/s.
Here is a simple calculation formula:
Inference speed (Tokens/s) = Memory bandwidth ÷ Actual size of the model running
Let's look at the table again:
| Model size | 4-bit actual VRAM usage | Inference speed (Tokens/s) |
|---|---|---|
| 9B | ~6 GB | 20 |
| 14B | ~10 GB | 12 |
| 32B | ~20 GB | 6 |
First, it should be noted that this is just a simple analogy. In actual operation, inference speed is affected by many factors, including the number of computing units, the key-value cache access mode, the batch size, the number of concurrent requests, framework optimization, and the cache hit rate.
This can be roughly understood as follows: the larger the model, the more memory it consumes, the greater the pressure on memory bandwidth, and the slower the inference speed usually becomes.
Therefore, when we want to deploy a local AI model on a Mac, we need to consider two factors: the version of the M chip and the amount of memory.
2.3 Operational Logic and Density
Let's look at the model Qwen3.5-35B-A3B. 35B is the physical scale of the model, so what does A3B represent?
Here, A stands for "activate".
In other words, although the total size of this model is 35B, only 3B of its parameters are executed for reasoning in each dialogue. Essentially, the model you're using has an intelligence of 35B, but only the 3B most relevant parameters are actually being processed.
You can think of it as an 'expert pool' with 35 billion pieces of knowledge, but when you ask a specific question, it will only dispatch the 3 billion most specialized 'experts'. It retains the brain of a large model while possessing the speaking speed of a small model.
So, this is the parameter that confused me when I first came into contact with models: model density.
In other words, the model parameters are divided into dense models that perform full inference and expert hybrid models (MoE) that involve only a small number of relevant parameters in inference.
Neither type of model is inherently better or worse. Dense models are generally better in terms of stability and consistency, while expert hybrid models have advantages in inference efficiency and scalability.
However, for home computers and everyday needs like ours, the expert hybrid model would be more suitable.
So, what is the theoretical inference speed of the 35B-A3B on the standard M4 chip?
| Model size | 4-bit actual VRAM usage | Inference speed (Tokens/s) |
|---|---|---|
| 9B | ~6 GB | 20 |
| 14B | ~10 GB | 12 |
| 32B | ~20 GB | 6 |
| 35B-A3B | ~20-24GB | 80 |
To protect myself, I'd like to add a further explanation: While I'm still using the size of A3B to calculate speed, for the MoE model, activating the 3B parameters does not equate to runtime speed being the same as the 3B model. It's also affected by factors such as routing overhead, memory access, and key-value caching.
You can note down this 80 Tokens/s. You'll see in the oMLX benchmark test later that a single-threaded task only gets 47, while an 8-threaded continuous batch processing task gets as high as 93.
The discrepancies in the data are due to two main factors: firstly, the Qwen3.5-35B-A3B expert hybrid model offers greater potential for multi-task inference; and secondly, oMLX's unique SSD KV caching technology. Of course, factors such as the L2 cache of the M4 chip were not considered, which could also lead to discrepancies in the data.
I think beginners can start by establishing a simple conversion between computer configuration and model parameters. If needed, they can then spend more time exploring the details.
2.4 Deployment Framework and Inference Engine
Previously, we chose the right computer and the right model. Similarly, choosing the right deployment framework and inference tool (engine) for the model is also very important.
For the Qwen3.5-35B-A3B model, I used two deployment frameworks to give you a clear sense of how they affect inference speed.
The first method is based on the GGUF universal format llama.cpp. I used ollama, which is the most commonly used format, to download it, and used Anything LLM to load it for easy display of relevant data.
The second type is the MLX framework, which is specifically optimized for Apple's M chip. I will demonstrate it using LM Studio and oMLX respectively.
It's important to note that although both GGUF and MLX use the 4-bit Qwen3.5-35B-A3B model, their quantization precision differs. Regarding quantization precision:
A typical example of the GGUF format is Q4_K_M, which uses 6-bit quantization for critical parts and retains 4-bit for non-critical parts. Due to this mixed precision, the GPU needs to frequently perform 'non-standard bit width' conversions during computation.decompression overheadIn frameworks that do not natively support this, it will significantly slow down the speed.
MLX stands for INT4 (full 4-bit), which allows Apple's M chip to directly access model parameters without the need for "finding" and "translation." This results in more efficient memory access and scheduling that is more aligned with the M chip when running models on a Mac.
This is one of the reasons why Mac computers prefer the MLX model.
III. Model Deployment Comparison Test
In this comparative test, I used four inference tools: Ollama, Anything LLM, LM Studio, and oMLX.
There are two downloaded models: GGUF and MLX Qwen3.5-35B-A3B 4-bit.
The testing issues mainly fall into three categories: generation speed test, first-word response test, and multi-round overload test.
Finally, I'll add a test. One of my reasons for choosing to deploy the model locally is to use OpenClaw. So, let's compare Qwen3.5-35B and Qwen3-Coder-30B. If you're like me and want to use OpenClaw to develop web pages or applications, perhaps specializing in the programming-related model would be better.
3.1 Generation Speed Test **(Tokens/s)**
Test MethodSend them the same complex prompt (e.g., "Please write a complete Snake game in Python with detailed comments") and observe the speed at which the background printouts are generated.
- Ollama: 15.42 t/s
- LM Studio: 35.06 t/s
- oMLX: 35.70 t/s
3.2: First-Word Response Time / Tip Word Processing (TTFT / Prefill)
Test MethodSend them a long document of about 5000 words and ask them to summarize it. Calculate the number of seconds from "pressing Enter" to "speaking the first word". Theoretically, MLX should have the advantage in this round of testing; you can see for yourself.
- LM Studio: slightly
- oMLX: (omitted)
3.3 Agent Multi-round Reload Test (Reprefill / Memory Test)
Test MethodUse the "Standard 10-Round High-Pressure Test Script" in the appendix.
- It is a scenario that simulates OpenClaw continuously writing code.
- Please inA brand new dialog boxIn the middle, send the following 10 questions in sequence.
- For the first 9 rounds, you don't need to pay attention to its answers; just wait patiently for it to finish generating (these rounds will quickly consume the context of about 100,000 tokens).
- ⚠️ The key point is in round 10! The instant the 10th Prompt is sent, immediately press your stopwatch until it appears on the screen.He uttered the first wordRecord this time difference (TTFT).
In this round of testing, besides observing the generation speed of 10 questions, we also examined the number and efficiency of the cached tokens. After just 10 questions, there were already 140,000 tokens, with 110,000 tokens cached. This is equivalent to using hard drive space instead of memory, saving 1-3 GB of space.The larger the number of model parameters and the higher the quantization bit depth (the higher the accuracy), the better.The space required to load the model and the dynamic cache space generated when processing the same number of tokens will bothLarger.
It's important to understand that solid-state drives (SSDs) are slower than RAM. While SSDs save valuable RAM space, they sacrifice a slight amount of inference speed. However, in the long run of multiple rounds of questioning, the resulting more stable system operation is clearly more worthwhile.
3.4 oMLX Continuous Batch Processing Benchmark Test
Test MethodIn oMlx benchmark tests, the concurrent task inference speed of the Qwen3.5-35B-A3B model was tested.
In this round of concurrent testing, we need to look not only at the token generation speed, but also at the time it takes to generate the first character.
For my standard 32GB M4, 2X speeds result in an ideal TPS of 72.1 tok/s and an average TTFT of 4933.2ms. At 4X speeds, the average TTFT drops to 9664.7ms, which is somewhat counterproductive.
3.5 Are inference models really that good?
Test MethodTest the speed of the first round of questions using Qwen3-Coder-30B.
Finally, one more thing: although we use 35B General Version I've done a speed test, but if you're like me and want to run OpenClaw locally to automate code writing, then I strongly recommend you change the model to... Qwen3-Coder-30B-A3B (MLX version)The general model has good writing skills, but it occasionally provides incorrect JSON formatting, causing the agent to crash; while the Coder model is an emotionless code machine, and it will never crash in OpenClaw.
IV. Summary
Alright, that's all for this video.
I actually made this video twice, and revised the script several times. You can probably tell from the tests. Originally, I just wanted to make a simple comparison: which model and tool is faster, so as to choose the most suitable one for use in OpenClaw.
But I later discovered that my understanding of AI models was often superficial and incomplete.
A couple of days ago, I saw a comment under last year's Mac mini unboxing video, saying that the video helped him.
This made me feel guilty, and it was while replying to this friend that strengthened my resolve to redo the video.
And that's exactly why I discovered—
👉 AI is not something that can be solved by "selecting the right parameters"; it is more like a complete system engineering project.
The model, quantization, inference engine, hardware architecture, and practical needs—every choice will affect the final result.
Logically, I should provide a "standard answer": for example, what model should be chosen for what scenario, what inference tool should be used, and what computer configuration should be used for what purpose, such as daily chatting, analysis reports, or software development.
But after actually finishing writing this copy, I felt that stubbornly searching for a fixed answer is a kind of "obsession".
There are no fixed laws, and no fixed laws are not laws at all. Cultivating the mind is worse than cultivating worldly laws.
In the context of AI, this is actually quite easy to understand.
The best model or framework today may be replaced in a few months.
The solution that is currently best suited for you could be completely different if you change the machine, the scenario, or the model.
Therefore, understanding is more important than memorizing "which one to use":
👉 Why it's more suitable here.
As for the so-called cultivation of the "mind," my understanding is:
If we treat AI as a way of thinking, then we should try to understand and break down problems using "AI methods".
If you treat AI as a tool, then use it to its fullest potential to discover and solve problems.
The former represents an upgrade in cognition;
The latter is an amplification of efficiency.
In today's information-saturated and rapidly evolving world, parameters may become outdated and models may be obsolete, but your understanding will not.
Hopefully this video can help you.
If you find this helpful, please subscribe to my channel, or like, comment, and share!
That's it, bye-bye~
Good web site! I really love how it is easy on my eyes and the data are well written. I'm wondering how I might be notified when a new post has been made. I have subscribed to your feed which must do the trick! Have a great day!
Hey there, You have performed a fantastic job. I will definitely digg it and in my view suggest to my friends. I am sure they'll be benefited from this web site.
I would like to thank you for the efforts you've put in writing this site. I'm hoping the same high-grade site post from you in the upcoming also. Actually your creative writing abilities has encouraged me to get my own site now. Really the blogging is spreading its wings rapidly. Your write up is a great example of it.
Как настроить [url=https://vpn-1.ru]VPN[/url] на MacOS без сторонних приложений?
Thanks for the suggestions you have contributed here. On top of that, I believe there are a few factors which will keep your motor insurance premium all the way down. One is, to take into account buying cars and trucks that are inside the good listing of car insurance corporations. Cars that happen to be expensive are more at risk of being lost. Aside from that insurance coverage is also using the value of your car, so the more costly it is, then higher the premium particular you make payment for.
[url=https://marketingovoe-agentstvo-1.ru]Маркетинговое агентство[/url] — как выбрать подходящее, если нет опыта работы с подрядчиками?
Yet another issue is that video gaming has become one of the all-time largest forms of recreation for people of every age group. Kids play video games, plus adults do, too. The particular XBox 360 is one of the favorite video games systems for many who love to have a lot of activities available to them, along with who like to learn live with other folks all over the world. Many thanks for sharing your thinking.
I discovered your weblog website on google and examine a couple of of your early posts. Proceed to keep up the excellent operate. I simply additional up your RSS feed to my MSN Information Reader. Looking for forward to studying more from you later on!?
Thanks for your recommendations on this blog. One thing I want to say is that purchasing electronic devices items in the Internet is not new. In reality, in the past decade alone, the marketplace for online electronic devices has grown drastically. Today, you will discover practically almost any electronic gadget and other gadgets on the Internet, ranging from cameras plus camcorders to computer components and gambling consoles.
I think one of your adverts caused my internet browser to resize, you may well want to put that on your blacklist.
I'm not that much of a internet reader to be honest but your blogs really nice, keep it up! I'll go ahead and bookmark your website to come back later on. All the best
It's the best time to make some plans for the future and it is time to be happy. I've read this publish and if I may I want to counsel you some interesting things or suggestions. Perhaps you can write next articles referring to this article. I want to read more things about it!
My brother recommended I might like this website. He was totally right. This post actually made my day. You can't imagine just how much time I had spent for this information! Thanks!
After I originally commented I clicked the -Notify me when new comments are added- checkbox and now every time a comment is added I get four emails with the same comment. Is there any manner you possibly can remove me from that service? Thanks!
I believe that avoiding ready-made foods would be the first step so that you can lose weight. They might taste great, but highly processed foods have very little vitamins and minerals, making you feed on more only to have enough electricity to get over the day. If you are constantly consuming these foods, moving over to grain and other complex carbohydrates will help you have more strength while ingesting less. Great blog post.
Hey! Do you know if they make any plugins to help with SEO? I'm trying to get my blog to rank for some targeted keywords but I'm not seeing very good results. If you know of any please share. Thank you!
Wow that was unusual. I just wrote an incredibly long comment but after I clicked submit my comment didn't show up. Grrrr… well I'm not writing all that over again. Anyhow, just wanted to say fantastic blog!
Какие [url=https://domashnie-zhivotnye-1.ru]домашние животные[/url] лучше всего поддаются дрессировке?
Popular canvaspaintings.com.au is Australia Online 100 percent Handmade Art Store. We deliver Budget Handmade Canvas Paintings, Abstract Art, Oil Paintings, Artwork Sale, Acrylic Wall Art Paintings, Custom Art, Oil Portraits, Pet Paintings, Building Paintings etc. 1000+ Designs To Choose From, Highly Experienced Artists team, Up-to 50 percent OFF SALE and FREE Delivery Australia, Sydney, Melbourne, Brisbane, Adelaide, Hobart and all regional areas. We ship worldwide international locations. Order Online Your Handmade Art Today.
excellent submit, very informative. I'm wondering why the opposite specialists of this sector do not understand this. You should proceed your writing. I am confident, you've a great readers' base already!
Hey there! I'm at work surfing around your blog from my new iphone 3gs! Just wanted to say I love reading your blog and look forward to all your posts! Carry on the great work!
I'm truly enjoying the design and layout of your site. It's a very easy on the eyes which makes it much more pleasant for me to come here and visit more often. Did you hire out a developer to create your theme? Great work!
Как выглядит юридически грамотный договор на [url=https://prodvizhenie-sajta-s-garantiej.ru]продвижение сайта с гарантией[/url]?
Youre so cool! I dont suppose Ive read something like this before. So good to seek out somebody with some authentic thoughts on this subject. real thanks for beginning this up. this website is something that's wanted on the net, somebody with somewhat originality. useful job for bringing one thing new to the web!
medhair
mehmet
aslı
aslı
dr saban
smile
hairneva
güncel
sapphire
sapphire
Hey would you mind letting me know which web host you're utilizing? I've loaded your blog in 3 different web browsers and I must say this blog loads a lot faster then most. Can you suggest a good hosting provider at a honest price? Thanks, I appreciate it!