In this episode, we'll discuss how much the speed can differ when we choose different underlying architectures and inference tools, even with the same parameter scale or the same AI model.
This topic is what I did.Mac mini unboxingI had given a simple demonstration at the time. But back then, I was just starting out with AI models, and I wasn't very clear about the differences between different AI models, as well as the differences in parameters such as the scale, architecture, and quantization format of the same AI model.
Of course, I can't say I fully understand it yet; I've just spent a little more time learning about it because I want to deploy OpecClaw.
Then we discovered that for the same AI model, adjusting the above factors could double or even quadruple the content generation speed. So, I'd like to share this with you, which also serves as a supplement to the AI model experience I shared in last year's unboxing video.
Okay, let's begin.
I. Demonstration Equipment and Models
First, let me introduce the devices and AI models I've used.
I'm using a Mac mini M4 with 32GB of RAM.
The AI model used is: Qwen3.5-35B-A3B
Don't think I'm crazy when you see 35B; this model is very representative.
In last year's unboxing video, I tested the number of tokens output per second by DeepSeek-R1 and concluded that since the 14B model can output about 15 tokens per second, while the 32B model can only output 4 to 5 tokens, I recommend choosing the 14B model.
However, this conclusion is too simplistic.
So, let's talk about the second question: how to determine if an AI model is suitable for you by looking at a simple description of its parameters.
Model parameter size is important, but there are many other factors that affect the user experience, such as the underlying architecture of the model (Dense/MoE), the quantization scheme (Q4_K_M/S/XS), the inference framework used (Ollama/LM Studio/oMLX), and the type of capability (Vision/Tool Use/Reasoning), whether a "thinking" process is required, whether a "reasoning" function is required, which in turn affects the first character response time and the token output speed, etc.
The expert hybrid model (MoE) Qwen3.5-35B-A3B perfectly illustrates this point:
"Model size" is not the same as "actual operational burden".
II. AI Model Parameter Analysis
If you are familiar with AI models, you can skip to Part 3.
Next, let's talk about the model and inference engine we'll be using.
- Model NameQwen3.5-35B-A3B
- Architecture typeMoE (Mixture of Experts)
- Model size35B Total / 3B Active (Total 35 billion / Activated 3 billion)
- Quantification scheme: 4-bit (INT4 / Q4_K_M)
- Deployment framework:MLX (Apple Silicon Native) / GGUF (llama.cpp)
Because we need to compare different deployment frameworks, and to better leverage the multi-task concurrency capabilities of expert hybrid models like Qwen3.5-35B-A3B, we'll be using LM Studio for the inference engine, and also recommend a new tool: oMLX. To understand its technical features in detail, you can summarize it using various AI tools. In short, it's designed specifically for Apple's M chips, achieving extreme speed, and its unique SSD KV caching technology frees up memory, making it ideal for large-scale multi-turn dialogue tasks.
Next, let's take the Qwen3.5-35B-A3B again and talk about how to choose the right model for your device based on the model's key parameters.
2.1 Model Name and Parameter Scale
Let's first look at Qwen 3.5-35B. Qwen is the source of the model, 3.5 is the version, and 35B is the parameter scale of the model.
We can easily determine the characteristics of various models by using Qwen, DeepSeek, GML, MiniMax, Gemma, and Llama, along with their version numbers, and then choose the appropriate model based on our own use case.
Parameter models like 9B, 14B, and 32B are directly linked to video memory. Of course, since the Apple M chip uses a unified memory architecture, it's directly related to video memory. Here's a simple conversion formula:
Memory usage (model weights) = Parameter size × Bit depth of each parameter ÷ 8
For example, in our 32B model, the quantization method is 4-bit, so the required memory is:
32 x 4 x 8 = 16G
That means at least 16GB of memory is required. Of course, running the model requires memory not only from the model itself, but also from the inference engine and the graphical interface. Ultimately, the breakdown would look like this:
| Model size | 4-bit actual VRAM usage | 32GB of RAM remaining |
|---|---|---|
| 9B | ~6 GB | easy |
| 14B | ~10 GB | Good experience |
| 32B | ~20 GB | limit |
Of course, please note: in actual operation,Key-Value Cache (Context Cache) As dialogue grows, it often consumes several gigabytes of additional memory, which is why the 32B model is at a "critical point" with 32GB of memory.
2.2 Token/s and Memory Bandwidth
The model's parameter size, bit depth, and computer memory size determine whether the model can run. However, a crucial factor determining how fast the model runs is memory bandwidth. For example, my Mac mini uses a standard M4 chip, with a memory bandwidth of 120GB/s.
Here is a simple calculation formula:
Inference speed (Tokens/s) = Memory bandwidth ÷ Actual size of the model running
Let's look at the table again:
| Model size | 4-bit actual VRAM usage | Inference speed (Tokens/s) |
|---|---|---|
| 9B | ~6 GB | 20 |
| 14B | ~10 GB | 12 |
| 32B | ~20 GB | 6 |
First, it should be noted that this is just a simple analogy. In actual operation, inference speed is affected by many factors, including the number of computing units, the key-value cache access mode, the batch size, the number of concurrent requests, framework optimization, and the cache hit rate.
This can be roughly understood as follows: the larger the model, the more memory it consumes, the greater the pressure on memory bandwidth, and the slower the inference speed usually becomes.
Therefore, when we want to deploy a local AI model on a Mac, we need to consider two factors: the version of the M chip and the amount of memory.
2.3 Operational Logic and Density
Let's look at the model Qwen3.5-35B-A3B. 35B is the physical scale of the model, so what does A3B represent?
Here, A stands for "activate".
In other words, although the total size of this model is 35B, only 3B of its parameters are executed for reasoning in each dialogue. Essentially, the model you're using has an intelligence of 35B, but only the 3B most relevant parameters are actually being processed.
You can think of it as an 'expert pool' with 35 billion pieces of knowledge, but when you ask a specific question, it will only dispatch the 3 billion most specialized 'experts'. It retains the brain of a large model while possessing the speaking speed of a small model.
So, this is the parameter that confused me when I first came into contact with models: model density.
In other words, the model parameters are divided into dense models that perform full inference and expert hybrid models (MoE) that involve only a small number of relevant parameters in inference.
Neither type of model is inherently better or worse. Dense models are generally better in terms of stability and consistency, while expert hybrid models have advantages in inference efficiency and scalability.
However, for home computers and everyday needs like ours, the expert hybrid model would be more suitable.
So, what is the theoretical inference speed of the 35B-A3B on the standard M4 chip?
| Model size | 4-bit actual VRAM usage | Inference speed (Tokens/s) |
|---|---|---|
| 9B | ~6 GB | 20 |
| 14B | ~10 GB | 12 |
| 32B | ~20 GB | 6 |
| 35B-A3B | ~20-24GB | 80 |
To protect myself, I'd like to add a further explanation: While I'm still using the size of A3B to calculate speed, for the MoE model, activating the 3B parameters does not equate to runtime speed being the same as the 3B model. It's also affected by factors such as routing overhead, memory access, and key-value caching.
You can note down this 80 Tokens/s. You'll see in the oMLX benchmark test later that a single-threaded task only gets 47, while an 8-threaded continuous batch processing task gets as high as 93.
The discrepancies in the data are due to two main factors: firstly, the Qwen3.5-35B-A3B expert hybrid model offers greater potential for multi-task inference; and secondly, oMLX's unique SSD KV caching technology. Of course, factors such as the L2 cache of the M4 chip were not considered, which could also lead to discrepancies in the data.
I think beginners can start by establishing a simple conversion between computer configuration and model parameters. If needed, they can then spend more time exploring the details.
2.4 Deployment Framework and Inference Engine
Previously, we chose the right computer and the right model. Similarly, choosing the right deployment framework and inference tool (engine) for the model is also very important.
For the Qwen3.5-35B-A3B model, I used two deployment frameworks to give you a clear sense of how they affect inference speed.
The first method is based on the GGUF universal format llama.cpp. I used ollama, which is the most commonly used format, to download it, and used Anything LLM to load it for easy display of relevant data.
The second type is the MLX framework, which is specifically optimized for Apple's M chip. I will demonstrate it using LM Studio and oMLX respectively.
It's important to note that although both GGUF and MLX use the 4-bit Qwen3.5-35B-A3B model, their quantization precision differs. Regarding quantization precision:
A typical example of the GGUF format is Q4_K_M, which uses 6-bit quantization for critical parts and retains 4-bit for non-critical parts. Due to this mixed precision, the GPU needs to frequently perform 'non-standard bit width' conversions during computation.decompression overheadIn frameworks that do not natively support this, it will significantly slow down the speed.
MLX stands for INT4 (full 4-bit), which allows Apple's M chip to directly access model parameters without the need for "finding" and "translation." This results in more efficient memory access and scheduling that is more aligned with the M chip when running models on a Mac.
This is one of the reasons why Mac computers prefer the MLX model.
III. Model Deployment Comparison Test
In this comparative test, I used four inference tools: Ollama, Anything LLM, LM Studio, and oMLX.
There are two downloaded models: GGUF and MLX Qwen3.5-35B-A3B 4-bit.
The testing issues mainly fall into three categories: generation speed test, first-word response test, and multi-round overload test.
Finally, I'll add a test. One of my reasons for choosing to deploy the model locally is to use OpenClaw. So, let's compare Qwen3.5-35B and Qwen3-Coder-30B. If you're like me and want to use OpenClaw to develop web pages or applications, perhaps specializing in the programming-related model would be better.
3.1 Generation Speed Test **(Tokens/s)**
Test MethodSend them the same complex prompt (e.g., "Please write a complete Snake game in Python with detailed comments") and observe the speed at which the background printouts are generated.
- Ollama: 15.42 t/s
- LM Studio: 35.06 t/s
- oMLX: 35.70 t/s
3.2: First-Word Response Time / Tip Word Processing (TTFT / Prefill)
Test MethodSend them a long document of about 5000 words and ask them to summarize it. Calculate the number of seconds from "pressing Enter" to "speaking the first word". Theoretically, MLX should have the advantage in this round of testing; you can see for yourself.
- LM Studio: slightly
- oMLX: (omitted)
3.3 Agent Multi-round Reload Test (Reprefill / Memory Test)
Test MethodUse the "Standard 10-Round High-Pressure Test Script" in the appendix.
- It is a scenario that simulates OpenClaw continuously writing code.
- Please inA brand new dialog boxIn the middle, send the following 10 questions in sequence.
- For the first 9 rounds, you don't need to pay attention to its answers; just wait patiently for it to finish generating (these rounds will quickly consume the context of about 100,000 tokens).
- ⚠️ The key point is in round 10! The instant the 10th Prompt is sent, immediately press your stopwatch until it appears on the screen.He uttered the first wordRecord this time difference (TTFT).
In this round of testing, besides observing the generation speed of 10 questions, we also examined the number and efficiency of the cached tokens. After just 10 questions, there were already 140,000 tokens, with 110,000 tokens cached. This is equivalent to using hard drive space instead of memory, saving 1-3 GB of space.The larger the number of model parameters and the higher the quantization bit depth (the higher the accuracy), the better.The space required to load the model and the dynamic cache space generated when processing the same number of tokens will bothLarger.
It's important to understand that solid-state drives (SSDs) are slower than RAM. While SSDs save valuable RAM space, they sacrifice a slight amount of inference speed. However, in the long run of multiple rounds of questioning, the resulting more stable system operation is clearly more worthwhile.
3.4 oMLX Continuous Batch Processing Benchmark Test
Test MethodIn oMlx benchmark tests, the concurrent task inference speed of the Qwen3.5-35B-A3B model was tested.
In this round of concurrent testing, we need to look not only at the token generation speed, but also at the time it takes to generate the first character.
For my standard 32GB M4, 2X speeds result in an ideal TPS of 72.1 tok/s and an average TTFT of 4933.2ms. At 4X speeds, the average TTFT drops to 9664.7ms, which is somewhat counterproductive.
3.5 Are inference models really that good?
Test MethodTest the speed of the first round of questions using Qwen3-Coder-30B.
Finally, one more thing: although we use 35B General Version I've done a speed test, but if you're like me and want to run OpenClaw locally to automate code writing, then I strongly recommend you change the model to... Qwen3-Coder-30B-A3B (MLX version)The general model has good writing skills, but it occasionally provides incorrect JSON formatting, causing the agent to crash; while the Coder model is an emotionless code machine, and it will never crash in OpenClaw.
IV. Summary
Alright, that's all for this video.
I actually made this video twice, and revised the script several times. You can probably tell from the tests. Originally, I just wanted to make a simple comparison: which model and tool is faster, so as to choose the most suitable one for use in OpenClaw.
But I later discovered that my understanding of AI models was often superficial and incomplete.
A couple of days ago, I saw a comment under last year's Mac mini unboxing video, saying that the video helped him.
This made me feel guilty, and it was while replying to this friend that strengthened my resolve to redo the video.
And that's exactly why I discovered—
👉 AI is not something that can be solved by "selecting the right parameters"; it is more like a complete system engineering project.
The model, quantization, inference engine, hardware architecture, and practical needs—every choice will affect the final result.
Logically, I should provide a "standard answer": for example, what model should be chosen for what scenario, what inference tool should be used, and what computer configuration should be used for what purpose, such as daily chatting, analysis reports, or software development.
But after actually finishing writing this copy, I felt that stubbornly searching for a fixed answer is a kind of "obsession".
There are no fixed laws, and no fixed laws are not laws at all. Cultivating the mind is worse than cultivating worldly laws.
In the context of AI, this is actually quite easy to understand.
The best model or framework today may be replaced in a few months.
The solution that is currently best suited for you could be completely different if you change the machine, the scenario, or the model.
Therefore, understanding is more important than memorizing "which one to use":
👉 Why it's more suitable here.
As for the so-called cultivation of the "mind," my understanding is:
If we treat AI as a way of thinking, then we should try to understand and break down problems using "AI methods".
If you treat AI as a tool, then use it to its fullest potential to discover and solve problems.
The former represents an upgrade in cognition;
The latter is an amplification of efficiency.
In today's information-saturated and rapidly evolving world, parameters may become outdated and models may be obsolete, but your understanding will not.
Hopefully this video can help you.
If you find this helpful, please subscribe to my channel, or like, comment, and share!
That's it, bye-bye~
hello there and thank you for your info ? I have definitely picked up anything new from right here. I did however expertise some technical issues using this site, as I experienced to reload the website lots of times previous to I could get it to load correctly. I had been wondering if your hosting is OK? Not that I am complaining, but slow loading instances times will sometimes affect your placement in google and could damage your high quality score if advertising and marketing with Adwords. Well I am adding this RSS to my email and could look out for much more of your respective interesting content. Make sure you update this again soon..
WONDERFUL Post.thanks for share..extra wait..?
Когда лучше [url=https://zakazat-prodvizhenie-sajta.ru]заказать продвижение сайта[/url] — сразу при запуске или через несколько месяцев?
USDT ERC20 to rub https://exchange-usdt-cash.com
Thank you for the sensible critique. Me & my neighbor were just preparing to do some research on this. We got a grab a book from our area library but I think I learned more clear from this post. I am very glad to see such fantastic information being shared freely out there.
Как отзывы влияют на [url=https://geo-prodvizhenie-sajta.ru]Гео продвижение сайта[/url] — стоит ли ими активно управлять?
UpvoteRocket is an automated vote delivery service for game server toplist websites. Buy votes for 25+ supported ranking sites including XtremeTop100, GTop100, TopG, MMtop200 and more. Real mobile proxy IPs, pay per vote, instant campaign setup.
I have learn a few good stuff here. Certainly price bookmarking for revisiting. I surprise how so much attempt you set to create this kind of excellent informative website.
fantastic points altogether, you just gained a brand new reader. What would you suggest in regards to your post that you made some days ago? Any positive?
[url=https://marketingovoe-agentstvo-1.ru]Маркетинговое агентство[/url] — как выбрать подходящее, если нет опыта работы с подрядчиками?
Very nice post. I just stumbled upon your blog and wished to say that I've really enjoyed surfing around your blog posts. In any case I will be subscribing to your feed and I hope you write again very soon!
When I initially commented I clicked the “Notify me when new comments are added” checkbox and now each time a comment is added I get three e-mails with the same comment. Is there any way you can remove me from that service? Thank you!
Почему [url=https://domashnie-zhivotnye-1.ru]домашние животные[/url] иногда ревнуют хозяина к другим людям?
Как выбрать пакет, когда решаешь [url=https://zakazat-prodvizhenie-sajta.ru]заказать продвижение сайта[/url]: базовый или комплексный?
Good blog post. Things i would like to add is that laptop or computer memory is required to be purchased but if your computer cannot cope with that which you do with it. One can mount two random access memory boards containing 1GB each, for instance, but not one of 1GB and one of 2GB. One should look for the manufacturer's documentation for one's PC to make sure what type of memory is essential.
I think this is among the most important information for me. And i'm glad reading your article. But wanna remark on some general things, The site style is wonderful, the articles is really excellent : D. Good job, cheers
Generally I don’t read article on blogs, but I wish to say that this write-up very pressured me to check out and do it! Your writing style has been surprised me. Thanks, quite great article.
Just wish to say your article is as amazing. The clearness in your post is just nice and i can assume you’re an expert on this subject. Well with your permission allow me to grab your RSS feed to keep updated with forthcoming post. Thanks a million and please keep up the rewarding work.
I believe that avoiding prepared foods would be the first step to help lose weight. They could taste very good, but ready-made foods include very little nutritional value, making you feed on more to have enough strength to get over the day. For anyone who is constantly taking in these foods, moving over to grain and other complex carbohydrates will make you to have more energy while consuming less. Interesting blog post.
http://www.factorytinsigns.com is 100 Trusted Global Metal Vintage Tin Signs Online Shop. We have been selling art and décor online worldwide since 2008. Started in Sydney, Australia. 2000+ Tin Beer Signs, Outdoor Metal Wall Art, Business Tin Signs, Vintage Metal Signs to choose from. 100 Premium Quality Artwork. Up-to 40 OFF Sale Store-wide. Fast Shipping USA, Canada, UK, Australia, New Zealand, Europe.