Small, Fierce, Present
Published on: 02/10/2025
October 2025. It’s no longer a hypothesis: small models are not a fallback, they’re a weapon. They’re the light seeping under the door while the AI engine room hums in the dark.
I’ve learned (sometimes the hard way) that raw power isn’t enough: what matters is closeness. You need a model that breathes next to you, one that doesn’t ask permission from a data center. You need an intelligence that can live in the palm of your hand and, when necessary, stretch toward the cloud without betraying the secret you entrusted it with. That’s where compact, optimized models have taken the stage.
The horizon of “small” in 2025
We’re not talking about toys. Over the past year, the landscape has thickened:
- Llama 3.1 brought 8B and 70B variants with extended context windows and real-world adoption; not shards, but workhorses. (Meta AI)
- Mistral doubled down on the “small” line, updating Mistral Small for agentic and fast tasks: small, yes, but trained to bite. (Mistral AI Documentation)
- Microsoft formalized the grammar of SLMs with Phi-3, proving that “less” can mean “more often, better, and at human cost.” (Microsoft Azure)
- From Asia came the tonal pressure of Qwen 2.5: omni and vision-language variants from 3B/7B up to more ambitious formats, with a clear trajectory toward edge and laptop deployment. (AlibabaCloud)
- Falcon 7B/40B, released by the Technology Innovation Institute, remains a symbol of real open weights: trained on transparent datasets, easily optimized with LoRA/QLoRA, and runnable on consumer GPUs. A milestone showing quality can emerge outside big tech.
- And above this ground grew the GPT-All ecosystem: not a model but a launcher that made it trivial to download, quantize, and run dozens of models locally. It turned “small” into a daily practice—a bridge that democratized local AI.
These models don’t just talk: they see, listen, summarize. The air of 2025 smells of compressed multimodality, of slashed latency, of lightweight adapters that turn a generalist into a domain artisan without rebuilding the world from scratch.
Body/machine, light/shadow
The server hall is blinding light and white noise; the wrist, instead, is warm shadow. On-device is no longer a slogan. Apple Intelligence formalized a pact: what’s intimate stays on device, what’s heavy rises to Private Cloud Compute—but on Apple silicon, with a declared chain of trust. A sober, almost surgical orchestration. (Apple)
On Windows, the threshold has become numeric and explicit: 40+ TOPS of NPU to be called a Copilot+ PC. Translation: much of the AI experience runs locally, and the cloud is back to being a choice, not a crutch. (I’ve seen 7B–13B models flow on laptops that struggled just a year ago.) (The Official Microsoft Blog)
The silicon bent to “small”: Snapdragon X Elite (45 TOPS) and Ryzen AI 300 (up to 50 TOPS) have made trivial what was acrobatics a year ago. Not rhetoric: architecture. (Qualcomm)
Optimization as ethics (and as craft)
Quantization, pruning, LoRA/QLoRA, distillation: the old tools became orchestration. But 2025 also brought new base speeds. FlashAttention-3 squeezed H100 past shame: 1.5–2.0× over FA-2, with FP8 no longer sounding like compromise. The difference? You feel it on the skin of the user: less waiting, more fluidity, more margin to run locally. (tridao.me)
On serving, vLLM and its PagedAttention normalized throughput that once demanded custom systems. For me it proved that “scaling” is not just a cloud verb—it’s a mental habit. (vLLM Blog)
And then speculative decoding: a small model drafts, the large model verifies. A duet: one dances light, the other checks the step. The result is text that arrives before you expect it, without losing its voice. (arXiv)
Edge as politics, not just technique
There’s a reason MLCommons keeps releasing MLPerf Tiny: edge isn’t folklore, it’s infrastructure. The latest Tiny v1.3 benchmarks tell of micro-accelerators chewing through classification and keyword spotting at milliwatts. Not poetry: it’s bills, privacy, resilience. (MLCommons)
And while phones learn to write on their own (see Gemini Nano and Android’s GenAI APIs), the boundary between “device” and “platform” evaporates. Who needs to ask a server for permission to jot down an idea? (Android Developers Blog)
Beneath the waterline: hardware and hunger
The inconvenient truth? Compute hunger is not satiated. NVIDIA Blackwell promises new seas; AMD MI325X became real in the machine rooms of those diversifying; Gaudi 3 fights for space where price/performance balance matters more than logos. We’re not talking posters: we’re talking orders, availability, clusters spinning up. (NVIDIA Newsroom)
And yet—the chain stays tense: demand outpaces supply, export controls shift the wind, entire ecosystems reorganize in a quarter. This is why small models are not a fad: they’re a survival strategy. (Reuters)
Regulation as frame (and as handbrake)
There’s another shadow outlining the light: the EU AI Act. Since February 2, 2025, bans have taken effect; since August 2025, obligations for GPAI began; the rest will phase in. Not bureaucracy: it’s the playing field. And for “small,” it’s often a competitive advantage—easier to be transparent, simpler to explain. (European Parliament)
Authenticity/illusion
There’s a risk: mistaking efficiency for truth. A 7B model replying in 80 ms is seductive. But speed doesn’t absolve error. Sometimes I’d rather wait a second longer if I can justify it. Maturity lies here: choosing when to stay local and when to raise your hand to the cloud. Not minimalism, but care.
Burning questions (that remain)
What do we lose when we optimize?
How much of us should live on device, and how much can we afford to hand to a server?
Do we want machines that understand us faster, or better?
I don’t have final answers—I mistrust those who do. But I know small models force us to decide: performance here and now, privacy not as manifesto but as practice, sustainability not as a badge on a site but as an energy balance measured and reduced. And I know we have the means: from kernels accelerating attention to NPUs silencing latency. (tridao.me)
Declaration (personal)
I stand with those who build sober tools. With those who accept imperfection just to stay close to the person who writes, draws, heals, produces. With those who’d rather have a 3–7B model, well-trained, patiently distilled, tied to a clean retrieval system, than a colossus breathing far away and frightening. Not out of ideology: out of responsibility, accessibility, equity.
Because in the end the point is simple and fierce: an AI that resembles us doesn’t shout—it whispers at eye’s distance. And “small,” in 2025, is exactly that: a whisper that does its job, doesn’t pretend to be everything, knows when to grow and when to stay silent.