Market Landscape
1.
AI Everywhere: More than 200 Billion CPUs exist currently, making it the only viable option to create an AI-enabled world.
2.
ROI-positive AI solutions: Current AI agents and systems mostly fail to meet ROI requirements because of high Opex and Capex costs.
3.
Sustainability & Environment Concerns: The current power infrastructure is not adequate enough for GenAI market needs.
4.
Vertical models & Agents: Enterprises are currently looking for SLMs with SOTA performance for their downstream tasks. Enterprise vertical Agents powered by vertical models market are deemed to be 300$ billion in value.
Technology Landscape
1.
Larger Parameter Size Does Not Equal Better Accuracy: Model sizes are reducing while their accuracy is getting better. Scaling law has also plateaued, the only way forward is smaller, efficient and performant models.
2.
More Efficient Model Architectures: New more efficient architectures like liquid models, SSM models, Bitnet models, Hymba, Jamba etc are making models more compute, memory, power and bandwidth-efficient, indicating an AI future on CPUs.
3.
Inference Time Optimizations for Accurate SLM Agents: Inference time optimisations like MOA, Swarm of models, Textgrad etc can drastically improve SLM based Agent accuracy & Performance.
4.
Hybrid LLM Architecture: Augmenting Client based SLMs with cloud based LLMs can drastically reduce GenAI agent cost by federating inference across cloud-client-edge.
5.
Collaborative Inferencing Methods: Augmenting LLMs with CPU-based SLMs can drastically reduce the cost of even LLM inference.
6.
No More Expensive Finetuning or Pretraining: Model merging methods have evolved to become as effective as finetuning for certain applications, these methods can easily be done on a CPU. Model merging and SOTA RAG methods will allow for easy model personalisation on a consumer-grade CPU.
7.
Goodput instead of Throughput: The industry is moving towards use case based SLO metrics rather than generic ones.
8.
Batch processing or non realtime: Most of the GenAI applications are not chatbots that require real-time processing. While Xeons could be used for even chatbot applications, it can exponentially better TCO & ROI for non-realtime or batched requests.
Bottom Line: The market is currently seeking ROI positive, power, memory and compute efficient GenAI based solutions and Agents that can work on the existing infrastructure. Current progress in data preparation, model training, model architecture, inference time optimizations and model merging methodologies allows for GenAI agents that can run consumer-grade CPUs at scale.
Why x86/CPU/Non-Accelerators is preferred for Inferencing
1.
Less CapEx: Currently, data centres need to be enormously changed in order to accommodate the networking, cooling and power requirements of using GPUs or accelerators.
2.
Less OpEx: GPUs require exponentially higher power, cooling and maintenance requirements, which means increased OpEx costs.
3.
Better TCO or ROI: CPUs can reduce CapEx and OpEx costs improving the ROI and TCO exponentially at scale. (Approximate calculations given in references)
4.
Scalability: Currently, there is an industry-wide shortage of accelerators and GPUs, this is only getting worse over time because of the rising demand. Now, companies can build a GenAI application as a POC or beta application for 1000 users, but it is extremely difficult to scale it to 100,000 users, It is almost impossible for a startup or developer.
5.
Easy to Adopt & Maintain: Building & maintaining a GPU-based application requires specific expertise both from a hardware and software perspective. These resources are scarce and extremely expensive. Almost all of the software, network, hardware and data centre engineers currently are well-versed with CPU-based systems at scale.
6.
The barrier to entry: Currently, GenAI is inaccessible to SMEs, Startups or traditional enterprises - With an Nvidia H1100 or AMD MI300 costing up to 85,000 USD/month with CSPs, While an SPR device only costs between 500-2000 USD/month. This drastically reduces the barrier to entry while promoting more adoption, experimentation and evolution of the technology.
Bottom Line: Democratise GenAI by commoditising it.
Bud Ecosystem (Technology and Models)
Bud Ecosystem develops a universal runtime, inference stack and SOTA GPU-free SLMs, MLLMs, & Diffusion models that can provide production-ready inference on consumer-grade CPUs with SOTA accuracy.
We are building a GenAI deployment agent (Agent for creating GenAI Agents) that finds the best models, finds the best configs for deployment, optimises the prompts, and initiates/manages end-to-end GenAI observability through a simple chat interface, allowing anyone to build SOTA GenAI infrastructure just by providing a few examples of their downstream tasks. We believe that this easy-to-use agent combined with cost-effective SPR/CPU devices can make GenAI applications as simple (create/deploy/manage) and cost-effective as a typical web application.
Bud already have worked with multiple large organizations & government agencies to move their GPU-based GenAI agents and solutions to CPUs with their proprietary runtime and GPU free models.
Example Models and TCO
Model | Context Precision | Faithfulness | Answer Relevancy | Context Recall | Average |
---|
GPT-4o | 0.900 | 0.9158 | 0.9016 | 0.8952 | 3.6126 |
Bud RAG Jr 1.7B | 0.9545 | 0.8530 | 0.8274 | 0.9394 | 3.5743 |
TCO - Xeon vs Nvidia A100 (Bud)
| Intel Xeon based | Nvidia A100 |
---|
Power Consumption | 210 W | ~810 W |
Throughput | 615 tokens/sec | 1462 tokens/sec |
Power cost Per 1M token | 0.094 kWh | 0.15 kWh |
Cost / 1M tokens | 0.181 USD | 0.203 USD |
TCO - Xeons (Bud) + Bud Jr Model vs Nvidia A100 vs Open AI
| RAG in Xeons with Bud Runtime & Bud Jr (1.7B) | RAG in Nvidia A100 with SOTA opensource RAG model - LLaMa-3.1 70B | GPT-4o RAG Costing |
---|
Power Consumption | 210 W | ~2910 W | - |
Throughput | 1269 tokens/sec | 333 tokens/sec | - |
Power cost Per 1M token | 0.045 kWh | 2.42 kWh | - |
Cost / 1M tokens | 0.087 USD | 4.53 USD | 15-20 USD |