Introducing Bud Latent: High Performance Inference Engine for Embeddings Models

Feb 12, 2025

Abstract : While working on projects that involve inference with embedding models, inference error rates were a major headache for us. We tested a few inference engines, like HuggingFace’s Text Embeddings Inference (TEI), which had a 94% error rate and Infinity Inference Engine, with a 37% error rate, both on higher context lengths (8000 tokens). Moreover, it is observed that TEI crashes when the input context length is 16000 tokens. With numbers like these, these tools weren’t ready for production deployment. So, we decided to build a new inference engine specifically to solve this problem. We called it Bud Latent, and it works. We reduced the error rate to under 1%.

Bud Latent is designed to optimize inference for embedding models in enterprise-scale deployments. Our benchmark results reveal that Bud Latent outperforms TEI by up to 90% and the Infinity by up to 85%, while also delivering a 1.4x performance boost on CPUs. With this performance and less than 1% error rate, Bud Latent is currently the most production-ready inference engine for embedding models.



We are introducing Bud Latent, a new inference engine added to the Bud Runtime stack, designed to optimize inference for embedding models in enterprise-scale deployments. Initial benchmarks indicate that Bud Latent offers up to a 90% improvement in inference performance compared to Hugging Face’s Text Embeddings Inference (TEI) and up to 85% increase compared to Infinity inference engine (torch backend), along with a 1.4x performance boost on CPUs.

Bud Latent is a production-ready inference engine with an error rate of less than 1%, compared to 94% for TEI and 37% for Infinity on higher context lengths (8000 tokens). Moreover, it is observed that TEI crashes when the input context length is 16000 tokens. The Bud Latent runtime supports a range of hardware and device architectures, including Intel, AMD CPUs, AMD ROCM, Nvidia CUDA devices, AWS Inferentia for cloud inference, and Apple Silicon & Intel Core Ultra for client-side inference. The platform also supports horizontal scaling, routing, and OpenAI-compatible APIs.

Graph 1 : Latency vs. number of requests comparison between Bud Latent, TEI and Infinity. Benchmark experiments conducted with model : gte-large-en-v1.5 on an Intel Xeon Platinum 8592V processor with 32 cores and 40GB of memory.

Graph 2 : Failed Requests vs. number of input tokens comparison between Bud Latent, TEI and Infinity. Benchmark experiments conducted with model : gte-large-en-v1.5 on an Intel Xeon Platinum 8592V processor with 32 cores and 40GB of memory.

The Growing Need for High-Performance Embeddings Inference

As enterprises increasingly rely on embedding models to power critical applications such as search relevance, content personalization, fraud detection, and knowledge management, the demand for robust, high-performance inference solutions has never been greater. Traditional inference engines often struggle with high error rates, high latency and cost inefficiencies, hindering real-time applications and large-scale deployments.

Bud Latent is engineered to overcome these challenges by significantly reducing error rate  and latency, enhancing throughput, and minimizing infrastructure costs—all while maintaining high accuracy. The key features of the inference engine are as follows;

  • State of the Art Performance: Supports Flash attention, Paged Attention, Custom optimised device specific kernels. With up to 90% performance improvement compared to TEI and 85% compared to Infinity.
  • Model Compatibility: Use models straight from huggingface, modelscope or disk
  • Multiplatform Support : Bud Latent offers broad compatibility across various hardware platforms, including NVIDIA CUDA, AMD ROCM, CPU, AWS INF2, and Apple MPS accelerators. This flexibility ensures that enterprises can seamlessly deploy embedding models across a range of environments, optimizing performance on each platform. It also supports heterogeneous hardware deployments, allowing you to run your embedding workload across different types of hardware.
  • Production-Ready : With robust reliability—featuring an error rate of less than 1%—and scalability, the new inference engine is production-ready. It supports multi-cloud, multi-hardware horizontal & auto scaling, allowing businesses to deploy AI-powered applications at scale without compromising speed, accuracy, or cost-efficiency. Additionally, it includes support for custom performance metrics and distributed tracing with OpenTelemetry and Prometheus metrics.
  • Multi-Utility: Bud Latent not only helps you create embeddings, but could also be used for reranking models, text curation models, prompt routing models, multi-modal, cross modal, text classification etc.
  • Zero Configuration: Integrated with the Bud Simulator, the system automatically identifies the optimal configuration for your production deployment, ensuring it meets your Service Level Objectives (SLOs) at the lowest possible cost.
  • Automated Hardware sizing & Finder: Automatically identify the right hardware across different clouds and determine the optimal hardware size to achieve the best Total Cost of Ownership while meeting your SLOs.
  • Dynamic Batching and Tokenization : The engine utilizes dynamic batching and tokenization, performed by dedicated worker threads. This allows for efficient resource management and maximizes throughput by processing multiple requests simultaneously, making it ideal for high-traffic applications.
  • Enhanced CPU Performance with AVX and AMX : For enhanced performance on CPU-based systems, the inference engine supports AVX (Advanced Vector Extensions) and AMX (Advanced Matrix Extensions). These optimizations combined with highly optimised custom kernels ensure that CPU resources are used to their full potential, speeding up inference processes while maintaining accuracy. Bud Latent runtime is also capable of leveraging multiple NUMA nodes available on a system for improved resource utilisation and optimal performance.
  • Cloud, On-prem, BYOC and Client Deployment : The inference engine is designed to work seamlessly in both cloud and client environments, offering the flexibility to scale and deploy based on your needs. Currently, Bud Latent runtime supports 16 different cloud options and allows you to do production deployment with Kubernetes & Openshift.
  • A solution for hardware supply problem: The Bud Latent runtime enables heterogeneous cluster deployments with automated hardware finding and provisioning across 16 different cloud platforms. This ensures you can scale up or down at any time, optimizing costs while maintaining full hardware availability.
  • Horizontal Scaling with Workers : Bud Runtime supports horizontal scaling via worker threads, allowing businesses to handle an increased number of requests while improving load handling. This architecture ensures that the system can scale efficiently as your application grows, maintaining optimal performance under heavy workloads.
  • Int8 and FP8 Support : The inference engine supports int8 (on CPU, RCOM (AMD), Synapse (Intel Gaudis) and CUDA) and fp8 (on H100/MI300) precision for faster computations and reduced memory usage, making it possible to run large models with greater efficiency while ensuring that the results maintain the necessary precision for high-quality outputs.
  • Launch Multiple Models Simultaneously : With the ability to launch and run multiple models at once, the inference engine provides flexibility in handling a variety of use cases, from search relevance and content personalization to multimodal embeddings.
  • Multimodal Support : The engine supports a variety of embeddings, including text, image, and audio embeddings (Clip, Clap, Colpali), as well as reranking models. This multimodal capability enables enterprises to create more comprehensive AI applications that can process and understand a wide array of data types, unlocking new possibilities for personalized and dynamic user experiences.

The Bud Latent inference engine can be applied to a variety of use cases across multiple industries:

  • AI Agents – Build high-performance AI agents with a less than 1% error rate for production-ready reliability. Create embedding-based AI agents ideal for automating workflows across various sectors such as customer service, operations management, and technical support.
  • Enterprise Search & Knowledge Management – Accelerates search and retrieval of information across vast document and database repositories.
  • E-commerce & Personalization – Enhances recommendation engines for dynamic, user-specific content delivery.
  • Financial Services & Fraud Detection – Strengthens anomaly detection and risk analysis through real-time embedding comparisons.
  • Healthcare & Life Sciences – Improves medical research and diagnostics by enabling fast similarity searches across biomedical datasets.

Bud Latent is now available for enterprises aiming to scale their embeddings-based AI applications with enhanced efficiency and accuracy. To learn more or request a demo, contact us at contact@bud.studio


Appendix : Benchmark results

Bud Latent : Up to 90% improvement in inference performance compared to TEI and up to 85% increase compared to Infinity inference engine (torch backend). Error rate of less than 1%

usersbatch_sizerequest_ratenum_tokenstotal_requestssuccessful_requestsfailed_requestsavg_response_timemedian_response_timemin_response_timemax_response_timedtype
101inf100101000.40520236920.49551940520.19078128230.500028111float32
501inf100505001.1434776291.1630292270.25758799911.600166343float32
1001inf10010010002.2725222692.2622706910.31666596233.241668141float32
2001inf10020020004.2174723023.3517196210.33542660626.649045318float32
5001inf10050050008.0001354767.7729201480.185996744815.03466847float32
10001inf10010001000017.4284142517.002854050.187515232733.67127076float32
1001inf50010010008.6685775599.4823544750.610550615914.31473305float32
1001inf2000100100049.8331553943.650448030.88071183373.80447029float32
2001inf2000200200089.2569411873.522337460.6558725275142.813371float32
1001inf80001001000270.9934692380.436284924.45675381408.0344722float32
1001inf160001001000265.2299411264.0894367134.3735515397.1633052float32
101inf100101000.18032396320.2135046050.098585234950.217314668bfloat16
501inf100505000.43869556130.42660685720.20494412070.6459736768bfloat16
1001inf10010010000.81999903420.6618397990.64523823931.216599256bfloat16
2001inf10020020001.4482435931.3856211220.38475041652.411869619bfloat16
5001inf10050050003.3871091883.5211695870.11808468776.130071493bfloat16
10001inf1001000100006.4018074766.2959991310.199954545112.51010011bfloat16
1001inf50010010002.8122435292.4432205750.3607299984.807883019bfloat16
1001inf2000100100019.0822214226.196656481.03038087727.09224063bfloat16
2001inf2000200200028.0809913125.42797310.408383315449.77493088bfloat16
1001inf80001001000149.277586170.58420051.764253156174.0163538bfloat16
1001inf160001001000119.0317349100.36497468.976574231174.0420267bfloat16

Table 1: Benchmark results. Experiments conducted with model : gte-large-en-v1.5 on an INTEL® XEON® PLATINUM 8592V processor, with 40GB of memory and 32 cores. The experiments used a data type of 32-bit floating point (float32), with an inf request rate and a batch size of 1.

Infinity : Error rate can reach up to 37%.

usersbatch_sizerequest_ratenum_tokenstotal_requestssuccessful_requestsfailed_requestsavg_response_timemedian_response_timemin_response_timemax_response_timedtype
101inf100101000.80806830770.80829790980.80585044810.809895765float32
501inf100505002.4874299612.5181047920.70125355573.380646594float32
1001inf10010010004.9728737056.4742637980.48684233056.986398246float32
2001inf10020020007.2882302196.5667676932.7996410513.27286144float32
5001inf100500500020.0828673518.923533560.580425629437.41144929float32
10001inf10010001000037.680594836.483443880.668507117872.30005198float32
1001inf500100100018.7807449315.382637822.70084144728.76471366float32
1001inf2000100100089.9890840185.985573191.259994561141.3938221float32
2001inf20002002000170.5375092147.1352841.66456474281.7094974float32
1001inf800010010037373.602216428.456244118.72330613428.4682249float32
1001inf1600010010021407.0072823536.698509419.24124624536.7168855float32
101inf100101000.77065228340.90861817080.44565090540.9102523811bfloat16
501inf100505002.2420690242.9420386470.66067877233.504472384bfloat16
1001inf10010010002.7143405972.1122468640.70783068984.373712376bfloat16
2001inf10020020005.2802649555.2906028861.0718727699.401459731bfloat16
5001inf100500500012.4029433112.524929080.517127608923.49155435bfloat16
10001inf10010001000024.3214897324.191107260.518660567747.98807011bfloat16
1001inf500100100012.7136228414.868461130.655370380716.63025535bfloat16
1001inf2000100100062.2090573480.37433740.994274426380.38623325bfloat16
2001inf2000200200089.9903978281.155898930.9599503074157.4812552bfloat16
1001inf80001001000238.2009776327.65287758.387714051337.3884446bfloat16
1001inf160001001000246.364753200.963280359.96540344374.2751351bfloat16

Table 2: Benchmark results. Experiments conducted with model : gte-large-en-v1.5 on an INTEL® XEON® PLATINUM 8592V processor, with 40GB of memory and 32 cores. The experiments used a data type of 32-bit floating point (float32), with an inf request rate and a batch size of 1.

TEI : Error rate can reach up to 94%

usersbatch_sizerequest_ratenum_tokenstotal_requestssuccessful_requestsfailed_requestsavg_response_timemedian_response_timemin_response_timemax_response_timedtype
101inf100101002.7571033882.1781552680.42570014674.423851764float32
501inf100505009.86397825710.631532640.316186238117.74905183float32
1001inf100100100021.7491204521.934353660.417165258940.68083454float32
2001inf100200200037.002910434.889313860.374449081775.03766196float32
5001inf100500500088.4913728989.257067040.4369468018170.181461float32
10001inf100100010000168.9215812168.73997570.4053446911338.8540228float32
1001inf5001001000149.4461583148.24981261.432162989287.6974653float32
1001inf20001003862315.9089398334.516386711.35889501574.5568133float32
2001inf200020038162301.1382666311.578277710.11749762561.5724616float32
1001inf8000100694440.917633593.987321191.4583224593.9925882float32
1001inf1600010001000000float32

Table 3: Benchmark results. Experiments conducted with model : gte-large-en-v1.5 on an INTEL® XEON® PLATINUM 8592V processor, with 40GB of memory and 32 cores. The experiments used a data type of 32-bit floating point (float32), with an inf request rate and a batch size of 1.