🚀 Wait is Over: LinkedIn's Engineering Blog on spec-decoding for Hiring Assistant is out! Thrilled to share our deep dive into one of the most impactful optimizations we’ve brought to LinkedIn’s AI stack: speculative decoding. Large language models are powerful, but speed matters. For real-time AI agents like Hiring Assistant, latency isn’t just a metric; it’s the difference between a great experience and a frustrating one. In this post, we unpack: ✅ Why speculative decoding is a game-changer for LLM inference ✅ How we applied n‑gram speculation to Hiring Assistant ✅ The results: 4× throughput gains and 66% lower P90 latency, without sacrificing quality This work represents months of collaboration and lateral thinking across AI, Infra, and Product teams to make large-scale GenAI practical, fast, and cost-efficient. 👉 Read the full blog here: https://lnkd.in/ez4f5kYQ Huge thanks to my co-authors, and everyone in the legal and communications teams who made this possible. Shoutout to our leaders for their prompt guidance and encouragement. Grateful for the opportunity to make this level of impact so soon after rejoining LinkedIn. It’s a testament to the incredible teams and culture that make bold ideas possible. The future of inference is here, and it’s all about speed, scale, and innovation. 💡 #AIInfrastructure #LLMInference #SpeculativeDecoding #LinkedInTech #GenAI #HiringAssistant
Earlier presentation post (Re: "Wait is over"): https://www.linkedin.com/posts/dhyey-mavani_aiinfrastructure-llminference-speculativedecoding-activity-7389369182905331712-rfyM
Impressive work on lowering latency without sacrificing quality! Speculative decoding’s impact is clear. I’m curious, what were some of the biggest challenges your team faced during implementation, especially around maintaining result accuracy? Would love to learn your thoughts on scaling this approach to other real-time LinkedIn AI products. 🚀 #LLMInference
love this breakdown on optimizing LLM inference. how'd you measure quality impact downstream?
Great work Dhyey and Team 👏
Blog link: https://www.linkedin.com/blog/engineering/ai/accelerating-llm-inference-with-speculative-decoding-lessons-from-linkedins-hiring-assistant/ Product-side blog: https://www.linkedin.com/blog/engineering/hiring/hiring-assistant-shaped-by-customers-powered-by-ai-innovation Agentic architecture blog: https://www.linkedin.com/blog/engineering/ai/how-we-engineered-linkedins-hiring-assistant Hope this helps!