RouteLLM is a flexible framework for serving and evaluating LLM routers, designed to maximise effectivity whereas minimizing worth.
Key choices:
- Seamless integration — Acts as a drop-in various for the OpenAI shopper or runs as an OpenAI-compatible server, intelligently routing simpler queries to cheaper fashions.
- Pre-trained routers out of the sphere — Confirmed to cut costs by as a lot as 85% whereas preserving 95% of GPT-4 effectivity on broadly used benchmarks like MT-Bench.
- Worth-effective excellence — Matches the effectivity of foremost industrial selections whereas being over 40% cheaper.
- Extensible and customizable — Merely add new routers, fine-tune thresholds, and look at effectivity all through a variety of benchmarks.
On this tutorial, we’ll stroll by means of how one can:
- Load and use a pre-trained router.
- Calibrate it to your private use case.
- Examine routing habits on a number of kinds of prompts.
- Attempt the Full Codes proper right here.
Placing within the dependencies
!pip arrange "routellm[serve,eval]"
Loading OpenAI API Key
To get an OpenAI API key, go to https://platform.openai.com/settings/group/api-keys and generate a model new key. When you occur to’re a model new client, you can need in order so as to add billing particulars and make a minimal payment of $5 to activate API entry.
RouteLLM leverages LiteLLM to help chat completions from quite a lot of every open-source and closed-source fashions. You’ll be capable to strive the guidelines of suppliers at https://litellm.vercel.app/docs/suppliers when you want to use one other model. Attempt the Full Codes proper right here.
import os
from getpass import getpass
os.environ['OPENAI_API_KEY'] = getpass('Enter OpenAI API Key: ')
Downloading Config File
RouteLLM makes use of a configuration file to search out pretrained router checkpoints and the datasets that they had been educated on. This file tells the system the place to look out the fashions that resolve whether or not or to not ship a query to the sturdy or weak model. Attempt the Full Codes proper right here.
Do I’ve to edit it?
For a lot of prospects — no. The default config already components to well-trained routers (mf, bert, causal_llm) that work out of the sphere. You solely wish to range it within the occasion you propose to:
- Put together your private router on a personalized dataset.
- Substitute the routing algorithm completely with a model new one.
For this tutorial, we’ll keep the config as is and simply:
- Set our sturdy and weak model names in code.
- Add our API keys for the chosen suppliers.
- Use a calibrated threshold to steadiness worth and top quality.
- Attempt the Full Codes proper right here.
!wget https://raw.githubusercontent.com/lm-sys/RouteLLM/basic/config.occasion.yaml
Initializing the RouteLLM Controller
On this code block, we import the obligatory libraries and initialize the RouteLLM Controller, which might deal with how prompts are routed between fashions. We specify routers=[“mf”] to utilize the Matrix Factorization router, a pretrained decision model that predicts whether or not or not a query should be despatched to the sturdy or weak model.
The strong_model parameter is about to “gpt-5”, a high-quality nevertheless costlier model, whereas the weak_model parameter is about to “o4-mini”, a faster and cheaper completely different. For each incoming rapid, the router evaluates its complexity in direction of a threshold and routinely chooses in all probability probably the most cost-effective chance—ensuring that straightforward duties are handled by the cheaper model whereas harder ones get the stronger model’s capabilities.
This configuration helps you to steadiness worth effectivity and response top quality with out information intervention. Attempt the Full Codes proper right here.
import os
import pandas as pd
from routellm.controller import Controller
shopper = Controller(
routers=["mf"], # Model Fusion router
strong_model="gpt-5",
weak_model="o4-mini"
)
!python -m routellm.calibrate_threshold --routers mf --strong-model-pct 0.1 --config config.occasion.yaml
This command runs RouteLLM’s threshold calibration course of for the Matrix Factorization (mf) router. The –strong-model-pct 0.1 argument tells the system to look out the brink price that routes roughly 10% of queries to the sturdy model (and the remaining to the weak model).
Using the –config config.occasion.yaml file for model and router settings, the calibration determined:
For 10% sturdy model calls with mf, the optimum threshold is 0.24034.
Due to this any query with a router-assigned complexity ranking above 0.24034 could be despatched to the sturdy model, whereas these beneath it ought to go to the weak model, aligning alongside together with your desired worth–top quality trade-off.
Defining the brink & prompts variables
Proper right here, we define a numerous set of test prompts designed to cowl quite a lot of complexity ranges. They embrace straightforward factual questions (susceptible to be routed to the weak model), medium reasoning duties (borderline threshold circumstances), and high-complexity or creative requests (further suited to the sturdy model), along with code period duties to test technical capabilities. Attempt the Full Codes proper right here.
threshold = 0.24034
prompts = [
# Easy factual (likely weak model)
"Who wrote the novel 'Pride and Prejudice'?",
"What is the largest planet in our solar system?",
# Medium reasoning (borderline cases)
"If a train leaves at 3 PM and travels 60 km/h, how far will it travel by 6:30 PM?",
"Explain why the sky appears blue during the day and red/orange during sunset.",
# High complexity / creative (likely strong model)
"Write a 6-line rap verse about climate change using internal rhyme.",
"Summarize the differences between supervised, unsupervised, and reinforcement learning with examples.",
# Code generation
"Write a Python function to check if a given string is a palindrome, ignoring punctuation and spaces.",
"Generate SQL to find the top 3 highest-paying customers from a 'sales' table."
]
Evaluating Win Cost
The following code calculates the win cost for each test rapid using the mf router, exhibiting the chance that the sturdy model will outperform the weak model.
Based mostly totally on the calibrated threshold of 0.24034, two prompts —
“If a put together leaves at 3 PM and travels 60 km/h, how far will it journey by 6:30 PM?” (0.303087)
“Write a Python carry out to look at if a given string is a palindrome, ignoring punctuation and areas.” (0.272534)
— exceed the brink and could possibly be routed to the sturdy model.
All completely different prompts keep beneath the brink, which means they’d be served by the weaker, cheaper model. Attempt the Full Codes proper right here.
win_rates = shopper.batch_calculate_win_rate(prompts=pd.Sequence(prompts), router="mf")
# Retailer results in DataFrame
_df = pd.DataFrame({
"Rapid": prompts,
"Win_Rate": win_rates
})
# Current full textual content material with out truncation
pd.set_option('present.max_colwidth', None)
These outcomes moreover help in fine-tuning the routing method — by analyzing the win cost distribution, we’re capable of regulate the brink to larger steadiness worth monetary financial savings and effectivity.
Routing Prompts By Calibrated Model Fusion (MF) Router
This code iterates over the guidelines of test prompts and sends every to the RouteLLM controller using the calibrated mf router with the specified threshold (router-mf-{threshold}).
For each rapid, the router decides whether or not or to not make use of the sturdy or weak model based on the calculated win cost.
The response consists of every the generated output and the exact model that was chosen by the router.
These particulars — the rapid, model used, and generated output — are saved inside the outcomes guidelines for later analysis. Attempt the Full Codes proper right here.
outcomes = []
for rapid in prompts:
response = shopper.chat.completions.create(
model=f"router-mf-{threshold}",
messages=[{"role": "user", "content": prompt}]
)
message = response.alternatives[0].message["content"]
model_used = response.model # RouteLLM returns the model actually used
outcomes.append({
"Rapid": rapid,
"Model Used": model_used,
"Output": message
})
df = pd.DataFrame(outcomes)
Throughout the outcomes, prompts 2 and 6 exceeded the brink win cost and had been as a result of this truth routed to the gpt-5 sturdy model, whereas the remaining had been handled by the weaker model.
Attempt the Full Codes proper right here. Be at liberty to check out our GitHub Net web page for Tutorials, Codes and Notebooks. Moreover, be completely happy to adjust to us on Twitter and don’t neglect to affix our 100k+ ML SubReddit and Subscribe to our Publication.

I’m a Civil Engineering Graduate (2022) from Jamia Millia Islamia, New Delhi, and I’ve a keen curiosity in Data Science, significantly Neural Networks and their utility in quite a few areas.
Elevate your perspective with NextTech Data, the place innovation meets notion.
Uncover the most recent breakthroughs, get distinctive updates, and be a part of with a world neighborhood of future-focused thinkers.
Unlock tomorrow’s traits proper now: study further, subscribe to our e-newsletter, and develop to be part of the NextTech group at NextTech-news.com
Keep forward of the curve with NextBusiness 24. Discover extra tales, subscribe to our publication, and be a part of our rising neighborhood at nextbusiness24.com

