Documentation
Getting Started
Predicting with LLMs
Agentic Flows
Text Embedding
Tokenization
Manage Models
Model Info
Getting Started
Predicting with LLMs
Agentic Flows
Text Embedding
Tokenization
Manage Models
Model Info
Speculative Decoding
Required Python SDK version: 1.2.0
Speculative decoding is a technique that can substantially increase the generation speed of large language models (LLMs) without reducing response quality. See Speculative Decoding for more info.
To use speculative decoding in lmstudio-python
, simply provide a draftModel
parameter when performing the prediction. You do not need to load the draft model separately.
import lmstudio as lms
main_model_key = "qwen2.5-7b-instruct"
draft_model_key = "qwen2.5-0.5b-instruct"
model = lms.llm(main_model_key)
result = model.respond(
"What are the prime numbers between 0 and 100?",
config={
"draftModel": draft_model_key,
}
)
print(result)
stats = result.stats
print(f"Accepted {stats.accepted_draft_tokens_count}/{stats.predicted_tokens_count} tokens")