Documentation
Getting Started
Predicting with LLMs
Agentic Flows
Text Embedding
Tokenization
Manage Models
Model Info
Getting Started
Predicting with LLMs
Agentic Flows
Text Embedding
Tokenization
Manage Models
Model Info
Configuring the Model
You can customize both inference-time and load-time parameters for your model. Inference parameters can be set on a per-request basis, while load parameters are set when loading the model.
Set inference-time parameters such as temperature
, maxTokens
, topP
and more.
result = model.respond(chat, config={
"temperature": 0.6,
"maxTokens": 50,
})
Note that while structured
can be set to a JSON schema definition as an inference-time configuration parameter,
the preferred approach is to instead set the dedicated response_format
parameter,
which allows you to more rigorously enforce the structure of the output using a JSON or class based schema
definition.
Set load-time parameters such as contextLength
, gpuOffload
, and more.
.model()
The .model()
retrieves a handle to a model that has already been loaded, or loads a new one on demand (JIT loading).
Note: if the model is already loaded, the configuration will be ignored.
import lmstudio as lms
model = lms.llm("qwen2.5-7b-instruct", config={
"contextLength": 8192,
"gpuOffload": 0.5,
})
.load_new_instance()
The .load_new_instance()
method creates a new model instance and loads it with the specified configuration.
import lmstudio as lms
client = lms.get_default_client()
model = client.llm.load_new_instance("qwen2.5-7b-instruct", config={
"contextLength": 8192,
"gpuOffload": 0.5,
})
On this page
Inference Parameters
Load Parameters
Set Load Parameters with .model()
Set Load Parameters with .load_new_instance()