Documentation
Predicting with LLMs
Agentic Flows
Text Embedding
Tokenization
Manage Models
Model Info
API Reference
Predicting with LLMs
Agentic Flows
Text Embedding
Tokenization
Manage Models
Model Info
API Reference
Configuring the Model
You can customize both inference-time and load-time parameters for your model. Inference parameters can be set on a per-request basis, while load parameters are set when loading the model.
Set inference-time parameters such as temperature
, maxTokens
, topP
and more.
const prediction = model.respond(chat, {
temperature: 0.6,
maxTokens: 50,
});
See LLMPredictionConfigInput
for all configurable fields.
Another useful inference-time configuration parameter is structured
, which allows you to rigorously enforce the structure of the output using a JSON or zod schema.
Set load-time parameters such as contextLength
, gpuOffload
, and more.
.model()
The .model()
retrieves a handle to a model that has already been loaded, or loads a new one on demand (JIT loading).
Note: if the model is already loaded, the configuration will be ignored.
const model = await client.llm.model("qwen2.5-7b-instruct", {
config: {
contextLength: 8192,
gpu: {
ratio: 0.5,
},
},
});
See LLMLoadModelConfig
for all configurable fields.
.load()
The .load()
method creates a new model instance and loads it with the specified configuration.
const model = await client.llm.load("qwen2.5-7b-instruct", {
config: {
contextLength: 8192,
gpu: {
ratio: 0.5,
},
},
});
See LLMLoadModelConfig
for all configurable fields.
On this page
Inference Parameters
Load Parameters
Set Load Parameters with .model()
Set Load Parameters with .load()