LM Studio 0.3.4 ships with Apple MLX
•
2024-10-08
LM Studio 0.3.4 ships with an MLX engine for running on-device LLMs super efficiently on Apple Silicon Macs.
Download LM Studio for Apple Silicon from here. Read on to learn more about MLX in LM Studio.
Llama 3.2 1B on M3 Max runs at ~250 tokens per second
👾 Interested in designing and building systems? We are hiring. See open positions here.
llama.cpp
and MLX models!The rest of this blog post is a deep dive into the technical details of MLX in LM Studio.
MLX is a new Open Source AI/ML software stack from Apple, optimized specifically for Apple Silicon. It utilizes the powerful acceleration hardware in Apple's M chips.
Developed by engineers at Apple and supported by a growing community of developers, MLX is positioned to be an extremely competitive choice for running on-device AI on Macs.
The MLX core library is written in C++, completed with community-supported Python as well as Swift frontends.
We're excited to unveil MLX support in LM Studio. This blog post will cover some technical details about MLX in general, and LM Studio's MLX engine in particular.
The LM Studio's MLX Engine is a Python module built using a combination of the following packages:
mlx-engine
is open source under the MIT license. Repo: https://github.com/lmstudio-ai/mlx-engine.
Our journey to integrate MLX into LM Studio started with Swift. While this approach worked perfectly fine, ultimately the following design goals made Python a better choice.
Design Goal 1: We want to iterate on the MLX engine with the community
Design Goal 2: We want to be able to support the latest models and techniques as soon as they are released
Adding mlx-lm support to LM Studio required the ability to deploy and run Python components in a portable, cross-platform fashion. Ideally, we also want to be able to fully integrate those components with the existing C/C++ components already used in the main LM Studio application (which ended up ruling out some potential candidate solutions, such as conda
environments).
LM Studio's initial Python runtime support is built atop the python-build-standalone project, and Python virtual environments, using a soon-to-published utility that supports the creation of an integrated set of independently downloadable Python application environments that share common runtime and framework layers (after all, nobody wants to download and install multiple copies of PyTorch or CUDA if they can reasonably avoid it).
This "stacked virtual environments" utility uses the CPython interpreter's "site customization" feature, together with some pre-publication and post-installation adjustments to the virtual environment contents, to allow these virtual environments to be reliably transferred between machines and the included application launch modules invoked with CPython's -m
command line switch.
Look for more detailed technical announcements on that front later in October.
mlx-engine
's featuresA critical piece of the python MLX ecosystem is mlx_lm
. This project provides an easy way to run large language models with a CLI tool, or a few lines of python, for example:
Let's pop the hood of generate_step
, so that we have a better understanding of what's going on
We can see the important operations happening here:
__call__
method. This returns an array of logits, where each element corresponds to an item in the model's vocabulary. The logits define a probability distribution over the items in the vocabulary.Let's see how we can add features to this generation loop that our users would love.
Let's add a feature to the generator: the user can request that the generator outputs valid json. We can use Outlines from .txt for this.
Outlines enables structured generation from LLMs (e.g. creating json output). This package comes with support for the mlx_lm
runtime, which we will leverage. Outlines does its work by converting a user-provided json schema into regex. Take a look at this title schema.
Outlines converts that schema into this regex string:
Here is a more human readable (but less precise) version of that regex string: \{"title": ".{1,}"\}
Using this regex string, the generation loop of Outlines is as follows:
mlx_lm
's generate_step
lets us define logits processors, so let's define a processor to mask the logits so the output matches the regex
And we can invoke the mlx generation step with an instantiation of this object
And there we have it! Now we can generate json whenever a json schema is provided.
Another piece of the MLX python ecosystem is mlx_vlm
, which is a package for running vision LLMs. Here is the generate_step
method in mlx_vlm
, edited for conciseness
Let's compare and contrast the mlx_vlm
implementation with the mlx_lm
implementation:
mlx_vlm
evaluation uses the model.__call__
method. The very first evaluation processes the pixel data, and the subsequent evaluations use the underlying language model.sample
function in mlx_vlm
has fewer sampling methods available compared to mlx_lm
.mlx_vlm
.It would be beneficial to use the logits processing and sampling from mlx_lm
, while also using vision models from mlx_vlm
. Let's implement that!
We'll write a class that will evaluate the pixel data on first call and use the language model on subsequent calls:
And now, we can pass it into mlx_lm.generate_step
:
And now we can prompt a LLM with an image, and have it make a title for us!
Captioning an image using a VLM and structured output
KV (key-value) caching across prompts is an optimization technique that enables LLM engines to reuse computations from previous interactions. This can greatly improve model response time, or "Time to First Token".
KV caching is especially valuable in a chat scenario, where a large portion of the prompt (the chat history) is often the same across generation requests to the model.
Timestep 1 (T1) - User sends prompt "Summarize this long article: <long article here...>"
Timestep 2 (T2) - The LLM engine performs inference on the input, computing large matrix multiplications between the model weights and the input token embeddings to yield output tokens: "This article discusses the impact of..."
Timestep 3 (T3) - User sends prompt "Are there any people mentioned in the article?"
. The entire chat history is sent to the LLM to give it proper context to continue the conversation.
Timestep 4 (T4) - The LLM engine performs inference on the input (all tokens from T1, T2, and T3), computing large matrix multiplications between the model weights and input token embeddings to yield output tokens: "Yes, the article mentions several key figures, including..."
KV caching takes advantage of the fact that by the time we're at T3, asking the LLM about "people mentioned in the article"
, we've already performed matrix computations in T1 and T2 that are the same as those that need to be computed in T3:
So if we save the results of the computations in T1 and T2 to a KV CACHE, and give the engine access to the KV CACHE at T3, then the engine only has to perform computations on the new part of the prompt, "Are there any people mentioned in the article?"
:
This can greatly improve response time in T4. In our testing, with a ~3000 token article and Meta-Llama-3.1-8B-Instruct-4bit
, T4 response time dropped from ~10 seconds without any KV caching, to just 0.11 seconds with it.
At the time of implementation, mlx-lm
exposed a cache_history
parameter to its generate_step
function:
By passing the proper cache_history
(akin to KV CACHE above), we were able to implement an intial
version of KV caching in our MLX engine.
We did this through an adaptation of mlx-lm
's PR Add the ability to load the KV cache from a file, in which we pre-process the prompt through the model inside of a cache wrapper:
cache_wrapper.update_cache
, seen above, above draws from cache_prompt.py to fill the cache chunk by chunk:
Now that the cache has been created and saved to generate_args["cache_history"]
, we can simply pass generate_args
and generate_step_input
to mlx_lm.utils.generate_step
:
This results in the generate_step
function being able to utilize prior computations, stored in cache_history
, to greatly reduce response time when compared to performing raw processing of the entire prompt.
We can then store this cache_history
object across prompt processing calls, building upon it to keep chat scenarios responsive, even during very long conversations. However, it is critical to ensure that the tokens processed into cache_history
still correspond to the beginning tokens in the prompt when doing so. For more information on this, view the cache resetting behavior within the update_cache
function.
Cmd+Shift+M
to search for models, Cmd+Shift+R
to manage LM Runtimes