Integration with Hugging FaceΒΆ
This document describes how vLLM integrates with Hugging Face libraries. We will explain step by step what happens under the hood when we run vllm serve.
Let's say we want to serve the popular Qwen model by running vllm serve Qwen/Qwen2-7B.
-
The
modelargument isQwen/Qwen2-7B. vLLM determines whether this model exists by checking for the corresponding config fileconfig.json. See this code snippet for the implementation. Within this process:- If the
modelargument corresponds to an existing local path, vLLM will load the config file directly from this path. - If the
modelargument is a Hugging Face model ID consisting of a username and model name, vLLM will first try to use the config file from the Hugging Face local cache, using themodelargument as the model name and the--revisionargument as the revision. See their website for more information on how the Hugging Face cache works. - If the
modelargument is a Hugging Face model ID but it is not found in the cache, vLLM will download the config file from the Hugging Face model hub. Refer to this function for the implementation. The input arguments include themodelargument as the model name, the--revisionargument as the revision, and the environment variableHF_TOKENas the token to access the model hub. In our case, vLLM will download the config.json file.
- If the
-
After confirming the existence of the model, vLLM loads its config file and converts it into a dictionary. See this code snippet for the implementation.
-
Next, vLLM inspects the
model_typefield in the config dictionary to generate the config object to use. There are somemodel_typevalues that vLLM directly supports; see here for the list. If themodel_typeis not in the list, vLLM will use AutoConfig.from_pretrained to load the config class, withmodel,--revision, and--trust_remote_codeas the arguments. Please note that:- Hugging Face also has its own logic to determine the config class to use. It will again use the
model_typefield to search for the class name in the transformers library; see here for the list of supported models. If themodel_typeis not found, Hugging Face will use theauto_mapfield from the config JSON file to determine the class name. Specifically, it is theAutoConfigfield underauto_map. See DeepSeek for an example. - The
AutoConfigfield underauto_mappoints to a module path in the model's repository. To create the config class, Hugging Face will import the module and use thefrom_pretrainedmethod to load the config class. This can generally cause arbitrary code execution, so it is only executed when--trust_remote_codeis enabled.
- Hugging Face also has its own logic to determine the config class to use. It will again use the
-
Subsequently, vLLM applies some historical patches to the config object. These are mostly related to RoPE configuration; see here for the implementation.
-
Finally, vLLM can reach the model class we want to initialize. vLLM uses the
architecturesfield in the config object to determine the model class to initialize, as it maintains the mapping from architecture name to model class in its registry. If the architecture name is not found in the registry, it means this model architecture is not supported by vLLM. ForQwen/Qwen2-7B, thearchitecturesfield is["Qwen2ForCausalLM"], which corresponds to theQwen2ForCausalLMclass in vLLM's code. This class will initialize itself depending on various configs.
Beyond that, there are two more things vLLM depends on Hugging Face for.
-
Tokenizer: vLLM uses the tokenizer from Hugging Face to tokenize the input text. The tokenizer is loaded using AutoTokenizer.from_pretrained with the
modelargument as the model name and the--revisionargument as the revision. It is also possible to use a tokenizer from another model by specifying the--tokenizerargument in thevllm servecommand. Other relevant arguments are--tokenizer-revisionand--tokenizer-mode. Please check Hugging Face's documentation for the meaning of these arguments. This part of the logic can be found in the get_tokenizer function. After obtaining the tokenizer, notably, vLLM will cache some expensive attributes of the tokenizer in get_cached_tokenizer. -
Model weight: vLLM downloads the model weight from the Hugging Face model hub using the
modelargument as the model name and the--revisionargument as the revision. vLLM provides the argument--load-formatto control what files to download from the model hub. By default, it will try to load the weights in the safetensors format and fall back to the PyTorch bin format if the safetensors format is not available. We can also pass--load-format dummyto skip downloading the weights.- It is recommended to use the safetensors format, as it is efficient for loading in distributed inference and also safe from arbitrary code execution. See the documentation for more information on the safetensors format. This part of the logic can be found here. Please note that:
This completes the integration between vLLM and Hugging Face.
In summary, vLLM reads the config file config.json, tokenizer, and model weight from the Hugging Face model hub or a local directory. It uses the config class from either vLLM, Hugging Face transformers, or loads the config class from the model's repository.