Foundry Local REST API Reference

The Foundry Local REST API provides endpoints for managing AI models, performing inference, and controlling the local inference service. All endpoints are compatible with the OpenAI Chat Completions API format.

This API is under active development and may include breaking changes without notice. Monitor the changelog before building production applications.

Base URL

http://localhost:5272

Authentication

For local usage, no authentication is required. The API uses a default placeholder API key.

Chat Completions

POST /v1/chat/completions

Process chat completion requests with local AI models. Fully compatible with the OpenAI Chat Completions API.

model

string

required

The specific model to use for completion (e.g., qwen2.5-0.5b-instruct-generic-cpu)

messages

array

required

The conversation history as a list of message objects. Each message requires:

role (string): Message sender’s role - system, user, or assistant
content (string): The actual message text

temperature

number

Controls randomness (0 to 2). Higher values (0.8) create varied outputs, lower values (0.2) are focused

top_p

number

Controls token selection diversity (0 to 1). Value of 0.1 considers only top 10% probability tokens

max_tokens

integer

Maximum tokens to generate in the completion

stream

boolean

When true, sends partial message responses as server-sent events

presence_penalty

number

Value between -2.0 and 2.0. Positive values encourage new topics

frequency_penalty

number

Value between -2.0 and 2.0. Positive values discourage repetition

Request Example:

{
  "model": "qwen2.5-0.5b-instruct-generic-cpu",
  "messages": [
    {
      "role": "user",
      "content": "Hello, how are you?"
    }
  ],
  "temperature": 0.7,
  "max_tokens": 100
}

string

Unique identifier for the chat completion

choices

array

List of completion choices generated

index (integer): Position of this choice
message (object): Generated message with role and content
finish_reason (string): Why generation stopped (stop, length, function_call)

usage

object

Token usage statistics

prompt_tokens: Tokens in the prompt
completion_tokens: Tokens in the completion
total_tokens: Total tokens used

Model Management

GET /foundry/list

Get a list of available Foundry Local models in the catalog.

models

array

Array of model objects with:

name: Model identifier
displayName: Human-readable name
version: Model version
modelType: Format (e.g., ONNX)
task: Primary task (e.g., chat-completion)
fileSizeMb: Size in megabytes
supportsToolCalling: Tool calling support

GET /openai/models

List cached models, including local and registered external models. Response Example:

["Phi-4-mini-instruct-generic-cpu", "phi-3.5-mini-instruct-generic-cpu"]

POST /openai/download

Download a model from the catalog to local storage.

Large model downloads can take significant time. Set a high timeout to avoid early termination.

model

object

required

Model specification:

Uri (string): Model URI to download
Name (string): Model name
ProviderType (string): Provider (e.g., AzureFoundryLocal, HuggingFace)

Request Example:

{
  "model": {
    "Uri": "azureml://registries/azureml/models/Phi-4-mini-instruct-generic-cpu/versions/4",
    "ProviderType": "AzureFoundryLocal",
    "Name": "Phi-4-mini-instruct-generic-cpu:4"
  }
}

GET /openai/load/

Load a model into memory for faster inference.

name

string

required

The model name to load

ttl

integer

Time to live in seconds. Overrides automatic unload settings

string

Execution provider: dml, cuda, qnn, cpu, webgpu

Example:

GET /openai/load/Phi-4-mini-instruct-generic-cpu?ttl=3600&ep=dml

GET /openai/unload/

Unload a model from memory.

name

string

required

The model name to unload

force

boolean

If true, ignores TTL settings and unloads immediately

GET /openai/loadedmodels

Get the list of currently loaded models. Response:

["Phi-4-mini-instruct-generic-cpu", "phi-3.5-mini-instruct-generic-cpu"]

Service Status

GET /openai/status

Get server status information.

Endpoints

array

HTTP server binding endpoints

ModelDirPath

string

Directory where local models are stored

PipeName

string

Current NamedPipe server name

Response Example:

{
  "Endpoints": ["http://localhost:5272"],
  "ModelDirPath": "/path/to/models",
  "PipeName": "inference_agent"
}

Token Counting

POST /v1/chat/completions/tokenizer/encode/count

Count tokens for a chat completion request without performing inference.

model

string

required

Model to use for tokenization

messages

array

required

Array of message objects with role and content

Example:

{
  "messages": [
    {
      "role": "system",
      "content": "This is a system message"
    },
    {
      "role": "user",
      "content": "Hello, what is Microsoft?"
    }
  ],
  "model": "Phi-4-mini-instruct-cuda-gpu"
}

GPU Management

GET /openai/getgpudevice

Get the current GPU device ID. Response: Integer representing the GPU device ID

GET /openai/setgpudevice/

Set the active GPU device.

deviceId

integer

required

The GPU device ID to use

Example:

GET /openai/setgpudevice/1

Error Handling

All API errors return standard HTTP status codes:

200 - Success
400 - Bad Request (invalid parameters)
404 - Not Found (model or resource doesn’t exist)
500 - Internal Server Error

Rate Limits

No rate limits are enforced for local usage. Performance is limited by hardware capabilities.

SDK Reference

Use the Python SDK for easier integration

JavaScript SDK

Node.js and browser integration

Documentation Index

​Foundry Local REST API Reference

​Base URL

​Authentication

​Chat Completions

​POST /v1/chat/completions

​Model Management

​GET /foundry/list

​GET /openai/models

​POST /openai/download

​GET /openai/load/

​GET /openai/unload/

​GET /openai/loadedmodels

​Service Status

​GET /openai/status

​Token Counting

​POST /v1/chat/completions/tokenizer/encode/count

​GPU Management

​GET /openai/getgpudevice

​GET /openai/setgpudevice/

​Error Handling

​Rate Limits

​Related Resources

SDK Reference

JavaScript SDK

Foundry Local REST API Reference

Base URL

Authentication

Chat Completions

POST /v1/chat/completions

Model Management

GET /foundry/list

GET /openai/models

POST /openai/download

GET /openai/load/

GET /openai/unload/

GET /openai/loadedmodels

Service Status

GET /openai/status

Token Counting

POST /v1/chat/completions/tokenizer/encode/count

GPU Management

GET /openai/getgpudevice

GET /openai/setgpudevice/

Error Handling

Rate Limits

Related Resources