Foundry Local REST API Reference
The Foundry Local REST API provides endpoints for managing AI models, performing inference, and controlling the local inference service. All endpoints are compatible with the OpenAI Chat Completions API format.This API is under active development and may include breaking changes without notice. Monitor the changelog before building production applications.
Base URL
Authentication
For local usage, no authentication is required. The API uses a default placeholder API key.Chat Completions
POST /v1/chat/completions
Process chat completion requests with local AI models. Fully compatible with the OpenAI Chat Completions API.The specific model to use for completion (e.g.,
qwen2.5-0.5b-instruct-generic-cpu)The conversation history as a list of message objects. Each message requires:
role(string): Message sender’s role -system,user, orassistantcontent(string): The actual message text
Controls randomness (0 to 2). Higher values (0.8) create varied outputs, lower values (0.2) are focused
Controls token selection diversity (0 to 1). Value of 0.1 considers only top 10% probability tokens
Maximum tokens to generate in the completion
When true, sends partial message responses as server-sent events
Value between -2.0 and 2.0. Positive values encourage new topics
Value between -2.0 and 2.0. Positive values discourage repetition
Unique identifier for the chat completion
List of completion choices generated
index(integer): Position of this choicemessage(object): Generated message with role and contentfinish_reason(string): Why generation stopped (stop,length,function_call)
Token usage statistics
prompt_tokens: Tokens in the promptcompletion_tokens: Tokens in the completiontotal_tokens: Total tokens used
Model Management
GET /foundry/list
Get a list of available Foundry Local models in the catalog.Array of model objects with:
name: Model identifierdisplayName: Human-readable nameversion: Model versionmodelType: Format (e.g., ONNX)task: Primary task (e.g., chat-completion)fileSizeMb: Size in megabytessupportsToolCalling: Tool calling support
GET /openai/models
List cached models, including local and registered external models. Response Example:POST /openai/download
Download a model from the catalog to local storage.Model specification:
Uri(string): Model URI to downloadName(string): Model nameProviderType(string): Provider (e.g.,AzureFoundryLocal,HuggingFace)
GET /openai/load/
Load a model into memory for faster inference.The model name to load
Time to live in seconds. Overrides automatic unload settings
Execution provider:
dml, cuda, qnn, cpu, webgpuGET /openai/unload/
Unload a model from memory.The model name to unload
If true, ignores TTL settings and unloads immediately
GET /openai/loadedmodels
Get the list of currently loaded models. Response:Service Status
GET /openai/status
Get server status information.HTTP server binding endpoints
Directory where local models are stored
Current NamedPipe server name
Token Counting
POST /v1/chat/completions/tokenizer/encode/count
Count tokens for a chat completion request without performing inference.Model to use for tokenization
Array of message objects with role and content
GPU Management
GET /openai/getgpudevice
Get the current GPU device ID. Response: Integer representing the GPU device IDGET /openai/setgpudevice/
Set the active GPU device.The GPU device ID to use
Error Handling
All API errors return standard HTTP status codes:200- Success400- Bad Request (invalid parameters)404- Not Found (model or resource doesn’t exist)500- Internal Server Error
Rate Limits
No rate limits are enforced for local usage. Performance is limited by hardware capabilities.Related Resources
SDK Reference
Use the Python SDK for easier integration
JavaScript SDK
Node.js and browser integration