REST API
This guide explains how to use the AI Search REST API to query your AI Search instance.
You need an API token with AI Search Run permissions to use the REST API. To create a new token:
- Log in to the Cloudflare dashboard, and go to API tokens for your profile. Go to Account API tokens
- Select Create Token.
- Select Create Custom Token.
- Enter a name for your token.
- Under Permissions, select AI Search and Run.
- Under Account Resources, select the account you want to use.
- Select Continue to summary, then select Create Token.
- Copy and save your API token for future use.
This endpoint searches for relevant results from your data source and generates a response using the model and the retrieved context:
curl https://api.cloudflare.com/client/v4/accounts/{ACCOUNT_ID}/ai-search/instances/{AI_SEARCH_NAME}/chat/completions \ -H 'Content-Type: application/json' \ -H "Authorization: Bearer {API_TOKEN}" \ -d '{ "messages": [ { "content": "How do I train a llama to deliver coffee?", "role": "user" } ] }'messages array required
An array of message objects. Each message has:
contentstring- The message content.rolestring- The role:user,system, orassistant.
stream boolean optional
Set to true to return a stream of results as they are generated. Defaults to false.
ai_search_options object optional
Per-request overrides for retrieval and model behavior. Supports the following nested options:
retrieval.filtersobject- Narrow down search results based on metadata. Refer to Metadata filtering for syntax and examples.retrieval.max_num_resultsnumber- Maximum number of chunks to return. Defaults to10, maximum50.retrieval.retrieval_typestring- One ofvector,keyword, orhybrid.retrieval.match_thresholdnumber- Minimum similarity score (0-1). Defaults to0.4.cache.enabledboolean- Override the instance-level cache setting for this request.reranking.enabledboolean- Override the instance-level reranking setting for this request.
For the full list of optional parameters, refer to the Chat Completions API reference.
When stream is set to false (default), the response is returned as a single JSON object:
{ "id": "chatcmpl-abc123", "object": "chat.completion", "created": 1771886959, "model": "@cf/meta/llama-3.3-70b-instruct-fp8-fast", "choices": [ { "index": 0, "message": { "role": "assistant", "content": "To train a llama to deliver coffee, start by building trust...", "refusal": null }, "logprobs": null, "finish_reason": "stop" } ], "usage": { "prompt_tokens": 6507, "completion_tokens": 137, "total_tokens": 6644 }, "chunks": [ { "id": "chunk001", "type": "text", "score": 0.85, "text": "Llamas can carry up to 3 drinks.", "item": { "key": "llama-logistics.md", "timestamp": 1735689600 }, "scoring_details": { "vector_score": 0.85 } } ]}When stream is set to true, the response is returned as server-sent events (SSE). The retrieved chunks are sent first as a single chunks event, followed by multiple data events containing the generated response in incremental pieces:
event: chunksdata: [{"id":"chunk001","type":"text","score":0.85,"text":"...","item":{...},"scoring_details":{...}}]
data: {"id":"id-123","created":1771887723,"model":"@cf/meta/llama-3.3-70b-instruct-fp8-fast","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":"To"}}]}
data: {"id":"id-123","created":1771887723,"model":"@cf/meta/llama-3.3-70b-instruct-fp8-fast","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":" train a llama"}}]}
data: [DONE]This allows you to display the source chunks immediately while streaming the generated response to the user.
This endpoint searches for results from your data source and returns the relevant chunks without generating a response:
curl https://api.cloudflare.com/client/v4/accounts/{ACCOUNT_ID}/ai-search/instances/{AI_SEARCH_NAME}/search \ -H 'Content-Type: application/json' \ -H "Authorization: Bearer {API_TOKEN}" \ -d '{ "messages": [ { "content": "How do I train a llama to deliver coffee?", "role": "user" } ] }'messages array required
An array of message objects. Each message has:
contentstring- The search query content.rolestring- The role:user,system, orassistant.
ai_search_options object optional
Per-request overrides for retrieval and model behavior. Supports the following nested options:
retrieval.filtersobject- Narrow down search results based on metadata. Refer to Metadata filtering for syntax and examples.retrieval.max_num_resultsnumber- Maximum number of chunks to return. Defaults to10, maximum50.retrieval.retrieval_typestring- One ofvector,keyword, orhybrid.retrieval.match_thresholdnumber- Minimum similarity score (0-1). Defaults to0.4.cache.enabledboolean- Override the instance-level cache setting for this request.reranking.enabledboolean- Override the instance-level reranking setting for this request.
For the full list of optional parameters, refer to the Search API reference.
{ "success": true, "result": { "search_query": "How do I train a llama to deliver coffee?", "chunks": [ { "id": "chunk001", "type": "text", "score": 0.85, "text": "Llamas can carry up to 3 drinks.", "item": { "key": "llama-logistics.md", "timestamp": 1735689600 }, "scoring_details": { "vector_score": 0.85 } } ] }}