R2

You can use Cloudflare R2 to store data for indexing. To get started, configure an R2 bucket containing your data.

AI Search will automatically scan and process supported files stored in that bucket. Files that are unsupported or exceed the size limit will be skipped during indexing and logged as errors.

Path filtering

You can control which files get indexed by defining include and exclude rules for object paths. Use this to limit indexing to specific folders or to exclude files you do not want searchable.

For example, to index only documentation while excluding drafts:

Include: /docs/**
Exclude: /docs/drafts/**

Refer to Path filtering for pattern syntax, filtering behavior, and more examples.

File limits

AI Search has a file size limit of up to 4 MB.

Files that exceed these limits will not be indexed and will show up in the error logs.

File types

AI Search can ingest a variety of different file types to power your RAG. The following plain text files and rich format files are supported.

Plain text file types

AI Search supports the following plain text file types:

Format	File extensions	Mime Type
Text	`.txt`, `.rst`	`text/plain`
Log	`.log`	`text/plain`
Config	`.ini`, `.conf`, `.env`, `.properties`, `.gitignore`, `.editorconfig`, `.toml`	`text/plain`, `text/toml`
Markdown	`.markdown`, `.md`, `.mdx`	`text/markdown`
LaTeX	`.tex`, `.latex`	`application/x-tex`, `application/x-latex`
Script	`.sh`, `.bat` , `.ps1`	`application/x-sh` , `application/x-msdos-batch`, `text/x-powershell`
SGML	`.sgml`	`text/sgml`
JSON	`.json`	`application/json`
YAML	`.yaml`, `.yml`	`application/x-yaml`
CSS	`.css`	`text/css`
JavaScript	`.js`	`application/javascript`
PHP	`.php`	`application/x-httpd-php`
Python	`.py`	`text/x-python`
Ruby	`.rb`	`text/x-ruby`
Java	`.java`	`text/x-java-source`
C	`.c`	`text/x-c`
C++	`.cpp`, `.cxx`	`text/x-c++`
C Header	`.h`, `.hpp`	`text/x-c-header`
Go	`.go`	`text/x-go`
Rust	`.rs`	`text/rust`
Swift	`.swift`	`text/swift`
Dart	`.dart`	`text/dart`
EMACS Lisp	`.el`	`application/x-elisp`, `text/x-elisp`, `text/x-emacs-lisp`

Rich format file types

AI Search uses Markdown Conversion to convert rich format files to markdown. The following table lists the supported formats that will be converted to Markdown:

Format	File extensions	Mime Types
PDF Documents	`.pdf`	`application/pdf`
Images ¹	`.jpeg`, `.jpg`, `.png`, `.webp`, `.svg`	`image/jpeg`, `image/png`, `image/webp`, `image/svg+xml`
HTML Documents	`.html`, `.htm`	`text/html`
XML Documents	`.xml`	`application/xml`
Microsoft Office Documents	`.xlsx`, `.xlsm`, `.xlsb`, `.xls`, `.et`, `.docx`	`application/vnd.openxmlformats-officedocument.spreadsheetml.sheet`, `application/vnd.ms-excel.sheet.macroenabled.12`, `application/vnd.ms-excel.sheet.binary.macroenabled.12`, `application/vnd.ms-excel`, `application/vnd.openxmlformats-officedocument.wordprocessingml.document`
Open Document Format	`.ods`, `.odt`	`application/vnd.oasis.opendocument.spreadsheet`, `application/vnd.oasis.opendocument.text`
CSV	`.csv`	`text/csv`
Apple Documents	`.numbers`	`application/vnd.apple.numbers`

¹ Image conversion uses two Workers AI models for object detection and summarization. See Workers AI pricing for more details.

Custom metadata

You can attach custom metadata to R2 objects for filtering search results. AI Search reads metadata from S3-compatible custom headers (x-amz-meta-*).

Before metadata can be extracted, you must define a schema in your AI Search configuration.

Set metadata when uploading

Use the customMetadata option when uploading objects with the R2 Workers binding:

await env.MY_BUCKET.put("docs/document.pdf", fileContent, {
  customMetadata: {
    category: "documentation",
    version: "2.5",
    is_public: "true",
  },
});

Use the Metadata option with the AWS SDK for JavaScript:

import { S3Client, PutObjectCommand } from "@aws-sdk/client-s3";

const client = new S3Client({
  region: "auto",
  endpoint: `https://${accountId}.r2.cloudflarestorage.com`,
  credentials: {
    accessKeyId: R2_ACCESS_KEY_ID,
    secretAccessKey: R2_SECRET_ACCESS_KEY,
  },
});

await client.send(
  new PutObjectCommand({
    Bucket: "your-bucket",
    Key: "docs/document.pdf",
    Body: fileContent,
    Metadata: {
      category: "documentation",
      version: "2.5",
      is_public: "true",
    },
  }),
);

Use the --header flag with Wrangler to set x-amz-meta-* headers:

wrangler r2 object put your-bucket/docs/document.pdf \
  --file=./document.pdf \
  --header="x-amz-meta-category:documentation" \
  --header="x-amz-meta-version:2.5" \
  --header="x-amz-meta-is_public:true"

How metadata extraction works

When a file is fetched from R2 during indexing:

All x-amz-meta-* headers are read from the object.
The x-amz-meta- prefix is stripped (for example, x-amz-meta-category becomes category).
Field names are matched against your schema (case-insensitive).
Values are cast to the configured data type.
Invalid values (for example, a non-numeric string for a number type) are silently ignored.

Unicode support

Metadata values support Unicode characters through MIME-Word encoding (RFC 2047). Most S3-compatible tools handle this encoding automatically.