Skip to content

feat: BROS-193: Storages #8007

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 76 commits into from
Jul 28, 2025
Merged
Show file tree
Hide file tree
Changes from 63 commits
Commits
Show all changes
76 commits
Select commit Hold shift + click to select a range
378beab
updates to store connector experience
Jul 18, 2025
f9537ee
storage removing code
Jul 18, 2025
15f9fc8
updating buttons
Jul 18, 2025
75aeed0
Components restructure
nick-skriabin Jul 21, 2025
22f82c9
Form setup
nick-skriabin Jul 21, 2025
05d0d4f
Merge branch 'develop' into fb-bros-193
nick-skriabin Jul 21, 2025
47a0ee2
Merge branch 'develop' into fb-bros-193
nick-skriabin Jul 22, 2025
28b4923
Use proper tailwind classes and tokens, reset form state on modal close
nick-skriabin Jul 22, 2025
ed161a1
S3 look and test connection check
nick-skriabin Jul 22, 2025
b3a89d9
S3 preview files
nick-skriabin Jul 23, 2025
1616974
Providers
nick-skriabin Jul 24, 2025
bea4375
Storage creation and editing
nick-skriabin Jul 24, 2025
ce77ba7
Validation and stepper
nick-skriabin Jul 24, 2025
e0896d7
Cleanup
nick-skriabin Jul 24, 2025
fdf9d3a
Use query mutation
nick-skriabin Jul 24, 2025
70e5d03
Fix default values display
nick-skriabin Jul 24, 2025
cdbb6dc
Remove individual provider forms
nick-skriabin Jul 24, 2025
f9e4384
Cleanup form
nick-skriabin Jul 24, 2025
49358c2
Setup all providers
nick-skriabin Jul 24, 2025
23dfac1
Add cursor rules
nick-skriabin Jul 24, 2025
4cc05dc
Move storage provider form to shared area
nick-skriabin Jul 24, 2025
623ebd3
Feature flag
nick-skriabin Jul 24, 2025
dbe6963
Refactor backend for import list. Add support for all storages
makseq Jul 25, 2025
01def57
Move providers definition outside
nick-skriabin Jul 25, 2025
f9c8441
Fix form layout
nick-skriabin Jul 25, 2025
96de4bd
Refactor into smaller components, move things around
nick-skriabin Jul 25, 2025
38617c1
Icons and UI
nick-skriabin Jul 25, 2025
ecf8320
Fixes on backend with file list
makseq Jul 25, 2025
dbf3bef
Merge branch 'fb-bros-193' of github.com:heartexlabs/label-studio int…
makseq Jul 25, 2025
95315cc
Callout support and styles adjustments
nick-skriabin Jul 25, 2025
a53c0d3
Remove callout
nick-skriabin Jul 25, 2025
e1c6b7f
Resolve circular deps
nick-skriabin Jul 25, 2025
141fe83
Structure fixes
nick-skriabin Jul 25, 2025
d343e3b
Validation
nick-skriabin Jul 25, 2025
81edf93
Remove old files
nick-skriabin Jul 25, 2025
cf79cf8
Remove obsolete files
nick-skriabin Jul 25, 2025
5fa3aff
Restore changed files -- unnecessary changes
nick-skriabin Jul 25, 2025
cf6e1f0
Add None if more than 100 files
makseq Jul 25, 2025
133c168
Merge branch 'fb-bros-193' of github.com:heartexlabs/label-studio int…
makseq Jul 25, 2025
86220b7
Revert "Restore changed files -- unnecessary changes"
nick-skriabin Jul 25, 2025
9118384
Remove unnecessary changes
nick-skriabin Jul 25, 2025
8c47fcf
Fix backend serializer. Add helper for Treat each JSON as task
makseq Jul 26, 2025
2af455f
Rename "Treat every" to "Import method" everywhere including docs
makseq Jul 26, 2025
ecf9b95
Merge branch 'fb-bros-193' of github.com:heartexlabs/label-studio int…
makseq Jul 26, 2025
244e7fb
Rename value JSON to Tasks
makseq Jul 26, 2025
4588edc
Add scan timeout 30 sec
makseq Jul 26, 2025
ce77bc7
Add pagination for GCS iter objects
makseq Jul 26, 2025
e6a79cd
Use old view for target storages
nick-skriabin Jul 26, 2025
5e18c4e
Fix localfiles
nick-skriabin Jul 26, 2025
ee77644
Fix validation, move path to files to Preview step
nick-skriabin Jul 26, 2025
f6a6428
Move Redis's path to preview step
nick-skriabin Jul 26, 2025
781ad79
Connection revaliadation
nick-skriabin Jul 26, 2025
cd8c5ce
Better looking connection check
nick-skriabin Jul 26, 2025
d03a25b
Don't lock the "next" button on the preview step
nick-skriabin Jul 26, 2025
61a2a8d
Reset form state when the dialog is closed
nick-skriabin Jul 26, 2025
9ad05d6
Fix label spacing
nick-skriabin Jul 26, 2025
1525890
Nicer error display
nick-skriabin Jul 26, 2025
30197d9
Fix permissions, page_size in gcs, use presinged urls text
makseq Jul 27, 2025
bd6cd46
Add customized placeholder for regex filter
makseq Jul 27, 2025
51fa4ca
Linters
makseq Jul 27, 2025
66cbd0b
Replace data.heartex.net to s3 bucket file
makseq Jul 27, 2025
9b8adaf
Merge branch 'develop' of github.com:heartexlabs/label-studio into fb…
makseq Jul 27, 2025
ad10d2d
Fix docs and all urls json
makseq Jul 28, 2025
a4f45ff
Fix api docs in pytest
makseq Jul 28, 2025
209dfe6
Update instructions
makseq Jul 28, 2025
ca54324
Revert test
makseq Jul 28, 2025
c3bd2e4
Remove dependency
nick-skriabin Jul 28, 2025
93101d8
Merge branch 'develop' into fb-bros-193
nick-skriabin Jul 28, 2025
add9c93
Update yarn.lock
nick-skriabin Jul 28, 2025
c5664ae
Add pytest for list api
makseq Jul 28, 2025
e6b1202
Merge branch 'fb-bros-193' of github.com:heartexlabs/label-studio int…
makseq Jul 28, 2025
bc55949
Linter for python
makseq Jul 28, 2025
05cd1f0
Rename "JSON - Treat each JSON" to "Tasks - Treach each JSON" to matc…
makseq Jul 28, 2025
3e2504f
Fix linter issues
nick-skriabin Jul 28, 2025
fcfe51c
Fix linter
nick-skriabin Jul 28, 2025
39b8b27
Remove flag override
nick-skriabin Jul 28, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
77 changes: 77 additions & 0 deletions .cursor/rules/storage-provider.mdc
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
---
description: Set of rules to maintain and extend cloud storage provides like S3, Azure, etc.
globs:
alwaysApply: false
---

# Cursor Rule: Implementing New Storage Providers in Label Studio

## Overview
This rule describes the process and best practices for adding a new storage provider to Label Studio using the declarative provider schema system.

## Steps to Add a New Storage Provider

1. **Create a Provider Config File**
- Add a new file under `web/lib/app-common/src/blocks/StorageProviderForm/providers/` named after your provider (e.g., `myProvider.ts`).

2. **Define Fields**
- Use the `FieldDefinition` type for each field.
- Each field should specify:
- `name`: Unique string identifier
- `type`: One of `text`, `password`, `select`, `toggle`, `counter`, etc.
- `label`: User-facing label
- `required`: Boolean (if applicable)
- `placeholder`: Example value (if applicable)
- `description`: (optional) Help text for the user
- `autoComplete`: (optional) For password fields
- `accessKey`: Boolean for credential fields (enables edit mode handling)
- `options`: For select fields
- `min`, `max`, `step`: For counter fields
- `schema`: Zod schema for validation, with `.default()` for default values

3. **Assemble the Layout**
- Use the `layout` array to group fields into rows.
- Each row is an object with a `fields` array listing the field names in order.
- Omit fields like `title`, `regex_filter`, and `use_blob_urls` from the provider schema; these are handled globally or in the preview step.

4. **Validation**
- Use Zod for all field validation.
- Use `.default()` for default values where appropriate.
- For optional fields, use `.optional().default("")` or similar.

5. **Credential Fields**
- Mark credential fields (e.g., API keys, secrets) with `accessKey: true`.
- Use `type: "password"` and set `autoComplete` as needed.
- Provide a realistic placeholder.

6. **Placeholders and Descriptions**
- Always provide a meaningful placeholder for each field.
- Add a description if the field may be confusing or has special requirements.

7. **Export the Provider**
- Export your provider config as the default export from the file.

8. **Register the Provider**
- Add your provider to the central registry in `providers/index.ts`.

## Example Field Definition
```ts
{
name: "api_key",
type: "password",
label: "API Key",
required: true,
accessKey: true,
placeholder: "sk-...",
autoComplete: "off",
schema: z.string().min(1, "API Key is required"),
}
```

## Best Practices
- Do **not** include global fields like `title`, `regex_filter`, or `use_blob_urls` in provider configs.
- Use `.default()` in Zod schemas for all fields that should have a default value.
- Use `accessKey: true` for any field that is a credential or secret.
- Keep field and layout definitions minimal and focused on provider-specific configuration.
- Test your provider in both create and edit modes to ensure correct behavior.

4 changes: 2 additions & 2 deletions docs/source/guide/security.md
Original file line number Diff line number Diff line change
Expand Up @@ -111,9 +111,9 @@ Below, both are explained from a security perspective.

After connecting a storage to a project, you have several options to load tasks into the project. Depending on the option, you need to provide specific permissions:

* **Sync media files** (**LIST** permission required): Storage Sync automatically creates Label Studio tasks based on the file list in your storage when **Treat every bucket object as a source file** is enabled. Label Studio does not read the file content; it simply references the files (e.g., `{"image": "s3://bucket/1.jpg"}`).
* **Sync media files** (**LIST** permission required): Storage Sync automatically creates Label Studio tasks based on the file list in your storage when **Tasks** import method is enabled. Label Studio does not read the file content; it simply references the files (e.g., `{"image": "s3://bucket/1.jpg"}`).

* **Sync JSON task files** (**LIST** and **GET** permissions required): Storage Sync reads Label Studio tasks from JSON files in your bucket and loads the entire JSON content into the Label Studio database when "Treat every bucket object as a source file" is enabled.
* **Sync JSON task files** (**LIST** and **GET** permissions required): Storage Sync reads Label Studio tasks from JSON files in your bucket and loads the entire JSON content into the Label Studio database when **Tasks** import method is enabled.

* **No sync** (**none** permissions required): You can manually import JSON files containing Label Studio tasks and reference storage URIs (e.g., `{"image": "s3://bucket/1.jpg"}`) inside tasks.

Expand Down
35 changes: 20 additions & 15 deletions docs/source/guide/storage.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,8 +25,8 @@ Set up the following cloud and other storage systems with Label Studio:
When working with an external cloud storage connection, keep the following in mind:

* For Source storage:
* When "Treat every bucket object as a source file" is checked, Label Studio doesn’t import the data stored in the bucket, but instead creates *references* to the objects. Therefore, you have full access control on the data to be synced and shown on the labeling screen.
* When "Treat every bucket object as a source file" is unchecked, bucket files are assumed to be immutable; the only way to push an updated file's state to Label Studio is to upload it with a new filename or delete all tasks that are associated with that file and resync.
* When **Files** import method is selected, Label Studio doesn’t import the data stored in the bucket, but instead creates *references* to the objects. Therefore, you have full access control on the data to be synced and shown on the labeling screen.
* When **Tasks** import method is selected, bucket files are assumed to be immutable; the only way to push an updated file's state to Label Studio is to upload it with a new filename to storage or delete all tasks that are associated with that file and resync.
* Sync operations with external buckets only goes one way. It either creates tasks from objects on the bucket (Source storage) or pushes annotations to the output bucket (Target storage). Changing something on the bucket side doesn't guarantee consistency in results.
* We recommend using a separate bucket folder for each Label Studio project.
* Storage Regions: To minimize latency and improve efficiency, store data in cloud storage buckets that are geographically closer to your team rather than near the Label Studio server.
Expand Down Expand Up @@ -57,7 +57,7 @@ Task data synced from cloud storage is not stored in Label Studio. Instead, the

* If you set the import method to "Files", Label Studio backend will only need LIST permissions and won't download any data from your buckets.

* If you set the import method to "JSON", Label Studio backend will require GET permissions to read JSON files and convert them to Label Studio tasks.
* If you set the import method to "Tasks", Label Studio backend will require GET permissions to read JSON files and convert them to Label Studio tasks.

When your users access labeling, the backend will attempt to resolve URI (e.g., s3://) to URL (https://) links. URLs will be returned to the frontend and loaded by the user's browser. To load these URLs, the browser will require HEAD and GET permissions from your Cloud Storage. The HEAD request is made at the beginning and allows the browser to determine the size of the audio, video, or other files. The browser then makes a GET request to retrieve the file body.

Expand All @@ -73,11 +73,14 @@ Source storage functionality can be divided into two parts:

#### Import method

!!! info
The "Treat every bucket object as a source file" option was renamed and reintroduced as the "Import method" dropdown.

Label Studio Source Storages feature an "Import method" dropdown. This setting enables two different methods of loading tasks into Label Studio.

###### JSON
###### Tasks

When set to "JSON", tasks in JSON or JSONL/NDJSON format can be loaded directly from storage buckets into Label Studio. This approach is particularly helpful when dealing with complex tasks that involve multiple media sources.
When set to "Tasks", tasks in JSON, JSONL/NDJSON or Parquet format can be loaded directly from storage buckets into Label Studio. This approach is particularly helpful when dealing with complex tasks that involve multiple media sources.

<img src="/images/source-storages-treat-off.png" class="make-intense-zoom">

Expand Down Expand Up @@ -392,7 +395,7 @@ After you [configure access to your S3 bucket](#Configure-access-to-your-S3-buck
- In the **Session Token** field, specify a session token of the temporary security credentials for an AWS account with access to your S3 bucket.
- In the **Import method** dropdown, choose how to import your data:
- **Files** - Automatically creates a task for each storage object (e.g. JPG, MP3, TXT). Use this if your bucket contains BLOB storage files such as JPG, MP3, or similar file types.
- **JSON** - Treat each JSON or JSONL file as a task definition (one or more tasks per file). Use this if you have multiple JSON files in the bucket with one task per JSON file.
- **Tasks** - Treat each JSON, JSONL, or Parquet as a task definition (one or more tasks per file). Use this if you have multiple JSON files in the bucket with one task per JSON file.
- (Optional) Enable **Scan all sub-folders** to include files from all nested folders within your S3 bucket prefix.
- In the **Use pre-signed URLs (On) / Proxy through Label Studio (Off)** toggle, choose how media is loaded:
- **ON** (Pre-signed URLs) - All data bypasses the platform and user browsers directly read data from storage.
Expand Down Expand Up @@ -559,7 +562,7 @@ In the Label Studio UI, do the following to set up the connection:
- In the **External ID** field, specify the external ID that identifies Label Studio to your AWS account. You can find the external ID on your **Organization** page.
- In the **Import method** dropdown, choose how to import your data:
- **Files** - Automatically creates a task for each storage object (e.g. JPG, MP3, TXT). Use this if your bucket contains BLOB storage files such as JPG, MP3, or similar file types.
- **JSON** - Treat each JSON or JSONL file as a task definition (one or more tasks per file). Use this if you have multiple JSON files in the bucket with one task per JSON file.
- **Tasks** - Treat each JSON, JSONL, or Parquet as a task definition (one or more tasks per file). Use this if you have multiple JSON files in the bucket with one task per JSON file.
- Enable **Scan all sub-folders** to include files from all nested folders within your S3 bucket prefix.
- In the **Use pre-signed URLs (On) / Proxy through Label Studio (Off)** toggle, choose how media is loaded:
- **ON** (Pre-signed URLs) - All data bypasses the platform and user browsers directly read data from storage.
Expand Down Expand Up @@ -703,7 +706,7 @@ In the Label Studio UI, do the following to set up the connection:
- In the **File Filter Regex** field, specify a regular expression to filter bucket objects. Use `.*` to collect all objects.
- In the **Import method** dropdown, choose how to import your data:
- **Files** - Automatically creates a task for each storage object (e.g. JPG, MP3, TXT). Use this if your bucket contains BLOB storage files such as JPG, MP3, or similar file types.
- **JSON** - Treat each JSON or JSONL file as a task definition (one or more tasks per file). Use this if you have multiple JSON files in the bucket with one task per JSON file.
- **Tasks** - Treat each JSON, JSONL, or Parquet as a task definition (one or more tasks per file). Use this if you have multiple JSON files in the bucket with one task per JSON file.
- In the **Use pre-signed URLs (On) / Proxy through Label Studio (Off)** toggle, choose how media is loaded:
- **ON** (Pre-signed URLs) - All data bypasses the platform and user browsers directly read data from storage.
- **OFF** (Proxy) - The platform proxies media using its own backend.
Expand Down Expand Up @@ -1034,7 +1037,7 @@ Select the **GCS (WIF auth)** storage type and then complete the following field
| Bucket Name | Enter the name of the Google Cloud bucket. |
| Bucket Prefix | Optionally, enter the folder name within the bucket that you would like to use. For example, `data-set-1` or `data-set-1/subfolder-2`. |
| File Name Filter | Optionally, specify a regular expression to filter bucket objects. |
| [Treat every bucket object as a source file](#Treat-every-bucket-object-as-a-source-file) | Enable this option if your bucket contains BLOB storage files such as JPG, MP3, or similar file types. This setting creates a URL for each bucket object to use for labeling, such as `gs://my-gcs-bucket/image.jpg`. Leave this option disabled if you have are specifying your tasks in JSON files. |
| Import method | Choose how to interpret your data:<br/>**Files** - Automatically creates a task for each storage object (e.g. JPG, MP3, TXT). Use this if your bucket contains BLOB storage files such as JPG, MP3, or similar file types.<br/>**Tasks** - Treat each JSON, JSONL, or Parquet as a task definition (one or more tasks per file). Use this if you have multiple JSON files in the bucket with one task per JSON file. |
| [Use pre-signed URLs](#Pre-signed-URLs-vs-storage-proxies) | **ON** - Label Studio generates a pre-signed URL to load media. <br /> **OFF** - The platform proxies media using its own backend. |
| Pre-signed URL counter | Adjust the counter for how many minutes the pre-signed URLs are valid. |
| Workload Identity Pool ID | This is the ID you specified when creating the Work Identity Pool. You can find this in Google Cloud Console under **IAM & Admin > Workload Identity Pools**. |
Expand Down Expand Up @@ -1159,7 +1162,7 @@ In the Label Studio UI, do the following to set up the connection:
- In the **File Filter Regex** field, specify a regular expression to filter bucket objects. Use `.*` to collect all objects.
- In the **Account Name** field, specify the account name for the Azure storage. You can also set this field as an environment variable,`AZURE_BLOB_ACCOUNT_NAME`.
- In the **Account Key** field, specify the secret key to access the storage account. You can also set this field as an environment variable,`AZURE_BLOB_ACCOUNT_KEY`.
- Enable **Treat every bucket object as a source file** if your bucket contains BLOB storage files such as JPG, MP3, or similar file types. This setting creates a URL for each bucket object to use for labeling, for example `azure-blob://container-name/image.jpg`. Leave this option disabled if you have multiple JSON files in the bucket with one task per JSON file.
- Set **Import method** to **"Files"** if your bucket contains BLOB storage files such as JPG, MP3, or similar file types. This setting creates a URL for each bucket object to use for labeling, for example `azure-blob://container-name/image.jpg`. Set this option to **"Tasks"** if you have multiple JSON/JSONL/Parquet files in the bucket with tasks.
- Choose whether to disable [**Use pre-signed URLs**](#Pre-signed-URLs-vs-storage-proxies), or [shared access signatures](https://docs.microsoft.com/en-us/rest/api/storageservices/delegate-access-with-shared-access-signature).
- **ON** - Label Studio generates a pre-signed URL to load media.
- **OFF** - The platform proxies media using its own backend.
Expand Down Expand Up @@ -1218,7 +1221,9 @@ In the Label Studio UI, do the following to set up the connection:
- In the **Host** field, specify the IP of the server hosting the database, or `localhost`.
- In the **Port** field, specify the port that you can use to access the database.
- In the **File Filter Regex** field, specify a regular expression to filter database objects. Use `.*` to collect all objects.
- Enable **Treat every bucket object as a source file** if your database contains files such as JPG, MP3, or similar file types. This setting creates a URL for each database object to use for labeling. Leave this option disabled if you have multiple JSON files in the database, with one task per JSON file.
- In the **Import method** dropdown, choose how to import your data:
- **Files** - Automatically creates a task for each storage object (e.g. JPG, MP3, TXT). Use this if your database contains BLOB storage files such as JPG, MP3, or similar file types.
- **Tasks** - Treat each JSON, JSONL, or Parquet as a task definition (one or more tasks per file). Use this if you have multiple JSON files in the database with one task per JSON file.
8. Click **Add Storage**.
9. Repeat these steps for **Target Storage** to sync completed data annotations to a database.

Expand Down Expand Up @@ -1268,9 +1273,9 @@ In the Label Studio UI, do the following to set up the connection:
If you are using Windows, ensure that you use backslashes when entering your **Absolute local path**.

1. (Optional) In the **File Filter Regex** field, specify a regular expression to filter bucket objects. Use `.*` to collect all objects.
2. (Optional) Toggle **Treat every bucket object as a source file**.
- Enable this option if you want to create Label Studio tasks from media files automatically, such as JPG, MP3, or similar file types. Use this option for labeling configurations with one source tag.
- Disable this option if you want to import tasks in Label Studio JSON format directly from your storage. Use this option for complex labeling configurations with HyperText or multiple source tags.
2. (Optional) In the **Import method** dropdown, choose how to import your data:
- **Files** - Automatically creates a task for each storage object (e.g. JPG, MP3, TXT). Use this if you want to create Label Studio tasks from media files automatically. Use this option for labeling configurations with one source tag.
- **Tasks** - Treat each JSON, JSONL, or Parquet as a task definition (one or more tasks per file). Use this if you want to import tasks in Label Studio JSON format directly from your storage. Use this option for complex labeling configurations with HyperText or multiple source tags.
3. Click **Add Storage**.
4. Repeat these steps for **Add Target Storage** to use a local file directory for exporting.

Expand All @@ -1283,7 +1288,7 @@ In those cases, you have to repeat all stages above to create local storage, but

Differences with instruction above:
- **7. File Filter Regex** - stay empty (because you will specify it inside tasks)
- **8. Treat every bucket object as a source file** - switch off (because you will specify it inside tasks)
- **8. Import method** - select **"Tasks"** (because you will specify file references inside your JSON task definitions)

Your window will look like this:
<img src="/images/local-storage-settings2.png" alt="Screenshot of the local storage settings for user task." class="gif-border">
Expand Down
Loading
Loading