Self-hosted plugins tutorial

This hands-on tutorial shows how to use the Unstructured self-hosted plugin framework to create a sample plugin. This sample plugin uses a VertexAI model from Google to perform sentiment analysis on the text that Unstructured extracts from documents. For example, given the following custom prompt:

Given a piece of text, classify the following types of information:
- toxic or non-toxic
- emotion if it conveys, such as happiness, or anger
- intent such as "finding information", "making a reservation", or "placing an order"

text:
    {text}

Output the results as JSON and nothing else. For example:
```json
{{
    "toxicity": "non-toxic",
    "emotion": "neutral",
    "intent": "making a reservation"
}}

And the following text:

Hi, can you please book a table for two at Juan for May 1?

The model returns a sentiment analysis in this format:

{
  "toxicity": "non-toxic",
  "emotion": "neutral",
  "intent": "making a reservation"
}

Requirements

A self-hosted deployment of the Unstructured UI and Unstructured API into infrastructure that you maintain in your Amazon Web Services (AWS), Azure, or Google Cloud Platform (GCP) account. If you do not have a self-hosted deployment, stop and contact your Unstructured sales representative, email Unstructured Sales at sales@unstructured.io, or fill out the contact form on the Unstructured website first.
A local development machine with Docker Desktop and the Python pacakge and project manager uv installed.
For sending requests to the plugin through Docker locally, the curl utility installed on the development machine.
For deploying the plugin to your self-hosted Unstructured UI, you must have aceess to a container registry that is compliant with the Open Container Initiative (OCI) and that is also reachable from your AWS, Azure, or GCP account. For example:
- For AWS accounts, Amazon Elastic Container Registry (Amazon ECR).
- For Azure accounts, Azure Container Registry.
- For GCP accounts, Google Artifact Registry (GAR).
You must also have the related command-line interface installed and configured on the development machine:
- For AWS accounts, the AWS CLI.
- For Azure accounts, the Azure CLI.
- For GCP accounts, the Google Cloud CLI.
To call the VertexAI portions of this tutorial:
- A Google Cloud account.
- The Vertex AI API enabled in the Google Cloud account. Learn how.
- Within the Google Cloud account, a Google Cloud service account and its related credentials.json key file or its contents in JSON format. Create a service account. Create credentials for a service account.
- A single-line string that contains the contents of the downloaded credentials.json key file for the service account (and not the service account key file itself). To print this single-line string without line breaks, suitable for copying, you can run one of the following commands from your Terminal or Command Prompt. In this command, replace <path-to-downloaded-key-file> with the path to the credentials.json key file that you downloaded by following the preceding instructions.
  - For macOS or Linux:
    tr -d '\n' < <path-to-downloaded-key-file>
  - For Windows:
    (Get-Content -Path "<path-to-downloaded-key-file>" -Raw).Replace("`r`n", "").Replace("`n", "")

Getting started

In this section, you set up the local development environment for this tutorial’s plugin. This includes creating a directory for overall plugin development, creating a virtual environment to isolate and version Python and various code dependencies, installing the Unstructured plugin development tools and their dependencies, and creating and initializing the code project for this tutorial’s plugin.

Identify a directory for overall plugin development

We recommend creating or using a centralized directory on your local development machine to use for developing this and other plugins. If you create a new directory, be sure to switch to it after you create it. This tutorial uses a directory named plugins within the current working directory. For example:

mkdir plugins
cd plugins

Create a virtual environment within the directory

Use uv to create a virtual environment within the directory that you want to use for overall plugin development. After you create the virtual environment, activate it.

This tutorial uses a virtual environment named plugins_3_12_9. This virtual environment uses Python 3.12.9. If this Python version is not installed on the system, uv installs it first. For example:

uv venv --python 3.12.9 --prompt "plugins_3_12_9"
source .venv/bin/activate

Install the Unstructured plugin development tools and their dependencies

Use uv to install the Unstructured plugin development tools and their dependencies into this virtual environment. These tools and their dependencies will be the same for all plugins that you develop that use this virtual environment.

uv pip install utic-dev-tools cookiecutter

The dependent cookiecutter package is a command-line utility that uses techniques such as wizards along with Python project templates to initialize new projects based on user input.

Create and initialize this tutorial's code project for the plugin

Use the unstructured-plugins new command to create the starter code for this tutorial’s plugin development project. This command starts a wizard that is used
to create a new directory for developing this plugin and then creates the plugin’s starter files and subdirectories within that directory:
```
unstructured-plugins new
```
When propmpted, enter some display name for the plugin, and then press Enter. This tutorial uses Sentiment Analysis as the plugin’s display name:
```
[1/3] name (My First Plugin): Sentiment Analysis
```
Next, enter the plugin’s type, and then press Enter. This tutorial uses sentiment as the plugin’s type:
```
[2/3] type (): sentiment
```
Next, enter the plugin’s subtype, and then press Enter. This tutorial uses analysis as the plugin’s subtype:
```
[3/3] subtype (): analysis
```
A project folder is created within the centralized plugins directory. The project folder is named plugin- followed by the plugin’s type, another dash, and the plugin’s subtype. For this tutorial, the project folder’s name is named plugin-sentiment-analysis.

Switch to the plugin’s project folder and then use uv to install and update this project’s specific code dependencies:
```
cd plugin-sentiment-analysis
uv sync
```

Write the plugin

In this section, you write the plugin’s runtime logic. This tutorial’s logic is primarily within the project’s src/plugin_sentiment_analysis/__init__.py file.

Add user interface settings

In this step, you add the user interface (UI) settings for the plugin. The UI settings are the fields that users see when they add the plugin as a node to a workflow’s visual DAG designer in the UI. The UI settings are defined in the __init__.py file of the plugin project’s src/plugin_<type>_<subtype> subfolder. These settings are specified in the __init__.py file’s PluginSettings class, which is a subclass of the Pydantic BaseModel class. The BaseModel class provides a Pydantic implementation of various type validation, data parsing, and serialization functionality.

In the project’s src directory, under the plugin_sentiment_analyis subdirectory, open the __init__.py file.
In the __init__.py file, add the necessary imports to capture VertexAI settings that the user sets in the UI. To do this, add the following from...import statements to the top of the file:
```
from typing import Literal
from pydantic import SecretStr
```
The Literal is a type hint in Python that restricts a field to specific literal values (such as strings, numbers, or booleans). It enforces that the input must match one of the specified options.

The SecretStr is a specialized string type in Pydantic for sensitive data (such as passwords and API keys). It masks the value in fields by displaying *****.
In the __init__.py file’s PluginSettings class, replace the sample string_field setting definition with settings for the location, credentials, and model fields. The class definition should now look as follows:
```
class PluginSettings(BaseModel):
    """
    Settings used to configure running instances of the plugin.

    These are what can be configured by the user and what will be
    available in the UI.
    """

    location: Literal[
        "us-east5", "us-south1", "us-central1", "us-east1", "us-east4", "us-west1"
    ] = Field(title="API Location")
    credentials: SecretStr = Field(title="Credentials JSON")
    model: Literal["gemini-1.5-flash"] = Field("gemini-1.5-flash", title="Model")
```
- The location field specifies the location of the VertexAI API. The field in the UI’s help pane for the plugin node will display the title of API Location.
- The credentials field specifies the JSON credentials for the VertexAI API. The field in the UI will have the title of Credentials JSON. Specifying the SecretStr type displays the field’s text with asterisks.
- The model field specifies the model for VertexAI to use. The field in the UI will have the title of Model. The default value for this field is gemini-1.5-flash.
- At run time, the PluginSettings class reads these field’s values from the UI and writes them as a JSON dictionary into a settings.json file in the project’s root for the plugin to read from later.

Integrate with VertexAI

Add the necessary VertexAI dependencies:

uv pip install google-cloud-aiplatform google-auth

At the top of the __init__.py file, add the necessary import statements for calling the VertexAI API and for standard Python logging and JSON parsing:

from google.cloud import aiplatform
from google.oauth2 import service_account
from vertexai.generative_models import GenerativeModel
import logging, json

In the __init__.py file’s Plugin class, replace the __post_init__ function body with the following definition:

def __post_init__(self):
    try:
        with open(self.env_settings.job_settings_file) as f:
            self.plugin_settings = PluginSettings.model_validate_json(f.read())

        credentials_json = json.loads(
            self.plugin_settings.credentials.get_secret_value()
        )
        credentials = service_account.Credentials.from_service_account_info(
            credentials_json
        )

        aiplatform.init(
            location=self.plugin_settings.location,
            project=credentials_json["project_id"],
            credentials=credentials,
        )

        self.model = GenerativeModel(self.plugin_settings.model)
    except Exception as e:
        print(f"Plugin initialization failed: {e}")
        raise

The __post_init__ function is called after the Plugin class is initialized. The function reads in the UI field values from the settings.json file that the PluginSettings class wrote to earlier.
The function then prepares the authorization credentials that were provided in the UI to be used by VertexAI.
The aiplatform.init function initializes the VertexAI API with the specified location, project ID, and authorization credentials.
The GenerativeModel class gets the model to be used that was specified in the UI.

In the __init__.py file’s Plugin class, just before the run function, add the prompt text to be sent to VertexAI. At run time, this prompt, along with a piece of text that Unstructured extracts from the document, is sent to VertexAI for sentiment analysis:

PROMPT = """
Given a piece of text, classify the following types of information:
- toxic or non-toxic
- emotion if it conveys, such as happiness, or anger
- intent such as "finding information", "making a reservation", or "placing an order"

text:
    {text}

Output the results as JSON and nothing else. For example:
```json
{{
    "toxicity": "non-toxic",
    "emotion": "neutral",
    "intent": "making a reservation"
}}

"""

In the __init__.py file’s Plugin class, replace the run function body with the following definition:

def run(self, element_dicts: list[dict]) -> Response:
    """
    This method is called once for every file that is processed.

    element_dicts is a list of elements:

    See https://docs.unstructured.io/open-source/concepts/document-elements
    """
    for element in element_dicts:
        element: ElementDict
        prompt_text = self.PROMPT.format(text=element["text"])
        response_text = self.model.generate_content(prompt_text).text
        try:
            data = json.loads(response_text.strip().strip("```").lstrip("json"))
        except json.JSONDecodeError:
            logging.basicConfig(level=logging.INFO)
            logging.getLogger().error(f"Failed to parse response: {response_text}")
            data = {}
        element["metadata"].update(data)

    return Response(element_dicts=element_dicts)

The run function is called once for every file that is processed. The function takes a list of the elements that Unstructured generated from the file as input.
Each element in the list of elements is a dictionary that contains the text extracted from the document and its related metadata.
The function sends the prompt and the element’s text to the model.
The function then adds the sentiment analysis output to the element’s metadata field.
After the last element’s sentiment analysis is output into the last element’s metadata field, the enitre updated list’s contents are given as input into the next node in the workflow’s DAG.

Run plugin tests locally with pytest

In this section, you manually run the plugin’s tests locally using pytest to make sure that the plugin’s logic is working as expected before further testing in Docker and eventual deployment for use in the UI.

In practice, you would typically use a continuous integration and continuous deployment (CI/CD) pipeline to automate running these tests. If any of the tests fail, the pipeline should stop and notify you of the failure. If all of the tests pass, the pipeline should then continue by running the plugin in Docker as a further test.

Add the necessary pytest dependencies. Also add a dependency on the dotenv package, which is used to read environment variables from a local .env file:
```
uv pip install pytest dotenv
```
In the project’s test directory, at the top of the test_plugin.py file, add the following import statements to enable reading local environment variables. Also, call the load_dotenv function to load the environment variables from the .env file:
```
import os
from dotenv import load_dotenv

load_dotenv()
```
In the test_plugin.py file, update the following from...import statement to find the specified classes that are defined in the src/plugin_sentiment_analyis folder:
```
from src.plugin_sentiment_analysis import Plugin, PluginSettings
```
In the root of the project’s test directory, add a blank __init__.py file. This file is required to allow the src directory to be seen by the test directory to enable the preceding from...import statement to work.

In the test_plugin.py file, replace the plugin function body with the following definition:
```
@pytest.fixture
def plugin(tmp_path):
    credentials = os.getenv("VERTEXAI_CREDENTIALS")
    if not credentials:
        raise ValueError("VERTEXAI_CREDENTIALS env var must be set to run test")

    settings_filepath: Path = tmp_path / "settings.json"
    settings = {"location": "us-east1", "credentials": credentials}
    settings_filepath.write_text(json.dumps(settings))

    yield Plugin(
        env_settings=EnvSettings(
            shared_filepath=tmp_path,
            job_settings_file=str(settings_filepath),
        )
    )
```
- The plugin function is a fixture that sets up the plugin’s infrastructure for the test_plugin test function that follows.
- The function reads the VERTEXAI_CREDENTIALS environment variable from the .env file that you will create next.
- Instead of using the settings.json file that would normally be used by the PluginSettings class, the function creates a temporary settings.json file just for these tests. This temporary file contains sample values for the API Location and Credentials JSON fields that users would have otherwise specified when using the plugin in the UI.
In the project’s root, create a file named .env. In this file, add an environment variable named VERTEXAI_CREDENTIALS, and set it to the single-line representation of the credentials.json file that you generated in this tutorial’s requirements:
```
VERTEXAI_CREDENTIALS="<single-line-credentials-json>"
```
If you plan to publish this plugin’s source code to an external repository such as GitHub, do not include the .env file in the repository, as it can expose sentitive information publicly, such as your credentials for the VertexAI API.
To help prevent this file from accidentally being included in the repository, add a .env entry to a .gitignore file in the root of the project.
In the test_plugin.py file, replace the test_plugin function body with the following definition. The function body definition should now look as follows:
```
def test_plugin(plugin: Plugin, elements: list[dict]):

    elements[0]["text"] = "Hi, can you please book a table for two at Juan for May 1?"

    output = plugin.run(elements)
    output_elements = output.element_dicts

    assert len(output_elements) == 1
    metadata = output_elements[0]["metadata"]

    assert metadata["toxicity"] == "non-toxic"
    assert metadata["emotion"] == "neutral"
    assert metadata["intent"] == "making a reservation"
```
- The test_plugin function is a test case that uses the plugin fixture to run the plugin’s logic.
- The function takes a list of Unstructured-formatted elements as input. The first element in the list contains the text that is used to test the plugin.
- The function then runs the plugin’s logic and checks that the output is as expected.
- The function checks that the output contains the expected values for the toxicity, emotion, and intent fields that are returned. If the expected values match, the test passes. Otherwise, the test fails.

Run the test

To run the test, use the following command to run pytest though the test target in the file named Makefile in the root of the project:

make test

If the test passes, you should see something similar to the following:

tests/test_plugin.py .

1 passed

Run the plugin in Docker locally

In this section, you proceed with local testing by manually running the plugin in Docker locally. This allows you to more fully test the plugin’s logic in an isolated environment before you deploy it into your self-hosted UI.

In practice, you would typically use a CI/CD pipeline to automate running the plugin in Docker and testing the output against an expected result. If the plugin’s output does not match the expected result, the pipeline should stop and notify you of the failure. If the plugin’s output matches the expected result, the pipeline should then continue by deploying the plugin to the staging version of your self-hosted Unstructured UI.

In your local machine’s home directory, create a hidden file named .vertex-plugin-settings.json. This file contains information that your local installation of Docker passes into the running container. In this file, add the following JSON content:

{
    "location": "<location>", 
    "credentials": "<single-line-credentials-json>"
}

In the preceding JSON:

Replace <location> with the location of the VertexAI API that you want to use, for example, us-east1.
Replace <single-line-credentials-json> with the single-line representation of the credentials.json file that you generated in this tutorial’s requirements.

This .vertex-plugin-settings.json file contains sensitive information and is intended for local Docker testing only. Do not check in this file with your plugin’s source code.

In the file named Makefile in the root of the project, replace the .PHONY: run-docker definition with the following definition:

.PHONY: run-docker
run-docker: docker-build-local
  docker run -it --rm \
    -v $(PWD):/shared \
    -v $(HOME)/.vertex-plugin-settings.json:/settings.json \
    -e JOB_SETTINGS_FILE=/settings.json \
    -p 8000:8000 \
    "${IMAGE_REPOSITORY}:${VERSION}"

The run-docker target builds the Docker image locally and then runs it as a container representing the plugin.

Start Docker Desktop on your local machine, if it is not already running.
Run the following command to call the run-docker target, which builds the Docker image and then runs the resulting container, representing the plugin:
```
make run-docker
```
You must leave this terminal window open and running while you are testing the plugin locally within the running Docker container. If you interrupt the running process here or close this terminal window, the Docker container stops running, and the plugin stops working.

Send a request to the listening plugin

In a new terminal window, use the following curl command to send a request to the plugin that is running in the Docker container. The request contains some sample text that you want VertexAI to perform sentiment analysis on along with some pretend metadata in the format that is typically generated by Unstructured during processing.

curl --location 'localhost:8000/invoke' \
--header 'Content-Type: application/json' \
--data '{
    "element_dicts": [
        {
            "type": "NarrativeText",
            "element_id": "1453c80530ef11712374570a086dbd64",
            "text": "Hi, can you please book a table for two at Juan for May 1?",
            "metadata": {
                "languages": [
                    "eng"
                ],
                "filetype": "text/plain",
                "data_source": {
                    "record_locator": {
                        "path": "/path/to/file.txt"
                    },
                    "permissions_data": [
                        {
                            "mode": 33188
                        }
                    ]
                }
            }
        }
    ]
}'

If successful, the output should look similar to the following. Notice that the toxicity, emotion, and intent fields were added to the element’s metadata field (JSON formatting has been applied here for better readability):

{
    "usage": [],
    "status_code": 200,
    "filedata_meta": {
        "terminate_current": false,
        "new_records": []
    },
    "status_code_text": null,
    "output": {
        "element_dicts": [
            {
                "type": "NarrativeText",
                "element_id": "1453c80530ef11712374570a086dbd64",
                "text": "Hi, can you please book a table for two at Juan for May 1?",
                "metadata": {
                    "languages": [
                        "eng"
                    ],
                    "filetype": "text/plain",
                    "data_source": {
                        "record_locator": {
                            "path": "/path/to/file.txt"
                        },
                        "permissions_data": [
                            {
                                "mode": 33188
                            }
                        ]
                    },
                    "toxicity": "non-toxic",
                    "emotion": "neutral",
                    "intent": "making a reservation"
                }
            }
        ]
    },
    "message_channels": {
        "infos": [],
        "warnings": []
    }
}

When you are done testing, you can stop the plugin by interrupting or closing the terminal window where the Docker container is running.

Deploy the plugin to your self-hosted UI

In this section, you manually deploy the successfully-tested plugin for your users to add to their workflows’ DAGs within your self-hosted Unstructured UI. This section describes how to deploy the plugin from your local development machine directly into your existing container registry.

In practice, you would typically use a CI/CD pipeline to automate deploying the plugin.

Specify the name of your container registry

In the file named Makefile in the root of the project, set the IMAGE_REGISTRY variable, replacing REGISTRY_NAME_REPLACE_ME with the name of your container registry.

IMAGE_REGISTRY=REGISTRY_NAME_REPLACE_ME

To get the name of your container registry if you do not already know it, run the command that is appropriate for your container registry. For example:

For AWS ECR, run the AWS CLI command aws ecr describe-repositories with the appropriate command-line options.
For Azure Container Registry, run the Azure CLI command az acr list with the appropriate command-line options.
For GAR, run the Google Cloud CLI command gcloud artifacts repositories list with the appropriate command-line options.

The container registry name typically takes the following format:

For AWS ECR, <aws_account_id>.dkr.ecr.<region>.amazonaws.com
For Azure Container Registry, <acr-name>.azurecr.io
For GAR, <location>-docker.pkg.dev/<project-id>/<repository-name>

Specify the username and password for access to your container registry

Set the following environment variables to the appropriate username and password for access to your container registry:

PLUGIN_REGISTRY_USERNAME
PLUGIN_REGISTRY_PASSWORD

For example:

# For macOS and Linux:
export PLUGIN_REGISTRY_USERNAME="<username>"
<container-registry-login-command>
export PLUGIN_REGISTRY_PASSWORD="<password>"

# For Windows:
set PLUGIN_REGISTRY_USERNAME="<username>"
<container-registry-login-command>
set PLUGIN_REGISTRY_PASSWORD="<password>"

In the preceding commands, for <container-registry-login-command>, run the command that is appropriate for your container registry. For example:

For AWS ECR, you do not run a separate login command here.
For Azure Container Registry, run the Azure CLI command az acr login with the appropriate command-line options.
For GAR, run the Google Cloud CLI command gcloud auth configure-docker with the appropriate command-line options.

In the preceding commands, to get the value for <password>, run the command that is appropriate for your container registry. For example:

For AWS ECR, run the AWS CLI command aws ecr get-login-password with the appropriate command-line options.
For Azure Container Registry, run the Azure CLI command az acr credential show with the appropriate command-line options.
For GAR, run the Google Cloud CLI command gcloud auth print-access-token with the appropriate command-line options.

Build and deploy the plugin's container

Run the following commands, one command at a time, to build the plugin’s container, deploy it to your container registry, and make the plugin available for use in the staging version of your self-hosted Unstructured UI:

make docker-build
make docker-push
make publish-plugin
make promote-plugin-to-staging

Test the plugin in your UI

Test the plugin in staging

Sign in to the staging version of your self-hosted Unstructured UI.
Create a new workflow or open an existing workflow.
In the workflow’s visual DAG designer, click the + icon anywhere between a Chunker node and a Destination node, and select Plugins > Sentiment Analysis.
Click the Sentiment Analysis node to open its settings pane.
In the settings pane, enter the required settings for the plugin. For example, enter the location of the VertexAI API, the single-string version of the credentials.json file’s contents for accessing the VertexAI API, and the model for VertexAI to use.
Run the workflow.
When the workflow is finished, go to the destination location, and look for the toxicity, emotion, and intent values that the plugin adds to the metadata field for each element that Unstructured generated based on the source files’ contents.

Make any changes to the plugin

If you need to make any changes to the plugin, you can do so by returning to the previous section titled Write the plugin.

Make the necessary code changes and then:

Run plugin tests locally with pytest.
Run the plugin in Docker locally.
Increment the plugin’s version number. To do this, in the project’s src/plugin_sentiment_analyis/__init__.py file, update the value of version in the PLUGIN_MANIFEST variable, for example from 0.0.1 to 0.0.2. Then save this file.
Deploy the plugin again to the staging version of your self-hosted Unstructured UI.
Test the updated plugin again in staging.

Keep repeating this loop until you are satisfied with the plugin’s performance in staging.

Promote the plugin to production

After you have tested the plugin in your staging UI and are satisfied with its performance, you can promote it from staging to production. To do this, run the following command:

make promote-plugin-to-production

Of coursse, you should immediately sign in to the production version of your self-hosted Unstructured UI and test the plugin from there there before you start advertising its availability to your users.

Congratulations! You have successfully created, tested, and deployed your first custom plugin into your self-hosted Unstructured UI that your users can now add to their workflow DAGs to unlock new capabilities and insights for their files and data!

Self-hosting

Security and compliance

AWS

Azure

GCP

Plugins

Self-hosted plugins tutorial

Requirements

Getting started

Write the plugin

Run plugin tests locally with pytest

Run the plugin in Docker locally

Deploy the plugin to your self-hosted UI

Test the plugin in your UI

Self-hosting

Security and compliance

AWS

Azure

GCP

Plugins

​Requirements

​Getting started

​Write the plugin

​Run plugin tests locally with pytest

​Run the plugin in Docker locally

​Deploy the plugin to your self-hosted UI

​Test the plugin in your UI

Requirements

Getting started

Write the plugin

Run plugin tests locally with pytest

Run the plugin in Docker locally

Deploy the plugin to your self-hosted UI

Test the plugin in your UI