Skip to main content

Using the Data Curation API

Using the API

There are 2 ways to call the Data Curation API. The first way is to use a Node.JS upload script and the second way is to send HTTP requests directly.

Prerequisites

Regardless of which method you choose, you will need the following information to call the API.

  1. Token endpoint for your OAuth instance
  2. Client Id and Client Secret for authentication
  3. API endpoint for your Data Curation API instance

Using the Upload Script

This section explains how to configure and use upload.js to make calls to the Data Curation API. You will need to have Node.JS (https://nodejs.org/en/download) installed on your machine to execute the script.

See here for copyable samples and more information.

Using HTTP Request Tool

This section explains how to configure and use a tool like Bruno (https://www.usebruno.com/) to make calls to the Data Curation API. There are 5 sequential HTTP requests to call the API and get the results.

1. Getting an Access Token

Set up a POST request to the token endpoint. Set the body to form URL encoded and add the following values.

KeyValue
grant_typeurn:hyland:params:oauth:grant-type:api-credentials
scopeenvironment_authorization
client_idInsert your client_id value here.
client_secretInsert your client_secret value here.

You should be able to send that request and get a response similar to:

{
"access_token": "jwt.access.token",
"expires_in": 900,
"token_type": "Bearer",
"scope": "environment_authorization"
}

Save the access_token field for use in the next request.

2. Calling the Presign endpoint

Create a POST request to the API /presign endpoint on the Data Curation API. You will need to pass the access_token as a bearer token in the authentication header.

The URL will look similar to this:

https://abcdefghij.execute-api.us-east-1.amazonaws.com/api/presign

You can optionally provide a set of options in the body of the request. If no options are provided the default values (shown below in the example body) will be used.

{
"normalization": {
"quotations": true
},
"chunking": true,
"embedding": true
}

After sending the request you should get a response similar to:

{
"job_id": "6e1bb8a0-2bc3-43a2-b3a6-e87975799c8d",
"put_url": "https://data-curation-api-dev-step-1-drop.s3.amazonaws.com/ABCXYZ",
"get_url": "https://data-curation-api-dev-step-3-results.s3.amazonaws.com/ABCXYZ"
}

Save all of these values for the following requests.

For more information about Presigned URLs, please see the Amazon Documentation.

3. Upload the file to AWS S3

Create a PUT request to the put_url from the /presign request.

If you are using Bruno, note that it does not support sending files; you will have to do it through a script. The content type header must be application/octet-stream.

const fs = require("fs");

const attachmentFilename = "C:\\path\\to\\file.pdf";
const attachment = fs.readFileSync(attachmentFilename);
const attachmentLength = attachment.length;

req.setHeader("Content-Type", "application/octet-stream");
req.setHeader("Content-Length", attachmentLength);

req.setBody(attachment);

4. (Optional) Call the Status endpoint

The /status endpoint can be polled to get the current status of the file in the pipeline. To call it, create a GET request to the API /status endpoint. This endpoint will need the access_token in the authentication header, and will take the job_id as a URL parameter.

The request should look similar to this:

https://abcdefghij.execute-api.us-east-1.amazonaws.com/api/status/6e1bb8a0-2bc3-43a2-b3a6-e87975799c8d

The initial status for a file is Wait For Upload.

{
"jobId": "6e1bb8a0-2bc3-43a2-b3a6-e87975799c8d",
"status": "Wait For Upload"
}

Finished results will have a status of Done.

{
"jobId": "6e1bb8a0-2bc3-43a2-b3a6-e87975799c8d",
"status": "Done"
}

5. Download the results from AWS S3

The final request is the easiest of all of them. You will need to create a GET request to the get_url and that is all. If the results are available you should get a JSON response with the text from the file resembling the example below. Note: The example has extra line breaks the JSON response does not for the sake of readability.

{
"markdown": {
"output": "Document Text",
"chunks_with_embeddings": [
{
"chunk": "Chunk Text",
"embeddings": [
-0.042955305427312851, 0.077558189630508423, 0.0026660626754164696
]
}
]
},
"json": {
"output": {
"type": "root",
"children": [
{
"type": "paragraph",
"children": [
{
"type": "text",
"value": "Document Text"
}
]
}
]
}
}
}