Using the Data Curation API
Using the API
There are 2 ways to call the Data Curation API. The first way is to use a Node.JS upload script and the second way is to send HTTP requests directly.
Prerequisites
Regardless of which method you choose, you will need the following information to call the API.
- Token endpoint for your OAuth instance
Client Id
andClient Secret
for authentication- API endpoint for your Data Curation API instance
Using the Upload Script
This section explains how to configure and use upload.js
to make calls to the Data Curation API. You will need to have Node.JS (https://nodejs.org/en/download) installed on your machine to execute the script.
See here for copyable samples and more information.
Using HTTP Request Tool
This section explains how to configure and use a tool like Bruno (https://www.usebruno.com/) to make calls to the Data Curation API. There are 5 sequential HTTP requests to call the API and get the results.
1. Getting an Access Token
Set up a POST
request to the token endpoint.
Set the body to form URL encoded and add the following values.
Key | Value |
---|---|
grant_type | urn:hyland:params:oauth:grant-type:api-credentials |
scope | environment_authorization |
client_id | Insert your client_id value here. |
client_secret | Insert your client_secret value here. |
You should be able to send that request and get a response similar to:
{
"access_token": "jwt.access.token",
"expires_in": 900,
"token_type": "Bearer",
"scope": "environment_authorization"
}
Save the access_token
field for use in the next request.
2. Calling the Presign endpoint
Create a POST
request to the API /presign
endpoint on the Data Curation API.
You will need to pass the access_token
as a bearer token in the authentication header.
The URL will look similar to this:
https://abcdefghij.execute-api.us-east-1.amazonaws.com/api/presign
You can optionally provide a set of options in the body of the request. If no options are provided the default values (shown below in the example body) will be used.
{
"normalization": {
"quotations": true
},
"chunking": true,
"embedding": true
}
After sending the request you should get a response similar to:
{
"job_id": "6e1bb8a0-2bc3-43a2-b3a6-e87975799c8d",
"put_url": "https://data-curation-api-dev-step-1-drop.s3.amazonaws.com/ABCXYZ",
"get_url": "https://data-curation-api-dev-step-3-results.s3.amazonaws.com/ABCXYZ"
}
Save all of these values for the following requests.
For more information about Presigned URLs, please see the Amazon Documentation.
3. Upload the file to AWS S3
Create a PUT
request to the put_url
from the /presign
request.
If you are using Bruno, note that it does not support sending files; you will have to do it through a script.
The content type header must be application/octet-stream
.
const fs = require("fs");
const attachmentFilename = "C:\\path\\to\\file.pdf";
const attachment = fs.readFileSync(attachmentFilename);
const attachmentLength = attachment.length;
req.setHeader("Content-Type", "application/octet-stream");
req.setHeader("Content-Length", attachmentLength);
req.setBody(attachment);
4. (Optional) Call the Status endpoint
The /status
endpoint can be polled to get the current status of the file in the pipeline.
To call it, create a GET
request to the API /status
endpoint.
This endpoint will need the access_token
in the authentication header, and will take the job_id
as a URL parameter.
The request should look similar to this:
https://abcdefghij.execute-api.us-east-1.amazonaws.com/api/status/6e1bb8a0-2bc3-43a2-b3a6-e87975799c8d
The initial status for a file is Wait For Upload
.
{
"jobId": "6e1bb8a0-2bc3-43a2-b3a6-e87975799c8d",
"status": "Wait For Upload"
}
Finished results will have a status of Done
.
{
"jobId": "6e1bb8a0-2bc3-43a2-b3a6-e87975799c8d",
"status": "Done"
}
5. Download the results from AWS S3
The final request is the easiest of all of them.
You will need to create a GET
request to the get_url
and that is all.
If the results are available you should get a JSON response with the text from the file resembling the example below. Note: The example has extra line breaks the JSON response does not for the sake of readability.
{
"markdown": {
"output": "Document Text",
"chunks_with_embeddings": [
{
"chunk": "Chunk Text",
"embeddings": [
-0.042955305427312851, 0.077558189630508423, 0.0026660626754164696
]
}
]
},
"json": {
"output": {
"type": "root",
"children": [
{
"type": "paragraph",
"children": [
{
"type": "text",
"value": "Document Text"
}
]
}
]
}
}
}