Skip to content

Extractor interface

The Extractor interface allows you to extract the content of a document and/or enumerate its sub-documents, such as email attachments and ZIP archives.

To obtain this interface, call the DocumentFilters.GetExtractor method. The Extractor interface contains the following methods and properties.

Extractor::Close method

The Close method releases the document resources referenced by this Extractor object.

Extractor::Compare method

The Compare method allows you to compare two documents returning the differences.

Extractor::CopyTo method

The CopyTo method extracts the binary content of the sub-document to a file.

Extractor::EOF property

The EOF property is only valid for documents where the SupportsText property is TRUE. The EOF property will be set to TRUE when no more text can be extracted from the document with calls to GetText. If the document needs to be re-read, call Close and Open first.

Extractor::FileType property

The FileType property is the document format code, as listed in Document Format Codes chart on page . The function is overloaded to be able to return the format name as a string.

Extractor::GetFirstImage method

The GetFirstImage method obtains a SubFile object representing the first embedded image of the current document when converting using classic HTML.

Extractor::GetFirstPage method

The GetFirstPage method returns the first page object of an opened document. The document must be opened in image mode (IGR_FORMAT_IMAGE).

Extractor::GetFirstSubFile method

The GetFirstSubFile method obtains a SubFile object representing the first sub-document of the current document.

Extractor::GetHashMD5 method

The getHashMD5 methods obtain a string representing the calculated hash of the current document for unique identification.

Extractor::GetHashSHA1 method

The getHashSHA1 methods obtain a string representing the calculated hash of the current document for unique identification.

Extractor::GetNextImage method

The GetNextImage method obtains a SubFile object representing the next embedded image of the current document when converting using classic HTML.

Extractor::GetNextPage method

The GetNextPage method returns the next page object of an opened document. The document must be opened in image mode (IGR_FORMAT_IMAGE).

Extractor::GetNextSubFile method

The GetNextSubFile method obtains a SubFile object representing the next sub-document of the current document.

Extractor::GetPage method

The GetPage method returns the page at the given index, where the page index is 0-based. An exception is raised if the index is invalid.

Extractor::GetPageCount method

Returns the number of pages in the current document, the document must be opened in image mode for the page count to be populated.

Extractor::GetRootBookmark method

The GetRootBookmark method returns a Bookmark node representing the top-most node of the bookmark hierarchy. The root bookmark only has Children data, it has no title or destination properties.

Extractor::GetSubFile method

The GetSubFile method obtains a SubFile object representing the nominated sub-file of the current document.

Extractor::GetText method

The GetText method extracts the next portion of text content from the document.

Extractor::Images property

The Images method property provides an enumerable collection of SubFile objects representing the embedded image of the current document when converting using classic HTML.

Extractor::Localize property

Utility function that allows for localization of metadata without providing a callback. Any localization options must be set before an .Open call.

Extractor::MimeType property

Returns the MimeType of the file.

Extractor::Open method

The Open method opens a document for processing.

Extractor::PageCount property

Returns the number of pages in the current document, the document must be opened in image mode for the page count to be populated.

Extractor::Pages property

The Pages property provides an enumerable collection of pages for an opened document. The document must be opened in image mode (IGR_FORMAT_IMAGE).

Extractor::SaveTo method

The SaveTo method extracts the entire text content of the document in a single call. The text may be saved to a file with the given name or via an instance of an IStream (COM) object.

Extractor::SubFiles property

Returns an enumerable set of SubFiles.

Extractor::getFileType method

The FileType method allows for extended information to be returned about the file type.

Extractor::getSupportsHTML method

getSupportsHTML method is TRUE if document can be converted to classic HTML.

Extractor::getSupportsSubFiles property

getSupportsSubFiles property is TRUE if the document is a compound or archive document, potentially with sub-documents.

Extractor::getSupportsText method

getSupportsText method return TRUE if text content can be extracted from the document. This property must be TRUE to be able to call to the Extractor::SaveTo and Extractor::GetText methods.