Extractor interface¶

The Extractor interface allows you to extract the content of a document and/or enumerate its sub-documents, such as email attachments and ZIP archives.

To obtain this interface, call the DocumentFilters.GetExtractor method. The Extractor interface contains the following methods and properties.

Extractor::ApproveExternalResourceCallback property	Set a callback hook called when an embedded images is to be downloaded.
Extractor::Close method	The Close method releases the document resources referenced by this Extractor object.
Extractor::Compare method	The Compare method allows you to compare two documents returning the differences.
Extractor::CopyTo method	The CopyTo method extracts the binary content of the sub-document to a file.
Extractor::EOF property	The EOF property is only valid for documents where the SupportsText property is TRUE. The EOF property will be set to TRUE when no more text can be extracted from the document with calls to GetText. If the document needs to be re-read, call Close and Open first.
Extractor::FileType property	The FileType property is the document format code, as listed in Document Format Codes chart on page . The function is overloaded to be able to return the format name as a string.
Extractor::GetFirstImage method	The GetFirstImage method obtains a SubFile object representing the first embedded image of the current document when converting using classic HTML.
Extractor::GetFirstPage method	The GetFirstPage method returns the first page object of an opened document. The document must be opened in image mode (IGR_FORMAT_IMAGE).
Extractor::GetFirstSubFile method	The GetFirstSubFile method obtains a SubFile object representing the first sub-document of the current document.
Extractor::GetHashMD5 method	The getHashMD5 methods obtain a string representing the calculated hash of the current document for unique identification.
Extractor::GetHashSHA1 method	The getHashSHA1 methods obtain a string representing the calculated hash of the current document for unique identification.
Extractor::GetNextImage method	The GetNextImage method obtains a SubFile object representing the next embedded image of the current document when converting using classic HTML.
Extractor::GetNextPage method	The GetNextPage method returns the next page object of an opened document. The document must be opened in image mode (IGR_FORMAT_IMAGE).
Extractor::GetNextSubFile method	The GetNextSubFile method obtains a SubFile object representing the next sub-document of the current document.
Extractor::GetPage method	The GetPage method returns the page at the given index, where the page index is 0-based. An exception is raised if the index is invalid.
Extractor::GetPageCount method	Returns the number of pages in the current document, the document must be opened in image mode for the page count to be populated.
Extractor::GetResourceStreamCallback property	Set a callback hook called when an rendering an embedded image.
Extractor::GetRootBookmark method	The GetRootBookmark method returns a Bookmark node representing the top-most node of the bookmark hierarchy. The root bookmark only has Children data, it has no title or destination properties.
Extractor::GetSubFile method	The GetSubFile method obtains a SubFile object representing the nominated sub-file of the current document.
Extractor::GetText method	The GetText method extracts the next portion of text content from the document.
Extractor::HeartbeatCallback property	Set a callback hook called when an heartbeat is sent.
Extractor::Images property	The Images method property provides an enumerable collection of SubFile objects representing the embedded image of the current document when converting using classic HTML.
Extractor::Localize property	Utility function that allows for localization of metadata without providing a callback. Any localization options must be set before an `.Open` call.
Extractor::LocalizeCallback property	Set a callback hook called when a string is to be localized.
Extractor::LogLevelCallback property	Set a callback hook called a log-level is requested for a module.
Extractor::LogMessageCallback property	Set a callback hook called a log-message is sent.
Extractor::MimeType property	Returns the MimeType of the file.
Extractor::OcrImageCallback property	Set a callback hook called when an OCRing an image.
Extractor::Open method	The Open method opens a document for processing.
Extractor::PageCount property	Returns the number of pages in the current document, the document must be opened in image mode for the page count to be populated.
Extractor::Pages property	The Pages property provides an enumerable collection of pages for an opened document. The document must be opened in image mode (IGR_FORMAT_IMAGE).
Extractor::PasswordCallback property	Set a callback hook called when a password is required.
Extractor::SaveTo method	The SaveTo method extracts the entire text content of the document in a single call. The text may be saved to a file with the given name or via an instance of an IStream (COM) object.
Extractor::SubFiles property	Returns an enumerable set of SubFiles.
Extractor::getFileType method	The FileType method allows for extended information to be returned about the file type.
Extractor::getSupportsHTML method	getSupportsHTML method is TRUE if document can be converted to classic HTML.
Extractor::getSupportsSubFiles property	getSupportsSubFiles property is TRUE if the document is a compound or archive document, potentially with sub-documents.
Extractor::getSupportsText method	getSupportsText method return TRUE if text content can be extracted from the document. This property must be TRUE to be able to call to the Extractor::SaveTo and Extractor::GetText methods.