Skip to main content

Extracting text from any file is harder than it looks. Extracting formatting is even harder.

· 13 min read
Ben Truscott
Ben Truscott
Document Filters Principal Engineer
Corey Kidd
(Frm) Product Owner

Backdrop

This post was originally hosted on the Stack Overflow Blog.

We take for granted document processing on an individual scale: double-click the file (or use a simple command-line phrase) and the contents of the file display. But it gets more complicated at scale. Imagine you’re a recruiter searching resumes for keywords or a paralegal looking for names in thousands of pages of discovery documents. The formats, versions, and platforms that generated them could be wildly different. The challenge is even greater when it’s time sensitive, for example if you have to scan all outgoing emails for personally identifiable information (PII) leakages, or you have to give patients a single file that contains all of their disclosure agreements, scanned documents, and MRI/X-ray/test reports, regardless of the original file format.