Text removal from PDFs could be a complex affair given that the documents format is adapted for page design. Also a singular word could include many message components if different typefaces are blended.
I am trying to receive a much better understanding of exactly how a PDF stores text message. Typically speaking, when a PDF is actually produced coming from an application like MS Word (or in my scenario SQL Web server Reporting Provider), exactly how is text stashed through the PDF? I will hope that the resulting data isn’t OCR’ ed in this particular scenario the approach it will be actually if the initial PDF document had really been actually developed coming from an image.
My initial understanding of PDF was actually that it maintained (PostScript) directions on just how to draft the “image” of the document to a page or even a printer, which there was actually no real text had within the file on its own. Subsequently, I was actually strongly believing that a message extractor may reverse-engineer such standards to generate the text that the PDF will typically create.
PDF consists of a variety of various kinds of points; not just vectorial or raster sketch directions. Text in details is embodied by message parts. These include a string of characters that ought to be actually drawn at certain openings taking advantage of a particular font design.
If you are lucky sufficient to deal with Tagged PDF files such as PDF/A or PDF/UA, text extraction can be a lot much easier since text spans are determined as such, and a mapping to Unicode characters is defined.
Wikipedia doesn’t have the complete specification but does act as an intro:
Is it safe to say that since the text aspect merely informs the rendering engine what to draw where that this would be the reason why there is no context when you draw out text from a PDF in c# http://www.iditect.com/tutorial/search-text/
it can get worse than that and you may have a PDF with minimized typeface details, in wich case you can not even inform which unicode or ansi text character belongs to a particular PDF-character. It can also improve and you may have a tagged PDF, which may include paragraph/title/line details, but in a general function app you can not presume anything.
It may be worth looking at the Text area of the PDF Reference too, if you actually wish to get deep into how it works and is kept.