This class is used to convert one or more JPEG files into a single PDF document. This is a lossless process: JPEG image data is rewritten directly into the output. I found that there is a JPegDecoder in the Atalasoft software. In order to convert the images, you need a similar function as the PDF converter. 32 results Atalasoft DotImage Document Imaging is an SDK that offers high-speed document and image conversion, viewing and annotation on any device.

Author: Gohn Salmaran
Country: French Guiana
Language: English (Spanish)
Genre: Science
Published (Last): 19 September 2006
Pages: 404
PDF File Size: 17.61 Mb
ePub File Size: 1.59 Mb
ISBN: 309-1-58957-194-2
Downloads: 79180
Price: Free* [*Free Regsitration Required]
Uploader: Doukazahn

This article is in the Product Showcase section for our sponsors at CodeProject. These articles are intended to provide you with information on products and services that we atalawoft useful and of value to developers.

From health records, tax forms, and insurance claims, to old memos, magazines, and books; businesses are digitizing paper every day. With the advent of better search technology, having searchable text for all these documents is an obvious win.

Imagf common way to do this is to use OCR Optical Character Recognition to translate the images to a document format that indexers already know, but the drawback is that we often lose the layout, images and color of the original — plus, since no OCR is perfect, we need the original image to be able to fix mistakes. What we want is a document format that looks like the original images when humans look at it, but that looks like plain text when the indexer looks at it.

And, when we copy from the image, we want text put on the clipboard.

Converting Scanned Document Images to Searchable PDFs with OCR – CodeProject

This is the promise of the searchable PDF. In a searchable PDF, the original scanned image is retained so any human can read the document. The textual content that is extracted via OCR is put behind the image so search indexers can see it and Acrobat Reader will let us select it as text. The ubiquity of desktop and enterprise search, ever-increasing OCR accuracy, and mass adoption of PDF are a powerful combination that make searchable PDF’s the ideal format to store digitized paper.

This article will demonstrate just how simple it is to develop an application that generates these searchable PDF’s from scanned documents that can be indexed by Google, Sharepoint, Microsoft desktop search, and other applications that will index PDF documents.

To help build this application, Atalasoft publishes an OCR framework that simplifies working with industry leading OCR engines and our own highly accurate engine, GlyphReader.

Atalasoft’s OCR framework includes a flexible Translator interface for producing output from the recognition process. For example, TextTranslator is available out of the box and generates a text stream. Both are “searchable”, but the latter includes the original image and is what we are going to use.

Shown here are the lower resolution images of the original scanned TIFF a recent white paper from Atalasoft that was printed, and scanned in color. Let’s start with a method that simply extracts the text into a file. First, we must create an ImageSource object to efficiently handle multi-page image files.


The resulting text file obviously does not look at all like the original document, but it does contain the text. It also isn’t stored in the same file as the image. We can do better.

To do this we need to:. The result is a high quality searchable PDF! When opening the PDF into Acrobat Reader see screenshot belowall text in the document can be selected as real text, even though the visible part of this PDF is the actual color rasterized image.

Simply having this file on your filesystem will cause Google Desktop Search, or Windows Desktop Search to index this document properly, with the document looking exactly like the original. To add searchable PDF generation to your applications, you will need the following products from Atalasoft:.

Be sure to request Evaluation Licenses for the required products. Attached to this article is the resulting PDF and C 2. Articles Quick Answers Messages. Bill Bither14 Dec NET applications to digitize paper documents as searchable PDFs that can be indexed by search engines. Download documentation – 1. Using our framework, these steps are handled for you: Decompress the image Pre-process the image to make OCR more accurate including cleaning it or deskewing it OCR the image to extract the text.

Construct a PDF with the image and the extracted text, with each word accurately positioned behind the appropriate place in the image. Extracting the Text into a Text File Let’s start with a method that simply extracts the text into a file.

To do this we need to: Add pdfTrans ; ocr. Product Requirements To add searchable PDF generation to your applications, you will need the following products from Atalasoft: Is That My Car? Virtualization for System Programmers.

SharePoint OCR image files indexing. How to make use of OCR technology through a web browser. First Prev Next unable to write to a output file. Member May Member Mar Bill Bither Oct 6: Bill Bither Sep 6: Bill Bither Apr Bill Bither Jan Bill Bither 5-Feb 8: Bill Bither Jan 5: Hamed Mosavi Dec Hamed Mosavi Apr Bryant DocEdge Dec Jeff Circeo Dec 6: Bill Bither Dec 6: I gave the Infile path of my D drive where the pdf file is present and outfile path with a folder in D drive.

Hi, i am testing the atalasoft component to convert tiff to searchable pdfi download the DotImage 6.

The adobe reader version is 8. What it is happening? Does it have to be a scanned document? Will this work to perform OCR on images which are not documents, but contain text?

Also, can you define a region to “search” for text by giving x and y coordinates? Bill Bither Atalasoft, Inc. Locate text in images? Hi, I have a requirement to locate connvert boxes which contain text in images.

I have thousands of scanned magazine pages all as JPEG images. I just want to locate the position of all the text, the boxes which contain all the text on the page. Then I want to remove the text, so all I have left is the images that were on the pages.


We are trying to build a library of magazine layouts but don’t want the text, so we want to replace the text with colored boxes.

How could this be done? Philip, Please contact Atalasoft Sales or Support about this question as we might be able to help you. Could you please wtalasoft a commercial Sofware?

Atalasoft.Imaging.ImageProcessing.Document Namespace

I have not developers in my company and I need to solve this. There imge commercial products out there that will do this mostly from other OCR vendors. We have a product currently in beta that we’ll be announcing soon which is based on our toolkit advertised here. Might as well get the word out and post a link here as it does exactly as you wish: The technology is proven, but the UI is not!

Dotimage serves for convert a document image pdf, jpeg,tiff to a format csv,txt. We offer 3 different engines that you can use. See a recent post in this thread for more information. My question isis really your solution serves for this purpose, isnt it?? It sounds like you are looking to perform Forms processing where each type of document convwrt you scan has convrrt standard template. You can setup a template programmatically using DotImage by defining a rectangular area for the Name field for example with some tolerance.

Saving this data to a CSV file then is t matter of formatting your data and saving a text file. For technical questions like these, it’s best that you talk with our support department. You can call them ator submit a support request. Does it support Chinese like charset? More information about our OCR offerings is on our website here: DotImage Document Imaging is a document imaging framework for.

NET application, in which case you’ll also need to purchase a production server license. More information on Atalasoft’s OCR is available on our website at http: An idea Hamed Mosavi Dec ahalasoft Some years ago I was wondering about this, and I don’t know if it exists in advanced countries like US: Let’s mix some technologies: We have a book, a scanner like a mouse will be moved over book pages and scans data, an OCR detects words and converts them conver text format, gives texts to a speech machine capable of converting text to speech.

Atalzsoft we can use a tool perhaps a bit similar to iPod to read our books for us We can listen to our books on the way home from our office and you know how much quick will that be.