OCR PDFs and Other Files to Save Recognized Text

Automatic OCR processing and PDF text recognition is now a necessity in many situations. With built-in Optical Character Recognition (OCR) technology, DocuFreezer lets you recognize text from various documents, thus becoming a useful OCR converter. It is also a reliable offline batch file converter for Windows 10 and older Windows systems.

Simple OCR converter for Windows

DocuFreezer can convert scanned images into editable text documents. The key benefit of this feature is the ability to extract text from images (or image-based documents) which can be copied and used somewhere else.

How to OCR PDFs or image files

  1. Open DocuFreezer
  2. Add files or an entire folder to the List of files
  3. Select Output file type: PDF or TEXT
  4. Go to Settings
  5. Place a checkmark next to Make PDF searchable (OCR) or OCR (Optical Character Recognition)
  6. Select the language of your input documents (better not select many languages at the same time)
  7. Select Multipage and other options, if necessary
  8. Click Start
  9. Get the resulting files in the Output folder

Download OCR Software

Input and output file formats

DocuFreezer allows you to convert PDF to text, scanned images – TIFF, PNG, JPEG to text, as well as CAD drawings, Excel files, and other data into editable text documents. The output files can be plain text files (TXT) or searchable PDF files.

Languages

So far, DocuFreezer supports the following OCR languages:

  • English
  • German
  • Hebrew
  • Polish
  • Japanese
  • Russian
  • Spanish
  • Portuguese

The number of languages will be increased in further versions of DocuFreezer. If you would like to have more languages added, please contact our support team.

Simple OCR software & PDF Converter for Windows
Note: the fewer OCR languages are selected, the more accurate text recognition will be.

Convert scanned image to text

When you scan a document, it becomes an image. Afterward, you might need to get the text out of it. The text that you can edit with a word processing, spreadsheet, or an editing program. Use DocuFreezer for this task – just add images and let the software OCR your files. Once the OCR is done, text in searchable PDF documents can be selected, copied, and marked up.

OCR Conversion: Convert Bitmapped PDF to Searchable PDF

You can also make your PDF searchable. DocuFreezer can create PDF containing editable text out of an image-only PDF or another filetype using the built-in OCR technology.

OCR Conversion: Bitmapped PDF to Searchable PDF

Recognize text from AutoCAD DWG and DXF

DocuFreezer supports DWG and DXF drawings as input formats. Thus, you can get the text out of your CAD drawings in the form of a searchable PDF or TXT. Simply add the files to the list, select PDF or TXT as Output file type, go to Settings and check option Make PDF searchable (OCR) or OCR (Optical Character Recognition).  

Text Recognition from AutoCAD DWG and DXF

OCR files from the command line

If you prefer command line tools rather than desktop apps, try using 2PDF. It is a PDF converter with a built-in OCR module. So if you have, e.g., raster (bitmap) images or document scans such as TIFF, PNG, JPEG, you can create searchable PDF files from them – with just one command. As a result, you'll get PDF files with text that can be indexed and copied.

Why is my OCR so poor? 7 steps to improve OCR accuracy

Text may be incorrect or corrupted after conversion with OCR. Short advice here is to make sure that the input files have high quality – large format and high resolution. Understanding the limitations of the OCR process can help you assist the OCR engine in producing more accurate results. The OCR results are considered to be good if the recognized text is 98-99% accurate (1-2% of OCR incorrect).

Below are some tips which will help you achieve better OCR results.

#1 Improve the quality of the source images

One of the most significant factors is DPI (Dots per Inch). Scan documents at 300 or higher DPI. Preferably, scan at 600 DPI to capture as much image information as possible. With high image resolution, OCR engine should be able to recognize high contrasts, character borders, pixel noise, and aligned characters.

#2 Select a lossless output format when scanning

To let OCR software extract text more precisely, choose a lossless file format, e.g., TIFF. If you scan to a TIFF without compression, no image information (roughly speaking, pixels) will be lost. Therefore, select a lossless file format, such as TIFF or high-quality PDF when scanning the source file.

#3 Enhance the contrast of images

Contrast and density are vital factors to consider before OCR'ing an image. When using a scanner (or an image editor if there is no way to scan the document again), you can adjust gamma and contrast to get clearer outputs. Adjust high contrast in such a way that characters are distinctive.

#4 Increase the text size of the source images

The recommended text size in the scanned documents is 10 points or higher. For the best results, try to make sure the text height is at least 20 pixels.

There is a minimum text size for reasonable accuracy. Consider the resolution as well as point size – OCR accuracy drops off below 10pt, rapidly below 8pt (with resolutions 300 DPI). At 10pt and 300 DPI, x-heights are typically about 20 pixels. Below an x-height of 10 pixels, you have very little chance of accurate results, and below 8 pixels letters will be "noise removed".

A quick check is to count the pixels of the x-height of your characters (x-height is the height of the lower case height). You can do it using a screenshot saving tool (e.g., Lightshot) or an image editor such as Photoshop.

#5 Select only those languages that are contained in your documents

If the OCR software you're using has an option to select between languages (like DocuFreezer), select only those which are in your source documents. The fewer languages selected – the better. This will help to avoid misinterpretation of characters.

#6 Avoid text rotation or skew and make text lines horizontal

When a page has been scanned when not straight, it can make the text rotated. If the text of a page is too skewed or rotated, it severely impacts the quality of the OCR. To solve this issue, try scanning a document again so that the word lines are horizontal. Alternatively, slightly rotate the digital image using an image editor.

#7 Remove dark borders and other objects near characters

Scanned pages may have dark edges around them. These can be processed as extra characters, especially if they vary in shape and gradation. If there's too much noise or objects, you can enhance the image using GIMP. Enlarge the image 2,5 times; then select background near letters using the Magic Wand tool and delete it; sharpen the image using Unsharp mark filter.

It is often impossible to comply with all these conditions, and proofreading may be required. You can use a grammar/spellchecker, such as Grammarly. Always proofread and correct any errors before sharing OCR-produced text.

What is Optical Character Recognition

Optical character recognition (OCR) is a method of converting a scanned image into text. When a page is scanned, it is usually stored as a bitmapped JPEG or TIFF format. When the image is on the screen, we can read it. But to the computer, it is just a series of black and white dots. The computer does not recognize any “words” or actual characters on the image. DocuFreezer can help you turn a flat image into letters and characters. Try the OCR feature in the free version of DocuFreezer – download the program using the button below.

Download Free Version