claraocr.org

  • Full Screen
  • Wide Screen
  • Narrow Screen
  • Increase font size
  • Default font size
  • Decrease font size

What is OCR?

OCR is Optical Character Recognition, a translation system which can automatically recognise text in images.

Uses for OCR!

Text information can be taken from image files using suitable OCR software. It is then possible to edit or electronically search through this data using word processing software

OCR Software

There is currently a wide range of both commercial and open source OCR software available for all the main operating systems (Linux, Mac, Windows)

What is OCR?

OCR (Optical Character Recognition) is software for automatic text recognition. This works by using technology to compare patterns between individual words or characters.  The template is structured and broken down into its individual elements, meaning that just the letters of the word are left. The result of this can be improved and made more precise through various procedures. Optical character recognition is normally carried out using a scanner or digital camera and follows three steps.

In the first phase, the image files are divided into sections of relevant and non-relevant information. Text and picture titles are considered important, whereas pictures, white spaces and lines are not.

Breaking down the file like this results in a useable pattern, which can be worked on further according to certain criteria.

In the second phase, this pattern is compared with existing data and corrected if necessary. Mistakes found in the words and characters are corrected with the help of the database. This pattern recognition reduces the number of mistakes which will be later read, and can be corrected manually if needed.

The third and final phase involves encoding the text for the publishing format. Depending on the program, this could be one of many formats such as HTML, XML or PDF. The quality of the text, however, depends on various factors. The most decisive of these is the quality of the original document – especially with regards to the colour, contrast, layout and font. The more distinctive and marked these are, the easier it is for software to recognise the pattern. Another point to consider is how the text is scanned or photographed – the resolution and picture quality are very important for the later stages of the process. In some OCR programs there is a great difference in quality in the pattern databases and the dictionaries. This in turn can lead to different results in the correction of mistakes. Clean pattern recognition can reach an accuracy rate of up to 80%, whilst good programs with powerful algorithms can reach accuracy levels of up to 99%. The latter have the advantage of being able to recognise letters as three-dimensional curves with characteristic features.

 

Development of the OCR software

Originally only one typeface was used for text recognition to avoid any possible recognition errors. This typeface was designed so that OCR readers could recognise it quickly and without too much trouble. The typeface OCR-A was developed in 1968 and is especially distinguished by its letters which are very easy to tell apart. The OCR-B typeface, which arrived at the beginning of the 1970s, featured a design which was on the non-proportional side, generating a clear difference between the characters. The latest version is the OCR-H which is able to understand handwritten capital letters and numbers. Thanks to the constant improvements in the computer world and with algorithms it is now possible for OCR software to recognise normal printer typefaces as well as handwriting. Modern text recognition systems can also do a lot more than just read individual characters. The results produced by OCR software can now be corrected and improved with the help of Intelligent Character Recognition (ICR). An “8” can be changed into a “B” when the letter is found in a certain context, for example.

You are here: OCR About OCR What is OCR