Blog archive

Multilingual OCR. How to do multi-language OCR on a scanned PDF document with multiple languages

Wednesday, October 30, 2019

A good searchable PDF software should support Multilingual OCR. OCRvision's text recognition feature can detect a wide variety of languages and can detect multiple languages within a single scanned document. You can see the complete list of supported OCR languages here. This means you can successfully create a searchable PDF from a scanned PDF or scanned image with more than one language in it.

In order to do Multilingual OCR using OCR vision, do the following steps;

  • Open the OCRvision user interface
  • Click on languages tab on the left-hand side (see the screenshot below)

 

multilngial OCR

      Fig 1: OCRvision language configuration interface  

Click on the languages you want in your multi-language OCR process and check the tick box. Once a language is enabled the text will turn blue. If you want to disable the language you can uncheck the tick box. There is a search box on the right-hand side of the interface where you can search for a particular language. 

Once you select and tick the language, it will automatically get updated and that language will be enabled for Multilingual OCR.

DONE. Now all you have to do is copy your scanned pdf document with multiple languages in the pre-configured "magic folder".

              Fig 2: OCRvision magic folder configuration interface 

 After a couple of seconds, a searchable PDF with multiple languages will be generated in the magic folder automatically. You can see the "currently processing file" which is being converted to a searchable PDF in the bottom left notification strip as in the screenshot above. As I said earlier, You don't have to do any manual button click. OCR vision is a searchable PDF  OCR automation software. It will create a searchable PDF automatically

 

Benefits of a Searchable PDF

Monday, October 28, 2019

Using a good searchable PDF software, you can convert any scanned document to a searchable digital file. This technology is called optical character recognition(OCR). If you want more details you can read this blog post on what is a searchable PDF. You can OCR a PDF and search in that document for any key phrases or words and even special symbols.

The advantages of a Searchable PDF

Easier to search and copy

The biggest advantage of them all is that searchable PDF can save lots of your time. Suppose you have a 20 page scanned document and you want to search for a specific word such as a name or an address. Instead of manually scanning through a 20-page black and white scanned document, you can use a few tools like windows file search or command line to search for it. Your information is on your fingertips in a matter of seconds. It can save lots of your time especially if you are dealing with lots of scanned documents in your daily job routine.

You can share this searchable pdf file with your colleagues and they can open and copy data from this file just like a from a word or excel file.

Enhance the value of your documents

If your company is in paperless office or digital archiving business, your customers can save on cost and enhance the value of their office documents by making them searchable. So the searchable PDF can help you to transform an organization into paperless office culture.
searchable PDFs can be easier to find online and in the search results. This can help you to enhance the customer experience.

Increase accessibility

A scanned document is just an image of a text document. It is inaccessible for a disabled person because the text is a graphical representation of the letters in the document rather than a searchable text. So you can't extract the words or read the document using an assistive technology software. To solve this problem you can OCR the scanned document into a searchable PDF. This Searchable PDF contains a text layer which can be accessed by any software like windows narrator.

Conclusion

Searchable PDFs are becoming like a standard for scanned documents. YOu can use a good searchable PDF software like OCRvision to convert scanned PDF to searchable PDF automatically.
By leveraging searchable PDFs, you can save time, increase productivity, improve your end-user experience, and increase your business over time.

 

What does it mean by "Searchable PDF"?

Wednesday, October 2, 2019

What is "Searchable PDF"?

Using OCRvision you can convert pdf to searchable pdf. Searchable PDF conversion or convert pdf to readable text is the main functinality of any OCR software which  makes your PDF document searchable, when it is a scanned or image-based PDF.  PDF or Portable Document Format is a file format introduced by Adobe to represent documents in a hardware/software/OS independent manner. So, each PDF document encapsulates the information like the text, fonts, graphics, images and other information needed to display it.

 You can broadly classify PDF documents into three;

  •          Text-Based PDF
  •          Image-Based PDF
  •          Searchable PDF

Text-Based PDF

These are digitally created PDFs. We can call them as “true PDFs”. Normally they are created using special software’s like Adobe Acrobat, Microsoft® Word, Excel®. You can even “print” a document as PDF file. These documents are searchable. Just like your word documents you can edit, search and delete text from these documents.

 Image-Based PDF

Image only or scanned PDF comes in the second category. These are created using scanners or digital cameras. It is basically an image embedded in a PDF document. Just like a JPEG or PNG file, they don’t have a text layer. That means you can only print them. You won’t be able to search for a text or copy text from these documents. If you are an organisation dealing with lots of scanned documents, dealing with the data locked in these documents will be a big nightmare for you.

Searchable PDF

Searchable PDFs are created from image-based PDFs. As discussed above, the problem with image-based document is that there is no text layer for you to search on. To solve this problem, we use an Optical character recognition software like OCRvision. An OCR software will analyse the data in the image-based PDFs and “recognise” the text and add a text layer o the document. This text layer is normally inviable or underneath the image. This text layer can be searched or indexed in your windows search. So, when you search for a keyword in this document, you are searching in this invisible text layer.