Selasa, 26 Juni 2012

Changing the scanned pdf to html


When looking for a source of reference, I get a pdf file that the content/text in the form of images. As a result I could not copy the text (Lol... bad habit). There is one way to change the scanned pdf into html, by using google. If you get a pdf file of search results, then Google will display an option to display it in html form. This method is only useful if the pdf file indexed all by google and only displays the first 20 pages alone.

There is one other way to convert pdf files into a set of images (.png, .jpg), and then scan the image with OCR, and then to output a HTML or text file. Since I (the author) using the operating system GNU / linux ubuntu, then the method below only applies to OS GNU/Linux only (for the windows version, ask someone else!).

Just go ahead....
First install xpdf, imagemagick, and ocropus
       sudo apt-get install xpdf imagemagick ocropus
use this script to convert. Save with the name "pdf2txt" (without the quotes).
click here to view the code.

How to use:
Suppose you want convert "makalah.pdf" into the html file, use the command:
       ./txt2pdf makalah.pdf > makalah.html 
 the above command will convert "makalah.pdf" and output "makalah.html".

Tidak ada komentar:

Posting Komentar