Changing the scanned pdf to html

Selasa, 26 Juni 2012

Diposting oleh iLmu gak akan ada Habisnya di 01.40 Label: TULISAN BAHASA INGGRIS BISNIS 2 #

When looking for a source of reference, I get a pdf file that the content/text in the form of images. As a result I could not copy the text (Lol... bad habit). There is one way to change the scanned pdf into html, by using google. If you get a pdf file of search results, then Google will display an option to display it in html form. This method is only useful if the pdf file indexed all by google and only displays the first 20 pages alone.

There is one other way to convert pdf files into a set of images (.png, .jpg), and then scan the image with OCR, and then to output a HTML or text file. Since I (the author) using the operating system GNU / linux ubuntu, then the method below only applies to OS GNU/Linux only (for the windows version, ask someone else!).

Just go ahead....
First install xpdf, imagemagick, and ocropus
sudo apt-get install xpdf imagemagick ocropus
use this script to convert. Save with the name "pdf2txt" (without the quotes).
click here to view the code.

How to use:
Suppose you want convert "makalah.pdf" into the html file, use the command:
./txt2pdf makalah.pdf > makalah.html
the above command will convert "makalah.pdf" and output "makalah.html".

ilmu gak akan ada habisnya

Selasa, 26 Juni 2012

Changing the scanned pdf to html

Tidak ada komentar:

Posting Komentar

Blog Archive

Mengenai Saya

Pengikut

klik di sini

tau dah

Labels