Posted by: maki | Date: 2020-06-23 19:02 | IP: IP Logged
What to do if the PDF file cannot be correctly copied (copy / paste text) Polish or Russian characters (All Non-English word)
I want to copy to any editor or PSPad.
Example copy text (corrupted some polish letters)
yczliwe, oparte na Biblii propozycje i rady mog si
znacznie przyczyni ´
c do twych postp ´
ow, nawet gdyby ´
uczszczał do tej szkoły od wielu lat (Prz. 1:5).
Czy chciałby ´
s robi ´
c szybsze postpy? Bdzie to mo ˙
zliwe, je ´
Example wrong character:
Środkowoeuropejski (Windows): 0xB4
Edited 1 time(s). Last edit at 2020-06-23 19:04 by maki.
Posted by: pspad | Date: 2020-06-23 19:10 | IP: IP Logged
It's not PSPad related question. Problem is in your PDF. PDF is final exported document for print, not for back work. Some characters can be "painted" e.t.c.
The solution is to find another PDF, use OCR tool like Abbyy Fine reader
Posted by: maki | Date: 2020-06-24 07:12 | IP: IP Logged
I have never in my life managed to extract any text from any pdf file with any "OCR" program.
Always but always the text is damaged. And I tested it on thousands of various pdf files.
I used dozens of programs, also professional, with unicode support, and always the result - deplorable!
This is unbelievable! Why this happens?!
Posted by: pspad | Date: 2020-06-24 08:12 | IP: IP Logged
Because PDF is final format for print - it looks like same on all platforms.
PDF isn't used for additional work with text.
Characters cannot be presented as characters, but as glyphs or simple characters with painted accent or...
Posted by: maki | Date: 2020-06-24 11:37 | IP: IP Logged
What praise the well-known "professional" and paid software, which in fact is doomed to a bad result of scanned documents, books, etc. (And I have a high quality scanner).
Let's be serious not only about the pdf format, or even epub, or other popular formats.
I wrote to the company, gave examples, and did not receive any answer, no help, only the possibility of reimbursement of costs incurred if I made a purchase.
It proves that OCR does not work. This is simple propaganda.
The scanned OCR document looks like it has more than a tornado of special Unicode characters that the world has invented.
Even the best hieroglyphs translator would not be able to understand the scanned document. :D
Edited 4 time(s). Last edit at 2020-06-24 11:43 by maki.
Posted by: pspad | Date: 2020-06-24 12:08 | IP: IP Logged
OCR work, I use it personally.
Do as you wish.
I mark this theme as closed. You got explanation, you got solution offer.
This Thread has been closed