It would be great if Acrobat was multi=threaded. I routinely have to OCR 4000 page documents which almost always fail and take hours
Acrobat needs to be multithreaded. I have a top of the line machine with 32g of ram and an I7-6700k processor an m2 SSD, yet am unable to successfully run OCR on a 4000 page document without splitting it up into 250-500page chunks, which is cumbersome. Even then, acrobat will only use 12% of my resources.
-
James Keeline commented
Ben, I think we can all agree that 4,000-page documents is a lot to OCR at once. It would be a particular issue if the PDFs are very large, leaving little RAM for the page processing.
I have noticed that when Acrobat "Pro" is doing certain actions on my MacBook Pro (2020) that it grabs disk space and doesn't return it until the application is restarted.
I wonder if you have tried my suggestion of a parallel OCR solution called OCRmyPDF. It is a Python script that uses the Tessaract engine for the work. On my machine (6 core i7, 16 GB RAM), you can tell when it is running because the fans spin up. It is a bit faster than Acrobat "Pro" and seems more reliable. But I have not tried to do more than 1,000 pages at a time.
If you care to supply a sample file, I can give it a try and let you know the results if you don't care to do it yourself.
But Acrobat "Pro" should be multithreaded and not just change the UI to be different. I finally turned off the new UI and am happier being able to find certain functions which were well hidden in the new one.
-
Ben Boucher-Giles commented
Sadly it is seven years on from my original post and Acrobat still has the same amateur approach to multithreading. Nowadays every cpu produces is inundated with cores, yet Adobe appears to ignore this and this software remains in the dark ages. I would also add that OCR is no more reliable now than it was back then. I still routinely have to OCR large documents by page ranges of 500 pages, saving in between becuase I cannot trust the standalone 2020 "Pro" version I use to get through the three hours of processing (using the one core it knows about!) without crashing and losing progress.
-
Jake Cooper commented
Hi, I'm a cyber security expert and seasoned developer. I'd like to shed some light on this topic from a business standpoint and then provide a solution from a developer standpoint.
First, in the software development industry it is well known that performance issues lie on the developers. The end users cannot see the back end and-- especially with most of your users being office workers and not computer geeks-- the 100 so people that vote on this is likely 100% of the users who actually understand what multithreading is.
If anyone else reads this there is no way they will upvote this because 9/10 viewers have likely never heard multithreading in their life.
A major problem with adobe is it's performance. Most users just brush it off as they aren't aware of the cause of an immediate solution.
As a major software organization, implementing any parallelization would yield massive performance boosts. You could even do it little by little and every update millions of people would talk about " how smooth the new adobe is "
Finally... the nerd stuff (yes!). While OCR is the most noticeable tool in need of parallelization, any and all tasks can benefit from it. Here's 2 ways you implement it very simply:
1. Monitor the OS resources. when they reach a certain point, distribute tasks evemly among threads.
2. Parallelize everything, make a setting for the user to set a max thread count.Next, the algorithm. This is one of the simplest things in programming. (I'm sorry if this offends but this is for all readers not just developers)
Let's say you have a 12 page document.
Standard single-core OCR looks like this:
(Instruction, time to complete in seconds)
OCR page 1, 5s
Then OCR page 2, 5s
Then page 3, 5s
Then page 4, 5s
Etc...
At 5s per page, 12 pages takes an entire minuteEach page is done sequentially. It is a very simple algorithm as it is entirely linear.
Here's how a multiprocessed OCR would look with a 6 core processor:
First you make a function or job that does this:
If the # of pages we've ocr'ed so far aren't equal to the number of pages we need to OCR, then add one to the number of pages we've done and OCR the page of that same number.
Otherwise, don't do anythingThen, you simply assign that job to every core.
What happens then?
When the first core takes its job, it won't ocr any other papers for 5 seconds (until it finishes it's paper), but it still increments the papers done so far to 1.This means the second core will get page 2 and 3 gets page 3 ending at six setting the pages we've done to six.
After 5 seconds, core 1 checks how many pages we've done and sees we haven't got all 12. It adds one to the number we've done, making it seven. The other cores finish their first task and follow suit.
This time, we have ocr'ed 12 pages total.meaning core 1 won't do anything else.neither will the other cores
Each core ocr'ed 2 pages. What took the current adobe algorithm 1 whole minute takes the industry standard only 10 seconds to complete.
Final words:
If adobe devs plan to wait for their users to vote en masse dor multithreading to be implement, then we will never ever see multithreading and people will absolutely leave adobe as it gets left behind the current generation.Adobe has the manpower, financial power, and technical prowess to implement this--yet they choose to wait for hr reps (general adobe users) to tell them how to do their job.
Unfortunately, the industry is changing and those who choose not to change with it will be left behind
-
Abdallh Nsr commented
استعاده جميع المستندات والبيانات للهاتف المحمول بتاريخ 10/1/2023
-
DocsDel commented
Yet another reason to finally make Acrobat 64 bit and leave Windows 3.1 days behind.
-
James D. Keeline commented
4,000 pages is a lot. There is merit in doing it in pieces and combining them afterward.
When I don't care to wait for Adobe Acrobat Amateur (it does not deserve the "Pro" label) to OCR a document, I use a Python script on my MacBook Pro 2020 16", 32G RAM, 2T SSD called OCRmyPDF. As I recall from the installation, there were a few dependencies but ultimately I got it going with "brew." Now I can OCR from the command line (terminal). When it is running, the fans and processor meters show that all of the resources are being employed. The time required depends on the resources but is about 1/3 that of Acrobat with an 8-core processor.
It is a tad ironic that a free, open source program can out-perform the expensive Adobe product which has not improved in this area in the past 10-20 years.
-
Anonymous commented
How is it 2021 and this still uses a single thread?
Why does it need to take an hour to OCR 1000 pages? -
James commented
Here here! Just spent a very frustrating afternoon (read: 4 hours) trying to Bates stamp docs in Adobe. Tried multiple versions of Adobe and kept hitting crashes. Even tried multiple computers, same results. Installed a trial of Foxit and while it wasn't blazing fast, it got the job done without complaining.
Safe to say we're reevaluating using Acrobat in our office and are now looking at the alternatives. That's 4 hours of my life I can't get back, not to mention 4 billable hours wasted. I can't bill the client for that time.
-
Anonymous commented
STILL WAITING ON THIS ADOBE
-
Mark commented
Just use PDF XChange Editor. It's cheaper (like about $50) and works better for 99% of everthing you need to do with a PDF. It uses multithreads/multicore.The only thing I could not do with PDF Xchange Editor is remove duplicates, which is a plugin for Acrobat made by a third party provider. It's just not worth paying for DC imho.
-
James D. Keeline commented
I have seen this sentiment raised for about a decade. Considering the millions of dollars (or equivalent currency) that people pay for Adobe Acrobat Pro, it is well past the point when it should behave like a fully professional program. This means rock-solid stability and reliability and speed. If we invest in hardware with ample RAM, multi-core processors, and good CPU speed, the software should take advantage of the available resources. Doing otherwise is cheating us from the productivity we deserve.
When I use software like HandBrake, it is immediately obvious because the fans spin up and any metering software will show that the resources are being fully utilized. It is harder to do video in a multicore multithreaded environment but they manage. Why can't Adobe in the processes we use every time we open Acrobat Pro ? This includes OCR ("Text Recognition") and building PDFs from images.
This should be faster than it is with a brand new MacBook Pro 2020 with 16G RAM and 2.6 GHz 6-Core Intel Core i7. For the OCR process, it is not really different than my old computer, a mid-2014 MacBook Pro with only 8G RAM.
-
Tom Bilan commented
I second this. I have an 8 core computer and this process is a perfect candidate for parallelization. I think all that Adobe would need to do is divide the document up into X number of smaller documents and then process them in separate streams then combine the results. X = # of cores. It doesn't seem like too hard of a computer science problem and would be a big win for anyone who's bought a computer in the last decade since everything is multi-core now.
-
Michel Phillips commented
I agree. I have a fast, new-ish CPU and plenty of RAM, but I can't OCR more than about 108 pages without Acrobat DC crashing. For a 1500 page document (which I routinely deal with), this means I have to run OCR manually 14 or 15 times. On Acrobat Pro XI I would start the OCR process when I was leaving for the day, and when I came back in the morning my entire document would be OCR'd.
New products are supposed to be BETTER. Not WORSE.
-
Anonymous commented
I have the same problems and get very frustrated with a minute load time on ducuments over 3000 mb.
-
John commented
Its been over a year! Has this been implemented yet?
-
Wrathofgod220 commented
Please also improve the Embed Index tool to use Multi-Thread and better CPU utilization. OCR and Index Embedding go hand-in-hand
-
Adminrishusha (Admin, Adobe) commented
Hi Mike,
We have raised a feature request for this and shall update you about the same.
-
mohamed emad commented
hi