Note: I wrote this post nearly 2 years ago and recently discovered it in my drafts, some of the info is outdated.
Recently, I was tasked with running OCR on a huge set of images (3.4 million.) I’m going to post some brief details on how we processed these images in about a week.
Initially, we uploaded all of the images to S3 from a colocated server we have locally using s3sync. This took a long while (~1.5 TB of data.)
Once the images were all stored in S3, I retrieved all of the meta data and stored them in a MySQL database which was running on a small EC2 instance. This host became the queue manager.
Since some images were in non-English languages, I went through and specified the language (if it wasn’t english) in the database.
I wrote a simple perl script which would:
- retrieve the next image to be processed from the queue manager
- retrieve the corresponding image from S3
- run OCR on the image (with language option if it wasn’t in English)
- store the OCR output in the database on the queue manager and mark the image as processed
For cost reasons (and because the OCR output was adequate) we used tesseract to process the images. It did a good job (depending on the image quality) and handled foreign languages very well.
To ensure we were getting the most bang for our buck, I whipped up a hack-of-a-script to keep at least 8 processes running on each server. The OCR processing instances were High-CPU x-large servers.
From there, I dumped the contents of the database and handed them off to our indexing expert. Some of the content is currently posted on www.worldvitalrecords.com and the rest is in the works.
Lessons learned (things I would have done different)
- Find a way to send the hard drives which contained the images to Amazon, instead of uploading the entire 1.5TB from our datacenter. I’ve heard rumors of them doing this for large datasets, but have not verified.
- Create a better script to manage running jobs (I’d probably use a multi-threaded perl script)
- Start processing images as soon as they were successfully uploaded. For simplicity’s sake, I uploaded all of the images, then processed them all at the same time.
- Get increased allocations for resources ahead of time. I started out with a 100 instance limit on EC2 and quickly saturated that limit. Around the middle of the week, I was able to finally get that limit increased to 200 instances.
- I’d consider using Amazon’s SimpleDB and Simple Queue Service in leiu of the MySQL database I used for the queue manager and for storing the OCR output.