getting text from image-based subtitles

I’ll try to expand on this post later, but just a quick note: today I watched an excellent movie (The Abacus and the Sword) and wanted to take some of the subtitles to use as SRS cards. But the subs were in the image-based .sub format, so Aegisub couldn’t handle it. The solution is to use Subtitle Edit. This doesn’t have a Japanese dictionary built in so you need to get it from the Tesseract project. Unpack that file to the \Tesseract\tessdata folder under Subtitle Edit’s install folder. Then when you open such a .sub file, it will ask you if you want to import it, then ask you what system to use (Tesseract) and what language (Japanese). Then you wait, some magic happens, and you have text. It marks the ones it isn’t sure of in a different colour so you can correct them manually, but it gets most of them pretty much right.


Comments are closed.