Using make for automatic processing

March 20, 2014

Solution    organizing    Reading time: 3 minutes

Recently, I needed to extract text from a PDF file. This PDF file was produced from scanned pages. To extract the pages back, I used command

pdfimages file.pdf IMG

This stored the original scans in PBM-formatted image files. The files were named in the IMG-XXX.pbm pattern, with XXX starting at 000 and incrementing.

In the next step I converted raw .pbm files to the TIFF format, as Tesseract OCR software prefers this format. Doing so using ImageMagick command convert, the images were additionally processed to remove spots and other mishaps. The format conversion and image processing command as whole was

convert -verbose -colorspace gray -median 3x3 -resize 300% -blur 10 input.pbm output.tiff

As I repeatedly tried different image processing chains for multiple images at once, the execution time exceeded 1 minute. It occured to me, looking at the CPU load, that only one core has been used at a time. Surely the multi-process execution on a multi-core CPU would speed up every OCR run. But I was unsure at first, whether there is a simple process manager, which limits amount of concurrently running processes. Then I remembered about an old acquaintance, make, or, precisely, GNU make. The tool also resolves dependencies automatically, which helps scheduling of processes, additionally.

The only problem yet to solve was the lack of rules for translation, or, in make parlance, pattern rules, from the source image to the OCRed text. Luckily, years ago I used make to produce both printable and audible music scores from descriptions, written in ABC notation, automatically; therefore, I only had to read up the make documentation again. This is the result:

CONVERT = convert
IMFLAGS = -verbose -colorspace gray -median 3x3 -resize 300% -blur 10

TESSERACT = tesseract

RM = rm

SRCS = $(wildcard *.pbm)
INTERMEDIATES = ${SRCS:.pbm=.tiff}
RESULTS = ${SRCS:.pbm=.txt}

.PHONY: all clean
all: $(RESULTS)
.PRECIOUS: $(INTERMEDIATES)

clean:
	$(RM) -f $(INTERMEDIATES) $(RESULTS)

%.tiff: %.pbm
	$(CONVERT) $(IMFLAGS) $< $@

%.txt: %.tiff
	$(TESSERACT) $< $(basename $@) -l deu

The commands convert, tesseract and rm are stored in variables CONVERT, TESSERACT and RM for being able to specify unusual locations of them. The variables SRCS, INTERMEDIATES and RESULTS contain file names. The declaration .PHONY instructs make to execute targets all and clean even when there are actual files names all or clean. The .PRECIOUS declaration instructs make to preserve intermediate files. Line groups, starting in the form of %to: %from, are pattern rules (rules for translation). $< and $@ mean source and target file respectively.

Run with make -j number, where number amounts either to the number of cores or the number of threads in the CPU. On AMD CPU with 4 cores, I used make -j 4.

Using make, which is available for any off-the-shelf OS, you are able to speed up lengthy processing on multi-core systems. Also, make ensures seamless translations between different stages of processing chain.

Which leads to the question: where do you use make, aside from everyday source code compilation?



comments powered by Disqus