Introduction
Optical Character Recognition (OCR) technology extracts text from scanned documents or images, making them searchable and editable. This article guides you through deploying a serverless OCR solution on Azure using a custom Docker container and the benefits of Azure Functions. This solution leverages ocrmypdf, an open-source Python library, to efficiently perform OCR on uploaded PDF documents.
By utilizing Azure Functions, you can create a scalable and cost-effective OCR solution without managing infrastructure complexities. Additionally, containerizing your function app with Docker ensures portability and isolation across different environments.
Prerequisites
Solution Code
Let’s dissect the core components of this solution.
1. Dockerfile
# Use the Azure Functions Python 3.11 base image
FROM mcr.microsoft.com/azure-functions/python:4-python3.11
# Set working directory and enable logging
ENV AzureWebJobsScriptRoot=/home/site/wwwroot \
AzureFunctionsJobHost__Logging__Console__IsEnabled=true
# Install dependencies (including Ghostscript)
RUN apt-get update && \
apt-get install -y --no-install-recommends \
build-essential curl libjpeg-dev libtiff5-dev libpng-dev \
libfontconfig1-dev libicu-dev libfreetype6-dev libpcre3-dev \
libopenjp2-7-dev ocrmypdf && \
rm -rf /var/lib/apt/lists/*
# Download and install Ghostscript
RUN curl -L https://github.com/ArtifexSoftware/ghostpdl-downloads/releases/download/gs9561/ghostscript-9.56.1.tar.gz -o ghostscript.tar.gz
RUN tar -xzf ghostscript.tar.gz && \
cd ghostscript-9.56.1 && \
./configure && make && make install && \
cd .. && \
rm -rf ghostscript-9.56.1 ghostscript.tar.gz
# Install Python dependencies
COPY requirements.txt /
RUN pip install -r /requirements.txt
# Copy application files
COPY . /home/site/wwwroot
The provided Dockerfile deviates from the standard Azure Function Python image in a few key aspects to address specific requirements for this OCR application:
1. Installing Additional Dependencies:
ocrmypdf
for OCR processing and potentially other libraries depending on the complexity of your function.apt-get
to manage package installation on the underlying Linux system within the container. This allows for installing system-level dependencies not readily available through pip, the Python package manager.2. Including Ghostscript:
3. Customizing Environment Variables:
AzureWebJobsScriptRoot
environment variable, which specifies the working directory for the Azure Function within the container. This helps Azure Functions locate the application code and execute it correctly.AzureFunctionsJobHost__Logging__Console__IsEnabled=true
) allows you to view logs generated by your function during development and troubleshooting.In summary, customizing the Dockerfile allows you to tailor the container environment to the specific requirements of your OCR application. This includes installing necessary libraries, managing dependencies not available through the default image, and ensuring compatibility with specific tools like Ghostscript by including a desired version.
2. Python Function App (app.py)
import azure.functions as func
import logging
import ocrmypdf
import tempfile
from pdfminer.high_level import extract_text
app = func.FunctionApp()
@app.route(route="ScanPdf", auth_level=func.AuthLevel.ANONYMOUS)
def ScanPdf(req: func.HttpRequest) -> func.HttpResponse:
logging.info('Python HTTP trigger function processed a request.')
try:
# Temporary file creation
temp_input_pdf = tempfile.NamedTemporaryFile(delete=False)
temp_output_pdf = tempfile.NamedTemporaryFile(delete=False)
# Process uploaded PDF
file = req.files['file']
temp_input_pdf.write(file.read())
ocrmypdf.ocr(temp_input_pdf.name, temp_output_pdf.name, deskew=True)
# Extract text from OCR-processed PDF
extracted_text = extract_text(temp_output_pdf.name)
# Cleanup
temp_input_pdf.close()
temp_output_pdf.close()
return func.HttpResponse(extracted_text, status_code=200)
except Exception as e:
logging.error(f"Error processing PDF: {str(e)}")
return func.HttpResponse("Failed to process the PDF file.", status_code=500)
Key Elements Explained
@app.route
): Defines an HTTP endpoint to receive PDF uploads.ocrmypdf.ocr
: Executes the OCR process on the uploaded file.pdfminer.high_level.extract_text
: Extracts text from the OCR output.3. Deployment:
docker build -t ocr-function-app .
/ScanPdf
endpoint with a PDF file to test functionality.Keep in mind to use ocrmypdf with larger files, you need more compute power, for my work I find that 2 cores and 4 GiB is enough.