Deploying an OCR Function App on Azure with Docker: A Step-by-Step Guide

Technical Article: Deploying an ocrmypdf on Azure

Introduction

Optical Character Recognition (OCR) technology extracts text from scanned documents or images, making them searchable and editable. This article guides you through deploying a serverless OCR solution on Azure using a custom Docker container and the benefits of Azure Functions. This solution leverages ocrmypdf, an open-source Python library, to efficiently perform OCR on uploaded PDF documents.

By utilizing Azure Functions, you can create a scalable and cost-effective OCR solution without managing infrastructure complexities. Additionally, containerizing your function app with Docker ensures portability and isolation across different environments.

Prerequisites

Basic understanding of Azure Functions and Docker.
An Azure Subscription.
Familiarity with Python.

Solution Code

Let’s dissect the core components of this solution.

1. Dockerfile

# Use the Azure Functions Python 3.11 base image
FROM mcr.microsoft.com/azure-functions/python:4-python3.11 

# Set working directory and enable logging
ENV AzureWebJobsScriptRoot=/home/site/wwwroot \
    AzureFunctionsJobHost__Logging__Console__IsEnabled=true

# Install dependencies (including Ghostscript)
RUN apt-get update && \
    apt-get install -y --no-install-recommends \
        build-essential curl libjpeg-dev libtiff5-dev libpng-dev \
        libfontconfig1-dev libicu-dev libfreetype6-dev libpcre3-dev \ 
        libopenjp2-7-dev ocrmypdf && \
    rm -rf /var/lib/apt/lists/*

# Download and install Ghostscript 
RUN curl -L https://github.com/ArtifexSoftware/ghostpdl-downloads/releases/download/gs9561/ghostscript-9.56.1.tar.gz -o ghostscript.tar.gz 
RUN tar -xzf ghostscript.tar.gz && \
    cd ghostscript-9.56.1 && \
    ./configure && make && make install && \
    cd .. && \
    rm -rf ghostscript-9.56.1 ghostscript.tar.gz

# Install Python dependencies
COPY requirements.txt /
RUN pip install -r /requirements.txt

# Copy application files
COPY . /home/site/wwwroot

Why Customize the Dockerfile?

The provided Dockerfile deviates from the standard Azure Function Python image in a few key aspects to address specific requirements for this OCR application:

1. Installing Additional Dependencies:

The default Azure Functions Python image is optimized for general-purpose Python functions and might not include all the libraries required for your specific application. In this case, we need to install ocrmypdf for OCR processing and potentially other libraries depending on the complexity of your function.
The Dockerfile utilizes apt-get to manage package installation on the underlying Linux system within the container. This allows for installing system-level dependencies not readily available through pip, the Python package manager.

2. Including Ghostscript:

While the default image might have basic PDF manipulation capabilities, including a specific version of Ghostscript ensures compatibility and desired functionality within your OCR application.
The Dockerfile downloads and compiles Ghostscript 9.56.1 from source, providing more control over the installed version compared to relying on the potentially outdated version available through package repositories. This ensures compatibility with specific PDF formats or functionalities required by your application.

3. Customizing Environment Variables:

The Dockerfile sets the AzureWebJobsScriptRoot environment variable, which specifies the working directory for the Azure Function within the container. This helps Azure Functions locate the application code and execute it correctly.
Enabling logging (AzureFunctionsJobHost__Logging__Console__IsEnabled=true) allows you to view logs generated by your function during development and troubleshooting.

In summary, customizing the Dockerfile allows you to tailor the container environment to the specific requirements of your OCR application. This includes installing necessary libraries, managing dependencies not available through the default image, and ensuring compatibility with specific tools like Ghostscript by including a desired version.

2. Python Function App (app.py)

import azure.functions as func
import logging
import ocrmypdf
import tempfile
from pdfminer.high_level import extract_text

app = func.FunctionApp()

@app.route(route="ScanPdf", auth_level=func.AuthLevel.ANONYMOUS)
def ScanPdf(req: func.HttpRequest) -> func.HttpResponse:
    logging.info('Python HTTP trigger function processed a request.')

    try:
        # Temporary file creation
        temp_input_pdf = tempfile.NamedTemporaryFile(delete=False)
        temp_output_pdf = tempfile.NamedTemporaryFile(delete=False)

        # Process uploaded PDF
        file = req.files['file']
        temp_input_pdf.write(file.read())
        ocrmypdf.ocr(temp_input_pdf.name, temp_output_pdf.name, deskew=True)

        # Extract text from OCR-processed PDF
        extracted_text = extract_text(temp_output_pdf.name)

        # Cleanup
        temp_input_pdf.close()
        temp_output_pdf.close()

        return func.HttpResponse(extracted_text, status_code=200)

    except Exception as e:
        logging.error(f"Error processing PDF: {str(e)}")
        return func.HttpResponse("Failed to process the PDF file.", status_code=500)

Key Elements Explained

Dockerfile: Sets up the environment with necessary dependencies for OCR processing.
Azure Function (app.py):
- HTTP Trigger (@app.route): Defines an HTTP endpoint to receive PDF uploads.
- Temporary Files: Handles temporary storage for input and OCR-processed PDFs.
- ocrmypdf.ocr: Executes the OCR process on the uploaded file.
- pdfminer.high_level.extract_text: Extracts text from the OCR output.
- Error Handling: Includes logging and appropriate response codes.

3. Deployment:

Build Docker Image:

docker build -t ocr-function-app .

Push Image to Registry: Push the built image to Azure Container Registry (ACR) or another supported registry.
Create Function App: Use the Azure portal, CLI, or Resource Manager templates to create an Azure Function App, specifying the Docker container as the source.
Configure Function App: Set the Docker image source and any required configurations.
Test the Function: Send a POST request to the /ScanPdf endpoint with a PDF file to test functionality.

Keep in mind to use ocrmypdf with larger files, you need more compute power, for my work I find that 2 cores and 4 GiB is enough.