Part 1: Optical Character Recognition in Python

2 min readOct 27, 2024

PaddleOCR is the best open-source library for Optical character recognition. But if you want output in structured format you can use existing LLMs. In this article I will show how to do Optical Character Recognition using both PaddleOCR and Claude Sonnet.

PaddleOCR

First install libraries

pip install paddleocr
pip install paddlepaddle

Then you can do OCR using following code

from paddleocr import PaddleOCR, draw_ocr
import cv2

# Initialize the OCR model with the desired language
ocr = PaddleOCR(use_angle_cls=True, lang='en')  # Set 'lang' based on the language you want

# Run inference on an image
img_path = 'path_to_your_image.jpg'
result = ocr.ocr(img_path, cls=True)

# Display the result
for line in result:
    print(line)

# Optional: Draw the detected text on the image
image = cv2.imread(img_path)
boxes = [line[0] for line in result[0]]
texts = [line[1][0] for line in result[0]]
scores = [line[1][1] for line in result[0]]

# Draw and display the image with detected text
from paddleocr import draw_ocr
image_with_boxes = draw_ocr(image, boxes, texts, scores)
cv2.imshow("OCR Result", image_with_boxes)
cv2.waitKey(0)

2. Claude Sonnet

First install libraries

pip install anthropic

Then obtain API key from claude website and initialise anthropic client

import anthropic

client = anthropic.Anthropic(api_key= ANTHROPIC_API_KEY)

Then you can OCR using following code

import os
import base64
import anthropic

client = anthropic.Anthropic(api_key= ANTHROPIC_API_KEY)

# Function to encode the image
def encode_image(image_path):
  with open(image_path, "rb") as image_file:
    return base64.b64encode(image_file.read()).decode('utf-8')

img_path = 'path_to_your_image.jpg'

base64_image = encode_image(img_path)
message = client.messages.create(
    model="claude-3-5-sonnet-20240620",
    max_tokens=1024,
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "source": {
                        "type": "base64",
                        "media_type": f"image/{img_path.split('.')[-1]}",
                        "data": base64_image,
                    },
                },
                {
                    "type": "text",
                    "text": "Return text in this image"
                }
            ],
        }
    ],
)
print(message.content[0].text)

If you are only extracting text then paddleOCR is better. But for extracting tables or structured data, Claude Sonnet or any other equivalent LLM will be better.

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

Ocr

Written by Rohit Raj

148 Followers

149 Following

Studied at IIT Madras and IIM Indore. Love Data Science

No responses yet

Write a response

What are your thoughts?

Also publish to my profile

Recommended from Medium

Tamanna

Understanding LayoutLM

LayoutLM is a pre-trained model developed by Microsoft that can generate layout features from text and image inputs. It’s designed for…

Jan 12

115

Unleash the Power of PaddleOCR: Your Guide to Best Open Source OCR

Generative AI

RSD Studio.ai

Unleash the Power of PaddleOCR: Your Guide to Best Open Source OCR

Want to find out about the best OCR that you can use to build AI applications at scale and earn a fortune!

Feb 8

Lists

Natural Language Processing

1977 stories1620 saves

Convert PDF to text (markdown) with olmOCR on Windows Mini PC with Intel Core Ultra i5

Wei Lu

Convert PDF to text (markdown) with olmOCR on Windows Mini PC with Intel Core Ultra i5

olmOCR is a Qwen2-VL 7B model fine-tuned with academic papers, technical documentation, and other reference content, as well as a toolkit…

Mar 4

Building a 2 Billion Parameter LLM from Scratch Using Python

Level Up Coding

Fareed Khan

Building a 2 Billion Parameter LLM from Scratch Using Python

It starts making sense

Jan 15

1.6K

Anoop Maurya

Ollama-OCR Now Supports PDFs! 🚀

Stuck behind a paywall? Read for Free!

4d ago

197

Llama 3.2-Vision for High-Precision OCR with Ollama

Agent Issue

Llama 3.2-Vision for High-Precision OCR with Ollama

With the new Llama 3.2 release, Meta seriously leveled up here — now you’ve got vision models (11B and 90B) that don’t just read text but…

Oct 31, 2024

342

See more recommendations

Help
Status
About
Careers
Press
Blog
Privacy
Terms
Text to speech
Teams