UnicodeDecodeError: ‘utf-8’ codec can’t decode byte 0x89 in position 270: invalid start byte – WHY?

I’m doing an ultra-simple web page scraper using Python/Beautifulsoup.
Facing a key information displayed as PNG image, I’ve had to reach for PIL/Pytesseract.

Code being extremely simple, and working when executed as my user. Image did load as print cmd shows, but image_to_string appears to generate the error.

    encoded_img = 'iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAQAAAC1HAwCAAAAC0lEQVR42mNkYAAAAAYAAjCB0C8AAAAASUVORK5CYII='

    # Decode and open as image
    img_data = base64.b64decode(encoded_img)
    img_bytes = BytesIO(img_data)
    img = Image.open(img_bytes)
    print(img.format, img.size, img.mode)

    # Use pytesseract to extract number
    custom_config = r'--psm 7 -c tessedit_char_whitelist=0123456789.,'

    return pytesseract.image_to_string(img, config=custom_config).strip()

However, when running from a cron task, (after resolving venv and dependencies) I get the impossible message from the title:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x89 in position 270: invalid start byte

Trying to set the LANG or LC_* env variables did not help.

I’m using python3 and macos-sonoma – not sure if that matters.

Any ideas?

After dumping my entire user environment and loading within my script – I got the script to run successfully.

Eliminating every other variable, I was down to TMPDIR that defaulted to /tmp and tesseract was apparently unable to write to it.

Ironically, when I pointed to a known dir – the script left it empty. Not sure if it was cleaned up before quitting, but I’m pretty confused, and suspect a bug in tesseract or somewhere.

Finally, setting the TMPDIR to a known and existing path (non-/tmp obviously), I’m up and running.

This entry was posted in Uncategorized. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *