I’m doing an ultra-simple web page scraper using Python/Beautifulsoup.
Facing a key information displayed as PNG image, I’ve had to reach for PIL/Pytesseract.
Code being extremely simple, and working when executed as my user. Image did load as print cmd shows, but image_to_string appears to generate the error.
encoded_img = 'iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAQAAAC1HAwCAAAAC0lEQVR42mNkYAAAAAYAAjCB0C8AAAAASUVORK5CYII='
# Decode and open as image
img_data = base64.b64decode(encoded_img)
img_bytes = BytesIO(img_data)
img = Image.open(img_bytes)
print(img.format, img.size, img.mode)
# Use pytesseract to extract number
custom_config = r'--psm 7 -c tessedit_char_whitelist=0123456789.,'
return pytesseract.image_to_string(img, config=custom_config).strip()
However, when running from a cron task, (after resolving venv and dependencies) I get the impossible message from the title:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x89 in position 270: invalid start byte
Trying to set the LANG or LC_* env variables did not help.
I’m using python3 and macos-sonoma – not sure if that matters.
Any ideas?
After dumping my entire user environment and loading within my script – I got the script to run successfully.
Eliminating every other variable, I was down to TMPDIR that defaulted to /tmp and tesseract was apparently unable to write to it.
Ironically, when I pointed to a known dir – the script left it empty. Not sure if it was cleaned up before quitting, but I’m pretty confused, and suspect a bug in tesseract or somewhere.
Finally, setting the TMPDIR to a known and existing path (non-/tmp obviously), I’m up and running.