Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ModuleNotFoundError for py-zerox Module #47

Open
shawn8888 opened this issue Oct 6, 2024 · 11 comments
Open

ModuleNotFoundError for py-zerox Module #47

shawn8888 opened this issue Oct 6, 2024 · 11 comments

Comments

@shawn8888
Copy link

I'm encountering a ModuleNotFoundError when trying to import the py-zerox module in my Python project, despite having installed it successfully.

Environment

Python Version: 3.12
Installed Packages:

py-zerox                  0.0.3

Steps to Reproduce:

Install the py-zerox package using:

pip install py-zerox

Attempt to import the module in a Python script:

from pyzerox import zerox

Receive the following error:

ModuleNotFoundError: No module named 'pyzerox'

Could you please assist me in resolving this issue? Any guidance on ensuring that the py-zerox module is recognized would be greatly appreciated.

Thank you!

@shawn8888
Copy link
Author

shawn8888 commented Oct 7, 2024

Found a solution here:
#41

pip uninstall py-zerox
pip install git+https://github.com/getomni-ai/zerox.git

created a .py file:

import os
from pyzerox import zerox
import asyncio

async def main():
    # Set your OpenAI API key
    os.environ["OPENAI_API_KEY"] = "mykey"

    # Path to the PDF file you want to process
    file_path = "PasnewB.PDF"

    # Call the zerox function
    result = await zerox(file_path=file_path, model="gpt-4o-mini", output_dir="./output")

    # Print the Markdown result
    print(result)

# Run the main function
asyncio.run(main())

ModuleNotFoundError error is fixed. However, still got other errors:

C:\Backup\Projects\python>python hello_zerox.py
Traceback (most recent call last):
  File "C:\Backup\Projects\python\hello_zerox.py", line 19, in <module>
    asyncio.run(main())
  File "C:\Python312\Lib\asyncio\runners.py", line 194, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "C:\Python312\Lib\asyncio\runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Python312\Lib\asyncio\base_events.py", line 685, in run_until_complete
    return future.result()
           ^^^^^^^^^^^^^^^
  File "C:\Backup\Projects\python\hello_zerox.py", line 13, in main
    result = await zerox(file_path=file_path, model="gpt-4o-mini", output_dir="./output")
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Python312\Lib\site-packages\pyzerox\core\zerox.py", line 91, in zerox
    select_pages = sorted(select_pages)
                   ^^^^^^^^^^^^^^^^^^^^
TypeError: 'NoneType' object is not iterable

Please help! Thanks!

@pradhyumna85
Copy link
Contributor

@shawn8888, the fix for the second problem is already raised as a PR #40, which is still currently unmerged, but you can still use is for now by uninstalling you py-zerox package and reinstalling with:

pip install git+https://github.com/pradhyumna85/zerox.git@formatting-control

@tylermaran, @annapo23, could you please review PR #40 and merge that.

@shawn8888
Copy link
Author

shawn8888 commented Oct 7, 2024

@pradhyumna85
Thank you for your reply! I have uninstalled 0.0.5 and reinstalled 0.0.6
However, I got another error.
I use OpenAI API and the key looks fine to me.

�[1;31mGive Feedback / Get Help: https://github.com/BerriAI/litellm/issues/new�[0m
LiteLLM.Info: If you need to debug this error, use `litellm.set_verbose=True'.

ERROR:root:Failed to process image Error:
    Error in Completion Response. Error: litellm.BadRequestError: OpenAIException - Error code: 400 - {'error': {'message': 'Unrecognized request argument supplied: output_dir', 'type': 'invalid_request_error', 'param': None, 'code': None}}
    Please check the status of your model provider API status.

ZeroxOutput(completion_time=2388.78, file_name='cs101', input_tokens=0, output_tokens=0, pages=[Page(content='', content_length=0, page=1)])

@pradhyumna85
Copy link
Contributor

@shawn8888, the parameter output_dir is replaced with output_file_path which is the output file path of the md file instead of a directory.
Refer: https://github.com/pradhyumna85/zerox/tree/formatting-control?tab=readme-ov-file#usage-1

@shawn8888
Copy link
Author

@pradhyumna85

C:\Backup\Projects\python>python hello_zerox2.py

Traceback (most recent call last):
  File "C:\Backup\Projects\python\hello_zerox2.py", line 48, in <module>
    result = asyncio.run(main())
             ^^^^^^^^^^^^^^^^^^^
  File "C:\Python312\Lib\asyncio\runners.py", line 194, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "C:\Python312\Lib\asyncio\runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Python312\Lib\asyncio\base_events.py", line 685, in run_until_complete
    return future.result()
           ^^^^^^^^^^^^^^^
  File "C:\Backup\Projects\python\hello_zerox2.py", line 40, in main
    result = await zerox(file_path = file_path, model = model, output_file_path = output_file_path,
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Python312\Lib\site-packages\pyzerox\core\zerox.py", line 180, in zerox
    await f.write(page_content)
  File "C:\Python312\Lib\site-packages\aiofiles\threadpool\utils.py", line 43, in method
    return await self._loop.run_in_executor(self._executor, cb)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Python312\Lib\concurrent\futures\thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeEncodeError: 'gbk' codec can't encode character '\xb2' in position 1751: illegal multibyte sequence

image
This "gbk codec" seems a language error. My Windows is set to use Chinese for non-Unicode Programs. Any solutions?
Thanks!

@shawn8888
Copy link
Author

pip install git+https://github.com/pradhyumna85/zerox.git@formatting-control

I was going to test the installation above on a different PC and got this error:


     Error during installation: Please install Poppler manually from https://poppler.freedesktop.org/
      Pre-install script failed: Command '['C:\\Python312\\python.exe', '-m', 'py_zerox.scripts.pre_install']' returned non-zero exit status 1.
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for py-zerox
Failed to build py-zerox
ERROR: ERROR: Failed to build installable wheels for some pyproject.toml based projects (py-zerox)

@pradhyumna85
Copy link
Contributor

pradhyumna85 commented Oct 8, 2024

@shawn8888, Install poppler utils manually using prebuilt binaries as you are on windows and then try pip install again.
Steps to install poppler utils prebuilt binaries on windows:

  1. Download the latest prebuilt binary zip from https://github.com/oschwartz10612/poppler-windows/releases
  2. Unzip the zip to some directory and add the Library/bin folder in the extracted to the PATH variable.

@shawn8888
Copy link
Author

@shawn8888, Install poppler utils manually using prebuilt binaries as you are on windows and then try pip install again. Steps to install poppler utils prebuilt binaries on windows:

1. Download the latest prebuilt binary zip from https://github.com/oschwartz10612/poppler-windows/releases

2. Unzip the zip to some directory and add the **Library/bin** folder in the extracted to the [PATH variable](https://stackoverflow.com/questions/44272416/how-to-add-a-folder-to-path-environment-variable-in-windows-10-with-screensho).

It works!
Could you please also check the "gbk codec" error above? Maybe change the output.md file encoding to be UTF-8?
Thanks!

@pradhyumna85
Copy link
Contributor

@shawn8888, Install poppler utils manually using prebuilt binaries as you are on windows and then try pip install again. Steps to install poppler utils prebuilt binaries on windows:

1. Download the latest prebuilt binary zip from https://github.com/oschwartz10612/poppler-windows/releases

2. Unzip the zip to some directory and add the **Library/bin** folder in the extracted to the [PATH variable](https://stackoverflow.com/questions/44272416/how-to-add-a-folder-to-path-environment-variable-in-windows-10-with-screensho).

It works! Could you please also check the "gbk codec" error above? Maybe change the output.md file encoding to be UTF-8? Thanks!

Set an environment variable (not inside python) PYTHONIOENCODING with value utf-8 and See if that solves the issue.

@shawn8888
Copy link
Author

Set an environment variable (not inside python) PYTHONIOENCODING with value utf-8 and See if that solves the issue.

@pradhyumna85 You are the best! After setting PYTHONIOENCODING=utf-8 in CMD, the program works!

I have a couple of questions:

  1. How can I make this a default setting so I don't have to type it every time I run the script?
  2. When the PDF file contains Chinese characters, I encounter an error, even though I’ve tested that gpt-4o-mini does support OCR for Chinese. The error message is:
    UnicodeEncodeError: 'gbk' codec can't encode character '\U0001f914' in position 2: illegal multibyte sequence.

Here is the pdf I tested:
PasnewB.PDF

@pradhyumna85
Copy link
Contributor

pradhyumna85 commented Oct 14, 2024

Set an environment variable (not inside python) PYTHONIOENCODING with value utf-8 and See if that solves the issue.

@pradhyumna85 You are the best! After setting PYTHONIOENCODING=utf-8 in CMD, the program works!

I have a couple of questions:

  1. How can I make this a default setting so I don't have to type it every time I run the script?
  2. When the PDF file contains Chinese characters, I encounter an error, even though I’ve tested that gpt-4o-mini does support OCR for Chinese. The error message is:
    UnicodeEncodeError: 'gbk' codec can't encode character '\U0001f914' in position 2: illegal multibyte sequence.

Here is the pdf I tested: PasnewB.PDF

For 1. you can set it on the OS level, for eg in windows:
image

For 2. even I am not sure, if you find a solution then please share here. Edit: Set an environment variable (not inside python) PYTHONUTF8 with value 1 and See if that solves the issue.

Also I would say try to work on linux, you would have a much easier life. If you are on windows then I would recommend you to use WSL 2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants