Support more file types #199
Replies: 8 comments 18 replies
-
On the original project, there is a pull request the-paperless-project/paperless#600 done by @Tooa that was never merged. I'd like to see more file type support in paperless. Here are a couple notes.
Regarding the python part:
I've also looked into using some native python libraries to support individual file types such as |
Beta Was this translation helpful? Give feedback.
-
Hi, I saw the PR, but didn't look into the code yet (not sure how applicable it would be for the I didn't see anywhere how to store the "native" document. I always get a parse error when I try to upload a I totally agree with the optional part. I would actually not mind it if I could eg. pre-parse any other documents, and upload a native+pdf version to paperless. That should work for most of my use cases... |
Beta Was this translation helpful? Give feedback.
-
First part is ready: Tika Todo: Gotenberg |
Beta Was this translation helpful? Give feedback.
-
Gotenberg working too, now |
Beta Was this translation helpful? Give feedback.
-
Current version: https://github.com/jovandeginste/paperless_tika |
Beta Was this translation helpful? Give feedback.
-
For some reason, it only picks a limited number of filetypes. When eg. I add a
Any idea what I'm missing? 😕 |
Beta Was this translation helpful? Give feedback.
-
Not sure. Need to debug later. Btw. We can totally add this to the main repo. The python dependencies aren't that big of a deal. It's just that having both web servers available should be optional and configurable. If unavailable / not configured, we can simply not announce the parser to paperless. Provide some documentation on how to adjust the docker-compose file to enable this feature, done. So if you want to prepare a pull request into a new feature branch, lets say |
Beta Was this translation helpful? Give feedback.
-
Any other filetypes on your wish list? |
Beta Was this translation helpful? Give feedback.
-
I'm not a python dev, but was working on a similar project (personal) in Golang. My project was standalone (no central web server).
I would really like to see more file type support in paperless. I spun up a separate Tika server for generic document content & metadata retrieval.
Since this is a web app, the browser can not show many file types natively. To get around this issue, we might convert anything "foreign" to a PDF using some separate process. I found the gotenberg project to be interesting.
So we might have a hook on incoming "foreign" files, that sends it to Tika for content+metadata; then, when we know the file's type, we can send it to the gotenberg project to have a PDF version for web display. Of course, when the user retrieves the file, he can have it in its original form (eg. docx).
What do you think?
I can help with the Tika and Gotenberg part, I think, but again, I'm not a Python dev...
Beta Was this translation helpful? Give feedback.
All reactions