Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AWS Textract Markdown/HTML output #16

Open
ThomasDelteil opened this issue Feb 24, 2025 · 0 comments
Open

AWS Textract Markdown/HTML output #16

ThomasDelteil opened this issue Feb 24, 2025 · 0 comments

Comments

@ThomasDelteil
Copy link

ThomasDelteil commented Feb 24, 2025

AWS Textract supports Markdown/HTML output through the Textractor python library: see https://aws-samples.github.io/amazon-textract-textractor/notebooks/document_linearization_to_markdown_or_html.html

Also given the table-heavy nature of the dataset, it would make sense to use the AWS Textract Table Feature (note that this will change the price comparison). (15$/1000 pages)

Amazon Bedrock Data Automation offers PDF to markdown service as well (10$/1000 pages).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant