Updated prompt to extract text and format it in Markdown, including a… #200

dzemeuksis · 2024-12-22T17:13:45Z

Updated prompt to extract text and format it in Markdown, including additional visual details, instead of only describing the image.

The markitdown module is designed to extract text from various documents and save it in Markdown format, as stated in its purpose. This change aligns the default behavior of image processing with the overall goal of the module.

Previously, providing an image resulted in a plain text description, which likely did not meet user expectations. Users are more likely to expect extracted text and formatting when supplying an image, making this change a better fit for the module's intended functionality.

…dditional visual details, instead of only describing the image.

dzemeuksis · 2024-12-22T17:15:16Z

@microsoft-github-policy-service agree

PetrAPConsulting · 2025-01-06T08:48:41Z

Hi @dzemeuksis,
I have similar suggestion. I am using this prompt (varified wit GPT 4o) which cover the most pictures embedded in the documents. I do not want to open separate pull request and yours is the most suitable to propose it.

"""Convert this image into a structured markdown representation that preserves its data and relationships. Follow these conversion guidelines based on content type:

For Tables:
Create a proper markdown table with headers and data rows. For example:

Column1	Column2
Data1	Data2

For Mathematical Formulas:
Use LaTeX notation within markdown delimiters. For example:
$$ y = mx + b $$

For Charts and Graphs:

Extract the actual data points and represent them in a markdown table
Include axis labels, units, and scale information
Describe the relationship pattern (linear, exponential, etc.) as a markdown header

For Flowcharts and Diagrams:
Convert to mermaid markdown syntax when possible:

graph LR
    A-->B
    B-->C

For Process Flows:
Create a numbered list with clear step progression and any branching conditions.

For Technical Diagrams:

Create a hierarchical structure using markdown headers
List components and their relationships
Preserve any measurements or specifications in tables

Additional Guidelines:

Maintain numerical precision exactly as shown
Preserve all labels and annotations as markdown text
Include metadata as key-value pairs at the top
Use markdown quotes for any explanatory text
Structure the output to prioritize machine readability
Preserve relationships between data elements using markdown hierarchy"""

Let me know if you see it as reasonable to push it together.

Petr

dzemeuksis · 2025-01-07T15:48:17Z

@PetrAPConsulting , that sounds great! What do you think about combining our proposals into something like this:

Analyze the image and extract all visible text in the original language. 
Reproduce the extracted text in a structured Markdown format, preserving any formatting such as headings, bullet points, and highlights. 
Ensure the output accurately reflects the structure and style of the original document.

Follow these additional guidelines based on the content type:

- **Tables:** Create a proper markdown table with headers and data rows.
- **Mathematical Formulas:** Use LaTeX notation within markdown delimiters, e.g., `$$ y = mx + b $$`.
- **Charts and Graphs:**
  - Extract data points into a markdown table.
  - Include axis labels, units, and scale information.
  - Describe patterns (e.g., linear, exponential) under markdown headers.
- **Flowcharts and Diagrams:**
  - Use mermaid markdown syntax where possible.
  - For process flows, create a numbered list with clear step progression.
  - For technical diagrams, list components and their relationships in a structured way, preserving measurements/specifications in tables.

For any visual elements that cannot be represented directly in Markdown, describe them in plain text under a section titled "Visual Notes."

Maintain numerical precision exactly as shown, preserve all labels and annotations as markdown text, and structure the output for both human and machine readability. Output only the converted Markdown text without any additional commentary or explanations.

PetrAPConsulting · 2025-01-07T17:31:47Z

@dzemeuksis
Hi,

I suppose you proposed prompt is fine but if you would agree I would extend it in some content types and keept example of mermaid. But again, it's up to you.

Analyze the image and extract all visible text in the original language. Reproduce the extracted text in a structured Markdown format, preserving any formatting such as headings, bullet points, and highlights.
Ensure the output accurately reflects the structure and style of the original document.

Follow these additional guidelines based on the content type:

Tables:

Create exact markdown representation of the table using markdown syntax (|column1|column2|)
Create a separator row (|---|---|) after the header
Transcribe all values exactly as they appear in the table

Mathematical Formulas:

Use LaTeX notation within markdown delimiters, e.g., $$ y = mx + b $$

Charts and Graphs:

Identify the graph type (bar, line, pie, etc.)
Extract data points into a markdown table
Include axis labels, units, and scale information
Describe patterns (e.g., linear, exponential) under markdown headers
Record maximums, minimums, and important values

Flowcharts and Diagrams:

Use mermaid markdown syntax where possible:

  graph LR
      A-->B
      B-->C

For process flows, create a numbered list with clear step progression and any branching conditions
For technical diagrams, list components and their relationships in a structured way, preserving measurements/specifications in tables

For any visual elements that cannot be represented directly in Markdown, describe them in plain text under a section titled "Visual Notes."

Maintain numerical precision exactly as shown, preserve all labels and annotations as markdown text, and structure the output for both human and machine readability. Output only the converted Markdown text without any additional commentary or explanations.

PetrAPConsulting · 2025-01-07T17:36:23Z

only one comment, to keep price of conversion reasonable pictures should not be bigger than ~1000x1000 px

dzemeuksis · 2025-01-09T14:39:06Z

@PetrAPConsulting
Do you suggest that this should be implemented in the module, or should we just mention it in the documentation?

PetrAPConsulting · 2025-01-09T15:34:09Z

@dzemeuksis
good question. I suppose it would make sense to put it directly to _markitdown.py. Actually I have it in my environment like this:
def _get_llm_description(self, local_path, extension, client, model, prompt=None):
if prompt is None or prompt.strip() == "":
prompt = """Analyze the image content and convert this image into a structured markdown representation with focus on preserving data relationships and machine readability. Follow these conversion guidelines based on content type:

Content Type:
- Identify whether it's a table, graph, chart, formula, flowchart, diagram, process flow, technical diagram or combination
  ....................continue

but if you make a pull request and Gagb or Afourney refuse to put it in the code directly they probably will suggest another options. I suppose it could save time to a lot of people who are planning to use LLM. Why should each of them optimize prompt even it isn't rocket science.

dzemeuksis · 2025-01-09T15:50:54Z

@PetrAPConsulting
Sure, but I asked about limiting or autoresize of pictures. Should this feature be somehow supported by the module? Do you mean expanding this pull request in that direction right now, or is it more of a suggestion to consider for the next steps?

PetrAPConsulting · 2025-01-09T17:10:04Z

It was suggestion for anybody when using it. Create script in react which resizes pictures it's 2 minutes work with Claude or ChatGPT.

PetrAPConsulting

From my experience suggested prompt is very useful compare to generic that was there originally.

Updated prompt to extract text and format it in Markdown, including a…

3b8ecac

…dditional visual details, instead of only describing the image.

dzemeuksis and others added 2 commits January 17, 2025 14:29

I changed the prompt as suggested in the PR comments.

ca5a251

Merge branch 'main' into feature/llm-description-in-markdown

1c9a938

PetrAPConsulting reviewed Jan 17, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updated prompt to extract text and format it in Markdown, including a… #200

Updated prompt to extract text and format it in Markdown, including a… #200

dzemeuksis commented Dec 22, 2024

dzemeuksis commented Dec 22, 2024

PetrAPConsulting commented Jan 6, 2025 •

edited

Loading

dzemeuksis commented Jan 7, 2025

PetrAPConsulting commented Jan 7, 2025 •

edited

Loading

PetrAPConsulting commented Jan 7, 2025

dzemeuksis commented Jan 9, 2025

PetrAPConsulting commented Jan 9, 2025

dzemeuksis commented Jan 9, 2025 •

edited

Loading

PetrAPConsulting commented Jan 9, 2025

PetrAPConsulting left a comment

Updated prompt to extract text and format it in Markdown, including a… #200

Are you sure you want to change the base?

Updated prompt to extract text and format it in Markdown, including a… #200

Conversation

dzemeuksis commented Dec 22, 2024

dzemeuksis commented Dec 22, 2024

PetrAPConsulting commented Jan 6, 2025 • edited Loading

dzemeuksis commented Jan 7, 2025

PetrAPConsulting commented Jan 7, 2025 • edited Loading

PetrAPConsulting commented Jan 7, 2025

dzemeuksis commented Jan 9, 2025

PetrAPConsulting commented Jan 9, 2025

dzemeuksis commented Jan 9, 2025 • edited Loading

PetrAPConsulting commented Jan 9, 2025

PetrAPConsulting left a comment

Choose a reason for hiding this comment

PetrAPConsulting commented Jan 6, 2025 •

edited

Loading

PetrAPConsulting commented Jan 7, 2025 •

edited

Loading

dzemeuksis commented Jan 9, 2025 •

edited

Loading