Is there a way to ignore hidden text when extracting words? #1233

EdmundsEcho · 2024-12-05T18:44:05Z

EdmundsEcho
Dec 5, 2024

I have this function that uses the page.extract_words() function. In some pdf files I get extra, non-displayed characters.

For instance:

Display	Extracted
Date	D0ate
Description	DUescription

Show/Hide Code

def extract_words(page, page_number, object_id_start) -> list[PreElement]:
    
    words = []
    object_id = object_id_start

    for word in page.extract_words(x_tolerance=3):
        words.append({
            "ObjectId": object_id,
            "Page": page_number,
            "Text": word["text"],
            "Bounds": make_bounds(word, page.height)
        })
        object_id += 1

    return words

Answered by jsvine

Dec 9, 2024

Hi @EdmundsEcho, this will depend on the particular PDF in question. Can you share it here? If not, I'd suggest this general approach:

Examine the objects in page.chars. Do the invisible chars have different attributes than the visible ones?
If so, use page.filter(...).extract_words(...); if you search for "filter" in the discussions here, you can find some examples of this being used for similar situations/solutions.

View full answer

jsvine · 2024-12-09T04:04:51Z

jsvine
Dec 9, 2024
Maintainer

Hi @EdmundsEcho, this will depend on the particular PDF in question. Can you share it here? If not, I'd suggest this general approach:

Examine the objects in page.chars. Do the invisible chars have different attributes than the visible ones?
If so, use page.filter(...).extract_words(...); if you search for "filter" in the discussions here, you can find some examples of this being used for similar situations/solutions.

3 replies

EdmundsEcho Dec 10, 2024
Author

Thank you for your response. The PDF includes a client's personal information, so I cannot share it. I will try your approach and report back.

EdmundsEcho Dec 12, 2024
Author

I'm finding different ncs (DeviceGray, DeviceRGB) and non-stroking-color values. The color [1,1,1] seems to enable hidden values. I'll be building a filter accordingly. Thank you.

EdmundsEcho Dec 12, 2024
Author

Here is something that may be of use for someone that want to configure the filter that should work using any of the char properties.

class CharFilter:
    def __init__(self, config: dict):
        self.config = config

    def make(self) -> Callable[[Char], bool]:
        exclude_criteria = self.config.get("filter_chars", {}).get("exclude", [])
        include_criteria = self.config.get("filter_chars", {}).get("include", [])

        if not exclude_criteria and not include_criteria:
            return lambda _: True

        def __filter__(char: Char) -> bool:
            # Check exclusion criteria
            for criterion in exclude_criteria:
                for key, criteria in criterion.items():

                    # List of string values
                    if isinstance(criteria, list) and all(isinstance(v, str) for v in criteria):
                        if char.get(key) in criteria:
                            return False

                    # List of integer values (compare as tuple)
                    elif isinstance(criteria, list) and all(isinstance(v, int) for v in criteria):
                        if char.get(key) == tuple(criteria):
                            return False

                    # Single string value
                    elif isinstance(criteria, str):
                        if char.get(key) == criteria:
                            return False

                    # Single integer value
                    elif isinstance(criteria, int):
                        if char.get(key) == criteria:
                            return False

            # Check inclusion criteria
            for criterion in include_criteria:
                for key, criteria in criterion.items():
                    if isinstance(criteria, list) and all(isinstance(v, str) for v in criteria):
                        # List of string values
                        if char.get(key) not in criteria:
                            return False

                    elif isinstance(criteria, list) and all(isinstance(v, int) for v in criteria):
                        # List of integer values (compare as tuple)
                        char_key = char.get(key)
                        if char_key and tuple(char_key) != tuple(criteria):
                            return False
                        if not char_key:
                            return False

                    elif isinstance(criteria, str):
                        # Single string value
                        if char.get(key) != criteria:
                            return False

                    elif isinstance(criteria, int):
                        # Single integer value
                        if char.get(key) != criteria:
                            return False

            return True  # Pass inclusion criteria

        return __filter__

And the toml snippet:

[filter_chars]
exclude = [{ fontname = ["Helvetica", "Courier"] }]
include = [{ non_stroking_color = [0] }]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is there a way to ignore hidden text when extracting words? #1233

{{title}}

Replies: 1 comment 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Is there a way to ignore hidden text when extracting words? #1233

EdmundsEcho Dec 5, 2024

Replies: 1 comment · 3 replies

jsvine Dec 9, 2024 Maintainer

EdmundsEcho Dec 10, 2024 Author

EdmundsEcho Dec 12, 2024 Author

EdmundsEcho Dec 12, 2024 Author

EdmundsEcho
Dec 5, 2024

Replies: 1 comment 3 replies

jsvine
Dec 9, 2024
Maintainer

EdmundsEcho Dec 10, 2024
Author

EdmundsEcho Dec 12, 2024
Author

EdmundsEcho Dec 12, 2024
Author