Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add sentence labeler #3570

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

MattGPT-ai
Copy link
Contributor

This is a simplified version of the chunking utility provided in #3520

@MattGPT-ai MattGPT-ai force-pushed the add-sentence-labeler branch 2 times, most recently from fdae4ef to a711cb6 Compare November 23, 2024 11:46
flair/training_utils.py Outdated Show resolved Hide resolved
@MattGPT-ai MattGPT-ai force-pushed the add-sentence-labeler branch from c5a7277 to bc7fa10 Compare January 21, 2025 18:00
@MattGPT-ai
Copy link
Contributor Author

I addressed your change request, is there anything else that you need me to change or can this be merged?

Copy link
Collaborator

@alanakbik alanakbik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for adding this and sorry for taking so long to review! See the comments for suggested changes.

Generally, I think this is quite a useful helper function for all Flair dataset classes that have annotations labeled as character offsets.

Regarding the chunking/truncation, it would be nice in the future to have such functionality be attached to the Corpus class, similar to the filter_long_sentences method but with truncation or chunking rather than filtering. This way, it could be used for any corpus.



def create_labeled_sentence_from_tokens(
tokens: Union[list[Token]], token_entities: list[TokenEntity], type_name: str = "ner"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tokens could also be passed as a list of str, since you convert them into string anyways in line 456. This would simplify the code a bit. Also, the Union in the signature is not necessary, so you could have tokens: list[str] instead of tokens: Union[list[Token]].

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to allow either and check the type and convert when it's a Token type?

return sentence


def create_labeled_sentence(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name of the function is a bit underspecified. How about something along the lines of create_labeled_sentence_from_entity_offsets or so.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll change it to that

token_limit: numerical value that determines the maximum size of a chunk. use inf to not perform chunking

Returns:
A list of labeled Sentence objects representing the chunks of the original text
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a mismatch between the wording in the comment (list of labeled Sentence objects, chunking) and what actually happens (truncation, with only the first part of the text being returned). Intuitively, I'd say it makes more sense to have the function perform a chunking and so return a list of Sentence objects. Alternatively, you could leave it as it is, but refer to what the function does as truncation and clarify that it returns a single Sentence.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually do have code that does this in another PR, but a previous review of it was a little bit more complicated. I think once this merges, I can rebase that code on top of this simpler base case and maybe refactor this in a better to way to accomplish both functions. I'll rewrite the docstring, I forgot to change some of it when I copied from the chunking function.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants