Add sentence labeler #3570

MattGPT-ai · 2024-11-22T21:48:19Z

This is a simplified version of the chunking utility provided in #3520

flair/training_utils.py

…ences exceeding the token limit, including tests

MattGPT-ai · 2025-01-22T05:30:14Z

I addressed your change request, is there anything else that you need me to change or can this be merged?

tests/test_sentence_labeling.py

alanakbik

Thanks a lot for adding this and sorry for taking so long to review! See the comments for suggested changes.

Generally, I think this is quite a useful helper function for all Flair dataset classes that have annotations labeled as character offsets.

Regarding the chunking/truncation, it would be nice in the future to have such functionality be attached to the Corpus class, similar to the filter_long_sentences method but with truncation or chunking rather than filtering. This way, it could be used for any corpus.

alanakbik · 2025-02-03T19:50:45Z

flair/training_utils.py

+
+
+def create_labeled_sentence_from_tokens(
+    tokens: Union[list[Token]], token_entities: list[TokenEntity], type_name: str = "ner"


The tokens could also be passed as a list of str, since you convert them into string anyways in line 456. This would simplify the code a bit. Also, the Union in the signature is not necessary, so you could have tokens: list[str] instead of tokens: Union[list[Token]].

Does it make sense to allow either and check the type and convert when it's a Token type?

alanakbik · 2025-02-03T19:52:30Z

flair/training_utils.py

+    return sentence
+
+
+def create_labeled_sentence(


The name of the function is a bit underspecified. How about something along the lines of create_labeled_sentence_from_entity_offsets or so.

I'll change it to that

alanakbik · 2025-02-03T19:54:23Z

flair/training_utils.py

+        token_limit: numerical value that determines the maximum size of a chunk. use inf to not perform chunking
+
+    Returns:
+        A list of labeled Sentence objects representing the chunks of the original text


There is a mismatch between the wording in the comment (list of labeled Sentence objects, chunking) and what actually happens (truncation, with only the first part of the text being returned). Intuitively, I'd say it makes more sense to have the function perform a chunking and so return a list of Sentence objects. Alternatively, you could leave it as it is, but refer to what the function does as truncation and clarify that it returns a single Sentence.

I actually do have code that does this in another PR, but a previous review of it was a little bit more complicated. I think once this merges, I can rebase that code on top of this simpler base case and maybe refactor this in a better to way to accomplish both functions. I'll rewrite the docstring, I forgot to change some of it when I copied from the chunking function.

MattGPT-ai force-pushed the add-sentence-labeler branch 2 times, most recently from fdae4ef to a711cb6 Compare November 23, 2024 11:46

helpmefindaname requested changes Dec 6, 2024

View reviewed changes

flair/training_utils.py Outdated Show resolved Hide resolved

MattGPT-ai requested a review from helpmefindaname December 26, 2024 19:11

MattGPT-ai added 3 commits January 21, 2025 09:59

feat: add chunking function to allow sequence tagger training on sent…

fa61606

…ences exceeding the token limit, including tests

remove chunking logic to have simple sentence labeler. fix tests.

c05614c

fix: remove type hints from private module

bc7fa10

MattGPT-ai force-pushed the add-sentence-labeler branch from c5a7277 to bc7fa10 Compare January 21, 2025 18:00

Merge branch 'master' into add-sentence-labeler

f726672

alanakbik reviewed Feb 3, 2025

View reviewed changes

tests/test_sentence_labeling.py Show resolved Hide resolved

alanakbik requested changes Feb 3, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add sentence labeler #3570

Add sentence labeler #3570

MattGPT-ai commented Nov 22, 2024

MattGPT-ai commented Jan 22, 2025

alanakbik left a comment

alanakbik Feb 3, 2025

MattGPT-ai Feb 3, 2025

alanakbik Feb 3, 2025

MattGPT-ai Feb 3, 2025

alanakbik Feb 3, 2025

MattGPT-ai Feb 3, 2025



		def create_labeled_sentence_from_tokens(
		tokens: Union[list[Token]], token_entities: list[TokenEntity], type_name: str = "ner"

Add sentence labeler #3570

Are you sure you want to change the base?

Add sentence labeler #3570

Conversation

MattGPT-ai commented Nov 22, 2024

MattGPT-ai commented Jan 22, 2025

alanakbik left a comment

Choose a reason for hiding this comment

alanakbik Feb 3, 2025

Choose a reason for hiding this comment

MattGPT-ai Feb 3, 2025

Choose a reason for hiding this comment

alanakbik Feb 3, 2025

Choose a reason for hiding this comment

MattGPT-ai Feb 3, 2025

Choose a reason for hiding this comment

alanakbik Feb 3, 2025

Choose a reason for hiding this comment

MattGPT-ai Feb 3, 2025

Choose a reason for hiding this comment