-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add sentence labeler #3570
base: master
Are you sure you want to change the base?
Add sentence labeler #3570
Conversation
fdae4ef
to
a711cb6
Compare
…ences exceeding the token limit, including tests
c5a7277
to
bc7fa10
Compare
I addressed your change request, is there anything else that you need me to change or can this be merged? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for adding this and sorry for taking so long to review! See the comments for suggested changes.
Generally, I think this is quite a useful helper function for all Flair dataset classes that have annotations labeled as character offsets.
Regarding the chunking/truncation, it would be nice in the future to have such functionality be attached to the Corpus
class, similar to the filter_long_sentences
method but with truncation or chunking rather than filtering. This way, it could be used for any corpus.
|
||
|
||
def create_labeled_sentence_from_tokens( | ||
tokens: Union[list[Token]], token_entities: list[TokenEntity], type_name: str = "ner" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tokens could also be passed as a list of str, since you convert them into string anyways in line 456. This would simplify the code a bit. Also, the Union in the signature is not necessary, so you could have tokens: list[str]
instead of tokens: Union[list[Token]]
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it make sense to allow either and check the type and convert when it's a Token
type?
return sentence | ||
|
||
|
||
def create_labeled_sentence( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name of the function is a bit underspecified. How about something along the lines of create_labeled_sentence_from_entity_offsets or so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll change it to that
token_limit: numerical value that determines the maximum size of a chunk. use inf to not perform chunking | ||
|
||
Returns: | ||
A list of labeled Sentence objects representing the chunks of the original text |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a mismatch between the wording in the comment (list of labeled Sentence objects, chunking) and what actually happens (truncation, with only the first part of the text being returned). Intuitively, I'd say it makes more sense to have the function perform a chunking and so return a list of Sentence objects. Alternatively, you could leave it as it is, but refer to what the function does as truncation and clarify that it returns a single Sentence.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually do have code that does this in another PR, but a previous review of it was a little bit more complicated. I think once this merges, I can rebase that code on top of this simpler base case and maybe refactor this in a better to way to accomplish both functions. I'll rewrite the docstring, I forgot to change some of it when I copied from the chunking function.
This is a simplified version of the chunking utility provided in #3520