Add support for 4D custom attention masks in GPT-2 #35517
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Problem Statement
Currently, GPT-2's attention mechanism only supports 2D attention masks, limiting its flexibility for advanced use cases like packed sequence processing. When users attempt to use 4D attention masks (shape [batch_size, num_heads, seq_length, seq_length]), the model raises dimension mismatch errors.
Issue #35290 demonstrates this limitation when trying to process packed sequences with custom attention patterns.
Proposed Solution
Extend GPT-2's attention mechanism to properly handle both 2D and 4D attention masks while maintaining backward compatibility. This allows for:
Implementation Details
The changes focus on the GPT2Attention class, specifically:
Testing Strategy
Added comprehensive test suite (
test_modeling_4D_attention_mask.py
) that verifies:Test Results
All tests passed successfully. Screenshot of test results:
Impact and Benefits
This enhancement:
Validation
Related Issues
Closes #35290 - Support for 4D attention masks in GPT-2
Additional Notes
requested reviewers - @ArthurZucker