Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT: Extend escape analysis to account for arrays with non-gcref elements #104906

Open
wants to merge 97 commits into
base: main
Choose a base branch
from

Conversation

hez2010
Copy link
Contributor

@hez2010 hez2010 commented Jul 15, 2024

Positive case:

var chs = new char[42];
chs[1] = 'a';
Console.WriteLine((int)chs[1] + chs.Length);

Codegen:

; Assembly listing for method ArrayAllocator.Program:Main() (FullOpts)
; Emitting BLENDED_CODE for X64 with AVX - Windows
; FullOpts code
; optimized code
; rsp based frame
; partially interruptible
; No PGO data
; Final local variable assignments
;
;* V00 loc0         [V00    ] (  0,  0   )    long  ->  zero-ref    class-hnd exact <short[]>
;  V01 OutArgs      [V01    ] (  1,  1   )  struct (32) [rsp+0x00]  do-not-enreg[XS] addr-exposed "OutgoingArgSpace"
;* V02 tmp1         [V02    ] (  0,  0   )  struct (104) zero-ref    do-not-enreg[SF] "stack allocated array temp"
;* V03 tmp2         [V03    ] (  0,  0   )    long  ->  zero-ref    single-def "V02.[000..008)"
;* V04 tmp3         [V04    ] (  0,  0   )     int  ->  zero-ref    single-def "V02.[008..012)"
;* V05 tmp4         [V05    ] (  0,  0   )   short  ->  zero-ref    "V02.[018..020)"
;
; Lcl frame size = 40

G_M25548_IG01:  ;; offset=0x0000
       sub      rsp, 40
                                                ;; size=4 bbWeight=1 PerfScore 0.25
G_M25548_IG02:  ;; offset=0x0004
       mov      ecx, 84
       call     [System.Console:WriteLine(int)]
       nop
                                                ;; size=12 bbWeight=1 PerfScore 3.50
G_M25548_IG03:  ;; offset=0x0010
       add      rsp, 40
       ret
                                                ;; size=5 bbWeight=1 PerfScore 1.25

; Total bytes of code 21, prolog size 4, PerfScore 5.00, instruction count 6, allocated bytes for code 21 (MethodHash=5b0b9c33) for method ArrayAllocator.Program:Main() (FullOpts)

Negative case:

var chs = new char[42];
chs[1] = 'a';
Console.WriteLine((int)chs[42] + chs.Length);

Codegen:

; Assembly listing for method ArrayAllocator.Program:Main() (FullOpts)
; Emitting BLENDED_CODE for X64 with AVX - Windows
; FullOpts code
; optimized code
; rsp based frame
; partially interruptible
; No PGO data
; Final local variable assignments
;
;* V00 loc0         [V00    ] (  0,  0   )    long  ->  zero-ref    class-hnd exact <short[]>
;  V01 OutArgs      [V01    ] (  1,  1   )  struct (32) [rsp+0x00]  do-not-enreg[XS] addr-exposed "OutgoingArgSpace"
;* V02 tmp1         [V02    ] (  0,  0   )  struct (104) zero-ref    do-not-enreg[SF] "stack allocated array temp"
;  V03 tmp2         [V03,T00] (  1,  0   )   byref  ->  rbx         must-init "dummy temp of must thrown exception"
;* V04 tmp3         [V04    ] (  0,  0   )    long  ->  zero-ref    single-def "V02.[000..008)"
;* V05 tmp4         [V05    ] (  0,  0   )     int  ->  zero-ref    single-def "V02.[008..012)"
;* V06 tmp5         [V06    ] (  0,  0   )   short  ->  zero-ref    single-def "V02.[018..020)"
;
; Lcl frame size = 32

G_M25548_IG01:  ;; offset=0x0000
       push     rbx
       sub      rsp, 32
       xor      ebx, ebx
                                                ;; size=7 bbWeight=0 PerfScore 0.00
G_M25548_IG02:  ;; offset=0x0007
       call     CORINFO_HELP_RNGCHKFAIL
       movsx    rcx, word  ptr [rbx]
       call     [System.Console:WriteLine(int)]
       int3
                                                ;; size=16 bbWeight=0 PerfScore 0.00

; Total bytes of code 23, prolog size 5, PerfScore 0.00, instruction count 7, allocated bytes for code 23 (MethodHash=5b0b9c33) for method ArrayAllocator.Program:Main() (FullOpts)
; ============================================================

Benchmark on Mandelbrot:

Method Job Mean Error StdDev Code Size Allocated
MandelBrot NoStackAllocationArray 199.7 us 1.30 us 1.22 us 1,996 B 2.49 KB
MandelBrot StackAllocationArray 195.8 us 1.16 us 1.08 us 2,414 B 1.14 KB

Diff: https://www.diffchecker.com/bNP4qHdF/

@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jul 15, 2024
@dotnet-policy-service dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Jul 15, 2024
Copy link
Member

@AndyAyersMS AndyAyersMS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For arrays (and also perhaps boxes and ref classes) we ought to have some kind of size limit... possibly similar to the one we use for stackallocs.

We need to be careful we don't allocate a lot of stack for an object that might not be heavily used, as we'll pay per-call prolog zeroing costs.

@AndyAyersMS
Copy link
Member

Merged in the changes from #111284.

@AndyAyersMS
Copy link
Member

@dotnet/jit-contrib PTAL
cc @jakobbotsch @hez2010

SPMI will be fairly accurate here... kicked off another collection which should have even fewer misses. Code size increases but mostly from clr test where small fixed-sized arrays are unusually frequent.

  • We currently limit array size to 528 bytes (so enough for a 512 byte array or a 128 int array). This can be altered by config.
  • I have disabled this opt for R2R because it inhibits some cross-module inlines. If/when we can prove that the vtable is never looked at (or perhaps rely on a runtime helper call to initialize it), we can revisit.
  • There is no "hotness" check so we will stack allocate arrays that are created in cold blocks. We should reconsider this since these arrays will frequently require prolog zeroing.
  • There is some special handling in VN to always rely on liberal VNs, but it is still simplistic, eg for a local array a the array load below will not be recognized as a constant.
 a[0] = 3;
 f();
     = a[0];
  • Arrays that are only stored to are not optimized away
  • Other opts mostly work as they do now, as these arrays don't really look any different than heap arrays to the jit.

@AndyAyersMS
Copy link
Member

Seeing AVs in osx crossgen2, oddly in both base and diff jits:

[17:42:02] Invoking: C:\h\w\AAC9097D\p\superpmi.exe -a -v ewi -f C:\h\w\AAC9097D\t\tmpwd8c5ge0\libraries.crossgen2.osx.arm64.checked.mch_fail.mcl -details C:\h\w\AAC9097D\t\tmpwd8c5ge0\libraries.crossgen2.osx.arm64.checked.mch_details.csv -target arm64 -jitoption force JitEnableNoWayAssert=1 -jitoption force JitNoForceFallback=1 -jitoption force JitAlignLoops=0 -jit2option force JitEnableNoWayAssert=1 -jit2option force JitNoForceFallback=1 -jit2option force JitAlignLoops=0 -p -failureLimit 100 C:\h\w\AAC9097D\p\base\checked\clrjit_universal_arm64_x64.dll C:\h\w\AAC9097D\p\diff\checked\clrjit_universal_arm64_x64.dll C:\h\w\AAC9097D\w\9D38092F\e\artifacts\spmi\mch\cc0e7adf-e397-40b6-9d14-a7149815c991.osx.arm64\libraries.crossgen2.osx.arm64.checked.mch

[17:45:49] ERROR: Unexpected exception c0000005 was thrown.

[17:45:49] ERROR: Method 86217 of size 23 failed to load and compile correctly by JIT1 (C:\h\w\AAC9097D\p\base\checked\clrjit_universal_arm64_x64.dll).

[17:45:49] ERROR: Unexpected exception c0000005 was thrown.

[17:45:49] ERROR: Method 86217 of size 23 failed to load and compile correctly by JIT2 (C:\h\w\AAC9097D\p\diff\checked\clrjit_universal_arm64_x64.dll).

[17:45:49] ERROR: Unexpected exception c0000005 was thrown.

Will try and look at this locally

@AndyAyersMS
Copy link
Member

AndyAyersMS commented Jan 17, 2025

Failing here: InitClass map is null

CorInfoInitClassResult MethodContext::repInitClass(CORINFO_FIELD_HANDLE   field,
                                                   CORINFO_METHOD_HANDLE  method,
                                                   CORINFO_CONTEXT_HANDLE context)
{
    Agnostic_InitClass key;
    ZeroMemory(&key, sizeof(key)); // Zero key including any struct padding
    key.field   = CastHandle(field);
    key.method  = CastHandle(method);
    key.context = CastHandle(context);

    DWORD value = InitClass->Get(key);

Looks like this map should always be present since morph always calls InitClass. But at any rate we can stop SPMI from AVing.

Tolerating this in #111555.

@hez2010
Copy link
Contributor Author

hez2010 commented Jan 18, 2025

Merged the main branch to include #111555.

@hez2010
Copy link
Contributor Author

hez2010 commented Jan 18, 2025

Try a late dead store removal
@MihuBot

@hez2010
Copy link
Contributor Author

hez2010 commented Jan 18, 2025

@AndyAyersMS Just experimented a bit with late dead stores removal by unexposing locals in the liveness and repeating to convergence: commit 3450580

Before adding late dead stores removal: diffs
After adding late dead stores removal: diffs

Relative diffs:

linux arm64: -26.588 bytes
linux x64: -27,113 bytes
osx arm64: -14,360 bytes
windows arm64: -25,528 bytes
windows x64: -26,628 bytes

Diffs look interesting but TP impacts are too high. Now I'm reverting the experiment commit.

@hez2010 hez2010 force-pushed the value-array-stack-alloc branch from 3450580 to 1915450 Compare January 18, 2025 16:28
@AndyAyersMS
Copy link
Member

@jakobbotsch can you look this one over again?

#ifdef FEATURE_READYTORUN
if (comp->opts.IsReadyToRun() && data->IsHelperCall(comp, CORINFO_HELP_READYTORUN_NEWARR_1))
{
len = data->AsCall()->gtArgs.GetArgByIndex(0)->GetNode();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dead code?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed R2R support more completely

Comment on lines 2849 to 2854
GenTree* const mtStore = gtNewStoreLclFldNode(lclNum, TYP_I_IMPL, 0, mt);
Statement* const mtStmt = gtNewStmt(mtStore);

fgInsertStmtBefore(block, newStmt, mtStmt);
gtSetStmtInfo(mtStmt);
fgSetStmtSeq(mtStmt);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
GenTree* const mtStore = gtNewStoreLclFldNode(lclNum, TYP_I_IMPL, 0, mt);
Statement* const mtStmt = gtNewStmt(mtStore);
fgInsertStmtBefore(block, newStmt, mtStmt);
gtSetStmtInfo(mtStmt);
fgSetStmtSeq(mtStmt);
GenTree* const mtStore = gtNewStoreLclFldNode(lclNum, TYP_I_IMPL, 0, mt);
Statement* const mtStmt = fgNewStmtFromTree(mtStore);
fgInsertStmtBefore(block, newStmt, mtStmt);

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto below, more things can switch to use fgNewStmtFromTree

@@ -181,6 +254,11 @@ inline bool ObjectAllocator::CanAllocateLclVarOnStack(unsigned int lclNu
return false;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like we should have a limit on the aggregate stack allocated size. That's somewhat preexisting, but probably more important now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree but will do that in a future PR

Comment on lines +6249 to +6260
// If this is a local array, there are no asyncronous modifications, so we can set the
// conservative VN to the liberal VN.
//
VNFuncApp arrFn;
if (vnStore->IsVNNewLocalArr(arrVN, &arrFn))
{
loadTree->gtVNPair.SetConservative(loadValueVN);
}
else
{
loadTree->gtVNPair.SetConservative(vnStore->VNForExpr(compCurBB, loadType));
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this actually show up as benefits?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In some limited cases, yes... eg we can const propagate through a[2] = 1; y = a[2];

@AndyAyersMS
Copy link
Member

@jakobbotsch think I've addressed most of the key points. Overall size limit will come in a future PR.

@jakobbotsch
Copy link
Member

@AndyAyersMS Did you push those changes?

@AndyAyersMS
Copy link
Member

Ah, I pushed to my fork, but ... this PR is not from my fork.

@AndyAyersMS
Copy link
Member

@jakobbotsch changes are there now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI community-contribution Indicates that the PR has been added by a community member
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants