You are not logged in.

sprezzatech blog #0007

Dirty South Supercomputers and Waffles
UNIX, HPC, cyberwarfare, and chasing perf in Atlanta.

dude, where're my registers?
Wed May 16 07:35:40 EDT 2012

Where do all the free registers go? When I took up programming—a wee lad of six in that merry summer of 1987—my ATARI 400's MOSTech 6502 offered 6 user-accessible registers (there were 22 “registers” with which to drive sound, player-missile graphics and I/O via CTIA+ANTIC, but you couldn't perform computation with them): You'd generally compose 2–3 values within A's 8 bits and forget load/store (beyond whatever large-scale phasing your main loop might exhibit), instead operating directly on the first 256 words. Among the (at least) ten addressing modes of the 6502 was indexed zero-page addressing, requiring only a byte to specify the word and thus saving a secondary instruction fetch for the remainder of the address. Back in the 1.02MHz days, your CPU clock wasn't some egregious multiple of your DRAM/MMU clock, and thus this wasn't as bad as it sounds—in a way, the zero page was a user-managed cache. Unfortunately, your OS (or ATARI BASIC) crapped all over a good chunk of page zero, and thus you had to write assembly (and then chain-load—BASIC and assembly
The ATARI 410 Program Recorder
The ATARI 410 Program Recorder enjoyed not fully reading your programs, which you discovered upon careening off the end.
programs sounded different, warbled differently, when played outside the ATARI 410 Program (cassette) Recorder) to meaningfully use it. Of course, if you were writing BASIC, none of this was visible to you anyway—unless you timed access to a 5 DIM A$(1) 10 A$="A" “array” vs access to a regular ol' 10 A = 65 (no, I did not perform this experiment as a six year old). Anyway: very few registers. That's the upshot.

Fast forward to the age of CUDA, and we're doing a bit better, register-wise:

Resource 1.0/1.1 SM 1.2/1.3 SM 2.x SM 3.0 SMX
Max warp size 32 32 32 32
Max warps 24 32 48 64
Max blocks 8 8 8 16
Max threads
(derived)
768 1024 1536 2048
32-bit registers 8192 16384 32768 65536
Max registers/thread at full occupancy
(derived)
10 16 21 32

The X86 world is significantly more complicated (as X86 tends to be). The scandalously low number of general purpose registers is partially offset by the past decade's complicated, dataflow-driven cores: use of register renaming helps facilitate out-of-order execution across a much wider hardware register file, and thus the small number of architectural registers doesn't restrict ILP. This doesn't help programmers, though, but the advanced caches standard on X86 cores at least absorb much of the delay due to register spills.

So dude, where are my registers?

Let's take the following simple CUDA program:
#include <cuda.h>
#include <assert.h>
#include <stdint.h>
#include <stdlib.h>

static __global__
void medadic(void){
}

static __global__
void monadic(uint32_t a0){
}

static __global__
void dyadic(uint32_t a0,uint32_t a1){
}

int main(void){
	assert(cuInit(0) == CUDA_SUCCESS);
	medadic<<<1,1>>>();
	monadic<<<1,1>>>(1);
	dyadic<<<1,1>>>(1,2);
	assert(cudaThreadSynchronize() == cudaSuccess);
	return EXIT_SUCCESS;
}
and compile it thusly:
nvcc c.cu -Xptxas -O3 -arch compute_20 -code sm_21 -lcuda -lcudart
We need the -code argument to nvcc in order to generate machine-specific SASS rather than just PTX, because PTX doesn't tell us anything about register usage (or altogether very much else, except how to emit some kinda half-assed pseudo-SSA). I used sm_21 because I've got a GTX 460 in this machine; adjust to taste. We pass -O3 to ptxas to ensure we're not getting non-production crap left in there.

Take a look at this with cuobjdump:
[skynet](0) $ cuobjdump -sass a.out 

Fatbin elf code:
================
arch = sm_21
code version = [1,4]
producer = cuda
host = linux
compile_size = 64bit
identifier = c.cu

code for sm_21
	Function : _Z7medadicv
/*0000*/     /*0x00005de428004404*/ 	MOV R1, c [0x1] [0x100];
/*0008*/     /*0x00001de780000000*/ 	EXIT;
	............................


	Function : _Z6dyadicjj
/*0000*/     /*0x00005de428004404*/ 	MOV R1, c [0x1] [0x100];
/*0008*/     /*0x00001de780000000*/ 	EXIT;
	............................


	Function : _Z7monadicj
/*0000*/     /*0x00005de428004404*/ 	MOV R1, c [0x1] [0x100];
/*0008*/     /*0x00001de780000000*/ 	EXIT;
	............................


[skynet](0) $ 
Well, it can't be a parameter, since it's showing up in medadic and there's only one of them in dyadic. Nope, this is part of the PTX ABI, a rather unfortunate affair all told which we won't explore further here. Know that with -abi=no passed to ptxas, this MOV—and the register it consumes—are eliminated, and you thus lose your stack. ptxas is smart enough to eliminate most dead code (code which has no effect on the world outside the function), so in order to get some data we'll need a more complicated program:
static __global__ void monadic(uint32_t *a0){ *a0 = 0; }

static __global__ void dyadic(const uint32_t *a0,uint32_t *a1){ *a1 = *a0; }

int main(void){
	uint32_t *out;

	assert(cuInit(0) == CUDA_SUCCESS);
	assert(cudaMalloc(&out,sizeof(uint32_t) * 3) == cudaSuccess);
	medadic<<<1,1>>>();
	monadic<<<1,1>>>(out);
	dyadic<<<1,1>>>(out + 1,out + 2);
	assert(cudaThreadSynchronize() == cudaSuccess);
	return EXIT_SUCCESS;
}
and get the rather more interesting results:
	code for sm_21
		Function : _Z7medadicv
	/*0000*/     /*0x00001de780000000*/ 	EXIT;
		............................


		Function : _Z6dyadicPjS_
	/*0000*/     /*0x80001de428004000*/ 	MOV R0, c [0x0] [0x20];
	/*0008*/     /*0x90005de428004000*/ 	MOV R1, c [0x0] [0x24];
	/*0010*/     /*0xa0009de428004000*/ 	MOV R2, c [0x0] [0x28];
	/*0018*/     /*0x00001c858c000000*/ 	LDU.E R0, [R0];
	/*0020*/     /*0xb000dde428004000*/ 	MOV R3, c [0x0] [0x2c];
	/*0028*/     /*0x00201c8594000000*/ 	ST.E [R2], R0;
	/*0030*/     /*0x00001de780000000*/ 	EXIT;
		..............................


		Function : _Z7monadicPj
	/*0000*/     /*0x80001de428004000*/ 	MOV R0, c [0x0] [0x20];
	/*0008*/     /*0x90005de428004000*/ 	MOV R1, c [0x0] [0x24];
	/*0010*/     /*0x000fdc8594000000*/ 	ST.E [R0], RZ;
	/*0018*/     /*0x00001de780000000*/ 	EXIT;
		.............................

We've learned that nvcc doesn't appear to perform whole-program analysis: even with -O3 supplied to it (not shown here), it doesn't optimize away the constant function monadic() to a cudaMemset() of size equivalent to the product of block and grid geometries (here, a trivially constant 1). The .param state space has been mapped into the first constant area at 0x20. ptxas will happily blow away dead registers used by parameters, as seen in dyadic. Why are we using two registers per parameter? Because our machine model is 64-bit, of course. Adding -m32 to our nvcc command line loses a register per parameter:
ptxas info    : Compiling entry function '_Z7monadicPj' for 'sm_21'
ptxas info    : Used 1 registers, 36 bytes cmem[0]
ptxas info    : Compiling entry function '_Z6dyadicPjS_' for 'sm_21'
ptxas info    : Used 2 registers, 40 bytes cmem[0]
ptxas info    : Compiling entry function '_Z7medadicv' for 'sm_21'
ptxas info    : Used 0 registers, 32 bytes cmem[0]
Note that trying to pass, say, -maxrregcount=2 to ptxas elicits whatever a computer does when it's being silently contemptuous and the following:
ptxas info    :  For profile sm_21 adjusting per thread register
	              count of 2 to lower bound of 16
-maxrregcount=0 disables the register ceiling, which just goes to show, once again, that nothing we learned from War Games is true (I had assumed ptxas would cycle through all possible permutations of instruction schedules, faster and faster, until my Dell IPS U3011 blew up and the GTX 460 spoke aloud, “A strange program. The only optimal choice is: not to compile. How about a nice game of Nethack?” Alas!)

Let's step up the game:
static __global__ void monadic(uint32_t *a0){ *a0 = *a0 + 1u; }

static __global__ void dyadic(const uint32_t *a0,uint32_t *a1){
	uint32_t c;

	for(c = 0 ; c < *a0 ; ++c){ *a1 += *a0; }
}
monadic() is pretty much what we'd expect. dyadic(), though, is suffering from LIVENESS INDIGESTION:
		Function : _Z6dyadicPjS_
	/*0000*/     /*0x80019de428004000*/ 	MOV R6, c [0x0] [0x20];
	/*0008*/     /*0x9001dde428004000*/ 	MOV R7, c [0x0] [0x24];
	/*0010*/     /*0xa0011de428004000*/ 	MOV R4, c [0x0] [0x28];
	/*0018*/     /*0x00601c858c000000*/ 	LDU.E R0, [R6];
	/*0020*/     /*0xb0015de428004000*/ 	MOV R5, c [0x0] [0x2c];
	/*0028*/     /*0xfc01dc23190e0000*/ 	ISETP.EQ.AND P0, pt, R0, RZ, pt;
	/*0030*/     /*0x000001e780000000*/ 	@P0 EXIT;
	/*0038*/     /*0x00405c858c000000*/ 	LDU.E R1, [R4];
	/*0040*/     /*0xfc009de428000000*/ 	MOV R2, RZ;
	/*0048*/     /*0x00001de440000000*/ 	NOP CC.T;
	/*0050*/     /*0x00105c0348000000*/ 	IADD R1, R1, R0;
	/*0058*/     /*0x04209c034800c000*/ 	IADD R2, R2, 0x1;
	/*0060*/     /*0x00405c8594000000*/ 	ST.E [R4], R1;
	/*0068*/     /*0x00601c8584000000*/ 	LD.E R0, [R6];
	/*0070*/     /*0x0021dc03188e0000*/ 	ISETP.LT.U32.AND P0, pt, R2, R0, pt;
	/*0078*/     /*0x400001e74003ffff*/ 	@P0 BRA 0x50;
	/*0080*/     /*0x00001de780000000*/ 	EXIT;
		..............................

		Function : _Z7monadicPj
	/*0000*/     /*0x80009de428004000*/ 	MOV R2, c [0x0] [0x20];
	/*0008*/     /*0x9000dde428004000*/ 	MOV R3, c [0x0] [0x24];
	/*0010*/     /*0x00201c858c000000*/ 	LDU.E R0, [R2];
	/*0018*/     /*0x04001c034800c000*/ 	IADD R0, R0, 0x1;
	/*0020*/     /*0x00201c8594000000*/ 	ST.E [R2], R0;
	/*0028*/     /*0x00001de780000000*/ 	EXIT;
		............................. 
Note that ptxas inserted an optimization for the case where *a0 is 0, exiting early. This makes sense since all loads are from the same place (hence use of LDU, Load Data Uniform). If threads were loading from different locations, this optimization would very rarely be a win due to divergence (only if all threads in a warp loaded 0 would you win). What is all this garbage, though? 8 registers? Count 'em: Most likely, “64-bit” registers have some kind of R(2*n) alignment requirement, hence R3's use as padding (look at monadic). Placing these in R0..R3 would have eliminated the need for this padding, but perhaps with so few registers being used, ptxas doesn't care. Maybe my hypothesis is just wrong, or even a canard. Collecting evidence for or against it is left as an exercise for the reader. Why aren't R6 and R7 reusable?

I can't say for sure, but observe what happens when, using the CUDA __restrict__ extension keyword (why this was necessary is beyond my comprehension, as restrict has been present since ANSI/ISO C99) to amend dyadic()s contract, specifying that the two parameters cannot overlap:
		Function : _Z6dyadicPjS_
	/*0000*/     /*0x80001de428004000*/ 	MOV R0, c [0x0] [0x20];
	/*0008*/     /*0x90005de428004000*/ 	MOV R1, c [0x0] [0x24];
	/*0010*/     /*0xa0011de428004000*/ 	MOV R4, c [0x0] [0x28];
	/*0018*/     /*0x00009c858c000000*/ 	LDU.E R2, [R0];
	/*0020*/     /*0xb0015de428004000*/ 	MOV R5, c [0x0] [0x2c];
	/*0028*/     /*0xfc21dc23190e0000*/ 	ISETP.EQ.AND P0, pt, R2, RZ, pt;
	/*0030*/     /*0x000001e780000000*/ 	@P0 EXIT;
	/*0038*/     /*0x00401c858c000000*/ 	LDU.E R0, [R4];
	/*0040*/     /*0xfc005de428000000*/ 	MOV R1, RZ;
	/*0048*/     /*0x00001de440000000*/ 	NOP CC.T;
	/*0050*/     /*0x04105c034800c000*/ 	IADD R1, R1, 0x1;
	/*0058*/     /*0x08001c0348000000*/ 	IADD R0, R0, R2;
	/*0060*/     /*0x0811dc03188e0000*/ 	ISETP.LT.U32.AND P0, pt, R1, R2, pt;
	/*0068*/     /*0x800001e74003ffff*/ 	@P0 BRA 0x50;
	/*0070*/     /*0x00401c8594000000*/ 	ST.E [R4], R0;
	/*0078*/     /*0x00001de780000000*/ 	EXIT;
		..............................
We drop to 6 total registers, with a1 being freely blown away. Interesting, especially as the CUDA Programming Guide claims that use of __restrict__ “may increase register pressure” (this is true due to other possibilities, but it seems just as likely to reduce register pressure).

It's also interesting to note that, despite the -O3 to ptxas, partial loop unrolling has not been performed, nor has the entire loop been optimized away into a single multiplication followed by an addition modulo 232. Adding a bare #pragma unroll changes nothing, but #pragma unroll 2 (or indeed any other integer argument to unroll) results in:
		Function : _Z6dyadicPjS_
	/*0000*/     /*0x80001de428004000*/ 	MOV R0, c [0x0] [0x20];
	/*0008*/     /*0x90005de428004000*/ 	MOV R1, c [0x0] [0x24];
	/*0010*/     /*0xa0011de428004000*/ 	MOV R4, c [0x0] [0x28];
	/*0018*/     /*0x00001c858c000000*/ 	LDU.E R0, [R0];
	/*0020*/     /*0xb0015de428004000*/ 	MOV R5, c [0x0] [0x2c];
	/*0028*/     /*0xfc01dc23190e0000*/ 	ISETP.EQ.AND P0, pt, R0, RZ, pt;
	/*0030*/     /*0xa00081e740000000*/ 	@P0 BRA.U 0x60;
	/*0038*/     /*0x004060858c000000*/ 	@!P0 LDU.E R1, [R4];
	/*0040*/     /*0x0400a003081ec000*/ 	@!P0 IMNMX.U32 R2, R0, 0x1, !pt;
	/*0048*/     /*0x080060a320020000*/ 	@!P0 IMAD R1, R0, R2, R1;
	/*0050*/     /*0x0040608594000000*/ 	@!P0 ST.E [R4], R1;
	/*0058*/     /*0x00001de780000000*/ 	EXIT;
	/*0060*/     /*0x00001de780000000*/ 	EXIT;
		..............................
Ahhh, here's our IMAD...but it seems odd that we needed specify a value to unroll, though it didn't matter what that value was so long as it was greater than 1! Explicit sequences such as *a1 += *a0 + *a0 and *a1 = *a0 + *a0 + *a0 are converted into sensible ISCADD 0x1 and IMAD 0x3 instructions, respectively. But what about this tart medicine?
static __global__ void triadic(uint32_t * __restrict__ a0,
		uint32_t * __restrict__ a1,uint32_t * __restrict__ a2){
        *a1 = *a0 + *a1 + *a2;
}
...drumroll, please...
		Function : _Z7triadicPjS_S_
	/*0000*/     /*0x80019de428004000*/ 	MOV R6, c [0x0] [0x20];
	/*0008*/     /*0x9001dde428004000*/ 	MOV R7, c [0x0] [0x24];
	/*0010*/     /*0xa0011de428004000*/ 	MOV R4, c [0x0] [0x28];
	/*0018*/     /*0xb0015de428004000*/ 	MOV R5, c [0x0] [0x2c];
	/*0020*/     /*0xc0021de428004000*/ 	MOV R8, c [0x0] [0x30];
	/*0028*/     /*0xd0025de428004000*/ 	MOV R9, c [0x0] [0x34];
	/*0030*/     /*0x00605c858c000000*/ 	LDU.E R1, [R6];
	/*0038*/     /*0x00409c858c000000*/ 	LDU.E R2, [R4];
	/*0040*/     /*0x00801c858c000000*/ 	LDU.E R0, [R8];
	/*0048*/     /*0x04205c0348000000*/ 	IADD R1, R2, R1;
	/*0050*/     /*0x00101c0348000000*/ 	IADD R0, R1, R0;
	/*0058*/     /*0x00401c8594000000*/ 	ST.E [R4], R0;
	/*0060*/     /*0x00001de780000000*/ 	EXIT;
		.................................
Holy snapping duck fucks, what is this hippie horseshit? Why no condensation of the IADDs into a single VADD, eliminating the need to blow away R1? Well, it's not as if R1's actually alive following 0x0048, and there's no point in saving someone if they're already dead. If we throw a loop around this, we get two IADDs into R0. It'd be interesting to analyze the data dependence/throughput characteristics of VADD vs two IADDs—perhaps one requires less parallelism to hide the dependency delays than the other— but it's 0733 EDT, and I have other stuff to do. Hack on, cool kids!

P.S.: We can generate VADD.U32.U32.ACC via inline PTX assembly:
asm ( "vadd.u32.u32.u32.add %0,%0,%1,%2;"
      : "+r"(*a1) : "r"(*a0), "r"(*a2) );