[Paper Review] The Load Slice Core Microarchitecture

Motivation

Microprocessor core는 Instruction Level Parallelism (ILP)를 올리기 위해 in-order pipeline에서 superscalar out-of-order pipeline으로 진화해왔으며, side-effect로 memory Level Parallelism (MLP) 또한 높여왔다.

참고: Memory-level parallelism (MLP) is a term in computer architecture referring to the ability to have pending multiple memory operations.

오늘날 off-chip memory wall, complex cache hierarchy로 인해 memory access는 매우 값비싼 동작이다. 게다가 에너지가 제한된 multi-core processor는 주로 single-thread 성능을 주된 걱정으로 여겨왔다.

본 논문에선 성능과 Memory Hierarchy Parallelism (MHP)를 비교하기 위해 6가지 구조로 나누어 설명한다.

in-order: 일반적인 in-order core를 말함
out-or-order: out-of-order execution이 가능한 core
out-of-order loads: in-order core를 확장시켜 load instruction에 필요한 operand가 모두 ready 일 때 바로 수행 시킬 수 있는 구조, 따라서 load instruction으로 인한 stall 시간을 감소 시킬 수 있다.
ooo loads+AGI: address-generating instruction이 사전에 수행 될 수 있도록 out-of-order execution이 가능한 구조다. (완벽하게 어떤 instruction이 address-generating instruction에 dependent한지를 아는 구조)
ooo ld+AGI (no-spec.): 위 구조 (ooo loads+AGI) 와 같이 load와 AGI instruction들을 out-of-order로 수행시키지만 unresolved branch는 accress 할 수 없다.
loads+AGI (in-order): ooo loads+AGI는 out-of-order와 비슷한 complexity를 가지기 때문에, 2개의 in-order queue (bypass queue, main queue)를 가지는 구조를 말한다.

Goal

위 motivation에 착안하여, 본 논문은 memory hierarchy에 parallel access를 만들며, 에너지 효율성도 최대화 할 수 있는 microarchitecture, The Load Slice Core (LSC) 를 제안한다.

LSC는 main pipeline에 second in-order pipeline을 추가하는데, 이 pipeline은 stalled instruction을 우회하기위해 memory access와 address-generating instruction을 수행한다고 한다.

load 또는 store instruction에 사용되는 address-generating instruction들을 포함하는 backward program slices는 software support 없이 자동으로 추출된다고 한다.

Implementation

LSC는 일반적인 superscalar in-order core를 기반으로 stall-on-use policy로 구성된다. pipeline은 크게 두 가지로 나뉘는데, 1) primary pipeline: instruction stream, 2) secondary pipeline: load, address-generating instruction 으로 구성된다.

Instruction slice는 load 또는 store instruction으로 마치며, address-generating instruction을 포함한다.

Iterative backward dependency analysis

Low cost, hardware 기반의 기술로, early execution을 위해 load / store instruction에서 backward instruction slices를 고른다. 여기서 backward instruction slices를 확인하는 방법은 아래 논문의 technique을 사용한다.

C. B. Zilles and G. S. Sohi, “Understanding the backward slices of performance degrading instructions”

이 technique은 반복적으로 memory access에 사용되는 address generating instruction을 확인한다. 그리고 해당 address generating instruction을 marking 한다. (software의 자연적으로 발생하는 loop 같은 경우를 통해)
대신 executed / committed instruction은 피한다. 각 producer는 one loop iteration 후에 mark 된다.

본 논문에선 두 가지 새로운 hardware structure를 사용한다.

Instruction Slice Table (IST) : backward slice에 포함된 instruction의 address들을 포함한다. IST를 활용해 instruction dispatch를 할 때 queue를 bypass 할지 결정할 수 있다. IST에 기록된 instruction들은 bypass queue로 들어가고, 그렇지 않은 instruction들은 main queue에 들어간다.
Register Dependency Table (RDT): 각 physical register에 마지막으로 write 하는 instruction pointer가 mapping 된 table이다. RDT는 address calculation에 필요한 register에 wrote 한 instruction들을 lookup 하는데 사용된다. 이를 통해 IST를 update 한다.

Example

(1) mov (r9+rax*8), xmm0
(2) mov esi, rax
(3) add xmm0, xmm0
(4) mul r8, rax
(5) add rdx, rax
(6) mul (r9+rax*8), xmm1

위 코드를 보면 (1), (6)은 memory load instruction이다. 해당 instruction들은 bypass queue로 가고, 나머지들은 main queue로 dispatch하게된다. 그렇게 되면 일반적인 in-order core의 경우처럼 (1), (2)는 수행되며 일단 (3)이 queue head에 도달하게 되면 block이 될 것이다 ((1)의 dependency로 인해).

첫 번째 iteration 때는 (5)는 이때 address generator로 detection 될 것이고, IST에 update가 될 것이다. 두 번째 iteration에는 (5)는 IST에서 찾아지고 main queue가 아닌 bypass queue에 위치할 것이다. 그리고 (4)는 (5)의 producer기 때문에 IST에 update가 될 것이다.

세 번째 iteration에는 (4), (5), (6)은 모두 bypass queue에 issue 될 것이다. 따라서 해당 architecture는 동적으로 critical instruction slices를 확인 할 수 있다.

LSC

Scheduling decision은 front-end pipeline에서 수행되며 (iterative backward dependency analysis를 활용해) wake-up / selection 동작 없이 instruction 처리가 가능하다.

Load / Address Generating Instruction (AGI) 들은 main instruction stream에선 out-of-order execution이 가능하지만, 서로 간에는 굳이 out-of-order execution이 필요 없다.

따라서 두 개의 in-order queue를 사용해 구현되며, wake-up / selection logic은 필요없다.

Evaluation

LSC는 기존 baseline 대비 평균 53% 성능향상을 15% area overhead, 22% power overhead만으로 이룰 수 있다고 한다.

특히 all address-generating instruction을 찾는데 필요한 iteration 수는 7 내외가 99.9%였다.

Reference

Carlson, Trevor E., et al. “The load slice core microarchitecture.” 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA). IEEE, 2015.