Emilio G. Cota
Paolo Bonzini
Alex Bennée
Luca P. Carloni
Columbia University
Red Hat, Inc.
Linaro, Ltd.
Columbia University
CGO 2017
Austin, TX
Increasing core counts for emulation guests (typically high-perf SoC's)
Scalable Cross-ISA Emulation
[14] J. H. Ding et al. PQEMU: A parallel system emulator based on QEMU. ICPADS, pages 276–283, 2011
[24] D. Lustig et al. ArMOR: defending against memory consistency model mismatches in heterogeneous architectures. ISCA, pages 388–400, 2015
[33] Z. Wang et al. COREMU: A scalable and portable parallel full-system emulator. PPoPP, pages 213–222, 2011
i.e. compare-and-swap vs. load locked-store conditional
key data structure: translation code cache
PQEMU [14] and COREMU [33] do not address (2)
ArMOR [24] solves (2.1)
Open source: http://qemu-project.org
Widely used in both industry and academia
Supports many ISAs through TCG, its IR:
They are applicable to Dynamic Binary Translators at large
[7] F. Bellard. QEMU, a fast and portable dynamic translator. Usenix ATC, pages 41–46, 2005
To scale, we need concurrent code execution
[12] D. Bruening, V. Kiriansky, T. Garnett, and S. Banerji. Thread-shared software code caches. CGO, pages 28–38, 2006
Long hash chains: slow lookups
Fixed number of buckets
hash=h(phys_addr) leads to uneven chain lengths
No support for concurrent lookups
Problems in the TB Hash Table:
hash=h(phys_addr, phys_PC, cpu_flags): uniform chain distribution
e.g. longest chain down from 550 to 40 TBs when booting ARM Linux
QHT: A resizable, scalable Hash Table
Fast, concurrent lookups
Low update rate: max 6% booting Linux
[1] http://concurrencykit.org
[12] D. Bruening, V. Kiriansky, T. Garnett, and S. Banerji. Thread-shared software code caches. CGO, pages 28–38, 2006
Fast, concurrent lookups
Low update rate: max 6% booting Linux
Candidate #1: ck_hs [1] (similar to [12])
[1] http://concurrencykit.org
[12] D. Bruening, V. Kiriansky, T. Garnett, and S. Banerji. Thread-shared software code caches. CGO, pages 28–38, 2006
[13] T. David, R. Guerraoui, and V. Trigonakis. Asynchronized concurrency: The secret to scaling concurrent search data structures. ASPLOS, p. 631–644, 2015
Fast, concurrent lookups
Low update rate: max 6% booting Linux
Candidate #1: ck_hs [1] (similar to [12])
Candidate #2: CLHT [13]
[1] http://concurrencykit.org
[12] D. Bruening, V. Kiriansky, T. Garnett, and S. Banerji. Thread-shared software code caches. CGO, pages 28–38, 2006
[13] T. David, R. Guerraoui, and V. Trigonakis. Asynchronized concurrency: The secret to scaling concurrent search data structures. ASPLOS, p. 631–644, 2015
Fast, concurrent lookups
Low update rate: max 6% booting Linux
Candidate #1: ck_hs [1] (similar to [12])
Candidate #2: CLHT [13]
#3: Our proposal: QHT
[31] G. Southern and J. Renau. Deconstructing PARSEC scalability. WDDD, 2015
Two families:
/* runs as a single atomic instruction */
bool CAS(type *ptr, type old, type new) {
if (*ptr != old) {
return false;
}
ptr = new;
return true;
}
Compare-and-Swap
(CAS)
Load Locked-Store Conditional (LL/SC)
/*
* store_exclusive() returns 1 if addr has
* been written to since load_exclusive()
*/
do {
val = load_exclusive(addr);
val += 1; /* do something */
} while (store_exclusive(addr, val);
ldl_l/stl_c lwarx/stwcx ldrex/strex ldaxr/strlxr ll/sc lr/sc
Challenge: How to correctly emulate atomics in a parallel environment, without hurting scalability?
Alpha:
POWER:
ARM:
aarch64:
MIPS:
RISC-V:
x86/IA-64:
cmpxchg
Cannot safely leverage the host's LL/SC: operations allowed between LL and SC pairs are limited
LL/SC is stronger than CAS: ABA problem
Challenge: How to correctly emulate atomics in a parallel environment, without hurting scalability?
cpu0 | cpu1 |
---|---|
do { val = atomic_read(addr); /* reads A */ ... ... } while (CAS(addr, val, newval); |
atomic_set(addr, B); atomic_set(addr, A); |
time
cpu0 | cpu1 |
---|---|
do { val = load_exclusive(addr); /* reads A */ ... ... } while (store_exclusive(addr, newval); |
atomic_set(addr, B); atomic_set(addr, A); |
Init: *addr = A;
SC fails, regardless of the contents of *addr
CAS succeeds where SC failed!
time
3 proposed options:
Correct & scalable
No need to instrument regular stores
But requires hardware support
[9] C. Blundell, E. C. Lewis, and M. M. Martin. Subtleties of transactional memory atomicity semantics. Computer Architecture Letters, 5(2), 2006.
Pico-user, single thread, aarch64-on-x86
Pico-user atomic_add, multi-threaded, aarch64-on-POWER
struct count {
u64 val;
} __aligned(64); /* avoid false sharing */
struct count *counts;
while (!test_stop) {
int index = rand() % n_elements;
atomic_increment(&counts[index].val);
}
Pico-user atomic_add, multi-threaded, aarch64-on-POWER
Scalable DBT design with a shared code cache
Scalable, correct cross-ISA emulation of atomics
QEMU v2.7 includes our improved hashing + QHT
QEMU v2.8 includes:
atomic instruction emulation
Support for parallel user-space emulation
Under review for v2.9: parallel full-system emulation
Scalable DBT design with a shared code cache
Scalable, correct cross-ISA emulation of atomics
single thread
[12] D. Bruening, V. Kiriansky, T. Garnett, and S. Banerji. Thread-shared software code caches. CGO, pages 28–38, 2006
[25] Y. Mao, E. Kohler, and R. T. Morris. Cache craftiness for fast multicore key-value storage. EuroSys, pages 183–196, 2012
x86-on-POWER
We applied ArMOR's [24] FSMs:
, and also leveraged
[24] D. Lustig et al. ArMOR: defending against memory consistency model mismatches in heterogeneous architectures. ISCA, pages 388–400, 2015
RCU is a way of waiting for things to finish,
without tracking every one of them
Credit: Paul McKenney
void *qht_lookup__slowpath(struct qht_bucket *b, qht_lookup_func_t func,
const void *userp, uint32_t hash)
{
unsigned int version;
void *ret;
do {
version = seqlock_read_begin(&b->sequence);
ret = qht_do_lookup(b, func, userp, hash);
} while (seqlock_read_retry(&b->sequence, version));
return ret;
}
Reader
Writer
seq=0
seq=3
seq=1
seq=2
seq=3
Retry
Reader: Sequence number must be even, and must remain unaltered. Otherwise, retry
seq=3
seq=4
Retry
Retry
seq=4
seq=4
val_t val = atomic_read(&bucket->val[i]);
smp_rmb();
if (atomic_read(&bucket->key [i]) == key && atomic_read(&bucket->val[i]) == val) {
/* found */
}
the memory allocator of the values must guarantee that the same address cannot appear twice during the lifespan of an operation.
[13] T. David, R. Guerraoui, and V. Trigonakis. Asynchronized concurrency: The secret to scaling concurrent search data structures. ASPLOS, pages 631–644, 2015
cpu0 | cpu1 | cpu2 | cpu3 |
---|---|---|---|
x=1 | y=1 | r1=x r2=y |
r3=y r4=x |
[10] H.-J. Boehm and S. V. Adve. Foundations of the C++ concurrency memory model. ACM SIGPLAN Notices, volume 43, pages 68–78, 2008.