Emilio G. Cota
Luca P. Carloni
VEE'19
April 14, 2019
Providence, RI
Columbia University
Dynamic Binary Translation (DBT) is widely used, e.g.
Speed | Cross-ISA | Full-system | |
---|---|---|---|
DynamoRIO | ✔ Fast | ✘ | ✘ |
Pin | ✔ Fast | ✘ | ✘ |
QEMU (& derivatives) | ✘ Slow | ✔ | ✔ |
Open source: https://www.qemu.org
Widely used in both industry and academia
Supports many ISAs through DBT via TCG, its Intermediate Representation (IR)
Our contributions are not QEMU-specific
They are applicable to cross-ISA DBT tools at large
[*] Bellard. "QEMU, a fast and portable dynamic translator", ATC, 2005
[*] Cota, Bonzini, Bennée, Carloni. "Cross-ISA Machine Emulation for Multicores", CGO, 2017
Correct cross-ISA FP emulation using the host FPU
Integration of two state-of-the-art optimizations:
indirect branch handling
dynamic sizing of the software TLB
Make the DBT engine scale under heavy code translation
Not just during execution
4. Fast, ISA-agnostic instrumentation layer for QEMU
baseline (incorrect): always uses the host FPU and never reads excp. flags
How common?
of FP instructions in SPECfp06
float64 float64_mul(float64 a, float64 b, fp_status *st)
{
float64_input_flush2(&a, &b, st);
if (likely(float64_is_zero_or_normal(a) &&
float64_is_zero_or_normal(b) &&
st->exception_flags & FP_INEXACT &&
st->round_mode == FP_ROUND_NEAREST_EVEN)) {
if (float64_is_zero(a) || float64_is_zero(b)) {
bool neg = float64_is_neg(a) ^ float64_is_neg(b);
return float64_set_sign(float64_zero, neg);
} else {
double ha = float64_to_double(a);
double hb = float64_to_double(b);
double hr = ha * hb;
if (unlikely(isinf(hr))) {
st->float_exception_flags |= float_flag_overflow;
} else if (unlikely(fabs(hr) <= DBL_MIN)) {
goto soft_fp;
}
return double_to_float64(hr);
}
}
soft_fp:
return soft_float64_mul(a, b, st);
}
.. and similarly for 32/64b + , - , \(\times\) , \(\div\), \( \surd\), ==
derived from state-of-the-art DBT engines
[A] Hong, Hsu, Chou, Hsu, Liu, Wu. "Optimizing Control Transfer and Memory Virtualization in Full System Emulators", ACM TACO, 2015
[B] Tong, Koju, Kawahito, Moshovos. "Optimizing memory translation emulation in full system emulators", ACM TACO, 2015
user-mode x86_64-on-x86_64. Baseline: QEMU v3.1.0
full-system x86_64-on-x86_64. Baseline: QEMU v3.1.0
with a shared translation block (TB) cache
Parallel TB execution (green blocks)
Serialized TB generation (red blocks) with a global lock
Parallel TB execution
Parallel TB generation (one region per vCPU)
Guest VM performing parallel compilation of Linux kernel modules, x86_64-on-x86_64
[*] Cota, Bonzini, Bennée, Carloni. "Cross-ISA Machine Emulation for Multicores", CGO, 2017
x86_64-on-x86_64 (lower is better). Baseline: KVM
Qelt faster than the state-of-the-art, even for heavy instrumentation (cachesim)
x86_64-on-x86_64 (lower is better). Baseline: native
We hope our work will enable further adoption of QEMU to perform cross-ISA emulation and instrumentation
user-mode x86-on-x86
VOID Instruction(INS ins)
{
if (INS_IsMemoryRead(ins))
INS_InsertCall(ins, IPOINT_BEFORE, (AFUNPTR)MemCB, ...);
}
VOID Trace(TRACE trace, VOID *v)
{
for (BBL bbl = TRACE_BblHead(trace); BBL_Valid(bbl); bbl = BBL_Next(bbl))
for (INS ins = BBL_InsHead(bbl); INS_Valid(ins); ins = INS_Next(ins))
Instruction(ins);
}
static void vcpu_tb_trans(qemu_plugin_id_t id, unsigned int cpu_index, struct qemu_plugin_tb *tb)
{
size_t n = qemu_plugin_tb_n_insns(tb);
size_t i;
for (i = 0; i < n; i++) {
struct qemu_plugin_insn *insn = qemu_plugin_tb_get_insn(tb, i);
qemu_plugin_register_vcpu_mem_cb(insn, vcpu_mem, QEMU_PLUGIN_CB_NO_REGS, QEMU_PLUGIN_MEM_R);
}
user-mode, x86_64-on-x86_64
user-mode x86_64-on-x86_64. Baseline: QEMU v3.1.0
CactusADM: TLB resizing doesn't kick in often enough (we only do it on TLB flushes)
lower is better
CactusADM: TLB resizing doesn't kick in often enough (we only do it on TLB flushes)
[^] Faravelon, Gruber, Pétrot. "Optimizing memory access performance using hardware assisted virtualization in retargetable dynamic binary translation. Euromicro Conference on Digital System Design (DSD), 2017.
[*] Belay, Bittau, Mashtizadeh, Terei, Mazieres, Kozyrakis. "Dune: Safe user-level access to privileged cpu features." OSDI, 2012
Before:
softMMU requires
many insns
after:
only 2 insns thanks to
shadow page tables
Advantages:
Disadvantages:
x86-on-ppc64, make -j N inside a VM
aarch64-on-aarch64, Nbench FP
aarch64-on-x86, SPEC06fp
ind. branches, aarch64-on-x86
ind. branches, x86-on-aarch64
bench before after1 after2 after3 final_speedup
-------------------------------------------------------------------
aes 1.12s 1.12s 1.10s 1.00s 1.12
bigint 0.78s 0.78s 0.78s 0.78s 1
dhryst 0.96s 0.97s 0.49s 0.49s 1.9591837
miniz 1.94s 1.94s 1.88s 1.86s 1.0430108
norx 0.51s 0.51s 0.49s 0.48s 1.0625
primes 0.85s 0.85s 0.84s 0.84s 1.0119048
qsort 4.87s 4.88s 1.86s 1.86s 2.6182796
sha512 0.76s 0.77s 0.64s 0.64s 1.1875
bench before after1 after2 after3 final_speedup
---------------------------------------------------------------------
aes 2.68s 2.54s 2.60s 2.34s 1.1452991
bigint 1.61s 1.56s 1.55s 1.64s 0.98170732
dhryst 1.78s 1.67s 1.25s 1.24s 1.4354839
miniz 3.53s 3.35s 3.28s 3.35s 1.0537313
norx 1.13s 1.09s 1.07s 1.06s 1.0660377
primes 15.37s 15.41s 15.20s 15.37s 1
qsort 7.20s 6.71s 3.85s 3.96s 1.8181818
sha512 1.07s 1.04s 0.90s 0.90s 1.1888889
Ind. branches, RISC-V on x86, user-mode
Ind. branches, RISC-V on x86, full-system