Mctrain's Blog

What I learned in IT, as well as thought about life

KVM: Memory Management Using Hardware Assisted Paging

| Comments

This post will introduce the memory mechanism in current KVM code, mostly in Intel IA-32e mode (long-mode-paging). More specifically, it mainly includes the initinalization and overall mechanism of KVM’s two dimentional paging, as well as the page fault handling process, and a little bit of TLB, and paging-structure caches and their invalidation mechanims in virtualization environment.

This post makes some references to the two blogs: EPT in kvm and kvm: hardware assisted paging, and the information of TLB and paging-structure cache is mainly from Intel Manual. Others are from my notes of reading KVM codes.

Before continuing, I should say this post assumes that the readers know what are Shadow paging and EPT (Extended Page Table), as well as why Intel proposes the Hardware assisted paging after shadow paging mechanism. If you don’t know it, please reffer to this blog, and these materials.

MMU Initialization

When user use qemu or libvirt-like tools to create a virtual machine, the qemu will issue an ioctl call to KVM to create a vcpu, the whole process start from code in virt/kvm/kvm_main.c:

virt/kvm/kvm_main.c
1
2
3
4
5
6
7
8
9
10
static int kvm_vm_ioctl_create_vcpu(struct kvm *kvm, u32 id)
{
  ...
  vcpu = kvm_arch_vcpu_create(kvm, id);
  ...
  r = kvm_arch_vcpu_setup(vcpu);
  ...
}

For `kvm_arch_vcpu_create`:
arch/x86/kvm/x86.c
1
2
3
4
5
6
struct kvm_vcpu *kvm_arch_vcpu_create(struct kvm *kvm,
unsigned int id)
{
  ...
  return kvm_x86_ops->vcpu_create(kvm, id);
}

kvm_x86_ops is defined in vmx.c, there’re some very important function pointers, as shown below:

arch/x86/kvm/vmx.c
1
2
3
4
5
6
7
8
9
static struct kvm_x86_ops vmx_x86_ops = {
  ...
  .vcpu_create = vmx_create_vcpu,
  .set_cr3 = vmx_set_cr3,
  .run = vmx_vcpu_run,
  .handle_exit = vmx_handle_exit,
  .set_tdp_cr3 = vmx_set_cr3,
  ...
};

So kvm_x86_ops->vcpu_create will invoke:

arch/x86/kvm/vmx.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
static struct kvm_vcpu *vmx_create_vcpu(struct kvm *kvm, unsigned int id)
{
  ...
  allocate_vpid(vmx); /* this will be discussed later */

  err = kvm_vcpu_init(&vmx->vcpu, kvm, id);
  ...
  if (enable_ept) {
    if (!kvm->arch.ept_identity_map_addr)
      kvm->arch.ept_identity_map_addr =
        VMX_EPT_IDENTITY_PAGETABLE_ADDR;
    ...
  }
  ...
}
virt/kvm/kvm_main.c
1
2
3
4
5
6
int kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, unsigned id)
{
  ...
  r = kvm_arch_vcpu_init(vcpu);
  ...
}
arch/x86/kvm/x86.c
1
2
3
4
5
6
int kvm_arch_vcpu_init(struct kvm_vcpu *vcpu)
{
  ...
  r = kvm_mmu_create(vcpu);
  ...
}

Then in kvm_mmu_create it will do some simple initialization:

arch/x86/kvm/mmu.c
1
2
3
4
5
6
7
8
9
10
int kvm_mmu_create(struct kvm_vcpu *vcpu)
{
  ...
  vcpu->arch.walk_mmu = &vcpu->arch.mmu;
  vcpu->arch.mmu.root_hpa = INVALID_PAGE;
  vcpu->arch.mmu.translate_gpa = translate_gpa;
  vcpu->arch.nested_mmu.translate_gpa = translate_nested_gpa;

  return alloc_mmu_pages(vcpu);
}

For kvm_arch_vcpu_setup:

arch/x86/kvm/x86.c
1
2
3
4
5
6
int kvm_arch_vcpu_setup(struct kvm_vcpu *vcpu)
{
  ...
  r = kvm_mmu_setup(vcpu);
  ...
}

In kvm_mmu_setup, it simply invokes init_kvm_mmu, which invokes init_kvm_tdp_mmu since we have enabled two-dimentional paging (tpd):

arch/x86/kvm/mmu.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
static int init_kvm_tdp_mmu(struct kvm_vcpu *vcpu)
{
  struct kvm_mmu *context = vcpu->arch.walk_mmu;

  ...
  context->page_fault = tdp_page_fault;
  ...
  context->root_hpa = INVALID_PAGE;
  context->direct_map = true;
  context->set_cr3 = kvm_x86_ops->set_tdp_cr3;
  context->get_cr3 = get_cr3;
  context->get_pdptr = kvm_pdptr_read;

  if (is_long_mode(vcpu)) {
    context->nx = is_nx(vcpu);
    context->root_level = PT64_ROOT_LEVEL;
    reset_rsvds_bits_mask(vcpu, context);
    context->gva_to_gpa = paging64_gva_to_gpa;
  }

  return 0;
}

After the initialization, before enter into the guest VM, it will invoke kvm_mmu_load, where to set cr3 of both EPT and guest VM:

arch/x86/kvm/mmu.c
1
2
3
4
5
6
7
8
9
10
11
int kvm_mmu_load(struct kvm_vcpu *vcpu)
{
  ...
  r = mmu_alloc_roots(vcpu);
  spin_lock(&vcpu->kvm->mmu_lock);
  mmu_sync_roots(vcpu);
  spin_unlock(&vcpu->kvm->mmu_lock);
  ...
  vcpu->arch.mmu.set_cr3(vcpu, vcpu->arch.mmu.root_hpa);
  ...
}

vcpu->arch.mmu.set_cr3 invokes vmx_set_cr3 in vmx.c:

arch/x86/kvm/vmx.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
static u64 construct_eptp(unsigned long root_hpa)
{
  u64 eptp;

  /* TODO write the value reading from MSR */
  eptp = VMX_EPT_DEFAULT_MT |
    VMX_EPT_DEFAULT_GAW << VMX_EPT_GAW_EPTP_SHIFT;
  if (enable_ept_ad_bits)
    eptp |= VMX_EPT_AD_ENABLE_BIT;
  eptp |= (root_hpa & PAGE_MASK);

  return eptp;
}

static void vmx_set_cr3(struct kvm_vcpu *vcpu, unsigned long cr3)
{
  unsigned long guest_cr3;
  u64 eptp;

  guest_cr3 = cr3;
  if (enable_ept) {
    eptp = construct_eptp(cr3);
    vmcs_write64(EPT_POINTER, eptp);
    guest_cr3 = is_paging(vcpu) ? kvm_read_cr3(vcpu) :
      vcpu->kvm->arch.ept_identity_map_addr;
    ept_load_pdptrs(vcpu);
  }

  vmx_flush_tlb(vcpu);
  vmcs_writel(GUEST_CR3, guest_cr3);
}

Since at the very beginning of OS booting the paging is not enabled, so the first version of guest_cr3 is vcpu->kvm->arch.ept_identity_map_addr, which is set before. After paging enabled, cr3 will be set by the guest OS, which will cause VMExit, and handle_cr will be called, and then invokes kvm_set_cr3:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
int kvm_set_cr3(struct kvm_vcpu *vcpu, unsigned long cr3)
{
  if (cr3 == kvm_read_cr3(vcpu) && !pdptrs_changed(vcpu)) {
    kvm_mmu_sync_roots(vcpu);
    kvm_mmu_flush_tlb(vcpu);
    return 0;
  }

  if (unlikely(!gfn_to_memslot(vcpu->kvm, cr3 >> PAGE_SHIFT)))
    return 1;
  vcpu->arch.cr3 = cr3;
  __set_bit(VCPU_EXREG_CR3, (ulong *)&vcpu->arch.regs_avail);
  vcpu->arch.mmu.new_cr3(vcpu);
  return 0;
}

It will set the real GUEST_CR3.

Paging in Runtime

1


Keep on updating…

Comments