Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

This page handle the scheduling policy of processes between the different CPUs available on a SEAPATH hypervisor

SEAPATH default CPU isolation

SEAPATH aims to host virtual machine with real time needs. To achieve that, process scheduling must be tuned in order to offer the best performance to the VM.

On SEAPATH, at configuration time with Ansible, a list of isolated CPUs is defined. On these CPUs, the kernel will avoid running

  • system process

  • user process

  • IRQs

These CPUs will then be entirely dedicated to running real time processes like virtual machines.

Info
In the Ansible inventory

...

of the hypervisors, these CPUs are defined by the `isolcpus` variables.

Tuned

The Debian version of SEAPATH uses tuned (https://github.com/redhat-performance/tuned)

This software configures different variables on the system in order to tune it for real-time virtualization performances. Among others :

  • Add command line argument to isolate CPUs (as described above)

  • Configure RT priority for kernel threads (ktimer,ksoft ...)

  • Configure IRQs on isolated cores (with irqbalance)

  • Configure sysfs workqueue on isolated cores

Find a list of all tuned scheduling modification at the end of this page.

On Yocto, tuned is not used. Instead, all these configurations are done at compile time.

Scheduling virtual machines

SEAPATH virtual machines are managed by Qemu.

On a hypervisor, when launching a VM, Qemu create different threads. They can be dispatched in two categories :

  • Virtual CPUs threads : These threads emulate the CPUs of the virtual machines. Each CPU inside the virtual machine is handled by one vCPU thread on the hypervisor.

  • Management threads : They are responsible for creating the VM, check if it crashes, manage the IO etc …

By default, all these threads will be managed by the Linux scheduler and thus run on the non isolated cores. But they can also be pinned to specific CPUs, what forced them to run on it.

Standard virtual machines

For a VM without any performance or real time needs, it is no use to handle any of the Qemu threads a particular way :

  • All threads will inherit a default priority and scheduling type (TS 19)

  • All threads will be handled by the Linux scheduler on the non isolated cores

Real time virtual machines

For a VM where performance and determinism is needed, here are our recommendations :

The vCPU threads are where the work of the VM will run, they then must have the most CPU time as possible. To achieve that, they should be

  • Scheduled with a real time scheduler (FIFO) and real time priority (42)

  • Put alone on an isolated core

Note
Each vCPU thread must be put alone on an isolated CPU. It is counterproductive to put two vCPU thread on the same core. This means you must have at least as much isolated cores as RT VM vCPUs.

The vCPU scheduler type as to be FIFO (FF). A Real Time priority of one is enough.

For more information read page Virtual machines on SEAPATH.

Finer control with cgroup (optional)

Implementation in SEAPATH

The Linux kernel uses cgroups in order to isolate processes. These cgroups work in a hierarchy where each layer restrains the resources a process can access too. Systemd also uses this mechanism by grouping his processes in slices.

By default, service and scope units are placed in the system slice, virtual machines and containers are found in the machine slice and user sessions in the user slice (more details here).

These slices can run on any cores of the hypervisor, but SEAPATH offers a way to restrict the CPUs where each of these slices can execute.

Info
This is configured using cpusystem, cpumachines and cpuuser in Ansible

The project also defines 3 others slices to separate the functionalities. These slices are optionals

  • machine-nort: a subgroup of machine slice to run all virtual machines with the default scheduler.

  • machine-rt: a subgroup of machine slice to run all virtual machines with real-time scheduler.

  • ovs: a group to run OpenVSwitch processes.

Info
These are configured by cpumachinesnort, cpumachinesrt and cpuovs in Ansible.
Note
Because of the hierarchy, all CPUs used in machine-rt and machine-nort slices must also be used in the machine slice.

Image Added

Info
Different slices can share the same CPUs. In a machine with few cores for example, it can be useful to put system, user and ovs slices on the same CPUs.

TODO : put the link to the inventories README once written

Utility of slices CPU isolation

Using these slices is useful to get a preset of CPU isolation for virtual machines. When placing a VM in either machine-rt or machine-nort slice it will be automatically scheduled on the CPUs of this slice.
It seems particularly useful when deploying many VMs at once.

One really important thing to have in mind when using these slices is that the Qemu management threads of the virtual machines will be part of the machine slice (resp machine-rt and machine-nort), and not the system slice. This means that this threads will not be executed in the CPUs associated with the Linux system, but on the CPU chosen for the machine slice.

Using this slices allows :

Info
This new isolation layer protects from really advanced attacks. Because it has drawbacks (see below), the question remains open if you should or not activate this feature.

Drawbacks

By activating CPU isolation on the machine slice, the management threads of the VM will be scheduled on the allowed CPU list of the slice. This new mechanism implies two things :

  • You must have one more CPU on the machine-rt slice. Because every vCPU thread needs to be scheduled on its own CPU, one more is needed to schedule management threads.

  • You must carefully place the management threads. By default, they will be scheduled on the first allowed CPU of the slice. If this CPU also runs an RT vCPU threads, it will prevent the management thread to run and the VM will never boot.

Info
The management thread scheduling is handled by the `emulatorpin` field in libivrt XML. 

For more information, read page Virtual machines on SEAPATH.

Specific configurations

NUMA

NUMA (Non-Uniform Memory Access) refers to machines that have the ability to contain several CPU sockets. Each of these sockets has its own cache memory, which means that accessing memory from one socket to another is much slower than accessing memory on its own socket.

To know if a SEAPATH hypervisor has a NUMA architecture, use the command `virsh nodeinfo` . It gives, among others, the number of NUMA cells in the system.
To know the architecture of CPUs in the system, launch the command `virsh capabilities`. It provides a section “topology” with a description of the sockets (here called cells) and the CPUs each one contains. More information about these commands
here.

If your system contains more than one NUMA cells, you must be careful to pin all the vCPU threads of one VM on the same NUMA cell. Otherwise, the data transfer between two cells will significantly slow down the VM.

Hyper-threading

Most of the modern CPUs support hyper-threading. This option can be enabled in the BIOS and double the number of CPUs available on the system. However, the newly created CPUs are not as fast and independent as classic ones.

Hyper-threading uses the concept of logical cores. A logical core is only the pipeline part of a CPU. It shares an arithmetic unit (ALU) and memory with another logical core to form a physical core.
When hyper-threading is disabled, every physical core runs only one logical core. It then has full access to the memory and the ALU. When hyper-threading is enabled, the two logical cores are enabled. The obvious drawback is that each logical core will influence its sibling.

When running real-time virtual machines, it is highly recommended to disable hyper-threading. However, on test systems or systems with fewer cores, it can be an interesting feature. In that case, the RT VM vCPU must be pinned to a logical core where its sibling doesn’t run any process. Otherwise, it will influence the vCPU thread, which will lose determinism.

To know the exact architecture of your CPUs, use the command `virsh capabilities` and watch the “topology” section. The siblings field describes which logical CPUs are grouped together.

Info
On most systems, logical CPUs are grouped in numerical order (0 with 1, 2 with 3 …) but this is not always the case. Always refer to `virsh capabilities` to check the exact architecture.

Annex: list of tuned modifications

Below a list of all scheduling modifications done by tuned.

TODO : explain all ?

  • /sys/module/kvm/parameters/halt_poll_ns = 0
    /sys/kernel/ktimer_lockless_check = 1
    /sys/kernel/mm/ksm/run = 2

  • Kernel parameters :
    isolcpus=managed_irq,domain,{isolated_cores}
    intel_pstate=disable
    nosoftlockup
    tsc=reliable
    nohz=on
    nohz_full={isolated_cores}
    rcu_nocbs={isolated_cores}
    irqaffinity={non_isolated_cores}
    processor.max_cstate=1
    intel_idle.max_cstate=1
    cpufreq.default_governor=performance
    rcu_nocb_poll

  • kernel thread priorities :
    group.ksoftirqd=0:f:2:*:^\[ksoftirqd
    group.ktimers=0:f:2:*:^\[ktimers
    group.rcuc=0:f:4:*:^\[rcuc
    group.rcub=0:f:4:*:^\[rcub
    group.ktimersoftd=0:f:3:*:^\[ktimersoftd

  • configures irqbalance with isolated_cores list

  • configures workqueue with isolated_cores list

  • kernel.hung_task_timeout_secs = 600
    kernel.nmi_watchdog = 0
    kernel.sched_rt_runtime_us = -1
    vm.stat_interval = 10
    kernel.timer_migration = 0

Interrupt Requests

The irqmask define the environment variable IRQBALANCE_BANNED_CPUS. It sould ignore some CPU and never assign interrupts (more details here, the irqbalance manual).

The workqueuemask is his negation. There are no function to get it because the project want avoid maintain an algorithm. It's used to configure the kernel.

These variables is a hex mask without "0x" where the first CPU is the least signifiant bit.

Kernel configuration

The cluster is installed with a real-time kernel on each node. It must execute orders without delay. So, some CPUs should be isolated from the scheduler kernel with isolcpus kernel parameter.

  • To specify some CPU:
Code Block
languagebash
isolcpus=<cpu number>,...,<cpu number>
  • To specify a range of CPU is:
Code Block
languagebash
isolcpus=<cpu number>-<cpu number>

...