diff options
Diffstat (limited to '2003/linux-kernel-smp-bangalore2003')
-rw-r--r-- | 2003/linux-kernel-smp-bangalore2003/kernel-smp-bangalore2003.mgp | 315 |
1 files changed, 315 insertions, 0 deletions
diff --git a/2003/linux-kernel-smp-bangalore2003/kernel-smp-bangalore2003.mgp b/2003/linux-kernel-smp-bangalore2003/kernel-smp-bangalore2003.mgp new file mode 100644 index 0000000..09e67a5 --- /dev/null +++ b/2003/linux-kernel-smp-bangalore2003/kernel-smp-bangalore2003.mgp @@ -0,0 +1,315 @@ +%include "default.mgp" +%default 1 bgrad +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% +%page +%nodefault +%back "blue" + +%center +%size 7 + + +Linux Kernel Architecture +%size 5 +SMP issues, locking primitives + + +%center +%size 4 +by + +Harald Welte <laforge@gnumonks.org> + + +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% +%page +Linux Kernel Architecture +Prerequirements + +Due to the technical nature of this presentation, the audience should be familiar with the following subjects + + experience in programming on a Linux/*NIX system + C language preferred + general knowledge about computer hardware + interrupts / IO / DMA + general knowledge about modern CPU architeture + address space / MMU + 'protected mode' / supervisor mode / ... + +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% +%page +Linux Kernel Architecture +Kernel / Userspace + + OS kernel provides + hardware abstraction (file I/O, network I/O, ...) + ressource allocation / limiting + address sepraration + privilege separation + IPC + + the traditional process model in *NIX operating systems + processes reside in seperate virtual address spaces + kernel only executes one process (init) at bootup + all other processes descend from from init + processes are scheduled and preempted by the kernel + processes invoke system functions via syscalls. + +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% +%page +Linux Kernel Architecture +System calls + +Definition + + a userspace process enters the kernel + mechanism is CPU architecture dependent + can be software interrupt (int 0x80) + can be special asm instruction (sysenter) + arguments are passed on the stack + common examples + open/close/read/write + exit/fork/execve/kill + socketcall, implements (socket/bind/connect/listen) + about 270 system calls in 2.6.x kernels + +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% +%page +Linux Kernel Architecture +Invocation of system call + +chronological order of events in case of a system call + + userspace process calls library function + library function is executed within the process' address space + library will eventually issue a systemcall, pushing arguments on the stack + library will issue syscall (int 0x80 / sysenter / ...) + execution will switch to syscall context in kernel mode + kernel will look up systemcall table and dispatch to respective function + syscall function in the kernel will handle the syscall + all data between kernel/userspace needs to be copied between address spaces + +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% +%page +Linux Kernel Architecture +Execution contexts + +apart from scheduling between different userspace processes, the kernel has different jobs like reacting to an external event + + hardirq + hardware interrupt line was triggered + softirq + the workhorse behind a hardirq + userspace + executing within userspace process + syscall + invoked by a system call from userspace + vsyscall + virtual system calls, executed in userspace context + +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% +%page +Linux Kernel Architecture +hardirq context + + interrupt generated by hardware is received + handled + can be interrupted by other hardirq's + does only minimal job and returns + examples + packet has arrived on network board + character was received on serial port + dma read/write to disk drive has completed + timer interrupt went off + + in most cases, a hardirq is followed by softirq or tasklet. +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% +%page +Linux Kernel Architecture +softirq context + + softirqs are run after hardirq + do the real work associated withe a hardirq + multithreaded (can run simultaneously on multiple cpus) + examples + network receive softirq + timer softirq + + prior to softirq's, linux had so-called 'bottom halves' + softirq introduced in 2.4.x (net rx/tx softirq) + bottom halves removed in 2.6.x + difference: only one BH can be run at a time + BH's have to be converted to tasklets in 2.6.x + +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% +%page +Linux Kernel Architecture +tasklets + + tasklets are somewhat in between of softirq's and bottom halves + one particular tasklet cannot run on multiple CPUs simultaneously + different tasklets can run on different CPUs simultaneosly + + otherwise, same as softirq context + tasklets are impl. inside the 'tasklet softirq' + + +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% +%page +Linux Kernel Architecture +syscall / userspace context + + userspace context + in userspace, executing a process + + syscall context + inside kernel, when userspace process issues syscall() + + vsyscalls (virtual syscalls) + first introduced with the x86-64 (AMD Opteron) arch + fast read-only access to kernel data structures + can do stuff like gettimeofday() without context switch + +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% +%page +Linux Kernel Architecture +synchronization + +Due to reentrancy and SMP, synchronization issues arise: + + simple case: UP system + softirq can be interrupted by hardirq + thus, shared structures (queues, ...) need to be protected + complex case: SMP system + softirq can run at the same time on multiple CPU's + as softirqs are multithreaded, synchronization between threads has to be implemented + +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% +%page +Linux Kernel Architecture +synchronization primitives + +busy-waiting locks + + spinlocks + if lock was not taken, take it and continue + if lock was taken, bysy-loop until it is free + rwlocks + special case of spinlocks + useful when structure protected by lock is often read but rarely updated/written to + allows either + multiple readers simultaneously, or + only one writer [and no readers] + brlocks + super-fast read/write locks, with write-side penalty + avoid cache ping-pong in multi reader case + only in kernel 2.4.x + +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% +%page +Linux Kernel Architecture +synchronization primitives (cont'd) + +sleeper locks + + semaphores + if semaphore can be acquired, continue + if semaphore cannot be acquired, put current process to sleep + once semaphore is available again, wakeup process + + WARNING: can only be used for sync userspace/syscall context + +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% +%page +Linux Kernel Architecture +new locking primitives in 2.6.x + + seqlocks + introduced with vsyscalls in 2.5/2.6 + reader/writer consistent mechanism without starving writers + readers never block but may have to retry if write in progress + + read copy update + new lockless mechanism in kernel 2.5/2.6 + defers update of data structure until all CPU's have scheduled and thus nobody has any references left + + + +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% +%page +Linux Kernel Architecture +example: incoming network packet + +hardirq context + NIC issues interrupt line after a packet was received + kernel enters (arch/i386/kernel/entry.S:common_interrupt) + core interrupt handler (arch/i386/kernel/irq.c:do_IRQ) + hardirq handler of network driver (drivers/net/tulip/interrupt.c:tulip_interrupt) + net/core/dev.c:netif_rx(): append skb to backlog queue + +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% +%page +Linux Kernel Architecture +example: incoming network packet + +softirq context + net/core/dev.c:net_rx_action() + net/core/dev.c:process_backlog() + net/core/dev.c:netif_receive_skb() + net/core/dev.c:deliver_skb() + net/ipv4/ip_input.c:ip_rcv() + netfilter prerouting hook + net/ipv4/ip_input.c:ip_rcv_finish() + call routing code + net/ipv4/ip_input.c:ip_local_deliver() + netfilter localin hook + net/ipv4/ip_input.c:ip_local_deliver_finish() + call l4 protocol + net/ipv4/udp.c:udp_rcv() + lookup socket, if any + include/net/sock.h:sock_queue_rcv_skb() + enqueue into socket receiver queue + net/core/sock.c:sock_def_readable() + wake_up_interruptible() on socket waitqueue + return from recv() via socketcall + +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% +%page +Linux Kernel Architecture +Cache Effects + + SMP systems have multiple CPU's + Every CPU has it's own cache(s) / cache hierarchies + Most modern CPU archs are cache coherent in hardware + This means a certain chunk of memory can only be write-cached on one CPU at a given time + Frequently updated data structures will ping-pong between CPU caches + Data structures have to be organized to avoid cache issues + Cacheline alignment + very easy by using SLAB_HWCACHE_ALIGN + per-cpu data structures + e.g. packet counters: have one for every CPU + structure layout + put all writeable/updated members together + +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% +%page +Future of Linux packet filtering +Thanks + The slides and the an according paper of this presentation are available at http://www.gnumonks.org/ + + Thanks to + the BBS people, Z-Netz, FIDO, ... + for heavily increasing my computer usage in 1992 + KNF + for bringing me in touch with the internet as early as 1994 + for providing a playground for technical people + for telling me about the existance of Linux! + Alan Cox, Alexey Kuznetsov, David Miller, Andi Kleen + for implementing (one of?) the world's best TCP/IP stacks + Paul 'Rusty' Russell + for starting the netfilter/iptables project + for trusting me to maintain it today + Astaro AG + for sponsoring parts of my netfilter work + linux-bangalore + for sponsoring my trip to this conference + |