1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
|
%include "default.mgp"
%default 1 bgrad
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%page
%nodefault
%back "blue"
%center
%size 7
Architecture of the Linux kernel
%size 5
or: The world beyond the syscall barrier
%center
%size 4
by
Harald Welte <laforge@gnumonks.org>
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%page
Architecture of the Linux kernel
Prerequirements
Due to the technical nature of this presentation, the audience should be familiar with the following subjects
experience in programming on a Linux/*NIX system
C language preferred
general knowledge about computer hardware
interrupts / IO / DMA
general knowledge about modern CPU architeture
address space / MMU
'protected mode' / supervisor mode / ...
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%page
Architecture of the Linux kernel
Kernel / Userspace
OS kernel provides
hardware abstraction (file I/O, network I/O, ...)
ressource allocation / limiting
address sepraration
privilege separation
IPC
the traditional process model in *NIX operating systems
processes reside in seperate virtual address spaces
kernel only executes one process (init) at bootup
all other processes descend from from init
processes are scheduled and preempted by the kernel
processes invoke system functions via syscalls.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%page
Architecture of the Linux kernel
System calls
Definition
a userspace process enters the kernel
mechanism is CPU architecture dependent
can be software interrupt (int 0x80)
can be special asm instruction (sysenter)
arguments are passed on the stack
common examples
open/close/read/write
exit/fork/execve/kill
socketcall, implements (socket/bind/connect/listen)
about 270 system calls in 2.6.x kernels
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%page
Architecture of the Linux kernel
Invocation of system call
chronological order of events in case of a system call
userspace process calls library function
library function is executed within the process' address space
library will eventually issue a systemcall, pushing arguments on the stack
library will issue syscall (int 0x80 / sysenter / ...)
execution will switch to syscall context in kernel mode
kernel will look up systemcall table and dispatch to respective function
syscall function in the kernel will handle the syscall
all data between kernel/userspace needs to be copied between address spaces
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%page
Architecture of the Linux kernel
Execution contexts
apart from scheduling between different userspace processes, the kernel has different jobs like reacting to an external event
hardirq
hardware interrupt line was triggered
softirq
the workhorse behind a hardirq
userspace
executing within userspace process
syscall
invoked by a system call from userspace
vsyscall
virtual system calls, executed in userspace context
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%page
Architecture of the Linux kernel
hardirq context
interrupt generated by hardware is received + handled
can be interrupted by other hardirq's
does only minimal job and returns
examples
packet has arrived on network board
character was received on serial port
dma read/write to disk drive has completed
timer interrupt went off
in most cases, a hardirq is followed by softirq or tasklet.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%page
Architecture of the Linux kernel
softirq context
softirqs are run after hardirq
do the real work associated withe a hardirq
multithreaded (can run simultaneously on multiple cpus)
examples
network receive softirq
timer softirq
prior to softirq's, linux had so-called 'bottom halves'
softirq introduced in 2.4.x (net rx/tx softirq)
bottom halves removed in 2.6.x
difference: only one BH can be run at a time
BH's have to be converted to tasklets in 2.6.x
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%page
Architecture of the Linux kernel
tasklets
tasklets are somewhat in between of softirq's and bottom halves
one particular tasklet cannot run on multiple CPUs simultaneously
different tasklets can run on different CPUs simultaneosly
otherwise, same as softirq context
tasklets are impl. inside the 'tasklet softirq'
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%page
Architecture of the Linux kernel
syscall / userspace context
userspace context
in userspace, executing a process
syscall context
inside kernel, when userspace process issues syscall()
vsyscalls (virtual syscalls)
first introduced with the x86-64 (AMD Opteron) arch
fast read-only access to kernel data structures
can do stuff like gettimeofday() without context switch
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%page
Architecture of the Linux kernel
synchronization
Due to reentrancy and SMP, synchronization issues arise:
simple case: UP system
softirq can be interrupted by hardirq
thus, shared structures (queues, ...) need to be protected
complex case: SMP system
softirq can run at the same time on multiple CPU's
as softirqs are multithreaded, synchronization between threads has to be implemented
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%page
Architecture of the Linux kernel
synchronization primitives
busy-waiting locks
spinlocks
if lock was not taken, take it and continue
if lock was taken, bysy-loop until it is free
rwlocks
special case of spinlocks
useful when structure protected by lock is often read but rarely updated/written to
allows either
multiple readers simultaneously, or
only one writer [and no readers]
brlocks
super-fast read/write locks, with write-side penalty
avoid cache ping-pong in multi reader case
only in kernel 2.4.x
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%page
Architecture of the Linux kernel
synchronization primitives (cont'd)
sleeper locks
semaphores
if semaphore can be acquired, continue
if semaphore cannot be acquired, put current process to sleep
once semaphore is available again, wakeup process
WARNING: can only be used for sync userspace/syscall context
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%page
Architecture of the Linux kernel
new locking primitives in 2.6.x
seqlocks
introduced with vsyscalls in 2.5/2.6
reader/writer consistent mechanism without starving writers
readers never block but may have to retry if write in progress
read copy update
new lockless mechanism in kernel 2.5/2.6
defers update of data structure until all CPU's have scheduled and thus nobody has any references left
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%page
Architecture of the Linux kernel
example: incoming network packet
hardirq context
NIC issues interrupt line after a packet was received
kernel enters (arch/i386/kernel/entry.S:common_interrupt)
core interrupt handler (arch/i386/kernel/irq.c:do_IRQ)
hardirq handler of network driver (drivers/net/tulip/interrupt.c:tulip_interrupt)
net/core/dev.c:netif_rx(): append skb to backlog queue
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%page
Architecture of the Linux kernel
example: incoming network packet
softirq context
net/core/dev.c:net_rx_action()
net/core/dev.c:process_backlog()
net/core/dev.c:netif_receive_skb()
net/core/dev.c:deliver_skb()
net/ipv4/ip_input.c:ip_rcv()
netfilter prerouting hook
net/ipv4/ip_input.c:ip_rcv_finish()
call routing code
net/ipv4/ip_input.c:ip_local_deliver()
netfilter localin hook
net/ipv4/ip_input.c:ip_local_deliver_finish()
call l4 protocol
net/ipv4/udp.c:udp_rcv()
lookup socket, if any
include/net/sock.h:sock_queue_rcv_skb()
enqueue into socket receiver queue
net/core/sock.c:sock_def_readable()
wake_up_interruptible() on socket waitqueue
return from recv() via socketcall
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%page
Architecture of the Linux kernel
example: reading of a file
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%page
Future of Linux packet filtering
Thanks
The slides and the an according paper of this presentation are available at http://www.gnumonks.org/
Thanks to
the BBS people, Z-Netz, FIDO, ...
for heavily increasing my computer usage in 1992
KNF
for bringing me in touch with the internet as early as 1994
for providing a playground for technical people
for telling me about the existance of Linux!
Alan Cox, Alexey Kuznetsov, David Miller, Andi Kleen
for implementing (one of?) the world's best TCP/IP stacks
Paul 'Rusty' Russell
for starting the netfilter/iptables project
for trusting me to maintain it today
Astaro AG
for sponsoring parts of my netfilter work
|