summaryrefslogtreecommitdiff
path: root/2003/linux-kernel-knf2003/linux-kernel-knf2003.mgp
blob: af367f4e9a1c3627d137d8feab8b370a05e0f0f9 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
%include "default.mgp"
%default 1 bgrad
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%page
%nodefault
%back "blue"

%center
%size 7


Architecture of the Linux kernel
%size 5
or: The world beyond the syscall barrier


%center
%size 4
by

Harald Welte <laforge@gnumonks.org>


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%page
Architecture of the Linux kernel
Prerequirements

Due to the technical nature of this presentation, the audience should be familiar with the following subjects

	experience in programming on a Linux/*NIX system
		C language preferred
	general knowledge about computer hardware
		interrupts / IO / DMA 
	general knowledge about modern CPU architeture
		address space / MMU
		'protected mode' / supervisor mode / ... 

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%page
Architecture of the Linux kernel
Kernel / Userspace

	OS kernel provides
		hardware abstraction (file I/O, network I/O, ...)
		ressource allocation / limiting
		address sepraration
		privilege separation
		IPC

	the traditional process model in *NIX operating systems
		processes reside in seperate virtual address spaces
		kernel only executes one process (init) at bootup
		all other processes descend from from init
		processes are scheduled and preempted by the kernel
		processes invoke system functions via syscalls.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%page
Architecture of the Linux kernel
System calls

Definition	

	a userspace process enters the kernel
	mechanism is CPU architecture dependent
		can be software interrupt (int 0x80)
		can be special asm instruction (sysenter)
	arguments are passed on the stack
	common examples
		open/close/read/write
		exit/fork/execve/kill
		socketcall, implements (socket/bind/connect/listen)
	about 270 system calls in 2.6.x kernels

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%page
Architecture of the Linux kernel
Invocation of system call

chronological order of events in case of a system call

	userspace process calls library function 
		library function is executed within the process' address space
		library will eventually issue a systemcall, pushing arguments on the stack
		library will issue syscall (int 0x80 / sysenter / ...)
	execution will switch to syscall context in kernel mode
		kernel will look up systemcall table and dispatch to respective function
		syscall function in the kernel will handle the syscall
		all data between kernel/userspace needs to be copied between address spaces

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%page
Architecture of the Linux kernel
Execution contexts

apart from scheduling between different userspace processes, the kernel has different jobs like reacting to an external event

	hardirq
		hardware interrupt line was triggered
	softirq
		the workhorse behind a hardirq
	userspace
		executing within userspace process
	syscall
		invoked by a system call from userspace
	vsyscall
		virtual system calls, executed in userspace context

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%page
Architecture of the Linux kernel
hardirq context

	interrupt generated by hardware is received + handled
	can be interrupted by other hardirq's
	does only minimal job and returns
	examples
		packet has arrived on network board
		character was received on serial port
		dma read/write to disk drive has completed
		timer interrupt went off

	in most cases, a hardirq is followed by softirq or tasklet.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%page
Architecture of the Linux kernel
softirq context

	softirqs are run after hardirq
	do the real work associated withe a hardirq
	multithreaded (can run simultaneously on multiple cpus)
	examples
		network receive softirq
		timer softirq

	prior to softirq's, linux had so-called 'bottom halves'
		softirq introduced in 2.4.x (net rx/tx softirq)
		bottom halves removed in 2.6.x
		difference: only one BH can be run at a time
		BH's have to be converted to tasklets in 2.6.x

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%page
Architecture of the Linux kernel
tasklets

	tasklets are somewhat in between of softirq's and bottom halves
		one particular tasklet cannot run on multiple CPUs simultaneously
		different tasklets can run on different CPUs simultaneosly

	otherwise, same as softirq context
		tasklets are impl. inside the 'tasklet softirq'


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%page
Architecture of the Linux kernel
syscall / userspace context

	userspace context
		in userspace, executing a process

	syscall context
		inside kernel, when userspace process issues syscall()

	vsyscalls (virtual syscalls)
		first introduced with the x86-64 (AMD Opteron) arch
		fast read-only access to kernel data structures
		can do stuff like gettimeofday() without context switch

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%page
Architecture of the Linux kernel
synchronization

Due to reentrancy and SMP, synchronization issues arise:

	simple case: UP system
		softirq can be interrupted by hardirq
			thus, shared structures (queues, ...) need to be protected
	complex case: SMP system
		softirq can run at the same time on multiple CPU's
	as softirqs are multithreaded, synchronization between threads has to be implemented

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%page
Architecture of the Linux kernel
synchronization primitives

busy-waiting locks

	spinlocks
		if lock was not taken, take it and continue
		if lock was taken, bysy-loop until it is free
	rwlocks
		special case of spinlocks
		useful when structure protected by lock is often read but rarely updated/written to
		allows either
			multiple readers simultaneously, or
			only one writer [and no readers]
	brlocks
		super-fast read/write locks, with write-side penalty
		avoid cache ping-pong in multi reader case
		only in kernel 2.4.x

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%page
Architecture of the Linux kernel
synchronization primitives (cont'd)

sleeper locks

	semaphores
		if semaphore can be acquired, continue
		if semaphore cannot be acquired, put current process to sleep
			once semaphore is available again, wakeup process

		WARNING: can only be used for sync userspace/syscall context

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%page
Architecture of the Linux kernel
new locking primitives in 2.6.x

	seqlocks
		introduced with vsyscalls in 2.5/2.6
		reader/writer consistent mechanism without starving writers
		readers never block but may have to retry if write in progress

	read copy update
		new lockless mechanism in kernel 2.5/2.6
		defers update of data structure until all CPU's have scheduled and thus nobody has any references left



%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%page
Architecture of the Linux kernel
example: incoming network packet 

hardirq context
	NIC issues interrupt line after a packet was received
	kernel enters (arch/i386/kernel/entry.S:common_interrupt)
	core interrupt handler (arch/i386/kernel/irq.c:do_IRQ) 
	hardirq handler of network driver (drivers/net/tulip/interrupt.c:tulip_interrupt)
	net/core/dev.c:netif_rx(): append skb to backlog queue

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%page
Architecture of the Linux kernel
example: incoming network packet 

softirq context
		net/core/dev.c:net_rx_action()
		net/core/dev.c:process_backlog()
		net/core/dev.c:netif_receive_skb()
		net/core/dev.c:deliver_skb()
		net/ipv4/ip_input.c:ip_rcv()
			netfilter prerouting hook
		net/ipv4/ip_input.c:ip_rcv_finish()
			call routing code 
		net/ipv4/ip_input.c:ip_local_deliver()
			netfilter localin hook
		net/ipv4/ip_input.c:ip_local_deliver_finish()
			call l4 protocol
		net/ipv4/udp.c:udp_rcv()
			lookup socket, if any
		include/net/sock.h:sock_queue_rcv_skb()
			enqueue into socket receiver queue
		net/core/sock.c:sock_def_readable()
			wake_up_interruptible() on socket waitqueue
		return from recv() via socketcall

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%page
Architecture of the Linux kernel
example: reading of a file

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%page
Future of Linux packet filtering
Thanks
	The slides and the an according paper of this presentation are available at http://www.gnumonks.org/

	Thanks to
		the BBS people, Z-Netz, FIDO, ...
			for heavily increasing my computer usage in 1992
		KNF
			for bringing me in touch with the internet as early as 1994
			for providing a playground for technical people
			for telling me about the existance of Linux!
		Alan Cox, Alexey Kuznetsov, David Miller, Andi Kleen
			for implementing (one of?) the world's best TCP/IP stacks
		Paul 'Rusty' Russell
			for starting the netfilter/iptables project
			for trusting me to maintain it today
		Astaro AG
			for sponsoring parts of my netfilter work

personal git repositories of Harald Welte. Your mileage may vary