| 经验 |
针对好多Linux 爱好者对内核很有兴趣却无从下口,本文旨在介绍一种解读linux内核源码的入门方法,而不是解说linux复杂的内核机制;
一.核心源程序的文件组织:
1.Linux核心源程序通常都安装在/usr/src/linux下,而且它有一个非常简单的编号约定:任何偶数的核心(例如2.0.30)都是一个稳定地发行的核心,而任何奇数的核心(例如2.1.42)都是一个开发中的核心。
本文基于稳定的2.2.5源代码,第二部分的实现平台为 Redhat Linux 6.0。
2.核心源程序的文件按树形结构进行组织,在源程序树的最上层你会看到这样一些目录:
●Arch :arch子目录包括了所有和体系结构相关的核心代码。它的每一个子目录都代表一种支持的体系结构,例如i386就是关于intel cpu及与之相兼容体系结构的子目录。PC机一般都基于此目录;
●Include: include子目录包括编译核心所需要的大部分头文件。与平台无关的头文件在 include/linux子目录下,与 intel cpu相关的头文件在include/asm-i386子目录下,而include/scsi目录则是有关scsi设备的头文件目录;
●Init: 这个目录包含核心的初始化代码(注:不是系统的引导代码),包含两个文件main.c和Version.c,这是研究核心如何工作的一个非常好的起点。
●Mm :这个目录包括所有独立于 cpu 体系结构的内存管理代码,如页式存储管理内存的分配和释放等;而和体系结构相关的内存管理代码则位于arch/*/mm/,例如arch/i386/mm/Fault.c
●Kernel:主要的核心代码,此目录下的文件实现了大多数linux系统的内核函数,其中最重要的文件当属sched.c;同样,和体系结构相关的代码在arch/*/kernel中;
●Drivers: 放置系统所有的设备驱动程序;每种驱动程序又各占用一个子目录:如,/block 下为块设备驱动程序,比如ide(ide.c)。如果你希望查看所有可能包含文件系统的设备是如何初始化的,你可以看drivers/block/genhd.c中的device_setup()。它不仅初始化硬盘,也初始化网络,因为安装nfs文件系统的时候需要网络其他: 如, Lib放置核心的库代码; Net,核心与网络相关的代码; Ipc,这个目录包含核心的进程间通讯的代码; Fs ,所有的文件系统代码和各种类型的文件操作代码,它的每一个子目录支持一个文件系统,例如fat和ext2;
●Scripts, 此目录包含用于配置核心的脚本文件等。
一般,在每个目录下,都有一个 .depend 文件和一个 Makefile 文件,这两个文件都是编译时使用的辅助文件,仔细阅读这两个文件对弄清各个文件这间的联系和依托关系很有帮助;而且,在有的目录下还有Readme 文件,它是对该目录下的文件的一些说明,同样有利于我们对内核源码的理解;
二.解读实战:为你的内核增加一个系统调用
虽然,Linux 的内核源码用树形结构组织得非常合理、科学,把功能相关联的文件都放在同一个子目录下,这样使得程序更具可读性。然而,Linux 的内核源码实在是太大而且非常复杂,即便采用了很合理的文件组织方法,在不同目录下的文件之间还是有很多的关联,分析核心的一部分代码通常会要查看其它的几个相关的文件,而且可能这些文件还不在同一个子目录下。
体系的庞大复杂和文件之间关联的错综复杂,可能就是很多人对其望而生畏的主要原因。当然,这种令人生畏的劳动所带来的回报也是非常令人着迷的:你不仅可以从中学到很多的计算机的底层的知识(如下面将讲到的系统的引导),体会到整个操作系统体系结构的精妙和在解决某个具体细节问题时,算法的巧妙;而且更重要的是:在源码的分析过程中,你就会被一点一点地、潜移默化地专业化;甚至,只要分析十分之一的代码后,你就会深刻地体会到,什么样的代码才是一个专业的程序员写的,什么样的代码是一个业余爱好者写的。
为了使读者能更好的体会到这一特点,下面举了一个具体的内核分析实例,希望能通过这个实例,使读者对 Linux的内核的组织有些具体的认识,从中读者也可以学到一些对内核的分析方法。
以下即为分析实例:
【一】操作平台:
硬件:cpu intel Pentium II ;
软件:Redhat Linux 6.0; 内核版本2.2.5【二】相关内核源代码分析:
1.系统的引导和初始化:Linux 系统的引导有好几种方式:常见的有 Lilo, Loadin引导和Linux的自举引导
(bootsect-loader),而后者所对应源程序为arch/i386/boot/bootsect.S,它为实模式的汇编程序,限于篇幅在此不做分析;无论是哪种引导方式,最后都要跳转到 arch/i386/Kernel/setup.S, setup.S主要是进行时模式下的初始化,为系统进入保护模式做准备;此后,系统执行 arch/i386/kernel/head.S (对经压缩后存放的内核要先执行 arch/i386/boot/compressed/head.S); head.S 中定义的一段汇编程序setup_idt ,它负责建立一张256项的 idt 表(Interrupt Descriptor Table),此表保存着所有自陷和中断的入口地址;其中包括系统调用总控程序 system_call 的入口地址;当然,除此之外,head.S还要做一些其他的初始化工作;
2.系统初始化后运行的第一个内核程序asmlinkage void __init start_kernel(void) 定义在/usr/src/linux/init/main.c中,它通过调用usr/src/linux/arch/i386/kernel/traps.c 中的一个函数
void __init trap_init(void) 把各自陷和中断服务程序的入口地址设置到 idt 表中,其中系统调用总控程序system_cal就是中断服务程序之一;void __init trap_init(void) 函数则通过调用一个宏
set_system_gate(SYSCALL_VECTOR,&system_call); 把系统调用总控程序的入口挂在中断0x80上;
其中SYSCALL_VECTOR是定义在 /usr/src/linux/arch/i386/kernel/irq.h中的一个常量0x80; 而 system_call 即为中断总控程序的入口地址;中断总控程序用汇编语言定义在/usr/src/linux/arch/i386/kernel/entry.S中;
3.中断总控程序主要负责保存处理机执行系统调用前的状态,检验当前调用是否合法, 并根据系统调用向量,使处理机跳转到保存在 sys_call_table 表中的相应系统服务例程的入口; 从系统服务例程返回后恢复处理机状态退回用户程序;
而系统调用向量则定义在/usr/src/linux/include/asm-386/unistd.h 中;sys_call_table 表定义在/usr/src/linux/arch/i386/kernel/entry.S 中; 同时在 /usr/src/linux/include/asm-386/unistd.h 中也定义了系统调用的用户编程接口;
4.由此可见 , linux 的系统调用也象 dos 系统的 int 21h 中断服务, 它把0x80 中断作为总的入口, 然后转到保存在 sys_call_table 表中的各种中断服务例程的入口地址 , 形成各种不同的中断服务;
由以上源代码分析可知, 要增加一个系统调用就必须在 sys_call_table 表中增加一项 , 并在其中保存好自己的系统服务例程的入口地址,然后重新编译内核,当然,系统服务例程是必不可少的。
由此可知在此版linux内核源程序中,与系统调用相关的源程序文件就包括以下这些:
1.arch/i386/boot/bootsect.S
2.arch/i386/Kernel/setup.S
3.arch/i386/boot/compressed/head.S
4.arch/i386/kernel/head.S
5.init/main.c
6.arch/i386/kernel/traps.c
7.arch/i386/kernel/entry.S
8.arch/i386/kernel/irq.h
9.include/asm-386/unistd.h
当然,这只是涉及到的几个主要文件。而事实上,增加系统调用真正要修改文件只有include/asm-386/unistd.h和arch/i386/kernel/entry.S两个;
【三】 对内核源码的修改:
1.在kernel/sys.c中增加系统服务例程如下:
asmlinkage int sys_addtotal(int numdata)
{
int i=0,enddata=0;
while(i<=numdata)
enddata+=i++;
return enddata;
}
该函数有一个 int 型入口参数 numdata , 并返回从 0 到 numdata 的累加值; 当然也可以把系统服务例程放在一个自己定义的文件或其他文件中,只是要在相应文件中作必要的说明;
2.把 asmlinkage int sys_addtotal( int) 的入口地址加到sys_call_table表中:
arch/i386/kernel/entry.S 中的最后几行源代码修改前为:
... ...
.long SYMBOL_NAME(sys_sendfile)
.long SYMBOL_NAME(sys_ni_syscall) /* streams1 */
.long SYMBOL_NAME(sys_ni_syscall) /* streams2 */
.long SYMBOL_NAME(sys_vfork) /* 190 */
.rept NR_syscalls-190
.long SYMBOL_NAME(sys_ni_syscall)
.endr
修改后为:
... ...
.long SYMBOL_NAME(sys_sendfile)
.long SYMBOL_NAME(sys_ni_syscall) /* streams1 */
.long SYMBOL_NAME(sys_ni_syscall) /* streams2 */
.long SYMBOL_NAME(sys_vfork) /* 190 */
/* add by I */
.long SYMBOL_NAME(sys_addtotal)
.rept NR_syscalls-191
.long SYMBOL_NAME(sys_ni_syscall)
.endr
3. 把增加的 sys_call_table 表项所对应的向量,在include/asm-386/unistd.h 中进行必要申明,以供用户进程和其他系统进程查询或调用:
增加后的部分 /usr/src/linux/include/asm-386/unistd.h 文件如下:
... ...
#define __NR_sendfile 187
#define __NR_getpmsg 188
#define __NR_putpmsg 189
#define __NR_vfork 190
/* add by I */
#define __NR_addtotal 191
4.测试程序(test.c)如下:
#include
#include
_syscall1(int,addtotal,int, num)
main()
{
int i,j;
do
printf("Please input a number\n");
while(scanf("%d",&i)==EOF);
if((j=addtotal(i))==-1)
printf("Error occurred in syscall-addtotal();\n");
printf("Total from 0 to %d is %d \n",i,j);
}
对修改后的新的内核进行编译,并引导它作为新的操作系统,运行几个程序后可以发现一切正常;在新的系统下对测试程序进行编译(*注:由于原内核并未提供此系统调用,所以只有在编译后的新内核下,此测试程序才能可能被编译通过),运行情况如下:
$gcc -o test test.c
$./test
Please input a number
36
Total from 0 to 36 is 666
可见,修改成功;
而且,对相关源码的进一步分析可知,在此版本的内核中,从/usr/src/linux/arch/i386/kernel/entry.S
文件中对 sys_call_table 表的设置可以看出,有好几个系统调用的服务例程都是定义在/usr/src/linux/kernel/sys.c 中的同一个函数:
asmlinkage int sys_ni_syscall(void)
{
return -ENOSYS;
}
例如第188项和第189项就是如此:
... ...
.long SYMBOL_NAME(sys_sendfile)
.long SYMBOL_NAME(sys_ni_syscall) /* streams1 */
.long SYMBOL_NAME(sys_ni_syscall) /* streams2 */
.long SYMBOL_NAME(sys_vfork) /* 190 */
... ...
而这两项在文件 /usr/src/linux/include/asm-386/unistd.h 中却申明如下:
... ...
#define __NR_sendfile 187
#define __NR_getpmsg 188 /* some people actually want streams */
#define __NR_putpmsg 189 /* some people actually want streams */
#define __NR_vfork 190
由此可见,在此版本的内核源代码中,由于asmlinkage int sys_ni_syscall(void) 函数并不进行任何操作,所以包括 getpmsg, putpmsg 在内的好几个系统调用都是不进行任何操作的,即有待扩充的空调用; 但它们却仍然占用着sys_call_table表项,估计这是设计者们为了方便扩充系统调用而安排的; 所以只需增加相应服务例程(如增加服务例程getmsg或putpmsg),就可以达到增加系统调用的作用。
Andries Brouwer, aeb@cwi.nl 2001-01-01
A program
---------------------------------------------------------------------------------------------------
#include <unistd.h>
#include <fcntl.h>
int main(){
int fd;
char buf[512];
fd = open("/dev/hda", O_RDONLY);
if (fd >= 0)
read(fd, buf, sizeof(buf));
return 0;
}
---------------------------------------------------------------------------------------------------
This little program opens the block special device referring to the first IDE disk, and if the open succeeded reads the first sector. What happens in the kernel? Let us read 2.4.0 source.
---------------------------------------------------------------------------------------------------
int sys_open(const char *filename, int flags, int mode) {
char *tmp = getname(filename);
int fd = get_unused_fd();
struct file *f = filp_open(tmp, flags, mode);
fd_install(fd, f);
putname(tmp);
return fd;
}
---------------------------------------------------------------------------------------------------
The routine getname() is found in fs/namei.c. It copies the file name from user space to kernel space:
---------------------------------------------------------------------------------------------------
#define __getname() kmem_cache_alloc(names_cachep, SLAB_KERNEL)
#define putname(name) kmem_cache_free(names_cachep, (void *)(name))
char *getname(const char *filename) {
char *tmp = __getname(); /* allocate some memory */
strncpy_from_user(tmp, filename, PATH_MAX + 1);
return tmp;
}
---------------------------------------------------------------------------------------------------
The routine get_unused_fd() is found in fs/open.c again. It returns the first unused filedescriptor:
---------------------------------------------------------------------------------------------------
int get_unused_fd(void) {
struct files_struct *files = current->files;
int fd = find_next_zero_bit(files->open_fds,
files->max_fdset, files->next_fd);
FD_SET(fd, files->open_fds); /* in use now */
files->next_fd = fd + 1;
return fd;
}
---------------------------------------------------------------------------------------------------
Here current is the pointer to the user task struct for the currently executing task.
The routine fd_install() is found in include/linux/file.h. It just stores the information returned by filp_open()
---------------------------------------------------------------------------------------------------
void fd_install(unsigned int fd, struct file *file) {
struct files_struct *files = current->files;
files->fd[fd] = file;
}
---------------------------------------------------------------------------------------------------
So all the interesting work of sys_open() is done in filp_open(). This routine is found in fs/open.c:
---------------------------------------------------------------------------------------------------
struct file *filp_open(const char *filename, int flags, int mode) {
struct nameidata nd;
open_namei(filename, flags, mode, &nd);
return dentry_open(nd.dentry, nd.mnt, flags);
}
---------------------------------------------------------------------------------------------------
The struct nameidata is defined in include/linux/fs.h. It is used during lookups.
---------------------------------------------------------------------------------------------------
struct nameidata {
struct dentry *dentry;
struct vfsmount *mnt;
struct qstr last;
};
---------------------------------------------------------------------------------------------------
The routine open_namei() is found in fs/namei.c:
---------------------------------------------------------------------------------------------------
open_namei(const char *pathname, int flag, int mode, struct nameidata *nd) {
if (!(flag & O_CREAT)) {
/* The simplest case - just a plain lookup. */
if (*pathname == '/') {
nd->mnt = mntget(current->fs->rootmnt);
nd->dentry = dget(current->fs->root);
} else {
nd->mnt = mntget(current->fs->pwdmnt);
nd->dentry = dget(current->fs->pwd);
}
path_walk(pathname, nd);
/* Check permissions etc. */
...
return 0;
}
...
}
---------------------------------------------------------------------------------------------------
An inode (index node) describes a file. A file can have several names (or no name at all), but it has a unique inode. A dentry (directory entry)describes a name of a file: the inode plus the pathname used to find it. Avfsmount describes the filesystem we are in.
So, essentially, the lookup part op open_namei() is found in path_walk():
---------------------------------------------------------------------------------------------------
path_walk(const char *name, struct nameidata *nd) {
struct dentry *dentry;
for(;;) {
struct qstr this;
this.name = next_part_of(name);
this.len = length_of(this.name);
this.hash = hash_fn(this.name);
/* if . or .. then special, otherwise: */
dentry = cached_lookup(nd->dentry, &this);
if (!dentry)
dentry = real_lookup(nd->dentry, &this);
nd->dentry = dentry;
if (this_was_the_final_part)
return;
}
}
---------------------------------------------------------------------------------------------------
Here the cached_lookup() tries to find the given dentry in a cache of recently used dentries. If it is not found, the real_lookup() goes to the filesystem, which probably goes to disk, and actually finds the thing.After path_walk() is done, the nd argument contains the required dentry,which in turn has the inode information on the file. Finally we do dentry_open() that initializes a file struct:
---------------------------------------------------------------------------------------------------
struct file *
dentry_open(struct dentry *dentry, struct vfsmount *mnt, int flags) {
struct file *f = get_empty_filp();
f->f_dentry = dentry;
f->f_vfsmnt = mnt;
f->f_pos = 0;
f->f_op = dentry->d_inode->i_fop;
...
return f;
}
---------------------------------------------------------------------------------------------------
So far the open. In short: walk the tree, for each component hope the information is in cache, and if not ask the file system. How does this work? Each file system type provides structs super_operations,file_operations, inode_operations, address_space_operations that contain the addresses of the routines that can do stuff. And thus
---------------------------------------------------------------------------------------------------
struct dentry *real_lookup(struct dentry *parent, struct qstr *name, int flags) {
struct dentry *dentry = d_alloc(parent, name);
parent->d_inode->i_op->lookup(dir, dentry);
return dentry;
}
---------------------------------------------------------------------------------------------------
calls on the lookup routine for the specific fiilesystem, as found in the struct inode_operations in the inode of the dentry for the directory in which we do the lookup.
And this file system specific routine must read the disk data and search the directory for the file we are looking for. Good examples of file systems are minix and romfs because they are simple and small. For example,in fs/romfs/inode.c:
---------------------------------------------------------------------------------------------------
romfs_lookup(struct inode *dir, struct dentry *dentry) {
const char *name = dentry->d_name.name;
int len = dentry->d_name.len;
char fsname[ROMFS_MAXFN];
struct romfs_inode ri;
unsigned long offset = dir->i_ino & ROMFH_MASK;
for (;;) {
romfs_copyfrom(dir, &ri, offset, ROMFH_SIZE);
romfs_copyfrom(dir, fsname, offset+ROMFH_SIZE, len+1);
if (strncmp (name, fsname, len) == 0)
break;
/* next entry */
offset = ntohl(ri.next) & ROMFH_MASK;
}
inode = iget(dir->i_sb, offset);
d_add (dentry, inode);
return 0;
}
romfs_copyfrom(struct inode *i, void *dest,
unsigned long offset, unsigned long count) {
struct buffer_head *bh;
bh = bread(i->i_dev, offset>>ROMBSBITS, ROMBSIZE);
memcpy(dest, ((char *)bh->b_data) + (offset & ROMBMASK), count);
brelse(bh);
}
(All complications, all locking, and all error handling deleted.)
---------------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------------
ssize_t sys_read(unsigned int fd, char *buf, size_t count) {
struct file *file = fget(fd);
return file->f_op->read(file, buf, count, &file->f_pos);
}
---------------------------------------------------------------------------------------------------
That is, the read system call asks the file system to do the reading,starting at the current file position. The f_op field was filled in the dentry_open() routine above with the i_fop field of an inode.
For romfs the struct file_operations is assigned in romfs_read_inode(). For a regular file (case 2) it assigns generic_ro_fops. For a block special file (case 4) it calls init_special_inode() (see devices.c) which assigns
def_blk_fops.
How come romfs_read_inode() was ever called? When the filesystem was mounted, the routine romfs_read_super() was called, and it assigned romfs_ops to the s_op field of the superblock struct.
---------------------------------------------------------------------------------------------------
struct super_operations romfs_ops = {
read_inode: romfs_read_inode,
statfs: romfs_statfs,
};
---------------------------------------------------------------------------------------------------
And the iget() that was skipped over in the discussion above (in romfs_lookup()) finds the inode with given number ino in a cache, and if it cannot be found there creates a new inode struct by calling get_new_inode()(see fs/inode.c):
---------------------------------------------------------------------------------------------------
struct inode * iget(struct super_block *sb, unsigned long ino) {
struct list_head * head = inode_hashtable + hash(sb,ino);
struct inode *inode = find_inode(sb, ino, head);
if (inode) {
wait_on_inode(inode);
return inode;
}
return get_new_inode(sb, ino, head);
}
struct inode *
get_new_inode(struct super_block *sb, unsigned long ino,
struct list_head *head) {
struct inode *inode = alloc_inode();
inode->i_sb = sb;
inode->i_dev = sb->s_dev;
inode->i_ino = ino;
...
sb->s_op->read_inode(inode);
}
---------------------------------------------------------------------------------------------------
So that is how the inode was filled, and we find that in our case (/dev/hda is a block special file) the routine that is called by sys_read is def_blk_fops.read, and inspection of block_dev.c shows that that is the routine block_read():
---------------------------------------------------------------------------------------------------
ssize_t block_read(struct file *filp, char *buf, size_t count, loff_t *ppos) {
struct inode *inode = filp->f_dentry->d_inode;
kdev_t dev = inode->i_rdev;
ssize_t blocksize = blksize_size[MAJOR(dev)][MINOR(dev)];
loff_t offset = *ppos;
ssize_t read = 0;
size_t left, block, blocks;
struct buffer_head *bhreq[NBUF];
struct buffer_head *buflist[NBUF];
struct buffer_head **bh;
left = count; /* bytes to read */
block = offset / blocksize; /* first block */
offset &= (blocksize-1); /* starting offset in block */
blocks = (left + offset + blocksize - 1) / blocksize;
bh = buflist;
do {
while (blocks) {
--blocks;
*bh = getblk(dev, block++, blocksize);
if (*bh && !buffer_uptodate(*bh))
bhreq[bhrequest++] = *bh;
}
if (bhrequest)
ll_rw_block(READ, bhrequest, bhreq);
/* wait for I/O to complete,
copy result to user space,
increment read and *ppos, decrement left */
} while (left > 0);
return read;
}
---------------------------------------------------------------------------------------------------
So the building blocks here are getblk(), ll_rw_block(), and wait_on_buffer().
The first of these lives in fs/buffer.c. It finds the buffer that already contains the required data if we are lucky, and otherwise a buffer that is going to be used.
---------------------------------------------------------------------------------------------------
struct buffer_head * getblk(kdev_t dev, int block, int size) {
struct buffer_head *bh;
int isize;
try_again:
bh = __get_hash_table(dev, block, size);
if (bh)
return bh;
isize = BUFSIZE_INDEX(size);
bh = free_list[isize].list;
if (bh) {
__remove_from_free_list(bh);
init_buffer(bh);
bh->b_dev = dev;
bh->b_blocknr = block;
...
return bh;
}
refill_freelist(size);
goto try_again;
}
---------------------------------------------------------------------------------------------------
The real I/O is started by ll_rw_block(). It lives in drivers/block/ll_rw_blk.c.
---------------------------------------------------------------------------------------------------
ll_rw_block(int rw, int nr, struct buffer_head * bhs[]) {
int i;
for (i = 0; i < nr; i++) {
struct buffer_head *bh = bhs[i];
bh->b_end_io = end_buffer_io_sync;
submit_bh(rw, bh);
}
}
---------------------------------------------------------------------------------------------------
Here bh->b_end_io specifies what to do when I/O is finished. In this case:
---------------------------------------------------------------------------------------------------
end_buffer_io_sync(struct buffer_head *bh, int uptodate) {
mark_buffer_uptodate(bh, uptodate);
unlock_buffer(bh);
}
---------------------------------------------------------------------------------------------------
So, ll_rw_block() just feeds the requests it gets one by one to submit_bh():
---------------------------------------------------------------------------------------------------
submit_bh(int rw, struct buffer_head *bh) {
bh->b_rdev = bh->b_dev;
bh->b_rsector = bh->b_blocknr * (bh->b_size >> 9);
generic_make_request(rw, bh);
}
---------------------------------------------------------------------------------------------------
So, submit_bh() just passes things along to generic_make_request(), the routine to send I/O requests to block devices:
---------------------------------------------------------------------------------------------------
generic_make_request (int rw, struct buffer_head *bh) {
request_queue_t *q;
q = blk_get_queue(bh->b_rdev);
q->make_request_fn(q, rw, bh);
}
---------------------------------------------------------------------------------------------------
Thus, it finds the right queue and calls the request function for that queue.
---------------------------------------------------------------------------------------------------
struct blk_dev_struct {
request_queue_t request_queue;
queue_proc *queue;
void *data;
} blk_dev[MAX_BLKDEV];
request_queue_t *blk_get_queue(kdev_t dev)
{
return blk_dev[MAJOR(dev)].queue(dev);
}
---------------------------------------------------------------------------------------------------
In our case (/dev/hda), the blk_dev struct was filled by hwif_init (from drivers/ide/ide-probe.c):
and this ide_get_queue() is found in drivers/ide/ide.c:
---------------------------------------------------------------------------------------------------
blk_dev[hwif->major].data = hwif;
blk_dev[hwif->major].queue = ide_get_queue;
#define DEVICE_NR(dev) (MINOR(dev) >> PARTN_BITS)
request_queue_t *ide_get_queue (kdev_t dev) {
ide_hwif_t *hwif = (ide_hwif_t *) blk_dev[MAJOR(dev)].data;
return &hwif->drives[DEVICE_NR(dev) & 1].queue;
}
---------------------------------------------------------------------------------------------------
This .queue field was filled by ide_init_queue():
And blk_init_queue() (from ll_rw_blk.c again):
---------------------------------------------------------------------------------------------------
ide_init_queue(ide_drive_t *drive) {
request_queue_t *q = &drive->queue;
q->queuedata = HWGROUP(drive);
blk_init_queue(q, do_ide_request);
}
blk_init_queue(request_queue_t *q, request_fn_proc *rfn) {
...
q->request_fn = rfn;
q->make_request_fn = __make_request;
q->merge_requests_fn = ll_merge_requests_fn;
...
}
---------------------------------------------------------------------------------------------------
Aha, so we found the q->make_request_fn. Here it is:
---------------------------------------------------------------------------------------------------
__make_request(request_queue_t *q, int rw, struct buffer_head *bh) {
/* try to merge request with adjacent ones */
...
/* get a struct request and fill it with device, start,length, ... */
...
add_request(q, req, insert_here);
if (!q->plugged)
q->request_fn(q);
}
add_request(request_queue_t *q, struct request *req,
struct list_head *insert_here) {
list_add(&req->queue, insert_here);
}
---------------------------------------------------------------------------------------------------
When the request has been queued, q->request_fn is called. What is that? We can see it above - it is do_ide_request() and lives in ide.c.
---------------------------------------------------------------------------------------------------
do_ide_request(request_queue_t *q) {
ide_do_request(q->queuedata, 0);
}
ide_do_request(ide_hwgroup_t *hwgroup, int masked_irq) {
ide_startstop_t startstop;
while (!hwgroup->busy) {
hwgroup->busy = 1;
drive = choose_drive(hwgroup);
startstop = start_request(drive);
if (startstop == ide_stopped)
hwgroup->busy = 0;
}
}
ide_startstop_t
start_request (ide_drive_t *drive) {
unsigned long block, blockend;
struct request *rq;
rq = blkdev_entry_next_request(&drive->queue.queue_head);
block = rq->sector;
block += drive->part[minor & PARTN_MASK].start_sect;
SELECT_DRIVE(hwif, drive);
return (DRIVER(drive)->do_request(drive, rq, block));
}
---------------------------------------------------------------------------------------------------
So, in the case of a partitioned disk it is only at this very low level that we add in the starting sector of the partition in order to get an absolute sector.
The first actual port access happened already:
---------------------------------------------------------------------------------------------------
#define SELECT_DRIVE(hwif,drive) \
OUT_BYTE((drive)->select.all,
hwif->io_ports[IDE_SELECT_OFFSET]);
---------------------------------------------------------------------------------------------------
but this do_request function must do the rest. For a disk it is defined in ide-disk.c, in the ide_driver_t idedisk_driver, and the function turns out to be do_rw_disk().
---------------------------------------------------------------------------------------------------
ide_startstop_t
do_rw_disk (ide_drive_t *drive, struct request *rq, unsigned long
block) {
if (IDE_CONTROL_REG)
OUT_BYTE(drive->ctl,IDE_CONTROL_REG);
OUT_BYTE(rq->nr_sectors,IDE_NSECTOR_REG);
if (drive->select.b.lba) {
OUT_BYTE(block,IDE_SECTOR_REG);
OUT_BYTE(block>>=8,IDE_LCYL_REG);
OUT_BYTE(block>>=8,IDE_HCYL_REG);
OUT_BYTE(((block>>8)&0x0f)|drive->select.all,IDE_SELECT_REG);
} else {
unsigned int sect,head,cyl,track;
track = block / drive->sect;
sect = block % drive->sect + 1;
OUT_BYTE(sect,IDE_SECTOR_REG);
head = track % drive->head;
cyl = track / drive->head;
OUT_BYTE(cyl,IDE_LCYL_REG);
OUT_BYTE(cyl>>8,IDE_HCYL_REG);
OUT_BYTE(head|drive->select.all,IDE_SELECT_REG);
}
if (rq->cmd == READ) {
ide_set_handler(drive, &read_intr, WAIT_CMD, NULL);
OUT_BYTE(WIN_READ, IDE_COMMAND_REG);
return ide_started;
}
...
}
---------------------------------------------------------------------------------------------------
This fills the remaining control registers of the interface and starts the actual I/O. Now ide_set_handler() sets up read_intr() to be called when we get an interrupt. This calls ide_end_request() when a request is done, which calls
end_that_request_first() (which calls bh->b_end_io() as promised earlier) and end_that_request_last() which calls
blkdev_release_request() which wakes up whoever waited for the block.
编辑者: hyl (07/12/02 13:56)