The physical address space is shown as follows:
+------------------+ <- 0xFFFFFFFF (4GB)
| 32-bit |
| memory mapped |
| devices |
| |
/\/\/\/\/\/\/\/\/\/\
/\/\/\/\/\/\/\/\/\/\
| |
| Unused |
| |
+------------------+ <- depends on amount of RAM
| |
| |
| Extended Memory |
| |
| |
+------------------+ <- 0x00100000 (1MB)
| BIOS ROM |
+------------------+ <- 0x000F0000 (960KB)
| 16-bit devices, |
| expansion ROMs |
+------------------+ <- 0x000C0000 (768KB)
| VGA Display |
+------------------+ <- 0x000A0000 (640KB)
| |
| Low Memory |
| |
+------------------+ <- 0x00000000
The exact booting procedure and details are architecture dependent. As for mit 6.828 lab, the booting steps are:
-
BIOS: IBM PC starts at physical address
0x000ffff0
(BIOS ROM) with CS =0xf000
, IP =0xfff0
(just 16 bytes before the end of BIOS ROM). The first instruction is[f000:fff0] 0xffff0: ljmp $0xf000,$0xe05b
which jumps to the segmented address with CS =0xf000
, IP =0xe05b
. The processor is running in real mode and the physical address is CS*16 + IP. The BIOS will setup an interrupt descriptor table and initialize various devices such as VGA display. After initializing the PCI bus and all the important devices the BIOS knows about, it searches for a bootable device. It will read the bootloader from the bootable disk and transfer control to it. -
bootloader: The disk is divided into 512 bytes region called sector. The first sector of a disk is the boot sector. The BIOS loads the boot sector into physical address from
0x7c00
to0x7dff
. Then it will jump to0000:7c00
to pass control to the bootloader. The bootloader does the followint thing (seeboot/boot.S
andboot/main.c
)- Switch the processor from real mode to 32-bit protected mode. Then the software is able to access the all the memory above 1MB in the process's physical address space. (details: 1. setup the important data segment registers such as DS,ES,SS. 2. Enable A20, which is for backwards compatibility with the earliest PCs, the addresses higher than 1MB will wrap around to zero by default. Need to undo this. 3. Switch from real to protected mode, using a bootstrap GDT and segment translation that makes virtual addresses identical to their physical address, so that the effective memory map does not change during the switch. The exact instruction is
lgdt gdtdesc
and set protected mode enable flagCR0_PE_ON
bit for cr0 register. 4. Jump to next instruction, but in 32-bit code segment. This will switch the processor into 32-bit mode. 5. In 32-bit code segment, it will setup the protected-mode data segment registers such as DS,ES,FS,GS,SS. Then it will set up the stack pointer and call intobootmain
function in/boot/main.c
) - Reads the kernel from the hard disk by directly accessing the IDE disk registers via the x86's special I/O instructions. details:
- Use
readseg
to read the 1st page off dist into physical address0x10000
(64KB at low physical memory). This part forms the ELF header. - Load each program segment into physical memory. The ELF header contain the information about the number of program headers (
ELFHDR->e_phnum
), and the start address of program headers (ELFHDR->e_phoff
). Each program header contains the load address (ph->p_pa
), memory size (ph->p_memsz
) and the program segment location on disk (ph->p_offset
). The.bss
part is zeroed (the size isph->p_memsz - ph->p_filesz
). Generally, the kernel is loaded at physical address starting from0x100000
(1MB) just above the BIOS part. - Call the entry point from the ELF header. This will enter the kernel code (
kern/entry.S
)call *0x10018
. (the entry point isELFHDR->e_entry
). From the assembly we can see that the entry address is0xf010000c
. This will be the first address the kernel starts to execute.
- Use
- Switch the processor from real mode to 32-bit protected mode. Then the software is able to access the all the memory above 1MB in the process's physical address space. (details: 1. setup the important data segment registers such as DS,ES,SS. 2. Enable A20, which is for backwards compatibility with the earliest PCs, the addresses higher than 1MB will wrap around to zero by default. Need to undo this. 3. Switch from real to protected mode, using a bootstrap GDT and segment translation that makes virtual addresses identical to their physical address, so that the effective memory map does not change during the switch. The exact instruction is
-
kernel:
kern/entry.S
when first enterentry
, we haven't set up virtual memory yet, so we're running from the physical address the boot loader loaded the kernel at: 1MB (plus a few bytes). However, the C code is linked to run at KERNBASE+1MB. Hence, we set up a trivial page directory that translates virtual addresses[KERNBASE,KERNBASE+4MB)
to physical addresses[0,4MB)
. This 4MB region will be sufficient until we set up our real page table inmem_init
.- Load the physical address of
entry_pgdir
into cr3.entry_pgdir
is page directories that map VA's[0,4MB)
and[KERNBASE,KERNBASE+4MB)
to PA's[0,4MB)
. - Trun on paging: set
CR0_PE|CR0_PG|CR0_WP
bit of cr0 register. (PE means protected mode enable, PG means paging, WP means write protect). If paging is not turned on, the first instruction will be affected is the code to jump above KERNBASE (mov $relocated, %eax; jmp *%eax
). - Jump above KERNBASE
- Clear the frame pointer register (EBP), set the stack pointer (
movl $(bootstacktop),%esp
) (This is where the stack gets initialized.) - Run C code (
call i386_init
, in filekern/init.c
) - Inside
i386_init()
, do a list of initializations:cons_init
: initialize the console. Then we can callcprintf
mem_init
: initialize memory management (lab2)env_init
: initialize user environment (lab3)trap_init
: initialize trap (lab3)mp_init
: initialize multiprocessor (lab4)lapic_init
: initialize local APIC(Advanced Programmable Interrupt Controller) (lab4)pic_init
: initialize multitasking (8295A interrupt controllers) (lab4)time_init
: initialize time ticks (lab6)pci_init
: initialize PCI (Peripheral Component Interconnect) for network driver (lab6)boot_aps
: start non-boot CPUs (lab4)ENV_CREATE(fs_fs, ENV_TYPE_FS)
: start file system (lab5)ENV_CREATE(net_ns, ENV_TYPE_NS)
: start network server environment (lab6)ENV_CREATE(user_icode, ENV_TYPE_USER)
: start a shell (lab5)shed_yield
: schedule and run the first user environment (lab4)
- Load the physical address of
Notes:
The 4 prerequisites to enter protect mode:
- Disable interrupts
- Enable the A20 line
- Load the Global Descriptor Table
- Set PE (protected mode enable) bit in CR0
The instruction
ljmp $PROT_MODE_CSEG, $protcseg
will makes the processor start executing in 32-bit code.The VMA (link address) and LMA (load address) is not the same for kernel image (
f0100000
and00100000
). It is linked at a very high virtual address in order to leave the lower part of the processor's virtual address space for user programs to use. The virtual address0xf0100000
is mapped into physical address0x00100000
.If paging is not turned on, the first instruction will be affected is the code to jump above KERNBASE (
mov $relocated, %eax; jmp *%eax
).
Typical question: How does cprintf
implemented?
- Initialize the console using
cons_init
, it will do the following:cga_init
: CGA (Color Graphics Adapter) initializationkbd_init
: keyboard initializationserial_init
: serial port initialization
- The call of
cprintf(const char *fmt, ...)
:va_list ap
will point to the first argument,fmt
will point to the format string.vcprintf(fmt, ap)
is enclosed byva_start(ap, fmt)
andva_end(ap)
. Inside the call tova_arg(ap, type)
will return the value of the given type and incrementap
according to the type size.vcprintf(const char *fmt, va_list ap)
will callvprintfmt((void*)putch, &cnt, fmt, ap)
.cprintf, vcprintf
are located inkern/printf.c
, whereasvprintfmt
is located inlib/printfmt.c
.vprintfmt
mainly process thefmt
string andap
argument list to decide what to print and callputch
to print them on the console.putch
is insidekern/printf.c
. It callscputchar(int ch)
(kern/console.c
) which in turn callscons_putc(int c)
(kern/console.c
)cons_putc
will callserial_putc(c), lpt_putc(c), cga_putc(c)
to print the character on the console.
The start of stack address:
This can be inferred from the code kern/entry.S
.data
###################################################################
# boot stack
###################################################################
.p2align PGSHIFT # force page alignment
.globl bootstack
bootstack:
.space KSTKSIZE
.globl bootstacktop
bootstacktop:
According to the objdump -h kernel
, .data
is loaded at 0xf011b000
(maybe different). So bootstacktop
should be 0xf0123000
and obj/kern/kernel.asm
consolidates my assumption.
This part will go through the first half of mem_init
in kern/pmap.c
.
i386_detect_memory
: This will find out how much memory the machine has (number of pages for base memory and number of pages for total memory).- create initial page directory
kern_pgdir
usingboot_alloc
and initialize it to zero. The implementation ofboot_alloc(uint32_t n)
is as follows (it is only used while jos is setting up its virtual memory system.page_alloc
is the real allocator. Ifn>0
, it will allocate enough pages of contiguous physical memory to holdn
bytes. Ifn==0
, it will return the address of the next free page without allocating anything):- Initialize the static pointer
nextfree
if it is the first time.nextfree
will be initialized toend
rounded up toPGSIZE
.end
is a symbol that points to the end of the kernel's bss segment: it is the first virtual address that the linker didn't assign to any kernel code or global variables. (just like heap). - Allocate a chunk large enough to hold
n
bytes, then updatenextfree
.
- Initialize the static pointer
- Recursively insert PD in itself as a page table, to form a virtual page table at virtual address UVPT:
kern_pgdir[PDX(UVPT)] = PADDR(kern_pgdir)|PTE_U|PTE_P;
. Explanation see the clever mapping trick. In this way, the page tables are mapped into the virtual address space so we can access these virtual address to directly access the page table entry. Otherwise we have to useKADDR(pgdir[PDX(va)])+PTX(va)
to access the address of page table entry. (note that in the formula pgdir stores the physical addresses of each page directory entry). After using this mapping, we can access the page table entry for a virtual addressva
by usinguvpt[PGNUM(va)]
(uvpt[]
seeinc/memlayout.h
) - Using
boot_alloc
to allocate an array of npages (this is the number of physical pages we detect on the machine)struct PageInfo
s and store it inpages
. These are the all physical pages we can allocate. The kernel uses this array to keep track of physical pages: for each physical page, there is a correspondingstruct PageInfo
in this array. - Using
boot_alloc
to allocate kernel data structures for environments which is an array of sizeNENV
ofstruct Env
and makeenvs
point to it. (lab3) - Here we've allocated the initial kernel data structures and set up the list of free physical pages. Then we'll call
page_init
to initialize page structure and memory free list. After this function is done,boot_alloc
will never be used and only the page allocator functions are used. The goal ofpage_init
is to mark used physical pages as used and form apage_free_list
.struct PageInfo
has a reference count memberpp_ref
and page linkpp_link
.pp_ref
is used to record usage count (some physical pages might be mapped to multiple virtual addresses) andpp_link
is used to link free pages.page_free_list
will point to the head node of free pages. The procedure forpage_init
is:- Mark physical page 0 as in use. This way we can preserve the real-mode IDT and BIOS structures in case we ever need them. (???)
- Mark the rest of base memory
[PGSIZE, npages_basemem * PGSIZE)
except physical page atMPENTRY_PADDR
(lab4 multiprocessor) as free. Remember the physical address space shown at the beginning. The base memory spans[0,640KB)
address range[0,0xA0000)
. This has 160 pages.MPENTRY_PADDR
is0x7000
which is page 7. - Mark IO hole as used.
[640KB,1024KB)
[0xA0000,0x100000)
. - Mark the kernel
.text, .data, .bss
and kernel data structures' physical page as used. The upper bound can be obtained byboot_alloc(0)
. - Mark the rest as free and add them to free list.
- Now all further memory management will go through the
page_*
functions.
page_*
functions implementation:
page_alloc(int alloc_flag)
: Allocates a physical page. If (alloc_flags & ALLOC_ZERO), the entire returned physical page will be filled with'\0'
bytes. The reference count of the page should not be incremented, as the caller must do this if necessary (either explicitly or viapage_insert
). The implementation is simple. You grap a node frompage_free_list
and letpage_free_list
point topage_free_list->pp_link
. When usingmemset
, remember to convertPageInfo
to virtual address usingpage2kva
.page_free(struct PageInfo *pp)
: trivial, link that page intopage_free_list
.
As for segmentation and page translation:
Selector +--------------+ +-----------+
---------->| | | |
| Segmentation | | Paging |
Software | |-------->| |----------> RAM
Offset | Mechanism | | Mechanism |
---------->| | | |
+--------------+ +-----------+
Virtual Linear Physical
In x86 terminology, a virtual address consists of a segment selector and an offset within the segment. A linear address is what you get after segment translation but before page translation. A physical address is what you finally get after both segment and page translation and what ultimately goes out on the hardware bus to your RAM.
A C pointer is the 'offset' component of the virtual address. In boot/boot.S
the Global Descriptor Table (GDT) is installed to disable segment translation by setting all segment base addresses to 0 and limits to 0xffffffff
. Therefore the 'selector' has no effect and the linear address always equals the offset of the virtual address.
Also, a simple page table has already been installed so that the kernel could run at its link address of 0xf0100000
. It only mapped 4MB. Here it is going to map the first 256MB of physical memory to virtual address 0xf0000000
and to map a number of other regions of the virtual address space.
Once we're in protected mode, all memory references are interpreted as virtual addresses and translated by the MMU, which means all pointers in C are virtual addresses.
Page Table Management
pgdir_walk(pde_t *pgdir, const void *va, int create)
: It return the page table entry pointer corresponding to the virtual addressva
andpgdir
. This requires walking the two-level page table structure. The rational is to get page directory entry usingpgdir[PDX(va)]
and check whether there exists a mapping usingPTE_P
flag. Allocate a new page for the page table ifcreate=true
. Then get page table entry address byKADDR(PTE_ADDR(*pde)) + PTX(va)
.boot_map_region(pde_t *pgdir, uintptr_t va, size_t size, physaddr_t pa, int perm)
: Map[va,va+size)
of virtual address to physical address[pa,pa+size)
. Iterate through all the pages inside this region, usepgdir_walk
to find its corresponding page table entry. Set the value of page table entry to the corresponding physical address and permission bits. Also update the permission bits of page directory entry.page_lookup(pde_t *pgdir, void *va, pte_t **pte_store)
: Return the page mapped at virtual addressva
(returnsstruct PageInfo
). Implementation: first usepgdir_walk
to find the corresponding page table entry, and then usepa2page(PTE_ADDR(*pte))
to get the page table entry's correspondingstruct PageInfo*
.page_remove(pde_t *pgdir, void *va)
: Unmap the physical page at virtual addressva
. Implementation: first usepage_lookup
to find itsstruct PageInfo*
, then invalidate the translation lookaside buffer, decrement the reference count (if reach zero, free the page usingpage_free
).page_insert(pde_t *pgdir, struct PageInfo *pp, void *va, int perm)
: Map the physical pagepp
at virtual addressva
. If there is already a page mapped atva
, it should bepage_remove
d. If necessary, on demand, a page table should be allocated and inserted intopgdir
.pp->pp_ref
should be incremented if the insertion succeeds. The TLB must be invalidated if a page was formerly present atva
. Corner-case: the same pp is re-inserted at the same virtual address in same pgdir. I first try to distinguish this case, looks like the following:
physaddr_t pa = page2pa(pp); // physical addr of page
pte_t *p = pgdir_walk(pgdir, va, 0);
// already exists a map and the physical address mismatch
if (p != NULL && (*p & PTE_P) && (PTE_ADDR(*p) != pa)) {
page_remove(pgdir, va);
tlb_invalidate(pgdir, va);
}
p = pgdir_walk(pgdir, va, 1);
if (p == NULL) // allocation fails
return -E_NO_MEM;
if (PTE_ADDR(*p) != pa)
pp->pp_ref += 1;
*p = pa | perm | PTE_P;
pgdir[PDX(va)] = pgdir[PDX(va)] | perm | PTE_P;
return 0;
Although it can pass lab2, it will cause error in later labs and it is really hard to discover that what's wrong is here. The correct solution is elegant:
physaddr_t pa = page2pa(pp); // physical addr of page
pte_t *p = pgdir_walk(pgdir, va, 1);
if (p == NULL)
return -E_NO_MEM;
pp->pp_ref++;
if ((*p) & PTE_P)
page_remove(pgdir, va);
*p = pa | perm | PTE_P;
pgdir[PDX(va)] = pgdir[PDX(va)] | perm | PTE_P;
return 0;
The idea is first increment the reference count, if a mapping already exists, the page_remove
function will take care of decrementing the reference count and invalidating the translation lookaside buffer.
Now back to second half of mem_init
function, we need to use these page_*
utilities to set up the virtual memory of kernel address space. The virtual memory layout is shown as follows(should be memorized together with the physical address space):
/*
* Virtual memory map: Permissions
* kernel/user
*
* 4 Gig --------> +------------------------------+ --+ PDE 1023
* | | RW/-- |
* ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
* : . : |
* : . : |
* : . : 256 MB
* |~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~| RW/-- |
* | | RW/-- |
* | Remapped Physical Memory | RW/-- |
* | | RW/-- |
* KERNBASE, ----> +------------------------------+ 0xf0000000 --+ PDE 960
* KSTACKTOP | CPU0's Kernel Stack | RW/-- KSTKSIZE |
* | - - - - - - - - - - - - - - -| |
* | Invalid Memory (*) | --/-- KSTKGAP |
* +------------------------------+ |
* | CPU1's Kernel Stack | RW/-- KSTKSIZE |
* | - - - - - - - - - - - - - - -| PTSIZE 4MB
* | Invalid Memory (*) | --/-- KSTKGAP |
* +------------------------------+ |
* : . : |
* : . : |
* MMIOLIM ------> +------------------------------+ 0xefc00000 --+ PDE 959
* | Memory-mapped I/O | RW/-- PTSIZE 4MB
* ULIM, MMIOBASE --> +------------------------------+ 0xef800000 PDE 958
* | Cur. Page Table (User R-) | R-/R- PTSIZE 4MB
* UVPT ----> +------------------------------+ 0xef400000 PDE 957
* | RO PAGES | R-/R- PTSIZE 4MB
* UPAGES ----> +------------------------------+ 0xef000000 PDE 956
* | RO ENVS | R-/R- PTSIZE 4MB
* UTOP,UENVS ------> +------------------------------+ 0xeec00000 PDE 955
* UXSTACKTOP -/ | User Exception Stack | RW/RW PGSIZE 4KB
* +------------------------------+ 0xeebff000
* | Empty Memory (*) | --/-- PGSIZE 4KB
* USTACKTOP ---> +------------------------------+ 0xeebfe000
* | Normal User Stack | RW/RW PGSIZE 4KB
* +------------------------------+ 0xeebfd000
* | |
* | |
* ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
* . .
* . .
* . .
* |~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~|
* | Program Data & Heap |
* UTEXT --------> +------------------------------+ 0x00800000
* PFTEMP -------> | Empty Memory (*) | PTSIZE 4MB
* | |
* UTEMP --------> +------------------------------+ 0x00400000 --+ PDE 1
* | Empty Memory (*) | |
* | - - - - - - - - - - - - - - -| |
* | User STAB Data (optional) | PTSIZE 4MB
* USTABDATA ----> +------------------------------+ 0x00200000 |
* | Empty Memory (*) | |
* 0 ------------> +------------------------------+ --+ PDE 0
*
* (*) Note: The kernel ensures that "Invalid Memory" is *never* mapped.
* "Empty Memory" is normally unmapped, but user programs may map pages
* there if desired. JOS user programs map pages temporarily at UTEMP.
*/
The rest of mem_init
:
- Map
pages
read-only by the user at linear addressUPAGES
boot_map_region(kern_pgdir, UPAGES, pages_size, PADDR(pages), PTE_U | PTE_P);
- Map
envs
array read-only by the user at linear addressUENVS
(lab3)
boot_map_region(kern_pgdir, UENVS, envs_size, PADDR(envs), PTE_U | PTE_P);
- Use the physical memory that
bootstack
refers to as the kernel stack.
boot_map_region(kern_pgdir, KSTACKTOP-KSTKSIZE, KSTKSIZE, PADDR(bootstack), PTE_W);
- Map all of physical memory at
KERNBASE
.[KERNBASE,2^32)->[0,2^32-KERNBASE)
size_t sz = (1LL<<32) - KERNBASE;
boot_map_region(kern_pgdir, KERNBASE, sz, 0, PTE_W);
- Initialize the SMP-related parts of the memory map (lab4)
mem_init_mp
- It will map the per-CPU stacks in the region
[KSTACKTOP-PTSIZE,KSTACKTOP)
- It will map the per-CPU stacks in the region
// Map per-CPU stacks starting at KSTACKTOP, for up to 'NCPU' CPUs.
//
// For CPU i, use the physical memory that 'percpu_kstacks[i]' refers
// to as its kernel stack. CPU i's kernel stack grows down from virtual
// address kstacktop_i = KSTACKTOP - i * (KSTKSIZE + KSTKGAP), and is
// divided into two pieces, just like the single stack you set up in
// mem_init:
// * [kstacktop_i - KSTKSIZE, kstacktop_i)
// -- backed by physical memory
// * [kstacktop_i - (KSTKSIZE + KSTKGAP), kstacktop_i - KSTKSIZE)
// -- not backed; so if the kernel overflows its stack,
// it will fault rather than overwrite another CPU's stack.
// Known as a "guard page".
// Permissions: kernel RW, user NONE
//
// LAB 4: Your code here:
size_t i;
uintptr_t va, kstacktop_i;
physaddr_t pa;
for (i = 0; i < NCPU; i++) {
kstacktop_i = KSTACKTOP - i * (KSTKSIZE + KSTKGAP);
va = kstacktop_i - KSTKSIZE;
pa = PADDR(&percpu_kstacks[i]);
boot_map_region(kern_pgdir, va, KSTKSIZE, pa, PTE_W);
}
- Switch from the minimal entry page directory to the full
ker_pgdir
page table we just created.
lcr3(PADDR(kern_pgdir));
- Configure more flags in cr0
cr0 = rcr0();
// Alignment Mask Numeric Error Monitor co-processor
cr0 |= CR0_PE|CR0_PG|CR0_AM|CR0_WP|CR0_NE|CR0_MP;
cr0 &= ~(CR0_TS|CR0_EM);
lcr0(cr0);
jos's environment is similar to process in a modern operating system. The structure to describe each environment is as follows:
struct Env {
struct Trapframe env_tf; // Saved registers
struct Env *env_link; // Next free Env
envid_t env_id; // Unique environment identifier
envid_t env_parent_id; // env_id of this env's parent
enum EnvType env_type; // Indicates special system environments
unsigned env_status; // Status of the environment
uint32_t env_runs; // Number of times environment has run
int env_cpunum; // The CPU that the env is running on
// Address space
pde_t *env_pgdir; // Kernel virtual address of page dir
// Exception handling
void *env_pgfault_upcall; // Page fault upcall entry point
// Lab 4 IPC
bool env_ipc_recving; // Env is blocked receiving
void *env_ipc_dstva; // VA at which to map received page
uint32_t env_ipc_value; // Data value sent to us
envid_t env_ipc_from; // envid of the sender
int env_ipc_perm; // Perm of page mapping received
};
Note: jos's struct Env
is analogous to struct proc
in xv6. Both structures hold the environment's user-mode register state in a Trapframe
structure. In jos, inidividual environment do not have their own kernel stacks as processes do in xv6. There can be only one jos environment active in the kernel at a time, so jos needs only a single kernel stack. (So if we can have multiple processes truly simultaneously running on different cores, we should provide kernel stack for each process. When will that stack be needed? Besides, jos implements multi-processor, so it still can only run one process in the kernel at a time?).
Creating and Running Environments:
env_init
: Mark all environments inenvs
as free, set theirenv_ids
to 0, and insert them into theenv_free_list
. (purpose: formenv_free_list
).env_setup_vm(struct Env *e)
: Initialize the kernel virtual memory layout for environment e. Allocate a page directory, sete->env_pgdir
accordingly, and initialize the kernel portion of the new environment's address space. This will not map anything into the user portion. Steps:- Allocate a page for the page directory using
page_alloc
- Set
e->env_pgdir
:e->env_pgdir = (pde_t *)page2kva(p)
- Initialize the page directory. Use
kern_pgdir
as a template. The virtual address space of all envs is identical aboveUTOP
except atUVPT
.UVPT
should map the env's own page table:e->env_pgdir[PDX(UVPT)] = PADDR(e->env_pgdir) | PTE_P | PTE_U;
- Allocate a page for the page directory using
env_alloc(struct Env **newenv_store, envid_t parent_id)
: Allocates and initializes a new environment. Steps:- Grab a
struct Env *
fromenv_free_list
:e=env_free_list
. After the allocation is done, at the end of funtion it will be committed:env_free_list=e->env_link
. - Allocate and set up the page directory for this environment using
env_setup_vm
(this will initialize the kernel virtual memory layout for environmente
). - Generate an
env_id
for this environment. An environment IDenvid_t
has three parts:
// +1+---------------21-----------------+--------10--------+ // |0| Uniqueifier | Environment | // | | | Index | // +------------------------------------+------------------+ // \--- ENVX(eid) --/
ENVX(eid)
= the environment's index in theenvs[]
array. The uniqueifier distinguishes environments that were created at different times, but share the same environment index.envid_t == 0
stands for the current environment. 4. Set the basic status variables:env_parent_id, env_type, env_status, env_runs
. 5. Clear out all the saved register state, to prevent the register values of a prior environment inhabiting thisEnv
structure from "leaking" into this new environment:memset(&e->env_tf, 0, sizeof(e->env_tf);
6. Set up appropriate initial values for the segment registers.GD_UD
is the user data segment selector in the GDT,GD_UT
is the user text segment selector. The low 2 bits of each segment register contains the Requestor Privilege Level (RPL), 3 means user mode. When switching privilege levels, the hardware does various checks involving the RPL and the Descriptor Privilege Level (DPL) stored in the descriptors themselves. The Current Privilege Level (CPL) is stored in the lowest 2 bits of the code segment selector (CS). Access to a segment is allowed only if CPL <= DPL and RPL <= DPL.e->env_tf.tf_ds/es/ss
is initialized toGD_UD|3
.e->env_tf.tf_cs
is initialized toGD_UT|3
.e->env_tf.tf_esp
is initialized toUSTACKTOP
.tf_eip
is initialized later after loading the binary image. 7. (lab4) Enable interrupts while in user mode.tf_eflags|=FL_IF
8. Clear the page fault handler until user installs one.env_pgfault_upcall=0
9. Clear the IPC receiving flag.env_ipc_recving=0
10. Commit the allocation.env_free_list=e->env_link; *newenv_store=e;
- Grab a
Notes:
- RPL (Requested Privilege Level): low 2 bits of each segment register.
- CPL (Current Privilege Level): low 2 bits of the code segment selector.
- DPL (Descriptor Privilege Level): stored in the descriptor.
Accessing a segment requires CPL <= DPL and RPL <= DPL.
https://stackoverflow.com/questions/36617718/difference-between-dpl-and-rpl-in-x86
An application would ordinarily not be able to access the memory in segment X (because CPL > DPL). But depending on how the system call was implemented, an application might be able to invoke the system call with a parameter of an address within segment X. Then, because the system call is privileged, it would be able to write to segment X on behalf of the application. This could introduce a privilege escalation vulnerability into the operating system.
To mitigate this, the official recommendation is that when a privileged routine accepts a segment selector provided by unprivileged code, it should first set the RPL of the segment selector to match that of the unprivileged code3. This way, the operating system would not be able to make any accesses to that segment that the unprivileged caller would not already be able to make. This helps enforce the boundary between the operating system and applications.
Currently:
Segment protection was introduced with the 286, before paging existed in the x86 family of processors. Back then, segmentation was the only way to restrict access to kernel memory from a user-mode context. RPL provided a convenient way to enforce this restriction when passing pointers across different privilege levels.
Modern operating systems use paging to restrict access to memory, which removes the need for segmentation. Since we don't need segmentation, we can use a flat memory model, which means segment registers CS, DS, SS, and ES all have a base of zero and extend through the entire address space. In fact, in 64-bit "long mode", a flat memory model is enforced, regardless of the contents of those four segment registers. Segments are still used sometimes (for example, Windows uses FS and GS to point to the Thread Information Block and 0x23 and 0x33 to switch between 32- and 64-bit code, and Linux is similar), but you just don't go passing segments around anymore. So RPL is mostly an unused leftover from older times.
region_alloc(struct Env *e, void *va, size_t len)
: Allocate len bytes of physical memory for environmente
, and map it at virtual addressva
in the environment's address space. Steps:- Calculate the page-aligned start and end virtual address.
- Loop through the pages of virtual address, use
page_alloc
to allocate physical page and usepage_insert
to insert the physical page at the corresponding virtual address.
load_icode(struct Env *e, uint8_t *binary)
: Set up the initial program binary, stack, and processor flags for a user process. This function is only called during kernel initialization, before running the first user-mode environment. It loads all loadable segments from the ELF binary image into the environment's user memory, starting at the appropriate virtual addresses indicated in the ELF program header. At the same time it clears the.bss
section. This is very similar to what the boot loader does, except that the boot loader also needs to read the code from disk. Finally the function maps one page for the program's initial stack. Steps:- Switch to environment page directory so that the virtual address translation of user address space can take effect. (first you should save the
cr3
in order to resume it at the end of the function because we still need to run in kernel address space after calling this function). - The begining of
binary
is ELF header. Get number of program headers from the ELF header. Loop through the headers. When is program header is of typeELF_PROG_LOAD
, allocate physical pages for the virtual address region specified in this program header and copy the program section to virtual address, clear the.bss
if exists. - Set
env_tf.tf_eip
to ELF header'se_entry
. - Now we have load the
.text, .data, .bss
segments of the program into memory (physically backed virtual address (although in modern linux, only the virtual address is allocated, virtual to physical mapping will only be created via page fault(????))). For the user portion of the address space, we still need to setup the stack area. - Map one page for the program's initial stack at virtual address
USTACKTOP-PGSIZE
:
region_alloc(e, (void *)(USTACKTOP-PGSIZE), PGSIZE);
- Resume
cr3
- Switch to environment page directory so that the virtual address translation of user address space can take effect. (first you should save the
env_create(uint8_t *binary, enum EnvType type)
: Allocate a new env withenv_alloc
, loads the named elf binary into it withload_icode
, and sets itsenv_type
. This function is only called during kernel initialization, before running the first user-mode environment. The newenv
's parent ID is set to 0. Steps:- Allocate a new env using
env_alloc
(this will initialize the kernel virtual address space, set up its registers, etc.) - Call
load_icode
to load the binary into the environment's address space (this will also set up theeip
register for the environment). - Set it
EnvType
- (lab5 fs) if this is the file server (type == ENV_TYPE_FS) give it IO privileges.
- Allocate a new env using
env_free(struct Env *e)
: Frees enve
and all memory it uses. Steps:- If freeing the current environment, switch to
kern_pgdir
before freeing the page directory usinglcr3(PADDR(kern_pgdir))
. - Flush all mapped pages in the user portion of the address space.
- Loop through all the page directory entries below
UTOP
- if it is mapped (
PTE_P
is on for the page directory entry)- find the physical and virtual address for the page table.
- unmap all the page table entries in this page table using
page_remove
. - free the page table itself (
page_decref
).
- Loop through all the page directory entries below
- Free the page directory (
page_decref
). - Return the environment to the free list.
- If freeing the current environment, switch to
env_destroy(struct Env *e)
: Frees environmente
. Ife
was the currentenv
, then runs a new environment (and does not return to the caller)- If
e
is currently running on other CPUs, change its state toENV_DYING
. A zombie environment will be freed the next time it traps to the kernel. - Call
env_free
to free env and all memory it uses. - If the env is current environment, call
sched_yield
(lab4, will choose a user environment to run and run it).
- If
env_run(struct Env *e)
: Context switch fromcurenv
to enve
. If this is the first call toenv_run
,curenv
is NULL.- If this is a context switch (a new environment is running):
- Set the current environment (if any) back to
ENV_RUNNABLE
if it isENV_RUNNING
. - Set
curenv
to the new environment - Set its status to
ENV_RUNNING
- Increment its
env_runs
counter - Use
lcr3
to switch to its address space
- Set the current environment (if any) back to
- Use
env_pop_tf
to restore the environment's state frome->env_tf
.
- If this is a context switch (a new environment is running):
env_pop_tf(struct Trapframe *tf)
: Restores the register values in the Trapframe with theiret
instruction. This exits the kernel and starts executing some environment's code. This function does not return.
Interrupt and Exception Handling:
trap_init
: insidekern/init.c
functioni386_init
, called afterenv_init
. It initialize the IDT with the addresses of the corresponding interrupt handlers. Each of the handlers should build astruct Trapframe
on the stack and calltrap
with a pointer to the Trapframe.trap
then handles the exception/interrupt or dispatches to a specific handler function.- Use
SETGATE
macro the set up interrupt/trap gate descriptor. Some examples are:
SETGATE(idt[T_DIVIDE ], 0, GD_KT, (uint32_t) (&t_divide ), 0); SETGATE(idt[T_DEBUG ], 0, GD_KT, (uint32_t) (&t_debug ), 0); SETGATE(idt[T_NMI ], 0, GD_KT, (uint32_t) (&t_nmi ), 0);
T_*
is the trap numbers defined ininc/trap.h
andt_*
are functions declared in this scope, defined inkern/trapentry.S
, for example:TRAPHANDLER_NOEC(t_divide , T_DIVIDE ) /* 0) // divide error */ TRAPHANDLER_NOEC(t_debug , T_DEBUG ) /* 1) // debug exception */ TRAPHANDLER_NOEC(t_nmi , T_NMI ) /* 2) // non-maskable interrupt */ //... TRAPHANDLER( t_dblflt , T_DBLFLT ) /* 8) // double fault */
TRAPHANDLER
defines a globally-visible function for handling a trap. It pushed a trap number onto the stack, then jumps to_alltraps
. This will be used for traps where the CPU automatically pushes an error code.TRAPHANDLER_NOEC
is used for traps where the CPU doesn't push an error code. It pushes a 0 in place of the error code, so the trap frame has the same format in either case. The definition of_alltraps
is:_alltraps: pushl %ds; pushl %es; pusha; movw $GD_KD, %ax; movw %ax, %ds; movw $GD_KD, %ax; movw %ax, %es; pushl %esp; call trap;
- Initiate per-cpu setup using
trap_init_percpu
. This will initialize and load the per-CPU TSS (Task State Segment) and IDT.- Setup a TSS so that we get the right stack when we trap to the kernel.
size_t i = thiscpu->cpu_id; thiscpu->cpu_ts.ts_esp0 = KSTACKTOP - i*(KSTKSIZE+KSTKGAP); thiscpu->cpu_ts.ts_ss0 = GD_KD; thiscpu->cpu_ts.ts_iomb = sizeof(struct Taskstate); // io map base address
- Initialize the TSS slot of the gdt.
gdt[(GD_TSS0 >> 3)+i] = SEG16(STS_T32A, (uint32_t) (&thiscpu->cpu_ts), sizeof(struct Taskstate) - 1, 0); gdt[(GD_TSS0 >> 3)+i].sd_s = 0; // 0 for system, 1 for application
- Load the TSS selector (the bottom 3 bits are left 0)
ltr(GD_TSS0+(i<<3)); // ltr means load task register
- Load the IDT
lidt(&idt_pd);
- Use
The purpose of having an individual handler function for each exception/interrupt: To push the corresponding error code onto the stack. This is used for the codes going to handle it further like
trap_dispatch
to distinguish the interrupts.If user calls
int $14
directly which corresponds to the kernel's page fault handler, but this will produce interrupt vector 13. This is to provide permission control or isolation. For each interrupt handler, we can define it whether can be triggered by a user program or not. So that we can ensure user programs would not interfere with the kernel.
Notes: Exceptions and interrupts are both protected control transfers, which cause the processor to switch from user to kernel mode(CPL=0) without giving the user-mode code any opportunity to interfere with the functioning of the kernel or other environments. In Intel's terminology, an interrupt is a protected control trasfer that is caused by an asynchronous event usually external to the processor, such as notification of external device I/O activity. An exception, in contrast, is a protected control transfer caused synchronously by the currently running code, for example due to a divide by zero or an invalid memory access.
In order to ensure that these protected control transfers are actually protected, the processor's interrupt/exception mechanism is designed so that the code currently running when the interrupt or exception occurs does not get to choose arbitrarily where the kernel is entered or how. Instead, the processor ensures that the kernel can be entered only under carefally controlled conditions. On the x86, two mechanisms work together to provide this protection:
- The Interrupt Descriptor Table. The processor ensures that interrupts and exceptions can only cause the kernel to be entered at a few specific, well-defined entry-points determined by the kernel itself, and not by the code running when the interrupt or exception is taken. The x86 allows up to 256 different interrupt or exception entry points into the kernel, each with a different interrupt vector. A vector is a number between 0 and 255. An interrupt's vector is determined by the source of the interrupt: different devices, error conditions, and application requests to the kernel generate interrupts with different vectors. The CPU uses the vector as an index into the processor's interrupt descriptor table(IDT), which the kernel sets up in kernel-private memory, much like the GDT. From the appropriate entry in this table the processor loads:
EIP
andCS
(includes in bits 0-1 the privilege level at which the exception handler is to run.).- The Task State Segment. The processor needs a place to save the old processor state before the interrupt or exception occurred, such as the original values of
EIP
andCS
before the processor invoked the exception handler, so that the exception handler can later restore that old state and resume the interrupted code from where it left off. But this save area for the old processor state must in turn be protected from unprivileged user-mode code; otherwise buggy or malicious user code could compromise the kernel. For this reason, when an x86 processor takes an interrupt or trap that causes a privilege level change from user to kernel mode, it also switches to a stack in the kernel's memory. A structure called the task state segment(TSS) specifies the segment selector and address where this stack lives. The processor pushes (on this new stack)SS, ESP, EFLAGS, CS, EIP
, and an optional error code. Then it loads theCS
andEIP
from the interrupt descriptor, and sets theESP
andSS
to refer to the new stack. Although the TSS is large and can potentially serve a variety of purposes, jos only uses it to define the kernel stack that the processor should switch to when it transfers from user to kernel mode. Since 'kernel mode' in jos is privilege level 0 on the x86, the processor uses theESP0
andSS0
fields of the TSS to define the kernel stack when entering kernel mode.Types of Exceptions and Interrupts:
All of the synchronous exceptions that the x86 processor can generate internally use interrupt vector between 0 and 31, and therefore map to IDT entries 0-31. Interrupt vectors greater than 31 are only used by software interrupts, which can be generated by the
int
instruction, or asynchronous hardware interrupts, caused by external devices when they need attention.Example:
The control flow when a divide by zero exception happened in user mode:
- The processor switches to the stack defined by the
SS0
andESP0
fields of the TSS, which in jos will hold valuesGD_KD
andKSTACKTOP
, respectively.- The processor pushes the exception parameters on the kernel stack, starting at address
KSTACKTOP
:+--------------------+ KSTACKTOP | 0x00000 | old SS | " - 4 | old ESP | " - 8 | old EFLAGS | " - 12 | 0x00000 | old CS | " - 16 | old EIP | " - 20 <---- ESP +--------------------+
- Divide error is interrupt vector 0 on the x86, the processor reads IDT entry 0 and set
CS:EIP
to point to the handler function described by the entry.- The handler function takes control and handles the exception, for example by terminating the user environment.
For certain types of x86 exceptions, in addition to the 'standard' five words above, the processor pushes onto the stack another word containing an error code. The page fault exception, number 14, is an important example.
+--------------------+ KSTACKTOP | 0x00000 | old SS | " - 4 | old ESP | " - 8 | old EFLAGS | " - 12 | 0x00000 | old CS | " - 16 | old EIP | " - 20 | error code | " - 24 <---- ESP +--------------------+
The Trapframe structure:
struct Trapframe { struct PushRegs tf_regs; uint16_t tf_es; uint16_t tf_padding1; uint16_t tf_ds; uint16_t tf_padding2; uint32_t tf_trapno; /* below here defined by x86 hardware */ uint32_t tf_err; uintptr_t tf_eip; uint16_t tf_cs; uint16_t tf_padding3; uint32_t tf_eflags; /* below here only when crossing rings, such as from user to kernel */ uintptr_t tf_esp; uint16_t tf_ss; uint16_t tf_padding4; } __attribute__((packed));Nested Exceptions and Interrupts
The processor can take exceptions and interrupts both from kernel and user mode. It is only when entering the kernel from user mode, however, that the x86 processor automatically switches stacks before pushing its old register state onto the stack and invoking the appropriate exception handler through the IDT. If the processor is already in kernel mode when the interrupt or exception occurs (the low 2 bits of the
CS
register are already zero), then the CPU just pushes more values on the same kernel stack. In this way, the kernel can gracefully handle nested exceptions caused by code within the kernel itself. If the processor is already in kernel mode and takes a nested exception, since it does not need to switch stacks, it does not save the oldSS
orESP
registers. For exception types that do not push an error code, the kernel stack therefore looks like the following on entry to the exception handler:+--------------------+ <---- old ESP | old EFLAGS | " - 4 | 0x00000 | old CS | " - 8 | old EIP | " - 12 +--------------------+
There is one important caveat to the processor's nested exception capability. If the processor takes an exception while already in kernel mode, and cannot push its old state onto the kernel stack for any reason such as lack of stack space, then there is nothing the processor can do to recover, so it simply resets itself. Needless to say, the kernel should be designed so that this can't happen.
Control flow for IDT:
IDT trapentry.S trap.c +----------------+ | &handler1 |---------> handler1: trap (struct Trapframe *tf) | | // do stuff { | | call trap // handle the exception/interrupt | | // ... } +----------------+ | &handler2 |--------> handler2: | | // do stuff | | call trap | | // ... +----------------+ . . . +----------------+ | &handlerX |--------> handlerX: | | // do stuff | | call trap | | // ... +----------------+
The assembly code in trapentry.S
will call trap
in kern/trap.c
. Steps are:
- Clear DF (direction flag)
- Halt the CPU if some other CPU has called panic()
- Re-acquire the big kernel lock if we were halted in
sched_yield
- Check that interrupts are disabled
- If trapped from user mode
- Acquire the big kernel lock
- Garbage collect if current environment is a zombie
- Copy trap frame (which is currently on the stack) into
curenv->env_tf
, so that running the environment will restart at the trap point and the trapframe on the stack should be ignored from here on
- Dispatch based on what type of trap occurred using
trap_dispatch
- If we mode it to this point, then no other environment was scheduled, so we should return to the current environment if doing so makes sense.
if (curenv && curenv->env_status == ENV_RUNNING) env_run(curenv); else sched_yield();
The steps for trap_dispatch
:
- Decide what to do based on the trap number from
tf->tf_trapno
:- If
T_PGFLT
(page fault), callpage_fault_handler
- If
T_BRKPT
(break point exception), callmonitor
to enter the jos kernel monitor. - If
T_SYSCALL
(system call), usetf_regs
's six register as argument to callsyscall
, the return value is stored intf->tf_regs.reg_eax
. - If
IRQ_OFFSET+IRQ_SPURIOUS
, handle the spurious interrupts (lab4) - If
IRQ_OFFSET+IRQ_TIMER
, handle clock interrupts. Only calltime_tick
on one cpu. Then uselapic_eoi
to acknowledge the interrupt and callsched_yield
. (lab4 preemptive multi-tasking) - If
IRQ_OFFSET+IRQ_KBD
, callkbd_intr
to handle keyboard interrupts (lab5 shell) - If
IRQ_OFFSET+IRQ_SERIAL
, callserial_intr
to handle serial interrupts (lab5 shell)
- If
- If none of above matches, this is unexpected trap: the user process or the kernel has a bug. Print the trapframe and exit.
For page fault handler page_fault_handler
:
- Read processor's CR2 register to find the faulting address
fault_va = rcr2();
- Handle kernel-mode page faults: should panic
panic("page fault in kernel mode\n")
- Reach here means that it is user mode page fault.
- Call the environment's page fault upcall, if it exists. Details:
- Set up a page fault stack frame on the user exception stack (below UXSTACKTOP), then branch to
curenv->env_pgfault_upcall
.struct UTrapframe
looks like:
If this is the first time enter the page fault (another situation is nested page fault), just save oldstruct UTrapframe { /* information about the fault */ uint32_t utf_fault_va; /* va for T_PGFLT, 0 otherwise */ uint32_t utf_err; /* trap-time return state */ struct PushRegs utf_regs; uintptr_t utf_eip; uint32_t utf_eflags; /* the trap-time stack to return to */ uintptr_t utf_esp; } __attribute__((packed));
esp
and set newesp
to user exception stack. If this is the page fault occurred during the handling of another page fault (i.e. theesp
has already pointed to the user exception stack), then an extra word should be leaved between the current top of the exception stack and the new stack frame because the exception stack is the trap-time stack. 2. Savefault_va, tf_regs, tf_eflags, tf_eip, tf_err
to the user trap frame. 3. Settf->tf_eip
to thecurenv->env_pgfault_upcall
. Rerun current environment, then the environment will start to run from the page fault handler registered earlier.struct UTrapframe *p; if (tf->tf_esp >= UXSTACKTOP-PGSIZE && tf->tf_esp < UXSTACKTOP) { // esp is already on the user exception stack user_mem_assert(curenv, (void *)(tf->tf_esp-4-sizeof(struct UTrapframe)), sizeof(struct UTrapframe)+4, PTE_W); tf->tf_esp -= 4; // push 4 bytes *((int32_t *)tf->tf_esp) = 0; tf->tf_esp -= sizeof(struct UTrapframe); p = (struct UTrapframe *)tf->tf_esp; p->utf_esp = tf->tf_esp+sizeof(struct UTrapframe)+4; } else { p = (struct UTrapframe *)(UXSTACKTOP-sizeof(struct UTrapframe)); user_mem_assert(curenv, (void *)p, sizeof(struct UTrapframe), PTE_W); p->utf_esp = tf->tf_esp; tf->tf_esp = (uintptr_t)p; } p->utf_fault_va = fault_va; p->utf_regs = tf->tf_regs; p->utf_eflags = tf->tf_eflags; p->utf_eip = tf->tf_eip; p->utf_err = tf->tf_err; tf->tf_eip = (uintptr_t)(curenv->env_pgfault_upcall); env_run(curenv);
- Set up a page fault stack frame on the user exception stack (below UXSTACKTOP), then branch to
- If upcall doesn't exist, destroy the environment.
- Call the environment's page fault upcall, if it exists. Details:
For system calls T_SYSCALL
:
case T_SYSCALL:
saved_syscall_num = tf->tf_regs.reg_eax;
// cprintf("before syscall num: %d\n", saved_syscall_num);
r = syscall(tf->tf_regs.reg_eax,
tf->tf_regs.reg_edx,
tf->tf_regs.reg_ecx,
tf->tf_regs.reg_ebx,
tf->tf_regs.reg_edi,
tf->tf_regs.reg_esi);
// cprintf("syscall num: %d, errno: %d\n", saved_syscall_num, r);
// if (r < 0)
// panic("syscall: %e\n", r);
// curenv->env_tf.tf_regs.reg_eax = r;
tf->tf_regs.reg_eax = r;
return;
inside syscall
function in kern/syscall.c
, it will dispatch the syscall according to its system call number:
int32_t
syscall(uint32_t syscallno, uint32_t a1, uint32_t a2, uint32_t a3, uint32_t a4, uint32_t a5)
{
// Call the function corresponding to the 'syscallno' parameter.
// Return any appropriate return value.
// LAB 3: Your code here.
// panic("syscall not implemented");
switch (syscallno) {
case SYS_cputs:
//...
User interface for system call in inside lib/syscall.c
:
void
sys_cputs(const char *s, size_t len)
{
syscall(SYS_cputs, 0, (uint32_t)s, len, 0, 0, 0);
}
// ...
static inline int32_t
syscall(int num, int check, uint32_t a1, uint32_t a2, uint32_t a3, uint32_t a4, uint32_t a5)
{
int32_t ret;
// Generic system call: pass system call number in AX,
// up to five parameters in DX, CX, BX, DI, SI.
// Interrupt kernel with T_SYSCALL.
//
// The "volatile" tells the assembler not to optimize
// this instruction away just because we don't use the
// return value.
//
// The last clause tells the assembler that this can
// potentially change the condition codes and arbitrary
// memory locations.
asm volatile("int %1\n"
: "=a" (ret)
: "i" (T_SYSCALL),
"a" (num),
"d" (a1),
"c" (a2),
"b" (a3),
"D" (a4),
"S" (a5)
: "cc", "memory");
if(check && ret > 0)
panic("syscall %d returned %d (> 0)", num, ret);
return ret;
}