386 Protected Mode (part 1)
Here's how protected mode changes everything you know about the x86.
By Jack Ganssle
In the few years since Intel release the 386 processor, it has gone from a tremendously overpriced compute engine to the minimum processor for anyone considering purchasing a PC. Proliferation versions (like the 386SX and AMD's variants) drive the chip cost down while maintaining software compatibility with the rest of the line.
It seems those of us in the embedded world could ignore this technology, since so many designs revolve around low performance controllers. Now, however, more and more embedded systems use the 386 series of components. Examples include high speed data communications devices (though in cheap modems the Z80 still reigns supreme), graphics equipment, and ultra-high-speed data acquisition gear. Even the cockpit displays of some modern jetliners use 386's as controllers.
Why? What's so great about the 386 that compels a designer to include a $325 processor in his embedded system? The 386 offers two important features: raw compute horsepower, and the potential for a huge address space.
I recently had the opportunity to design a rather complex embedded system using a 386, and found the experience to be both frustrating and rewarding. Frustrating, because Intel's documentation assumes the reader is completely knowledgeable about protected mode. Rewarding, because the processor's power and complexity is awesome. I ended the project with a great deal of respect for those who mastered this complexity to design the chip way back in the mid-80s.
386 Benefits
Most of us computing with a 386-based PC run the processor in its slowest and least functional mode. Yet, even then we get staggering performance improvements over that for which we lusted a decade ago. Most PC applications run in "real mode", using 8088-like 20 bit addresses and 16 bit registers.
The 386 can and does often act just like a very fast 8088. It's most obvious virtue is its raw speed. With no wait states machine cycles take only two clocks. At 33 Mhz, this is a blazing 61 nsec per cycle. Short instructions (e.g., a register to register move) complete in two cycles, or about 122 nsec. This baby is no slouch at moving data!
There is a sort of hidden price to running so fast, though. How many memory systems can present data so quickly? Inject a single wait state, and the machine's performance declines by a third. Any high performance embedded system will likely need costly cache to properly match memory speeds to the processor's bandwidth.
The 386 has a richer instruction set than it's 80x88 cousins. 32 bit multiply/divides, barrel shifters that shift up to 32 bits in 7 cycles, and bit manipulations are all included. All registers are 32 bits, so handling decent sized data is a breeze.
Embedded people might be disappointed with its lack of peripherals. 64180/Z180, 8051, 80196, and other embedded parts include timers, serial ports, and the like, all designed to reduce the cost and size of a system. Not so the 386, which is targeted only at high performance, high cost applications. I hope Intel or AMD does eventually come up with versions specifically for embedded markets, including serial and parallel ports. It would seem a sensible use of the vendors' ability to cram ever more functionality onto a piece of silicon. After all, even the RISC folks are now targeting processors specifically towards the embedded marketplace.
Protected vs. Real Modes
If you've worked with the 80x88 family, you are intimately familiar with what 386 documentation calls "Real Mode". Real Mode addresses are limited to 20 bits, and are generated by adding a 16 bit segment register, shifted left four bits, to a 16 bit offset. This much maligned segmentation causes no end of grief for programmers trying to access large data structures. Since an offset cannot exceed 16 bits, you just can't increment beyond 64k; you'll have to watch for a 64k boundary and then play games with the segment register.
The 386's Protected Mode changes everything you ever learned about 80x88 segmentation. Protected mode offers direct access to 32 bit addresses. Though segment registers still play a part in every address calculation, their role is no longer one of directly specifying an address. In protected mode segment registers are pointers to data structures that define segmentation limits and addresses. More on this later.
On a 386 operating in real mode you have access to practically every feature the 386 has to offer - with the exception of 32 bit addressing. Just about all of the new instructions are available. All operands can be 8, 16, or even 32 bits. That's right - real mode programs can easily handle double word long data, using 32 bit registers. On the 386, in real or protected modes, you access operands as follows:
mov al,[1000] ; load 8 bits mov ax,[1000] ; load a word mov eax,[1000] ; load a double word
Manipulate data the same way:
add al,cl ; add two bytes add eax,ecx ; add two 32 bit numbers
You can use the 32 bit registers to address memory, but in real mode the effective address may not exceed 20 bits. The 386 will generate an exception if the address is too large.
Take advantage of the 386's extended instructions (even in real mode), to greatly speed processing:
mul eax,edx ; 32 x 32 multiply ; 64 bit result goes to edx:eax
The processor includes extra segment registers. Where an 80x88 CPU only provides ES, DS, SS, and CS, the 386 adds FS and GS, which you can use in real or protected mode.
Protected Mode Addressing
Segment registers are called "selectors" when operating in protected mode, to distinguish their operation from that of real mode. For these registers do indeed perform a selection process. In protected mode, segment register simply point to a data structures that contain the information needed to access a location.
Every protected mode program must include a table of "descriptors", which are 8 byte data structures that define the start and end of a segment. Depending on the type of segment, a descriptor may have other information such as access rights and the like. A typical descriptor contains the following information, packed into an 8 byte record:
- Segment start: absolute 32 bit address
- Segment limit: Maximum address this segment can reference
- Segment status: privilege level, segment present, segment available, segment type, etc.
Thus, the descriptor tells the 386 everything it needs to know about accessing data or code in a segment. Accesses to memory are qualified by the descriptor selected by the current segment register. This selector is a 12 bit number indicating which entry to use in the descriptor table; if the selector is 0, the first descriptor is taken, a selector of 1 takes the second, etc. The 386 multiplies the selector by 8 (8 bytes per entry), and adds this to the base address of the table of descriptors (contained in an internal 386 register loaded by the programmer before switching to protected mode.)
For example, a code fetch always uses the current CS. A protected mode fetch starts by multiplying CS by 8 and then adding the descriptor base register. The 386 then reads an entire 8 byte record from the descriptor table. The entry describes the start of the segment; the processor adds the current instruction pointer to this start
A data access behaves the same way. A load from location DS:1000 makes the processor read a descriptor by shifting DS left 3 bits (i.e., times 8), adding the table's base address (stored in the 386's on-board descriptor table register), and reading the 8 byte descriptor at this address. The descriptor contains the segment's start address, which is added to the offset in the instruction (in this case 1000). Offsets, and segment start addresses, are 32 bit numbers - it's really easy to reference any location in memory.
Every memory access works through these 8 byte descriptors. If they were stored only in user RAM the 386's throughput would be pathetic, since each memory reference needs the information. Can you imagine waiting for an 8 byte read before every memory access? Actually, the processor caches a descriptor for each selector (one for CS, one for DS, etc) on-chip, so the segment translation requires no overhead. However, every load of a selector (like MOV DS,AX or POP ES) will make the 386 stop and read all 8 bytes to it's internal cache, slowing things down just a bit.
Figure 1 shows how addressing works. The figure ignores Paging, yet another 386 feature that permits extending the address space far beyond 4 Gb.
It's all a little mind boggling. The CPU manipulates these 8 byte data structures automatically, reading, parsing, caching, and working with them as needed, with no programmer intervention (once they are set up).
Not only does the CPU translate addresses as described. In parallel it checks every memory reference to ensure it behaves properly. Remember the "limit" field in the descriptor? If the effective address (base plus offset) is greater than this limit, the 386 aborts the instruction and generates a protection violation exception. It won't let you do something stupid. You can even specify that a segment is read-only; a write will create the same exception.
But wait a minute! Everyone seems to think that segments aren't used in protected mode! In fact, segmentation is practically essential, and is far more useful than you might think.
On a 80x88 processor you'll frequently write programs divided into more than one named code segment. The linker combines like-named segments together, and then groups the segments into one hunk. In the embedded world, using a Locator (like ones sold by Systems and Software and Paradigm), you can separate named segments into specific RAM or ROM addresses to match the nuances of your particular hardware environment. The 386 takes this one step further.
A 386 linker groups like-named segment together. Then, if you wish, you can assign any group to any descriptor. Though the selector uses only 12 bits to pick a descriptor, another bit selects which of two descriptor tables to read from (the Local or Global tables), giving up to 8192 separate segments.
This is a lot of power; most DOS users ignore it. It is ideal for embedded applications. Suppose you have memory mapped I/O: group it into a named segment and assign read/write attributes to it. Even better, separate read and write ports into different segments to ensure your code never accidently accesses one incorrectly. Make your code fetch-only, so illegal accesses create protection violation errors - debugging will be a lot easier with this enabled.
Some embedded systems include a ROMed version of DOS. DOS runs in real mode only, so use the 386's segmentation to define real and protected segments. The real ones will (sigh) not have the great protection mechanisms. Restrict them to low addresses (under 20 bits), and put the protected mode code up high. The real mode code will not physically be able to generate a high address that might effect the protected mode code.
Linkers
If we had to define the selectors and descriptors ourselves, protected mode would be just too hard to use. The descriptors are arranged in a nasty, hard to assemble format. Fortunately, Intel and others supply linkers that do all of the hard work for you.
It is a little tedious to actually switch from real to protected mode, but Intel application notes do a pretty good job of describing the procedure. There seems to be surprisingly little written about actually building an application. It turns out that the linker does most of the work of building descriptors.
I've been using System & Software's (Irvine, CA) Link & Locate 386 lately, and find that writing protected mode code with it is a breeze. Writing protected mode code is really no different than for real mode. Break your code into named segments, separating data and code, and segment them further if you wish to restrict access in some fashion. Assemble the code with any decent assembler: Microsoft's MASM and Borland's TASM do just fine. Then, use a linker with a carefully scripted command file to assign descriptors as wished. Figure 2 shows a script file for Link & Locate 386 for a typical application.
This program consists of just 4 segments. Real_code is real mode code executed occasionally by the program. Cgroup is the bulk of the program. Dgroup is a data area. Flat_seg is a special segment defined so the program can perform a linear address anywhere in memory.
Notice how the segments, in many cases, have absolute addresses assigned, defining their start. The Linker puts in ending limits automatically.
Flat_seg is a special case; we've set it to start at 0 and end at the end of memory. This more or less bypasses protection checking, as the segment's definition precludes getting an addressing error. Sometimes, in embedded systems we need to access any area to get to specific hardware.
A program operating with this structure will have its code all in segment cgroup, and all data in dgroup. The program will start with code that looks something like:
dgroup segment use32 ; data segment data1 dd ? data2 dd ? dgroup ends cgroup segment assume cs: cgroup, ds:dgroup mov ax,dgroup mov ds,ax ; set selector DS to dgroup mov eax,data1 ; using DS, reference data1
This looks just like 80x88 code. Now, suppose we want an absolute reference anywhere in memory (say, we have some wierd hardware device to read from). Do this:
mov ax,flat_seg mov es,ax ; set selector ES to flat_seg mov esi,<some address> mov al,es:[esi] ; read from an absolute address
Since selector ES points to a descriptor that is a flat, 32 bit address space, any number in ESI is a 32 bit offset added to flat_seg's start address of 0.
Avoid writing code that runs in one 32 bit flat segment. Sure, it is the easiest way to generate a big program. You'll lose the benefits of the 386's protection checking. This is especially deadly with ROMed code - how will you know that the code is not sometimes accidently writing over the ROM? A ROM write is not in itself a problem, but usually indicates some software flaw that may go undetected.
The code set up selectors just like real mode 80x88 code sets segment registers. There really is no difference. The linker replaces segment references with pointers to the descriptor table. In the linker command file, we've defined "gdt" (the Global Descriptor Table), and specific entries for each segment. GDT entries 1 to 8 are reserved in this case, but 9 corresponds to dgroup, 10 to cgroup, etc. The linker will build GDT and insert it into the program.
Conclusion
Go here for part 2.
********************************************************* segment *segments ( dpl = 0 ), real_code( dpl = 0, base = 08000h, usereal ), dgroup( dpl = 0), cgroup( dpl = 0, base=200000h ), flat_seg( dpl = 0, base=0, limit=0ffffffffh), table gdt (location = gdt_start, reserve = (1..8), entry = ( 9: dgroup, 10: cgroup, 11:flat_seg); end; Figure 2: Typical Linker Command File ***********************************************************