Assembly Language

The next step up above MachineCode. The folks at IntelCorporation (for example) decide what InstructionSet the computer will understand. This instruction set is usually of the form <register>, <register>, <operation>, <register> or variations using <memory address>. AssemblyLanguage is a shorthand way to represent these machine instructions, for example

  MOV ES, 13h

to move hex 13 to register ES. RealProgrammers program in AssemblyLanguage (well, in the 1960's they did).

They still do - SteveGibson is one well-known example.

It is also necessary to understand AssemblyLanguage and to do some programming in it when doing operating system and compiler work.

It is also frequently necessary to look at the AssemblyLanguage code generated by your compiler to figure out what it is really doing!

Assembly Language: an interpreted language for executing microcode. Real programmers program microcode.

Wrong on two counts: Assembly is seldom if ever "interpreted" (in the computer science sense, it's almost always translated); and many architectures don't use microcode. Did you mean binary opcodes?

Others claim that since you use a program to convert a text file (assembly) to a binary file (MachineCode) then it's really a compiled language. Some processors don't have microcode.

One can argue this is true or false depending on how much intelligence you expect something to have to be a "compiler". In some cases though, an assembler can qualify as a compiler by most any definition. For example, GeOs was developed by GeoworksCorporation? (now belongs to BreadboxComputerCompany) in an ObjectOrientedAssembler of their own making. That assembler was once available for purchase as part of a developer package for GeOs, and might still be available from Breadbox for all I know.

Any machine that uses more than one machine cycle per instruction or has variable size instructions runs microcode. If the machine has an instruction width that equals its data width, it must be running microcode (How else could you direct a 32 bit value to a particular register without using more than 32 bits?). Inside of every processor is a smaller processor that fetches a machine instruction (a read operation), decodes and executes the operation (an execution operation), and determines the address of the next machine instruction (a calculation operation). The process repeats the above cycle over and over; that is microcode.

Actually, you can do multi-cycle instructions without microcode; the Power 4 (IIRC) handles certain instructions, which are too complex to do in one cycle, by splitting them into two single-cycle instructions, thus letting the core be a pure single-cycle machine, and so simplifying its design. Since this splitting is simple and mechanical, and doesn't involve any actual microcode per se. Or something.

And how does it fetch opcodes from memory except if it has microcode maintaining a program counter, dumping the address out on the address bus, reading the instruction back in from the data bus, incrementing the program counter, and possibly reloading the program counter? A JSR (Jump to Subroutine) is actually a complex operation and it is a program written in microcode. Microcode is the core of every currently available microprocessor that I am aware of.

A lot of what you are talking about can be done with state machines. Microcode is only necessary for really complicated instructions, such as FSIN on the x86.

Comments from an old-timer to the youngsters commenting in this section. Try again - you are all wrong on at least one count each - even when you are mostly right. My point is that there are no absolutes. The S/360 series had models that were really NOTHING like the 360 architecture and had microcode options to emulate (as opposed to simulate) other machines. Most notably the 360/30 could "be" a 1401 and the /40 a 1410. At the other end of the line, there was NO microcode in the 360/95, and possibly none in the 65 (sorry, my memory is not perfect). One thing I know for CERTAIN - there was NO microcode of any kind in the Amdahl 470 (the first PCM - Plug-Compatible Mainframe). How do I know that? Because I still have the internal architecture diagrams from when I worked on them. The 470 was a "pure" machine - NO other machine running S/370 architecture was. I could go into a long dissertation as to how a hardwired machine could do things like move data longer than it's bus width, but the reader would have to understand basic digital electronics. If you know how to build an adder and comparator out of NAND gates, you know I'm right. If you don't, you're speaking from a position of ignorance. Same argument applies to the "more than one cycle per instruction" contention - especially when you consider pipelined execution where it becomes obvious that a "single-cycle" instruction (there really is no such thing) takes several discrete steps to execute.

Regarding assemblers vs. compilers, a useful distinction is an assembler (excluding macros) creates ONE machine instruction per source statement and a compiler [usually] creates more than one.

Well, regarding x86, where ES is a segment register, you can't MOV immediate values into segment registers directly, so you'd need to PUSH it and POP ES or use another register. And more likely it would be (e.g.) MOV EAX,13h (Intel style, Windows/DOS assemblers like MASM, TASM, etc.) or mov $0x13,%eax (AT&T style, e.g. GNU as).

Oh yeah, and Intel isn't the only player... far from it. You forget Motorola, ARM, Texas Instruments, SUN, Transmeta, Zilog, etc. etc. And an instruction is more like <opcode> <arguments> (usually 1-3 arguments); opcodes are short instruction names like MOV and PUSH and RST and CALL and RET. An opcode and its arguments (usually) corresponds to a single machine language instruction (which may be multiple bytes).

You are aware that the "bit-fielding" of certain opcodes does in fact produce a machine-code-level syntax where things like <reg>[<reg>]<operation><operand> are the actual result, even though the mnemonics say MOV AX,[BX] or that sort of thing?

More and more people are using FPGAs to implement their own custom InstructionSet

32 bit ASM is easy to program

Check out: http://www.menuetos.org/ - a GUI OS written entirely by hand in assembly.

A statement of particular interest (and the reason I make mention of this project here) in the FAQ is "32bit asm is generally almost as easy to program as C or Pascal."

Anyone care to say how this could possibly be true?

It is true. Just recently, I started writing a program in AssemblyLanguage that is intended to run before the main OS boot loader. I found that writing code in assembly actually led to less mistakes than writing it in C++ (though to be fair, when you make a mistake in assembler it's much more likely to crash the system). And there are several code constructs you can use in assembly that you can't in C++... perhaps some people would call some of them bad coding practises, but it makes things much simpler and it's still easily understandable. Besides which, you can run into trouble with C++ library calls (this routine makes a DOS call, unbeknownst to you. DOS? DOS isn't even loaded yet!) -- GavinLambert

I too would like to say that coding in flat-model, 32bit, x86 assembler is just about as easy as writing C or Pascal. First, much of the mystery and complexity of assembler on the x86 lineage comes from its segmented memory model. I'll never understand entirely why they did this, considering its direct competition (at the time), the MC680xx family from Motorola, used a flat, linear memory model. I think you'll find that this was because they started with an 8-bit processor with 16-bit extras (the 8080, 8085 and Z80) and then grafted on some 32-bit registers. To get enough memory addressing they then used two 16-bit registers and a small amount silicon to build a 28-bit [actually 20-bit] address. Of course, I could be wrong, but I suspect it all stems from trying to modify the original processors and use the previous silicon as much as possible. They were doing the SimplestThing that could possibly work.

At the time ('76, '77, there abouts), it was clear that larger address spaces were needed but you were limited in what you could do: you could attempt to keep both source and object (binary) compatibility, so you can still use existing development systems and programs; keep just source compatibility (so you just have to recompile existing programs); or abandon both and go for a clean slate.

Motorola decided to do the clean slate and ended up with the 68000 series (which internally was a 32-bit architecture; externally it depended upon which CPU was in use). Intel, on the other hand, decided to stick with source compatibility (to a degree - you can mechanically translate 8080 source code to 8086 code) but destroyed binary compatibility. To address more memory, the engineers at Intel went to a segmented architecture that wasn't that bad a method - it's just that programs grew faster than I think the engineers at Intel expected (or they didn't expect that particular architecture to become as popular as it did). The initial address space was 1M (20 bits) - a physical address consisted of a segment register (16 bits), multiplied by 16 (shifted left four bits) then the offset (another 16 bits) added to that. You had four segment registers, once for code, one for data, one for the stack, and an extra one (that in all my programs at least, was set to the data segment).

The assembly language I think is the nicest (if not the largest and one of the most CISCish in nature) is the VAX assembly. Sixteen general purpose registers (of which the program counter is one of them), and a completely orthogonal instruction set makes it nice to program in. It may not be the fastest, but it is nice. -- SeanConner

You clearly haven't programmed in ARM or StrongARM code. 16 general purpose registers (one is PC), completely orthogonal instruction set, every instruction can be made conditional, and whether or not the flags are set is optional on every instruction. Most instructions execute in one clock cycle. Extremely powerful.

However, for our benefit, Intel finally realized that memory segmentation is horrible and provided us with a flat memory model, thus greatly simplifying assembler on these microprocessors. (This happened in the 80386 and in all chips to follow it in the family).

The 80386 (and later) does not have a flat memory model. It has a segmented memory model in the original sense of "segmentation", in which the segments are variable size. Few operating systems use this facility, they just set all the segment registers to point to a flat 4GB virtual space for each process.

It's true that x86 variable sized segments were not all that heavily used, historically (although that may have changed more recently) but it's not true that most OSes do what you said. I think that may have been true of Windows, and Windows only, but not Windows NT, and pre-Windows XP at that. For one thing, "real" operating systems usually wouldn't make it as large as 4GB, it would be only as large as needed, and for another thing, VM paging means that use of multiple segments to tile the user address space are desirable. The typical setup of having stack space grow downward from the opposite end of the address space than does other data also tends to force the use of at least two segments, each growing toward each other either by adding more segments or by actually stretching existing segments.

It might sound counterintuitive, but writing Win32 programs in assembler isn't that complicated, and as Gavin points out above, writing programs in assembler can sometimes be faster and more direct than high level languages. -- JeffPanici

My program was in 16-bit real-mode code. It had to be, it was running before the OS loaded and I didn't want to bother with writing the instructions to get the processor in and out of protected mode. I used register-passing exclusively (and it doesn't make them non-reusable, you just have to be careful to document what the input and output registers are, and whether it preserves or destroys the other registers). And optimisation wasn't an issue. After looking at the output from my C compiler, I can make better ASM code than it can (it's not its fault, it's just that I know what I'm trying to do, whereas the compiler is just mindlessly following its ruleset). But even so, the major time factor in my program was waiting for a vertical sync to flip the backbuffer into video memory. I could have easily upped the framerate if I didn't have to actually display it :P -- GavinLambert

segment registers

One final thing - the segmented addressing modes use 20 bit addresses, not 28 bit. The flat address of a segment:offset pair is (segment shl 4) + offset. The only real advantage that I've encountered when using it is that you can store only a 16-bit pointer (just the segment) if you ensure that the offset of the start of the data is zero. And in 16-bit code it is very slightly cheaper to load the segment from memory and clear the offset register than to load both from memory. Flat-model would have been nicer, though :) -- GavinLambert

If you don't truncate to 20 bits after doing (segment shl 4) + offset, you notice it is actually possible to set bit 20 to binary 1. That way, segment:offset can actually address slightly more than 2^20 (1 megabyte). This is the A20-gate ("Addressline 20-gate") trick exploited by HIMEM.SYS on 286's and up: the hardware can address more than 20 bits, and by enabling this A20-gate software can address slightly more (just under 64 kilobyte extra) without leaving real mode. -- PeterVanDijk

But note that this doesn't work in v86 mode (http://www.codecomments.com/A86_Assembler/message237872.html).

Above link is broken. Regardless, the extra 65520 bytes at linear address 100000h are addressable in either real-address mode or V86 mode (both of which translate segmented addresses to linear addresses in exactly the same way). What is effectively accessed depends on paging and A20 masking though. Note that paging is typically "disabled" on x86 unless protection is enabled, thus (lacking modern vendor-specific virtualisation extensions) it is only applicable to V86 mode.

Paging is done by the MMU, A20 masking generally was a hack in the chipset's memory controller (now that the memory controller has been moved onto the CPU, it may have moved there as well - if it is still supported at all). Disabling the actual A20 line (with the original A20 masking implementation) causes each odd MiB (at linear address 100000h, 300000h, 500000h, etc) to alias the even MiB right below it (0, 200000h, etc) - a typical V86 mode monitor (and software written ca. 1985 or later in general) doesn't want that.

However a V86 mode monitor trivially can emulate the legacy A20 masking behaviour by intercepting attempts to reprogram the chipset's A20 masking facility, then appropriately modifying the V86 task's page tables to toggle a paging wrap effect (aliasing the 16 pages in the extra range to the first 16 pages iff to wrap). To legacy code running in the V86 task it is indistinguishable whether the wrap is caused by physical A20 masking or by paging. Of course a V86 mode monitor may lack such A20 emulation if it isn't deemed necessary.

C++ better at matrix inversion; ASM better at CRCs and BigNums?

Hmm. I'd agree that for some processors and some problems, asm is as easy as C++ - it might even be easier. But it's a lot less portable. Win32 code is a good example because:

it's event driven, so the chunks of code are small
there are many system calls
you don't care about performance
you probably don't intend to port it

Hand-optimizing assembler code is very difficult; compilers are usually much better at it than humans (exceptions include things like DSP code). If you don't believe me, you should attempt to write an NxN matrix inversion routine (in assember) that beats the performance of the MatrixTemplateLibrary (actually, that's pretty-much impossible in C, too).

Some processors are easier than others. If you use a typical purist RISC processor, you'll find that the pipeline is exposed to the assember programmer. Once you start worrying about things like delay slots and manual resource scheduling, you'll find that hand-coded assembler isn't such a good idea.

-- DaveWhipp

Writing in assembler makes some things very much easier because you have access to the processor flags. Computing CRCs one bit at a time requires shifting a bit into the bottom of a register, and then doing an XOR with a magic word if the shift generated a carry. You can't do that in C.

If you really want to hand tune assembly, then you can pass values in registers instead of pushing them onto the stack. This eliminates memory access times and greatly reduces the cost of subroutine calls. The downside is that your subroutines become very tightly coupled and almost non-reuseable. I have used this optimization in the past, but would probably avoid it now and really insist on a faster processor. -- WayneMack

Or a compiler that can put procedure arguments in registers, and that can do interprocedural register targeting. E.g. the Larcency Scheme compiler does this, and even better, you can read all about it at http://www.ccs.neu.edu/home/will/Twobit/ultimate.html. -- StephanHouben

x86 assembly language is better at math than C. In the x86, addition and subtraction update the carry flag (the carry flag is used for borrows in subtraction), and there are "add with carry" and "subtract with borrow" operations which you can use to achieve arbitrary precision. You can also multiply two 32-bit numbers to produce a 64-bit result, and you can divide a 64-bit number by a 32-bit number to get a 32-bit result (producing a divide overflow if the result doesn't fit in 32 bits) and a 32-bit remainder. These are the building blocks you need to produce arbitrary precision math. In C, if you multiply two 32-bit numbers, you get only the 32 least significant bits of the result; the upper half is truncated. If you want a 64-bit result, you have to cast the factors to 64-bit, and then if your compiler is stupid about such things, it will generate code which paranoidly checks the the upper 32-bits of both 64-bit factors, even though in this specific case they should be zero. It is also useful that in assembly language the quotient and the remainder are generated at the same time; in C and C++ you have to write two separate operations, and a stupid enough compiler might actually emit the divide opcode twice.

In floating point, x86 assembly allows you to set the rounding modes. This is a feature that the IeeeSevenFiftyFour floating-point standard requires CPUs to make available, but most languages don't make it available. (Sometimes it is available as a library function.) It makes it possible to calculate upper and lower bounds for a value, and detect whether round-off error is significant in a specific calculation.

But beware, the OS you're using may or may not save and restore your floating point mode on context switches (due to the expense of doing so). If it does not, then your settings may screw up other processes, and their changes to it may screw up your process.

asm is as easy as C++

Assembly language has its place, mainly to do things that higher level languages cannot do. Obviously, everything a higher level language can do, assembly can do. This does not mean that it is easier in assembly. For example:

Pushing parameters onto a stack and pulling them off for function calls. In assembly, one must do this manually for each function call. In C++, this is hidden by the function declaration.
Declaring local variables. In assembly, one must manually allocate stack space for local variables and ensure the size fits the need. In C++, this is hidden by the variable declaration and variables in different scopes are overlaid for efficiency.
Writing complex evaluations. In assembly, each value in an expression must be evaluated separately and a branch decision made on each one. In C++, an evaluation consisting of multiple ANDs and ORs can be written in a single IF statement.
Writing polymorphic methods. Jump tables came from assembly, but managing multiple tables and keeping the offsets within the tables synchronized is tedious and error prone. In C++, this is handled by header declarations.
Writing an exception walkback. In assembly, one can write code to walk up the stack and identify an error handling routine at the appropriate level. In C++, this is handled with a simple Throw/Catch syntax.

Anyone got any short sample programs?

short (useful) sample program appended below at SET9600.COM

http://web.archive.org/web/20020203150017/http://www.webgurru.com/tutorials/assembly/chap6.htm#2

A short sample: RET

Example of reboot: jmp 0xFFFF:0x0000

See also ForthAssistedHandAssembly.

Assembly vs Assembler

My understanding was that the Assembler assembles binary code from AssemblyLanguage. I think the two phrases get interchanged a lot now that few people use AssemblyLanguage. -- BrianMcCallister

The two terms have always been interchanged a lot, not just in recent years, nor is it completely incorrect; it's just a case of metonymy, which is extremely common in natural language.

You are quite correct. An assembler is the "compiler" for AssemblyLanguage. -- GarryHamilton

[When Assembly is compiled, does the Assembler perform any notable optimizations, or is the process just something similar to macro expansion into MachineCode?]

Traditionally, no, but vendors often try to spice up their assembler with extensions to take some of the burden off of the programmer. Borland's Turbo Assembler, for instance, included extensions that automatically handle C-style subroutine declarations, local variables, and certain kinds of loops. These things are not all that commonly used.

There are two types of assemblers: Those that are primarily intended to serve as the final stage of a compiler (assuming the compiler doesn't emit opcodes directly); and those intended for human code-writing. The former has far fewer features than the latter; often containing only that necessary to assemble the compiler's output.

Optimizing assemblers have been implemented from time to time, but this is relatively unusual, since the usual point of writing in assembly is to get 100% control over what is going on, and if a compilation from a higher level language is involved, the optimizer is usually a completely different pass, rather than being combined with the assembler. When RISC processors were new, it became more common for assemblers to do certain minor kinds of optimizations, such as automatically filling branch delay slots, but usually not full fledged peephole optimization.

The first step beyond machine code, where human readable (relatively speaking) symbols are used to generate programs.

Relatively speaking, as in an assembly program is as easy to read as an old BASIC program, even when decompiled from machine code. You just need to know how the processor behaves when it encounters a particular opcode, and handle it.

Hand-coded assembly is easy to write and maintain if you're good about naming the call and jump flags, and put in decent comments. The operations might be lower-level and constrained to certain integer maths, but that's no reason to write something unmaintainable or unreadable.

AssemblyLanguage is one of the few (readable) levels at which the actual processor behaviors are exposed. Consequently, the most complete control of the "currently defined" processor behaviors is available in AssemblyLanguage and those few others that grant direct access to the CPU.

Most high-level languages will not represent the low-level behaviors of the processor - ForthLanguage and derivatives being exceptions.

The "currently defined" behaviors of the processor are the subject of MicroCode, which establishes the behaviors of the processor. Some exotic processors have writable MicroCode, but for the vast majority of cases this will only be a curiosity: for all intents and purposes, the MicroCode can be considered part of the processor.

Programming at any level requires reasoning about the semantic structures at hand. High level languages claim to offer simpler or more appropriate structures, but often fall short of this goal when needs change or one is forced to reason about lower levels anyway. A regular dose of assembly programming will remind us what we should demand of our higher level languages.

Most of the discussion here seems to concern PC Assembly Language(s) - I am wondering if we should have a separate page for MainframeAssemblerLanguage? (esp. IBM S/360/370/390/z90 Assembler Language). I think there is quite a lot to be said about this - in particular it has been going strong for almost 40 years, which has got to be some kind of record for a computer language...

No new page required. The title says AssemblyLanguage; that's generic. Most of the discussion is about x86 assembly because that's what most people are familiar with. If you want to talk about 360 assembly, go right ahead. If the discussion becomes huge, then it would be interesting to calve a new page. Now that pure x86 is pretty much a thing of the past as of 2004, is there going to be a discussion about AMD64 assembly?

So...how 'bout them base registers? Can't beat 'em.

Hey, I think ARM assembly language is pretty cool.

humorous assembly language

fictional opcodes:

BBWBranch Both Ways DWIMDo What I Mean FLIFlash Lights Impressively HCFHalt and Catch Fire HTSHalt and Throw Sparks CMFRM Come From

http://rdrop.com/users/jimka/assembly.html

Hey, I remember doing assembly where BBW was available (threads implemented in microcode, yum!).

BetterAssembly?

PreferredOrderOfSrcDstArguments and ForthAssistedHandAssembly and http://terse.com/ have some interesting ideas on how to make a "better" assembly language (without going all the way to C).

DevelopmentTools

ObjectOrientedAssembler mentions "assembler doesn't give any support for non-procedural methods". Would it be crazy to write a "better" assembly language that provides some support for object-oriented methods?

Anyone got any short sample programs?

Here's a DEBUG script to create simple baud rate setup program.

  C:\>DEBUG
  :  a 100            ; begin assembly code entry
  :  mov al,0         ; zero, for "off"
  :  mov dx,03F9      ; Com1 address + 1 = int enable register
  :  out b al,dx      ; interrupts off
  :  mov al,13        ; DTR+RTS+LBK (loop back mode)
  :  mov dl,FC        ; Com1 address + 4 = modem ctrl register
  :  out b al,dx      ; set loopback mode for Com1
  :  mov al,83        ; 80 = DLAB, 03 = 8-bit mode mask
  :  mov dl,FB        ; Com1 address + 3 = line ctrl register
  :  out b al,dx      ; turn on divisor latch access
  :  mov al,00        ; high-order byte of 000C
  :  mov dl,F9        ; "ie" reg now high-order divisor byte
  :  out b al,dx      ; place first half of divisor value
  :  mov al,0C        ; low-order byte (that's 12)
  :  mov dl,F8        ; "data" reg now low-order divisor byte
  :  out b al,dx      ; place second half of divisor value
  :  mov al,03        ; 8-bit mode mask without DLAB
  :  mov dl,FB        ; Com1 address + 3 = line ctrl register
  :  out b al,dx      ; turn off divisor latch access
  :  nop              ; waste a cycle
  :  mov dl,F8        ; Com1 base address = data register
  :  in al,dx         ; clear the (first) RX data register
  :  nop              ; just killing time
  :  in al,dx         ; clear the (other) RX data register
  :  mov al,03        ; DTR+RTS (without loop back bit)
  :  mov dl,FC        ; Com1 address + 4 = modem ctrl register
  :  out b al,dx      ; clear loopback mode for Com1
  :  mov ax,4C00      ; terminate program, errorlevel=0
  :  int 21           ; call DOS to end ; just press <Enter> next
  :  n set9600.com    ; where to write the program
  :  r cx             ; alter contents of CX register
  :  003A             ; that's 58 for you civilians
  :  w                ; create file and save program
  :  q                ; leave DEBUG
  dir set9600.com

You now have SET9600.com which ... does exactly that.

PreferredOrderOfSrcDstArguments DigitalSignalProcessing ObjectOrientedAssembler WriteAssembler