Intel Itanium

The next platform from Intel. Features a VLIW [VeryLargeInstructionWord?] instruction set design. Can process up to 8 instructions at once (if I remember correctly). Programmer controls processor parallelism rather than chip.

For someone who gets excited about Assembly optimization, this chip will be super awesome. For everyone else, the question is cost, how does it perform now, and finally, how well will first generation Itanium programs execute on later generations of Itaniums (apparently that is the major historical flaw in VLIW design).

See: http://developer.intel.com/design/ia-64/devinfo.htm for Itanium Software Developer information.

And http://developer.intel.com/software/opensource/ for an OpenSource IA64 Assembler.

SourceForge has some Itanium boxes on the compile farm for OpenSource programmers to use to port Linux software to Itanium.

On a related note, you may know that while SGI is not going to abandon their IRIX customers anytime soon, they have announced that Linux is going to be their future operating system and Irix will be slowly phased out over the coming years. They have also announced that MIPS would slowly be phased out also, in favor of Intel over the coming years.

So, in taking steps in that direction, SGI has booted Linux/MIPS (a long time port of Linux to SGIs, which till recently was limited to Indys, Indigo2s, and a few Chalenges) on one a 32 processor ccNuma Origin 2000. They've stated that while Linux/MIPS will never be supported, they plan to work on the performance problems now with Linux/MIPS so that they will be able to ship Linux/Itanium in massive ccNuma configurations as soon as the Itaniums are ready. I hear that they already have a few in house versions of the Origin 3k line running off of Itaniums instead of Mips chips.

Some facts about IntelItanium:

Operating Environments
- IA-32 System Environment
- IA-32 Instruction Set - IA-21 PM. RM and VM86 application and operating system. Compatible with Pentium, etc
- IA-64 Instruction Set - Not supported
- IA-64 System Environment
- IA-32 Protected Mode - IA32 Protected Mode apps in IA64 system env
- IA-32 Real Mode - IA-32 Real Mode applications in the IA64 system
- IA-32 Virtual Mode - IA32 Virtual 86 Mode Apps in IA64
- IA-64 Instruction Set - IA64 Apps on IA64 Operating Systems
Explicit Parallelism
- Mechanism for synergy b/w the compiler and processor
- Massive resources to take advantage of instruction level parallelism
- 128 integer and floating point registers, 64 1-bit predicate registers, 8 branch registers
- Support for many execution units and memory ports
Features that enhance instruction level parallelism
- Speculation (which minimizes memory latency impact)
- Prediction (which removes branches)
- Software pipelining of loops with low overhead
- Branch prediction to minimize the cost of branches
Focused enhancements for improved software performance
- Special support for software modularity
- High-performance floating point architecture
- Specific multimedia instructions
Pages are 8KB instead of 4KB on x86
The IA64 expects memory accesses to be aligned on "natural boundaries"
If a data access isn't aligned, an alignment fault is generated
Instructions are 41 bits long
3 instructions are collected into a 16 byte "bundle"
Bundles are always 16 bit aligned
All instructions in a bundle may operate in parallel, but debuggers can still single step
There are no explicit call or return instructions
Many instructions have multiple parts
Top 96 registers in rotating stack

Note that Linux itself apparently already runs on the Itanium. (Haven't tested it myself though, however I will if any of you friendly folks sends me an Itanium box to do the tests on.) Porting software should be easy if it can handle the fact that pointers are now 64 bit. That's the theory, at least.

By the way, larger pages might cause trouble for GarbageCollection techniques that implement WriteBarrier?s using the MMU. I am also wondering whether ByteCode VirtualMachines can take full advantage of the VLIW design; there's a NEXT jump after every bytecode's implementation, and simple bytecodes might not fill the full instruction word.

I don't fully understand the remark "Top 96 registers in rotating stack". Aparently we now have 96 registers (hurray! take that, SPARC),

The Sparc supports a variable number of registers, ranging from 40 to 520 (depending on whether the implementation is lower cost or higher performance), accessible via register windows that are pushed/popped on call/return from functions. There is in general across architectures a difference between externally visible registers (e.g. 8 registers on x86) versus internal registers (recent Pentiums have...I forget...128 internal registers or something). The x86's 8 external registers is known to be definitely fewer than desirable, but ceteris paribus, too many external registers is also a bad thing, since it burns up instruction encoding bits better used for other purposes, and/or makes the word length longer. Therefore high performance processors have long used many, many internal registers, regardless of their number of external registers, and the external are mapped to the internal in any number of clever ways, such as hardware register renaming, register windows, etc.
There's a limit to the useful number of internal registers, too, since the mapping logic gets slower as the number increases, and each doubling in their number approximately means one more level in the address decode/mux/demux, slowing each access to each register, so that it makes sense at some point to limit the number of internal registers and instead start using (or growing) level one and two cache. -- DougMerritt

...but whaddabout the "rotating stack"? Is this some kind of hardware stack a la what dedicated ForthLanguage machines have? (But why rotating? Stacks don't rotate, do they?)

-- StephanHouben

This probably means that the top-of-stack pointer rotates through the 96 registers. When you push an item, the 96-deep stack item (if any) goes into the BitBucket. This is how ChuckMoore's Forth chips work.

Close but not quite. It automatically dumps the overflow into memory. So it amounts to yet another stack caching scheme, somewhat similar to ChuckMoore's, but also somewhat similar to the Sparc's and to others (and also somewhat different in some details)...there have been many stack caching schemes through history.

See for instance:

-- DougMerritt

CategoryHardware