Duffs Device Discussion

Refactoring of this page is in progress to remove heat and retain light shed on the subject, expected to take multiple days. In the interim, please add new comments at the end of page to make it easier to see them and incorporate them in the final page.

I began the refactoring; especially strong disagreements about the contents or refactoring perhaps should sometimes be directed to MichaelSparks as a somewhat more neutral (but involved) party if mediation is needed. -- Doug

DuffsDevice seems to be controversial and not easily understood.

Please note that Duff explicitly said he didn't mean it to be an alternative to memcpy; DuffsDevice is about device I/O under certain rare circumstances, and about something surprising about C: a construct that most would think at first is illegal, but is not.

After various arguments amongst many, this high level summary was offered:

Experience says:

First rule of optimization applies.
Second rule of optimization applies
You all know this
Long arguments degenerate into lots of arguing about points that were and weren't made and lose the plot completely.

Advice is:

Don't use DuffsDevice.
(For experts only) Don't use DuffsDevice unless you can prove that it's necessary.
- Costin summarizes that there is a tradeoff between obfuscated but portable C versus clear but non-portable assembler.
Take a deep breath, and
- agree that the argument above is going and will go nowhere,
- assume that any insult and slights were unintentional and responses to real or imagined first attacks
- agree that the vast majority of people in this debate know more than the average programmer
- agree that
  - DuffsDevice might be used in specific, special circumstances,
  - Duff himself may have found such
  - this discussion is taking time from your lives that could be better spent elsewhere.

Agreed. -- MichaelSparks

Agreed. -- DougMerritt

[not agreed CostinCozianu]

[ Agreed with meta-discussion puzzlement ] JamesMcCartney

Some clarification:

The summary says it's a waste of time to split too many hairs on this, and that our various positions can be summarized in a way that is acceptable to all of us, and that to say something unobjectionable to everyone, the summary must be fairly abstract, and not get into the fine details where there is more disagreement, but also more heat than light.

Another attempted summary that is more detailed, and hence more controversial (and less complete), than the summary above:

Some advise against using this device because
- The case labels constrain an optimizing compiler's ability to reorder instructions within the loop, so you lose out on a lot of potential performance gain.
- Modern OptimizingCompiler?s (especially for HighPerformanceFortran?) can automatically reorder, fuse and unroll loops (with linear or even affine bounds) and even consider cache and register-file sizes in doing so. Using such special constructs prevent these algorithms to be (fully) used.
- it is even less readable than the corresponding assembly code (see also KillYourDarlings).
- Assembly should be used if such a specialized (cycle-counting) optimization is necessary.
- In today's code, IO would be the bottleneck anyway.
Some argue that that critique is (in general) not valid because Duffs device is special:
- The code can't be reordered
- "*to" is not incremented and it's volatile.
- Separating the block copies from the remainder copy will be slower
- Duff had reasons to count cycles (and said so).
- Duffs device is portable.
- This is not a typical situation.
- DuffsDevice must be seen as a historical piece of ingenious hacking.

Further summaries and references from the discussion below:

In fact, browsing the web for Duff's device in source code, I find that all instances of it I can find in actual source are memory to memory (or register to memory).

Pre-summary discussion, edited somewhat:

For uses similar to memcpy, I would advise against using this device. The case labels constrain an optimizing compiler's ability to reorder instructions within the loop, so you lose out on a lot of potential performance gain. I think it is always better to write an unrolled loop plus separate code to deal with the remainder. -- JamesMcCartney.

I second that. [...] -- .gz

I third it, but I want to point out that it was invented around 1984, long before garden variety compilers were able to perform the optimizations you mention. You would be crazy to type out a DuffsDevice today, but it served its purpose well at the time. -- MichaelSparks

Duff already pretty much said as much, actually.

Some things to consider (that are not arguing that it should be used instead of memcpy): The code can't be reordered, and "*to" is not incremented - it's volatile. And even so minor a thing as separating the block copies from the remainder copy will be slower, and it is quite clear he had reasons to count cycles. This is not a typical situation.

Also, as Duff indicated, if the algorithm is as fast as possible, and the code isn't fast enough, you don't give up, you do whatever is necessary. DuffsDevice is a (strange) example of that.'

Worth noting (without intending to contradict the usual rules of optimization): compilers are better at low level optimizations than the average programmer, they are not better than human experts at optimizing low-level code; an expert can write better assembly code than a compiler can, contrary to modern myth, there's just usually no reason to go to so much trouble.

If it turns out that the performance isn't really up to what your application requires, then I would suggest going to assembler for the tight loops. There shouldn't be a practical portability issue - since you have hard performance constraints, you must already know the exact hardware that is in use. Doing clever things like DuffsDevice is bad, even if it does improve performance. A maintenance programmer is much more likely to understand a quick loop in assembly than DuffsDevice.

Perhaps, but going to assembler is somewhat extreme, too.

BTW it wasn't declared "volatile" because that keyword didn't exist until later.

Very few folks need to poke data into a PIO register.

True. And the others perhaps should never consider DuffsDevice.

Most programmers who use Duff's device use it inappropriately, for memory-to-memory operations where something like memcpy would be more appropriate.

DuffsDevice illustrates a point about the C language which surprises many people, so it is an interesting curiosity for that purpose. It also has been proven to be the fastest solution in C for IO in some cases; Duff himself explicitly said it wasn't about memory-to-memory, but too few readers of DuffsDevice notice.

The memcpy that comes with any given platform is almost certain to be much faster than any handmade solution, because they are usually written in assembler by experts on the cpu in question. I've done so myself (in compiler work), and it is not a trivial exercise to get maximum speed; the platform's memcpy is almost invariably faster than any hand-rolled solution.

It needs to be made explicit the context in which this is useful. Switch cases inside a loop will not speed up all loops that have to deal with a remainder. I didn't see that pointed out explicitly anywhere on this page. The definition given above "a special case of unrolling a tight loop, by exploiting the FallThru? of the C case statement, interlacing it with a C do...while statement" implies that it is a useful means to unroll any tight loop. It is not. It can slow down a loop, and I have witnessed code where this mistake was made. -- JamesMcCartney

True.

Duff's solution to a specific problem resulting from a specific condition in a specific environment is certainly not an attempt to suggest that "since I climbed across the chasm using chopped sapplings bound with vines therefore all bridges should be sappling & vine constructs."

I have a small collection of bizarre-looking code fragments that have solved ... specific problems in specific environments. The "switch statement that really isn't" is a device to permit a sequence of events all of which must succeed without the usual horrific nested-if side-on pyramid.

I have assembly chunks that "know" where to jump into other assembly chunks to save a few TeeStates. The likelihood that anyone else will ever need to save those particular TeeStates is very low. But it was the right answer at the time.

I have some hairy-looking union-and-structure combos that serve exactly one purpose in one environment and which have defied every attempt to define them differently. Ugly, but right for the circumstance.

Sometimes you have to shave with a hunting knife or pick your teeth with a twig. And sometimes that's the only way you can get the job done. -- GarryHamilton

In fact, browsing the web for Duff's device in source code, I find that all instances of it I can find in actual source are memory to memory (or register to memory).

So I think there is a need for caveats here. -- JamesMcCartney

Well, as Doug mentioned, you can always find plenty of people willing to misinterpret, misunderstand, and misapply a given technique. For those not-so-rare individuals, the only caveat is likely to be "HeresYourSign" - some way of warning the people who must work with them that they are prone to do this. Some will be because of FadAddiction?, others will be FailureToReadTheDocs?, and some are just (dare I say it) stupid on the subject. I forgot to mention HeroWorship? (or is that IconWorship?). -- gh

Strongly agree with all of Garry's comments.

Additionally, in regard to "It needs to be made explicit the context in which this is useful. Switch cases inside a loop will not speed up all loops that have to deal with a remainder" - I find this highly annoying - how many times do I have to say that Duff did address this above? And yet still you don't bother to go back and read a bit more carefully? Why not?? Here are two key things Duff said above:

People who have posted saying `just use memcpy' have missed the point, as have those who have criticized it using various machine-dependent memcpy implementations as support. In fact, the example in the message is not implementable as memcpy, nor is any computer likely to have an memcpy-like idiom that implements it.
- (I.e. because it does a block move to a single location. -- Doug)
Don't try to be smarter than an over-clever C compiler that recognizes loops that implement block move or block clear and compiles them into machine idioms.

Sure looks like clear caveats to me!

The reasons for not using assembler unless you absolutely have to are so extremely well known that I should not have to argue it. The fact that you dislike a C construct is insufficient reason to drop into assembler. If nothing else, you didn't address my argument about portability, yet that easily trumps the question of your aesthetic tastes in C constructs.

I have not argued much about the two-loops versus one issue, but just to clarify: at the time, two loops would not have fitted in the instruction cache on the cpu Duff was using, so the single loop was a clear speed win. Instruction caches are bigger today, but if this sort of thing comes up, the size of the switch statement could potentially be arbitrarily large, again overflowing your I cache, no matter how big.

I tell you why *I* haven't needed to use Duff's device over the years: simply because I've been lucky enough to use hardware that has nice smart DMA controllers or BitBlt? hardware or whatever, so that I didn't have to code a solution at all, I just told the hardware what to do. This is also why you see so few actual uses of Duff's device on the web. It's hardly ever needed (and when it is used, it will tend to be in a device driver, 99.9999% of which are proprietary, and not to be found via web searching). -- Doug

And, why must you refuse to accept that many people can and have missed the reasons why this device should not be used in most code? I can find plenty of examples on the net of this device being used in code where it should not be. I can't find a single example on the net other than copies of the original historic email of an appropriate usage of this device. These pages are educational to many people so I think that pointing out the limitations is important. And I don't think that the original text spells out clearly enough the potential problems of using this in places other than the original context. I think my original comment stands and is not "completely wrong". I've no care about the aesthetics one way or another, and I think it is pretty nifty myself. But a lot of folks may not normally think about what case labels do to optimizers so I thought that should be pointed out. I think we're getting nowhere here, though, so I'm giving up. -- JamesMcCartney

For those whose native tongue is English yet whose command of English is inadequate to understand Duff's cautions about Duff's own code and who, in spite of their inability to comprehend the niceties of their trade, insist on immortalizing quaint and unique if inappropriate code fragments in their work for whatever reason, one can only shake one's head and mutter "HeresYourSign". -- gh

Once again, I agree with Garry.

It's not true that we're getting nowhere. I feel like I have accomplished getting the current audience to understand that most of the original comments were already covered by Duff or were simply offtopic, regardless of whether you want to put up triply redundant red warning flags.

The remaining "issue" is not an issue: I agree that "this device should not be used in most code" - and no-one ever said anything differently. In fact, the reason why you find so few uses of it on the net, and that most of them are inappropriate, is that most people are reasonable enough to understand that it should almost never be used, so they don't, so you should be encouraged by the fact that you don't see examples of them using it. :-) That sounds weird, but seriously, you can't google on people not doing something; obviously you're only going to find the cases where they did do something. -- Doug

I will make one concession, about portability. I thought the argument about portability was a red herring, but you have convinced me otherwise. My first thought was that since there are hard performance constraints, the exact hardware setup must already be known. On second thought, there could be a range of processors, and only the slowest one needs special tricks to perform up to spec. In that case, C code may be a good compromise. However, since I am uncompromising, I would probably implement the routine in assembly and create a special build for that processor. But I can understand the value in having just one implementation.

Cool. I think it's very reasonable to begin with one portable implementation, and then add additional non-portable optimizations if and when it makes sense.

I also understand that Duff was correct to make the conclusion at that time. If you read back to my first post on this page, that is the message I tried to convey. -- MichaelSparks

Perhaps I did not correctly understand your original post. -- Doug

So, can we get some examples of when such esoteric constructs were necessary? I presume that a modern C compiler would see a similar For loop and just unroll the damn thing as much as possible. Perhaps wanting to replace only a single column of interleaved data would justify this kind of approach?

Read DuffsDeviceInDuffsOwnWords for exactly that.