Not Needing Binary Search

NeedingBinarySearch asks for examples where [a complex algorithm like] BinarySearch is needed. Here are examples of places where something like binary search might have seemed like a candidate solution, but in fact something simpler worked. This is not to say that binary search is evil. The purpose is just to provoke people. Er, no, to provoke people to think about simple solutions.

Do you really think binary search is a complex algorithm? I happen to think the opposite. Now if you compare it to linear search, well, almost anything is more complex than that. But I agree this discussion motivates people to look for simpler solutions whenever possible.

You have a large collection of records, sorted on your search key. The collection is so large that it is spread over many disk pages.

Sure, this is a great example but only because you have to be loading the records. Now in a similar case, when you do not need to load the records I would not even dare to do that. An example is a pool-base allocation library (for embedded systems). If you have to see if the address is part of one of many pools, I would definite use binary search to get to the pool that _may_ contain the address and then binary search again to validate that the address is indeed the start of a buffer in the pool.

An easy solution to quickly finding the record of a given key, if it exists, is to binary search the file, buffering in new pages as needed. The search will only read a few pages, then stay within one page and quickly find the desired record if it's there.

Another easy solution is to keep an internal table of the first key on each page. Search the table linearly in memory to determine the only page that could possibly contain the key you need. Read that one page, search it linearly.

This search will out-perform the disk-based binary search.

Ummm, you just suggested replacing a binary search with a BeeTree. OK, it's a special case of a B-Tree (it never has to deal with a lot of B-Tree details like page splits), but it's still more complex than a binary search. B-Tree generally kicks binary search butt in performance contests for all but a few special cases (where they are equal). Was that intentional?

Something has been bothering me and I just now figured out what it is. The assumption that has been made by many people is that the binary search O(logN) is actually faster than the built-in O(N) search. This may not in fact be the case. It has been my personal experience that if you stick to the XP practices (all of them) performance is not a problem. In one case I was writing code similar to something that was already written and took 8 hours to run. My code was going to do twice as much work so the estimate was 16 hours. Now this existing code was optimized up, down, left, and right, but it was built JangIt style. I stuck to XP as I coded the problem creating a MethodObject to simplify the complex algorithm. I ran my code in 2 hours. 8 times faster. I did absolutely no optimization. Another case was when we replaced a parsing algorithm that was hand crafted by one of our programmers to be spectacularly optimized to read in a serialized database. We threw that away and instead serialized out as a simple Topaz (GemStone) script that could then be filed in using the built-in GemStone parser. It was about 24 times as fast.

I guess the point I am trying to make is that if you find yourself searching an array with thousands and thousands of neatly sorted numbers maybe there is a simplification you could make elsewhere so that you don't need to reach to the bottom of the bag of tricks. --DonWells

If you did something simple that was 24 times faster than a supposedly optimized algorithm, then the point I would take away from that is that either the other guy had constraints that you didn't, or he wasn't very good at optimization -- and that's all.

Well, would it help if we called it a *system*, not *algorithm*? After all, the point about this is that over-aggressive optimization confuses the structure of systems and reduces their overall performance: a step that is locally productive becomes globally counter-productive when combined with too many other such steps.

As far as I know, Smalltalk doesn't have a binary search. It does, however, have a number of hashed collections, which are of course O(1). And they're used rather commonly. That tells me that faster-than-linear access is useful.

(Keep in mind that the worst case for hash tables (all keys hashing to the same value) ends up equivalent to a linear search, or worse. True, it would take an extraordinarily poor combination of data and hash algorithm, and is extremely improbable in practice, but it is possible. - JayOsako)
Not quite; maybe you meant industrial strength hash tables? Most simple hash algorithms frequently behave this way on certain kinds of data that may in fact be surprisingly common, not rare, and all simple hash tables behave that way when they are overly full. Industrial strength implementations go out of their way to attempt to avoid this behavior, but it's nontrivial to try to get O(1) average insert and lookup given unknown data of unknown total count.
[Then it's a good thing that Smalltalk deals with known data of known size; all object pointers all the time. And if it's so risky to write your own hash function, then that just encourages any right-thinking programmer to use the hashed collections as given. Hence the adage, talk small and carry a big class library.]
I agree. But I pondered that sort of thing and decided that he wasn't commenting just on the industrial strength hash tables that ship with good Smalltalk implementations, he seemed to be addressing the CS theory -- so I responded on that basis.
I changed my phrasing to "unknown total count", because I didn't mean the byte width of individual elements. If you know a hash table will never store more than 64 elements, that is important information for tuning its implementation. But if you need it to run fast whether it's used for 4 elements, 64 elements, 200k elements, or 200M elements, that is way harder to implement. Not that users care -- except that it's impossible to implement such things perfectly, so users should care enough to beware of such implementation difficulties, and not assume the impossible. Then again, extreme speed rarely matters. But if it does, this is one area to be suspicious of ahead of time.
[Collections in Smalltalk grow at need. They change their capacity to match their usage, roughly doubling when they're full, the hashed ones always keeping a prime or pseudo-prime capacity. There's no reason why you can't start with a collection with 200M capacity, but it'd be painful to have it grow to that size.]

To Don's point, however, I suspect it's likely that many of the uses of binary search or hashing could be eliminated altogether by other means. (See for example <http://citeseer.ist.psu.edu/cai95using.html>.) When a program is too slow due to doing a search, the best thing to do is to eliminate the search. This saves 100% of the time, and will usually result in a simpler system.

Other than as an example, which I believe is Alistair's case, actually needing to implement a binary search seems unlikely. Alistair's point, of course, was about commenting complex code, not about whether one woulda/coulda/shoulda implemented that particular complex code. --RonJeffries