Cap Array

An array that grows in chunks instead of fragmenting the heap with miniscule memory adjustments.

It is said to have a capacitance and a growby property or parameter.

It is automatic and dynamic and the programmer doesn't allocate memory himself (when it is encapsulated into a module, record, struct, object, etc).

  caparray.growBy = 100;   // chunk growth size
  caparray.initialCapacity = 400;
  caparray.add('item');
  caparray.add('another item');
  writeln(caparray.items[0]);
  writeln(caparray.items[1]);
  writeln(caparray.length); // prints 2
  writeln(caparray.actualLength); // prints 400
In the above demonstration, the array is first set to a capacity of 400. This means it will not allocate memory again until it reaches a count of '401 when it needs to make room for more items in the array. When the 401s item is attempted to be added, it will allocate memory for 400 + 100 (500) beforehand, and then the item placed in the array. It will then be ready again for more items until the 501st item is added, and so on.

We could find out the actual length by using actualLength or a similar method/procedure as shown in the demo above. However, the actual length of the array is generally not something the programmer even should care about. It is an implementation detail. More importantly, all the programmer should care about is that the array acts just like a regular array.. with a length() ability (count of items) and indexed access.

Developed by someone when they got sick of continually calling setLength() on dynamic arrays (or memAlloc).

It can be implemented as an object, record, or struct like above - but the language could build it in natively too. I implemented it as a record and object in modern pascal and pondered implementing it into the language directly, so that arrays could be accessed without all the verbosity that object/dot notation entail.

See also CapString and Java's java.util.Vector class (http://java.sun.com/j2se/1.4.2/docs/api/java/util/Vector.html)

You will find that this is how strings and arrays behave in many of the modern scripting languages (e.g. PerlLanguage, RubyLanguage, and others). (Actually, they often also have a "shrinkage" strategy as well). Also, you will find that you get better performance for large data if you grow the array/string by simply doubling the size each time you overflow it -- it will cause a lot fewer copies of your data. See ShlemielThePainter

That's exactly the reason I don't use modern scripting languages often - because they do this behind my back and no one ever knows about it because the users of the language are generally too, how do I put it, too dumbfounded to even care what algorithms are really driving their programs. By encapsulating this pattern into a module (or object) and clearly documenting the specifications (instead of just assuming everyone knows about this pattern), I can use this algorithm when appropriate. When other algorithms are appropriate I use them (such as a regular array without capacity, or a hashlist that is much different).

Doubling the memory each time - I don't feel this is necessary and doesn't scale with large data and small memory (cramped servers and such). If you have a large volume of data (1GB) it is not wise to double to 2GB, then to 4GB.. as the system may only have 200MB of memory available (busy web server). As always, it depends on the conditions - which is why I don't use modern scripting languages a lot since this doubling you speak of may be done behind your back - instead of it being in a module that I can control myself and study. It may sound like I'm a control freak and I should be using Cee or Assembly - but that's not the case, because this algorithm can be placed in a reusable module, or object - and once it is done, it is done! Encapsulated away for next time.

As for other languages having this before I discovered it - it is their fault for not naming this pattern - many people reinvent the same patterns over and over again but they forget to clearly label them and put them in a repository of some sort.. like a book, a PDF file, a wiki, etc. The first person who invented the wheel, was probably never credited (I'm guessing it was a round rock of some sort) since he didn't document the pattern of the wheel. Someone else got to it first.

If you want optimal performance for addending large amounts of data and still have rapid access to the inside elements by index, use a deque or rope. Only if you need flat memory (e.g. for DMI network-interface message formatting) should you use an array or vector. In that case you may either resize it by hand (so it isn't "behind your back") or let it grow naturally - your choice (C++ vector does exactly this, with '.reserve' and '.resize'). I can't help but wish to say: "You'll learn, young grasshopper." You're getting caught up on PrematureOptimization - even worse, micro-optimizations performed without attention to their context; proper optimization requires as much information as possible - context is everything. Your reinvention of the vector may be 'clever' in the context of Pascal, but it isn't very optimal in the broader context. Focus on overall algorithmic performance: if you multiply the size of a vector by some constant k, such that k > 1, each time you require space, you can guarantee that the total cost for inserting all the elements into the vector is O(N), AND you can guarantee that space-efficiency is strictly greater than 1/k where 'k' is the constant of choice. E.g. if 'k' is 1.3, you can guarantee space-efficiency of 1/1.3, or 77%. Time efficiency will be lower for a lower k, of course. If you want 100% space-efficiency, simply 'trim' the final array at the end: the cost of one more pass doesn't influence O(N) temporal cost, but gets you straight to 100% space-efficiency - trading space for speed. Compare this to the solution of adding a constant amount k to the size each time: temporal cost O(N^2), space efficiency N/(N+k).

And believe me, you are FAR from the first to have 'discovered' something like this. It was one of the first things I built when learning C++ at the young age of sixteen, and it has been a problem on introductory-programming material that I both used and later assisted in teaching at university. And C has had 'realloc' for longer than you or I have been alive.

[Indeed. Algorithms of this sort were discussed when I studied computer science in university about 25 years ago.]

I informed the programming communities about these algorithms and they simply ignored them and went back to reinventing it each time themselves with memAlloc and in Pascal they use setLength. When I proposed this sort of algorithm (and others) be considered, the response was typical: don't need it, we'll just use memAlloc every time, no need, blah blah. Putting this algorithm into a module or class (just as Java has done similarly) is better than every man using memAlloc (which is buggy and human error prone) each time. This algorithm also reduces verbosity that memAlloc requires. If this algorithm is slow for the job at hand, then someone can simply take this module out of his program and swap it with another module - just as sometimes I use hashlists, sometimes I use collections, some times I use lists, etc. There was a case when I had to concatenate a bunch of strings and it took several hours using the standard reference counted algorithm available in FPC for arrays and ansistrings. After using this algorithm my data mining and storage of thousands of items took only a few seconds.

In my case, I was not prematurely optimizing - since I wrote this algorithm only when I needed it. However, it is silly to just abandon the algorithm and not categorize it and put it into a reusable module. It is such a simple, stupid, dumb algorithm that just works.. a lot better than built in reference counters, and a lot better than overengineered hashlists for when you don't need them. When you need a hash list, you use one. This is about choosing algorithms and data structures - which unfortunately the current kids of today don't understand - everything is just a dynamic type in todays languages. Associative array? What is that powered by? Internally sometimes a hashlist, which 99 percent of script kiddies don't have a clue about. Fundamental algorithm knowledge is lacking.

This is beginning to smell like elitism or perhaps an unconscious attempt at MentalMasturbation justification. (MM is fine by itself, but when it comes to insulting others for not accepting it, then things change.) --top

This entire wiki is about masturbating thoughts. GetOverIt. Specifically, quite often you masturbate your tables when you don't need to. Almost every single page you inject your unhelpful arrogant comments about MentalMasturbation, and about how EverythingIsaTable?. TableMasturbation.

I learned about the guts of data-structures and look-up techniques in school (even got A's in the courses), but I've forgotten most of it out of disuse. For one, generally corporate/biz languages offer just two arrays: positional (index multiplication-based look-up) and associative arrays (implementation depends on the language). Making a custom array type is usually out of the question. --top

Children's scripting languages may not allow one to define or choose their own arrays, structs/records (PHP comes to mind). They may force you to use a hash or the built in array. Most people don't even know what algorithms and data structures are any more. If it is out of the question to make a custom array type, or a struct/record in the language, I wouldn't use the language (BrainFuck or PHP aren't suitable for my needs. Nor will I use an associative array to emulate a struct/record - which many PHP programmers do. I find this obnoxious and messy['in code'] and completely unneeded in real languages).

Further, if its large enough such that these two types of arrays are not sufficient, then there's usually a database available. If size or complexity is an issue, then you use the database that probably already exists for other purposes instead of RAM-only structures. (I suppose one could find reasons to complain about the RDBMS's B-tree-based lookup also.) --top

You seem to be complaining about CapArray. Hashes, btrees, and linked lists are not something I complain about. It simply is not the correct algorithm for the job (an array[i] is what we are asking for - not a table with a primary key, not a hash list, and not a btree - those are overkill). That, would be reinventing the CapArray - when you don't need a hash and try to bend it into a CapArray. Likewise, abusing an associative array for a struct/record is also bending something that wasn't designed to be a struct/record and hence reinventing a struct/record.

If something *still* takes too long even among these 3 choices, then it probably needs a redesign anyhow. I don't think we need to reinvent the wheel in the vast majority of cases. --top

Irony: those who use memAlloc and reAlloc are actually reinventing the CapArray each time - and those people (not I) need to restructure and redesign their code to be more like a CapArray (even better, simply download the CapArray, or the correct algorithm that some other genius has already written in a module which doesn't have bugs as its been reused in other apps already). If the application suits a hash list, then they download that module. If they are storing data for later which is relational, they use a database. Picking the right algorithm for the situation is what is important - not bending everything into a table just for the sake of TableMasturbation.

By the way - as you have claimed I am reinventing the wheel - please provide me with URL's and links to an array and string type for Freepascal or Delphi which stores characters (or an array) without requiring manual management using memAlloc, setLength, or reAlloc - and it must perform well by default without any extra gotchyas (such as heap fragmentation, no setLength, no allocating on each insert - and it must not be verbose, and it must not reinvent the CapArray).

Is the Java vector just mental masturbation too? And the C++ vector is just mental masturbation? And Python is just MentalMasturbation since Python reinvents PHP and Cee and Ruby in a different form.

I called it MM because I believe it to be wasted time and a potential maintenance headache down the road. The 3 choices mentioned are usually sufficient and readily available for most apps that I work on it. I'm not just being a "naive script kiddie", but making smart choices. --top


I remember a DrDobbs? article almost a decade ago that described this basic concept in something called "Hash Array Tables", or HATs. (Don't bother Googling for this now, though; apparently, there are thirty gajillion other structures also named HATs). However, if my DRAM doesn't need refreshing, the characteristic of these HATs is that it took time into consideration when growing the structure. E.g., if you tended to extend the array quickly, it'll resort to doubling the array size. If you did it only infrequently, it would grow the array only a tiny bit (say, 10 elements). If it were somewhere in between, then it'd grow the array by a mid-sized amount (e.g., by 100 elements of whatever you were putting in). It was actually pretty neat. The idea was to amortize the time spent managing memory in proportion to the measured growth or shrinkage patterns of the array.

--SamuelFalvo?

I think any such structure is subject to one of the fundamental problems of AdaptiveCollections: the cost of runtime profiling is not insignificant and often overruns the cost of a 'dumber' algorithm with a fast heuristic. Such things should be considered on a case-by-case basis, of course - e.g. a smarter heuristic becomes more efficient as grows the relative cost of resizing the collection.

My own preference is for something like a 'rope' where addending and modifying in the center is very cheap (because it is performed logically) but with an additional option of a 'simplify' operation that can be called after one is done rapidly adding new elements in order to reduce the time to start reading them. In the best cases, such simplification can be amortized across the reads and writes themselves, though the logic for making that work neatly is often painfully difficult to comprehend, and requires a little extra thinking (e.g. with partial lazy evaluations) do properly in a purely functional manner.

Acquiring the current timestamp in microseconds and subtracting from the previous timestamp acquired is hardly a slow operation.

 class Hat(object):
   def __init__(self):
     self.lastAccessTime = time.time()

def appendElement(self, elt): if self._isOutOfRoom(): dT = time.time() - self.lastAccessTime if dT <= SMALL_THRESHOLD: self._expandBy_(10) elif SMALL_THRESHOLD < dT <= BIG_THRESHOLD: self._expandBy_(100) else: self._doubleInSize() self._store_andIncrementSizeByOne(elt)

def _store_andIncrementSizeByOne(self, elt): self.data[self.currentSize] = elt self.currentSize = self.currentSize + 1 self.lastAccessTime = time.time()

...

It's not that hard, and positively not that time consuming, especially if you have a very fast system-call mechanism with which you can acquire the current timestamp. Even assuming it is a slow syscall, you can adjust your growth rates to better amortize the optimization across future additions. --SamuelFalvo?

Acquiring a timestamp is generally a call of moderate speed at best, but can be considered cheap if the size of the data-items is larger. The problem is that the cost for the call, acquiring the time, and the extra assignment operation are multiplied strictly by the number of inserts, which will at least double the per-insert cost for smaller data-items. And the proposed benefit is of questionable value, being non-algorithmic anyway. Demonstrating that this provides any real optimization will be quite non-trivial. And then, of course, there are the RulesOfOptimization - when you do get around to hand-optimizing something, you had darn well better be able to prove it worked because the loss of simplicity and the extra coupling (in this case to the time system, which means you lose determinacy for debugging) is often higher than anticipated.

A note about naming: vector, to me, isn't so clear.. it is described as http://www.google.ca/search?q=define%3Avector. CapArray is much more specific in naming this algorithm, methinks.


The timestamp algorithm simply should be placed in another module.. Samuel extends the caparray even using inheritance or by calling the functions in the caparray algorithm and tuning it. [... rest moved below without any apparent consideration of how it affects the context ...]

I believe that creating unnecessary, trivial choices is a BadThing. It encourages quibbling over small stuff. While power to implement either of these is important, keeping them both around simultaneously is of negative net value compared to just keeping one. This philosophy is embraced by Python: there should be one, and preferably only one, obvious way to do it (with the corollary: and that obvious way should be a correct way to do it).

Sounds a lot like standard pascal where there was only one correct system library to use. Sounds a lot like bull shit actually. I've never had a problem with the cap array algorithm - I've had more problems with the built in algorithms that use simple reference counting with no capacity at all. I've also heard people have problems with hash lists allocating way too much memory on loaded servers. You'll note that I'm not a script kiddie, and that I use real languages. HaHaOnlySerious.

Computer programming languages are HumanComputerInterface?s. They should be subject to good HCI design principles, such as DontModeMeIn and the absence or minimalization of trivial decisions. I'm not as well versed in HCI principles as I could be, but see also TheHumaneInterface and some of RK's work on ObjectBrowsers and KillerUserInterface. What you're defending is the ability to quibble over pointless details, like whether to use a an inefficient CapArray or a maybe-sometimes-slightly-more-efficient temporal version. Is your best defense of this behavior is to call "bull shit" and to appeal to hear-say anecdotes about what you've heard other people doing?

There is value in constraint. A blank page is unconstrained, but it also has no features over which you can take advantage or from which you can acquire value. EverythingIsNothing?. When decisions are offered to a programmer or any other human they ought to be meaningful in a very real way.

Such as one choosing to use a simple fixed array for when he really needs that, or an enumeration, or a caparray, or a list, or a hash, or a binary search tree. Or an integer instead of a string. This is what type systems are all about.

There is nothing trivial about choosing a fixed array when you "really need it", such as when dumping data to the network interface or writing a device driver for the graphics card. And if you need a sort-ordered data representation or the ability to do 'persistent' updates (e.g. with versioned trees and backtracking) then a binary search tree gives you considerably better algorithmic performance than a hash. But choosing between two algorithms that do the same thing - one of which has cost '20N+3' and the other of which has cost '18N+100' is entirely trivial. Not only that, but you're paying for it with clutter, bloat, redundancy, readability, maintainability (if ThereIsMoreThanOneWayToDoIt, then the maintenance programmer has to learn every effin' one of them). Having two modules that do the same job would be a perfect target for OnceAndOnlyOnce refactoring.

And TypeSystems are not about offering trivial choices (or choices at all) - they're about constraining choices, ensuring that everything runs safely and smoothly.

In my case the built in algorithm (reference counter) did not work fast at all, and was causing my strings to be concatenated very slowly in a loop. The hashlist offered in the available modules were way too complicated for what I needed to do (had to free and create the hashlist objects, with risk of dangling pointers, and casts necessary for each type, pointers, etc.). Combined with my IncludeFileParametricPolymorphism, I can make a CapArray generic and not rely on the crumby built in reference counter algorithm, and not use any pointers or ugly hacks. Further more, one can take this CapArray and derive a better time stamp algorithm from it.. using the CapArray code as a base, since the CapArray can be dynamically changed (it's capacity can be modified at run time based on certain conditions, since it is heap based, not stack based).

If you're doing big-string ops, 'rope' is the best way to go. Seriously. Anyhow, CapArray's performance is very questionable compared to that of a simply resizing a vector by some multiplicative constant 'k', which gets you O(N) performance with a temporal cost of (1/(k-1) + 1)X where X is the cost of picking the exact right size at the beginning. And it can easily support trimming to get 100% space efficiency, increasing the temporal cost to (1/(k-1)+2)X. A CapArray will not beat this in terms of efficiency for time or for space.


(Moving 'hypocrisy' conversation to ExtensibleProgrammingLanguage)


The timestamp algorithm simply should be placed in another module.. Samuel extends the caparray even using inheritance or by calling the functions in the caparray algorithm and tuning it. Then users can have the choice.. to use the timestamp one, or the simpler one. This is what modular programming is all about, IMO - giving people swappable options. (PrematureOptimization note: In FPC, to get timestamp, one may have to pull in the Sysutils unit which adds 60K to his executable. Without the time stamp, no 60K. But, just give people the option.. put it in another module, derive it, reuse the code, etc)


crap excised from above; delete when you think everyone's seen what they want to see

I'm really confused over your absolute fixation over what you think are buzz phrases. Stop it. It's no more a buzz phrase than BuzzPhrase itself is.

A buzz phrase is saying DoubleSpeak and stuff that means nothing. For example:

   1. What do we desire?
   2. What do we desire going forward? 
   3. What do we desire going sideways? 
   4. What do we desire going backward? 
Adding the going forward doesn't help anyone much. HaHaOnlySerious.

{Just treat "going forward" as ambiguously (a) a conversational segue or transition, or (b) an indicator or foreshadowing that the speaker intends to inform you of a plan, or (c) another way to say 'in the future'. All uses are perfectly reasonable and common. None of these uses is a BuzzPhrase. So asking, each time, some equivalent of "what is that? a BuzzPhrase" is quite irritating and usually wrong.}

See BuzzwordCompliant (a page I did not make).. and you will have a nice surprise.

{Arrggh. You may be right. But see NoKeening for a nice lesson.}


AprilZeroEight


EditText of this page (last edited July 8, 2010) or FindPage with title or text search