Scaling Oop Discussion

(moved from ArgumentsAgainstOop)

On a small scale, OOP is over-kill for BBR [Binding Behavior to References]. Procedural techniques usually work fine (at least with decent dynamic features). On a very large scale, databases are superior in my opinion, largely because they provide ready-made features to manage big collections of semi-similar stuff. There's a middle ground where OOP may have a slight edge, but this creates two problems: 1. How do we scale up to the DB when the complexity or size grows to exceed this middle area? If we start out with a DB for the middle ground, scaling is far easier than switching paradigms. 2. The confusion of mixing and integrating multiple paradigms. - t

I'll need to disagree with you here. On the very large scale - by which I mean DistributedSystems - objects are primitives; they are a basis for distribution, for security (ObjectCapabilityModel, access control), for LiveProgramming and persistence, and so on. On smaller scales, TotalFunctionalProgramming and LogicProgramming (including relations/databases and queries, DataLog style) work very well, offering strong opportunites for optimization (due to strong properties like guaranteed termination, independence of evaluation order, etc.). In the middle-layer, DataflowProgramming and FunctionalReactiveProgramming act as 'plumbing' between objects and services (i.e. MultiCaster is included here; related AlanKayOnMessaging). It is infeasible to create a single-Database-to-rule-all-services. It is a natural consequence of politics and technology that, on very large scales, long-lived data (anything longer lived than a message) is managed in hundreds of different databases - ranging from small databases in sensor devices to large databases in corporate clouds.

Putting the Database in the "middle layer" is a mistake, on the very large scale, because it requires pumping data to a common database, which means systems at the edge must know about a common database - i.e. carry a reference to the appropriate DatabaseManagementSystem?. Not only does this introduce management and modularity challenges; it also introduces security challenges, since you can't easily grant access to just a subset of data sources without introducing expensive filters. More practical is to pipe in the other direction: construct a 'virtual' Database by 'subscribing' (DataflowProgramming / FunctionalReactiveProgramming) to many smaller databases, along with performing any processing or transforms. An object might provide a reference for mutating a database, but the data itself can be accessed via SideEffect-free FunctionalReactiveProgramming, which allows many optimizations for very-large-scale multi-cast networks (up to exponential performance, bandwidth, and space savings). This is more secure, more modular (by which I mean it has better distributed management and partial-service sharing without violating security), and makes databases/LogicProgramming a suitable subordinate to both OOP and its lower-plumbing-layer FunctionalReactiveProgramming.

DistributedSystems programming is my passion. I have looked, but I do not see a practical alternative to OOP on the very-large scale. I've done much work on DistributedTransactions and distributed DataflowProgramming / FunctionalReactiveProgramming to make programming this very-large-scale ever more feasible. Of course, distributed OOP itself may serve as a substrate for even-higher-level programming (i.e. MetaProgramming - describing distributed applications via LogicProgramming or FunctionalProgramming - thus completing the cycle).

I'm not an OOP fanatic. I am, however, an OOP expert: I know the good, the bad, and the ugly when it comes to OOP and its most popular implementations; I've worked with OOP for over ten years; I keep up-to-date with OO design patterns (though I believe DesignPatternsAreMissingLanguageFeatures), and I contributed much to ArgumentsAgainstOop. I've known a few OO fanatics, so I know what you're fighting against, but you seem to assume anyone who defends OO is an OO fanatic. I think it ridiculous to assume use of OOP means rejecting other techniques, yet your own arguments tend to all have that as a basis: assume relational or functional or whatever is mutually exclusive with OOP, then argue that OOP is bad because some feature might be better performed by alternative paradigm <pick one>. I also feel you often blindly ignore the larger scale; e.g. procedural access to a named database (remote reference) as opposed to a 'global' database is fundamentally an OOP technique, and processes are objects at the level of the operating system, and so on.

Your statements are too general for me to inspect. Can you provide a specific example or scenario that doesn't require too much domain background education for the reader? Otherwise, they appear to either be WalledGardens or plugs for pet technologies. I agree there may perhaps be domains where existing RDBMS cannot handle, but they are not documented here.

No. (a) YouCantLearnSomethingUntilYouAlreadyAlmostKnowIt, (b) therefore, I can't know what you're missing until I already almost know it, (c) therefore any example I might provide is as likely to fly well above your head as it is to be too trivial for you to grok the distinction you need. If you want clarification, you'll need to ask the right questions to milk it from me, and you'll need to detail scenarios that give me some clue of what you're failing to understand.

And there is not even ONE domain RDBMS can handle on the 'very large scale'. To say otherwise strongly implies we could put everything (strongly everything - across all organizations) associated with a common domain into just one RDBMS. This conflicts with communications, technology, and political requirements of every domain I can imagine. ReductioAdAbsurdum: it must be the case that RDBMS does not handle even one domain (at least not any I can imagine) on the 'very large scale'. Why should I document a list of every domain I can imagine? On the very large scale, we use multiple RDBMS's - i.e. different RDBMS's for different data management and security domains. The moment you have two or more RDBMS's in the world, you need something on the larger scale (above RDBMS) to distinguish, reference, and access the different RDBMS's. Use of references tied to different RDBMS's suggests OOP. QED.

You are the only one I know of that claims that OOP "scales better" than RDBMS. It's not a claim I wish to dig into right now due to its low popularity/commonality and due to the difficulty of getting specifics out of you. Maybe another day. By the way, semi-distributed RDBMS are fairly common. For example, store branches usually have their own independent RDBMS that feeds info to a "master" or "HQ" DB during the night or asynchronously. Sometimes the HQ system copies everything, sometimes only a subset. Unique keys are often the store ID plus a local counter. This approach has drawbacks, but generally works in practice. I invite you to show how OOP can improve on this model. -t

RDBMS typically distributes via mirroring and caching, which is especially useful for read-only interactions (LDAP). Any object can distribute in this manner (EeLanguage calls it the 'unum pattern'; OpenCroquet is based on it). The mechanism you name for RDBMS distribution isn't anything special, but doesn't cover the case where: (a) different objects are on different machines, (b) distributed 'data' in the sense that there is no "master" server/HDD. I have worked with 'truly' distributed RDBMS database data before (i.e. where five computers each hold one third of the data, overlap for double-failure redundancy). RDBMS performance scales poorly for joins and complex queries in such cases: if you need to distribute the data, you're already in serious trouble because you won't have enough space to perform a join! This is why almost nobody does it; it's much easier to buy a bigger RAID drive, and perform regular backups and mirroring.

As far as suggesting I'm the "only one you know" who thinks OOP moves better than RDBMS on the very large scale: Ponder the relationships between SOA and OOP. Peruse the motivations behind ErlangLanguage, and concern yourself with why MnesiaDatabase is subordinate to the process objects. I've never found anyone who thinks RDBMS scales better than OOP on the 'very large scale'. RDBMS scales well, certainly better than certain OOP implementations. But on the very large scale, it's all services and processes communicating via messages across named references, and always has been. On the broad scale, a given database in a given RDBMS is one object instance.

So, how can OOP improve on a 'distributed global RDBMS'? I named how earlier, in the discussion that was 'too general' for you - security, distributed management, and the various other features associated with having distinct objects. A specific case of improvement: a sensor device keeps its own RDBMS and doesn't need to feed or query the global RDBMS. Suppose for a moment that the contrary was true: you cannot name RDBMS objects, there is only one global RDBMS in the entire world, and all code in the world references this global RDBMS. Each domain gets its own set of tables, and all sensor devices store data to these tables: every camera in the entire world stores photos to the global 'photo' table, for example. I invite you to consider how well this scales in terms of a few common SoftwareEngineering requirements: performance, security and secrecy, and safety (including disruption tolerance).

The code may not "know" whether there's only one or not. And it largely depends on how the RDBMS manages name-spaces. If we need more indirection, then add more indirection to the naming system. It's not an inherant in-born fault of the relational paradigm. And what do you mean by "each domain"? And you have not stated why the cameras "must" use one central DB. It's still unclear exactly what you are trying to achieve. I cannot "build" a system without a requirements document.

Logically, there are exactly two possibilities: (1) there is a global, implicit DB. (2) there are many DB's distinguished by some feature - which will essentially be a 'name' or 'reference'. In case (2), you have already admitted a higher scaling factor than the DB - in particular, you are now using names to distinguish 'DB objects' and scale upwards. Thus, you may not logically use case (2) to argue that RDBMS scales better than OOP; attempting to do so is utterly counter-productive and logically inconsistent. Therefore, you must (to be logical) use case (1): there is a global, implicit DB. It doesn't need to be a "centralized" DB. (I never said "centralized" RDBMS above, did I? It could be a distributed RDBMS.) But it does need to be a common DB to every bit of running software in the whole damn world. Anything else is logically equivalent to admitting objects scale above RDBMS.

By "each domain" I mean each and every given domain you can possibly name, plus the imaginary ones, that might involve software: photography, oceanography, physical security, cryptozoology, software engineering, etc. Some domains may overlap, of course. E.g. oceanography could also use the photography tables for some things, due to domain overlap (oceanography includes photography). But any attempts to start playing with "the naming system" as you suggest is likely to just reinvent OOP inside the global DB via 'table objects', which is similarly counter-productive to demonstrating RDBMS scalability (in the absence of OOP). So I said there is some near-constant set of tables per domain. As to exactly how these tables are chosen - perhaps a panel of domain experts chooses and maintains the tables for everyone and all software in the world... but how the tables are chosen isn't too relevant to the challenge I set before you. Assume the tables for each domain are very well chosen, and work from there.

Again, without specifics, I cannot tell if it's really necessary to "invent OO in the DB". But if such is required, it does not mean that "relational fails", since the set of design possibilities for relational and for OO are not necessarily mutually-exclusive. Sometimes the solutions will converge into the same or very similar design. The issue here is whether "scaling" forces DB's to be more OO.

Since you need specifics, consider this problem: you have some constant number - say 50000 tables - to play with to store, access, and share all data for every bit of software from pacemakers to usenet. You can have namespaces, but you can't get more tables by using them. Now, clearly 50000 tables is not enough to avoid sharing... i.e. there are more than 50000 pacemakers in this world, and certainly more than 50000 cars, more than 50000 websites, more than 50000 news groups, and more than 50000 small businesses. So you'll be sharing these tables: all pacemakers use the same set of pacemaker tables. All shipping businesses use the same set of shipping-business tables. All e-mail clients and servers use the same, global set of e-mail tables. And this isn't simply the 'same schema'; by 'same set of tables' I literally mean that you can find data for every pacemaker in the entire world by performing a query unless you add a security filter (which someone will need be trusted to maintain). You can assume the schema will be well-chosen, of course, since there'll be a lot of really smart people thinking hard and standardizing it. Are you grokking so far? Because that is (in essence) what it means to say "relational can achieve the very large scale" without introducing 'DB objects'. (Allowing each small business, news-group, or pacemaker to create its own set of tables is just cheating - it isn't fundamentally different than giving each its own dedicated database-object.) Introducing 'DB objects' allows for different pacemakers to each have their own database with their own tables, but also means admitting a relational database - on the very-large scale - is just another plain-old-object. That doesn't mean relational "fails", but it does mean that objects/OO - above relational - were the key to scalability.

In theory, how the tables are partitioned, replicated, or whatever could be pushed down almost entirely to the implementor's or configurator's viewpoint/concern. The SQL query designer may not have to care, and it may be changed under the hood without the designer ever knowing the difference. (In practice there may be some performance and timing trade-off changes that affect design decisions.) If you are implying that OOP better fits the physical realm of the implementor (systems software programmer) dealing with server boxes and cables, I would not necessarily disagree. I've already agreed that SystemsSoftware is an area that OO seems to better fit than domain development. OO was invented for physical modeling, and as long as the physical parts don't hit high quantities or have to interface with something that involves high quantities, it may be just dandy there. -t

You say "in theory", but which one? Will I also find perfect caches, GodRamIllusion, and a SufficientlySmartCompiler in this anonymous theory of yours? In this theory, how do you ensure the "implementor's or configurator's viewpoint/concern" will be secure, especially after you scale to multiple users and organizations? ... Also, what does "very large scale" mean to you, TopMind? To me 'very large scale' will encompass millions of machines across different users and organizations and generations. The Internet is 'very large scale'.

Again, my answer is "it depends on the specific requirements". I cannot give specific answers to generalities, I can only give specifics about specifics. Questions about whether you partition row-wise, column-wise, or both-wise (copy) have roughly analogous issues even for an OO-only solution. When dealing with physical separation of data in situations where you want to "hide" this separation from users such that it looks like one big data-set (when wanted), there's a ton of trade-offs to consider with regard to pitting time issues against bandwidth issues against integrity issues, etc. No paradigm rids the need to make these trade-offs, they only gives us tools to manage them. I generally start backwards: what does the user ideally want to see. If we cannot fully deliver it due to limited disk costs and/or the speed of light (for example), then what trade-off combinations best fit their need profiles. I cannot tell you that the users would rate recency (up-to-date) info over integrity issues, for example. Only the user or knowledge of the user's needs can tell us where to set the trade-off dials. - t

You aim to suggest that design is a ZeroSumGame of trade-offs. But that conclusion is easily debunked: it doesn't take much effort to take a high-quality product and (intentionally) design something that is 'worse' by every quality and metric you're using, therefore it must also be possible in the general case to take a design and improve it by every metric and quality. The possibility exists that solution S2 is better in every way you measure compared to solution S1. In those cases, it doesn't matter how you 'weight' features. Given this is true, you cannot assume that there must be a context where a particular paradigm-set - Relational+Procedural, for example - will be stronger than some other paradigm-set. Instead, you actually need to do some work and find those contexts in order to have a valid point.

Further, on the 'very large scale', you need to ensure that all relevant forms of 'scalability' has a suitably high 'weight' when it comes to these trade-offs. That constraint further restricts your ability to assume there is some context where Relational will provide superior scalability.

Earlier you stated, "On a very large scale, databases are superior in my opinion, largely because they provide ready-made features to manage big collections of semi-similar stuff." Some questions for you: (A) What did you mean by 'very large scale'? (Are you just scaling the amount of data? Or are you also scaling the number of developers, number of users, number of CPUs, geographic scaling - distribution, temporal scaling - how long programs live, etc.?) (B) Can you provide contexts and convincing arguments where Relational will 'generally' scale better than use of Objects?

I cannot give a simple metric because it depends on lots of details and interrelations, such as how many other tables/lists it joins with and how often. And I cannot provide objective evidence because I believe that relational's benefits are largely phychological; that is "mind fit" (remember the story behind my hanlde). Further, some individuals may have "OOP heads", which I don't dispute. The computer doesn't really care what paradigm it's running. Human management of complexity is primary issue. See SoftwareEngineeringAsManagementOfSoftware. (I'm sure you believe that computer-assisted validation via type-checking is a big factor, but I don't want to rekindle that debate here.)

Well, I think this argument is done. I personally think you're HandWaving above and have been for a while now, but I'll LetTheReaderDecide.

I personally think you are HandWaving on the requirements. More precisely, making non-confirmable excuses why you cannot provide it. Why can't you just give a scenario where "here's OO doing X and here's relational trying to do X but failing right here at line 123 because one cannot rename the table while it's being used" or whatnot. You only have to provide one scenario of failure to demonstrate the existence of a weakness or flaw. You don't need generalities to acheive that. I didn't ask for a theory lesson or a list of your pet topics; I only asked for a semi-realistic specific example of where relational allegedly fails over OOP with regard to scaling. -t

I've described what I mean by scaling. You refuse to provide even that much. And if you were intelligent, perhaps you'd already be in the know as to why 'small examples' don't do much for showing 'scaling failures'. Let me know when you figure that one out.

I thought you were trying to describe the scenario in slightly more detail (and not doing so well), not define "scaling".

You're looking at the wrong text. Look to the top, where I said: "the very large scale - by which I mean DistributedSystems". I even provided an example: "The Internet is 'very large scale'."

The "scaling" issue I originally referred to is not so much about computer systems breaking or crashing more. It's more about managing all the instances/records/data from a human standpoint. If there is a problem, how does one go about trouble-shooting, for example. If I can query my 100,000 "units" using a decent query tool, it's easier to hunt for problems or clues. Say some of the cameras you mentioned are sending corrupted data and we want to see if we can find a pattern to the problem cameras and/or their attributes to give us clues. We may create a quicky report/query by location, corruption time (picture/vid time-stamp), by installation date, by model number, by installer employee ID, combos of these, etc. etc. etc. This kind of troubleshooting is quite common in large systems.

OOP does not provide that out of the box, and "encapsulation" tends to make it difficult because it says we have to explicitly add collection-oriented "services" to each object or object "kind" we want to analyze.

That really isn't much a problem. One of those "kinds" could be "relational database", then could be used repeatedly. But OOP does have issues for data-management when taken by itself. But just like you're presumably using something like 'procedural' on the lower-level scale for relational, you should asume that OOP gets its pick of lower-level scale components. I earlier named which ones I'd choose: functional reactive and event-flow programming for 'plumbing', and functional + logic for transformations and pattern-matching. Relational gets to use these too, if it wants them; the question is overall scalability at the top-level.

An RDBMS that partitioning data across multiple locations is more likely to automatically handle indexing, joining, and combining results that may be from physically split up tables or DB's. Those are far less likely to come out-of-the-box in an OOP system. You don't just build that kind of stuff into an OOP app because you "might need it someday". But it would be natural to have partition management in an RDBMS. Nobody would ever go, "hmmmm, I wonder why they added all that into this RDBMS". But it would look out of place in an OOP app.

Regarding examples being too small: It's usually possible, with enough thinking, to create a simpler demo that isolates the problem and only the problem. Now I agree that sometimes if you strip out all the side stuff, one may wonder why it was done that way. After a few question and response cycles, one may either start to agree with the reason for "doing it that way", or suggest alternatives that may have avoided the demo'd problem.

Example:

A: "We can't use the file system to store and access all the NY mug-shot photos. It cannot handle it; so we have to buy something expensive instead.

B: "What do you mean? What's happening?"

A: If you put all 100,000 mug-shots in a single folder, the folder takes more than 15 minutes to open in Windows Explorer, and it's also rather slow to access from a program."

B: "Have you thought of using a hash to divide them into multiple folders?"

A: "How?"

B: "You use the booking number digits. Take the last 2 digits of the booking number, since they appear to be either sequential or random, and put each file into the corresponding folder. You'd have 100 folders, each named after all combinations of the last 2 digits. Thus, no one folder gets overly bulky. Here's a little script to save and retrieve a given photo..."

A: "Oh, okay. I didn't think of that. You are right, we can use the file system effectively."

That's not a 'demo', nor does it exhibit a fundamental scaling problem (bad scaling for folders of certain FileSystems is an accident of implementation, not a problem in B-tree based filesystems). I suspect you'll have a much harder time coming up with a 'demo' for scaling problems that are more fundamental, and any attempt to explain why it is fundamental will necessarily move into an argument based on known truths and logic. I'm relatively strong with logic, so I'd rather jump straight to the meat. But I'll try to remember that artificial dialog and logically ineffectual demonstrations help you grok things.

When there are communication problems, it's generally wise to try different approaches rather than simply repeat the same technique louder.


Is the WWW Object Oriented?

This question is interesting enough to deserve its own page. Content moved to ObjectOrientedInternet.


Pooh with small voice says: "To have a scalable OO Internet, you need a UnifiedDataModel."


EditText of this page (last edited May 28, 2013) or FindPage with title or text search