Hundreds of billions are spent on software development. Software runs the world. But why are there not more realistic tests being done by research institutions and universities? This is the age of science, not the age of anecdotes. Should so much be decided on such scanty evidence? That's not rational. There's no excuse for such mass avoidance of testing.
I suspect it's GreekSyndrome? going on (EvidenceEras). Such testing is "boring" from an academic standpoint. It's not exploring some slick catchy equation or BigIdea, but messy and tedious coding and counting.
For example, how about a representative sample application be developed using OOP and non-OOP styles, say a college class and grade tracking system, and then throw several realistic change-scenarios at both code bases, and see which one is easiest to change in terms of modules, functions, and lines of code that needed changing. (I suggest one heavy-typed language and one lightly-typed language be used.) Theory and anecdotes only go so far. Time to science up and RaceTheDamnedCar.
(Even if it doesn't settle anything, it will at least provide a framework for discussing and comparing code change metrics.)
That sounds like innumerable final-year undergraduate "information systems" dissertations. Fun stuff for students, but fraught with difficulties for any sort of serious or genuinely meaningful research and the outcomes are of little interest (i.e., unlikely to be funded). It's too much like doing "serious" research to determine which is "better", a hammer or screwdriver.
- I'm not sure what you mean by "of little interest". You mean of little interest to academics, industry, or both? A good many managers want empirical data about tools. Are you making a statement of value, or a statement of interest observations/estimates?
- Of little interest to entities that fund research... Funding is typically based on the desire to obtain valuable intellectual property, which can be used to create innovative and/or competitive products or technology. Comparing product <x> to product <y>, especially when such comparisons are inevitably contentious, is unlikely to be profitable. Inventing a new product, method, or algorithm is potentially profitable.
- I generally agree with that about the United States, but why aren't other less corporate-polluted countries interested in empirical comparisons? They spend gobs on software also.
- What countries do you have in mind? Also, the problem with empirical product/paradigm comparisons, from a "serious" research point of view, is that they are inevitably subject to a trivial category of criticism: Someone can always find another "Ah, but your comparison didn't take into account <x>!" Kinda like the debates that go on here...
- Of course there will be "but if's". A perfect empirical study is impossible. But by making the details public, at least one can see what's going on and adjust accordingly. Further, it's a learning process. The next round or next tool or paradigm will come along and the testing lessons from the past can be added to the next. If our ability to make tools greatly exceeds our ability to test them, then we as a society will could be wasting trillions. Is it really more rational to fly blind just because testing is kind of hard? The industry's ability to push fads and gimmicks to sell more boxes is growing ever more powerful. We need ways to separate the cream from the crap.
- It's not that testing is hard. In fact, it's easy to set up trivial experiments comparing, say, language <x> with language <y>: "Here we have the same program written in <x> and <y>. It doesn't have functionality <p>. Change it so it does <p>, then answer this survey." The problem is getting funding to do it, and standing up to the inevitable barrage of "yeah, but"s. Yeah, but the programmers didn't have the same prior experience of <x> and <y>. Yeah, but the survey didn't ask <q>. Yeah, but functionality <p> is innately biased toward/against <x>. Yeah, but <x> is known to be crap; you should have used <z>. Yeah, but <y> is known to be crap; you should have used <z>. Continue ad infinitum... Students often do studies like these for final-year projects because they gain marks for the more "yeah, but"s they've anticipated and addressed, or at least discussed. But for "serious" research, it just weakens confidence in the results having any meaning. So, researchers tend to target their efforts in much more fruitful and satisfying directions and the cream is separated from the crap by subjective experience. E.g., "I tried <x> and I like it so I use it."
- {Don't forget: "Yeah, but you didn't actually start with the 'same program'.". Only one property can be held the same between languages, and that is requirements. Thus, you must account for variation within each language, which is a difficult problem.}
- See, Top? It's begun, and I haven't even started the experiment.
- Sorry, I'm not understanding the complaint.
- If you write the "same program" in two different languages, then it's not really the "same program", is it? It's two programs, which both implement the same requirements. Therefore, how can you be sure that they are equivalent?
- If a bug is discovered, you fix it. Hopefully it won't have a significant change on the study results. If it does, then release Study 1.10. Every research project has the risk of human error. Assistant bob accidentally farts near a bacteria test-tube. Be careful, but bleep happens.
- {Bugs are a separate issue. The issue raised above is, in layman terms, that you might be comparing 'good' code in language <x> to 'bad' code in language <y>, or vice versa. There is a lot of variation within each language. For the experiment to be valid, you must (somehow) methodically ensure that the initial program samples are representative. Alternatively, you can scale the experiment by a few orders of magnitude: ask 100 developers from each language to write programs meeting the requirements, then for each test select a program randomly from this pool... and run two orders magnitude more tests.}
- As addressed below, the best shot at obtaining "experts" in each technique, language, and/or paradigm would ideally be done. It won't be random developers using random styles.
- {That's naive. Even if you restricted the experiment to "experts", the issue described here will apply.}
- Nothing will make everybody 100% happy. That's an unrealistic goal. A "best attempt" is made within the given resource allowance. Yes, science can be hard. And, fans of a given technique can go ahead and run the same tests on their own source code if they are motivated. True, they'll have advanced knowledge of the changes, unlike the original code sets, but hopefully if they are gaming the changes, then new changes can be added to plug the cheat patterns. With enough change scenarios, the risk of fudging grows smaller. In this aspect, it's similar to speed benchmarks. The more and varied benchmarks you have, the harder it is to game your product (cheat). Past a certain quantity and variety of tests, tuning for the benchmarks will result in equivalent real-world improvements anyhow. There's a pattern here: the more effort you spend on the research project, the better the results.
- {s/effort/money.}
- The world probably spends something along the lines of 100 billion annually on software maintenance. A $500-million study is a drop in the bucket in comparison. Is it better to make billion-dollar decisions on your boss's gut feel? Is that the rational path? Would Spock approve? Headlights are too expensive such that we will drive by braille instead?
- Who would you approach for the $500-million?
- Fruitful for who? It sounds like the research system is broken, at least for software design research. Doing nothing because you fear yeah-but's is institutional cowardice. Collect all the yeah-but's, rank them, and address the top vote-getters for the next round.
- Fruitful for the researcher, of course, or for his or her boss. It ultimately comes down to funding. If you (or anyone else) are willing to fund ongoing research into the "best" programming languages/paradigms/tools, I'd be happy to conduct that research.
- You are probably in a better position to propose and "sell" such an idea than I am. Who knows, it may spark an entire industry in empirical testing. The low-hanging theoretical research fruit is probably already picked anyhow, making the often-ignored empirical field more inviting. -t
In 2000 I worried that computers in 2010 would be so easy to use, nobody would need programmers any more.
Fat chance, huh!? --PhlIp
I was told in 1985 that programming was a dead end, and that hardware repair was the place to be --PeteHardie
I was told in 1987 that nobody can predict the future. That's the one that was right. As far as programming in general, the 1985 people may have had a point: -t
I wasn't warned that I'd have to leave programming to advance - I was warned that "soon" all programs would be written by machines
I'd like to see how such machines deal with users who don't know what the h*ll they really want. I suppose if it can spend all it's cycles redoing everything on an endless trial-and-error loop, it could pull it off.
Had a bad day, Top?
Just the usual: flaky users who gum things up for the shear power-trip of being able to gum things up. Drama-addicts, we sometimes call them.
I bet they like provoking nerds.
If so, it works.
If you wish to sell your ideas, TopMind, perhaps you should write some papers or attend some conferences on these subjects. EvaluationAndUsabilityOfProgrammingLanguagesAndTools? (PLATEAU) is getting together 2010 October 18. http://ecs.victoria.ac.nz/Events/PLATEAU/WebHome
Reno? Like they will get any actual work done, he he. I guess SoftwareDevelopmentIsGambling ;-)
One town is very like another when your head's down over your parentheses, brother.
{Conversely, you can find booze, gambling and hookers everywhere.}
Addressing Specific Issues
- "Yeah, but functionality <p> is innately biased toward/against <x>."
- Those who make the requirements perhaps should have no knowledge of the languages/paradigms used. This is similar to the "double-blind" techniques of medical research.
- "Yeah, but <x> is known to be crap; you should have used <z>."
- Of course everyone will inject their pet technique into the result. That's not necessarily a bad thing: studies that raise new questions and get people thinking are good things even if they don't produce a final answer. They can use the study to show how their technique would weather better under a given change. Anyhow, a group of experienced practitioners known to produce successful products in tool <x> should ideally be selected for the design and coding of a given test app. That's a reasonable way to handle it. In other words, <x> has to be shown to work at least some of the time. And it would be a common technique. It's poor allocation to focus on rare techniques on the first go.
- Any study of paradigm/technique A against paradigm/technique B is going to require picking a representative of A and B. That's a silly complaint because it would mean that one cannot or should not compare A to B, ever, meaning we keep living with anecdotes and HolyWars. Comparisons between say gas and hybrid or electric vehicles use actual models rather than all possible theoretical instances of construction.
- To reduce the chance of such complaints, perhaps at least 3 independent groups produce code for each paradigm/technique being measured.
- What makes you think that will reduce such complaints?
- Some change scenarios will be considered far more or less common than others in practice; or vary per domain.
- The study would include the weightings used for the final scores, but the reader can plug in their own weights to better fit their environment or observations. Thus, they can downplay change scenarios they deem unlikely. (Related: PerceptionOfChange)
I see it as even more fundamental and less scientific than one would realize. Let us say the requirements are fair (none of the languages in question have large portions of the problem domain in their standard libraries) and generated by double-blind rig so that the requirements writers cannot choose to favor a particular language, and furthermore the original samples (program before new feature) are written with the best techniques available and the best design possible. In that case, the best you can hope for is a battle of champions and so whatever language has the best champion will win, regardless of the merits of the language in question.
I once was preparing to tackle this problem in a Master's level CS class. The problem is intractable due to the various difficulties including test rig and champion ability.
I don't view it so much as a contest but more of a way to provide a framework to compare, isolate the reason for differences, and suggest further studies. If there is an issue of whether it is a paradigm advantage or a language advantage, then further tests can be done. Also, Ideally, each paradigm is tested under multiple languages such that one can see if any "winner" shines more per language or per paradigm. Since Lisp is a multi-paradigm language, perhaps it can be used as a kind of control language.
It's a starting point, and a step above anecdotes and holy-wars, which is the only technique used now. You seem to be saying that one shouldn't embark on a journey unless one is sure to find the destination.
See also: CodeChangeImpactAnalysis
CategoryRant, CategoryMetrics