(Based on material in ValueExistenceProofTwo)
(C/I = "Compiler or Interpreter")
I/O Profile is a working term or working concept I have been using which describes an equivalency metric or to test equivalency of two programs or compilers/interpreters in terms of their externally observable behavior, which generally excludes implementation specifics. I created the topic to factor the description and questions to a single place. Basically, it theoretically checks that all the instances of output match byte-for-byte per corresponding instances of all possible input sources. In practice we use an approximation of "all", which is described later. For compiler/interpreter (C/I) testing, the source code and any system resources available to the "language", such as OS clocks, are usually also considered "input".
// pseudo-code example IP-1A for interpreters or compilers for d = each possible data input file/stream . for s = each possible source code input file ... x1 = outputFromRun(modelFoo(d, s)) // the target test model or system ... x2 = outputFromRun(actualLanguage(d, s)) // the standard or reference model or system ... if (x1 != x2) then ..... write("Model Foo failed to match I/O Profile of actual language.") ..... stop ... end if . end for s end for d write("The I/O Profile of model Foo matches actual language.")(Dot's to prevent TabMunging.)
In practice we cannot test every possible combination such that spot-checking (sampling) is done instead. This reflects the practical side of software QA testing in general and science in general in that we cannot practically test every combination of input or make every possible observation of every atomic particle that ever existed. We can approach the ideal asymptote of "full" testing, depending on the level of resources we wish to devote to testing (an economic decision), but we can never actually reach 100% testing. Thus, in practice it's not a "perfect" test, but so far a perfect and practical alternative remains elusive. All non-trivial interpreters/compilers probably have bugs or "unexpected oddities" because of similar "testing" economic patterns inherent in nature.
Also, it's not a test of machine speed or resource usage, although could be modified for such, such as putting upper limits on execution time or resources as considered "equivalent" per target usage.
Further, pass/fail doesn't have to be Boolean. It could be modified to produce an "equivalency score" based on say the number of failed instances of I/O sets and/or the number of differences in the output streams. But what's considered GoodEnough depends on intended use. For example, two compilers for the "same" language may produce slightly different values for floating point results for some computations, but a language user may otherwise not consider this sufficient to say they are not the "same" language.
Error messages are often different per brand or major version, which can complicate testing. To simplify testing, generally the existence of an error message can be consider equivalent regardless of the content of the messages. Thus, if vendor A's interpreter triggers an error for input set X and vendor B's interpreter also produces an error message for input set X (and matches any output before the "crash"), then we would consider that test instance "equivalent". (Most of the common languages have unfriendly error descriptions anyhow. The quality of error text could be considered a "soft criteria".)
--top (AKA TopMind)
Diagram of Model Testing Parts for Interpreters/Compilers
[1. Data (bytes), optional] | | v [2. Model Foo] --------> [3. Output (bytes)] ^ | | [4. Source code (bytes)]
Discussion
{Note that d and s are infinite sets, which precludes using the above for any practical purpose.}
The "spot-checking" paragraph covers that. It is indeed practical and indeed is the very way production software is often QA'd. Spot checking is a large part of what professional testers do, for a practical living. Perhaps what you meant is that "testing every case is not practical". While true, we don't have to test every case to get a useful result from such a comparison tool. In practice we often live with an approximation. The alternatives have not proven themselves in practice for non-trivial tools.
Ponder this. You are a professional IBM compiler tester and IBM is releasing their version of C-sharp. What's the best way to verify the IBM version of C-sharp "runs properly" in terms of running existing source code written for MS's C-sharp compiler?
{Your claim is that your I/O Profile can test language equivalence. It can't, not as you've written it, because your description defines d and s as infinite sets, such that the supposed "I/O Profile" cannot exist. In practice, verifying a compiler is done using standard ValidationSuites -- e.g., http://www.plumhall.com/ZVS031_mp.pdf -- and by running all your existing code on it and re-running all the UnitTests if you've got them. That still doesn't prove, or even demonstrate, equivalency between compilers. At best, it merely demonstrates that your new compiler probably works on some of your existing code. (In practice, you chase obscure bugs for years; I speak from grim experience.)}
In practice we use an approximation of the ideal "I/O Profile", perhaps using unit tests and existing production applications from various "old" C/I users. I'll try to rework the wording a bit [in progress]. And while unit tests are helpful and useful, if you are trying to match another existing compiler/interpreter, such tests should also be ran against the original C/I, creating the very kind of code samples and test setup that I/O Profile typically uses.
{What, in practice, do we use "an approximation of the ideal 'I/O Profile'" for?}
That was already explained. If it's not clear to you, I have insufficient info about where your head interpreted it different than my head such as to correct such mistakes. I've reviewed it and adding a few more specifics. I hope that helps.
{As for running unit tests "against the original C/I [compiler / interpreter?]", I don't know what that would demonstrate other than what we already know: All the tests pass. The "I/O Profile", therefore, would consist only of "true, true, true, true, true, true, true, true, true, true, true, true ..."}
How do you know the existing behavior of another C/I until you test? Please don't tell me you trust your personal interpretation of the manual enough that you skip empirical testing. That would be silly.
{So far, your "I/O Profile" seems to be nothing more than some source code recognised by a compiler or interpreter, along with some input and output of the resulting programs. I don't know what we're intended to do with that, other than it sounds like a rather obfuscated way of saying, "Here are a bunch of programming examples for language <x>." Examples are obviously useful, but I don't know that calling them an "I/O Profile" helps anyone.}
No, it's not about libraries of examples. I don't know where you got that idea from. It's comparing two or more compilers/interpreters to ensure they work the same (produce the same results given the same input and source instances). If you can recommend a different or better approach to compare two or more compilers/interpreters to ensure they "work the same", please enlighten us.
{Apparently, you pass the "I/O Profile" through two or more compilers/interpreters as per your "Diagram of Model Testing Parts for Interpreters/Compilers", and if they exhibit identical behaviour, you consider them equivalent. Is that not the case? If so, the "I/O Profile" itself is obviously a collection of examples. It's a language test suite, like that at http://www.plumhall.com/ZVS031_mp.pdf }
Collections of samples don't test themselves. It's a process, not merely a library of files. And the code "samples" may be actual application code bases, not just testing snippets. Input instances/samples are indeed part of the "system", but it's NOT just a static library of files. Thus, "a bunch of programming examples for language <x>" is misleading. Stop writing misleading stuff.
How is it misleading? That's precisely what it appears to be -- a library of files.
What is I/O?
There can be disputes or fuzzy areas in terms of what is considered input and output. Generally file I/O, such as used by READ and WRITE statements is not a source of dispute, and often easiest to test. However, binary transfers and debugging I/O may be subject to controversy or be situationally of importance. For example, vender A's version of an interpreter may internally store numeric bytes high-to-low (left-to-right), while vendor B's interpreter does the reverse. If say an ODBC driver grabbed the bytes as-is from RAM, then vender B's interpreter will have a different I/O profile than A's if we compare assuming that kind of direct access is part of I/O. (And the ODBC drivers won't work as expected in one.) -t
Generally in comparing tools the intended use and environment of the tool matters. For example, you are in a hardware store and see two hammer brands, A and B. They are the same weight and price. But, you notice this on one of the tags: "Not designed for use on Mars". (Suppose the store is near a NASA building such that those things matter to some.) Since you never expect to go to Mars, that difference is not a difference to you. For your intended use, hammer's A and B are "equivalent".
Byte-for-Byte and Literalness
(Based on material from TypeDefinitionsSmellBadly)
[In the IoProfile page, you state that the comparison are byte-for-byte. Is it really typical for the developers you know to do byte-for-byte comparison on compiler output? Is it really typical for the developers you know to use anything that could reasonably called an "approximation to all" inputs when trying out a new compiler?]
"Byte-for-byte" is the ideal, but [typical shops] are using an approximation of IoProfile. They may merely eyeball screens to make sure they "look right" as an approximation of "byte for byte". Generally, if it doesn't match byte-for-byte, problems that matter are discovered, with the possible exception of slight floating-point-related differences for non-monetary apps. "Does X match Y within shop-acceptable tolerances" may be another way to state it. The algorithm I gave is an illustration simplified for easy reader digestion, but not necessarily intended to be taken 100% literally. It can be "refined" per shop standards/expectations. For example:
Before:
if (x1 != x2) then // per original algorithm at top of pageAfter:
if (! similarEnoughForOurShop(x1, x2)) thenIn practice, a human will probably be making the final judgement call. Automated tests may perhaps identify differences to be investigated, but human judgement will analyze it to see if it matters or not to that shop. For example, if report X under compiler A displays a ratio as 1.38372634, but report X under compiler B displays 1.38372635 (last digit different), then a given shop may say, "that difference doesn't matter to us". (The displayed value should probably have been rounded to 4 or so decimal places, but that's an application design issue.)
I've done similar for interpreter "rewrites" from vendors to check that they ran our shop apps, and this included checking the Web, trade mags, and bulletin boards (some of this was before Web) for related complaints and opinions from others. Thus, one is not just using shop apps to judge the matching accuracy. For example, dBASE rewrote their interpreter foundation between version III+ and IV from assembler to C, and ColdFusion rewrote their interpreter foundation from C++ to Java around 2000. Based on in-shop tests and outside opinions, early versions of the rewrites were buggy, but they eventually got pretty close to matching the original. Minor app-code tweaks had to be made for some apps, but nobody would consider such to be sufficient grounds for calling them "different languages" in a general sense. (In one case, a binary-level plug-in didn't work on the newer version, but a replacement was found. See "What is I/O" above for a note about binary-level integration compatibly.)
And typically shops do a smaller-scale version of the same thing for minor version interpreter/compiler upgrades: test your apps against the new version by comparing output (expected per gray-matter memory and/or as documented). -t
Your IoProfile, as described above, requires "each possible data input file/stream" and "each possible source code input file". These are infinite. Thus, running a validation suite -- which is necessarily finite -- on your interpreters or compilers is, by your definition, not an approximation of an IoProfile. Any finite approximation of infinity (which is itself a contradiction -- more accurate would be to describe "any finite approximation of infinity" as "any finite quantity") is infinitely less than infinity. Anyway, use of validation suites is nothing new; why describe it as an "I/O Profile"?
When I use IoProfile, I generally mean a "practical approximation of IoProfile". As far as finding a better name, let's propose some and evaluate the merit of each. I personally think IoProfile is a decent and nicely-compact description of the concept, but WetWare varies. I'd like feedback from about 3 other WikiZens before changing it. "Validation suite" is too open-ended, and I'm not necessarily referring to a "suite". It could be eyeballing screens by somebody familiar with an app for the rougher approximations such that it's not about a particular software product or even a general category of a software products. It's a concept, not software.
In Practice these are often used together to verify IoProfile fits expectations:
See also: ScienceAndTools