I'm completely stuck. Aside from TestPattern?s, how do you seriously RegressionTest a graphical library? Often, different algorithms yield similar but distinct outputs, all of which are acceptable. Consider with Bresenham's line algorithm, if you set the initial error to 0, you get a different line than if you start with an error of 1/2. While the latter is "more correct" it really just doesn't matter, and it's faster to set the error to 0. So, human inspection seems the only real way, but human inspection is:
One solution is to not use 800-pixel wide displays. Say, maybe 40x40 displays so errors are screechingly obvious. However, for widgets like toolbars this isn't practical. I usually resort to squinting and using bright red orthogonal lines for comparison. -- SunirShah
What about having tests that tell us whenever there is a visual difference of any significance, and leaving it up to the tester to decide if the new one is errant?
What does "of any significance" exactly mean? What about off-by-one errors on the endpoints? I think I have an answer. Use fuzzy logic. Make some pixels "absolutely required" and "absolutely not required". Make others "mediumly required" and "mediumly not required" If the absolutely required pixels are wrong, the whole test fails. Otherwise, the test's pass rating is equivalent to the percentage of mediumly (not) required pixels correct. -- SunirShah
How many people (aside from me ;) notice off-by-one pixel errors on a 800 pixel wide display?
A past boss, a civil engineer working with water, noticed that all the wonderful plots and graphs we had, showing the water levels of certain wells, were off by a cumulative error of 0.06844626967830253251% ... seems someone forgot to take into account that every four years there's an extra day. I have no idea how he spotted that, even on the 10-foot plots. I had to fix the Fortran, which engineers seem to love (bless 'em).
I believe that even in the case of drawing a line, it is ultimately a good idea to specify exactly what you want, even if you think a small difference wouldn't matter. DonKnuth makes this point in Chapter 4 of TAOCP, where he talks about implementing floating point and argues that it is important to get it "right", even though FP arithmetic is explicitly about accepting some inexactness. -- StephanHouben
It's totally bogus to specify exactly what you want because the reality is that we want something small and fast that looks generally like a line. Whether the error gradient trips one pixel earlier than it should is mostly irrelevant. Very unlike accounting, graphics is a mostly analog domain. -- SunirShah
The only reason even one bit would ever change is if you change the algorithm. Then, one examination of the graphical differences, by hand, would tell you if the new results are "OK" or not, and you can go on. If the output changes more often, then consider a more sophisticated comparison algorithm.
Imagine this scenario:
What do you think?
You're still losing. The output buffer is likely to change if the algorithm changes, especially for non-trivial shapes like Beziers. The choice then isn't one of replacing the old buffer or not, but of stepping through the algorithm by hand or not. In other words, the tests don't get you very far (and are thus useless compared to other techniques). However, I'm interested in this fuzzy evaluation of the buffers. That seems promising. Over time, you can change the rating of the pixels in the buffer as you gain more knowledge. This means you get the same accumulation of knowledge in the graphical tests as you do with regular tests. I'll keep y'all posted. -- SunirShah
It seems that we agree to TestEverythingThatCouldPossiblyBreak. Where we disagree, is in our answer to the question: Can such code PossiblyBreak?
No, it can't:
Much later than when I originally formulated the question. I'm finally working on the ScreenScraper right now. A line algorithm is not trivial, trust me. There are at least 16 "styles" of lines, when you fully combine all the options we support. Probably more that I miscounted. Plus, we have to clip the line correctly in all cases. Plus, it has to be (reasonably) accurate in all cases. A line is the least complicated, but it's an adequate example. -- SunirShah
Writing a better ScreenScraper.
After looking at the existing framework, which dumps small buffers as text, I've decided that there's a reason we don't edit pictures as text files. I figure it's very handy to be able to load and save the scrapings as image files. Something simple like BMPs or PCXs should do. Bonus: if we have a good small image loader, we can use it later for production instead of the current bloated PNG library (~50kb).
Given a nice image, one can using one's favourite handy dandy image editor:
Semi automatic idea.
I develop 3d software for mainly games and such and I have always wanted to get into automatic unit testing. But I have always found too many problems regarding testing graphic systems at a higher lever. That is, testing that a model is displayed correctly etc.
For me, it seems almost impossible to make good, solid, automatic unit tests for this. The discussion above also recognize the difficulty. However, a semi-automatic approach could be arranged. With this, I mean that when executing unit tests some tests need manual feedback. This should also be supported in the unit testing framework, so when it executes a test that needs manual feedback it should prompt the user for it.
For example:
I would like to test that my 3d engine can display a basic textured model.
This technique would require a fairly advanced unit testing framework with graphical capabilities and also some effort from the test runner. However, it would save some time, at least for me.
What do you think of such an approach?
-- hObbE
Another Semi automatic idea.
What about doing some sort of reference rendering (either in your application itself, under very close manual scrutiny, or in some sort of well-proven third party package like 3ds max), displaying that image on the right side of the screen, then calling your application to render what should be the exact same thing on the left. Or, for a bit more Turing-test flavor, have the program randomly switch which side is which to keep the operator on their toes. It shouldn't be a difficult test for any person to do, as comparison is one of the things the brain does best, and it should still provide you fairly obvious indications of, say, when you accidentally broke texture loading and started rendering everything upside-down.
I find it ironic in a fun sort of way that we're considering using FuzzyLogic to check graphical output since "fuzzy" is a graphical metaphor. It seems to me, though, that it could quite literally work. Let's say we compare a test output image to a reference image, literally blurring the images to some extent before comparison, then assigning a maximum error threshold for comparison of each pixel. Pixels near a boundary tend toward some average between the values on either side, so comparisons even between what is a blue pixel from one image and a red pixel from the other will be close together in value if the pixels are both next to the blue/red boundary.
To test the tester, try different degrees of blur and threshold with manually tweaked graphics to see what values pretty much pass things that look close enough, and fail things that don't.
This is similar to the idea of testing at a reduced resolution, but not exactly. A hybrid idea would be to do resolution reduction, and average the values of the original pixels to produce the low-rez pixels. This would also reduce the total processing time and memory requrements slightly vs the blur without resolution reduction.
OK, here is some real world data on the RegressionTests for the OpenSource cairo graphics library (http://cairographics.org/).
As you can see from their GitVersionControl repository (http://gitweb.freedesktop.org/?p=cairo;a=tree;f=test), they create reference PNG images for each of their regression tests. Running the regressions generates test PNG files. If the files don't compare pixel perfect, the test is marked FAIL and a diff image PNG file is created to accentuate the difference.
This style of testing does require pixel-perfect accuracy, which is problematic for some of the cairo back-ends which themselves don't guarantee perfect accuracy. The test suite currently allows you to specify a maximum pixel difference, but that is very primitive. The plan is to incorporate the Perceptual Diff (pdiff) library, used by PixarCompany, to obtain a difference metric which takes human visual acuity into account.