Testing Graphics Systems

I'm completely stuck. Aside from TestPattern?s, how do you seriously RegressionTest a graphical library? Often, different algorithms yield similar but distinct outputs, all of which are acceptable. Consider with Bresenham's line algorithm, if you set the initial error to 0, you get a different line than if you start with an error of 1/2. While the latter is "more correct" it really just doesn't matter, and it's faster to set the error to 0. So, human inspection seems the only real way, but human inspection is:

expensive
often wrong.

How many people (aside from me ;) notice off-by-one pixel errors on a 800 pixel wide display?

One solution is to not use 800-pixel wide displays. Say, maybe 40x40 displays so errors are screechingly obvious. However, for widgets like toolbars this isn't practical. I usually resort to squinting and using bright red orthogonal lines for comparison. -- SunirShah

What about having tests that tell us whenever there is a visual difference of any significance, and leaving it up to the tester to decide if the new one is errant?

What does "of any significance" exactly mean? What about off-by-one errors on the endpoints? I think I have an answer. Use fuzzy logic. Make some pixels "absolutely required" and "absolutely not required". Make others "mediumly required" and "mediumly not required" If the absolutely required pixels are wrong, the whole test fails. Otherwise, the test's pass rating is equivalent to the percentage of mediumly (not) required pixels correct. -- SunirShah

How many people (aside from me ;) notice off-by-one pixel errors on a 800 pixel wide display?

A past boss, a civil engineer working with water, noticed that all the wonderful plots and graphs we had, showing the water levels of certain wells, were off by a cumulative error of 0.06844626967830253251% ... seems someone forgot to take into account that every four years there's an extra day. I have no idea how he spotted that, even on the 10-foot plots. I had to fix the Fortran, which engineers seem to love (bless 'em).

I believe that even in the case of drawing a line, it is ultimately a good idea to specify exactly what you want, even if you think a small difference wouldn't matter. DonKnuth makes this point in Chapter 4 of TAOCP, where he talks about implementing floating point and argues that it is important to get it "right", even though FP arithmetic is explicitly about accepting some inexactness. -- StephanHouben

It's totally bogus to specify exactly what you want because the reality is that we want something small and fast that looks generally like a line. Whether the error gradient trips one pixel earlier than it should is mostly irrelevant. Very unlike accounting, graphics is a mostly analog domain. -- SunirShah

The only reason even one bit would ever change is if you change the algorithm. Then, one examination of the graphical differences, by hand, would tell you if the new results are "OK" or not, and you can go on. If the output changes more often, then consider a more sophisticated comparison algorithm.

Imagine this scenario:

A developer writes an anti-aliased line-drawing function by eyeballing the end-result.
This developer then saves the eyeballed bitmap into a UnitTest that compares with a dynamically drawn one.
The developer checks this code in.
Weeks later, a second developer accidentally breaks this code while working on something else.
The tests signal failure, causing the developer to quickly realize the mistake.
They backs out the damaging changes and try again.
Weeks later, a third developer writes some code that slightly changes the line-drawing code, but not in a bad way. The test signals failure, even though it is a FalseNegative?.
This developer hard-codes the new bitmap, and the process repeats.

With a test setup like the above, you give the developers a choice about keeping the changes they've made. Without the feedback that automated tests bring, it could take a long, long time to figure out what changed things.

What do you think?

You're still losing. The output buffer is likely to change if the algorithm changes, especially for non-trivial shapes like Beziers. The choice then isn't one of replacing the old buffer or not, but of stepping through the algorithm by hand or not. In other words, the tests don't get you very far (and are thus useless compared to other techniques). However, I'm interested in this fuzzy evaluation of the buffers. That seems promising. Over time, you can change the rating of the pixels in the buffer as you gain more knowledge. This means you get the same accumulation of knowledge in the graphical tests as you do with regular tests. I'll keep y'all posted. -- SunirShah

It seems that we agree to TestEverythingThatCouldPossiblyBreak. Where we disagree, is in our answer to the question: Can such code PossiblyBreak?

No, it can't:

If you're changing the algorithm, you'll be there to manually check it. Since the algorithm is rarely going to be toyed with, the automated tests aren't going to pay for themselves. (Is this what you're saying, Sunir?)

: The algorithm currently is being toyed with - constantly. The UnitTests are just abandoned because they're pointless. They don't tell you anything other than, "Yup, you sure have changed the code." They fail to communicate as communication is defined as the transfer of information the recipient doesn't already know. -- SunirShah

Yes, it can:

If you change a sub-part of the algorithm (i.e., code that you are reusing in the algorithm), then you might not be able to test all code that uses it.
If you change to a different machine, with a different floating-point mechanism, the differences might be noticeable. An automated test might be worth it.

Maybe we should talk about something less trivial than a line-drawing function.Let's talk about something more tricky, like GUI-layout manager ("Put these three buttons in a row, here, here, and here."). Such a thing would reuse plenty of code that could, possibly, break. Right? -- anon.

Much later than when I originally formulated the question. I'm finally working on the ScreenScraper right now. A line algorithm is not trivial, trust me. There are at least 16 "styles" of lines, when you fully combine all the options we support. Probably more that I miscounted. Plus, we have to clip the line correctly in all cases. Plus, it has to be (reasonably) accurate in all cases. A line is the least complicated, but it's an adequate example. -- SunirShah

Writing a better ScreenScraper.

After looking at the existing framework, which dumps small buffers as text, I've decided that there's a reason we don't edit pictures as text files. I figure it's very handy to be able to load and save the scrapings as image files. Something simple like BMPs or PCXs should do. Bonus: if we have a good small image loader, we can use it later for production instead of the current bloated PNG library (~50kb).

Given a nice image, one can using one's favourite handy dandy image editor:

Zoom in to check pixels exactly.
Edit the scrape.
Add fuzzy hinting maybe.

Later. After finding time in between meetings (grumble), I've implemented a better screen scraper. I use some magic colours to indicate pixels that absolutely must be on/off. Red for off, green for stroke on, and blue for fill on. Other pixels just increment a failure count. The test fails if either the number of failures (by percentage) exceeds some threshold, or any of the absolutely must be on/off pixels is incorrect. On failure, I dump as Windows BMPs

The resulting image
The logical set union of the result and test slide.
The logical set intersection of the result and test slide.
The arithmetic XOR of the result and test slide.
The logical set difference between result and test slide.
The logical set difference between test slide and result.

Thus providing a decent diff. Now I have the great task of grandfathering a thousand test slides that are ten years old in a dead graphics format. -- SunirShah

Semi automatic idea.

I develop 3d software for mainly games and such and I have always wanted to get into automatic unit testing. But I have always found too many problems regarding testing graphic systems at a higher lever. That is, testing that a model is displayed correctly etc.

For me, it seems almost impossible to make good, solid, automatic unit tests for this. The discussion above also recognize the difficulty. However, a semi-automatic approach could be arranged. With this, I mean that when executing unit tests some tests need manual feedback. This should also be supported in the unit testing framework, so when it executes a test that needs manual feedback it should prompt the user for it.

For example:

I would like to test that my 3d engine can display a basic textured model.

Write the usual unit tests to see that the model is loaded etc.
Write a unit test that prompts the test runner with a question "Is a textured model displayed in the rendering area?" and displays the model.
When running the test the test framework tries to load and display the model and the test runner gets prompted to answer the question

I noted that I manually do this when working on my code. But then it is completely manual and no other tests are run before it.

This technique would require a fairly advanced unit testing framework with graphical capabilities and also some effort from the test runner. However, it would save some time, at least for me.

What do you think of such an approach?

-- hObbE

Another Semi automatic idea.

What about doing some sort of reference rendering (either in your application itself, under very close manual scrutiny, or in some sort of well-proven third party package like 3ds max), displaying that image on the right side of the screen, then calling your application to render what should be the exact same thing on the left. Or, for a bit more Turing-test flavor, have the program randomly switch which side is which to keep the operator on their toes. It shouldn't be a difficult test for any person to do, as comparison is one of the things the brain does best, and it should still provide you fairly obvious indications of, say, when you accidentally broke texture loading and started rendering everything upside-down.

I find it ironic in a fun sort of way that we're considering using FuzzyLogic to check graphical output since "fuzzy" is a graphical metaphor. It seems to me, though, that it could quite literally work. Let's say we compare a test output image to a reference image, literally blurring the images to some extent before comparison, then assigning a maximum error threshold for comparison of each pixel. Pixels near a boundary tend toward some average between the values on either side, so comparisons even between what is a blue pixel from one image and a red pixel from the other will be close together in value if the pixels are both next to the blue/red boundary.

To test the tester, try different degrees of blur and threshold with manually tweaked graphics to see what values pretty much pass things that look close enough, and fail things that don't.

This is similar to the idea of testing at a reduced resolution, but not exactly. A hybrid idea would be to do resolution reduction, and average the values of the original pixels to produce the low-rez pixels. This would also reduce the total processing time and memory requrements slightly vs the blur without resolution reduction.

OK, here is some real world data on the RegressionTests for the OpenSource cairo graphics library (http://cairographics.org/).

As you can see from their GitVersionControl repository (http://gitweb.freedesktop.org/?p=cairo;a=tree;f=test), they create reference PNG images for each of their regression tests. Running the regressions generates test PNG files. If the files don't compare pixel perfect, the test is marked FAIL and a diff image PNG file is created to accentuate the difference.

This style of testing does require pixel-perfect accuracy, which is problematic for some of the cairo back-ends which themselves don't guarantee perfect accuracy. The test suite currently allows you to specify a maximum pixel difference, but that is very primitive. The plan is to incorporate the Perceptual Diff (pdiff) library, used by PixarCompany, to obtain a difference metric which takes human visual acuity into account.

http://pdiff.sourceforge.net/

CategoryTesting