Recently we were bit by a problem where we let individual applications/developers make a decision and it turned out the decision needed to be a system level decision. I think it's important when creating an api to create apis at different levels. The low level interface should provide full flexibility. Higher level apis should be made where system level information, not accessible to parts of the system, can insert itself and make more informed decisions.
In this particular case it was the timeout for requests. The decision is harder because the developer needs to retry in case of failure. The API allows a timeout to be set. Developers are supposed to pick a timeout value that makes sense for their application.
It turns out different developers had widely different ideas of what the timeout should be. Some people thought a request should be satisifed in 100msecs no matter how loaded the system was. Others thought 5 secs. Others didn't understand how a request can fail so they would use a short timeout in a retry loop which essentially turned into a hard infinite loop.
Now there are over 400 uses of this api in the system and it's very hard to change as every developer has a different style in using the api.
Yes, better education would help. Better developers would help but it's hard when management let's anyone anywhere in the country do development. But i think the problem is deeper in that the original design shouldn't have let developers select these values. It takes system level knowledge, that is what nodes are up, load, etc to select correctly.
How do you differentiate "system" and "application"? What does each term mean to you? --RandyStafford
Now there are over 400 uses of this api in the system and it's very hard to change as every developer has a different style in using the api.
Maybe the problem is not the lack of BigDesignUpFront. Maybe it's just a failure to follow OnceAndOnlyOnce.
If at the first sign of duplicated logic, the timeout values and their meanings where encapsulated within some other objects, there would only be one place to change, not 400.
If the developers cannot learn to do this, then yes, you're forced to find a committee of people who can and have them tell the others what to do. (Though I'm entirely in doubt that this is the case. Was MercilessRefactoring a goal?)
Duplicate logic depends on how you think its used. As does MercilessRefactoring. If you think applications have different needs then it can't be centralized and there's nothing to refactor. A change of this size on a large project has to approved, scheduled, resources assigned, etc, which sucks.
No place to put shared code? Yikes!
How do you differentiate "system" and "application
IMHO the system is the context in which applications run. Of course the system is itself composed of applications so the distincting is somewhat arbitrary. System at some level has a horizontal perspective where a application has vertical concerns.
"Others didn't understand how a request can fail so they would use a short timeout in a retry loop" - it sounds like that would always be the wrong thing to do, for any value of timeout. Could you have discouraged it, eg by throwing an exception instead of returning an error value?
How the failure is indicated does not change that developers must deal with the failure.
So you must have libraries full of code like strcpy(a, b) just because some developers happen to choose the same names? Uses of strcpy are not common code that can be extracted out. Strcpy is what was pulled out into a library. The uses of strcpy are general not libaryized.
In code strcpy users will have code like:
void f() {
strcpy(a,b); ... ... strcpy(b,c); ...}
Function f may be in a library. strcpy is in a library. But the calls to strcpy aren't libraryizable, they are the actual calls using the parameters required at that instance. You wouldn't pull out each of these strcpy calls because the arguments to strcpy (a,b,c) are for that subsytem. Now if subsystems did use common values they are libraryizable, but it in general that's not the case. This was the same thought for the database api which allowed parameters to be specified for retry, timeout, etc.
To continue the strcpy() analogy, maybe what we are dealing with here looks more like this:
void f() { strcpy(a, "Fred"); strcpy(b, "Homer"); } void g() { strcpy(c, "Fred"); strcpy(d, "Homer"); }...versus:
void fRefactored() { strcpy(a, getMainFlintstonesCharacter()); strcpy(b, getMainSimpsonsCharacter()); } void gRefactored() { strcpy(c, getMainFlintstonesCharacter()); strcpy(d, getMainSimpsonsCharacter()); }In other words, express your meaning rather than your implementation, and do so OnceAndOnlyOnce.
''Not knowing your application better, I would think there would be a better way to express latency expectations other than having MagicNumbers. As soon as two people write the code: "Connection(Timeout(Milliseconds(100))", then that should become: "Connection::makeConnectionWithLowLatency()". This way, when it turns out that 100ms is too little, you can up the value in only one place.
But that's just the general argument for refactoring/clear-intent. In this specific case, it doesn't sound to me like there's any single good choice for the timeout, even if you refactor that decision out. Why not implement the old "Double the timeout and try again" mechanism inside the Connect?
At any rate, I'm still unconvinced that there is such a thing as a case where there is a distinction between System and Application-Level Decisions, and there isn't a violation of OnceAndOnlyOnce. I can see how it becomes harder to keep code refactored across larger systems, but that is a different subject.