Just ran across this problem, I'm wondering if any of you math and algorithm types have seen it addressed anywhere... There's a directed acyclical graph. Each node has two values: a duration (suppose integers) and a profit (which can be negative). The profit value changes over time. Is there a polynomial or near-linear algorithm for finding the most profitable path? (p.s. simpler version: suppose the profit always starts negative and increases linearly)
My situation is project scheduling, the graph is the project PERT chart, the duration is the number of days or weeks to complete a feature, profit = (gained revenue - development cost) for any particular feature, and the gained revenue is time-varying because later generally means less profit; missing some key date can make it catastrophic; sometimes there is no cash advantage to early.
The book SoftwareByNumbers describes a heuristic, but it seemed to me that maybe there could be an algorithm based on working backwards from the end date, using induction to add one earlier node at a time and always keeping score of the cost profile from each new starting node. If this worked, then the cost of adding a new node to the front would be O(projectLength) or O(edges) and the whole thing could be at worst O(projectLength*edges*nodes) or similar instead of factorial or exponential. Does this graph problem sound familiar to anyone?
Let me see if I understand this. You have (V,E) a directed acyclic graph. We have functions d(V,t) and p(V,t) which for a given value of t give a "duration" and "profit" for each vertex. Given a path v_0, v_1, v_2, v_3 ... v_n we compute t_0, t_1, t_2 ... t_n by setting t_0=0 and t_i to be the sum of the d(v_j,t_j) for j<i. Think of t_i as the time of arrival at vertex v_i. Then we have p_i = sum p(v_j,t_j), and you want a path that maximizes p_n.
Am I right in assuming that not every node has to be visited?
Consider the two opposing forces at work here: (1) a near-linear algorithm must be approximately local in its behavior (Thanks for that insight - that's new and useful news to me - AC) ; it can't go gallivanting around all over the entire DAG on each step.
Yet (2) in the absence of information about special patterns in the data, you have to consider that non-local paths may be the optimal ones, for instance, getting from A to X via path A-B-C-X versus path A-D-E-X, where B = 1, C = 100, D = 10, E = 1...when you only look ahead one segment, B wins over D, yet B-C is 101 total cost whereas D-E is only 11.
This is precisely what makes the hard problems hard: one must examine non-local information to find even good solutions, let alone optimal solutions. And it gets all the worse when some segment costs can be negative relative to others, because then it's not even vaguely monotonic.
This also applies to every kind of statistical, learning, and search algorithm: the solution(s) with the desired degree of optimality must signal themselves non-locally by having a semi-monotonic gradient leading towards them, or else no algorithm can do better than exhaustive search/lucky guessing.
This gives fairly universal guidance in the probable cost of optimal or near-optimal algorithms.
Of course, your original hope is right once in a while: sometimes there is a clever algorithm that exploits a pattern in the problem that differentiates it from the inherently hard problems. So sure, never hurts to ask.
Also of course, I'm not the world's biggest expert in these things. Knuth might have a different take on it. :-) And then I'd learn something important. But on this one, I'd be astonished. -- DougMerritt
Thanks, Doug ... I spent an hour or so with it last night, and saw why near-linear algorithms won't work. The sample case that did it for me was considering a spike value in the middle of the graph, representing a demo for a conference, which basically has no value anywhere except at one point. Working with this example made it clear to me that there's no way to discover that spike by working backwards or, in fact, anything else other that brute discovery. This matches what you just wrote, above.
Can something be done if the problem is simplified, so revenue grows linearly after delivery (still using a negative number representing cost of development)? Even that would be useful. If that makes sense, the next idea might be to build some software to run the calculation dynamically on a PERT chart where the human locks one or more features into fixed places, announcing, if you will that those are the non-linear, non-movable nodes for the calculator to deal with.
Does the simplified problem look any better? -- AlistairCockburn
The original problem statement is for "optimal", so if I understand correctly, the answer to the original problem is a clear "no". If you were interested in something more like "near-optimal" then maybe, or better yet, "best that can be found in near-linear time", then definitely yes.
If I understand correctly, the combinatorial explosion can be avoided by using a trellis-based Viterbi algorithm, in which you select a window of max N simultaneous path segments times max M path segment lengths, with N*M selected to be on the order of as much space/time as you can afford. The window is then slid over all possible routes, keeping only the best N current segments, throwing away sub-paths that aren't as good as those N. This is very easy to implement and works very well on a large number of kinds of problems. You can choose N*M to yield whatever you think "near-linear" means in your context. Statistically the approximation to optimality tends to diminish at the same time, unless you get lucky.
It is of course a heuristic, in that for some problem spaces the best global path is not composed of the locally best path segments, so depending on your data, any of thousands of other kinds of traversal heuristics might be better, such as (I forget the name of this one) pick a single route at random, and then do genetic optimization and/or simulated annealing on its subpaths to improve it. Those approaches are conceptually easy but require more coding (unless you've got GA/SA libraries lying around). -- DougMerritt
Bummer. thanks.
The description of the problem doesn't quite match my mental model of a project network. Here's my model: the project network represents project activities, with a specific set of order constraints - some work has to be completed before other work can start, etc. the shortest path through the network is termed 'the critical path', because any delays on this path will delay the end date of the project. (there can be multiple branches on the critical path.) (in reality, a project network rarely captures the actual constraints, because project constraints aren't really a graph, and can be conditional - for example: either of these two activities can happen at this point, as long as: either one specific activity is at least half-way complete or a different specific activity has completed, but: the two activities can't happen simultaneously.)
it sounds to me like your description means: select the best of many possible project networks, by cost analysis of the paths, where cost is a function of activity complete date. The trouble with this kind of question is that, for some subset of activities, what you are really saying is "test all possible orders", which is a factorial problem. (The heuristic approach would be: either do the highest value work first, or the work with the earliest target date.)
Does this explanation make sense, or did I completely misunderstand the problem definition? -- DirckBlaskey
This isn't a standard use of a project network, and no, I'm not interested in the critical path (Doug demonstrated convincingly that there isn't an "easy" solution, above). Thanks -- AlistairCockburn
I still don't understand what the graph represents, but.. Assuming that the graph represents some ordering constraints, but that paths that would be 'simultaneous' based on those constraints are actually 'mutually exclusive' based on implied resource usage. This would mean the duration of the project is the sum of the durations of all activities, which gives you the project end date.
Walking backward from this date, at any point in the graph, you have a list of activities 'eligible' to be scheduled (meaning activities where all their supported activities have already been placed). At that point, you select the activity whose placement would mean the smallest loss. If there are multiple activities with the same minimal loss, you could select from those based on lowest value, or maybe better, longest duration.
This isn't guaranteed to produce the best answer, but it might be a good start. -- DirckBlaskey
The problem is feature scheduling. Assume your application is going to be delivered weekly or monthly over a long time frame, feature by feature. Assume that developing each feature takes an integral number of weeks or months, costs an amount to develop (negative profit), and after deployment produces a positive value (eventually profit for any feature goes positive if it gets delivered early enough). The question is, in which order to develop the features?
Few questions:
And a few quick scribbles that might hopefully lend some insight:
Duration can be eliminated entirely from the problem. Whenever you have a node of duration n, you replace it with a series of n nodes, each of duration 1. The profit for all nodes except the last should be 0 (assuming you never get any profit until the feature is complete), and the last node in the chain gets the profit of the original. So if you had a node of (duration = 4, profit = 30 + 9t), it becomes the following chain
(duration = 1, profit = 0) -> (1, 0) -> (1, 0) -> (1, 30 + 9t)The time value for profit calculations is then the number of nodes that have been already been traversed in the graph.
We're close to being able to apply DijkstrasAlgorithm here, since you could just assign a cost to each node that's the negative of the profit (or alternatively, take the max profit instead of min cost). But there's a problem, because the profit depends on where the node is placed, so I think the "relaxation" process that DijkstrasAlgorithm uses would need to back up all the way to beginning and compute costs for everything. This leads to the combinatorial explosion Doug was talking about. I think any similar method would have the same problem: any intermediate results are potentially invalidated by the changing profit, so you can't use DynamicProgramming techniques or break the problem up into subproblems.
I'll give this a bit more thought overnight. My gut intuition is it can't be done in polynomial time. But it's summer/winter vacation, I've got nothing else to do, and it's a neat challenge. Plus, this problem is an isomorphism of how to play Civilization 3 so that you get key scientific advances as fast as possible, a far more applicable problem for me. ;) -- JonathanTang
Yeah, now you're talking! That's an important problem. Please post your results ASAP! ;-)
Being JustaStudent may be an advantage here; math is still fresh in your mind, you have fewer preconceptions, your memory is better, you have the speed/reflexes of youth...Perhaps you'll discover that P = NP. Could happen. Never know.
Now the Civilization 3 tech tree would be a MUCH harder problem. Should you spend time increasing city size (more taxes to science) or increasing production (more units = more cities = more science, also more libraries/universities = more science, also building science wonders)? Should you go directly for a technology or is it faster to research universities and research labs and science wonders to help you there? The mind boggles.
Although, the way you phrased it was static...just at a point in time. Surely a single matrix inversion (or pseudo-inversion) will solve that. :-)
How many features/nodes would be on graph? i.e. is the number of nodes too big to make brute force too slow? You can presumably cache the contribution from early features added if just looking at swapping orders of late feature, swapping earliest feature clearly potentially changes all others as you have stated. You might also consider what it means if the profit does change complexly with order. You already have one example, the demo, which has a step function shape. You may be able to find other sets of profit functions, then some Sets of functions may be considered before others to give a solution (although you need a way to decide which to look at first). -- JamesKeogh
I'd like to make the problem a bit harder: In reality we don't know the durations/costs for sure, but have to either make conservative estimates or else work with probability distributions which is my proposal. Note: Summing over probability distributions is done by folding and can be simplified to a few additions if one constrains oneself to a set of simple cases like gauss or poisson distributions. -- GunnarZarncke