Separate Io From Calculation

A special case of SeparationOfConcerns.

Problem: Methods are not reusable because they have too much responsibility. In particular, they do IO and they also calculate, so whenever you want to do one of those things, you end up doing both.

Solution: Separate IO and calculation as much as possible. Methods which do either IO or do a calculation are considered reusable. Methods which do both are not considered reusable, because whenever you want to do one of them, you end up doing both..

Can we have someone finish up the pattern template?

Related patterns: ModelViewController, ModelDelegate

SeparateIoFromCalculation is a rule you can't follow all the time. Somehow, you must glue together I/O and calculation. For better reuse though, having functions that do I/O and functions that do calculation is better.

Example: You have a method that calculates sqrt and prints sqrt at the same time. It is called from several thousand places in your program. Then you need to calculate sqrt but you don't need to print it. Or you need sqrt of the same number several times, but you don't want to calculate the expensive operation several times.

It is better to separate this method into calculateSqrt and printSqrt. printSqrt is so similar to print that probably it is better to just use print. Then calculateSqrt() can be simply called sqrt().

Some of the SeparateIoFromCalculation definitions on this page are naive approaches that only work with data sets that always fit in available memory. They fall apart when you can't read and write all the data at once.

See ResultSetSizeIssues. "Fit in memory" is somewhat of a system-specific issue because in many modern operating systems and/or languages, they cache RAM to disk automatically when actual RAM gets tight. However, it still may be a significant issue in some domains that deal with more imposing hardware constraints. Also, in interactive applications, it may be prudent to prompt the user with a warning about potentially large result sets before continuing. For example, "Do you really want to see a screen-report with 10,000 rows?"
All systems have memory and only some of that is available to any given application, so this isn't a "system-specific issue". Consuming multiple gigabytes of virtual memory will get a "doesn't play well with others" comment on your app's report card.
[He beat me to it...I agree, and in more detail:] It's not even a little "archaic". Most machines have 32 bit address spaces. Most nontrivial databases are considerably larger than 2^32 bytes.
You appear to have in mind "demand paged virtual memory", invented circa 1962 by Burroughs and popularized circa 1964 by IBM - not really "modern", it just seems that way to people who cut their teeth on recent but stone age systems like DOS and Windows 3.1 on 8086s and 80286s.
Even on systems with a more comfortable 64 bit address size, they never actually have 2^64 bytes of RAM, and paging is far from transparent; it causes what's known as "thrashing" to constantly cause things to be paged to and from disk.
So, from the beginning of computing history up through the current day, it's been extremely common to have data sets that won't fit in working memory, and that shows no sign of changing at all.
The idea behind caching is GracefulDegradation. Things just get slow if many apps cache too much. If it keeps happening, either rewrite the apps or add more RAM.
- [There's nothing graceful about thrashing a disk. You can't keep adding RAM if the solution has no memory bounds, and it's rare that you actually have time to rewrite the app after memory usage becomes a problem for the users.]
- If you simplify your software by using SeparateIoFromCalculation, you will have more time to deal with such issues. In practice, I don't find it a common problem if you understand the trade-offs.
- [I don't hear anyone arguing in favor of mixing IO and calculations. My problem is with the idea that they should be segregated in time, so that there is a single input phase, followed by a single calculation phase, followed by a single output phase. Writing code like that (that assumes all of the data can be read in a single input phase) doesn't give you more time to figure out a bounded memory solution. It just delays it.]
  - That rings a strong bell... last year, I tried to clean up and improve CohesionAndCoupling - would you check and see if anything there is in alignment with your point? (Please note, I don't mean that to start an EditWar on the page in question, I'm hoping to find points of agreement, that's all).
  - [I suppose my point is that CouplingAndCohesion are measures of how software is logically organized. That logical organization is (or should be) independent of how execution flows. It's good to reduce coupling between IO and calculation. That doesn't mean it's good to do all of the IO at once.]
I think we're all aware of what to do. The point is that I'm disagreeing that it no longer matters whether something fits in memory. It always has mattered, and as long as we have secondary store larger than primary store, always will.
Perhaps "matters" is not Boolean.
- And come to think of it, that was an odd response. Ok, I've got a 100 terabyte database. I do a query which needs to step through every row. The virtual result set does not fit in memory. There's nothing wrong with the app, so I won't change it, and I can't add 100 terabytes to my machine. Sometimes things just don't fit - that is quite different from saying it doesn't matter whether they fit. It makes a huge difference.
- Separating IO is a rule of thumb, not a mandate. If a specific UseCase cannot use it, then so be it. Don't use it.
- In this context, I don't think there's a problem or a nuance with the word "matters", I think that you need to refine the claim you're making some more. Sometimes an app must be aware of IO issues - but I certainly agree that there's nonetheless a SeparationOfConcerns issue. CouplingAndCohesion still says that little bits of IO code shouldn't be randomly intermixed throughout the app, even in this kind of situation.

Given the most recent responses above ("then you will have more time to deal with such issues", "rule of thumb"), we may have finally reached violent agreement on the matter - if that truly is the summary.

The above seems to arguing on different LevelOfAbstraction. By focusing on a single implementation, reading to memory then calculating, one may miss the point is to separate by code module and not by time. The use of ContinuationsAndCoroutines, streams or other forms of LazyEvaluation could work around the memory issues here. -- TylerMac

In the OO world, a similar separation is applied on methods: Some are mutators and others are inspectors. Creating a mixed method is considered bad practice, since then it becomes harder to use and reuse.

Consider, for example, a class used for creating XML documents. You can add new nodes (addNode) and you can obtain the resulting XML (getXml). If getXml() performs the document construction, it is underperforming and probably it won't maintain correctly its invariant, unpleasantly surprising their users.

This is probably the most familiar abstraction in computer science, right before DataStructures (AbstractDataTypes).

How do you figure? Oops, I mean, why do you say that?
- The first abstraction maybe was punched cards. Then assembly language. Then Fortran, which had data types, but they were not abstract yet. By around the same time, Fortran instructions were for IO or for calculation, not both at the same time.
By this definition, it then isn't even possible to do both at the same time, so there would then be no need to make an effort to separate them, so this seems to confuse your intended point rather than clarify.
- It is not possible if you follow this rule 100% of the time. TheyreJustRules. Somehow, you need to connect the two (IO and Calculation). Make as few of those methods as you can.

The idea is that I/O must not be performed by the same methods/functions that perform calculations.

Why? It's not sufficient to describe absolute laws of programming, it's necessary to describe the problem, and how the law addresses the problem.
- Problem: Methods are not reusable because they have too much responsibility. In particular, they do IO and they also calculate, so whenever you want to do one of those things, you end up doing both.
- Solution: Separate IO and calculation as much as possible. Methods which do either IO or do a calculation are considered reusable. Methods which do both are not considered reusable, because whenever you want to o one of them, you end up doing both..
Hint: does it have anything to do with cohesion? Anything to do with command line or GUI? Any other issues? Buehler? Buehler? Buehler?
- Probably it has to do more with cohesion and coupling.

The most basic example of this is the original BASIC keywords LET, PRINT and INPUT. Then when you wrote your first structured program in Pascal to calculate the square root of a number, you could functionally decompose it in: Input number, calculate root of that number, print number.

Is this a DesignPattern?

There are many ways to decompose the program above:

Write all in one big method.
One method loops, another inputs the number, and another calculates and outputs.
One method loops, another inputs the number and calculates, and finally another outputs.
One method loops, another inputs the number, another calculates, and finally another outputs.

Only number 4 is as correct as possible according to SeparateIoFromCalculation.

And only numbers 1..3 are correct according to CombineIoAndCalculation?. :-)

Also notice that in number 4, the method that loops does I/O and the calculation. It is considered better only because it has less methods which break this rule.

The best you can do is to only partially break this absolute rule? Something's wrong with this picture.
- Yes, this rule can't be followed a 100% of the time, but it is better if you follow it.
  - As Above: Why???
    - The first reason is SeparationOfConcerns. The second is AvoidCollateralEffects (also known as avoid SideEffects). The third is EaseOfReuse?.

This page promotes a process decomposition. The example given is also known as a ReadEvalPrintLoop. A variation on this is known as the MasterControlProgram, or MCP, which is cast as the central evil by AlanKayIsTron because over-application of this pattern leads to heavily moded software. See TheEnd. An alternative decomposition is as objects that can read and print themselves, or the spreadsheet that decomposes calculations as a sea of cells that are directly viewed and manipulated.

Objects that can print themselves is not the same as the evil ReadEvalPrintLoop. For example, what about an object that whenever asked to calculate something, it prints itself? Or that whenever asked to print itself, it calculates something? I would call that an AntiPattern, the opposite of OneResponsibilityRule. Methods should do only one thing: Either I/O or a calculation. -- GuillermoSchwarz

See NakedObjects.

The problem is that sometimes it is more code to keep them separated.

One shot:

  get x
  for each i in x
    alter i to j
    print j
  end for

Separated:

  get x
  for each i in x
    alter i to j
    store j in y
  end for
  .....
  get y
  for each j in y
    print j
  end for

Can you please change this example so that actually it makes any sense?

The code above is the wrong way to separate the code because you transformed 1 loop into 2 loops. At first glance it may look correct, but it is performing slower and using additional storage. Besides the first function does IO and calculation, while the second does the same, it is only separated by a blank line. That is not what I was trying to imply. SeparateIoFromCalculation is about a function or method doing either IO or doing a calculation.

What is your proposed alternative? The "blank line" was only to indicate different "code units". Whether those units are functions or something else is immaterial, perhaps a debate left to LongFunctions.
- I can't show an alternative to code that apparently does nothing. Can you show an example we can understand?

More code doesn't means worse. In the above code, where IO and Calculation is done in one method, it is less clear about the purpose of the method. When you see "get x" in a line of code, it's name doesn't tell you any thing. You could change the name from "get x" to "getAndPrint x". But what would you do if you only want to print all the value in x? You can't call "get x" because that would alter the content of x. So What would you do? The solution you use in separate method solution is not the correct way. The purpose of the original method is to "modify and print" so you should have

modify x for each i in x alter i to j end for

print x for each i in x print i end for

I assume the modified i is stored back to corresponding place in x. If not, it means you probably do that calculation just to do the output formatting; that way, the calculation may be ok. Or you could have done

 print x
 for each i in x
       print format x
 end for

 format x
 return altered x

What if we wanted to save the altered thingies, but also print them?

  get x
  for each i in x
    alter i to j
    put j into x replacing i
    print j
  end for

I'm quite unconvinced that 'saving' a value is fundamentally different from I/O. It is just output to memory rather than to the UI. However, saving a temporary, where that temporary is used and discarded as a purely computational side-effect (a side-effect of calculation), would qualify as different. If, however, you're provided an address or reference to someplace to save data, then requesting or sending information from or to that address (whether it be to a printer or elsewhere in memory) is definitely I/O.

SeparateIoFromCalculation, taken fully, means ReferentialTransparency for calculations... since the calculations neither receive input (except to initiate their computation) nor provide output (except to report their result).

The profit of SeparateIoFromCalculation is that the calculation logic is independent of the IO logic or the data source. For example, there is sin(double) function that is used to calculate sin value of a double. What if its signature is changed to sin(File) which can calculate the sin value of the first 4 bytes double from the file?

Separate IO from Calculation makes your code more reusable. usually the IO/GUI code changes when you are polishing your software (improve program looks, support new format). As long as your calculation code is separate from IO code, you can be sure the calculation is always correct, whether or not you change the IO part.

Surely not all calculation can be separate from IO. For example, how would you write FileWriter? or ExcelFileFormatReader? if you shouldn't be interacting with the file? The point is to keep the IO code to the low-level class. Only the class then really needs to know its source of input.

Can someone make sense of that last sentence?
- I think he is trying to make the following point: IO should be low level, so that the upper levels of the programs only deal with abstractions and do not care where the data is taken from or put to.
  - Thank you for help summarizing it for me... English is not my native language. So sorry I couldn't write clearer than that.
    - Mine either, so probably nobody else understood. ;-)

The other alternative would be to write a lot of calculation functions without IO. Then when IO is needed, upper level functions would link them together.

It has been my experience that the first choice (low level objects print themselves) gives you better ObjectOriented systems, in which you really do not care about the IO. All IO is resolved in lower level classes: UI, database, etc. The program is just a bunch of rules. I would say construction rules. For example, to get a doctor you need a person that has a medical "speciality". So you can either go get a "speciality", then the list of doctors appears (persons), the user selects in the UI which doctor he/she will prefer. Or perhaps you are inputting a new doctor in which case you select its "speciality" and then add the person related information.

The second choice (upper levels in the system connect the model and the UI) is exactly like StructuredProgramming. This leads to systems that are very hard to change. -- GuillermoSchwarz

I know that by now this page is really a DeadHorse?, but I think I'll beat on it a little more.

I was astonished that there was even any debate on separating I/O from calculation, but then it occurred to me that most of us have never programmed in AssemblyLanguage.

You completely misinterpret. Most of the early criticism was from me, and my complaint was that there was, originally, not one single little bit of explanation for the why of separation, merely a proclamation that it should be done. The page overall still suffers badly from this fault. Way up above you can see a hint I tried to drop, that one way to explain the "why" has to do with cohesion. No follow up. This is also pretty low level advice, that potentially is subsumed by other principles.
- The separation is very obvious in early programming languages and it remains so today in most modern programming languages. But when writing your own DataStructures, you may end up mixing IO and calculation, because of your lack of experience. This page is just an advice for novice programmers.

To someone coding in assembly, the question, "should I mix the I/O with the calculations" would never be asked: it's absurd on its face.

It depends. You can create subroutines in assembly as in any language. Subroutines should not do IO and calculation.

Perhaps the most general thing that is immediately apparent is that, while calculations are essentially abstract concepts, I/O is (in lower level languages) quite married to the platform of the moment.

That was not the point I was trying to make. Simply that to achieve reuse you need to separate those.

Consider:

  LOAD POINTER REGISTER WITH LOCATION OF BUFFER
  LOAD ACCUMULATOR REGISTER WITH "FETCH" API CODE
  CALL OS API LOCATION  ; OS returns number of bytes in COUNTER REGISTER
  JUMP TO BOTTOM OF LOOP IF COUNTER REGISTER ZERO
  TOP OF LOOP
  LOAD BYTE REGISTER WITH BYTE AT CURRENT POINTER REGISTER
  (do something useful [calculation] with the byte)
  INCREMENT THE POINTER REGISTER
  DECREMENT THE COUNTER REGISTER
  JUMP TO TOP OF LOOP IF COUNTER REGISTER NOT ZERO
  BOTTOM OF LOOP

so, where did the byte come from?

What's this? COBOL for Pentium X? Cute, thanks. No, just verbose ASM mnemonics for non-ASM programmers.

The OS owns the I/O for this routine, and we don't want to have to know how it does that, we just want the bytes to arrive in our buffer.

In those instances where we *must* write an I/O driver to fetch or put data, we certainly don't want the code that does the meaningful calculations tangled up in it!

Now, in a higher level language, since "Boy, bring me another byte" handles I/O as a fairly abstract concept, we may see no real need for separation but, trust me, you want the nuts and bolts of I/O done away from your pure-bred algorithms.

Now, don't get me wrong, I have nothing against I/O routines as a whole - heck, some of my finest code is I/O routines - but I wouldn't want one to marry my data.

-- GarryHamilton

I/O can't be separated from calculations. I/O is just one form of calculation.

Not in common parlance. Feel free to read the title as SeparateIoFromOtherCalculation?, though, if you insist on pedantics.
How is IO a form of calculation? IO is about getting data in and out. Calculation is about transforming data. For example in a Chess game you wouldn't like the UI be tangled with the AI.

I/O is necessary for computation, not calculation. Calculation is transformation of values, and is performed via computation.

A better way to describe this may be to move from this:

 loop {
    x = highRiskRead()
    process(x)
 }

To this:

 loop {
   x = highRiskRead()
   insertIntoStructure(myStruct, x)
 }

 loop {
   x = getNextItem(myStruct, ...)
   process(x)
 }

This can simplify error-handling because other than running out of disk space (assume data structure is cached), we should not have to do heavy error-handling in the second loop because we can just conclude that something is too messed up if there is a problem and suddenly stop without worrying about cleanup. However, the first approach may need complex error-handling to undo half-done processing because we know the IO is full of risk. I find error-handling is just simpler when high-risk processes are separated from low-risk processes. Ideally, all errors should be handled gracefully, but I find that complex error-handling for rare systems-related problems is not worth the code volume and tough to test, so displaying an error message and closing down is often sufficient. For example, I don't put disk-full error-handling on every structure "insert" command because if the disk is full there are probably far worse things to worry about. It is usually not an application's job to monitor disk or cache space, except maybe in life-support systems or bulk-load batches.

Ideally, a language would have a central handling routine for system errors so that we can give a message and perhaps write to a log if there are problems, but otherwise don't worry about graceful recovery. If task-specific graceful recovery is needed, apply specific handling code to only those critical sections.

It might be true that in resource-sensitive domains that separation may be too costly because it tends to require an intermediate buffer structure. It may be a trade-off between developer productivity and machine productivity.

-- top

The 2nd example above is a good example of an "unbounded memory" algorithm. There's no upper limit to the memory it will consume. It might not exhaust virtual memory, but you can't reliably predict how much memory it will use. That means you can't predict how many instances can be hosted on the same machine or what its impact will be on other processes. This kind of algorithm doesn't scale gracefully and should be avoided.

The preferred alternative is to use a bounded memory algorithm that solves the same problem. The 1st example above might be one of those.

{Related issues taken up in ResultSetSizeIssues, and earlier above (perhaps refactoring is in order). A buffer does not necessarily have to be un-bounded. We can put quotas and/or warning messages on it.}
A lot depends on the transaction semantics of the I/O and calculations involved (if any). If the calculation part is "pure" in that it has no effect on (observing) system state other than returning a result - in other words, halting the calculation mid-stream does not leave the system in an inconsistent state - then intermingling the two is a good idea. If, OTOH, the processing must be done wholly or not at all (requires transaction semantics), Top's approach may be better. (Of course, it could be argued that anything requiring transaction semantics should be itself considered "I/O" rather than "calculation"; I agree completely).
- I don't understand this comment. Transaction scope is a separate issue from memory requirements. The transaction can encompass the entire loop or a single iteration depending on the nature of the processing. The system should never be left in an inconsistent state, but I don't see how a bounded memory solution relates to that.
- An argument for top's solution (where a buffer is interspersed between the I/O and calculation phases) is that in the latter solution, any I/O errors can be caught before a (potentially destructive if not completed) calculation operation commences. Otherwise (when I/O and calculation are interleaved) an I/O error halfway through the file can leave the system in an inconsistent state. That's what I was referring to. (I'm assuming that process(), if destructive, is not operating on transactional memory of any sort; so rollback-on-error isn't possible; if it is then the interleaved solution might be better).
- Every I/O has to be within the scope of a transaction. Either like this:

 startTransaction
 loop {
    x = highRiskRead()
    process(x)
 }
 endTransaction

Or like this:

loop { startTransaction x = highRiskRead() process(x) endTransaction } Even if you do all of the reads before processing and all of the writes after processing, you still need to do them inside a transaction. It's never OK to leave the system in an inconsistent state, no matter when you do your I/O. If you interleave them, you can (depending on the requirements of the processing and other parts of the system) use smaller transactions that cover single iterations instead of the entire loop.

This is the standard model for steps in a workflow, or requests in a server. Start a transaction, wait for a message, read the message, process the message, send zero or more messages, commit the transaction. This allows you to pipeline operations and distribute I/O & CPU bandwidth between multiple processes.

At any rate, SeparationOfConcerns is frequently (and unfortunately) orthogonal to the scheduling of operations (sequencing, bundling of operations into transactions) and resources (memory, data); a fact that makes organizing programs a difficult problem.

The goal is not to outright eliminate transactions and I/O. It is to simplify them and/or "bulkify" them in order to reduce the need for complex escape conditions and error handling. It simplifies the portion of the code that does the I/O. Of course, there are always exceptions to rules of thumb.

Why do we need to actually read the whole file in? What's wrong with lazy reading? We could define a class for that...but I'll just use some PerlSix builtins.

    sub generic_source_loop(&risky, :&onErr:($)) {
        loop {
            try {
                risky();
                CATCH {
                    onErr($!);
                }
            }
        }
    }
    sub do_calculations(*@input) {
        map {prosess $^x} @input;
    }
    do_calculations generic_source_loop { highRiskRead() };

Tadaa!

In ObjectOrientation, this is about separating MutatorMethods from InspectorMethods, CommandQuerySeparation.

I've encountered a situation where SeparateIoFromCalculation and data-driven-programming (perhaps a variation of TableOrientedProgramming) seemed to reduce the need for HOFs (HigherOrderFunctions). I had this domain-specific report presentation engine. It provided formatting and did optional row and column totals. However, each report cell needs custom formatting for different usages (instances) of the report engine. Either I copy-and-paste the report engine for each instance to customize it, or I find a nice way to apply situation-specific formatting rules to the cells. A bunch of domain factors could be involved in how a cell is formatted. One approach I started with went something like this:

  function reportEngine(dataStruct, cellDisplayHOF, ...) {
     ...
     while (R = rowLoop(...)) {
        while (C = columnLoop(...)) {
           ...
           cellDisplayHOF(dataStruct[R,C].value, factorA, factorB, factorC, etc...)
           ...
        }  // next C
     }  // next R
     ...
  }

The problem is that the report engine had to track and pass all the situational factors that could affect cell formatting, such as cell colors, alignment, value formatting with things like "$123,456.00" for some dollar amounts, etc. The situational logic would go something like, "If this is in region X and is one of the special product categories specified by person Y after date Z, then make the cell red".

Some factors that affectted formatting came from parameters passed to "reportEngine", and some were attached to the data structure because they were more cell-specific. Different situations would use different factors such that for consistency I had to assume the widest number of factors. I was shuffling around too much info through the report engine. It was a middle-man that shouldn't have to care about all this situation-specific info.

I decided that a better solution would be a data-driven one that separated the format calucation from display. A structure/table like this was more useful:

  struture: reportCells
  ---------------
  rowID
  columnID
  rawValue  // needed for totals calcs
  cellFormatString  // TD attributes
  displayValue  // may have formatting, such commas in big numbers.

One routine generates the reportCells table/structure, and the other simply displays it. The display routine no longer has to care about domain factors that may affect the display. Here is a rough idea of how the display portion works:

  function displayReport(reportCells, ...) {
    ...
    while (R = rowLoop(...)) {
      print("{tr}");  // HTML row start
      while (C = columnLoop(...)) {
         print("{td %1}", reportCells[R,C].cellFormattingString");
         print("%1{/td}", reportCells[R,C].displayValue);
      }  // next C
      print("{/tr}");  // HTML row end
    } // next R
    ...
  }
  // (Braces used in HTML tags instead of angle-brackets to not confuse wiki.)

The "computation" of the cell formatting no longer has to pass through the display portion. It is already done in a prior step. (This is an oversimplification, but gives an idea of what is going on. For one, the data structure may have had other columns used in other calculations and stuff, but it didn't matter because the display routine ignored extra columns.)

The downside is that some of the looping structure is duplicated in the format calculator and the display portion for many usages (but not all because the order of calculation depends on usage-specific issues). But it is worth it.

In this case, use of data-driven programming and SeparateIoFromCalculation avoided the need for HOF's, and simplified the app.

CategoryInfoPackaging