Here we explore information loss that happens due to inability of human to remember, reproduce or transfer information accurately.
This is not related to InformationTheory, but is related to OnceAndOnlyOnce, CopyAndPaste, DuplicatedCode.
When a programmer makes good design decision that violates commonly established rules and says that it comes from experience, such decision can often be explained in terms of information loss.
When user says that application user interface is inconvenient, such inconvenience can often be explained in terms of information loss.
Many established rules and patterns can be viewed as guides about how to avoid information loss in specific cases.
As a result of information loss, a program has to dig and extract necessary information from inconvenient information structures or the programmer has to recall the information from human memory.
InformationLoss of PrimaryInformation is BadThing which is often caused by DuplicatedCode.
InformationLoss of SecondaryInformation may be good or bad, which may cause poor runtime performance.
Obvious programming rule: programmer should avoid information loss of PrimaryInformation.
Copy and paste and duplicated code is information loss about similarity
OOP CodeReuse (inheritance and OnceAndOnlyOnce) helps to avoid certain problems happening with duplicated code. Those problems do not appear immediately after CopyAndPaste happens. Those problems appear only when programmer modifies one code snippet and forgets to modify another one. The key word here is forget. He forgets because he lost information about the fact that he did copy and paste. So this problem happened because of information loss about copy and paste.
Now if you really need to copy and paste because of some language inconvenience then you can create program that will automatically copy and paste (duplicate) information from one place to another so you edit only one place and forget about second one which is always in sync guaranteed.
Refactoring brings lost information back into the code or data.
During refactoring programmer removes duplicates by saving discovered information about duplicates/similarities in such a way that resulting functionality remains the same. That information is saved in a form of parametrized function or code generator.
Encapsulation or information hiding prevents information loss about independency
A programmer try to hide information and decouple code snippets to be able to use information about their independency when he need to modify something. If code snippets become coupled because of violation of InformationHiding it causes information loss of their independency, now those snippets are dependent or programmer forgets if they are independent or not because he knows that he did not always follow information hiding.
When encapsulating, what you're doing is separating certain information which is meaningful from other information which is meaningless. So someone who violates information hiding is not just losing but destroying the logical architecture of the system, blurring the significance of APIs with the insignificance of their implementation details.
Code generation (only active code generation) does not cause information loss
ActiveCodeGeneration produces duplicates of code, but CodeGenerator contains full information about how to create those duplicates, so information loss does not happen. PassiveCodeGeneration cause information loss and consequently should not be used.
Database normalization
DatabaseNormalization rules help to remove duplicates or functional dependencies which is prevention of information loss about similarities or independencies, help to keep necessary information and relations to keep data clean which is also prevention of information loss.
Sometimes normalization is broken for performance reasons, but even in that case information loss is not broken because in good design data duplication is handled in one place (trigger for example) which is the place which contains information about similarities.
User interface context
Many user interfaces suffer from inconvenience. The source of that inconvenience is context information loss.
Earlier versions of MS Windows did not allow hibernation which was inconvenient for users who were forced to discard context of work every time they turn off computer.
Use CopyAndPaste if you want to delete
Many employers take steps to protect their source code from being deleted by bad programmers by restricting deletion permissions, but they don't take any steps to prevent code duplication. So every programmer who has permission to add code can copy and paste many times in many places which will have similar effect to deleting it.
I think what Aleksey is saying is that all human produced information derives some of its meaning from context since we never use context free grammars. So to preserve all of the information in some code, you have to maintain its context. If you copy it then that's decontextualizing it, whereas if you refer to it or transclude it then you add new context to the code while preserving the context it was originally in. Of course, you're not preserving the time sequence of contexts but that's generally not considered important.
Additionally, when a programmer duplicates code, their intent is to convey "this does the same thing as that other code". The problem is that this information about different sections being duplicates is in a form that's difficult to regenerate because this difficulty is proportional to the size or perhaps even the square of the size of the program.
Even when you can use an automated search tool, you need to manually trigger this search tool. This is expensive enough that you need to first believe that there are duplicates, and this you generally do if you've encountered a duplicate in the last X days. So excluding local effects, the probability of your looking for duplicates that do exist is inversely proportional to the size of the code base. And as a result of false positives, it's likely that the probability of success is inversely proportional to the size of the code base even with automated searching.
In contrast to all these problems of manual duplication, automatic forms of duplication such as code generation, transclusion and extraction of shared code, successfully preserve this valuable meta-information that different sections are duplicates of each other.
Most of what's on InformationLoss is usually expressed, or at least can be, in terms of ImpedanceMismatch between human and machine. The way that Aleksey is connecting once and only once to other forms of conveying meaning is novel and I think insightful. -- RK
HierarchyOfInformationLoss moved there.
Related ideas
"CloneAndModifyProgramming is InformationLoss about similarity".
Copy and paste just preserves the similarity. Modification is where we lose it.
-- PhlIp
No, if we made a copy of a source and didn't create a program that will synchronize this copy with the source, then we lost information right there even if we modify it year later. Buf if we created a program to synchronize, then we saved information about similarity inside this program, so it's not lost. If we didn't create this program then information is left in our human memory which can not keep it accurately, so it's lost.
...examples where information loss is a good thing?
Sure, there are plenty. Science for one, engineering, art, any field of human endeavour really, any system at all where resources are limited.In any field of human endeavour, obsolete ideas need to be forgotten and forgotten completely in order that they free up valuable resources (time, effort and energy) and avoid conflicting with modern knowledge. As well, a blank slate is a necessary precondition for creativity, as studies show. So information loss is generally a necessity and it turns out that it's incomplete information loss, whether deliberate or accidental, that's the real problem.
Concrete example, mathematicians fervently believe that mathematical thought is continuous, with a several thousand year history stretching back in time to the Babylonians. This is false; our understanding of mathematics has completely changed in that time. Our understanding of what mathematics can do (KurtGoedel and GregoryChaitin), how to do mathematics (BertrandRussell's SetOfAllSets and modern experimentalist mathematicians), what to base it on (ZermoloFrankel? SetTheory) and even what mathematics is (WillardVanOrmanQuine's mathematical logics, plural) have completely and radically changed. Yet mathematicians persist in minimizing this information loss, trying to preserve obsolete ideas, at the expense of modern understanding of their own field. The same problem exists for quantum mechanics. There's even a page here explaining the necessity of information loss EmptyYourCup.
This is completely general by the way and not due to the particular limitations of human minds. It's true of any system where resources are limited. So for example, internet routers drop packets they can't route or transmit for whatever reason. When there's a traffic jam, it's absolutely horrible because shedding packets only causes their retransmission which increases traffic and makes the jam worse, yet the system works this way and it could only work this way. No possible rigorous standard would have met with such widespread adoption.
There's something you're not getting about copy & paste and encapsulation and their applicability to information loss. What you say about copy & paste causing information loss and encapsulation preventing it, it's true. But you're doing a poor job of explaining it because you don't distinguish low-level information from system level meta-information. The whole point of encapsulation is to prevent meta-information loss. And the fact that you don't make this distinction is a real problem, far bigger than just being unable to explain things. Because encapsulation doesn't just prevent meta-information loss, it encourages low-level information loss. Encapsulation and information hiding help programmers get rid of information, help them make the system completely forget in every detail, about some implementation that's become obsolete.
The relationship between low-level information and system level meta-information is general too. Science is a social meta-system (community) whose purpose is to generate intellectual systems (theories). It would be helpful to the health and coherence of the metasystem (science) if certain obsolete theories and paradigms were systematically purged from the education and library systems, dumping all of this obsolete material into archival storage where it will be possible for historians to reinterpret it at leisure. This would have to be done in at least two stages. First you purge all pre-quantum physics, and then after a new generation of scientists has come to dominate, you get rid of all pre-70s quantum physics.
The real insight you're going around is that meta-level information is much more valuable than low-level information, so the former must be preserved at the expense of the latter and not vice versa. It's not about information preservation at all, it's about making a different information tradeoff. Note that Smalltalk doesn't merely put high level information on an equal basis with low level information, it raises the former above the latter. So in Smalltalk it's possible to specify the existence of a method and even use this method without ever specifying how it does anything. And such a way of programming is considered a Best Practice.
Well, that's one insight, I'm sure there are more. EricVonHippel deals with related insights such as that information is sticky. I wonder how that insight plays into the information tradeoff insight. -- RK