Ocaml Type Safety Problem

The combination of ObjectiveCaml's type system and some of its libraries is unsafe in the sense that legal programs can result in UndefinedBehavior - i.e. it is not StronglyTypedWithoutLoopholes.

The problem occurs only as a result of using functions, implemented as part of the standard library or using the ForeignFunctionInterface, that have a type variable mentioned in the return type but not in an argument type.

An example of such a function is Marshal.from_string : string -> int -> 'a (see http://caml.inria.fr/ocaml/htmlman/libref/Marshal.html):

 # let x = Marshal.to_string [1;2;3] [];;
 val x : string = "����\000\000\000\007\000\000\000\003\000\000\000\t\000\000\000\t A B C@"

# let y = Marshal.from_string x 0;; val y : 'a = <poly>

# let z = (y : int list);; val z : int list = [1; 2; 3]

# let z1 = (y : in_channel);; - : in_channel = <abstr>

# let z2 = input_line z1;;
... Memory access error.

At first sight, it may seem as though the problem is with the expression (y : in_channel). After all, the program appeared to work as long as y was treated as having a type compatible with int list. Consider the following CeeLanguage code, which looks as though it is doing something very similar:

 typedef uint8_t *string;
 typedef void (*in_channel)();

string to_string(void * x); void * from_string(string s);

{ int a[3] = {1,2,3}; string x = to_string(a); void * y = from_string(x); int * z = (int *) y; in_channel z1 = (in_channel) y; z1(); }
The C code has UndefinedBehavior because (in_channel) y is an unsafe cast.

However, 'a in OCaml is not analogous to void *, and casts of the form (expr : type) are normally allowed only when they are guaranteed to be statically typesafe. So what is going on that causes the OCaml code to be unsafe, only for values returned by Marshal.from_string and similar functions?

To understand this, we need to distinguish between type expressions and types. Type variables like 'a have a different meaning in each case. In a type expression, a type variable is just a placeholder to be filled in by the type inferencer. In a type, it is universally quantified. Universally quantified types are explained in LucaCardelli's paper "OnUnderstandingTypes, Data Abstraction and Polymorphism". Below, I'll write the universal quantifier explicitly, e.g. \forall 'a ('a), to hopefully make this clearer. I've also attempted to put expressions in italics and types in boldface.

For instance:

 # ((42, "hello") : 'a*'b);;
 - : int * string = (42, "hello")

'a*'b is a type expression, not a type. The type inferencer has worked out that the whole expression can be typed by assigning int to the placeholder 'a, and string to the placeholder 'b. If any type variables remained in the *output* of the inferencer, they would implicitly be universally quantified. However, despite appearances, (42, "hello") is not of type \forall a,b ('a*'b), and (int, string) is not a subtype of \forall a,b ('a*'b).

Universal types can be viewed as intersections over a type variable. Consider, for example, the identity function, of type \forall 'a ('a -> 'a) [which is printed just as 'a -> 'a]. This type assignment means that if we instantiate the type variable 'a to any type T, the identity function can be viewed as having type T -> T. If a function is both of type A -> A and of type B -> B, then it must also be of type (A -> A) intersection (B -> B). Hence 'a -> 'a is equivalent to the intersection of (T -> T) over all T.

If a function had the sum/union type (A -> A) union (B -> B), on the other hand, we would know only that it either maps A to A, or maps B to B. Suppose A is the type of integers; then the function lambda x . x+1 has this sum type, but it does not have the corresponding intersection type (in general).

Based on how 'a is being used in the examples above, there is some reason to think that OCaml might be treating it as denoting a sum or union type. However, that interpretation would not explain why the expression (y : in_channel) is allowed.

In fact, \forall 'a ('a) is equivalent to the BottomType in the type lattice - the intersection of all types - and (y : in_channel) is allowed because \forall 'a ('a) is a subtype of in_channel. The compiler is not treating 'a as a special case; BottomType is necessarily a subtype of all other types, and the subtyping rules for polymorphic types in OCaml are the same as in many other languages.

If we view types as sets of values, then \forall 'a ('a) corresponds to the empty set. This is the source of the problem: there should be no values of this type (it should be "uninhabited"), but in the example, Marshal.from_string returns a value y of type 'a. So the type signature Marshal.from_string : string -> int -> 'a is wrong. The inferencer cannot tell that it is wrong, because the Marshal module is not implemented in OCaml. When the expression (y : in_channel) is reached, the typechecker concludes that it cannot fail, because it was presented with an incorrect assumption about the type of y.

Note that the reason why this code is unsafe is not the same as why the apparently similar C code is unsafe. In C void * is essentially the TopType for pointers. in_channel is not a subtype of void *, and so the C typechecker doesn't conclude that the cast is safe, it just allows it regardless.

Although the immediate cause of the problem is the type signatures of modules like Marshal, Parsing, etc., the design of the type system also plays a role, by not being expressive enough to allow dynamic type checking. In many statically typed languages, there is a TopType (for example GENERAL in Eiffel) that can be used to refer to any value, and which can be "downcast" to a more specific type with a run-time check. (Java is similar except that primitive values cannot be typed as java.lang.Object.) It is the lack of support for such a type and for run-time type checks that resulted in Marshal being designed with an inaccurate type signature.

It is quite feasible and well-understood how to add dynamic typing to a language without any loss of efficiency for statically typable programs. This is described in http://citeseer.ist.psu.edu/abadi89dynamic.html : add a type "dynamic", whose values are opaque except when used in an expression "(value : T)", which checks that the value is of type T. This does not require non-dynamic values to carry any type information.

Alternatively, without changing the language, it would be possible to define a fixed type that encodes some universal data model (e.g. the data model of XML or EssExpressions), and write a marshalling/unmarshalling module for that type. The result would have to be picked apart using PatternMatching. This wouldn't be nearly as convenient as either the current design or one using a dynamic type, but it might be useful to people who need to do unmarshalling of potentially untrusted input in OCaml as it stands.

See also: IssuesForLanguageDesigners, SafetyAndExpressivenessDiscussion.


J�rgen Hermanrud Fjeld writes (on the list <http://www.coyotos.org/mailman/listinfo/bitc-dev>):


There's an interesting article on LambdaTheUltimate (http://lambda-the-ultimate.org/node/view/163) about an experimental language called "acute" that addresses many of the concerns raised here.

How considerate of them to design acute and post to LtU just in time for the discussion here. :-)

AliceLanguage also supports typesafe [un]marshalling.


ThreadMode discussion below


The Marshal module is intended to produce streams that can "be written on a file or sent over a pipe or network connection" (http://caml.inria.fr/ocaml/htmlman/libref/Marshal.html). So not making unmarshalling typesafe is arguably a poor design decision for two reasons:

Note that the Marshal module fails to be typesafe because it is written in an unsafe language; it couldn't be written in OCaml.

An example of the problem is Marshal.from_channel : in_channel -> 'a, and other functions that return an unknown type. The programmer then has an unsafe cast exactly like C/C++ countervariant in order to be able to use the value, and this may generate SIGSEGV or type errors. Mind you, user code is in this case 100% pure OCAML. This kind of safety issue is moot in systems that preserve run time type information (Java, Scheme, Lisp, etc). The design decision was to put a runtime for which all data is opaque bytes rather than typed values. This makes it faster but less safe.

 # let x= Marshal.to_string [1;2;3] [];;

val x : string = "����\000\000\000\007\000\000\000\003\000\000\000\t\000\000\000\t A B C@"

# let y= Marshal.from_string x 0;;

val y : 'a = <poly>
The type system has been broken at this point, by the implementation of the Marshal module returning a value inconsistent with its declared type. In fact, no value can legitimately have type 'a in OCaml's type system. This is because \forall 'a ('a) is equivalent to the intersection of all possible types, which is the bottom element in the type lattice. (If types are viewed as sets of values, this corresponds to the empty set.) There is no way to produce a value of type 'a except by use of the Marshal module (or the ForeignFunctionInterface). Hence, it is the Marshal module that must be "blamed" for creating a value of a type that should be uninhabited (have no values). IOW this is a library design problem, not a type system design problem. UndefinedBehavior occurs immediately - although a SIGSEGV obviously need not occur immediately, or at all.

 # let z= (y : int list);;

val z : int list = [1; 2; 3]

# let z1 = (y : in_channel);; (* type error, unsafe language expression not related to a particular module *)

- : in_channel = <abstr>

# let z2 = input_line z1;;
... Memory access error.

Look at the following piece of C code

  {
    char *x = "abcdefghijklmnop";
    char ****y;
    void *z;
    int l;
    z = x;
    y = (char ****) z;  // corrected
    l = strlen (***y);
  }
No warning whatsoever (hope so, I didn't check, but anyway perfectly legal), and SIGSEGV is all but sure thing. Can you attribute SIGSEGV to the unsafe downcast y = (char ****) z ? If you do then you should also attribute the problem in Ocaml to let z1 = (y: in_channel);

In this program, "y = (char ****) z" is the first statement that causes an inconsistent type assignment. In the OCaml program, "let y = Marshal.from_string x 0" is the first statement that causes an inconsistent type assignment.


Prematurely trying to reach a conclusion follows:

Well, in the end, after all these interesting discussions, I can only conclude that our differences are not sensibly on the substance of it. You advocate that Marshal.unmarshal should return a type "dynamic" along the lines described by Cardelli, and therefore you attribute this unsafety to the typing design of Marshal. I can still in all honesty say that it's absolutely a language design issue, because were you to write a better Marshal you absolutely wouldn't be able to do it without a modified Ocaml language where the type dynamic would be absolutely special, provided by the language design in collaboration with the compiler and possibly runtime implementation. The type "dynamic" cannot be expressed in the current type system of Ocaml. I would also note that Marshal is not the only module in the standard library that can be used to create expressions typed 'a.

I think we are close to agreement (you've changed my mind to some degree on the relevance of the type system design to the problem), but we should first resolve the sum vs intersection issue in order to get a coherent refactoring.

As to whether the type 'a is a direct sum or an intersection, you have a good argument by analogy with all the other quantified types, but I can only observe that the existence of identifiers typed as 'a is again a language design issue, and the type 'a is a one time special with its own different rules for the compiler, it is most definitely not treated by the compiler in the same way that say 'a->'a or 'a*'a or any other quantified type is treated by the compiler. A proof of that is the other way that you can break safety by populating arrays with it, or feeding it to module or functors, for example try to push it onto a stack or use it with any other collection types, it will help you break the type safety.

The subtyping rules that apply to 'a in OCaml are perfectly standard and ordinary, the same as the rules that apply in many other languages. However, to understand OCaml's type system it's necessary to distinguish between type expressions and types. Type variables have a different meaning in each case. In a type expression, a type variable is just a placeholder to be filled in by the type inferencer. In a type, it is universally quantified.

For instance:

 # ((42, "hello") : 'a*'b);;
 - : int * string = (42, "hello")
'a*'b is a type expression, not a type. The type inferencer has worked out that the expression can be typed by assigning int to the placeholder 'a, and string to the placeholder 'b. If any type variables remained in the output of the inferencer, they would implicitly be universally quantified. However, despite appearances, (42, "hello") is not of type \forall a,b ('a*'b), and (int, string) is not a subtype of \forall a,b ('a*'b). (In fact the latter type is also uninhabitable, like 'a.)

So 'a is not being treated differently from other uses of type variables. Given pure OCaml code, the type inferencer will never infer any uninhabitable types. The problem is that the type inferencer cannot be run on modules implemented in unsafe code. Instead it just takes the declared type as gospel. But the declared types for Marshal.from_string etc. are wrong, in the sense that they are not consistent with treating type variables as universally quantified. The OCaml compiler and code generator blindly use subtyping rules that assume that previously inferred types are correct, and fall over when this assumption proves to be false.

You claim that 'a should be the bottom type and therefore uninhabited, but if your claim would be consistent (and I do not hold you for its inconsistency because the whole type system is made inconsistent by this one time special put in place to allow for modules like Marshal, Parsing, and others to do their job) then 'a identifiers should be disallowed by the language, because ML by its design does not allow for the existence of uninitialized anything, not top value or local identifiers, not fields in array, nor in structures, and also ML does not have null. So it follows that if a type like 'a is indeed uninhabited than the compiler should forbid it, altogether. Again this is not the case because in my opinion 'a is a one time special.

This is actually a really fascinating topic for TypeSystemWeenie?s :-) I quote from "A Modern Perspective on Type Theory", ISBN 1-4020-2334-0 , p60. It proves a lemma about the RamifiedTheoryOfTypes?, a very early mathematical type system used in PrincipiaMathematica?, to the effect that all RTT types are inhabited. Then it says:

From a modern point of view, this is a remarkable lemma. Many modern type systems are based on the principle of propositions-as-types [...]. In such systems types represent propositions, and terms inhabiting such a type represent proofs of that proposition. In a propositions-as-types based system in which all types are inhabited, all propositions are provable. Such a system would be (logically) inconsistent. RTT is not based on propositions-as-types, and there is nothing paradoxical or inconsistent in the fact that all RTT-types are inhabited.

[...]

The rules of RTT [...] have a bottom-up character: one can only introduce a variable of a certain type in a context Gamma, if there is a [propositional function] that has that type in Gamma. In lambda_->Church, one can introduce variables of any type without wondering whether such a type is inhabited or not.

It is possible (although not the only possibility) to view OCaml's type system as a propositions-as-types system [CurryHowardIsomorphism]. The proposition that 'a represents is "for all propositions A, A is true". So a term inhabiting 'a (i.e. an OCaml expression returning a value of type 'a) represents a proof of all propositions, making the type system inconsistent. As the book points out, this reasoning need not hold for all type systems, but it holds for most type systems derived originally from the simply typed LambdaCalculus - including OCaml's.

Being able to declare an identifier bound to a variable of type 'a is like being able to think about the statement "for all propositions A, A is true" without asserting it. This is less useful in the type system context than it is in the formal system context, but it is not useless. It turns out that the bottom type can be used to express the "return type" of a function that does not return - i.e. of a FirstClassContinuation. For instance a continuation that takes a single integer argument can be given the type "int -> bottom" (sometimes written "int -> 0"). This allows the usual subtyping rules for function types to work for continuations. Also, some type systems are BooleanLattices - i.e. types are closed under intersection, union and negation - in which case the bottom type can be expressed e.g. as "not top", and it is pointless to prohibit that.

(Of course OCaml does not have FirstClassContinuations and its type system is not a BooleanLattice, so maybe this is a little beside the point for OCaml. Nevertheless it's applicable to the design of new languages.)

As to how to fix it, Cardelli's proposal has its merits. But associating a type identifier with all the values does have its cost, but I would note that this costs seems to be in increasing memory usage, no extra CPU needs to be spent because of this thing. Having all the values carrying their types at run-time may have benefits other than serialization/deserialization, for example would allow for a reflection module and having types as first class values. This features have proved their usefulness through languages from Lisp to Java in many industrial applications. With memory becoming cheap these days and 64 bit processors for the masses becoming a reality, I wouldn't worry about that type tag too much. -- CostinCozianu

I think you've slightly misunderstood the intended implementation of the proposal. Most values do not carry their types at run-time; only "dynamic" values do. There is an implicit (always successful) conversion from any other statically known type to dynamic, which adds the statically known tag. (Correction: no there isn't; it's an explicit conversion.) There is also an explicit conversion from dynamic values to static ones, which checks the tags and then removes them where possible (or results in them being ignored).

So, there is no additional memory (or CPU) usage at all for statically typable programs; only for parts of programs that actually use dynamic types. SomeLunchesAreVeryCheap?.


BTW, thanks for this discussion - it's been helpful for me since I'm designing a capability system in which OCaml is intended to be one of the supported languages. -- Daira Hopwood (That was many years, two name changes and a gender transition ago, and I'm still designing it :-)


CategoryLanguageTyping


EditText of this page (last edited March 26, 2013) or FindPage with title or text search