An embedded document is when one document (often a structured text file, or a binary, or anything else) is embedded within another. (This discussion assumes that the result is a linear sequence of bytes/characters--use of more advanced filing systems is beyond the scope of this discussion). A common instance of this found in programming occurs when mixing programming languages--embedding a Perl script or sed script inside a shell script; embedding HTML inside a C program; etc. For purposes of this discussion, we limit ourselves to the case where the enclosing document is a text-based programming language or DomainSpecificLanguage which will be hand-edited; binary embedding is a different topic altogether.
The general problems are as follows:
- Determining the span of the embedded text. For machine-generated files, the obvious solution is to store the length of the embedded text; this is what programs such as the Unix archive utility (ar) do. For human-edited files; this is impractical. Other solutions must be derived.
- The embedded text may not obey the formatting requirements of the host language; the obvious example is embedding C or Java code in HTML or XML. Both CeeLanguage and JavaLanguage are replate with the characters < and &, which are special to both HTML (HyperTextMarkupLanguage) and XML (ExtensibleMarkupLanguage) or any other SGML (StandardGeneralizedMarkupLanguage) derivative. Either processing of < and & must be disabled (which is so in an XML CDATA section), or the special characters or character sequences must be quoted--in this case one uses < and &. (Which, if you're editing the enclosed file, looks real UgLy).
- Often, it may be desirable to "unquote" portions of the document (and embed snippets of the host language within the quoted document).
Several ways to do this are:
- Length encoding, mentioned above. Not appropriate for human edited documents, for obvious reasons.
- HereDocuments. Commonly used, seem to work well in practice. Occasionally, you run into a "gotcha" where the enclosed document contains an unquoted terminating sequence and all heck breaks loose.
- Use of separate files -- XML (ExtensibleMarkupLanguage) requires this for embedding binary files (anything that might contain null characters or other XML-verboten stuff). The enclosing document specifies a filename or UniversalResourceIndicator?, the tool processing the file is expected to go find it and read in in if needed. Can be a major problem if the pointed-to document wasn't copied along with the enclosing document, or some search path isn't configured correctly, or if you don't have a filesystem available, etc.
- Use of a CommonSubstrate?. Documents in one XML application have no problems embedding other XML documents (even in a different XML application); as XML is designed to mix and match formats in this way. This doesn't help, of course, if you are trying to embed a legacy format that doesn't use the common substrate. EssExpressions are another popular CommonSubstrate?. Use of a CommonSubstrate? isn't always appropriate when the substrate defies the textual conventions of a particular application area.
See also LiterateProgramming
CategoryDocumentation