Processing Markup Languages

(Moved from StructuralRegularExpressions)

I don't know if its related, but parsing XML and markup languages brings up an interesting question: what does the result look like? As a table-head, I'm tempted to draft the parsed result as something like:

table: tag ---------- tagID sequence // may be superfluous with ID, depending on environment tagName parentRef // parent tag (0 if no parent) isClosed // "1" if self-closing such as <foo x=1\> // Alternative closing tracking to better handle bad or non-trees: closeType // B=beginning tag, E=ending tag, S=self-closing

table: attribs -------------- tagRef // f.k. to tag table attribName attribValue

A dummy tag, such as "intertext", can perhaps be created for text between tags. What would non-table-heads choose as the output format? Iterators?

{A formal Grammar (as per the ChomskyHierarchy) can be more general than any regular expression - i.e. capable of matching anything a regular expression can match, plus plus. The basic output of matching for any grammar is very simple: a boolean indicating whether the structure is a match - is 'legal' within the grammar. The same would be said for structural regular expressions - simply returning a boolean as to whether the structure matched. Of course, it's also nice to know -why- a match occurs. To do that, you essentially add variables to the regular-expression or grammar, and you also need to handle choices (where a component is X or Y or Z, you need to know which 'choice' was made). The 'return' from a match in this case is the choice and its associated variables. Very simple. Things get a tad more complicated if you wish to receive recursive or list variables, but only a tad more.}

{As far as parse-results in general, either a single hierarchical value-structure, or a stream of such structures (depending on the medium), is most natural for the result of a parse. Database tables are best used for data - literally, propositions held to be true - not for partial values that individually say nothing of their world or context. Also, you need to be careful with tables: ordering is relevant in the parsing of XML and SGML.}

I believe your response reflects the BigIron RDBMS bias. Tools like a NimbleDatabase and DynamicRelational can transcend the BigIron stiffness of the current flavor of RDBMS. And, relational is perfectly capable of storing and using ordering-related info. --top

{Relational can store ordering if you include some sort of ordering identifier - a double or integer, for example. That is not reflected in your above schema.}

It's called "sequence" above and has been there from the beginning.
{Ah. That's what "sequence" meant, then. It is unclear from the above spec.}
I'm curious, how did you interpret it?
{I thought it a boolean indicating that the tag starts a sequence.}
Interesting interpretation. I guess that's why examples are nice because it is impossible to guess all the different ways one can interpret it differently than intended by the author. By the way, I'd use "isSequence" for a Boolean.

{And NimbleDatabases are still databases. They ought to store meaningful data. Structural information on the tags in the file is not particularly meaningful. Admittedly, at least, you intend this 'table' to be extremely transitory and subject to further processing. But I can't see it as a great deal more useful than storing structural information on anything else... e.g. on text files:}

  table: my-text-file
  ------------
  char  // unicode codepoint
  pos   // position in file (integer)

{That you CAN extract and represent structure in a set of tables by introducing artificial identifiers and references doesn't make it particularly useful to do so. You maintain the exact same information, and at least the same capability of processing it, simply by using a plain data-structure - a string for text-files, and a hierarchical structure for XML. If you were to extract semantic information (e.g. relating books to their authors) from the XML file into a table, I'd have far fewer objections. Unfortunately, even XML schema never specify semantics.}

I am not following your argument. If one wants to process information from an XML/HTML document, using it in raw form can be daunting. SeparateIoFromCalculation applies here. I don't want to mix parsing with processing if I don't have to.

{I'm not saying you should mix parsing with processing (at least for XML; streaming text would be a different matter). I'm saying that a table doesn't help. It provides no benefit whatsoever over a hierarchical datatype (since you NEED the structure in order to make sense of an XML document), but they're a bit harder to work with (since you'll be rebuilding the structure in order to make sense of the XML document). Tables can be made to work, but they aren't the right tool for the job. Everything you break down into that table will need to be built up again to make sense of it. It's a little bit like the tables are artificially injected just to look neat... like GoldPlating for the parsing problem.}

It is only gold-plating under the BigIron view of tables. Again, I honed my nimble table skills in ExBase, which made creating, filling, and inspecting tables a snap, and far more visual/inspectable than arrays or linked lists. That's NOT gold-plating, that's using powerful tools for programmer productivity and structure transperency/x-ray-ability. I don't think you "get" the power of tables. The same project in say Java would take roughly 5 times more code and 5 times the amount of time. In a scriptish language, the diff would be smaller, but still there. (The only signif downside in Xbase is obtaining arbitrary-lengthed columns without relying on "memo" fields, which is a tool-specific problem. DynamicRelational is possible.) --top
{Whether a programming environment makes it easy to inspect arrays, linked lists, tables, and other data-structures is truly based on the environment, not so much on the data structure. If you're asserting that tables are somehow fundamentally more transparent than arrays and linked lists, you're certainly wrong (trivially, one could view any array as a table). And your claim that 'YouJustDontGetIt' is also wrong; I'm quite familiar with the 'power' of tables. What I "don't get" is your silly and entirely arbitrary insistence that any value more complex than a string is some sort of sin against the IT profession.}

Plus, if its in a structure, one can go back and forth and up and down etc., while during-parse iterators (getNextTag(...)) generally cannot.

{I did not propose use of iterators. I'd much prefer to take a hierarchical data structure of the XML, and write a FoldFunction over it (using Haskell or OCaml-style pattern-matching) that extracts semantic information, and transform it in one step to represent the semantic information. XML is pure structure, and its semantics are largely embedded in that structure; you'll always need such a step to capture semantics. I'd be quite willing to drop into a nimble database the semantic information extracted from the XML.}

Comparison to a raw text file is not applicable because there is no pre-digestion into tokens of some sort in a character-per-character representation. With the tag tables, one is dealing with XML statements and their related attributes, not individual characters. It goes up the abstraction ladder from mere characters. It's at the token and statement level, not at the character level. Maybe I'm not understanding what you perceive as the alternatives.

{Comparison to raw text file is applicable, though perhaps a little extreme. It actually does some pre-digestion, btw - to get unicode codepoints, you'll generally need to "pre-digest" UTF-7, UTF-8, or UTF-16. But if use of 'tokens' are your primary objection, it wouldn't hurt my example to separate the text-file into tokens. Either way, having just the structure of the text is quite pointless - it isn't useful data. Neither the text format nor the XML format you propose offer real 'data' about anything except representation of 'data' in the file.}

  table: my-text-file-2
  -------------------
  token   text-word or other token (punctuation, formatting)
  pos     position in file (integer)

{It might be hard to see where I'm coming from without a couple examples... e.g. an XML file for configuring a GUI, and/or an XML update format for multi-library purchases, checkins, and checkouts.}

{As a note, when I'm feeling lazy about it and am uncaring about efficiency, my approach looks something like:}

  XML => Node (value-object) with sequence of sub-nodes (including text-nodes) and map of attributes
  Node with sequence of sub-nodes and map of attributes => function that extracts all semantic information
  Semantic information => manipulation of application (change settings, update DB, etc.)

{Iterators might be used over the sequences and maps for languages that lack efficient pattern-matching or fold operations (e.g. C++), but would be performed over the structure, rather than directly out of the XML. More relevantly, the function that extracts information is generally somewhat recursive and/or 'deep'... e.g. to construct recursive components. For a GUI-description, for example, it needs to design components that set up windows with certain widgets and display hooks (gauges, lists, etc.) and operational hooks (on click) and whatnot. I've done such when writing an OpenGL-only GUI for an in-video-game menu system. Relevantly, if I had put all this into table, I could have made that work too... but I'd need to do all those joins just to acquire the structure back in order to build the GUI.}

{Here's a question I have for you: (Other than the fact that you simply love tables and want to inject them into every operation you perform ;-) Why would you break down a hierarchical structure when you'll just need to do re-work to build it again? Especially for something as transitory and (by itself) meaningless as the XML parse result?}

I don't understand the question. And, XML is not necessarily hierarchical. How about you present a specific scenario. Otherwise, we'll probably keep talking past each other via generalizations.

{With XML you're legally limited to exactly one tag if you wish to entirely avoid hierarchy. More relevantly, the cases where the hierarchy is shallow are rare; there's not much point in arguing what is 'necessarily' true when actual truth is readily available. As far as talking past one another... that isn't really possible without you doing a bit more talking. I will use a scenario if you present one. The above question I have for you is from the 'then what?' aspect of breaking the XML hierarchy down so you can fit it into a table. So you broke it down... then what? My conjecture, from both experience and analysis, is that the 'then what' is (except for some possible rare cases) to rebuild the hierarchy so you can extract meaningful data out of the XML. Since the hierarchy provides context that gives the tags their meaning, doing so would often be necessary. And doing so strikes me as rather pointless rework.}

Ebay Example

Well, okay, let's say we want to mine Ebay for certain products and related info. Suppose there are a bunch of TD's in the HTML that contain description and value pairs, such as the label "price" and the value of price. Example:

<td color="brown" align="right">Price:</td><td>$19.95</td>

tagID...seq...tagName...closeType ---------------------------------- 30......50....TD...........B 31......51....INTERTEXT....S 32......52....TD...........E 33......53....TD...........B 34......54....INTERTEXT....S etc...

tagRef..attribName..attribValue ------------------------------- 30......COLOR.......brown 30......ALIGN.......right 31......VALUE.......Price: 34......VALUE.......$19.95 etc...

(Dots to prevent TabMunging. Also note that the "intertext" convention and the "closeType" convention option is used in the example.)

If we wanted to extract that, then we can search for the text contents of "price:" between TD and end-TD tags. When we find it, then just increment the sequence number by 3 to get to the adjacent TD's contents tag, and we have price (and a sanity check on tag name). We don't have to traverse trees in this case, and even if we did, its perfectly possible to iterate through trees when in tables. (And some query languages have built-in tree traversal operations.) I find that data often naturally is either not tree-shaped to begin with, or can easily be "flattened" for the purpose at hand such that dealing with viney trees in raw form is not that common of a need. Trees are over-used and over-assumed in IT in my opinion. (I may use more intermediate tables closer to the domain/project besides just the parsed markup tables shown here. Some language/tools make ad-hoc or nimble tables easy, some don't.)

Note that I wasn't clear on how to deal with ending tags, but the process is the same regardless of which approach is taken.

{If you choose, as you've done, to relegate the problem to the pattern-matching algorithm, you've simply shifted the problem... not solved it. You'll end up relying upon a consistent pattern in the source text. This particular problem wouldn't have been any more difficult to extract all the 'Price:' components from even if you chose to use a regular expression over raw text, which indicates that extracting first to the table is still GoldPlating. And, as with regular expressions and other naive pattern-matching algorithms, your approach would ultimately fail on the exceptions. E.g. it can fail if they choose to make a price-value bold (<td>$19.99!</td>) in order to indicate that it is a 'Buy it Now!' opportunity or that the auction is about to expire, and it can fail if they have some inconsistent spacing (<td>Price: </td><td>19.99</td>), etc.}

{Of course, HTML in general is a horrible medium from which to extract any sort of semantic information. (Raw English text would be better!) Here is a realistic sample from the eBay table (for ONE item):

:<div class="showcase navigation"><span class="gallery" onmouseover= "onGalleryPlus(event, '310012933651', '1', , '0', , 'http://cgi.ebay.com/Fire-Starter-Pine-Pitch-Loaded-Wood-Trade-Wood_W0QQitemZ310012933651QQihZ021QQcategoryZ20598QQssPageNameZWDVWQQrdZ1QQcmdZViewItem');"><img class="link" src= "" align="absmiddle" height="16" width="16">&nbsp;<a href="#Enlarge">Enlarge</a></span></div></td><td class="ebcTtl"><img title="New" alt="New" src= "" border="0"> <h3 class="ens fontnormal"><a href= "http://cgi.ebay.com/Fire-Starter-Pine-Pitch-Loaded-Wood-Trade-Wood_W0QQitemZ310012933651QQihZ021QQcategoryZ20598QQssPageNameZWDVWQQrdZ1QQcmdZViewItem">Fire Starter, Pine Pitch Loaded Wood, Trade Wood,</a></h3><span class="icons"><img src="" id="sr_giftIcon_7" border="0">&nbsp;</span><div class="navigation">Low Shipping For Additional Bundles&nbsp;<br></div></td><td class="ebcPpl">&nbsp;<img src="" alt="This seller accepts PayPal" title="This seller accepts PayPal" border="0" height="16" width="16">&nbsp;</td><td class="ebcBid"><img title="Buy It Now" alt="Buy It Now" src="" class="binImg" align="middle" border="0"></td><td class="ebcPr"><span>$1.25</span><br></td><td class="ebcShpNew"><span class="shpTxt">$4.60&nbsp;</span></td><td class="ebcTim">Jan-06&nbsp;23:16</td>

Your sample perhaps can use some wiki-escaping.

And the same kind of processing can apply. We can fancy it up to ignore tags between TD and "intertext" (dummy tag) without significant re-work (using original HTML example).

{That might work until you see: "<td>Price:</td><td>  19.99</td>", at which point the spaces between the tags become 'intertext' (though perhaps not in HTML - at least in XML you can't generically dismiss spacing as insignificant). Patchwork solutions to pattern-matching problems usually give patchwork results, be they at the token level OR the raw text level. My own observation (after working with languages) is that, beyond the Chomsky Hierarchy, it is not a meaningfully 'easier' problem to match at the character level vs. the token or structure level. (I.e. any lex problem can be solved just as readily by yacc.)}

I don't think it is realistic to try to work around every possible future change. If they stick nbsp's or whatnot in there, then it breaks and we adjust it. (We may want to validate the items we extract to better catch any changes.) They might even go away from HTML tables, and bust it that way. It's one of those issues of how far to go with the YagNi rule of thumb. I generally fall into the middle of the road with regard to YagNi.
- Note: A general-purpose xml/html markup parser could perhaps have an option to convert character espaces such as nbsp's into ascii equivalents or a substition character of our chosing, such as "?" if there is no translation.
- {I'm curious as to your reasoning... a general-purpose xml markup parser should handle unicode easily enough. I wouldn't think it an issue. Any reasonable database could translate/store in utf-8 if necessary. And I'd also expect that ' ' (and family) and '&#<codepoint>;' and '&#x<hex-codepoint>;' would all be translated properly into the associated characters. I guess I consider it a non-issue. Spaces are part of the 'intertext', which is all that is relevant in the discussion above.}
{I don't think it is realistic to pretend that you won't have various 'inconsistencies' even in existing pages, with a lot of variables going into how line-items are displayed. And I'd hate to see the monster of a query you're planning to construct to solve this problem. So long as you're dealing with simplistic patterns, it will be much easier to simply write a simple RegExp to extract it. If you start dealing with complex patterns, it will also be simpler (and far more maintainable) to write a complex grammar (yacc-style) to extract it. I've written enough parsers by hand to know this.}
Why do you keep claiming it would be a lot of coding compared to your alternative? Repeating that vacuous claim over and over and over and over does NOT make it true. I suppose I am going to have to show an actual example to put this to rest. But not today...
{I keep claiming it because it IS true, not to "make it true". I expand on the 'why' it is true further down on the page (PageAnchor: Why Grammars?).}

In fact, in table form that is easier to do than with regular expressions. Starting out with regular expressions creates a DiscontinuitySpike because reg-ex is either not sophisticated enough, or more work than using tables when dealing with tag-per-tag context. Reg-ex deals at the character levels, not atomized tags and attributes. That is because in tablized form, it is already a higher abstraction than what reg-ex would be looking at.

{You seem to be arguing that regular expressions are more work because they are more work in a rather circular manner. Are you ready to establish that pattern-matching problems over raw text are meaningfully more difficult than pattern-matching problems over tokenized text that just happens to be stored in a table?}

Tokenizing stuff into a structure that is representationally equivalent to the original (within reason) is generally more usable than parsing raw text. Always easier? I won't commit to that.

We get tag info, attribute info, and position info readily on hand. We can search, sort, and calculate on any of these with the ease of a query language (and basic app-language loops and IF's). Reg-ex makes a lack-luster query language. It is not powerful/concise enough on the token level. (Related: GrepVsDatabase)

{The conciseness of RegExp depends on how you choose to construct and represent them. They can be pretty darn concise if you use a constructive approach with sub-RegExps?, so you aren't, for example, constantly rewriting the 'extract-number' stuff.}

Reg-ex as commonly used is very non-mnemonic. Non-cryptic replacements are going to be more verbose than using a query language. The more complex stuff you do with it, the more cryptic one's code is. And one is still dealing at the character level of abstraction instead of the token level. I'm isolating concerns.
{I disagree with your assertion; if you build reg-ex's up like small grammars, they can be quite readable and memorable. Writing one big-arse reg-ex won't be any more cryptic than writing one big-arse procedure to do all your pattern-matching. To the contrary, grammars in general are much easier to read, write, and maintain than queries to extract patterns among tokens. And I'm saying that, for pattern-matching at least, dealing at the character-level of abstraction is not fundamentally more difficult than the token-level. You aren't saving much by working with the tokens.}

{That aside, what you desire to query when ProcessingMarkupLanguages is the data IN the file, not data ABOUT the file. This difference is fundamental. Unless your goal is to index or annotate the source, you really need to extract meaning from the file before you have anything worth querying. Arguing that a reg-ex is a lack-luster query language is pointless before establishing the need for one. Creating an artificial need for queries by forcing the representation of the file into a table doesn't count - doing so is a fine example of GoldPlating.}

The net cost of labor is not more {I'd agree this is true for getting stuff into the table. Getting useful stuff back out, though... have you verified this claim?}
I am dealing on the token level instead of the character level {Which doesn't help as much as you seem to think it should.}
I don't run into as many DiscontinuitySpikes if I need more complex processing {DemandForEvidence.}
Not dealing with the cryptic nature of reg-ex that looks like a toddler got his hands all over the top row of the keyboard. {That's only one representation of reg-ex's.}
I have ready random-access to any part of the structure, not forward-only iterators {Which doesn't help as much as you seem to think it would. In XML, for example, you don't know the meaning of the inner structure without first visiting the encapsulating structure.}
It's already saved, or easily saved, so I don't have to worry about persistent issues. {This wasn't an issue in the first place. The markup document, in its raw form, is already saved or easily saved.}
I have use of off-the-shelf data/query browsers to study, filter, sort, and print the structure. {To what end?}
Etc. on the features listed in DatabaseDefinition.
Maybe it is GoldPlating, because I can use off-the-shelf gold plates. I don't have to do the gold-plating myself. (Joseph Smith wishes he had it so easy.) Keep in mind that I tend to use DB's for apps anyhow, so any overhead of using a DB engine is already paid for.
{Let me put it this way: even if my language-of-choice had nimble tables and orthogonal persistence et al, I would still consider them the wrong tool for the job. They might be usable as a hack if my language of choice lacks support for pattern-matching grammars, but they are clearly a hack. Writing convoluted queries to extract patterns that represent useful information is neither more natural nor more maintainable than writing convoluted procedural code to do the same. If your query language could handle complex patterns across table-rows, I might think differently.}
Where's your Convolut-A-Meter? I want to see the needle reading.
{Give me that stupid Simplicity-Meter you use when you're announcing your grand plans to bring KISS to the masses, and I'll turn it upside down and give it back to you.}
Sounds like another null-versus-null evidence debate.
{Perhaps so. My own knowledge is from experience and intuition and considerable post-graduate education on exactly the subjects of processing languages and pattern-recognition and analysis - it isn't something I can present readily for your scrutiny. But even you can't contest that automated grammars are designed for exactly the purpose of processing patterns out of languages. There is no denying that they're a right tool for the job even if you wish to believe that tables can do them one better.}

I still don't know why you are calling it GoldPlating. I don't know what you are comparing it to in your head. If you want a code volume challenge, I'm all for it. The scenario is that we start out using tag quantity offsets, but later decide to filter out stuff such as B and SPAN tags and thus need to change such code.

{You should start with XML intended to carry semantic information in the first place. HTML is a horrible medium for semantic information - you, quite literally, are probably better off simply eliminating ALL the HTML markup and processing it as though it were plain-text.}

One often cannot control such. In my scenario, it is not realistic to expect Ebay Co. listen to you. (If I had a choice, I wouldn't choose XML anyhow most likely unless dictated by the customer requirements. See RelationalAlternativeToXml.)
{Eh? What I'm effin' saying, top, is that the HTML example is a horrible example. You can (and should) consider filtering out all the markup and simply looking for text-patterns that correspond to 'Price: ($?)<Number>'. But there are plenty of cases where XML is available to start with, carrying semantic information in its structure, as often used for communications and for data interchange. (RelationalAlternativeToXml for those cases is not a viable choice, since what we're discussing is representation of data; you'd need to consider CsvAlternativeToXml?.)}
If we have data pre-packaged in XML that is meant for data transfer, then the difference will be small either way. Thus, that is not the scenario I am focusing on.
{Processing HTML the way you are, you might as well process raw English text. The difference is too small either way - all HTML ever does is help 'display' the text. You chose a bad scenario. XML will provide plenty of challenge for you, and would be far more meaningful in the context of 'ProcessingMarkupLanguages'.}
If one's task is to strip info from auction pages, they don't have a fricken choice. You have to search for your cellphone where you dropped it, not where the light is better.
{Actually, there probably are better choices when the task is stripping info from auction pages. One possibility would be to make a deal with eBay to get remote read-access to their DataBase (possibly via some API). Regardless, you chose a horrible example problem for the task of 'ProcessingMarkupLanguages'. The reason is simple: you aren't really processing the markup language, top. What you're processing is the text being marked up. If I had said you need to find and index all words in HTML documents as an example of 'ProcessingMarkupLanguages', it would be a shitty example. Your eBay example is shitty in the exact same manner.}

{But if you wish a code-volume challenge, I'll put a grammar vs. your approach after you've provided your approach. I'll never even touch tag offsets - that's just a horrible idea to start with when dealing with any sort of extensible language.}

I wouldn't fault anybody for it in the name of YagNi. If they change the HTML that much, one would probably want to know (via a crash) so that they could manually check it out for other surprises anyhow.
{You seem to be under the impression that eBay or HTML sources of data in general will be extremely consistent even within a single page, or from page to page. This has some truth, due to their construction by CGI, but it isn't something that can be counted upon. There can be a lot of variables that determine how things are displayed.}
No, I am only saying that there is a point where trying to handle every future contingency has diminishing returns, and an impossible goal since we couldn't enumerate all the potential changes. It's a matter of weighing investments based on estimates of future change probabilities.
{I'll agree that 'future contingency' has diminishing returns. I'm simply saying it's irrelevant; my argument is NOT about 'future contingency'. You'll find the inconsistencies in the here and now. HTML is a display language, not one for capturing semantic information, and line-items will be influenced by plenty of variables reflecting special deals, column alignment, special mouse-over operations, etc.}
See above about HTML vs. XML.

{I've explained before that I'm comparing it to a hierarchical, language-dependent datatype that creates a single value/object that is a direct reflection of the original XML structure... e.g. in Haskell:

  type XMLTagname   = String
  type XMLAttrName  = String
  type XMLAttrValue = String
  datatype XMLData = (Tag XMLTagname [(XMLAttrName,XMLAttrValue)] [XMLData])
                   | Intertext String

{With such a structure, all I've done is represented the XML file directly within the processing language. There is nothing special about it at all. If I want to, I could write a general function collapse this to your pair of tables with a single fold function... but I'd get nothing out of doing so. More usefully, I could write a generic function that can check generic XMLData items against generic XMLSchema. I can also write a single big function to extract semantics that are quite sensitive to such things as whether 'tagx' is found under a 'tagy' vs. a 'tagz'.}

Again, you are pissing all over YagNi. We just want to extract data, not clean up Ebay's work for them. We're only mining Ebay here, not trying to purify the world for the Second Coming. We are a user of the document, not the creator.
{We're talking about generic approaches to ProcessingMarkupLanguages. There are many ways to process them, top. One I've had to work with is translate from one XML format to another. If you want to limit this discussion to ProcessingHTMLEbayExample or some such, we can do so.}

{With decent language support, or a good library, I can write a parser that operates on this structure via pattern-matches (akin to parsers that operate on strings, but starting instead with a generic XML parse tree). But if your language only supports parsers over strings, that would be the way to go. Ultimately, one extracts semantics via pattern-matching with a grammar, and the application of some function. A RegExp extraction is sufficient for the simplistic eBay example you initially solved with the offset, but higher level grammars work as one adds complexity. Either way, the parse tables are superfluous.}

I wouldn't fault somebody for using RegExp for a few or simple items if they can pull it off easily, in the spirit of YagNi. For trivial stuff, the effort between the table-based parser and RegExp is not going to be different enough to argue about, beyond personal taste. But if the document or element relationships grow more complex, the table approach would come out ahead in my opinion.
{Your opinion is based more on your love of tables than on anything resembling logic. As document or element relationships grow more complex, I am confident that pattern-matching grammars (including the more advanced RegExps) will come out well ahead of any table of tokens - hell, even if you have a table of tokens, you'll need to essentially write a parser for it. How many parsers have you written, top? Creating streams of tokens simplifies some things a bit, but not in any considerable fashion.}
Well, your opinion is based on hate of tables. Tables are not even the whole issue. What I'm doing is separating character-level parsing from structural analysis and dealing with relationships between tokens instead of the character level relationships. I'm shifting up the abstraction level and putting it in an easy-to-navigate data structure/system. I'm setting it free. It's now random-access, query-able, and above character-level. That is a good thing.
{I have no hate for tables. They simply aren't the right tool for the job; tables are NOT designed for pattern analysis in the absence of fixed structure. Structural analysis can be done far more easily with a grammar, which is all about patterns in a source. You have almost zero experience implementing languages; you haven't yet learned that 'random-access, query-able' tokens are nearly as shitty a tool as 'random-access, queryable' characters. You've gained nothing of any real value. You take fragile and naive ("I'm setting it free!" ugh) approaches to extracting useful information. But I doubt you'll be convinced before you actually attempt to make the next steps in several different scenarios and actually compare it to more straightforward approaches. Get more experience with processing languages in general, top.}
I don't have to get more experience if I leverage my existing knowledge of tables and queries. Until you show it objectively/clearly worse, I'll choose my personal preference over your personal preference. This is not really a language-centric issue anyhow. It is about extracting domain-specific info from mark-up languages.
{I think it quite obvious that extracting domain-specific info from a mark-up language IS a language-centric issue. This is doubly true when dealing with domain-specific forms of markup languages - you'd do well to remember that XML itself is just a format for representing domain-specific languages, which are themselves described by the schema.}
{And so long as you're dealing with very simple problems, like looking for 'Price:' next to a number in a very fixed format, you'll probably do well enough with tables, just as a person could very easily do well enough if they "leverage existing knowledge of procedural". (PageAnchor: Why Grammars?) Where you'll encounter problems, top, is when the language itself, and the patterns involved (which contain the domain-specific info), get to be more complex. The complexity of queries or procedural code grows differently than that of a grammar, mostly because there are more and more omissions, ambiguities, and variations that you need to handle explicitly in procedural code and queries that are handled via implicit backtracking and the various 'or' options (x | y | z, '?', '*') with grammars and regexps and such (and that's ignoring the more complex grammars, which may need guard statements and context-sensitivity). Demonstrating this growth of complexity with didactic examples is an effin' pain (partly because didactic examples are supposed to be simple). Fortunately, I don't need to 'show' this for it to be true - I know it to be true from experience, and I'll quite happily allow you to run smack into the problem on your own time.}
I gave you a chance to pick an example first. You told me to go instead, so I picked something that reflected a mark-up need I've actually seen: scraping e-commerce listings from different sites. If you don't want to take the effort to produce a sufficient example, that is NOT MY FAULT. You are chewing me out because you are lazy? Actually, I suspect if somebody had to scrap enough e-commerce screens, they'd eventually want to form a library of common scraping needs. Example:

  getValue(tag1, attrib1, tag2, attrib2, distance, value, compareOp, options)

This would get the N'th tag2 after tag1 (or before if negative) where attrib1 contained the given value. Tags outside of those listed are not counted in the "distance" metric. The compare-op is the comparison used (equals, contains, starts-with, etc.) Perhaps regex's could be used instead of compare-op. Or, just pass an expression to be eval'd (HOF-like in scriptland). I'd chose tabling to implement such functions, of course.

{Your example isn't any more a "mark up need" than would be scraping pages for dates or other text. It, very simply, isn't a meaningful example of 'ProcessingMarkupLanguages'. The fact that the markup is there is pretty much completely coincidental to what you are doing.}

Why do you say that? Without a markup parser, it would be more difficult to say ignore B and Span tags.
{As I've said, I really don't believe you - in particular, the "more difficult" part.}
We'd be back to dealing at the character level instead of the token level.
{For parsing problems (and pattern-matching problems in general) specifying lower-level patterns in addition to higher-level patterns is pretty much at the same level of difficulty. Even if your complaint is about the 'repetition' of common patterns from one problem to another (numbers, XML structure, etc.), it really isn't difficult to already have the generic lower-level patterns already available for everything from numbers to markup tags. It wouldn't be difficult to add them by hand for something as simple as the basic XML structure, but (as you had below for your "I'm gonna skip all the stuff for getting XML into the table" comment) one doesn't need to do so. In addition, if you use pattern-matching or parsers for both low-level and high-level patterns, then there is no massive DiscontinuitySpike between the two. I.e. ignoring the 'B' and 'span' tags isn't more difficult than ignoring anything else - you simply describe it as an optional extra in the pattern you're looking for, or as something that can be reduced (&ignored - like a comment) within any pattern.}

{I imagined you were competent enough to choose a real example of ProcessingMarkupLanguages, seeing as this page is your baby. That you failed to do so is hardly my fault. I had even offered suggestions for examples prior to you providing your eBay example: library database interchange - letting others know what has been purchased, lost, checked in, checked out; and configuration file for a GUI (windows and widgets and results of pressing certain buttons encoded into XML). Both are fairly common exemplars of two distinct uses of XML (configuration & data interchange). }

As already described, data interchange is rather trivial because it's data already prepared to be fairly easy to grab. It doesn't have all the aesthetic decorations that otherwise make the road bumpy and harsh. Effort for data interchange programming is mostly about converting data conventions from one system to that of another, such row-to-columns, columns-to-rows, single record to multiple records, etc. This is not a particularly "grammary" task either. You are looking for grammars in the clouds when they are merely clouds. (I suppose there is an EverythingIsa way to view everything as a grammar, but this is merely the multi-paths to TuringComplete showing its face.)
{Data Interchange isn't so 'trivial' when the databases on both sides of the interchange aren't exactly identical in structure, semantics, or schema. (E.g. consider data interchange between different libraries, each having their own database with non-standard schema.) And it is still a pain-in-the-arse to code and a fine example of ProcessingMarkupLanguages. And it is a pattern-matching problem - one can write a grammar and semantics to extract the relevant information because grammars are equivalent to any other pattern-matching language. Compiled grammars are the right tool for the job of pattern-matching because they simplify the main problems of variations, repetitions, omissions/options, and ambiguities. (Admittedly if your source has NONE of those things, then it is relatively easy to process with plain-ol'-procedures or tables or whatnot, at your preference. OTOH, it also makes for a very, very, VERY simple grammar.)}
- But the conversion between DATA FORMATS is not a grammar-centric issue. It is mostly a data conversion issue, plain and simple.
- {No, not 'plain and simple'. 'FORMAT' is just another word for 'REPRESENTATION' and 'STRUCTURE', which is what grammars are all about. Even grabbing Data represented in a file as 'Price:' <formatting crap> <Price-Value> is a representation issue. The data is represented, and you need to extract it into a database.}
- It is not significantly different than typical data conversion tasks with OR without XML once I use the XML-to-table gizmo. The kinds of transformations that are needed would not be significantly different than converting say a GIS database from one vendor to that of another's. See "Typical tasks for data conversion" below for more. You need to show that grammaring is signif better than tabling for the listed items. If the effort is the same, then I go with my personal preference. (I didn't mean to imply that data conversion is "trivial". It is often highly involved and labor-intensive; just not in a "grammar" sense.)
- {It's fine if you wish to use your personal preference. It will work well enough for data that is pre-structured in a simple manner that is already very close to what you need in the end (which seems to be "typical"). It certainly helps that you are already assuming free support for getting data into the tables from a vast array of data formats (a not-unreasonable assumption, but almost certainly not done with 'tables').
 - I am not sure what you are fussing about here regarding "free support". Your Grammar-A-Tron is an extra tool also.
 - {Is that so? Does that make your 'table' system an extra tool, too? Both grammars and tables have plenty of ready support in existing languages. Anyhow, the "fuss" as you call it is that you seem to make your conclusions based on the few magic-buttons provided by your Database tool. Consider a more generic problem, like attempting to extract the XML-structure itself from table of (index,character) representing the file, and you'll get to experience what it is like attempting to actually derive higher-level patterns from lower-level patterns, and XML itself is extremely simple to parse compared to many languages. The same basic approach gets you to yet higher-level patterns as lower-level patterns. But you seem to be ignoring that, pretending that a magic-button will always exist that gets you to within 10' of where you need to be, and the worst you'll experience from that point is parsing "xxx, yyy" into two different columns. Hah. I won't begrudge you the magic buttons that do exist, but I do think you're generalizing from a bad sample-set.}
 - If that is supposed to be a practical and common example, then please clarify it.
- {But all that won't help the moment you start dealing with or looking for higher-level data-patterns that weren't already nearly "solved" for you by the underlying Database-conversion utility. Whether you'll need higher-level patterns depends on the difference between database semantics, and once you start dealing with patterns again you'll benefit from a grammar... e.g. where you can simply declare in a pattern-matching language that having a particular attribute or structure or tag-ordering means it best matches one pattern and not the other, instead of needing to explicitly code (and maintain) a convoluted array explicit pattern-tests and checks and whatever procedural, functional, or (your choice) query languages you're choosing to use.}
- Again again again again, you can call it "convoluted" a million times, but words don't make it so. Show a somewhat realistic example of it being more convoluted than your magic stuff. Assume equal or unknown until shown otherwise.
- {If I gave an example of a 'convoluted' query to solve a problem, you could argue that it was intentionally convoluted (i.e. sabotaged). I'm perfectly content to let you be the one to prove it convoluted. But, of course, you seem to be unwilling to give it your best shot... even for finding stuff like prices off the eBay web-page when bold, span, images, and various formatting options are thrown in (where what is displayed depends on what is in the eBay database - e.g. not every product has an image, not every product is buy-it-now, etc.).}
- That would not be a problem. I would filter for only the tags I'm interested in, such as "WHERE TAGNAME IN ('TD','TEXTAREA')".
- {Not a problem, eh? Well, I'm waiting, then. I must ask: which 'INTERTEXT' blocks will you grab?}
 - One could combine it with a function similar to the "getValue" function already shown. Perhaps one could pass a logic clause (for use in WHERE).
 - {Magic functions? over which inputs? Perhaps you could just call a generic 'extract data' procedure, too, and be done with it.}
- As far as things not existing, we'd have to look at the actual pattern of non-existence. We cannot anticipate up-front all possible ways that non-existence might be handled, only make a guess.
- {That depends on pattern source and the clarity with which you can specify the pattern's you're looking for. Sometimes you can anticipate up-front all possible ways. But even if you can't, it's hardly a rational argument against grammars. Part of the reason grammars are good is that they're easier to fix, since, when you failed to anticipate something, you just make one small tweak and it propagates all the way back to the procedures that search for it. Not so with 'explicit' pattern-matching, where you might need to change stuff}
 - You need to demonstrate that. I don't take your word for it.
 - {I don't need to demonstrate it to make it true, top. And I don't have a goal of proving it to you; I fully expect that, should you ever bother to do real work in language processing, you'll learn it on your own.. just like everyone else has. I can only vaguely recall my own experiences with creating minor modifications to a language (even simple addition of options) that created ambiguities such that I had to entirely restructure the hand-written parser for the language to handle the ambiguity correctly. That was before I really learned about grammars, and learned they handle ambiguities very well.}
 - If you have no textual demonstration for the given example, your anecdotal claim is noted, and we can agree to a cease-fire and move on. {PageAnchor: Final}
- If the price value is 4 cells away from the cell containing "price:" (sometimes "filler" table cells are used for spacing), but on a given run the 4th cell does not exist or doesn't contain numerics, what are you gonna do? We can record a blank/missing price and log the problem and/or halt, but beyond that we cannot do much without making risky assumptions. Are you suggesting the computer start guessing how the HTML was changed, such as looking in the 3rd and 5th cell? That's utterly ridiculous. We don't want to turn a small business into an AI lab.
- {What's your objection to AI? I'd bet money that in 50 years, your small business IT stuff will essentially be AI microfactories (at least relative to what we call 'AI' today... the word tends to be a moving goalpost, such that stuff we figure out how to do well is suddenly insufficiently advanced to decorate with the term 'AI'). Anyhow, the computer shouldn't start guessing how the HTML was changed. But there is no reason that you should force it to be concerned about the exact number of columns difference. Line-items in HTML tables can easily have different numbers of columns, for example, based on whether the item fulfills various conditions.}
- I don't think the software should guess when the columns change from where expected. Many layouts are not strictly grid-based anyhow. I suppose if it is strictly grid-based, we can check the column titles as a feature.
- {It seems you're unwilling to even give it a serious try. Further, if I wrote a query, I'd need to go to great effort to show it correct, since it would be so damn convoluted to do so easily... I'm too lazy for that - I'd much rather use a tool designed for the purpose and that is easier to show correct, like grammars. There is a reason, top, that people use grammars and pattern-matching languages rather than directly writing special queries or procedures or functions that handle every ambiguity and special case - the moment the language has ambiguities, options, omissions, and variations, it's pretty much necessary for complexity management.}
- If I did write such a sample, what guarantees do I have that you will write a counter example? A battle of wits with an unarmed man is boring.
- {You don't have a guarantee that I'd write a counter-example. But I do agree that a battle of wits with an unarmed man is rather boring. I'll even attest to it.}
- And you seem obsessed with having some kind of mathematical proof behind your solution. It's possible that the HTML generated has flaws in it, I would note. For example, it may be missing end tags, be in non-nested order, etc. (xxx) If we over-tie to assumed regularity, typos in the HTML may cause problems in something that tries to use mathematical purity.
- {And you can also end up with typos and such in the word 'Price' and whatnot. You can try to handle them or not at your discretion, top. The typos you mention will cause you just as many problems in your table form (or even getting them into table form) as they would any generic XML parser. But I should note that grammars don't necessarily imply this "mathematical purity" you're mentioning. Regular Expressions are simple grammars and are very easily applied without attempting to find some structure for the whole file.}
- Grammars tend to derail easier when they expect proper ordering, nesting, closing, etc. Authors may also "write to bugs" in rendering engines (Internet Explorer is the most common target). To "prove" it "right", it may have to reinvent how IE renders internally, which is probably a hack-it-up job from some 3rd-world country because they were the lowest bidder, not a math thesis. IE is the main test-bed, not math, and it is not known for its clean logic and consistency. You cannot model a specific swamp with simple math: a swamp is a swamp. Your search for purity is likely MentalMasturbation.
- {Getting it right has nothing to do with figuring out how IE renders the HTML or whatnot. All that matters is extracting all the prices (such that it works) and nothing more than the prices (such that it distinguishes).}
 - The example had more than price. We may also look for description, product number, expiration date, etc. if available.
- Further, databases are a relatively well-known tool in my domain. Your grammatron, on the other hand, would take some experience, practice, and exposure to get one's head around. It does not leverage existing knowledge.
- {That is true. I imagine that grammars and types and various other things simply aren't worth the effort for you to learn to use correctly since you don't see much need for them in the widespread but relatively monotonous domain of IT in which you work. But if you can't get your head around it, you shouldn't be doing all that generalizing I see in the paragraph above.}
- You talk like you are God's gift to God. Show, don't talk.
- {Heh. That is your objection to my objection to your failure to show? ("Grammars tend to derail easier..." than what? oh, and you follow that up with "your grammaratron [...] does not leverage existing knowledge" - actually it does, just not knowledge YOU have - but you're sitting there making generalizations about it without having ever "gotten one's head around" it. Stop being a hypocrite.}
{It isn't true to say that EverythingIsa grammar (for example, values aren't grammars even though it is possible to make a grammar that expands to or accepts only a representation of exactly that value... and neither are communications, cats, dogs, spark-plugs, etc.). But it is true that any describable structure or pattern of structures can be described with one (given a grammar-language supporting the correct level or lower in the ChomskyHierarchy, and the right dimensionality and primitives (e.g. 2D color images (primitive = color pixel aligned in two dimensions with other pixels) are difficult to pattern-match with a 1D text grammar (primitive = character aligned in one dimension with other characters))). Any (pure) pattern-matching language you can come up with will ultimately be equivalent (mathematically isomorphic) to some grammar or other. {And I do expect that people will come up with a library to help find patterns they're looking for in text and capture variables out of them. In full maturity, after considerable refactoring, I'd quite expect such a library would look distinctly like support for writing a grammar, possessing generic pattern-matching and analysis. And perhaps you could use functions that compile a grammar into query statements over a table... just like they are commonly compiled into a massive set of procedures with procedural code. Ultimately, you'll have moved out a layer of abstraction, from explicit queries or procedures to describing grammars and semantics.}

. . . .

{Once the semantics are extracted, then what one does with it can vary... if it semantically is data, one might put it into a database. If it semantically is code, one might execute it. If it semantically is display information, one might display it.}

(Note that I am assuming that the markup-to-table parsing code is already written. [snip...] Plus, dumping the output of an OOP-based parser to tables may not be that difficult.)

{No problem.}

Typical Tasks for Data Conversion

Data structure "shape" remapping: Converting rows to columns (and visa versa), trees to tables, maps to lists, etc.
Map between named slots. (ex: source "LNAME" to destination "Last_Name")
Convert formats, such as yyyy-mm-dd into mm/dd/yyyy.
Split or combine compound elements: "xxxx, yyyy" into "xxxx" and "yyyy". (This is probably the most parse-intensive of these. Most the others are inter-cell instead of intra-cell.)
Change positional-based referencing into named-based references. (Ex: Move the 5th item in list X into "customer_ID".)
Filtering - Removing extra information that is not relavant to target's needs.

RE: "I find that data often naturally is either not tree-shaped to begin with, or can easily be "flattened" for the purpose at hand" -- top

{Data often is 'flat' in a sense, but when you open an XML document, and even after you parse out the tags, you aren't working with "the data" yet. You're working with the representation and transport medium of the data. And, for XML at least, that representation IS 'tree-shaped' to begin with. CSV would be better to avoid tree-shape to begin with. Anyhow, once you get it down to {Item:X,Price:Y} tuples, THEN you'll have the "flat" data you desire. And getting there will generally require BOTH pattern-matching (syntax) and processing (to imbue semantics).}

I think you missed my point, but it is not material enough yet to rephrase.

RE: "Trees are over-used and over-assumed in IT in my opinion." -- top

{I agree that trees for data (in the 'propositions held to be true' sense) aren't a very good thing for business applications because they force navigational access to useful information, and navigational-access code creates inertia that ties one to the structure or representation of the data.}

{But trees for values are a different matter. I cannot agree with your tendency to use two or three tables (a whole micro-database worth) to represent a single value simply because you think that anything more complex than flat strings, numbers, and dates is some sort of sin against the IT profession.}

Let's compare with a scenario, per above.

{I'm not sure what we'd be comparing. My opinion and your opinion?}

JanuaryZeroEight

CategorySyntax, CategoryXml