Parsing Html

[under construction]

Parsing HTML can be sticky because one often is looking for something specific rather than a general parsing, and because a lot of the HTML in the field is crappy, with bad nesting, missing end tags, etc. I propose that parsing results be put into a table-set or structure similar to this:

 table: tags
 ------------
 tagSequence
 tagName   // Tag name. End tags include the forward-slash
 docRef    // Foreign key to document
 enderType // "pair" or "self". "self" means it ended with a forward-slash at the end
 pairRef   // Reference to start or end pair (if found)  
 parentRef // Reference to next outer tag. For example, a TD should have a TR parent.
 content   // Content "between" start and end pairs (or at least up to the next tag)
 warnings  // Any problems found by the parser

table: attributes ---------------- attribSequence tagRef // Foreign key to "tags" table attribName attribValue valueCount // 0 if attribute lacks a value, 1 if it has it (legitimate for some HTML) quoting // quote used to wrap value, if any

If end-tags, parent tags, etc. cannot be found, then the references are simply left empty. Other parsing tools tends to try to stuff everything into a tree, but if the nesting is messed up or the parser gets off track, then tree-ness will be damaged. The above instead includes any tree-ness if found, but still maintains info about individual tags if big-picture parsing goes wrongs. The above is more limp-friendly (LimpVersusDie) because it starts with a kind of "flat-ness" and adds tag relationship info only if properly found.

--top


EditText of this page (last edited December 28, 2012) or FindPage with title or text search