[under construction]
Parsing HTML can be sticky because one often is looking for something specific rather than a general parsing, and because a lot of the HTML in the field is crappy, with bad nesting, missing end tags, etc. I propose that parsing results be put into a table-set or structure similar to this:
table: tags ------------ tagSequence tagName // Tag name. End tags include the forward-slash docRef // Foreign key to document enderType // "pair" or "self". "self" means it ended with a forward-slash at the end pairRef // Reference to start or end pair (if found) parentRef // Reference to next outer tag. For example, a TD should have a TR parent. content // Content "between" start and end pairs (or at least up to the next tag) warnings // Any problems found by the parser table: attributes ---------------- attribSequence tagRef // Foreign key to "tags" table attribName attribValue valueCount // 0 if attribute lacks a value, 1 if it has it (legitimate for some HTML) quoting // quote used to wrap value, if anyIf end-tags, parent tags, etc. cannot be found, then the references are simply left empty. Other parsing tools tends to try to stuff everything into a tree, but if the nesting is messed up or the parser gets off track, then tree-ness will be damaged. The above instead includes any tree-ness if found, but still maintains info about individual tags if big-picture parsing goes wrongs. The above is more limp-friendly (LimpVersusDie) because it starts with a kind of "flat-ness" and adds tag relationship info only if properly found.
--top