Link Grammar Parser

: The Link Grammar Parser is a syntactic parser of English, based on link grammar, an original theory of English syntax. Given a sentence, the system assigns to it a syntactic structure, which consists of a set of labeled links connecting pairs of words.

I've used this and it's stunning at finding grammatical errors in English text. It produces correct parses for grammatically correct sentences, including multiple versions for ambiguous sentences, and is reasonably robust in the face of unrecognised words.

You give it a grammatically correct or near correct sentence and it will tell you what part of speech each word is. - http://www.link.cs.cmu.edu/link/submit-sentence-4.html lets you use it to analyse single sentences, http://www.link.cs.cmu.edu/link/ gives a more complete outline. The output is a little cryptic, but a full explanation of all the terms is given on http://www.link.cs.cmu.edu/link/dict/index.html

Worthwhile for people to know about, but just for the record, it is beyond the state of the art to accurately automatically analyze the parts of speech of words in unconstrained text. E.g. 95% accurate would be very nice, but can't currently be done. On the other hand, some applications are worthwhile even with e.g. 60% accuracy.
- ''See further down the page for this author's examples of "unconstrained text".

I believe this simply to be false, unless by "unconstrained" you mean "context-free and in ungrammatical streams." The LinkGrammarParser achieves far better than 95% with grammatical text, and does very well with ungrammatical but reasonable text.

Hmmm:

 (S (VP (S Worthwhile
          (PP (PP for
                   (NP (NP people)
                       (SBAR (WHNP to)
                             (VP know
                                 (PP about)))))
               , but just
               (PP for
                   (NP the record))))
        ,)
    (S (NP it)
       (VP is
           (PP beyond
               (NP (NP the state)
                   (PP of
                       (NP the art))))
           (S (VP to
                  (VP (ADVP accurately)
                      (ADVP automatically)
                      analyze
                      (NP (NP the parts)
                          (PP of
                              (NP (NP speech)
                                  (PP of
                                      (NP words)))))
                      (PP in
                          (NP unconstrained text)))))))
    .)

...looks pretty damn accurate to me!

Citation/URL? (I see that they added a "morpho-guessing" feature some years back that they said improved things a lot, but I don't see accuracy figures.)

Based purely on having used it on over 1000 sentences and agreeing with its analysis on the 300 I checked by hand.

It's nice that it worked well for you, but that's a different question, really.

No matter which AI problem we talk about, it is well known that they are all quite difficult, and it is quite important to know how accurate one can expect an algorithm to be. In testing OCR systems, the algorithms are often 100% accurate on samples "in the lab", but when tested on samples from the wild (e.g. seventh-generation photocopies with never-before-seen fonts), accuracy can drop to 60%. That's not hypothetical, I'm talking about tests I've run myself.

Similar things happen in every AI domain from machine vision to linguistics. Accurate tagging of parts of speech is well known to be one of those problems. Yes, various algorithms can approach 100% accuracy on some kinds of sentences, but no, they don't on a large corpus e.g. drawn from random magazines and newspapers.

Just as a by the way, 95% sounds good, and is good enough for some purposes, but for many other applications something closer to 99.9999% would be needed (e.g. when a technology is competing with minimum wage human operators).

-- DougMerritt

What are you saying? That if an AI problem is ever essentially cracked, it ceases to be an AI problem? The links provided open the algorithm to a theoretically infinite corpus... so let the failures be noted and the reasons discussed (that might fail, let's take a look...)

 (S so
    (S (VP let
           (NP (NP (NP the failures)
                   be
                   (VP noted))
               and
               (NP (NP the reasons)
                   (VP discussed))))))

...wrong, I think. Compare the expanded sentence "let the failures be noted and let the reasons be discussed"

 (S (S (VP let
           (NP the failures)
           (VP be
               (VP noted))))
    and
    (S (VP let
           (NP the reasons)
           (VP be
               (VP discussed)))))

...failure under ellipsis with conjunction

No, no, I'm saying that your results are not following rigorous standards for statistical analysis, so they are statistically meaningless, whereas for most purposes, when evaluating AI algorithms, people need rigorous statistically meaningful measurements. This is an area I've done a lot of professional work in, and I've seen enormous differences between anecdotal evidence from casual testing versus careful methodology using large test sets painstakingly gathered from a large number of real world sources.

Well they're not my results, but I agree that they are anecdotal rather than statistical evidence. When testing such things, it is human nature to try to break them, so I would expect the anecdotal evidence to overstate rather than understate the error rate, but who knows?

Only if you're evil minded like I am. Try it on Shakespeare, poetry, street jargon, surrealist essays, and real horror cases like literary/theatre/art criticism. :-)

Since I am predisposed to it and hardly likely to be fair, and since you are, by your own admission, evil-minded, I would suggest that you try it out and tell us your results. Further, I did say "grammatically correct", into which category Shakespeare, poetry, street jargon, surrealist essays, and real horror cases like literary/theatre/art criticism frequently fail to fall.

It does appear that we are talking at cross-purposes. You seem to be using the term "unconstrained" to mean "random mish-mash of words that nonetheless manages to communicate something to someone."

CategoryNaturalLanguage