Probability Based Parsing

I've been kicking around the idea of a simple "probability-based parsing" tool that uses simple patterns along with a ranking. This would mostly be for one-off, ad-hoc, or occasional parsing and/or converting. Traditional parsing requires putting together rather exacting rules that take a long time to prepare, tune, and read (if one doesn't do it often enough to make it second-nature).

Probability-based parsing is more "organic". Further, it may be useful in cases where you are only interested in specific features such that you don't want to have to specify rules for everything in order to isolate something specific. One may have to still inspect the results to make sure you "got them all", but it may do the vast majority of the work for you. It's a tool for the niche in-between the cases of "do it all by hand" and "build a perfect parser".

First you define basic atoms of the source language, such as "named-token". Existing reg-ex may be good enough for this purpose. Then you define rules that look something like this:

 if-start | if * then | 50
 if-end | end if | 60
 if-end | endif | 70
 func-call-start | <named-token>( | 30
The pattern is:
 pattern name | pattern | probability

Each token is then labelled with the highest probability match, perhaps with a "log" showing the non-winning matching rules to help tune or verify the parsing.

(Asterisk means zero or more tokens. Probabilities are weighted relative to each other.)

Has anybody seen anything like this before?

--top

Lots of work has been done in this area, e.g. with regards to machine learning and natural language processing. What you speak of is actually a weighted logic (cf. http://www.dyna.org/) as opposed to a true probability-based approach. That's probably a good thing; I'd hate to have my code parse differently every time I run it.


I'm not clear about you use-case. It seems that you do want to specify languages albeit as easy as possible. But what kind of languages? Languages for planned functionaly? Like describing business functionality in a specific DSL? Or languages matching an already existing corpus of sentences? The latter is where the machine learning above could do it's work better. It would be nice to have a tool which could extract a simple language from given text file. Then processing of the text file would become much easier.

Example: Log-Files. How often do you have to read log files of the most different formats (ad-hoc and what else)? An enormous amout typically much of it unneeded. Granted. Grep can usually trim it, but what I have in mind is more like this:

input some GB log file with lines like this:

  2012-03-23 13:31:52.681 (18.682) PID:31105 Info   import_data.c           _import_file_or_directory(). 145: Import of "/tmp/server_data_1332509496_31105.zip" took 16 seconds and succeeded. Got 74596 events in 58810 productions.
  2012-03-23 13:32:35.235 (61.236) PID:31105 Info   mainloop.c                   _mainloop_log_status(). 547: STATUS1: storages T/M/C: 1 / 1 / 1 with 74596 events / 58810 productions use 357766 KB. Requests: 0 total, 0 (0.0%) success, meta: 0 (0.0%), optim.: 0 (0.0%), longest: 0 0 0 ms, 0 threads

My imagined tool should extract a grammar like this from it:

  <grammar> ::= (<line> "\n")*
  <line> ::= <date-iso> " " <time-iso-ms> " (" <num> ") PID:" <num> ["Info"|"Warn"|"Debug"|"Error"] " "+ <file-name> " "+ <method-name> ". " <num> ": " <text>
(using some predefined well-known patterns and settings to generate a "single-rule" grammar).

The <text> at the end onviously contains different typical groups of log messages, which would require multiple grammar rules for each one, so the grammar might go on:''

  <text> ::= "Import of " <string> <text1>
  <text> ::= "STATUS1: storages T/M/C: " <digit> " / " <digit> " / " <digit> " with " <num> " events / " <num> ...

Using this grammar it would be much easier to extract statistics from the log-file, e.g. with:

  cat log-file | guess-ast | ast2csv grammar.line.date-iso grammar.line.text[1].num[1,2]

Which would presumably generate output like this:

  2012-03-23 74596 58810

Wouldn't this be great?

Yeah, it would be. As a side note, the log file routine authors probably should have stuck with CSV to make life easier on everybody.


The normal arguments for fuzzy and weighted grammars are:

Weighted grammars are also valuable for generative purposes, i.e. mad-libbing, or creating a world from a world-grammar.


EditText of this page (last edited May 11, 2012) or FindPage with title or text search