February 19, 2010

Why LR parsers for NLP with TAGs ? : a logical answer

It is known that context free grammars are often used to parse natural language strings. Indeed most of the contructs in natural languages can be expressed using context free grammars. And among the most efficient parsing techniques are the set of LR parsing techniques based on initial work by Don. Now the common shortcoming of LR parsers for them to be used for natural languages widely is that when conflicts occur during the parsing, the context provided by the associated context free grammar to resolve the conflict is very less. Hence conflicts occur commonly when LR parsing techniques are used for natural languages and it is not easy to remove the ambiguity involved.
Now consider the case of tree adjoining grammars, they fall into the category of mildly context sensitive formalisms and are more powerful than context free grammars and also in simple terms allow more than one level of derivation in each production. Given all these factors, applying a little logical thought would tell us that conflicts in LR parsing using a TAG for natural languages can be resolved with much more ease because of the extra context present. Given this I am presently working on writing a LR parser for TAGs, though implementations like XTAG already exist, I am doing this solely to understand the theoretical concepts better.

February 12, 2010

embedded push down automata and parsing MCS languages

I was working on coming out with a solution for parsing the following language using an embedded push down automata, because of the presence of three distinct terminals, it is pretty obvious that a simple push down automata cannot be used.
a^nb^nc^n
The main feature where an embedded PDA varies from a simple PDA is that, it has a stack where multiple stacks can be pushed. New stacks can be pushed above or below the current stack, while symbols can be pushed and popped from the current stack, symbols can only be pushed into stacks other than the current one. This effectively means that in a simple PDA when we perform a POP operation, the information is deleted from the PDA stack, where as in an embedded PDA, when we perform a POP information, the information can be either completely removed from the embedded PDA or be pushed into another stack above or below the current stack. This allows an embedded PDA to parse languages such as the above. The solution for parsing the above language is simple and can be visualized as a simple PDA for the a^nb^n part, but instead of popping out the a's when a b is encountered in the input string, we push the symbol to another stack, the symbols are finally popped out from the system when a corresponding c is encountered.

February 11, 2010

Embedded push down automaton (testing latex with blogger)


Here is the formal definition of the language which a string belongs to if it has been successfully parsed using an embedded push down automata :
C\left(M\right)=\left{q,{\gamma}_m ... {\gamma}_1,x_1,x_2\right} \epsilon Q \mathbb{X} (\Gamma^+)* \mathbb{X} \Sigma^* \mathbb{X} \Sigma^*
More equations and notes from my research on mildly context sensitive grammars and their describing automaton will make their way to the blog soon.

February 07, 2010

using getopt : is it worth the effort ?

I was recently going through the GNU getopt library for parsing command line options. On a *nix system with online man pages installed, you should be able to read the manual pages for the getopt C library functions by doing something like
man 3 getopt
Now what I am left wondering about is, whether the complexity of using the library functions when the number of arguments your program requires is one or two, worth it ? Whenever I have had to implement similar functionality into my program all I have done is use a while loop over argv until I found all the required arguments. Something like this :
.
.
.
i = 0;
num_args_found = 0;
do{
 if (argv[i] == '-'){
           /*check for all valid options */
           /*if invalid character found then skip and continue */
 }
 i++;
 num_args_found++;
}while (i < argc);
if (num_args_found < requried_num_of_args){
 printf ("The requried arguments were not found\n");
 exit (0);
}
/*rest of the program here */.
.
.
though I am sure this is the best of the approaches, and even I have sometimes used a better one or a modified version of the above code, like to show exactly which arguments were missing. But the point of discussion here is that, when the expected number of arguments is less than three or four, the complexity of using a library function is much more than that for writing some custom code as above.