Archive for March, 2010

The C typedef problem

Posted in Compilers on March 5, 2010 by decepticonpunk

EDIT: The ideas described in this post can only be used for a limited set of typedef declarations. Stay tuned for a better technique :-)

Introduction
If you have ever written any program doing analyses on C code, then this isn’t news to you. The famous typedef problem has been the subject of many discussions among compiler developers and many solutions have been proposed. Personally speaking, I wasn’t aware of how serious it was until I had to cope with it. So, you guessed it, I’m currently trying to solve the famous typedef problem that the parser generator of OpenSAT faces. Before moving on to describing any possible solutions, let’s try to define the actual source of evil.

Definition
I am sure you already have the K&R book (I mean, come on, everyone has it), so, jump to page 234 (A13) and have a look at the C grammar. One of the most notable rules is the following:

<type-specifier> ::= "void" | "char" | "short" | "int" | "long" | "float" | "double" | "signed" | "unsigned" | <struct-or-union-specifier> | <enum-specifier> | <typedef-name>
<typedef-name> ::= <identifier>

Looking at the rules above, one can see that a <type-specifier> can be equal to a <typedef-name> and a <typedef-name> equal to an <identifier>. Even if you are not familiar with BNF grammars, it’s pretty easy to understand that this set of rules refer to identifiers declared as types via typedef.

typedef unsigned long our_int_t;

When a C compiler comes across such a declaration, it knows that our_int_t should be treated as an alias for unsigned long and that our_int_t can be used as a type specifier (i.e as a standard C keyword used to define a basic C type). Notice that type names defined via typedef should always obey the rules for variable names, so, without some extra context information, the C lexer is unable to understand if a given string is an identifier or a type name! You might think that this is not a problem at all, but let’s have a look at a completely valid C snippet that any descent compiler will happily accept as correct.

typedef unsigned long our_int_t;

int main(void) {
  our_int_t v; /* [1] */
  int our_int_t = 2; /* [2] */
  return 0;
}

Funny right? At [1], a new variable named v of type our_int_t is declared. The C compiler has already parsed the typedef declaration, so, it knows that our_int_t is actually an alias for unsigned long and thus accepts the declaration. So far so good. The real problem starts at [2] where our fictional programmer, interestingly enough, declares a new integer variable named our_int_t which is given a default value of 2. Although our_int_t was previously declared as a type name, it is used at [2] as an identifier. On the contrary to what you might think, this line is syntactically and semantically correct. If you need further convincing, just compile the previous snippet with your favorite compiler.

The previous paragraph describes only one side of the coin. Unfortunately, one more problem arises from this little inelegance in C’s grammar, but it’s not possible to explain it here in detail. The problem lies in the way LALR states are generated. By replacing <typedef-name> with <identifier> (since <typedef-name> ::= <identifier> is also true), a lot of conflicts pop up in the resulting parsing tables. If you want to have a look, grab the yacc C grammar from here, replace TYPE_NAME with IDENTIFIER and run yacc on it.

Solutions
The past few days I’ve been trying to solve this problem in an ellegant and effective way. During all that time I googled a lot and I came across some very interesting sources which are worth studying. Here are some links:

1. The typedef problem discussed in comp.compilers here, here, here and here.
2. Same at a very cool blog called The little calculist.

There’s more than one way to solve the typedef problem. It can either be done in the parser or in the lexer. In the former case, the C grammar is modified, while in the latter, lookahead is introduced in the lexer. It’s up to the developer to decide what’s best for him. Personally, I consider messing with the grammar a dangerous practice, so I decided to implement the second solution.

Solving the problem in the lexer was not a big trouble. First <typedef-name> and <identifier> are declared as terminals e.g TYPE_NAME and IDENTIFIER as in the yacc grammar shown above. The next step is to modify the lexer in order to make it able to distinguish if a given string is an identifier or a type name and return the appropriate token type. The following pseudocode is what I actually implemented in C for OpenSAT.

/* Upon reading a "typedef" token, set a flag that
 * indicates we are currently lexing a typedef
 * declaration.
 */
if token.name == "typedef" then
  in_typedef = 1;
fi
...
...
if token.name == ";" then
  ...
  in_typedef = 0;  /* Typedef ends at ";". */
fi
...
...
if token.type == IDENTIFIER then
  /* If this token is already in the type_table hash table
   * and if lookahead is not one of the characters shown
   * below then this token is used as a type specifier.
   */
  if token in type_table then
    if lookahead not one of ['=' ',' '{' ';'] then
      token.type = TYPE_NAME
    fi
  /* If we are currently lexing a typedef declaration and
   * the token is not in type_table then this is a new type
   * name. Insert it in the type_table.
   */
  else if in_typedef == 1 then
    token.type = TYPE_NAME
    insert_in_type_table(token)
  fi
fi

So far this solution seems to work fine. You can have a look at my test program here and the output produced by OpenSAT’s lexer here (notice that the tokens are correctly identified). I still haven’t finished dealing with the typedef problem, so, maybe my solution is not 100% correct. If you think you got a better idea drop me a mail or add a comment! :-)

– dp