Alexander Morou
Sponsor
Greetings,
Just curious if anyone here is interested in the silly research I've been doing lately.
The past year or so, I've been struggling on and off with creating a very basic lexical analysis state-machine generator. A while back I thought up something that I thought was close: a radix tree, to represent non-flexible state systems (such as keywords, operators, and so on, they don't repeat and have no optional elements).
The radix tree approach is far simpler than the final result. State->State transitions, by themselves, aren't very difficult, but the issue I kept running into was the need to properly express the problem, I think the reason I failed so many times was I didn't fully understand it. What I came up with is fairly straight forward. I start out with each expression group, unify each expression with each other, ignoring at first the overlaps, on each expression I concatenate each element, bearing mind the Kleene operators and flagging/clearing the edges as necessary based upon the next element. Here's where I usually screwed up: I kept thinking it was just going to work, but in thinking about what I have here, at this point, I have multiple expressions that potentially overlap. I realized this and tried, multiple times, to write a unification algorithm, but the thing I kept messing up was trying to do it all in the same state.
The issue lied in the fact that in order to do it properly and efficiently, I needed a clean slate, a completely new instance of the state with no transitions to foul things up. It was only later that I found the working solution, and later still that I realized: it's the standard concept of NFA->DFA conversion.
So then I thought, after finally getting it, I would take it further. I noticed a lot of redundancy in the generated code, so I knew pretty well how to handle the simplification process: any two states with equal sets of transition elements, are equivalent to one another. Therefore: for every point of duplication, remove the duplicate and place the 'master' (first) non-duplicate, and update the state->state transitions to move to the master state.
I thought further, and realized, using an example to simplify this: C (B | AB)
For every terminal edge, match those which have equal incoming transitions, and replace those as necessary using the above method. This basically means that the above set, CB or CAB, have the same ending-state value.
The second rule of reduction helped in reducing complex state-sets from roughly 460 states to 230, half the original working set. The only issue with it is now I need a sub-state value to represent the point of origin to maintain the determinism of the result (ie. if you're parsing a keyword, which keyword was evaluated, because 'ascending' and 'descending' end on the same state, and share the 8 states in 'scending').
So the question: Anyone here have any idea on how to reduce further?
The thing already reduces the machines further than I could possibly do (efficiently) by hand.
If anyone's interested in the code that it produces, PM me or post a reply. This project will be used to bootstrap the next version of OILexer (Objectified Intermediate Language, Lexical Analysis Generator).
Just curious if anyone here is interested in the silly research I've been doing lately.
The past year or so, I've been struggling on and off with creating a very basic lexical analysis state-machine generator. A while back I thought up something that I thought was close: a radix tree, to represent non-flexible state systems (such as keywords, operators, and so on, they don't repeat and have no optional elements).
The radix tree approach is far simpler than the final result. State->State transitions, by themselves, aren't very difficult, but the issue I kept running into was the need to properly express the problem, I think the reason I failed so many times was I didn't fully understand it. What I came up with is fairly straight forward. I start out with each expression group, unify each expression with each other, ignoring at first the overlaps, on each expression I concatenate each element, bearing mind the Kleene operators and flagging/clearing the edges as necessary based upon the next element. Here's where I usually screwed up: I kept thinking it was just going to work, but in thinking about what I have here, at this point, I have multiple expressions that potentially overlap. I realized this and tried, multiple times, to write a unification algorithm, but the thing I kept messing up was trying to do it all in the same state.
The issue lied in the fact that in order to do it properly and efficiently, I needed a clean slate, a completely new instance of the state with no transitions to foul things up. It was only later that I found the working solution, and later still that I realized: it's the standard concept of NFA->DFA conversion.
So then I thought, after finally getting it, I would take it further. I noticed a lot of redundancy in the generated code, so I knew pretty well how to handle the simplification process: any two states with equal sets of transition elements, are equivalent to one another. Therefore: for every point of duplication, remove the duplicate and place the 'master' (first) non-duplicate, and update the state->state transitions to move to the master state.
I thought further, and realized, using an example to simplify this: C (B | AB)
For every terminal edge, match those which have equal incoming transitions, and replace those as necessary using the above method. This basically means that the above set, CB or CAB, have the same ending-state value.
The second rule of reduction helped in reducing complex state-sets from roughly 460 states to 230, half the original working set. The only issue with it is now I need a sub-state value to represent the point of origin to maintain the determinism of the result (ie. if you're parsing a keyword, which keyword was evaluated, because 'ascending' and 'descending' end on the same state, and share the 8 states in 'scending').
So the question: Anyone here have any idea on how to reduce further?
The thing already reduces the machines further than I could possibly do (efficiently) by hand.
If anyone's interested in the code that it produces, PM me or post a reply. This project will be used to bootstrap the next version of OILexer (Objectified Intermediate Language, Lexical Analysis Generator).