Compiler Compiler - Top-Down Recursive Descent LL(k)

Alexander Morou · Mar 14, 2009

Well,

I'm finally to the syntactical analysis phase, and let me tell you, it's no joke.

On top of requiring a good understanding of the overall language's interconnected structure, you have to know other things, like where/how the data-dependencies lie. Right as it is now, I'm focusing on the token look ahead exploration for each of the rules, as well as the individual rule data exports, in two forms. The data-exports are basically the information the rules provide to whomever uses the parser, I've decided on two formats: Verbatim and Declared. Declared is a bare-bones version of the original, containing information on a need-to-know basis, unnamed elements are discarded. Verbatim is as it sounds: a data set in the order in which the rule was defined, and in which the data was encountered. This way if you're interested in stepping the data-structure you can do so in the order in which it was parsed. Depending on whether I have any issues with the Verbatim format, determines whether I drop it or not.

First thing's first: Look-ahead exploration. So far I've been able to do the simplest exploration, and know from the start rule what's available in each individual rule's starting context. The next step involves using what I have and comparing it to what's next, based upon what's next will determine how far the look-ahead goes. On top of all this, I need to build a relational map token->token. This map will tell me at a moment's glance what overlaps where, so if keyword and identifier are in the same context, it knows they're equivalent to one another, or if two keyword-like tokens are encountered together, where they overlap, so only cases where the overlap is pertinent are explored.

I'm thinking each parse method will have a 'parseState' parameter, since a lot of the initial look-ahead analysis will overlap, it'd be pointless to regularly re-check the look-ahead to make the same decisions over and over. The parseState will relate to what level of look-ahead has been performed, this will determine where into the code it jumps.

This doesn't even begin to cover ambiguity cases, I'll have to figure that out when I hit them, I'm guessing it'll be here shortly. If anyone here has any experience in this, please, don't hold back.

Alexander Morou · Mar 31, 2009

Well that was fun.

After writing a lot of code and finding that most of it was pretty much useless. The initial method employed was a large syntactical graph which branched from the start rule. Which would've been great, were I only interested in verifying the syntax. It also didn't properly support terminal closure to the calling rule of each rule offshoot. Don't worry if that doesn't make any sense, obviously, since it didn't work, I didn't understand it either.

I've rewritten a large portion of the code related to look-ahead exploration, and I believe I'm at a phase where I'm able to begin the actual logical path evaluation. I'm likely to use a state-machine approach for each rule so when I encounter recursive ambiguities I can handle them by invoking a state machine merger which is given the task of taking the call hierarchy and creating a fusion between all the active rules and where they are to go 'next' after their current state.

The main reason for such an irritating thing is what recursion does to the parse process. In order to remove the need to back-track to any degree you'll need a method that will allow the nested rule parses to only 'eat' as many tokens as they need. This essentially means that a given rule may well consume a series of tokens, but it also might consume too many. To prevent this, I'll need to emulate a version of the rule that represents the call hierarchy, only when the entire call hierarchy's requirements are satisfied can the look-ahead advance beyond that terminal edge, should there be more to parse beyond that point.

After the rule's look-ahead is known, the actual data mining process can begin.

For those interested to know more about the state machine resolving recursive ambiguities, here's a small graphical representation of a simple parse:

As you can tell from the above, it's a silly language; however it very simply demonstrates how a rule can over-eat its fill. Were you to not use a method to avoid backtracking in this method, you'd end up with a partial parse, and a syntax error, to which the parser would say 'Wait, perhaps it's just a bad guess', it'd backtrack until it found a parse that worked, if none worked, it'd throw an error.

Naturally there's probably something I'm missing from all of this, but time will tell.

Edit:
Well after reading some more (since I want to be sure of what I'll be doing before doing anything), I've found that the concepts I'm wanting to try are a stray path away from some common parsing concepts. For one, I don't like the idea of 'Maximal munch'. If a series of characters represents two tokens at once, I'll use both tokens and explore both individually. It should take very little effort to discern the proper path, since, even if two tokens overlap on their range entirely. Granted this means the analysis phase will become, probably, exponentially complex, but I think it's worth it. I also remind myself that I can take this liberty since:
1. I'm doing this for fun.
2. This is a research project.

I'll post more when more's available.

Alexander Morou · Apr 15, 2009

Well I think I've got the basic concept down. Now I just need to implement it. Granted the above model displays a method that's very different from your standard LL(k) parser, I'm thinking it's similar to ANTLR's LL(*) method, but also similar to a Chart Parser due to its aim to eliminate back-tracking and the worries associated to combinatorial explosion. Whether this works will be pretty quick to find out.

The Combinatorial explosion issue will be handled by issuing new states into the state system only as necessary, which means the state machine for two rules (ie. Rule A uses Rule B) is only built when a condition arises that the states for such a union is necessary (ie. when Rule B is actually needed and next in the tree). The entire system will utilize memoization for state system combinations, this ensures that two parses of the same kind use the same unified state machine, and if needed the information can be persisted for use later (since I'm guessing table building on the fly will be slow on the first go, so being able to save this data will be helpful).

If I'm right, the actual information for the states will be very straight forward, it should only require a few bytes per state to define the transitions necessary to go into the next state/rule. Using C# as an example, I'd guess 40 bytes per state is about accurate, there's about 1200 states for the entire language when you build a tree for just the start rule->every other rule, excluding left recursive situations. That's about a 48000 byte state system, which won't ever occur since that doesn't define the state system at its deepest level only but all possible permutations of the state system. Even then, assuming a worst-case scenario of 46.875KB per a single state system (probably closer to 10KB, realistically), I'm guessing there's only so many cases in a language that will have absurdly large state systems, and once encountered, they'll start to overlap and use the memoization cache.

The cache itself will be tree based, where the nodes in the tree are the current call hierarchy, which makes retrieval simple if each rule pushes its current state onto a stack. In all this information is data relative to what's captured, this data will be a part of the rule blueprint (the data used to build the state systems), this point of origin will likely have to be tracked in the overall state machine built from the blueprints. So when the rule accepts the blueprint can properly assemble the rule using a series of construction aids that will be defined using delegate acceptors when defining the edges of the rule.

All of the above is very complex, and if I'm even slightly wrong on it, the entire thing crumbles. Here's hoping I know what I'm doing this time.

Alexander Morou · May 9, 2009

Starting the tree build code now. Here's hoping it goes as planned. Took so long to fully whiteboard the idea. This is a lot more complex than I thought it was. I'll post in a few days the results of the experiment. Granted there's probably zero people interested in the results of this, but hey, that can change.

The biggest issue I've had with this idea is deciding how to properly handle the internal structure. There's a lot of key factors, for one, the look-ahead exploration is dynamic the first go around, second there's tree closure to be considered, which involves keeping track of rule sub-trees and when their exit states are entered, it needs to know where to route the next points to, which is tricky, to say the least. There's issues with left-recursion that I'll have to figure out as I go. There's also the fact that each 'path' followed while parsing needs to be tracked by the system, in the end the entire concept of 'recursion' is thrown out the door, essentially making this similar to a top-down left-most derivation push down automata which is both deterministic and nondeterministic where determinism fails (maximal munch, as mentioned before, isn't observed).

The good news is this entire process has given me ideas for the third language development system I'll make (after version 2). Because Maximal Munch isn't observed, this means that in certain cases you have to pad your rule descriptions with useless mentions to ignored white-space or other elements, such as comments. Instead of making the experimental Câ™¯ derivation, Câ‹†, I'll instead make a toy language called T*y++. I'll use it to mix standard general-purpose languages with language construction principles. Essentially killing two birds with one stone: instead of having to make a solid library to use in Câ™¯, when I go about making the experimental language, I'll just mix it in, with the other concepts I want to try out, as a language feature. I've been doing some tests with the actual syntax as I'd like it to be, and it seems more natural to me as a coder to use this system rather than a custom DSL specifically structured towards languages only. It'll be easier to mix-in language elements for match code since the actual code structure itself is based off of a general purpose language.

Here's a dummy script I've been toying with:

Code:

using System;

using System.Collections.Generic;

using System.Linq;

/*----------------------------------------\

| Copyright Â© 2009 Allen Copeland Jr. Â  Â  |

|-----------------------------------------|

| The Abstraction Project's code is prov- |

| -ided under a contract-release basis. Â  |

| DO NOT DISTRIBUTE and do not use beyond |

| the contract terms. Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  |

\--------------------------------------- */

Â 

namespace AllenCopeland.Abstraction.SLF.Languages

{

Â  Â  public language part ToyCC

Â  Â  {

Â  Â  Â  Â  public grouped rule LanguageTLElementDeclaration =

Â  Â  Â  Â  Â  Â  LanguageDeclaration |

Â  Â  Â  Â  Â  Â  LangaugeTLEnumDeclaration |

Â  Â  Â  Â  Â  Â  LanguageRuleDeclaration |

Â  Â  Â  Â  Â  Â  LanguageTokenDeclaration;

Â 

Â  Â  Â  Â  public rule LanguageDeclaration =

Â  Â  Â  Â  Â  Â  (

Â  Â  Â  Â  Â  Â  Â  Â  "new" | 

Â  Â  Â  Â  Â  Â  Â  Â  "public" | 

Â  Â  Â  Â  Â  Â  Â  Â  "protected" | 

Â  Â  Â  Â  Â  Â  Â  Â  "internal" | 

Â  Â  Â  Â  Â  Â  Â  Â  "private" | 

Â  Â  Â  Â  Â  Â  Â  Â  "abstract" | 

Â  Â  Â  Â  Â  Â  Â  Â  "sealed"

Â  Â  Â  Â  Â  Â  ):Modifiers;max-each:1;* 

Â  Â  Â  Â  Â  Â  "language" "part"!:IsPartial;? Identifier:Name; 

Â  Â  Â  Â  Â  Â  Â  Â  "starts" "with" ':' TypeName

Â  Â  Â  Â  Â  Â  '{'

Â  Â  Â  Â  Â  Â  Â  Â  LanguageElementDeclaration:Elements;*

Â  Â  Â  Â  Â  Â  '}';

Â 

Â  Â  Â  Â  public grouped rule LanguageElementDeclaration =

Â  Â  Â  Â  Â  Â  LanguageEnumDeclaration |

Â  Â  Â  Â  Â  Â  LanguageRuleDeclaration |

Â  Â  Â  Â  Â  Â  LanguageTemplateDeclaration |

Â  Â  Â  Â  Â  Â  LanguageTokenDeclaration;

Â 

Â  Â  Â  Â  public rule LanguageTLEnumDeclaration =

Â  Â  Â  Â  Â  Â  (

Â  Â  Â  Â  Â  Â  Â  Â  "public" |

Â  Â  Â  Â  Â  Â  Â  Â  "new" |

Â  Â  Â  Â  Â  Â  Â  Â  "protected" |

Â  Â  Â  Â  Â  Â  Â  Â  "internal" |

Â  Â  Â  Â  Â  Â  Â  Â  "private"

Â  Â  Â  Â  Â  Â  ):Modifiers;max-each:1;*

Â  Â  Â  Â  Â  Â  "language" import(LanguageEnumBody);

Â 

Â  Â  Â  Â  public rule LanguageEnumBody =

Â  Â  Â  Â  Â  Â  "enum" Identifier:Name;

Â  Â  Â  Â  Â  Â  '{'

Â  Â  Â  Â  Â  Â  Â  Â  DelimitedList<LanguageEnumItem:Items;, ','> ','?

Â  Â  Â  Â  Â  Â  '}';

Â 

Â  Â  Â  Â  public rule LanguageEnumItem =

Â  Â  Â  Â  Â  Â  '@'!:IsCaseInsensitive;? Identifier:Value; (':' Identifier:Name)?;

Â 

Â  Â  Â  Â  public rule LanguageEnumDeclaration =

Â  Â  Â  Â  Â  Â  (

Â  Â  Â  Â  Â  Â  Â  Â  "public" |

Â  Â  Â  Â  Â  Â  Â  Â  "new" |

Â  Â  Â  Â  Â  Â  Â  Â  "protected" |

Â  Â  Â  Â  Â  Â  Â  Â  "internal" |

Â  Â  Â  Â  Â  Â  Â  Â  "private"

Â  Â  Â  Â  Â  Â  ):Modifiers;max-each:1;* 

Â  Â  Â  Â  Â  Â  import(LanguageEnumBody);

Â  Â  }

}

The thing about this concept is there's certain limiters I plan to add to the language to eliminate the need to do certain checks yourself. You'll notice the max-each keyword, it's effectively used in cases where there's multiple tokens, called in an alternation, that's repeated. This would be a group only option which will instruct it to keep tabs on which elements have been already included in the set of tokens. Given the DFA structure I'll be implementing in this version, and the fact that it'll keep tabs on every token that's been encountered, this shouldn't be too difficult once I get there. I'm also planning on other features which I just didn't add in this version of the langauge description: simple things like 'import'ing other structures into the description of a rule. Wherever you use 'import(rule)' it'll import the description of the rule on that spot. Creating a concatenation of the two rules, but without the reference to the rule you import, instead concatenating the full definition.

Alexander Morou · Jun 16, 2009

Another aspect of the Compiler Compiler, the compiler back-end for languages to use to do the actual compile, is going well lately.

The most recent effort on my part has been the description of the expression side of things. Language Integrated Query, Lambda Expressions, symbol expressions, symbolic fusions, symbolic fusions to an expression series and type-series (for inference towards call site information on either methods, properties, indexers, and so on, since properties can also return function pointers/delegates), and so on.

To simplify the infrastructure's use, I've put a double effort into play for creating Language Integrated Queries in the system.

You can either make a LinqExpression via direct instantiation:

Code:

[pre][font=courier new][color=#FF00FF]ILinqExpression[/color] [color=#008080]queryTestA[/color] [color=#808000]=[/color] [color=#0000FF]new[/color] [b][color=#800080]LinqExpression[/color][/b][color=#808080]([/color][color=#808080])[/color][color=#808080];[/color]

[color=#008080]queryTestA[/color][color=#808080].[/color][color=#008080]From[/color] [color=#808000]=[/color] [color=#0000FF]new[/color] [color=#800080][b]LinqFromClause[/b][/color][color=#808080]([/color][color=#008000]/* rangeVariableName: */[/color][color=#008000]"person"[/color][color=#808080],[/color] [color=#008000]/* rangeSource: */[/color][color=#008000]"people"[/color][color=#808080].[/color][color=#008080]GetSymbolExpression[/color][color=#808080]([/color][color=#808080])[/color][color=#808080])[/color][color=#808080];[/color]

[color=#0000FF]var[/color] [color=#008080]queryTestABody[/color] [color=#808000]=[/color] [color=#0000FF]new[/color] [color=#800080][b]LinqSelectBody[/b][/color][color=#808080]([/color][color=#808080])[/color][color=#808080];[/color]

[color=#008080]queryTestA[/color][color=#808080].[/color][color=#008080]Body[/color] [color=#808000]=[/color] [color=#008080]queryTestABody[/color][color=#808080];[/color]

[color=#008080]queryTestABody[/color][color=#808080].[/color][color=#008080]Clauses[/color][color=#808080].[/color][color=#008080]Join[/color][color=#808080]([/color][color=#008000]/* rangeVariableName: */[/color][color=#008000]"pet"[/color][color=#808080],[/color] Â  Â [color=#008000]/* rangeSelector: */[/color][color=#008000]"pets"[/color][color=#808080].[/color][color=#008080]GetSymbolExpression[/color][color=#808080]([/color][color=#808080])[/color][color=#808080],[/color] [color=#008000]/* leftSelector: */[/color][color=#008000]"person"[/color][color=#808080].[/color][color=#008080]GetSymbolExpression[/color][color=#808080]([/color][color=#808080])[/color][color=#808080],[/color] [color=#008000]/* rightSelector: */[/color][color=#008000]"pet"[/color][color=#808080].[/color][color=#008080]Fuse[/color][color=#808080]([/color][color=#008000]"Owner"[/color][color=#808080])[/color][color=#808080],[/color] [color=#008000]/* intoRangeName: */[/color][color=#008000]"gj"[/color][color=#808080])[/color][color=#808080];[/color]

[color=#008080]queryTestABody[/color][color=#808080].[/color][color=#008080]Clauses[/color][color=#808080].[/color][color=#008080]From[/color][color=#808080]([/color][color=#008000]/* rangeVariableName: */[/color][color=#008000]"subPet"[/color][color=#808080],[/color] [color=#008000]/* rangeSelector: */[/color][color=#008000]"gj"[/color][color=#808080].[/color][color=#008080]Fuse[/color][color=#808080]([/color][color=#008000]"DefaultIfEmpty"[/color][color=#808080])[/color][color=#808080].[/color][color=#008080]Fuse[/color][color=#808080]([/color][color=#0000FF]new[/color] [color=#FF00FF]IExpression[/color][color=#808080][[/color][color=#FF0000]0[/color][color=#808080]][/color][color=#808080])[/color][color=#808080])[/color][color=#808080];[/color]

[color=#008080]queryTestABody[/color][color=#808080].[/color][color=#008080]Selection[/color] [color=#808000]=[/color] [color=#008000]"string"[/color][color=#808080].[/color][color=#008080]Fuse[/color][color=#808080]([/color][color=#008000]"Format"[/color][color=#808080])[/color][color=#808080].[/color][color=#008080]Fuse[/color][color=#808080]([/color][color=#008000]"{0,-15}{1}"[/color][color=#808080].[/color][color=#008080]ToPrimitive[/color][color=#808080]([/color][color=#808080])[/color][color=#808080].[/color][color=#008080]AsNamedParameter[/color][color=#808080]([/color][color=#008000]"format"[/color][color=#808080])[/color][color=#808080],[/color] [color=#008000]"person"[/color][color=#808080].[/color][color=#008080]Fuse[/color][color=#808080]([/color][color=#008000]"FirstName"[/color][color=#808080])[/color][color=#808080].[/color][color=#008080]AsNamedParameter[/color][color=#808080]([/color][color=#008000]"arg0"[/color][color=#808080])[/color][color=#808080],[/color] [color=#808080]([/color][color=#808080]([/color][color=#800080][b]Symbol[/b][/color][color=#808080])[/color][color=#008000]"subPet"[/color][color=#808080])[/color][color=#808080].[/color][color=#008080]EqualTo[/color][color=#808080]([/color][color=#800080][b]IntermediateGateway[/b][/color][color=#808080].[/color][color=#008080]NullValue[/color][color=#808080])[/color][color=#808080].[/color][color=#008080]IIf[/color][color=#808080]([/color][color=#008000]"string"[/color][color=#808080].[/color][color=#008080]Fuse[/color][color=#808080]([/color][color=#008000]"Empty"[/color][color=#808080])[/color][color=#808080],[/color] [color=#008000]"subPet"[/color][color=#808080].[/color][color=#008080]Fuse[/color][color=#808080]([/color][color=#008000]"Name"[/color][color=#808080])[/color][color=#808080])[/color][color=#808080].[/color][color=#008080]AsNamedParameter[/color][color=#808080]([/color][color=#008000]"arg1"[/color][color=#808080])[/color][color=#808080])[/color][color=#808080];[/color][/font][/pre]

Or via a LinqBuilder through the LinqHelper class:

Code:

[font=courier new][color=#FF00FF]ILinqExpression[/color] [color=#008080]queryTestB[/color] [color=#808000]=[/color]

Â  Â  [color=#800080][b]LinqHelper[/b][/color]

Â  Â  Â  Â  [color=#808080].[/color][color=#008080]From[/color][color=#808080]([/color][color=#008000]/* Â  Â  rangeVariable: */[/color] [color=#008000]"person"[/color][color=#808080],[/color] [color=#008000]/* rangeSource: */[/color][color=#808080]([/color][color=#800080][b]Symbol[/b][/color][color=#808080])[/color][color=#008000]"people"[/color][color=#808080])[/color]

Â  Â  Â  Â  [color=#808080].[/color][color=#008080]Join[/color][color=#808080]([/color][color=#008000]/* rangeVariableName: */[/color] Â  Â [color=#008000]"pet"[/color][color=#808080],[/color] [color=#008000]/* rangeSource: */[/color][color=#808080]([/color][color=#800080][b]Symbol[/b][/color][color=#808080])[/color][color=#008000]"pets"[/color][color=#808080],[/color] [color=#008000]/* conditionLeft: */[/color][color=#808080]([/color][color=#800080][b]Symbol[/b][/color][color=#808080])[/color][color=#008000]"person"[/color][color=#808080],[/color] [color=#008000]/* conditionRight: */[/color][color=#008000]"pet"[/color][color=#808080].[/color][color=#008080]Fuse[/color][color=#808080]([/color][color=#008000]"Owner"[/color][color=#808080])[/color][color=#808080],[/color] [color=#008000]/* intoRangeName: */[/color][color=#008000]"gj"[/color][color=#808080])[/color]

Â  Â  Â  Â  [color=#808080].[/color][color=#008080]From[/color][color=#808080]([/color][color=#008000]/* rangeVariableName: */[/color] [color=#008000]"subPet"[/color][color=#808080],[/color] [color=#008000]/* rangeSource: */[/color][color=#008000]"gj"[/color][color=#808080].[/color][color=#008080]Fuse[/color][color=#808080]([/color][color=#008000]"DefaultIfEmpty"[/color][color=#808080])[/color][color=#808080].[/color][color=#008080]Fuse[/color][color=#808080]([/color][color=#0000FF]new[/color] [color=#FF00FF]IExpression[/color][color=#808080][[/color][color=#FF0000]0[/color][color=#808080]][/color][color=#808080])[/color][color=#808080])[/color]

Â  Â  Â  Â  [color=#808080].[/color][color=#008080]Select[/color][color=#808080]([/color][color=#008000]/* Â  Â  Â  selection: */[/color] [color=#008000]"string"[/color][color=#808080].[/color][color=#008080]Fuse[/color][color=#808080]([/color][color=#008000]"Format"[/color][color=#808080])[/color][color=#808080].[/color][color=#008080]Fuse[/color][color=#808080]([/color][color=#008000]"{0,-15}{1}"[/color][color=#808080].[/color][color=#008080]ToPrimitive[/color][color=#808080]([/color][color=#808080])[/color][color=#808080].[/color][color=#008080]AsNamedParameter[/color][color=#808080]([/color][color=#008000]"format"[/color][color=#808080])[/color][color=#808080],[/color] [color=#008000]"person"[/color][color=#808080].[/color][color=#008080]Fuse[/color][color=#808080]([/color][color=#008000]"FirstName"[/color][color=#808080])[/color][color=#808080].[/color][color=#008080]AsNamedParameter[/color][color=#808080]([/color][color=#008000]"arg0"[/color][color=#808080])[/color][color=#808080],[/color] [color=#808080]([/color][color=#808080]([/color][color=#800080][b]Symbol[/b][/color][color=#808080])[/color][color=#008000]"subPet"[/color][color=#808080])[/color][color=#808080].[/color][color=#008080]EqualTo[/color][color=#808080]([/color][color=#800080][b]IntermediateGateway[/b][/color][color=#808080].[/color][color=#008080]NullValue[/color][color=#808080])[/color][color=#808080].[/color][color=#008080]IIf[/color][color=#808080]([/color][color=#008000]"string"[/color][color=#808080].[/color][color=#008080]Fuse[/color][color=#808080]([/color][color=#008000]"Empty"[/color][color=#808080])[/color][color=#808080],[/color] [color=#008000]"subPet"[/color][color=#808080].[/color][color=#008080]Fuse[/color][color=#808080]([/color][color=#008000]"Name"[/color][color=#808080])[/color][color=#808080])[/color][color=#808080].[/color][color=#008080]AsNamedParameter[/color][color=#808080]([/color][color=#008000]"arg1"[/color][color=#808080])[/color][color=#808080])[/color][color=#808080])[/color][color=#808080].[/color][color=#008080]Build[/color][color=#808080]([/color][color=#808080])[/color][color=#808080];[/color][/font]

Personally I'm partial towards the builder method.

You'll probably also notice you have two ways to get a symbol expression, "text".GetSymbolExpression() or (Symbol)"text". The reason for multiple methods of doing things is preference. The initialization via creating the objects yourself is a requirement by having an object system, the builder version is for simplicity, to simplify the process, and ensure it's always right.

MouraVision · Jun 30, 2009

Interesting thing you got there, personally I haven't read through all of it because I am lacking time, but I will and will give you my thoughts about this when I do finish reading it.

Keep working on it, and keep giving back reports. Good Luck

Alexander Morou · Jun 30, 2009

Well,



    The simple description of the compiler framework is, it's a framework for two things: 

1. Generating large amounts of code (sometimes hundreds of megabytes worth)

2. Analysis of existing code translated into the object model (typically done while parsing the code)

    2.a. Analysis for compilation

    2.b. Analysis for language->language translation; where one feature, in the source language, isn't present in the destination language.



    The simple description of the parser compiler framework is: it's a framework for generating the code necessary to

parse a certain kind of language file and give you a concrete syntax tree of the results of the parse.  It builds the CST,

lexical analysis (scanner/tokenizer) parser, and the syntactical parser.  This project isn't as far along as the other, as

there are still technical aspects I'm figuring out (tokenizer = finished, syntactical = drawing board, but close).

Alexander Morou · Jul 12, 2009

Yay finally making headway on the OILexer project.

Above is a screen shot from visual studio in which I'm going through the data structures it's generating. Into those, you'll notice that within a class' body within a C# file there are seven possible paths to the term 'static'. This essentially provides me with the 'first set' information. It's basically a set of data which says at any given point, what terms are available, and in this case, why they're available.

The first set data path information is quite a bit of information by itself, but all it's really stating are the terms that branch from that given point. The actual individual states, which are transitioned into, are another matter. The above small snippet shows that it only creates a determinized version of the state once it is called for.

The next step is to take the first data and wrap it around another state system, the follow set state. Once this is done I can finally start getting somewhere, provided I don't find there's a critical flaw somewhere. The follow state will basically be responsible for watching the tail end of things, when a terminal state is encountered on a given production, it needs to look at the paths which reached that point, and mix in the appropriate state information and determinize it once again.

The entire thing hinges on being lazily evaluated, which is why I'm assuming there will be little chance for error. Though I could be wrong.

Alexander Morou · Aug 2, 2009

Well,

After finding out the first attempt, at a first set and follow set state, was a complete failure, I rewrote the associated code and received something like the following:

Code:

[font=courier new][color=#008080]CSharpUnionState[/color] [color=#008080]csus[/color] [color=#808000]=[/color] [color=#800080]CSharpUnionState[/color][color=#808080].[/color][color=#008080]ObtainRule[/color][color=#808080]([/color][color=#800080]RuleIdentifiers[/color][color=#808080].[/color][color=#800080]Rules1[/color][color=#808080].[/color][color=#008080]CSharpFile[/color][color=#808080])[/color][color=#808080];[/color]

[color=#008080]var[/color] [color=#008080]midPoint[/color] [color=#808000]=[/color] [color=#008080]csus[/color][color=#808080].[/color][color=#008080]Transitions[/color][color=#808080][[/color][color=#800080]RuleTransitions[/color][color=#808080].[/color][color=#800080]Keywords2[/color][color=#808080].[/color][color=#008080]Namespace[/color][color=#808080]][/color][color=#808080].[/color][color=#008080]Transitions[/color][color=#808080][[/color][color=#800080]RuleTransitions[/color][color=#808080].[/color][color=#800080]Captures[/color][color=#808080].[/color][color=#008080]Identifier[/color][color=#808080]][/color]

    [color=#808080].[/color][color=#008080]Transitions[/color][color=#808080][[/color][color=#800080]RuleTransitions[/color][color=#808080].[/color][color=#800080]Operators2[/color][color=#808080].[/color][color=#008080]OpenBlock[/color][color=#808080]][/color]

        [color=#808080].[/color][color=#008080]Transitions[/color][color=#808080][[/color][color=#800080]RuleTransitions[/color][color=#808080].[/color][color=#800080]Keywords1[/color][color=#808080].[/color][color=#008080]Class[/color][color=#808080]][/color][color=#808080].[/color][color=#008080]Transitions[/color][color=#808080][[/color][color=#800080]RuleTransitions[/color][color=#808080].[/color][color=#800080]Captures[/color][color=#808080].[/color][color=#008080]Whitespace[/color][color=#808080]][/color][color=#808080].[/color][color=#008080]Transitions[/color][color=#808080][[/color][color=#800080]RuleTransitions[/color][color=#808080].[/color][color=#800080]Captures[/color][color=#808080].[/color][color=#008080]Identifier[/color][color=#808080]][/color]

        [color=#808080].[/color][color=#008080]Transitions[/color][color=#808080][[/color][color=#800080]RuleTransitions[/color][color=#808080].[/color][color=#800080]Operators2[/color][color=#808080].[/color][color=#008080]OpenBlock[/color][color=#808080]][/color]

            [color=#808080].[/color][color=#008080]Transitions[/color][color=#808080][[/color][color=#800080]RuleTransitions[/color][color=#808080].[/color][color=#800080]Keywords1[/color][color=#808080].[/color][color=#008080]Class[/color][color=#808080]][/color][color=#808080].[/color][color=#008080]Transitions[/color][color=#808080][[/color][color=#800080]RuleTransitions[/color][color=#808080].[/color][color=#800080]Captures[/color][color=#808080].[/color][color=#008080]Whitespace[/color][color=#808080]][/color][color=#808080].[/color][color=#008080]Transitions[/color][color=#808080][[/color][color=#800080]RuleTransitions[/color][color=#808080].[/color][color=#800080]Captures[/color][color=#808080].[/color][color=#008080]Identifier[/color][color=#808080]][/color]

            [color=#808080].[/color][color=#008080]Transitions[/color][color=#808080][[/color][color=#800080]RuleTransitions[/color][color=#808080].[/color][color=#800080]Operators2[/color][color=#808080].[/color][color=#008080]OpenBlock[/color][color=#808080]][/color]

                [color=#808080].[/color][color=#008080]Transitions[/color][color=#808080][[/color][color=#800080]RuleTransitions[/color][color=#808080].[/color][color=#800080]Keywords2[/color][color=#808080].[/color][color=#008080]Struct[/color][color=#808080]][/color][color=#808080].[/color][color=#008080]Transitions[/color][color=#808080][[/color][color=#800080]RuleTransitions[/color][color=#808080].[/color][color=#800080]Captures[/color][color=#808080].[/color][color=#008080]Whitespace[/color][color=#808080]][/color][color=#808080].[/color][color=#008080]Transitions[/color][color=#808080][[/color][color=#800080]RuleTransitions[/color][color=#808080].[/color][color=#800080]Captures[/color][color=#808080].[/color][color=#008080]Identifier[/color][color=#808080]][/color]

                [color=#808080].[/color][color=#008080]Transitions[/color][color=#808080][[/color][color=#800080]RuleTransitions[/color][color=#808080].[/color][color=#800080]Operators2[/color][color=#808080].[/color][color=#008080]OpenBlock[/color][color=#808080]][/color]

                    [color=#808080].[/color][color=#008080]Transitions[/color][color=#808080][[/color][color=#800080]RuleTransitions[/color][color=#808080].[/color][color=#800080]Keywords1[/color][color=#808080].[/color][color=#008080]Class[/color][color=#808080]][/color][color=#808080].[/color][color=#008080]Transitions[/color][color=#808080][[/color][color=#800080]RuleTransitions[/color][color=#808080].[/color][color=#800080]Captures[/color][color=#808080].[/color][color=#008080]Whitespace[/color][color=#808080]][/color][color=#808080].[/color][color=#008080]Transitions[/color][color=#808080][[/color][color=#800080]RuleTransitions[/color][color=#808080].[/color][color=#800080]Captures[/color][color=#808080].[/color][color=#008080]Identifier[/color][color=#808080]][/color]

                    [color=#808080].[/color][color=#008080]Transitions[/color][color=#808080][[/color][color=#800080]RuleTransitions[/color][color=#808080].[/color][color=#800080]Operators2[/color][color=#808080].[/color][color=#008080]OpenBlock[/color][color=#808080]][/color]

                        [color=#808080].[/color][color=#008080]Transitions[/color][color=#808080][[/color][color=#800080]RuleTransitions[/color][color=#808080].[/color][color=#800080]Keywords1[/color][color=#808080].[/color][color=#008080]Delegate[/color][color=#808080]][/color][color=#808080].[/color][color=#008080]Transitions[/color][color=#808080][[/color][color=#800080]RuleTransitions[/color][color=#808080].[/color][color=#800080]Captures[/color][color=#808080].[/color][color=#008080]Whitespace[/color][color=#808080]][/color][color=#808080].[/color][color=#008080]Transitions[/color][color=#808080][[/color][color=#800080]RuleTransitions[/color][color=#808080].[/color][color=#800080]Captures[/color][color=#808080].[/color][color=#008080]Identifier[/color][color=#808080]][/color][color=#808080].[/color][color=#008080]Transitions[/color][color=#808080][[/color][color=#800080]RuleTransitions[/color][color=#808080].[/color][color=#800080]Operators1[/color][color=#808080].[/color][color=#008080]Multiplication[/color][color=#808080]][/color][color=#808080].[/color][color=#008080]Transitions[/color][color=#808080][[/color][color=#800080]RuleTransitions[/color][color=#808080].[/color][color=#800080]Captures[/color][color=#808080].[/color][color=#008080]Identifier[/color][color=#808080]][/color][color=#808080].[/color][color=#008080]Transitions[/color][color=#808080][[/color][color=#800080]RuleTransitions[/color][color=#808080].[/color][color=#800080]Operators2[/color][color=#808080].[/color][color=#008080]LeftParenthesis[/color][color=#808080]][/color][color=#808080].[/color][color=#008080]Transitions[/color][color=#808080][[/color][color=#800080]RuleTransitions[/color][color=#808080].[/color][color=#800080]Operators2[/color][color=#808080].[/color][color=#008080]RightParenthesis[/color][color=#808080]][/color][color=#808080].[/color][color=#008080]Transitions[/color][color=#808080][[/color][color=#800080]RuleTransitions[/color][color=#808080].[/color][color=#800080]Operators2[/color][color=#808080].[/color][color=#008080]EntryTerminal[/color][color=#808080]][/color]

                    [color=#808080].[/color][color=#008080]Transitions[/color][color=#808080][[/color][color=#800080]RuleTransitions[/color][color=#808080].[/color][color=#800080]Operators2[/color][color=#808080].[/color][color=#008080]CloseBlock[/color][color=#808080]][/color]

                    [color=#808080].[/color][color=#008080]Transitions[/color][color=#808080][[/color][color=#800080]RuleTransitions[/color][color=#808080].[/color][color=#800080]Operators2[/color][color=#808080].[/color][color=#008080]EntryTerminal[/color][color=#808080]][/color]

                [color=#808080].[/color][color=#008080]Transitions[/color][color=#808080][[/color][color=#800080]RuleTransitions[/color][color=#808080].[/color][color=#800080]Operators2[/color][color=#808080].[/color][color=#008080]CloseBlock[/color][color=#808080]][/color]

            [color=#808080].[/color][color=#008080]Transitions[/color][color=#808080][[/color][color=#800080]RuleTransitions[/color][color=#808080].[/color][color=#800080]Operators2[/color][color=#808080].[/color][color=#008080]CloseBlock[/color][color=#808080]][/color]

            [color=#808080].[/color][color=#008080]Transitions[/color][color=#808080][[/color][color=#800080]RuleTransitions[/color][color=#808080].[/color][color=#800080]Operators2[/color][color=#808080].[/color][color=#008080]EntryTerminal[/color][color=#808080]][/color]

        [color=#808080].[/color][color=#008080]Transitions[/color][color=#808080][[/color][color=#800080]RuleTransitions[/color][color=#808080].[/color][color=#800080]Operators2[/color][color=#808080].[/color][color=#008080]CloseBlock[/color][color=#808080]][/color]

        [color=#808080].[/color][color=#008080]Transitions[/color][color=#808080][[/color][color=#800080]RuleTransitions[/color][color=#808080].[/color][color=#800080]Operators2[/color][color=#808080].[/color][color=#008080]EntryTerminal[/color][color=#808080]][/color]

        [color=#808080].[/color][color=#008080]Transitions[/color][color=#808080][[/color][color=#800080]RuleTransitions[/color][color=#808080].[/color][color=#800080]Keywords1[/color][color=#808080].[/color][color=#008080]Class[/color][color=#808080]][/color][color=#808080].[/color][color=#008080]Transitions[/color][color=#808080][[/color][color=#800080]RuleTransitions[/color][color=#808080].[/color][color=#800080]Captures[/color][color=#808080].[/color][color=#008080]Whitespace[/color][color=#808080]][/color][color=#808080].[/color][color=#008080]Transitions[/color][color=#808080][[/color][color=#800080]RuleTransitions[/color][color=#808080].[/color][color=#800080]Captures[/color][color=#808080].[/color][color=#008080]Identifier[/color][color=#808080]][/color]

        [color=#808080].[/color][color=#008080]Transitions[/color][color=#808080][[/color][color=#800080]RuleTransitions[/color][color=#808080].[/color][color=#800080]Operators2[/color][color=#808080].[/color][color=#008080]OpenBlock[/color][color=#808080]][/color]

            [color=#808080].[/color][color=#008080]Transitions[/color][color=#808080][[/color][color=#800080]RuleTransitions[/color][color=#808080].[/color][color=#800080]Keywords1[/color][color=#808080].[/color][color=#008080]Interface[/color][color=#808080]][/color][color=#808080].[/color][color=#008080]Transitions[/color][color=#808080][[/color][color=#800080]RuleTransitions[/color][color=#808080].[/color][color=#800080]Captures[/color][color=#808080].[/color][color=#008080]Whitespace[/color][color=#808080]][/color][color=#808080].[/color][color=#008080]Transitions[/color][color=#808080][[/color][color=#800080]RuleTransitions[/color][color=#808080].[/color][color=#800080]Captures[/color][color=#808080].[/color][color=#008080]Identifier[/color][color=#808080]][/color]

            [color=#808080].[/color][color=#008080]Transitions[/color][color=#808080][[/color][color=#800080]RuleTransitions[/color][color=#808080].[/color][color=#800080]Operators2[/color][color=#808080].[/color][color=#008080]OpenBlock[/color][color=#808080]][/color]

            [color=#808080].[/color][color=#008080]Transitions[/color][color=#808080][[/color][color=#800080]RuleTransitions[/color][color=#808080].[/color][color=#800080]Operators2[/color][color=#808080].[/color][color=#008080]CloseBlock[/color][color=#808080]][/color]

            [color=#808080].[/color][color=#008080]Transitions[/color][color=#808080][[/color][color=#800080]RuleTransitions[/color][color=#808080].[/color][color=#800080]Keywords1[/color][color=#808080].[/color][color=#008080]Delegate[/color][color=#808080]][/color][color=#808080].[/color][color=#008080]Transitions[/color][color=#808080][[/color][color=#800080]RuleTransitions[/color][color=#808080].[/color][color=#800080]Captures[/color][color=#808080].[/color][color=#008080]Whitespace[/color][color=#808080]][/color][color=#808080].[/color][color=#008080]Transitions[/color][color=#808080][[/color][color=#800080]RuleTransitions[/color][color=#808080].[/color][color=#800080]Captures[/color][color=#808080].[/color][color=#008080]Identifier[/color][color=#808080]][/color][color=#808080].[/color][color=#008080]Transitions[/color][color=#808080][[/color][color=#800080]RuleTransitions[/color][color=#808080].[/color][color=#800080]Captures[/color][color=#808080].[/color][color=#008080]Identifier[/color][color=#808080]][/color][color=#808080].[/color][color=#008080]Transitions[/color][color=#808080][[/color][color=#800080]RuleTransitions[/color][color=#808080].[/color][color=#800080]Operators2[/color][color=#808080].[/color][color=#008080]LeftParenthesis[/color][color=#808080]][/color][color=#808080].[/color][color=#008080]Transitions[/color][color=#808080][[/color][color=#800080]RuleTransitions[/color][color=#808080].[/color][color=#800080]Operators2[/color][color=#808080].[/color][color=#008080]RightParenthesis[/color][color=#808080]][/color][color=#808080].[/color][color=#008080]Transitions[/color][color=#808080][[/color][color=#800080]RuleTransitions[/color][color=#808080].[/color][color=#800080]Operators2[/color][color=#808080].[/color][color=#008080]EntryTerminal[/color][color=#808080]][/color]

        [color=#808080].[/color][color=#008080]Transitions[/color][color=#808080][[/color][color=#800080]RuleTransitions[/color][color=#808080].[/color][color=#800080]Operators2[/color][color=#808080].[/color][color=#008080]CloseBlock[/color][color=#808080]][/color]

        [color=#808080].[/color][color=#008080]Transitions[/color][color=#808080][[/color][color=#800080]RuleTransitions[/color][color=#808080].[/color][color=#800080]Operators2[/color][color=#808080].[/color][color=#008080]EntryTerminal[/color][color=#808080]][/color]

    [color=#808080].[/color][color=#008080]Transitions[/color][color=#808080][[/color][color=#800080]RuleTransitions[/color][color=#808080].[/color][color=#800080]Operators2[/color][color=#808080].[/color][color=#008080]CloseBlock[/color][color=#808080]][/color]

    [color=#808080].[/color][color=#008080]Transitions[/color][color=#808080][[/color][color=#800080]RuleTransitions[/color][color=#808080].[/color][color=#800080]Operators2[/color][color=#808080].[/color][color=#008080]EntryTerminal[/color][color=#808080]][/color][color=#808080];[/color][/font]

Which would effectively look like the following file:

Code:

[color=#0000FF]namespace[/color] [color=#008080]Identifier[/color]

[color=#808080]{[/color]

    [color=#0000FF]class[/color][font=tahoma]▒[/font][color=#008080]Identifier[/color]

    [color=#808080]{[/color]

        [color=#0000FF]struct[/color][font=tahoma]▒[/font][color=#008080]Identifier[/color]

        [color=#808080]{[/color]

            [color=#0000FF]class[/color][font=tahoma]▒[/font][color=#008080]Identifier[/color]

            [color=#808080]{[/color]

                [color=#0000FF]delegate[/color][font=tahoma]▒[/font][color=#008080]Identifier[/color][color=#808000]*[/color] [color=#008080]Identifier[/color][color=#808080]([/color][color=#808080])[/color][color=#808080];[/color]

            [color=#808080]}[/color][color=#808080];[/color]

        [color=#808080]}[/color]

    [color=#808080]}[/color][color=#808080];[/color]

    [color=#0000FF]class[/color][font=tahoma]▒[/font][color=#008080]Identifier[/color]

    [color=#808080]{[/color]

        [color=#0000FF]interface[/color][font=tahoma]▒[/font][color=#008080]Identifier[/color]

        [color=#808080]{[/color]

        [color=#808080]}[/color]

        [color=#0000FF]delegate[/color][font=tahoma]▒[/font][color=#008080]Identifier[/color] [color=#008080]Identifier[/color][color=#808080]([/color][color=#808080])[/color][color=#808080];[/color]

    [color=#808080]}[/color][color=#808080];[/color]

[color=#808080]}[/color][color=#808080];[/color]

(The funky "▒" character represents points where the white space is required.)
The most difficult aspect was closing up the tails of the rules. Finding the proper data structure to represent the problem. I guess the first few times I tried, and failed, I didn't understand it well enough to do so. The next phase is to make this information useful. I'm a bit worried about the time factor of the equation because the lexer, so far, seems to be blazing fast compared to the unified state model.

As a comparison, in just using keywords, I received roughly 40MB/s validation/failure response. To build the above tree for the first time took 21 ms (after eliminating the 400ms JIT). The second access was about 0.000048 seconds, or 48μs (microseconds), which is probably acceptable?

Suggestions, questions, comments?

Edit
When I first posted this, I hadn't known that there was an issue in the handling of left-recursion.
Presently it appears to handle direct left recursion well. I won't be able to say much about the indirect left recursion and hidden left recursion until I bake the logic of the unified state model into the builder.

On a side note I think I managed to triple the speed of the tree building, I'll do more tests and post back when I've confirmed this.

Alexander Morou · Aug 10, 2009

Further into the thick of it.

Code:

[color=#0000FF]namespace[/color] [color=#008080]Identifier[/color]

[color=#808080]{[/color]

 [color=#0000FF]class[/color]▒[color=#008080]Identifier[/color]

 [color=#808080]{[/color]

  [color=#0000FF]partial[/color]▒[color=#0000FF]class[/color]▒[color=#008080]Identifier[/color]

  [color=#808080]{[/color]

   [color=#0000FF]struct[/color]▒[color=#008080]Identifier[/color]

   [color=#808080]{[/color]

    [color=#0000FF]class[/color]▒[color=#008080]Identifier[/color]

    [color=#808080]{[/color]

     [color=#0000FF]delegate[/color]▒[color=#008080]Identifier[/color][color=#808000]*[/color] [color=#008080]Identifier[/color][color=#808080]([/color][color=#808080])[/color][color=#808080];[/color]

     [color=#0000FF]bool[/color] [color=#008080]Identifier[/color][color=#808080]([/color][color=#008080]Identifier[/color] [color=#008080]Identifier[/color][color=#808080])[/color]

     [color=#808080]{[/color]

      [color=#0000FF]if[/color] [color=#808080]([/color][color=#008080]Identifier[/color][color=#808000]<[/color][color=#008080]Identifier[/color][color=#808000]>[/color][color=#808080].[/color][color=#008080]Identifier[/color][color=#808080][[/color][color=#008080]Identifier[/color][color=#808080]][/color] [color=#909090]?[/color][color=#909090]?[/color] [color=#008080]Identifier[/color] [color=#808000]+[/color] [color=#008080]Identifier[/color] [color=#808000]<<[/color] [color=#008080]Identifier[/color] [color=#808000]%[/color] [color=#808000]-[/color][color=#808080]([/color][color=#008080]Identifier[/color] [color=#808000]>>[/color] [color=#008080]Identifier[/color][color=#808080])[/color] [color=#808000]!=[/color] [color=#008080]Identifier[/color] [color=#808000]>=[/color] [color=#008080]Identifier[/color] [color=#808000]&&[/color] [color=#008080]Identifier[/color] [color=#808000]==[/color] [color=#008080]Identifier[/color] [color=#808000]>[/color] [color=#008080]Identifier[/color] [color=#808000]<[/color] [color=#008080]Identifier[/color] [color=#808000]*[/color] [color=#008080]Identifier[/color] [color=#808000]<=[/color] [color=#008080]Identifier[/color] [color=#808000]||[/color] [color=#008080]Identifier[/color] [color=#808000]^[/color] [color=#008080]Identifier[/color] [color=#808000]*[/color] [color=#008080]Identifier[/color] [color=#808000]&&[/color] [color=#008080]Identifier[/color][color=#808080])[/color]

      [color=#808080]{[/color]

       [color=#0000FF]foreach[/color] [color=#808080]([/color][color=#008080]Identifier[/color] [color=#008080]Identifier[/color] [color=#0000FF]in[/color] [color=#008080]Identifier[/color][color=#808080].[/color][color=#008080]Identifier[/color][color=#808080][[/color][color=#008080]Identifier[/color][color=#808080]][/color][color=#808080])[/color]

       [color=#808080]{[/color]

       [color=#808080]}[/color]

      [color=#808080]}[/color]

     [color=#808080]}[/color]

    [color=#808080]}[/color][color=#808080];[/color]

   [color=#808080]}[/color]

  [color=#808080]}[/color][color=#808080];[/color]

 [color=#808080]}[/color][color=#808080];[/color]

 [color=#0000FF]class[/color]▒[color=#008080]Identifier[/color]

 [color=#808080]{[/color]

  [color=#0000FF]struct[/color]▒[color=#008080]Identifier[/color]

  [color=#808080]{[/color]

   [color=#0000FF]class[/color]▒[color=#008080]Identifier[/color]

   [color=#808080]{[/color]

    [color=#0000FF]interface[/color]▒[color=#008080]Identifier[/color]

    [color=#808080]{[/color]

    [color=#808080]}[/color]

    [color=#0000FF]delegate[/color]▒[color=#008080]Identifier[/color] [color=#008080]Identifier[/color][color=#808080]([/color][color=#808080])[/color][color=#808080];[/color]

   [color=#808080]}[/color][color=#808080];[/color]

  [color=#808080]}[/color]

  [color=#0000FF]delegate[/color]▒[color=#008080]Identifier[/color] [color=#008080]Identifier[/color][color=#808080]([/color][color=#808080])[/color][color=#808080];[/color]

 [color=#808080]}[/color][color=#808080];[/color]

 [color=#0000FF]class[/color]▒[color=#008080]Identifier[/color]

 [color=#808080]{[/color]

  [color=#0000FF]interface[/color]▒[color=#008080]Identifier[/color]

  [color=#808080]{[/color]

  [color=#808080]}[/color]

  [color=#0000FF]delegate[/color]▒[color=#008080]Identifier[/color] [color=#008080]Identifier[/color][color=#808080]([/color][color=#808080])[/color][color=#808080];[/color]

 [color=#808080]}[/color][color=#808080];[/color]

 [color=#0000FF]class[/color]▒[color=#008080]Identifier[/color]

 [color=#808080]{[/color]

  [color=#0000FF]bool[/color] [color=#008080]Identifier[/color][color=#808080]([/color][color=#008080]Identifier[/color] [color=#008080]Identifier[/color][color=#808080])[/color]

  [color=#808080]{[/color]

   [color=#0000FF]if[/color] [color=#808080]([/color][color=#008080]Identifier[/color][color=#808000]<[/color][color=#008080]Identifier[/color][color=#808000]>[/color][color=#808080].[/color][color=#008080]Identifier[/color][color=#808080][[/color][color=#008080]Identifier[/color][color=#808080]][/color] [color=#909090]?[/color][color=#909090]?[/color] [color=#008080]Identifier[/color] [color=#808000]+[/color] [color=#008080]Identifier[/color] [color=#808000]<<[/color] [color=#008080]Identifier[/color] [color=#808000]%[/color] [color=#808000]-[/color][color=#808080]([/color][color=#008080]Identifier[/color] [color=#808000]>>[/color] [color=#008080]Identifier[/color][color=#808080])[/color] [color=#808000]!=[/color] [color=#008080]Identifier[/color] [color=#808000]>=[/color] [color=#008080]Identifier[/color] [color=#808000]&&[/color] [color=#008080]Identifier[/color] [color=#808000]==[/color] [color=#008080]Identifier[/color] [color=#808000]>[/color] [color=#008080]Identifier[/color] [color=#808000]<[/color] [color=#008080]Identifier[/color] [color=#808000]*[/color] [color=#008080]Identifier[/color] [color=#808000]<=[/color] [color=#008080]Identifier[/color] [color=#808000]||[/color] [color=#008080]Identifier[/color] [color=#808000]^[/color] [color=#008080]Identifier[/color] [color=#808000]*[/color] [color=#008080]Identifier[/color] [color=#808000]&&[/color] [color=#008080]Identifier[/color][color=#808080])[/color]

   [color=#808080]{[/color]

    [color=#0000FF]foreach[/color] [color=#808080]([/color][color=#008080]Identifier[/color] [color=#008080]Identifier[/color] [color=#0000FF]in[/color] [color=#008080]Identifier[/color][color=#808080].[/color][color=#008080]Identifier[/color][color=#808080][[/color][color=#0000FF]bool[/color][color=#808080].[/color][color=#008080]Identifier[/color] [color=#808000]%[/color] [color=#008080]Identifier[/color][color=#808080]][/color][color=#808080])[/color]

    [color=#808080]{[/color]

    [color=#808080]}[/color]

   [color=#808080]}[/color]

  [color=#808080]}[/color]

 [color=#808080]}[/color][color=#808080];[/color]

[color=#808080]}[/color]

Stepping through that stream state the first time takes about 100ms, I won't post the code used to emulate the token set for such a thing, because it's incredibly long. A second step through the same set takes 135 microseconds (0.000135 seconds). I'm finally ready to start seeing if I can pull usable data from this and build the parse graph. Only took me two years to get this far! Naturally I didn't work on it solid for those two years, but meh.

I think ambiguity resolution will be the biggest pitfall. Yay for redundant parses.

Alexander Morou · Aug 24, 2009

Nearly finished with the lexical analysis phase. Earlier I had reported that it was finished; however, I checked into it further and realized I had ripped out the lexical analyzer's primary method: NextToken.

The primary reason for the removal is the original was written ages ago before I had a properly functioning state machine generator (which means it was a half-assed generator).

On a side note, I've also been taking a look at Visual Studio's class diagram portion of its program and I decided to make a quick breakdown of the project in visual form (super large image warning, 24-bit (3-byte) 8109px × 4696px in size, 2+MB png or 108MB-144MB to load in memory, depending on 3/4 bytes per pixel -> (8109*4696*3)/(2^20)):

(Click to enlarge)

Does anyone here develop, or at least start things out with, a visual diagram of how their program should look, object-wise? I've never seen the point myself.

As a side note, on the far-right near the middle of the image, GDParser has a method named 'ParseContinuous' which was for solving direct left recursion on rules for binary operators for the conditional directives defined in templates. Interestingly the methods used, in the parser I wrote by hand versus the parser compiler's resultant parsers, are completely different. ParseContinuous repeats a parse wrapping the resulted element into a 'token' it places upon the next stack, which is a hack-ish method to fake a token on the stack of that kind. The resulted parsers don't need such a method, because they aren't recursive descent in nature. A good classification for the parsers generated is an LL(infinite) parser, because the length of the look-ahead is unbound to any specific number (it looks ahead until there's nothing more, regardless of what the intent of the language is).

This method makes it ideal for determining the look ahead at any given state of the language, but it's going to undoubtedly complicate the parse graph generation phase, which I'm brainstorming on next.

For an example of how the lexical analyzer's new next token looks, click here. I had to upload the file due to message length constraints on the forum software.

Update:
This project now has a home on CodePlex. If you're interested in future updates, please go there.

Woocash · Aug 25, 2009

It sound very wisely : )

Alexander Morou · Aug 26, 2009

What?

Alexander Morou · Feb 20, 2010

Been a while, but I'm starting to get back into this.

It'll play an integral role in the Abstraction: Game Maker project later on. In September I decided that it would be a good idea to rewrite a good portion of the project's Build code (meaning all of it), I wrote more generalized deterministic automation and extended the language token syntax to enable Unicode support, which required me to have to rewrite the bit-set handlers to keep up with the amount of data.

The general capture-styled tokens required a bit more finesse to allow for Unicode category inclusion, primarily due to the unicode categories being sparsely collected in many, many groupings. The results of which lead to single state-machines with 3.3MB of code (with a library build size of 2MB!) After including unicode awareness into the state-machine builder (and thus the resulted state-machines), the size dropped to 200KB, furthering this concept and shifting the redundant unicode logic into a single area of the procedure and using GoTo logic to forward to these redundant areas, lead to a further reduction in size to about 78KB for the same machine that once took 3+MB to express. That same library is now 44KB for the lexical state machines alone.

Here's an example of the unicode-hungry state machine used in C#, the identifier, which allows any language's characters, so long as they're in a given set of categories. There are 182 states and five different Unicode logic graphs. If two graphs are identical, but one originates within the state that they transition into, they are two distinct graphs to avoid setting the state when it's not necessary. Edit: The state machine would usually have been very small; however, this identifier had the Keywords automation subtracted from it, so that the end states from the keyword are not within the identifier. This allows the keyword to always win over the Identifier.

Sometime later the output code will include comments, but ... not just yet. :]

Alexander Morou · Aug 3, 2010

News,

Worked on integrating the language used to describe grammars for other languages into Visual Studio.

I did this to get an idea of what's involved in Visual Studio Integration. In doing so I learned a few things: My grammar parser is not geared towards the kind of parser you'd expect for Visual Studio Integration. Its lexer is super quick, but the parser and identity resolution mechanisms are slow by comparison. 150KB of text is quick to lex, but not nearly as quick to parse. Later on I'll rewrite the parser using the program and see about instituting delta-based lexer/parser logic. If I can do this it should greatly expedite parsing/lexing and identity resolution.

There's one key advantage to a multiple file setup: you can circumvent many of the pitfalls I encountered with my system by distributing the identities across multiple files, the main file would parse, and one-time parse the other files associated. This decreases the overhead on how much needs re-parsed each key press and increases the speed of identity resolution, since there's less items within the file that its resolving.

Envision, Create, Share

Compiler Compiler - Top-Down Recursive Descent LL(k)

Alexander Morou

Alexander Morou

Alexander Morou

Alexander Morou

Alexander Morou

MouraVision

Alexander Morou

Alexander Morou

Alexander Morou

Alexander Morou

Alexander Morou

Woocash

Alexander Morou

Alexander Morou

Alexander Morou

Share this page

Thank you for viewing

Discord

RPG Maker

Content