Architecture

Stage 1, the templater

The templater takes raw files and optionally fills in any configurable sections before passing onto the linter. In particular most templating systems use curly brackets {} to indicate templatable sections and these are not currently set up in any of the lexers, and so will fail at the next step if not dealt with here.

SQLFluff supports 2 templating engines:

Jinja is the default templater used by SQLFluff. Under the covers, dbt also uses Jinja but in SQLFluff it is a separate templater which relies directly on the dbt compiler.

For more details on how to configure the templater see Jinja Templating Configuration.

Stage 2, the lexer

The lexer takes raw text and splits it into tokens. Nothing is removed but whitespace and code are separated. In principle all identifiers should be separated at this stage, but they should not be imparted any meaning at this stage. Any files which cannot be lexed, should raise a SQLLexError.

While the Lexer is passed a series of raw segments, it will return a single segment, usually a FileSegment, which will be used to initiate parsing.

Stage 3, the parser

We recursively parse each of the elements, using their in built grammars.

  1. Initially we start with a segment, which contains only raw segments. This is normally a FileSegment, but it could in theory be any kind of segment, and as we recurse it will be lower and lower sublevels.

  2. We then parse this element by calling .parse().

    1. This first calls uses the parse_grammar if that is present, on which we call .match(). The match method of this grammar will return a potentially refined structure of the segments within this segment in greater detail than what was initially there. In the example of a FileSegment, it first divides up the query into statements and then finishes.

    2. The .match() method of any grammar is naturally recursive, and is comprised of segments and grammars. The match step is still not inert however and child elements of a match grammar may return more specific versions of segments during this phase even though we are not calling .parse()

      • Segments must implement a match_grammar to be used in this way. When .match() is called on a segment, this is the grammar which is used to decide whether there is a match.

      • Grammars are objects which combine segments or other grammars together in a pre-defined way. For example the OneOf grammar will match if any one of its child elements match.

    3. Regardless of whether the parse_grammar was used, the next step is to recursively call the .parse() method of each of the child segments of the grammar. This operation is wrapped in a method called .expand(). In the FileSegment, the first step will have transformed a series of raw tokens into StatementSegment segments, and the expand step will let each of those segments refine the content within them.

    4. Eventually in that recursive operation we reach segments which have no children (raw elements containing a single token), and so the recursion naturally finishes.

  3. If no match is found for the current segment, the contents will be wrapped in an UnparsableSegment which can be picked up as a parsing error later.

  4. In analysing the eventual tree, any UnparsableSegment objects should raise a SQLParseError which can then be formatted and raised to the user.

Principles within the parser

The parser is arguably the most complicated element of SQLFluff, and is relied on by all the other elements of the tool to do most of the heavy lifting. When working on the parser there are a couple of design principles to keep in mind.

  • The core of the grammar is stored in a dialect, the root dialect being the ansi dialect. The intent here is that for other SQL dialects that they will inherit most of the logic from a parent dialect and then just reimplement the sections that they need to. One reason for the Ref grammar is that it allows name resolution of grammar elements at runtime and so a patched grammar with some elements overriden can still rely on lower level elements which haven’t been overwritten.

  • Segments and Grammars operate roughly interchangeably from a dialect. A grammar is something which can be matched against and return a result. These will be used as instantiated classes and usually contain a set of elements (stored in the _elements attribute) which are the sub elements which the grammar will use for matching. These may in turn be a mixture of segments and grammars. A segment can be used as either a class or an instance of a class. While matching, the class itself is used, but if it does match, rather than just returning the original segments it was passed unchanged, it will optionally return mutated versions of those segments which may have been given a more specific meaning by the matcher. For example a RawSegment containing 123.4 may actually be mutated to an instance of the NumericLiteralSegment class when matched against the NumericLiteralSegment class.

  • All grammars and segments attempt to match as much as they can and will return partial matches where possible. It is up to the calling grammar or segment to decide whether a partial or complete match is required based on the context it is matching in.

Stage 4, the linter

Given the complete parse tree, we now walk that tree to assess the tree for linting errors. A linter is an object which is able to traverse the tree itself, allowing it to choose which objects it examines and which it alerts as a problem.

Some linters, may optionally be able to fix the problems they find. If this is the case, they will optionally return a mutated tree as one of their return values, which can be passed to the next linter. In normal operation this will not be what is returned, because it becomes confusing with line references for a user fixing issues manually.