Internals

It is recommended that the following is read in conjunction with exploring the codebase. dialect_ansi.py in particular is helpful to understand the recursive structure of segments and grammars. Some more detail is also given on our Wiki including a Contributing Dialect Changes guide.

Architecture

At a high level, the behaviour of SQLFluff is divided into a few key stages. Whether calling sqlfluff lint, sqlfluff fix or sqlfluff parse, the internal flow is largely the same.

Stage 1, the templater

This stage only applies to templated SQL, most commonly Jinja and dbt. Vanilla SQL is sent straight to stage 2, the lexer.

In order to lint templated SQL, SQLFluff must first convert the ‘raw’ or pre-templated code into valid SQL, which can then be parsed. The templater returns both the raw and post-templated SQL so that any rule violations which occur in templated sections can be ignored and the rest mapped to their original line location for user feedback.

SQLFluff supports two templating engines: Jinja and dbt.

Under the hood dbt also uses Jinja, but in SQLFluff uses a separate mechanism which interfaces directly with the dbt python package.

For more details on how to configure the templater see Jinja Templating Configuration.

Stage 2, the lexer

The lexer takes SQL and separates it into segments of whitespace and code. No meaning is imparted; that is the job of the parser.

Stage 3, the parser

The parser is arguably the most complicated element of SQLFluff, and is relied on by all the other elements of the tool to do most of the heavy lifting.

  1. The lexed segments are parsed using the specified dialect’s grammars. In SQLFluff, grammars describe the shape of SQL statements (or their components). The parser attempts to apply each potential grammar to the lexed segments until all the segments have been matched.

  2. In SQLFluff, segments form a tree-like structure. The top-level segment is a FileSegment, which contains zero or more StatementSegments, and so on. Before the segments have been parsed and named according to their type, they are ‘raw’, meaning they have no classification other than their literal value.

  3. The three key components to the parser are segments, match_grammars and parse_grammars. A segment can be a leaf in the parse tree, such as a NumericLiteralSegment, which is simply a number, or can contain many other segments, such as a SelectStatementSegment. Each segment can specify a parse_grammar, and a match_grammar. If both a match_grammar and parse_grammar are defined in a segment, match_grammar is used to quickly prune the tree for branches which do not match segments being parsed, and the parse_grammar is then used to refine the branch identified as correct. If only a match_grammar is defined, then it serves the purpose of both pruning and refining.

  4. A segment’s .parse() method uses the parse_grammar, on which .match() is called. The match method of this grammar will return a potentially refined structure of the segments within the segment in greater detail. In the example of a FileSegment, it first divides up the query into statements and then finishes.

    • Segments must implement a match_grammar. When .match()

      is called on a segment, this is the grammar which is used to decide whether there is a match.

    • Grammars combine segments or other grammars together in a

      pre-defined way. For example the OneOf grammar will match if any one of its child elements match.

    1. Regardless of whether the parse_grammar was used, the next step is to recursively call the .parse() method of each of the child segments of the grammar. This operation is wrapped in a method called .expand(). In the FileSegment, the first step will have transformed a series of raw tokens into StatementSegment segments, and the expand step will let each of those segments refine the content within them.

    2. During the recursion, the parser eventually reaches segments which have no children (raw segments containing a single token), and so the recursion naturally finishes.

#. If no match is found for a segment, the contents will be wrapped in an UnparsableSegment which is picked up as a parsing error later.

When working on the parser there are a couple of design principles to keep in mind.

  • Grammars are contained in dialects, the root dialect being the ansi dialect. The ansi dialect is used to host logic common to all dialects, and so does not necessarily adhere to the formal ansi specification. Other SQL dialects inherit from the ansi dialect, replacing or patching any segments they need to. One reason for the Ref grammar is that it allows name resolution of grammar elements at runtime and so a patched grammar with some elements overridden can still rely on lower-level elements which haven’t been redeclared within the dialect

  • All grammars and segments attempt to match as much as they can and will return partial matches where possible. It is up to the calling grammar or segment to decide whether a partial or complete match is required based on the context it is matching in.

Stage 4, the linter

Given the complete parse tree, rule classes check for linting errors by traversing the tree, looking for segments and patterns of concern. If the rule discovers a violation, it returns a LintResult pointing to the segment which caused the violation.

Some rules are able to fix the problems they find. If this is the case, the rule will return a list of fixes, which describe changes to be made to the tree. This can include edits, inserts, or deletions. Once the fixes have been applied, the updated tree is written to the original file.

Reflow Internals

Many rules supported by SQLFluff involve the spacing and layout of different elements, either to enforce a particular layout or just to add or remove code elements in a way sensitive to the existing layout configuration. The way this is achieved is through some centralised utilities in the sqlfluff.utils.reflow module.

This module aims to achieve several things: * Less code duplication by implementing reflow logic in only one place.

  • Provide a streamlined interface for rules to easily utilise reflow logic.

    • Given this requirement, it’s important that reflow utilities work within the existing framework for applying fixes to potentially templated code. We achieve this by returning LintFix objects which can then be returned by each rule wanting to use this logic.

  • Provide a consistent way of configuring layout requirements. For more details on configuration see Configuring Layout.

To support this, the module provides a ReflowSequence class which allows access to all of the relevant operations which can be used to reformat sections of code, or even a whole file. Unless there is a very good reason, all rules should use this same approach to ensure consistent treatment of layout.

class ReflowSequence(elements: List[ReflowBlock | ReflowPoint], root_segment: BaseSegment, reflow_config: ReflowConfig, depth_map: DepthMap, lint_results: List[LintResult] | None = None)

Class for keeping track of elements in a reflow operation.

This acts as the primary route into using the reflow routines. It acts in a way that plays nicely within a rule context in that it accepts segments and configuration, while allowing access to modified segments and a series of LintFix objects, which can be returned by the calling rule.

Sequences are made up of alternating ReflowBlock and ReflowPoint objects (even if some points have no segments). This is validated on construction.

Most operations also return ReflowSequence objects such that operations can be chained, and then the resultant fixes accessed at the last stage, for example:

fixes = (
    ReflowSequence.from_around_target(
        context.segment,
        root_segment=context.parent_stack[0],
        config=context.config,
    )
    .rebreak()
    .get_fixes()
)
break_long_lines()

Rebreak any remaining long lines in a sequence.

This assumes that reindent() has already been applied.

classmethod from_around_target(target_segment: BaseSegment, root_segment: BaseSegment, config: FluffConfig, sides: str = 'both') ReflowSequence

Generate a sequence around a target.

Parameters:
  • target_segment (RawSegment) – The segment to center around when considering the sequence to construct.

  • root_segment (BaseSegment) – The relevant root segment (usually the base FileSegment).

  • config (FluffConfig) – A config object from which to load the spacing behaviours of different segments.

  • sides (str) – Limit the reflow sequence to just one side of the target. Default is two sided (“both”), but set to “before” or “after” to limit to either side.

NOTE: We don’t just expand to the first block around the target but to the first code element, which means we may swallow several comment blocks in the process.

To evaluate reflow around a specific target, we need need to generate a sequence which goes for the preceding raw to the following raw. i.e. at least: block - point - block - point - block (where the central block is the target).

classmethod from_raw_segments(segments: Sequence[RawSegment], root_segment: BaseSegment, config: FluffConfig, depth_map: DepthMap | None = None) ReflowSequence

Construct a ReflowSequence from a sequence of raw segments.

This is intended as a base constructor, which others can use. In particular, if no depth_map argument is provided, this method will generate one in a potentially inefficient way. If the calling method has access to a better way of inferring a depth map (for example because it has access to a common root segment for all the content), it should do that instead and pass it in.

classmethod from_root(root_segment: BaseSegment, config: FluffConfig) ReflowSequence

Generate a sequence from a root segment.

Parameters:
  • root_segment (BaseSegment) – The relevant root segment (usually the base FileSegment).

  • config (FluffConfig) – A config object from which to load the spacing behaviours of different segments.

get_fixes() List[LintFix]

Get the current fix buffer.

We’re hydrating them here directly from the LintResult objects, so for more accurate results, consider using .get_results(). This method is particularly useful when consolidating multiple results into one.

get_raw() str

Get the current raw representation.

get_results() List[LintResult]

Return the current result buffer.

insert(insertion: RawSegment, target: RawSegment, pos: str = 'before') ReflowSequence

Returns a new ReflowSequence with the new element inserted.

Insertion is always relative to an existing element. Either before or after it as specified by pos. This generates appropriate creation LintFix objects to direct the linter to insert those elements.

rebreak() ReflowSequence

Returns a new ReflowSequence corrected line breaks.

This intentionally does not handle indentation, as the existing indents are assumed to be correct.

Note

Currently this only moves existing segments around line breaks (e.g. for operators and commas), but eventually this method will also handle line length considerations too.

reindent()

Reindent lines within a sequence.

replace(target: BaseSegment, edit: Sequence[BaseSegment]) ReflowSequence

Returns a new ReflowSequence with edit elements replaced.

This generates appropriate replacement LintFix objects to direct the linter to modify those elements.

respace(strip_newlines: bool = False, filter: str = 'all') ReflowSequence

Returns a new ReflowSequence with points respaced.

Parameters:
  • strip_newlines (bool) – Optionally strip newlines before respacing. This is primarily used on focused sequences to coerce objects onto a single line. This does not apply any prioritisation to which line breaks to remove and so is not a substitute for the full reindent or reflow methods.

  • filter (str) – Optionally filter which reflow points to respace. Default configuration is all. Other options are line_break which only respaces points containing a newline or followed by an end_of_file marker, or inline which is the inverse of line_break. This is most useful for filtering between trailing whitespace and fixes between content on a line.

NOTE this method relies on the embodied results being correct so that we can build on them.

without(target: RawSegment) ReflowSequence

Returns a new ReflowSequence without the specified segment.

This generates appropriate deletion LintFix objects to direct the linter to remove those elements.

class ReflowPoint(segments: Tuple[RawSegment, ...])

Class for keeping track of editable elements in reflow.

This class, and its sibling ReflowBlock, should not normally be manipulated directly by rules, but instead should be manipulated using ReflowSequence.

It holds segments which can be changed during a reflow operation such as whitespace and newlines.It may also contain Indent and Dedent elements.

It holds no configuration and is influenced by the blocks on either side, so that any operations on it usually have that configuration passed in as required.

property class_types: Set[str]

Get the set of contained class types.

Parallel to BaseSegment.class_types

get_indent() str | None

Get the current indent (if there).

get_indent_impulse(allow_implicit_indents: bool = False, following_class_types: Set[str] = {}) IndentStats

Get the change in intended indent balance from this point.

NOTE: The reason we check following_class_types is because bracketed expressions behave a little differently and are an exception to the normal implicit indent rules. For implicit indents which precede bracketed expressions, the implicit indent is treated as a normal indent.

Returns:

The first value is the raw

impulse. The second is the deepest trough in the indent through the values to allow wiping of buffers.

Return type:

tuple of int

indent_to(desired_indent: str, after: BaseSegment | None = None, before: BaseSegment | None = None, description: str | None = None, source: str | None = None) Tuple[List[LintResult], ReflowPoint]

Coerce a point to have a particular indent.

If the point currently contains no newlines, one will be introduced and any trailing whitespace will be effectively removed.

More specifically, the newline is inserted before the existing whitespace, with the new indent being a replacement for that same whitespace.

For placeholder newlines or indents we generate appropriate source fixes.

num_newlines() int

Return the number of newlines in this element.

These newlines are either newline segments or contained within consumed sections of whitespace. This counts both.

property pos_marker: PositionMarker | None

Get the first position marker of the element.

property raw: str

Get the current raw representation.

respace_point(prev_block: ReflowBlock | None, next_block: ReflowBlock | None, root_segment: BaseSegment, lint_results: List[LintResult], strip_newlines: bool = False, anchor_on: str = 'before') Tuple[List[LintResult], ReflowPoint]

Respace a point based on given constraints.

NB: This effectively includes trailing whitespace fixes.

Deletion and edit fixes are generated immediately, but creations are paused to the end and done in bulk so as not to generate conflicts.

Note that the strip_newlines functionality exists here as a slight exception to pure respacing, but as a very simple case of positioning line breaks. The default operation of respace does not enable it, however it exists as a convenience for rules which wish to use it.

class ReflowBlock(segments: Tuple[RawSegment, ...], spacing_before: str, spacing_after: str, line_position: str | None, depth_info: DepthInfo, stack_spacing_configs: Dict[int, str], line_position_configs: Dict[int, str])

Class for keeping track of elements to reflow.

This class, and its sibling ReflowPoint, should not normally be manipulated directly by rules, but instead should be manipulated using ReflowSequence.

It holds segments to reflow and also exposes configuration regarding how they are expected to reflow around others. Typically it holds only a single element, which is usually code or a templated element. Because reflow operations control spacing, it would be very unusual for this object to be modified; as such it exposes relatively few methods.

The attributes exposed are designed to be “post configuration” i.e. they should reflect configuration appropriately.

property class_types: Set[str]

Get the set of contained class types.

Parallel to BaseSegment.class_types

depth_info: DepthInfo

Metadata on the depth of this segment within the parse tree which is used in inferring how and where line breaks should exist.

classmethod from_config(segments, config: ReflowConfig, depth_info: DepthInfo) ReflowBlock

Construct a ReflowBlock while extracting relevant configuration.

This is the primary route to construct a ReflowBlock, as is allows all of the inference of the spacing and position configuration from the segments it contains and the appropriate config objects.

line_position: str | None

Desired line position for this block. See Configuring layout and spacing

line_position_configs: Dict[int, str]

Desired line position configurations for parent segments of the segment in this block. See Configuring layout and spacing

num_newlines() int

Return the number of newlines in this element.

These newlines are either newline segments or contained within consumed sections of whitespace. This counts both.

property pos_marker: PositionMarker | None

Get the first position marker of the element.

property raw: str

Get the current raw representation.

spacing_after: str

Desired spacing after this block. See Configuring layout and spacing

spacing_before: str

Desired spacing before this block. See Configuring layout and spacing

stack_spacing_configs: Dict[int, str]

Desired spacing configurations for parent segments of the segment in this block. See Configuring layout and spacing