Internals¶
It is recommended that the following is read in conjunction with exploring the codebase. dialect_ansi.py in particular is helpful to understand the recursive structure of segments and grammars. Some more detail is also given on our Wiki including a Contributing Dialect Changes guide.
Architecture¶
At a high level, the behaviour of SQLFluff is divided into a few key stages. Whether calling sqlfluff lint, sqlfluff fix or sqlfluff parse, the internal flow is largely the same.
Stage 1, the templater¶
This stage only applies to templated SQL. Vanilla SQL is sent straight to stage 2, the lexer.
In order to lint templated SQL, SQLFluff must first convert the ‘raw’ or pre-templated code into valid SQL, which can then be parsed. The templater returns both the raw and post-templated SQL so that any rule violations which occur in templated sections can be ignored and the rest mapped to their original line location for user feedback.
SQLFluff supports multiple templating engines:
SQL placeholders (e.g. SQLAlchemy parameters)
dbt (via plugin)
Under the hood dbt also uses Jinja, but in SQLFluff uses a separate mechanism which interfaces directly with the dbt python package.
For more details on how to configure the templater see Templating Configuration.
Stage 2, the lexer¶
The lexer takes SQL and separates it into segments of whitespace and
code. Where we can impart some high level meaning to segments, we
do, but the result of this operation is still a flat sequence of
typed segments (all subclasses of RawSegment
).
Stage 3, the parser¶
The parser is arguably the most complicated element of SQLFluff, and is relied on by all the other elements of the tool to do most of the heavy lifting.
The lexed segments are parsed using the specified dialect’s grammars. In SQLFluff, grammars describe the shape of SQL statements (or their components). The parser attempts to apply each potential grammar to the lexed segments until all the segments have been matched.
In SQLFluff, segments form a tree-like structure. The top-level segment is a
FileSegment
, which contains zero or moreStatementSegment
s, and so on. Before the segments have been parsed and named according to their type, they are ‘raw’, meaning they have no classification other than their literal value.A segment’s
.match()
method uses thematch_grammar
, on which.match()
is called. SQLFluff parses in a single pass through the file, so segments will recursively match the file based on their respective grammars. In the example of aFileSegment
, it first divides up the query into statements, and then the.match()
method of those segments works out the structure within them.- Segments must implement a
match_grammar
. When.match()
is called on a segment, this is the grammar which is used to decide whether there is a match.
- Segments must implement a
- Grammars combine segments or other grammars together in a
pre-defined way. For example the
OneOf
grammar will match if any one of its child elements match.
During the recursion, the parser eventually reaches segments which have no children (raw segments containing a single token), and so the recursion naturally finishes.
If no match is found for a segment, the contents will be wrapped in an
UnparsableSegment
which is picked up as a parsing error later. This is usually facilitated by theParseMode
on some grammars which can be set toGREEDY
, allowing the grammar to capture additional segments as unparsable. As an example, bracketed sections are often configured to capture anything unexpected as unparsable rather than simply failing to match if there is more than expected (which would be the default,STRICT
, behaviour).The result of the
.match()
method is aMatchResult
which contains the instructions on how to turn the flat sequence of raw segments into a nested tree of segments. Calling.apply()
on this result at the end of the matching process is what finally creates the nested structure.
When working on the parser there are a couple of design principles to keep in mind.
Grammars are contained in dialects, the root dialect being the ansi dialect. The ansi dialect is used to host logic common to all dialects, and so does not necessarily adhere to the formal ansi specification. Other SQL dialects inherit from the ansi dialect, replacing or patching any segments they need to. One reason for the Ref grammar is that it allows name resolution of grammar elements at runtime and so a patched grammar with some elements overridden can still rely on lower-level elements which haven’t been redeclared within the dialect
All grammars and segments attempt to match as much as they can and will return partial matches where possible. It is up to the calling grammar or segment to decide whether a partial or complete match is required based on the context it is matching in.
Stage 4, the linter¶
Given the complete parse tree, rule classes check for linting errors by
traversing the tree, looking for segments and patterns of concern. If
the rule discovers a violation, it returns a LintResult
pointing
to the segment which caused the violation.
Some rules are able to fix the problems they find. If this is the case, the rule will return a list of fixes, which describe changes to be made to the tree. This can include edits, inserts, or deletions. Once the fixes have been applied, the updated tree is written to the original file.
Reflow Internals¶
Many rules supported by SQLFluff involve the spacing and layout of different elements, either to enforce a particular layout or just to add or remove code elements in a way sensitive to the existing layout configuration. The way this is achieved is through some centralised utilities in the sqlfluff.utils.reflow module.
This module aims to achieve several things:
Less code duplication by implementing reflow logic in only one place.
Provide a streamlined interface for rules to easily utilise reflow logic.
Given this requirement, it’s important that reflow utilities work within the existing framework for applying fixes to potentially templated code. We achieve this by returning LintFix objects which can then be returned by each rule wanting to use this logic.
Provide a consistent way of configuring layout requirements. For more details on configuration see Configuring Layout.
To support this, the module provides a ReflowSequence
class which
allows access to all of the relevant operations which can be used to
reformat sections of code, or even a whole file. Unless there is a very
good reason, all rules should use this same approach to ensure consistent
treatment of layout.
- class ReflowSequence(elements: List[ReflowBlock | ReflowPoint], root_segment: BaseSegment, reflow_config: ReflowConfig, depth_map: DepthMap, lint_results: List[LintResult] | None = None)¶
Class for keeping track of elements in a reflow operation.
This acts as the primary route into using the reflow routines. It acts in a way that plays nicely within a rule context in that it accepts segments and configuration, while allowing access to modified segments and a series of
LintFix
objects, which can be returned by the calling rule.Sequences are made up of alternating
ReflowBlock
andReflowPoint
objects (even if some points have no segments). This is validated on construction.Most operations also return
ReflowSequence
objects such that operations can be chained, and then the resultant fixes accessed at the last stage, for example:fixes = ( ReflowSequence.from_around_target( context.segment, root_segment=context.parent_stack[0], config=context.config, ) .rebreak() .get_fixes() )
- break_long_lines()¶
Rebreak any remaining long lines in a sequence.
This assumes that reindent() has already been applied.
- classmethod from_around_target(target_segment: BaseSegment, root_segment: BaseSegment, config: FluffConfig, sides: str = 'both') ReflowSequence ¶
Generate a sequence around a target.
- Parameters:
target_segment (
RawSegment
) – The segment to center around when considering the sequence to construct.root_segment (
BaseSegment
) – The relevant root segment (usually the baseFileSegment
).config (
FluffConfig
) – A config object from which to load the spacing behaviours of different segments.sides (
str
) – Limit the reflow sequence to just one side of the target. Default is two sided (“both”), but set to “before” or “after” to limit to either side.
NOTE: We don’t just expand to the first block around the target but to the first code element, which means we may swallow several comment blocks in the process.
To evaluate reflow around a specific target, we need need to generate a sequence which goes for the preceding raw to the following raw. i.e. at least: block - point - block - point - block (where the central block is the target).
- classmethod from_raw_segments(segments: Sequence[RawSegment], root_segment: BaseSegment, config: FluffConfig, depth_map: DepthMap | None = None) ReflowSequence ¶
Construct a ReflowSequence from a sequence of raw segments.
This is intended as a base constructor, which others can use. In particular, if no depth_map argument is provided, this method will generate one in a potentially inefficient way. If the calling method has access to a better way of inferring a depth map (for example because it has access to a common root segment for all the content), it should do that instead and pass it in.
- classmethod from_root(root_segment: BaseSegment, config: FluffConfig) ReflowSequence ¶
Generate a sequence from a root segment.
- Parameters:
root_segment (
BaseSegment
) – The relevant root segment (usually the baseFileSegment
).config (
FluffConfig
) – A config object from which to load the spacing behaviours of different segments.
- get_fixes() List[LintFix] ¶
Get the current fix buffer.
We’re hydrating them here directly from the LintResult objects, so for more accurate results, consider using .get_results(). This method is particularly useful when consolidating multiple results into one.
- get_raw() str ¶
Get the current raw representation.
- get_results() List[LintResult] ¶
Return the current result buffer.
- insert(insertion: RawSegment, target: RawSegment, pos: str = 'before') ReflowSequence ¶
Returns a new
ReflowSequence
with the new element inserted.Insertion is always relative to an existing element. Either before or after it as specified by pos. This generates appropriate creation
LintFix
objects to direct the linter to insert those elements.
- rebreak() ReflowSequence ¶
Returns a new
ReflowSequence
corrected line breaks.This intentionally does not handle indentation, as the existing indents are assumed to be correct.
Note
Currently this only moves existing segments around line breaks (e.g. for operators and commas), but eventually this method will also handle line length considerations too.
- reindent() ReflowSequence ¶
Reindent lines within a sequence.
- replace(target: BaseSegment, edit: Sequence[BaseSegment]) ReflowSequence ¶
Returns a new
ReflowSequence
with edit elements replaced.This generates appropriate replacement
LintFix
objects to direct the linter to modify those elements.
- respace(strip_newlines: bool = False, filter: str = 'all') ReflowSequence ¶
Returns a new
ReflowSequence
with points respaced.- Parameters:
strip_newlines (
bool
) – Optionally strip newlines before respacing. This is primarily used on focused sequences to coerce objects onto a single line. This does not apply any prioritisation to which line breaks to remove and so is not a substitute for the full reindent or reflow methods.filter (
str
) – Optionally filter which reflow points to respace. Default configuration is all. Other options are line_break which only respaces points containing a newline or followed by an end_of_file marker, or inline which is the inverse of line_break. This is most useful for filtering between trailing whitespace and fixes between content on a line.
NOTE this method relies on the embodied results being correct so that we can build on them.
- without(target: RawSegment) ReflowSequence ¶
Returns a new
ReflowSequence
without the specified segment.This generates appropriate deletion
LintFix
objects to direct the linter to remove those elements.
- class ReflowPoint(segments: Tuple[RawSegment, ...])¶
Class for keeping track of editable elements in reflow.
This class, and its sibling
ReflowBlock
, should not normally be manipulated directly by rules, but instead should be manipulated usingReflowSequence
.It holds segments which can be changed during a reflow operation such as whitespace and newlines.It may also contain
Indent
andDedent
elements.It holds no configuration and is influenced by the blocks on either side, so that any operations on it usually have that configuration passed in as required.
- property class_types: Set[str]¶
Get the set of contained class types.
Parallel to BaseSegment.class_types
- get_indent() str | None ¶
Get the current indent (if there).
- get_indent_impulse() IndentStats ¶
Get the change in intended indent balance from this point.
- indent_to(desired_indent: str, after: BaseSegment | None = None, before: BaseSegment | None = None, description: str | None = None, source: str | None = None) Tuple[List[LintResult], ReflowPoint] ¶
Coerce a point to have a particular indent.
If the point currently contains no newlines, one will be introduced and any trailing whitespace will be effectively removed.
More specifically, the newline is inserted before the existing whitespace, with the new indent being a replacement for that same whitespace.
For placeholder newlines or indents we generate appropriate source fixes.
- num_newlines() int ¶
Return the number of newlines in this element.
These newlines are either newline segments or contained within consumed sections of whitespace. This counts both.
- property pos_marker: PositionMarker | None¶
Get the first position marker of the element.
- property raw: str¶
Get the current raw representation.
- respace_point(prev_block: ReflowBlock | None, next_block: ReflowBlock | None, root_segment: BaseSegment, lint_results: List[LintResult], strip_newlines: bool = False, anchor_on: str = 'before') Tuple[List[LintResult], ReflowPoint] ¶
Respace a point based on given constraints.
NB: This effectively includes trailing whitespace fixes.
Deletion and edit fixes are generated immediately, but creations are paused to the end and done in bulk so as not to generate conflicts.
Note that the strip_newlines functionality exists here as a slight exception to pure respacing, but as a very simple case of positioning line breaks. The default operation of respace does not enable it, however it exists as a convenience for rules which wish to use it.
- class ReflowBlock(segments: Tuple[RawSegment, ...], spacing_before: str, spacing_after: str, line_position: str | None, depth_info: DepthInfo, stack_spacing_configs: Dict[int, str], line_position_configs: Dict[int, str])¶
Class for keeping track of elements to reflow.
This class, and its sibling
ReflowPoint
, should not normally be manipulated directly by rules, but instead should be manipulated usingReflowSequence
.It holds segments to reflow and also exposes configuration regarding how they are expected to reflow around others. Typically it holds only a single element, which is usually code or a templated element. Because reflow operations control spacing, it would be very unusual for this object to be modified; as such it exposes relatively few methods.
The attributes exposed are designed to be “post configuration” i.e. they should reflect configuration appropriately.
- property class_types: Set[str]¶
Get the set of contained class types.
Parallel to BaseSegment.class_types
- depth_info: DepthInfo¶
Metadata on the depth of this segment within the parse tree which is used in inferring how and where line breaks should exist.
- classmethod from_config(segments, config: ReflowConfig, depth_info: DepthInfo) ReflowBlock ¶
Construct a ReflowBlock while extracting relevant configuration.
This is the primary route to construct a ReflowBlock, as is allows all of the inference of the spacing and position configuration from the segments it contains and the appropriate config objects.
- line_position: str | None¶
Desired line position for this block. See Configuring layout and spacing
- line_position_configs: Dict[int, str]¶
Desired line position configurations for parent segments of the segment in this block. See Configuring layout and spacing
- num_newlines() int ¶
Return the number of newlines in this element.
These newlines are either newline segments or contained within consumed sections of whitespace. This counts both.
- property pos_marker: PositionMarker | None¶
Get the first position marker of the element.
- property raw: str¶
Get the current raw representation.
- spacing_after: str¶
Desired spacing after this block. See Configuring layout and spacing
- spacing_before: str¶
Desired spacing before this block. See Configuring layout and spacing
- stack_spacing_configs: Dict[int, str]¶
Desired spacing configurations for parent segments of the segment in this block. See Configuring layout and spacing