Text highlighting definition file syntax

Seit 3.19

The gb.highlight component allows to register new text syntax highlighters based on definition files.

A definition file contains a list of highlighting states.

Each state is associated with:

  • A style name, which tells how to draw a piece of text, according to an highlighting theme. The style name must made of lowercase letters, dot or underscore characters.

  • A list of commands, which tells how to recognize the piece of text that must be drawn with that style.

The highlighting process takes a piece of text (normally a whole line), an initial state, and returns an array that associates one state to each character of the text, and a final state.

Taking an initial state and returning the final state allows the highlighting process to be run incrementally, (i.e. line by line) by a text editor.

Let's take part of the html highlighting definition file as an example (it highlights the contents of HTML files):

$(IDENT)=[a-zA-Z0-9-:]+
doctype{Doctype=Preprocessor}:
  from <!DOCTYPE to >
comment:
  from <!-- to -->
entity{Entity=Function}:
  match /&[A-Za-z]+;/
  match /&#[0-9]+;/
markup{Markup=Keyword}:
  from /<$(IDENT)/ to //?>/
  attribute{Attribute=Datatype}:
    match /$(IDENT)/
  equal{Normal}:
    symbol =
  value{Value=String}:
    from " to "
    from ' to '
    string.entity{Entity}:
      match /&[A-Za-z]+;/
      match /&#[0-9]+;/
  value.unquoted{Value}:
    match /[^"'`=<>\s]+/
markup.close{Markup}:
  match /</$(IDENT)\s*>/

State definition

Each line that ends with a colon : character introduces a new state.

The syntax is the following:

state name [ { style name [ = default style name ] } ] :

  • state name is the name of the state.

  • style name is the name of the associated style.

  • default style name is the name of a style that will be used as a default if style name is not defined in the highlighting theme used when actually rendering the text.

There is a list of hard-coded style names that are defined in all highlighting themes, and that you can use without problem in any definition file.

These common style names are: Normal, Added, Removed, Error, Comment, Documentation, Keyword, Function, Operator, Symbol, Number, String, Datatype, Preprocessor, Escape, and Constant.

If you want to introduce a new style name, it's a good idea to give as default style name a member of that list.

Example

doctype{Doctype=Preprocessor}:

introduces a state named doctype, associated with a style whose name is Doctype. As Doctype is a new style name, we tell the highlighter to use the Preprocessor style if the Doctype style is not explicitly defined in the highlighter theme.

If no style name is defined, the state will be associated with a style having the same name.

Example

comment:

introduces the comment state which will be associated with the Comment style (case is not important).

Each state is checked independently, but as the definition file is read from top to bottom, the first states have higher priority than the last ones.

Moreover, if no state matches the current character, the normal state applies. In that case, space, tab and newline characters are automatically ignored, i.e. highlighted with the normal state.

State commands

The definition of a state is followed by one on several commands that defines which text must be associated with that state.

These commands must be indented with at least one space.

Here is the possible commands:

match pattern Apply the state to the matching pattern.
word word #1 word #2 ... word #n Apply the state to any of the following words. Matching is case sensitive.
keyword word #1 word #2 ... word #n Apply the state to any of the following words, and add these words to a list of keywords associated with the highlighter. Matching is case sensitive.
symbol symbol #1 symbol #2 ... symbol #n Apply the state to any of the following symbols.
from start pattern to end pattern Apply the state to the text between the start pattern and the end pattern, the patterns included.
from start pattern Apply the state to the text between the start pattern and the end of the line, the start pattern included.
from here to end pattern Apply the state from the current character up to the end pattern, the end pattern included.
from here Apply the state from the current character up to the end of the line.
between start pattern and end pattern Apply the state to the text between the start pattern and the end pattern, the patterns excluded.
between start pattern Apply the state to the text between the start pattern and the end of the line, the start pattern excluded.
between here and end pattern Apply the state from the current character up to the end pattern, the end pattern excluded.

A pattern can be:
  • A plain string, without spaces between quotes or not.

    Using quotes allows to use escaped control characters: "\n" for a newline, "\t" for a tabulation, "\\" for a slash...

  • Or a Perl-compatible regular expression between slashes /, handled by the gb.pcre component.

    For more information about regular expressions, see the PCRE pattern syntax page.

Example

comment:
  from <!-- to -->

tells that the comment state will be applied to all text between the <!-- and --> strings included.

entity{Entity=Function}:
  match /&[A-Za-z]+;/
  match /&#[0-9]+;/

tells that the entity state will be applied to each text matching the &[A-Za-z]+; or the &#[0-9]+; regular expression.

Recursive states

It is possible to nest states. The effect of nested states depends on the command.

  • For the from and between commands, nested states are applied for the part of text inside the start and end patterns specified in the command arguments.

  • For the match, word, keyword or symbol commands, the nested states are applied to the text following the matched text.

Example

markup{Markup=Keyword}:
  from /<$(IDENT)/ to //?>/
  attribute{Attribute=Datatype}:
    match /$(IDENT)/
  equal{Normal}:
    symbol =
  value{Value=String}:
    from " to "
    from ' to '
    string.entity{Entity}:
      match /&[A-Za-z]+;/
      match /&#[0-9]+;/
  value.unquoted{Value}:
    match /[^"'`=<>\s]+/

The markup state is applied from the <$(IDENT) up to the /?> regular expressions.

Note: $(IDENT) is not actually a regular expression pattern, but a preprocessor variable defined at the top of the definition file. See below.

attribute, equal, value and value.unquoted are states that will apply only inside the markup state, i.e. between the start and end patterns defined by the from command.

In other words, all these nested states allows define a specific highlighting process that occurs only inside HTML markups.

Special commands

This is a command that matches no text, but define some properties in association with the current state.

There is only one special command, at the moment:

limit Matching that state set the "limit" flag indicating a new section of the text.

For example, the Gambas text editor uses that flag for delimiting collapsible sections in the edited text.

Variables

Variables are text surrounded by $( and ). They have a value, usually defined at the beginning of the definition file.

The syntax of a variable definition is the following:

$(variable name) = value

Example:
$(IDENT)=[a-zA-Z0-9-:]+

Every occurrence of the variable is replaced by its value.

So, in the example, every occurrence of $(IDENT) in the definition file will be replaced by [a-zA-Z0-9-:]+.

It is the right way to centralize the definition of your regular expression patterns.

Preprocessor commands

Definition files support some rudimentary preprocessing commands. These preprocessing commands are lines beginning with the @ character.

Command Description
@include file name Include another definition file inside the current one. The included file must be located in the same directory as the current file, so only the file name is specified.
@define name Define a preprocessor flag named name.
@if name
...
@endif
Process the part of the definition file between @if name and @endif only if the name has been defined with the @define command.
@word regular expression Define which regular expression the word and keyword commands will use to match a word. The regular expression must be specified between / characters.

By default, a word is matched by the /[A-Za-z_][A-Za-z0-9_]*/ regular expression.