Text highlighting definition file syntax
Dal 3.19
The
gb.highlight component allows to
register new text syntax highlighters based on
definition files.
A definition file contains a list of
highlighting states.
Each state is associated with:
-
A style name, which tells how to draw a piece of text, according to an highlighting theme.
The style name must made of lowercase letters, dot or underscore characters.
-
A list of commands, which tells how to recognize the piece of text that must be drawn with that style.
The highlighting process
takes a piece of text (normally a whole line), an initial state, and
returns an array that associates one state to each character of the text, and a final state.
Taking an initial state and returning the final state allows the highlighting process to be run incrementally, (i.e. line by line) by a text editor.
Let's take part of the
html
highlighting definition file as an example (it highlights the contents of HTML files):
$(IDENT)=[a-zA-Z0-9-:]+
doctype{Doctype=Preprocessor}:
from <!DOCTYPE to >
comment:
from <!-- to -->
entity{Entity=Function}:
match /&[A-Za-z]+;/
match /&#[0-9]+;/
markup{Markup=Keyword}:
from /<$(IDENT)/ to //?>/
attribute{Attribute=Datatype}:
match /$(IDENT)/
equal{Normal}:
symbol =
value{Value=String}:
from " to "
from ' to '
string.entity{Entity}:
match /&[A-Za-z]+;/
match /&#[0-9]+;/
value.unquoted{Value}:
match /[^"'`=<>\s]+/
markup.close{Markup}:
match /</$(IDENT)\s*>/
State definition
Each line that ends with a colon
:
character introduces a new state.
The syntax is the following:
state name [ { style name [ = default style name ] } ] :
-
state name
is the name of the state.
-
style name
is the name of the associated style.
-
default style name
is the name of a style that will be used as a default if style name
is not defined in the highlighting theme used when actually rendering the text.
There is a list of hard-coded style names that are defined in all highlighting themes, and that you can use without problem in any definition file.
These common style names are:
Normal
,
Added
,
Removed
,
Error
,
Comment
,
Documentation
,
Keyword
,
Function
,
Operator
,
Symbol
,
Number
,
String
,
Datatype
,
Preprocessor
,
Escape
, and
Constant
.
If you want to introduce a new style name, it's a good idea to give as default style name a member of that list.
Example
doctype{Doctype=Preprocessor}:
introduces a state named
doctype
, associated with a style whose name is
Doctype
. As
Doctype
is a new style name, we tell the highlighter to use the
Preprocessor
style if the
Doctype
style is not explicitly defined in the highlighter theme.
If no style name is defined, the state will be associated with a style having the same name.
Example
introduces the
comment
state which will be associated with the
Comment
style (case is not important).
Each state is checked independently, but as the definition file is read from top to bottom, the first states have higher priority than the last ones.
Moreover, if no state matches the current character, the
normal
state applies.
In that case, space, tab and newline characters are automatically ignored, i.e. highlighted with the
normal
state.
State commands
The definition of a state is followed by one on several commands that defines which text must be associated with that state.
These commands must be indented with at least one space.
Here is the possible commands:
match pattern
|
Apply the state to the matching pattern.
|
word word #1 word #2 ... word #n
|
Apply the state to any of the following words. Matching is case sensitive.
|
keyword word #1 word #2 ... word #n
|
Apply the state to any of the following words, and add these words to a list of keywords associated with the highlighter. Matching is case sensitive.
|
symbol symbol #1 symbol #2 ... symbol #n
|
Apply the state to any of the following symbols.
|
from start pattern to end pattern
|
Apply the state to the text between the start pattern and the end pattern, the patterns included.
|
from start pattern
|
Apply the state to the text between the start pattern and the end of the line, the start pattern included.
|
from here to end pattern
|
Apply the state from the current character up to the end pattern, the end pattern included.
|
from here
|
Apply the state from the current character up to the end of the line.
|
between start pattern and end pattern
|
Apply the state to the text between the start pattern and the end pattern, the patterns excluded.
|
between start pattern
|
Apply the state to the text between the start pattern and the end of the line, the start pattern excluded.
|
between here and end pattern
|
Apply the state from the current character up to the end pattern, the end pattern excluded.
|
A pattern can be:
-
A plain string, without spaces between quotes or not.
Using quotes allows to use escaped control characters:
"\n"
for a newline, "\t"
for a tabulation, "\\"
for a slash...
-
Or a Perl-compatible regular expression between slashes
/
, handled by the gb.pcre component.
For more information about regular expressions, see the PCRE pattern syntax page.
Example
comment:
from <!-- to -->
tells that the
comment
state will be applied to all text between the
<!--
and
-->
strings included.
entity{Entity=Function}:
match /&[A-Za-z]+;/
match /&#[0-9]+;/
tells that the
entity
state will be applied to each text matching the
&[A-Za-z]+;
or the
&#[0-9]+;
regular expression.
Recursive states
It is possible to nest states. The effect of nested states depends on the command.
-
For the
from
and between
commands, nested states are applied for the part of text inside the start and end patterns specified in the command arguments.
-
For the
match
, word
, keyword
or symbol
commands, the nested states are applied to the text following the matched text.
Example
markup{Markup=Keyword}:
from /<$(IDENT)/ to //?>/
attribute{Attribute=Datatype}:
match /$(IDENT)/
equal{Normal}:
symbol =
value{Value=String}:
from " to "
from ' to '
string.entity{Entity}:
match /&[A-Za-z]+;/
match /&#[0-9]+;/
value.unquoted{Value}:
match /[^"'`=<>\s]+/
The
markup
state is applied from the
<$(IDENT)
up to the
/?>
regular expressions.
Note: $(IDENT)
is not actually a regular expression pattern, but a preprocessor variable defined at the top of the definition file. See below.
attribute
,
equal
,
value
and
value.unquoted
are states that will apply only inside the
markup
state, i.e. between the start and end patterns defined by the
from
command.
In other words, all these nested states allows define a specific highlighting process that occurs only inside HTML markups.
Special commands
This is a command that matches no text, but define some properties in association with the current state.
There is only one special command, at the moment:
limit
|
Matching that state set the "limit" flag indicating a new section of the text.
For example, the Gambas text editor uses that flag for delimiting collapsible sections in the edited text.
|
Variables
Variables are text surrounded by
$(
and
)
. They have a value, usually defined at the beginning of the definition file.
The syntax of a variable definition is the following:
$(variable name) = value
Every occurrence of the variable is replaced by its value.
So, in the example, every occurrence of
$(IDENT)
in the definition file will be replaced by
[a-zA-Z0-9-:]+
.
It is the right way to centralize the definition of your regular expression patterns.
Preprocessor commands
Definition files support some rudimentary preprocessing commands. These preprocessing commands are lines beginning with the
@
character.
Command
|
Description
|
@include file name
|
Include another definition file inside the current one. The included file must be located in the same directory as the current file, so only the file name is specified.
|
@define name
|
Define a preprocessor flag named name.
|
@if name
...
@endif
|
Process the part of the definition file between @if name and @endif only if the name has been defined with the @define command.
|
@word regular expression
|
Define which regular expression the word and keyword commands will use to match a word. The regular expression must be specified between / characters.
By default, a word is matched by the /[A-Za-z_][A-Za-z0-9_]*/ regular expression.
|