Gambas Documentation
Application Repository
Code Snippets
Compilation & Installation from source code
Components
Controls pictures
Deprecated components
Developer Documentation
Development Environment Documentation
Documents
Error Messages
Gambas Playground
How To's
Language Index
#Else
#Endif
#If
+INF
-INF
Abs
Access
ACos
ACosh
AFTER
Alloc
AND
AND IF
Ang
APPEND
Arithmetic Operators
Array Declaration
AS
Asc
ASin
ASinh
Asl
Asr
ASSERT
Assignment Operators
Assignments
ATan
ATan2
ATanh
Base
Base64
BChg
BClr
BEGINS
Bin
Binary Data Representation
Bool@
Boolean@
BREAK
BSet
BTst
BYREF
Byte@
CASE
CATCH
CBool
Cbr
CByte
CDate
Ceil
CFloat
CHGRP
CHMOD
Choose
CHOWN
Chr
CInt
CInteger
Clamp
CLASS
CLong
CLOSE
Comp
Complex numbers
CONST
Constant Declaration
Constant Expression
CONTINUE
Conv
Conversion Functions
COPY
Cos
Cosh
CPointer
CREATE
CREATE PRIVATE
CREATE STATIC
CShort
CSingle
CStr
CString
CVariant
Datatypes
Date
Date@
DateAdd
DateDiff
Day
DConv
DEBUG
DEC
Dec
DEFAULT
Deg
DFree
DIM
Dir
DIV
DO
DOWNTO
EACH
ELSE
END
ENDIF
ENDS
END SELECT
END STRUCT
END WITH
ENUM
Enumeration declaration
Eof
ERROR
ERROR TO
Eval
Even
EVENT
Event Loop
Events declaration
EVERY
EXEC
Exist
Exp
Exp2
Exp10
Expm
EXPORT
EXTERN
External Function Declaration
External Function Management
FALSE
FAST
FAST UNSAFE
File & Directory Paths
FINALLY
Fix
Float@
Floor
FLUSH
FOR
FOR EACH
Format
Frac
Free
FromBase
FromBase64
FromHtml
FromUrl
FUNCTION
Global Special Event Handlers
GOSUB
GOTO
Hex
Hour
Html
Hyp
IF
IIf
IN
INC
INCLUDE or #INCLUDE
INHERITS
Inline Arrays
Inline Collections
INPUT
INPUT FROM
InStr
Int
Int@
Integer@
IS
IsAlnum
IsAscii
IsBlank
IsBoolean
IsDate
IsDigit
IsDir
IsFloat
IsHexa
IsInf
IsInteger
IsLCase
IsLetter
IsLong
IsLower
IsMissing
IsNaN
IsNull
IsNumber
IsPunct
IsSpace
IsUCase
IsUpper
KILL
Labels
Language Constants
LAST
LCase
Left
Len
LET
LIBRARY
LIKE
LINE INPUT
LINK
Localization and Translation Functions
Local Variable Declaration
LOCK
Lof
Log
Log2
Log10
Logical Operators
Logp
Long@
LOOP
Lsl
Lsr
LTrim
Mag
MATCH
Max
ME
Method Declaration
Mid
Min
Minute
MkBool
MkBoolean
MkByte
MkDate
MKDIR
MkFloat
MkInt
MkInteger
MkLong
MkPointer
MkShort
MkSingle
MOD
Month
MOVE
NEW
NEXT
NInStr
NOT
Now
NULL
Oct
Odd
ON GOSUB
ON GOTO
OPEN
OPEN MEMORY
OPEN MEMORY
OPEN NULL
OPEN PIPE
OPEN PIPE
OPEN STRING
Operator Evaluation Order
OPTIONAL
OR
OR IF
OUTPUT
OUTPUT TO
PEEK
Pi
Pointer@
PRINT
PRIVATE
PROCEDURE
PROPERTY
Property Declaration
PUBLIC
QUIT
Quote
Rad
RAISE
Rand
RANDOMIZE
RDir
READ
Realloc
REPEAT
Replace
RETURN
Right
RInStr
RMDIR
Rnd
Rol
Ror
Round
RTrim
Scan
SConv
Second
SEEK
Seek
SELECT
Sgn
SHELL
Shell
Shl
Short@
Shr
Sin
Single@
Sinh
SizeOf
SLEEP
Space
Special Methods
Split
Sqr
Stat
STATIC
STEP
STOP
STOP EVENT
Str
Str@
String
String@
String comparison methods
String Operators
StrPtr
STRUCT
Structure declaration
SUB
Subst
SUPER
SWAP
Swap
Tan
Tanh
Temp
THEN
Time
Timer
TO
Tokenize
Tr
Trim
TRUE
TRY
TypeOf
UCase
UnBase64
UNLOCK
UnQuote
UNTIL
Url
USE
User-defined formats
Using reserved keywords as identifiers
Val
Variable Declaration
VarPtr
WAIT
WATCH
Week
WeekDay
WEND
WHILE
WITH
WRITE
XOR
Year
Language Overviews
Last Changes
Learning topics
Lexicon
README
Search the wiki
To Do
Topics
Tutorials
Wiki License
Wiki Manual

Tokenize

Tokens = Tokenize ( String [ , Identifiers , Strings , Operators , KeepSpace ] )

Since 3.21

Split a string into tokens and return them.

Arguments

  • String : the string to split.

  • Identifiers : a string of extra characters allowed in identifier tokens.

  • Strings : an array of strings, each string describing the limits of a string token.

  • Operators : an array of strings, each string representing an operator token.

  • KeepSpace : tell if space tokens are returned.

Return value

The tokens are returned as a string array.

Description

This function is a simple lexical parser that splits a string into tokens and return them as string array made of following kind of tokens:

  • Space tokens

    A space token is made of successive space or tab characters.

  • Newline tokens

    A newline token is made of one newline character.

  • Number tokens

    A number token is made of successive digit characters.

  • Identifier tokens

    An identifier starts with a letter, and is made of any successive letter or digit or extra character specified in the Identifiers argument.

    If Identifiers is not specified, only letter and digits are allowed.

  • String tokens

    Each string of the Strings array describe the delimiters of a string token.

    • If the description is made of one character, then the initial and final delimiter are that character. And if two successive delimiter characters are encountered, only one character is kept, and it is not considered as an escape character anymore.

    • If the description is made of two characters, then the first one is the initial delimiter, and the second one the final delimiter. The final delimiter cannot be escaped.

    • If the description is made of three characters, then the first one is the initial delimiter, and the second one the final delimiter. The final delimiter can be escaped by using the third character.

    If Strings is not specified, then no string token is parsed.

    For example: ["\"", "''\\", "[]"] will parse as token strings everything enclosed by double quotes, single quote, and square brackets. The strings enclosed by double quotes will allow the "double quoting", those enclosed by single quotes will allow the ' character to be escaped with a backslash character, whereas those enclosed by square brackets will allow no escape.

  • Operator tokens

    The contents of the Operators argument is an array of the different strings that will be parsed as a unique token.

    As all characters that are not parsed as a space, newline, number, identifier or string token are returned as an single character token, the Operators should usually contains only operators made of multiple characters. For example, <=, >=, &&, and so on.

The tokens are parsed in the order of that description.

So if a token is parsed an identifier, it cannot be parsed as an operator. In other words, if you specify something like "X->" in the Operators argument, it will never match, as "X" will be identified as an identifier first.

As all tokens are returned as strings, you can't really know what the type of token is. But it should not be actually relevant.

Examples

Print Tokenize("Return Subst((\"&1 MiB\"), FormatNumber(Size / 1048576))").Join(" _ ")
Return _ Subst _ ( _ ( _ " _ & _ 1 _ MiB _ " _ ) _ , _ FormatNumber _ ( _ Size _ / _ 1048576 _ ) _ )

Print Tokenize("Return Subst((\"&1 MiB\"), FormatNumber(Size / 1048576))",, ["\""]).Join(" _ ")
Return _ Subst _ ( _ ( _ "&1 MiB" _ ) _ , _ FormatNumber _ ( _ Size _ / _ 1048576 _ ) _ )

See also