the base class for all scanners More...
Public Member Functions | |
__construct ($src=null) | |
constructor | |
add_filter ($arg1, $arg2, $arg3=null) | |
Add an individual token filter. | |
add_identifier_mapping ($name, $matches) | |
Adds an identifier mapping which is later analysed by map_identifier_filter. | |
add_stream_filter ($arg1, $arg2=null) | |
Adds a stream filter. | |
highlight ($src) | |
Public convenience function for setting the string and highlighting it. | |
init () | |
Set up the scanner immediately prior to tokenization. | |
main () | |
the method responsible for tokenization | |
map_identifier_filter ($token) | |
Identifier mapping filter. | |
nestable_token ($token_name, $open, $close) | |
Handles tokens that may nest inside themselves. | |
pop () | |
Pops the top element of the stack, and returns it. | |
push ($state) | |
Pushes some data onto the stack. | |
record ($string, $type, $pre_escaped=false) | |
Records a string as a given token type. | |
record_range ($from, $to, $type) | |
Helper function to record a range of the string. | |
remove_filter ($name) | |
Removes the individual filter(s) with the given name. | |
remove_stream_filter ($name) | |
Removes the stream filter(s) with the given name. | |
skip_whitespace () | |
Skips whitespace, and records it as a null token. | |
start () | |
Flushes the token stream. | |
state () | |
Gets the top element on $state_ or null if it is empty. | |
tagged () | |
Returns the XML representation of the token stream. | |
token_array () | |
Gets the token array. | |
Public Member Functions inherited from Scanner | |
add_pattern ($name, $pattern) | |
Allows the caller to add a predefined named pattern. | |
bol () | |
Beginning of line? | |
check ($pattern) | |
Non-consuming lookahead. | |
eol () | |
End of line? | |
eos () | |
End of string? | |
get ($n=1) | |
Consume a given number of bytes. | |
get_next ($patterns) | |
Look for the next occurrence of a set of patterns. | |
get_next_named ($patterns) | |
Find the index of the next occurrence of a named pattern. | |
get_next_strpos ($patterns) | |
Look for the next occurrence of a set of substrings. | |
index ($pattern) | |
Find the index of the next occurrence of a pattern. | |
match () | |
Get the result of the most recent match operation. | |
match_group ($g=0) | |
Get a group from the most recent match operation. | |
match_groups () | |
Get the match groups of the most recent match operation. | |
match_pos () | |
Get the position (offset) of the most recent match. | |
next_match ($consume_and_log=true) | |
Automation function: returns the next occurrence of any known patterns. | |
peek ($n=1) | |
Lookahead into the string a given number of bytes. | |
pos ($new_pos=null) | |
Getter and setter for the current position (string pointer). | |
pos_shift ($offset) | |
Moves the string pointer by a given offset. | |
remove_pattern ($name) | |
Allows the caller to remove a named pattern. | |
reset () | |
Reset the scanner. | |
rest () | |
Gets the remaining string. | |
scan ($pattern) | |
Scans at the current pointer. | |
scan_until ($pattern) | |
Scans until the start of a pattern. | |
string ($s=null) | |
Getter and setter for the source string. | |
terminate () | |
Ends scanning of a string. | |
unscan () | |
Revert the most recent scanning operation. |
Static Public Member Functions | |
static | guess_language ($src, $info) |
Language guessing. |
Public Attributes | |
$version = 'master' | |
scanner version. |
Protected Member Functions | |
rule_mapper_filter ($tokens) | |
Rule re-mapper filter. | |
user_def_filter ($token) | |
Filter to highlight identifiers whose definitions are in the source. |
Protected Attributes | |
$case_sensitive = true | |
Whether or not the language is case sensitive. | |
$filters = array() | |
Individual token filters. | |
$ident_map = array() | |
A map of identifiers and their corresponding token names. | |
$rule_tag_map = array() | |
Rule remappings. | |
$state_ = array() | |
State stack. | |
$stream_filters = array() | |
Token stream filters. | |
$tokens = array() | |
The token stream. | |
$user_defs | |
Identifier remappings based on definitions identified in the source code. |
the base class for all scanners
LuminousScanner is the base class for all language scanners. Here we provide a set of methods comprising a highlighting layer. This includes recording a token stream, and ultimately being responsible for producing some XML representing the token stream.
We also define here some filters which rely on state information expected to be recorded into the instance variables.
Highlighting a string at this level is a four-stage process:
@li string() - set the string @li init() - set up the scanner @li main() - perform tokenization @li tagged() - build the XML
A note on tokens: Tokens are stored as an array with the following indices:
LuminousScanner::add_filter | ( | $arg1, | |
$arg2, | |||
$arg3 = null |
|||
) |
Add an individual token filter.
Adds an indivdual token filter. The filter is bound to the given token_name. The filter is a callback which should take a token and return a token.
The arguments are: [name], token_name, filter
Name is an optional argument.
LuminousScanner::add_identifier_mapping | ( | $name, | |
$matches | |||
) |
Adds an identifier mapping which is later analysed by map_identifier_filter.
$name | The token name |
$matches | an array of identifiers which correspond to this token name, i.e. add_identifier_mapping('KEYWORD', array('if', 'else', ...)); |
This method observes LuminousScanner::$case_sensitive
LuminousScanner::add_stream_filter | ( | $arg1, | |
$arg2 = null |
|||
) |
Adds a stream filter.
A stream filter receives the entire token stream and should return it.
The parameters are: ([name], filter). Name is an optional argument.
|
static |
Language guessing.
Each real language scanner should override this method and implement a simple guessing function to estimate how likely the input source code is to be the language which they recognise.
$src | the input source code |
LuminousScanner::highlight | ( | $src | ) |
LuminousScanner::init | ( | ) |
Set up the scanner immediately prior to tokenization.
The init method is always called prior to main(). At this stage, all configuration variables are assumed to have been set, and it's now time for the scanner to perform any last set-up information. This may include actually finalizing its rule patterns. Some scanners may not need to override this if they are in no way dynamic.
LuminousScanner::main | ( | ) |
the method responsible for tokenization
The main method is fully responsible for tokenizing the string stored in string() at the time of its call. By the time main returns, it should have consumed the whole of the string and populated the token array.
Reimplemented in LuminousStatefulScanner, and LuminousSimpleScanner.
LuminousScanner::map_identifier_filter | ( | $token | ) |
Identifier mapping filter.
Tries to map any 'IDENT' token to a TOKEN_NAME in LuminousScanner::$ident_map This is implemented as the filter 'map-ident'
LuminousScanner::nestable_token | ( | $token_name, | |
$open, | |||
$close | |||
) |
Handles tokens that may nest inside themselves.
Convenience function. It's fairly common for many languages to allow things like nestable comments. Handling these is easy but fairly long winded, so this function will take an opening and closing delimiter and consume the token until it is fully closed, or until the end of the string in the case that it is unterminated.
When the function returns, the token will have been consumed and appended to the token stream.
$token_name | the name of the token |
$open | the opening delimiter pattern (regex), e.g. '% /\* x' |
$close | the closing delimiter pattern (regex), e.g. '% \* /x' |
Exception | if called at a non-matching point (i.e. $this->scan($open) does not match) |
LuminousScanner::pop | ( | ) |
Pops the top element of the stack, and returns it.
Exception | if the state stack is empty |
LuminousScanner::record | ( | $string, | |
$type, | |||
$pre_escaped = false |
|||
) |
Records a string as a given token type.
$string | The string to record |
$type | The name of the token the string represents |
$pre_escaped | Luminous works towards getting this in XML and therefore at some point, the $string has to be escaped. If you have already escaped it for some reason (or if you got it from another scanner), then you want to set this to TRUE |
Exception | if $string is NULL |
Reimplemented in LuminousStatefulScanner.
LuminousScanner::record_range | ( | $from, | |
$to, | |||
$type | |||
) |
Helper function to record a range of the string.
$from | the start index |
$to | the end index |
$type | the type of the token This is shorthand for $this->record(substr($this->string(), $from, $to-$from) |
RangeException | if the range is invalid (i.e. $to < $from) |
An empty range (i.e. $to === $from) is allowed, but it is essentially a no-op.
Reimplemented in LuminousStatefulScanner.
|
protected |
Rule re-mapper filter.
Re-maps token rules according to the LuminousScanner::rule_tag_map map. This is called as the filter 'rule-map'
LuminousScanner::skip_whitespace | ( | ) |
Skips whitespace, and records it as a null token.
Convenience function
LuminousScanner::tagged | ( | ) |
Returns the XML representation of the token stream.
This function triggers the generation of the XML output.
Reimplemented in LuminousStatefulScanner.
LuminousScanner::token_array | ( | ) |
Gets the token array.
|
protected |
Filter to highlight identifiers whose definitions are in the source.
maps anything recorded in LuminousScanner::user_defs to the recorded type. This is called as the filter 'user-defs'
|
protected |
Whether or not the language is case sensitive.
Whether or not the scanner is dealing with a case sensitive language. This currently affects map_identifier_filter
|
protected |
Individual token filters.
A list of lists, each filter is an array: (name, token_name, callback)
|
protected |
A map of identifiers and their corresponding token names.
A map of recognised identifiers, in the form identifier_string => TOKEN_NAME
This is currently used by map_identifier_filter
|
protected |
Rule remappings.
A map to handle re-mapping of rules, in the form: OLD_TOKEN_NAME => NEW_TOKEN_NAME
This is used by rule_mapper_filter()
|
protected |
State stack.
A stack of the scanner's state, should the scanner wish to use a stack based state mechanism.
The top element can be retrieved (but not popped) with stack()
TODO More useful functions for manipulating the stack
|
protected |
Token stream filters.
A list of lists, each filter is an array: (name, callback)
|
protected |
The token stream.
The token stream is recorded as a flat array of tokens. A token is made up of 3 parts, and stored as an array:
|
protected |
Identifier remappings based on definitions identified in the source code.
A map of remappings of user-defined types/functions. This is a map of identifier_string => TOKEN_NAME
This is used by user_def_filter()