Module Sedlexing

Runtime support for lexers generated by sedlex.

This module is roughly equivalent to the module Lexing from the OCaml standard library, except that its lexbuffers handle Unicode code points (OCaml type: Uchar.t in the range 0..0x10ffff) instead of bytes (OCaml type: char).

It is possible to have sedlex-generated lexers work on a custom implementation for lex buffers. To do this, define a module L which implements the start, next, mark and backtrack functions (See the Internal Interface section below for a specification). They need not work on a type named lexbuf: you can use the type name you want. Then, just do in your sedlex-processed source, bind this module to the name Sedlexing (for instance, with a local module definition: let module Sedlexing = L in ....

Of course, you'll probably want to define functions like lexeme to be used in the lexers semantic actions.

type lexbuf

The type of lexer buffers. A lexer buffer is the argument passed to the scanning functions defined by the generated lexers. The lexer buffer holds the internal information for the scanners, including the code points of the token currently scanned, its position from the beginning of the input stream, and the current position of the lexer.

exception InvalidCodepoint of int

Raised by some functions to signal that some code point is not compatible with a specified encoding.

exception MalFormed

Raised by functions in the Utf8 and Utf16 modules to report strings which do not comply to the encoding.

Creating generic lexbufs
val create : ?bytes_per_char:(Stdlib.Uchar.t -> int) -> (Stdlib.Uchar.t array -> int -> int -> int) -> lexbuf

Create a generic lexer buffer. When the lexer needs more characters, it will call the given function, giving it an array of Uchars a, a position pos and a code point count n. The function should put n code points or less in a, starting at position pos, and return the number of characters provided. A return value of 0 means end of input. bytes_per_char argument is optional. If unspecified, byte positions are the same as code point position.

val set_position : ?bytes_position:Stdlib.Lexing.position -> lexbuf -> Stdlib.Lexing.position -> unit

Set the initial tracked input position, in code points, for lexbuf. If unspecified, byte position is set to the same value as code point position.

val set_filename : lexbuf -> string -> unit

set_filename lexbuf file sets the filename to file in lexbuf. It also sets the Lexing.pos_fname field in returned Lexing.position records.

val from_gen : ?bytes_per_char:(Stdlib.Uchar.t -> int) -> Stdlib.Uchar.t Gen.t -> lexbuf

Create a lexbuf from a stream of Unicode code points. bytes_per_char is optional. If unspecified, byte positions are the same as code point positions.

val from_int_array : ?bytes_per_char:(Stdlib.Uchar.t -> int) -> int array -> lexbuf

Create a lexbuf from an array of Unicode code points. bytes_per_char is optional. If unspecified, byte positions are the same as code point positions.

val from_uchar_array : ?bytes_per_char:(Stdlib.Uchar.t -> int) -> Stdlib.Uchar.t array -> lexbuf

Create a lexbuf from an array of Unicode code points. bytes_per_char is optional. If unspecified, byte positions are the same as code point positions.

Interface for lexers semantic actions

The following functions can be called from the semantic actions of lexer definitions. They give access to the character string matched by the regular expression associated with the semantic action.

val lexeme_start : lexbuf -> int

Sedlexing.lexeme_start lexbuf returns the offset in the input stream of the first code point of the matched string. The first code point of the stream has offset 0.

val lexeme_bytes_start : lexbuf -> int

Sedlexing.lexeme_bytes_start lexbuf returns the offset in the input stream of the first byte of the matched string. The first byte of the stream has offset 0.

val lexeme_end : lexbuf -> int

Sedlexing.lexeme_end lexbuf returns the offset in the input stream of the code point following the last code point of the matched string. The first code point of the stream has offset 0.

val lexeme_bytes_end : lexbuf -> int

Sedlexing.lexeme_bytes_end lexbuf returns the offset in the input stream of the byte following the last byte of the matched string. The first byte of the stream has offset 0.

val loc : lexbuf -> int * int

Sedlexing.loc lexbuf returns the pair (Sedlexing.lexeme_start lexbuf,Sedlexing.lexeme_end lexbuf).

val bytes_loc : lexbuf -> int * int

Sedlexing.bytes_loc lexbuf returns the pair (Sedlexing.lexeme_bytes_start lexbuf,Sedlexing.lexeme_bytes_end lexbuf).

val lexeme_length : lexbuf -> int

Sedlexing.lexeme_length lexbuf returns the difference (Sedlexing.lexeme_end lexbuf) - (Sedlexing.lexeme_start lexbuf), that is, the length (in code points) of the matched string.

val lexeme_bytes_length : lexbuf -> int

Sedlexing.lexeme_bytes_length lexbuf returns the difference (Sedlexing.lexeme_bytes_end lexbuf) - (Sedlexing.lexeme_bytes_start lexbuf), that is, the length (in bytes) of the matched string.

val lexing_positions : lexbuf -> Stdlib.Lexing.position * Stdlib.Lexing.position

Sedlexing.lexing_positions lexbuf returns the start and end positions, in code points, of the current token, using a record of type Lexing.position. This is intended for consumption by parsers like those generated by Menhir.

val lexing_position_start : lexbuf -> Stdlib.Lexing.position

Sedlexing.lexing_position_start lexbuf returns the start position, in code points, of the current token.

val lexing_position_curr : lexbuf -> Stdlib.Lexing.position

Sedlexing.lexing_position_curr lexbuf returns the end position, in code points, of the current token.

val lexing_bytes_positions : lexbuf -> Stdlib.Lexing.position * Stdlib.Lexing.position

Sedlexing.lexing_bytes_positions lexbuf returns the start and end positions, in bytes, of the current token, using a record of type Lexing.position. This is intended for consumption by parsers like those generated by Menhir.

val lexing_bytes_position_start : lexbuf -> Stdlib.Lexing.position

Sedlexing.lexing_bytes_position_start lexbuf returns the start position, in bytes, of the current token.

val lexing_bytes_position_curr : lexbuf -> Stdlib.Lexing.position

Sedlexing.lexing_bytes_position_curr lexbuf returns the end position, in bytes, of the current token.

val new_line : lexbuf -> unit

Sedlexing.new_line lexbuf increments the line count and sets the beginning of line to the current position, as though a newline character had been encountered in the input.

val lexeme : lexbuf -> Stdlib.Uchar.t array

Sedlexing.lexeme lexbuf returns the string matched by the regular expression as an array of Unicode code points.

val lexeme_char : lexbuf -> int -> Stdlib.Uchar.t

Sedlexing.lexeme_char lexbuf pos returns code point number pos in the matched string.

val sub_lexeme : lexbuf -> int -> int -> Stdlib.Uchar.t array

Sedlexing.sub_lexeme lexbuf pos len returns a substring of the string matched by the regular expression as an array of Unicode code points.

type submatch = {
  1. lexbuf : lexbuf;
  2. pos : int;
  3. len : int;
}

A submatch captures a sub-pattern matched by an as binding. It carries the lexbuf and the position/length of the submatch (in code points, relative to the start of the current token). Use the extraction functions below to obtain the matched content in the desired encoding.

val lexeme_of_submatch : submatch -> Stdlib.Uchar.t array

Sedlexing.lexeme_of_submatch s returns the submatch as an array of Unicode code points.

val rollback : lexbuf -> unit

Sedlexing.rollback lexbuf puts lexbuf back in its configuration before the last lexeme was matched. It is then possible to use another lexer to parse the same characters again. The other functions above in this section should not be used in the semantic action after a call to Sedlexing.rollback.

val with_tokenizer : (lexbuf -> 'token) -> lexbuf -> unit -> 'token * Stdlib.Lexing.position * Stdlib.Lexing.position

with_tokenizer tokenizer lexbuf given a lexer and a lexbuf, returns a generator of tokens annotated with positions. This generator can be used with the Menhir parser generator's incremental API.

Internal interface

These functions are used internally by the lexers. They could be used to write lexers by hand, or with a lexer generator different from sedlex. The lexer buffers have a unique internal slot that can store an integer. They also store a "backtrack" position.

val start : lexbuf -> unit

start lexbuf informs the lexer buffer that any code points until the current position can be discarded. The current position becomes the "start" position as returned by Sedlexing.lexeme_start. Moreover, the internal slot is set to -1 and the backtrack position is set to the current position.

val next : lexbuf -> Stdlib.Uchar.t option

next lexbuf extracts the next code point from the lexer buffer and increments the current position. If the input stream is exhausted, the function returns None. If a '\n' is encountered, the tracked line number is incremented.

val mark : lexbuf -> int -> unit

mark lexbuf i stores the integer i in the internal slot. The backtrack position is set to the current position. If the lexbuf has tagged DFA memory cells (from as bindings), the current cell values are snapshotted so they can be restored by backtrack.

val backtrack : lexbuf -> int

backtrack lexbuf returns the value stored in the internal slot of the buffer, and performs backtracking (the current position is set to the value of the backtrack position). If the lexbuf has tagged DFA memory cells, they are restored to the values saved by the last mark call, so that sub-match positions reflect the last accepting state.

val __private__next_int : lexbuf -> int

__private__next_int lexbuf extracts the next code point from the lexer buffer and increments the current position. If the input stream is exhausted, the function returns -1. If a '\n' is encountered, the tracked line number is incremented.

This is a private API, it should not be used by code using this module's API and can be removed at any time.

Tagged DFA memory cells for as bindings.

The following functions manage an internal array of memory cells used to record sub-match positions during DFA execution. Cells store either positions (>= 0) or encoded integer values (<= -2). The sentinel -1 means "unset". Positions are automatically adjusted when the internal buffer is compacted, and converted to token-relative offsets on read by __private__mem_pos.

This is a private API used by generated code and may change at any time.

val __private__init_mem : lexbuf -> int -> unit

__private__init_mem lexbuf n ensures at least n memory cells are available, resetting all cells to -1 (unset). Called once at the start of each match%sedlex block that uses as bindings.

val __private__set_mem_pos : lexbuf -> int -> unit

__private__set_mem_pos lexbuf i records the current position in cell i, for later retrieval by __private__mem_pos. Used by Set_position tag operations on DFA transitions.

val __private__set_mem_value : lexbuf -> int -> int -> unit

__private__set_mem_value lexbuf i v stores integer v in cell i, encoded as -(v + 2) so it is disjoint from positions and the unset sentinel. Used by Set_value tag operations for or-pattern discriminators.

val __private__mem_pos : lexbuf -> int -> int

__private__mem_pos lexbuf i returns the position stored in cell i, as an offset relative to the start of the current token.

val __private__mem_value : lexbuf -> int -> int

__private__mem_value lexbuf i decodes and returns the integer value stored in cell i (reverses the -(v + 2) encoding).

val __private__num_mem_cells : lexbuf -> int

Returns the current number of allocated memory cells.

Support for common encodings
module Latin1 : sig ... end
module Utf8 : sig ... end
module Utf16 : sig ... end