A tokenizer for creating transducers from UTF-8 strings. More...

Public Member Functions
def	__init__
	Create a tokenizer that recognizes utf-8 symbols. More...

def	add_multichar_symbol
	Add a multicharacter symbol symbol to this tokenizer. More...

def	add_skip_symbol
	Add a symbol to be skipped to this tokenizer. More...

def	check_utf8_correctness
	If input_string is not valid utf-8, throw an IncorrectUtf8CodingException. More...

def	tokenize
	Tokenize the string input_string. More...

def	tokenize
	Tokenize the string pair input_string : output_string. More...

def	tokenize_one_level
	Tokenize the string input_string. More...

def	tokenize_space_separated
	Tokenize str and skip all spaces. More...

Detailed Description

A tokenizer for creating transducers from UTF-8 strings.

Strings are tokenized from left to right using longest match tokenization. For example, if the tokenizer contains a multicharacter symbol 'foo' and a skip symbol 'fo', the string "foo" is tokenized as 'foo:foo'. If the tokenizer contains a multicharacter symbol 'fo' and a skip symbol 'foo', the string "foo" is tokenized as an empty string.

An example:

Note: The tokenizer only tokenizes utf-8 strings. Special symbols are not included in the tokenizer unless added to it. TODO: should they ???

Constructor & Destructor Documentation

def __init__ ( self )

Create a tokenizer that recognizes utf-8 symbols.

Member Function Documentation

def add_multichar_symbol	(	self,
		symbol
	)

Add a multicharacter symbol symbol to this tokenizer.

If a multicharacter symbol has a skip symbol inside it, it is not considered a multicharacter symbol. For example if we have a multicharacter symbol 'foo' and a skip symbol 'bar', the string "fobaro" will be tokenized 'f' 'o' 'o', not 'foo'.

def add_skip_symbol	(	self,
		symbol
	)

Add a symbol to be skipped to this tokenizer.

After skipping a symbol, tokenization is always started again. For example if we have a multicharacter symbol 'foo' and a skip symbol 'bar', the string "fobaro" will be tokenized 'f' 'o' 'o', not 'foo'.

def check_utf8_correctness ( input_string )

If input_string is not valid utf-8, throw an IncorrectUtf8CodingException.

A string is non-valid if:

It contains one of the unsigned bytes 192, 193, 245, 246 and 247.
If it is not made up of sequences of one initial byte (0xxxxxxx, 110xxxxx, 1110xxxx or 11110xxx) followed by an appropriate number of continuation bytes (10xxxxxx).
1. Initial bytes 0xxxxxxx represent ASCII chars and may not be followed by a continuation byte.
2. Initial bytes 110xxxxx are followed by exactly one continuation byte.
3. Initial bytes 1110xxxx are followed by exactly two continuation bytes.
4. Initial bytes 11110xxx are followed by exactly three continuation bytes. (For reference: http://en.wikipedia.org/wiki/UTF-8)

This function is a static one.

def tokenize	(	self,
		input_string
	)

Tokenize the string input_string.

Returns: A tuple of string pairs.

def tokenize	(	self,
		input_string,
		output_string
	)

Tokenize the string pair input_string : output_string.

If one string has more tokens than the other, epsilons will be inserted to the end of the tokenized string with less tokens so that both tokenized strings have the same number of tokens.

Returns: A tuple of string pairs.

def tokenize_one_level	(	self,
		input_string
	)

Tokenize the string input_string.

Returns: A tuple of strings.

def tokenize_space_separated	(	self,
		str
	)

Tokenize str and skip all spaces.

Returns: A tuple of strings.

The documentation for this class was generated from the following file:

libhfst.py

Public Member Functions

Detailed Description

Constructor & Destructor Documentation

Member Function Documentation