HFST - Helsinki Finite-State Transducer Technology - Python API
version 3.9.0
|
A tokenizer for creating transducers from UTF-8 strings. More...
Public Member Functions | |
def | __init__ |
Create a tokenizer that recognizes utf-8 symbols. More... | |
def | add_multichar_symbol |
Add a multicharacter symbol symbol to this tokenizer. More... | |
def | add_skip_symbol |
Add a symbol to be skipped to this tokenizer. More... | |
def | check_utf8_correctness |
If input_string is not valid utf-8, throw an IncorrectUtf8CodingException. More... | |
def | tokenize |
Tokenize the string input_string. More... | |
def | tokenize |
Tokenize the string pair input_string : output_string. More... | |
def | tokenize_one_level |
Tokenize the string input_string. More... | |
def | tokenize_space_separated |
Tokenize str and skip all spaces. More... | |
A tokenizer for creating transducers from UTF-8 strings.
Strings are tokenized from left to right using longest match tokenization. For example, if the tokenizer contains a multicharacter symbol 'foo' and a skip symbol 'fo', the string "foo" is tokenized as 'foo:foo'. If the tokenizer contains a multicharacter symbol 'fo' and a skip symbol 'foo', the string "foo" is tokenized as an empty string.
An example:
def __init__ | ( | self | ) |
Create a tokenizer that recognizes utf-8 symbols.
def add_multichar_symbol | ( | self, | |
symbol | |||
) |
Add a multicharacter symbol symbol to this tokenizer.
If a multicharacter symbol has a skip symbol inside it, it is not considered a multicharacter symbol. For example if we have a multicharacter symbol 'foo' and a skip symbol 'bar', the string "fobaro" will be tokenized 'f' 'o' 'o', not 'foo'.
def add_skip_symbol | ( | self, | |
symbol | |||
) |
Add a symbol to be skipped to this tokenizer.
After skipping a symbol, tokenization is always started again. For example if we have a multicharacter symbol 'foo' and a skip symbol 'bar', the string "fobaro" will be tokenized 'f' 'o' 'o', not 'foo'.
def check_utf8_correctness | ( | input_string | ) |
If input_string is not valid utf-8, throw an IncorrectUtf8CodingException.
A string is non-valid if:
This function is a static one.
def tokenize | ( | self, | |
input_string | |||
) |
Tokenize the string input_string.
def tokenize | ( | self, | |
input_string, | |||
output_string | |||
) |
Tokenize the string pair input_string : output_string.
If one string has more tokens than the other, epsilons will be inserted to the end of the tokenized string with less tokens so that both tokenized strings have the same number of tokens.
def tokenize_one_level | ( | self, | |
input_string | |||
) |
Tokenize the string input_string.
def tokenize_space_separated | ( | self, | |
str | |||
) |
Tokenize str and skip all spaces.