Datatypes for Symbols and Symbol Alphabets | |
typedef Symbol | Symbol |
A handle for a symbol name, i.e. a string. | |
typedef SymbolSet | SymbolSet |
A set of symbols aka an alphabet of symbols. | |
typedef SymbolIterator | SymbolIterator |
Iterator over the symbols in a SymbolSet. | |
typedef SymbolPair | SymbolPair |
A pair of symbols representing a transition in a transducer. | |
typedef SymbolPairSet | SymbolPairSet |
A set of symbol pairs aka an alphabet of symbol pairs. | |
typedef SymbolPairIterator | SymbolPairIterator |
Iterator over the set of symbol pairs in a SymbolPairSet. | |
typedef KeyTable | KeyTable |
A table for storing Key-to-Symbol associations. | |
Defining and Using Symbols | |
Symbol | define_symbol (const char *s) |
Define a symbol with name s. | |
bool | is_symbol (const char *s) |
Whether the string s indicates a name for a symbol. | |
Symbol | get_symbol (const char *s) |
Find the symbol for the symbol name s. | |
const char * | get_symbol_name (Symbol s) |
Find the symbol name for the symbol s. | |
bool | is_equal (Symbol s1, Symbol s2) |
Whether the symbol s1 is identical to symbol s2. | |
Defining and Using Alphabets of Symbols | |
SymbolSet * | create_empty_symbol_set () |
Define an empty set of symbols. | |
SymbolSet * | insert_symbol (Symbol s, SymbolSet *Si) |
Insert s into the set of symbols Si and return the updated set. | |
bool | has_symbol (Symbol s, SymbolSet *Si) |
Whether symbol s is a member of the set of symbols Si. | |
Iterators over Symbols | |
SymbolIterator | begin_sigma_symbol (SymbolSet *Si) |
Beginning of the iterator for the symbol set Si. | |
SymbolIterator | end_sigma_symbol (SymbolSet *Si) |
End of the iterator for the symbol set Si. | |
size_t | size_sigma_symbol (SymbolSet *Si) |
Size of the iterator for the symbol set Si. | |
Symbol | get_sigma_symbol (SymbolIterator Si) |
Get the symbol pointed by the symbol iterator si. | |
Defining and Using Symbol Pairs | |
SymbolPair * | define_symbolpair (Symbol s1, Symbol s2) |
Define a symbol pair with input symbol s1 and output symbol s2. | |
Symbol | get_input_symbol (SymbolPair *s) |
Get the input symbol of SymbolPair s. | |
Symbol | get_output_symbol (SymbolPair *s) |
Get the output symbol of SymbolPair s. | |
Defining and Using Alphabets of Symbol Pairs | |
SymbolPairSet * | create_empty_symbolpair_set () |
Define an empty set of symbol pairs. | |
SymbolPairSet * | insert_symbolpair (SymbolPair *p, SymbolPairSet *Pi) |
Insert p into the set of symbol pairs Pi and return the updated set. | |
bool | has_symbolpair (SymbolPair *p, SymbolPairSet *Pi) |
Whether symbol pair p is a member of the set of symbol pairs Pi. | |
Iterators over Symbol Pairs | |
SymbolPairIterator | begin_pi_symbol (SymbolPairSet *Pi) |
Beginning of the iterator for the symbol pair set Pi. | |
SymbolPairIterator | end_pi_symbol (SymbolPairSet *Pi) |
End of the iterator for the symbol pair set Pi. | |
size_t | size_pi_symbol (SymbolPairSet *Pi) |
Size of the iterator for the symbol pair set Pi. | |
SymbolPair * | get_pi_symbolpair (SymbolPairIterator pi) |
Get the symbol pair pointed by the symbol pair iterator pi. | |
Defining the Connection between Symbols and Transducer Keys. | |
The relation 1:N between keys and symbols is useful for dealing with equivalence classes of symbols. | |
KeyTable * | create_key_table () |
Create an empty key table. | |
bool | is_key (Key i, KeyTable *T) |
Whether i indicates an existing key in key table T. | |
bool | is_symbol (Symbol s, KeyTable *T) |
Whether s indicates an existing symbol in key table T. | |
void | associate_key (Key i, KeyTable *T, Symbol s) |
Associate the key i in the key table T with the symbol s. | |
Key | get_key (Symbol s, KeyTable *T) |
Find the key for the symbol s in key table T. | |
Key | get_unused_key (KeyTable *T) |
Return a Key which hasn't been associated to any symbol in key table T. | |
Symbol | get_key_symbol (Key i, KeyTable *T) |
Find a symbol for the key i in key table T. | |
KeySet * | get_key_set (KeyTable *T) |
A set of keys in key table T. | |
SymbolSet * | get_symbol_set (KeyTable *T) |
A set of symbols in key table T. | |
KeyTable * | read_symbol_table (istream &is, bool binary=false) |
Read a symbol table from istream is and transform it to a key table. binary defines whether the symbol table is in binary or text format. | |
void | write_symbol_table (KeyTable *T, ostream &os, bool binary=false) |
Transform the key table T to a symbol table and write it to ostream os. binary defines whether the symbol table is written in binary or text format. | |
KeyTable * | gather_flag_diacritic_table (KeyTable *kt) |
Return a new key table only including those key/symbol pairs which correspond to flag-diacritic symbol names. | |
Reading Symbol Strings and Transducers | |
Read transducers (1) in text format from pair strings and input streams and
(2) in binary format from files and input streams so that the keys used in the transducer are harmonized according to a key table. | |
TransducerHandle | longest_match_tokenizer (KeySet *ks, KeyTable *kt) |
Create a left to right longest match tokenizer for symbols in key set ks. | |
TransducerHandle | longest_match_tokenizer2 (KeyTable *kt) |
Create a left to right longest match tokenizer for symbols in key set ks. | |
KeyTable * | recode_key_table (KeyTable *kt, const char *epsilon_replacement) |
Replace the epsilon in kt, with epsilon_replacement. | |
KeyPairVector * | tokenize_string_pair (TransducerHandle tokeniser, const char *upper, const char *lower, KeyTable *inputKeys) |
Change 2 strings to a transducer aligned character by character according to tokenisation by tokeniser. The path(s) of result of composition of of string’s UTF-8 representations against tokeniser are paired up to a new tokeniser from beginning to end. Empty spaces in the end are filled with ε’s. | |
KeyVector * | tokenize_string (TransducerHandle tokeniser, const char *string, KeyTable *inputKeys) |
Change a string s into identity pair transducer as tokenised by tokeniser. | |
KeyVector * | longest_match_tokenize (TransducerHandle tokenizer, const char *string, KeyTable *inputKeys) |
Use tokenizer to tokenize string. | |
KeyPairVector * | longest_match_tokenize_pair (TransducerHandle tokenizer, const char *string1, const char *string2, KeyTable *inputKeys) |
Use tokenizer to tokenize string1 and string2 and align the tokenized strings to a key pair vector. | |
KeyPairVector * | tokenize_pair_string (TransducerHandle tokeniser, char *pairs, KeyTable *inputKeys) |
Tokenise with tokeniser a string s of individual characters and colon separated pairs into transducer. | |
TransducerHandle | pairstring_to_transducer (const char *str, KeyTable *T) |
Create a one-path transducer as defined in pairstring form in str using the symbols defined in key table T. | |
TransducerHandle | read_transducer_text (istream &is, KeyTable *T, bool sfst=false) |
Make a transducer as defined in text form in istream is using the key-to-printname relations defined in key table T. The parameter sfst defines whether SFST text format is used, otherwise AT&T format is used. | |
bool | has_symbol_table (istream &is) |
Whether the transducer coming from istream is has a symbol table stored with it. | |
TransducerHandle | read_transducer (istream &is, KeyTable *T) |
Read a transducer in binary form from input stream is and harmonize it according to the key table T. | |
TransducerHandle | harmonize_transducer (TransducerHandle t, KeyTable *T_old, KeyTable *T_new) |
Harmonize transducer t that uses key table T_old according to key table T _new. | |
Writing Symbol Strings and Transducers | |
Write transducers (1) in text format into pair strings and output streams and
(2) in binary format to output streams so that the print names associated to keys are stored with the transducer. | |
char * | transducer_to_pairstring (TransducerHandle t, KeyTable *T, bool spaces=true, bool print_epsilons=true) |
A pairstring representation of one-path transducer t using the symbols defined in key table T. spaces defines whether pairs are separated by spaces. | |
void | print_transducer (TransducerHandle t, KeyTable *T, bool print_weights=false, ostream &ostr=std::cout, bool old=false) |
Print transducer t in text format using the symbols defined in key table T. The parameter print_weights indicates whether weights are included, the output stream ostr indicates where printing is directed. Parameter old indicates whether transducer t should be printed in old SFST text format instead of AT&T format. | |
void | write_transducer (TransducerHandle t, KeyTable *T, ostream &os=std::cout, bool backwards_compatibility=false) |
Write t in binary form to output stream os. Key table T is stored with the transducer. | |
void | write_runtime_transducer (TransducerHandle t, KeyTable *kt, FILE *output_file) |
Write a transducer t with key table kt into file output_file. Write its symbols into the file with name symbol_file_name. |
A table for storing Key-to-Symbol associations.
A key can be associated to several symbols but a symbol is associated to only one key.
Definition at line 57 of file symbol-layer.h.
A handle for a symbol name, i.e. a string.
Symbol is the type of a handle for such a symbol that could occur in cell of an input or output tape or as input or output labels of transitions in transducers, or of a special-use symbols that do not occur on tapes but occur only as input or output transition labels having a special interpretation, e.g. any, default, failure, etc., which is indicated by an attribute of the transducer.
There is a global, session-spesific table of Symbol-to-string relations, called the the global symbol cache. In the symbol cache, one Symbol is associated with one string and for one string there is one Symbol representing it, i.e. the relation between strings and Symbols is one-to-one.
Definition at line 34 of file symbol-layer.h.
typedef SymbolIterator SymbolIterator |
typedef SymbolPair SymbolPair |
A pair of symbols representing a transition in a transducer.
Definition at line 43 of file symbol-layer.h.
typedef SymbolPairIterator SymbolPairIterator |
Iterator over the set of symbol pairs in a SymbolPairSet.
Definition at line 49 of file symbol-layer.h.
typedef SymbolPairSet SymbolPairSet |
A set of symbol pairs aka an alphabet of symbol pairs.
Definition at line 46 of file symbol-layer.h.
Associate the key i in the key table T with the symbol s.
The symbol that is first associated with a key, becomes the primary symbol for that key. If key i has already been associated with one or more symbol(s) not equal to s, the symbol s becomes a parallel symbol for the key i.
SymbolPairIterator begin_pi_symbol | ( | SymbolPairSet * | Pi | ) |
Beginning of the iterator for the symbol pair set Pi.
SymbolIterator begin_sigma_symbol | ( | SymbolSet * | Si | ) |
Beginning of the iterator for the symbol set Si.
SymbolSet* create_empty_symbol_set | ( | ) |
Define an empty set of symbols.
SymbolPairSet* create_empty_symbolpair_set | ( | ) |
Define an empty set of symbol pairs.
KeyTable* create_key_table | ( | ) |
Create an empty key table.
The result has no associations defined between symbols and keys.
Symbol define_symbol | ( | const char * | s | ) |
Define a symbol with name s.
SymbolPair* define_symbolpair | ( | Symbol | s1, | |
Symbol | s2 | |||
) |
Define a symbol pair with input symbol s1 and output symbol s2.
SymbolPairIterator end_pi_symbol | ( | SymbolPairSet * | Pi | ) |
End of the iterator for the symbol pair set Pi.
SymbolIterator end_sigma_symbol | ( | SymbolSet * | Si | ) |
End of the iterator for the symbol set Si.
Return a new key table only including those key/symbol pairs which correspond to flag-diacritic symbol names.
Flag-diacritic symbol names begin and end with an '@'.
Symbol get_input_symbol | ( | SymbolPair * | s | ) |
Get the input symbol of SymbolPair s.
Find a symbol for the key i in key table T.
If there are several symbols associated with the key, the primary symbol (the symbol that was first associated with the key) is returned.
Symbol get_output_symbol | ( | SymbolPair * | s | ) |
Get the output symbol of SymbolPair s.
SymbolPair* get_pi_symbolpair | ( | SymbolPairIterator | pi | ) |
Get the symbol pair pointed by the symbol pair iterator pi.
Symbol get_sigma_symbol | ( | SymbolIterator | Si | ) |
Get the symbol pointed by the symbol iterator si.
Symbol get_symbol | ( | const char * | s | ) |
Find the symbol for the symbol name s.
const char* get_symbol_name | ( | Symbol | s | ) |
Find the symbol name for the symbol s.
Return a Key which hasn't been associated to any symbol in key table T.
TransducerHandle harmonize_transducer | ( | TransducerHandle | t, | |
KeyTable * | T_old, | |||
KeyTable * | T_new | |||
) |
Harmonize transducer t that uses key table T_old according to key table T _new.
bool has_symbol_table | ( | istream & | is | ) |
Whether the transducer coming from istream is has a symbol table stored with it.
bool has_symbolpair | ( | SymbolPair * | p, | |
SymbolPairSet * | Pi | |||
) |
Whether symbol pair p is a member of the set of symbol pairs Pi.
Insert s into the set of symbols Si and return the updated set.
SymbolPairSet* insert_symbolpair | ( | SymbolPair * | p, | |
SymbolPairSet * | Pi | |||
) |
Insert p into the set of symbol pairs Pi and return the updated set.
bool is_symbol | ( | const char * | s | ) |
Whether the string s indicates a name for a symbol.
KeyVector* longest_match_tokenize | ( | TransducerHandle | tokenizer, | |
const char * | string, | |||
KeyTable * | inputKeys | |||
) |
Use tokenizer to tokenize string.
The transducer tokenizer should be created using the function longest_match_tokenizer2. The key table inputKeys should contain all characters in string and be compatible with tokenizer.
KeyPairVector* longest_match_tokenize_pair | ( | TransducerHandle | tokenizer, | |
const char * | string1, | |||
const char * | string2, | |||
KeyTable * | inputKeys | |||
) |
Use tokenizer to tokenize string1 and string2 and align the tokenized strings to a key pair vector.
The transducer tokenizer should be created using the function longest_match_tokenizer2. The key table inputKeys should contain all characters in string1 and string2 and be compatible with tokenizer. The tokenized strings will be aligned into a key pair vector. The shorter one of the tokenized strings will be padded with zeroes at the end.
TransducerHandle longest_match_tokenizer | ( | KeySet * | ks, | |
KeyTable * | kt | |||
) |
Create a left to right longest match tokenizer for symbols in key set ks.
The keytable kt should contain the letters which make up the symbols for keys in ks. The keyset ks should not contain the key epsilon! The resulting transducer can be composed with other transducers to accomplish tokenization.
TransducerHandle longest_match_tokenizer2 | ( | KeyTable * | kt | ) |
Create a left to right longest match tokenizer for symbols in key set ks.
The keytable kt should contain the letters which make up its multicharacter symbols. Tokenization can be accomplished using functions longest_match_tokenize and longest_match_tokenize_pair.
TransducerHandle pairstring_to_transducer | ( | const char * | str, | |
KeyTable * | T | |||
) |
Create a one-path transducer as defined in pairstring form in str using the symbols defined in key table T.
The transitions must be written one after another separated by a space. (For automatic tokenization of symbols, see tokenize_pair_string.) If the input and output symbols are not equal, they are separated by a colon. If the backslash '\' and colon ':' are part of a symbol name, they must be escaped as "\\" and "\:".
For example the string "a:\: cd:e"
represents a transducer with consecutive transitions mapping "a" to ":" and "cd" to "e".
void print_transducer | ( | TransducerHandle | t, | |
KeyTable * | T, | |||
bool | print_weights = false , |
|||
ostream & | ostr = std::cout , |
|||
bool | old = false | |||
) |
Print transducer t in text format using the symbols defined in key table T. The parameter print_weights indicates whether weights are included, the output stream ostr indicates where printing is directed. Parameter old indicates whether transducer t should be printed in old SFST text format instead of AT&T format.
In HFST the print_weight parameter is ignored.
In At&T and SFST format, the newline, horizontal tab, carriage return, vertical tab, formfeed, bell character, backspace, backslash and space are printed as "\n", "\t", "\r", "\v", "\f" "\a", "\b", "\\" and "\0x20". In SFST format, the colon and angle brackets are printed as "\:", "\<" and "\>".
KeyTable* read_symbol_table | ( | istream & | is, | |
bool | binary = false | |||
) |
Read a symbol table from istream is and transform it to a key table. binary defines whether the symbol table is in binary or text format.
Key table and symbol table are two ways of representing key-to-string mappings. Key tables are used during a session and symbol tables when moving or storing information between sessions.
During a session, a key table associates keys to symbol handles and the global symbol cache associates symbol handles to strings.
Between sessions, a symbol table associates keys directly to strings, as there is no symbol cache.
A symbol table in OpenFst text format lists each symbol name and its associated key on one line. The symbol name and the associated key are separated by a tabulator. If several symbol names are associated to the same key, the one listed first is considered the primary print name for that key.
An example:
KeyTable Global symbol cache Symbol table Symbol table in text format -------- ------------------- ------------ --------------------------- Key Symbol Symbol string Key string <> TAB 0 <eps> TAB 0 0 0, 1 0 "<>" 0 "<>", "<eps>" a TAB 1 1 2 1 "<eps>" 1 "a" b TAB 2 2 4 2 "a" 2 "b" c TAB 3 3 5 3 "A" 3 "c" 4 "b" 5 "c" 6 "d"
TransducerHandle read_transducer | ( | istream & | is, | |
KeyTable * | T | |||
) |
Read a transducer in binary form from input stream is and harmonize it according to the key table T.
Following notations are used: Ts = the transducer read from istream is and S = the symbol table of transducer Tr.
Harmonization is done in the following way:
If T is empty (made with create_key_table), S is copied to T as such and all keys used in Ts remain the same i.e. no harmonization is done.
If T is not empty, the harmonization goes as follows. For each input and output key in a transition in Ts, a corresponding primary print name is looked in S. A corresponding key value for this print name is then looked in T and the original input or output key is replaced with this key. Epsilon keys are copied as such (the primary name of epsilon is thus defined solely by T). If a primary print name used in Ts is not found in T, it is added to T and to the global symbol cache to the next free position.
Some special cases: (1) If a key used in Ts is not found in S, it is replaced by next free key in T, but it is not added to T as it has no print name (the side effect is that the key after next free key in T is associated with a dummy Symbol, so it is recommended that all keys used in Ts are in S.) (2) Keys defined in S that are not used in Ts are not copied to T.
TransducerHandle read_transducer_text | ( | istream & | is, | |
KeyTable * | T, | |||
bool | sfst = false | |||
) |
Make a transducer as defined in text form in istream is using the key-to-printname relations defined in key table T. The parameter sfst defines whether SFST text format is used, otherwise AT&T format is used.
In At&T and SFST format, the newline, horizontal tab, carriage return, vertical tab, formfeed, bell character, backspace, backslash and space must be escaped as "\n", "\t", "\r", "\v", "\f" "\a", "\b", "\\" and "\0x20". In SFST format, the colon and angle brackets must be escaped as "\:", "\<" and "\>".
An example of a transducer file:
AT&T AT&T UNWEIGHTED SFST 0 0 0 final 0 0 1 a aa 0.3 0 1 a aa 0 a:aa 1 0 2 b b 0 0 2 b b 0 b 2 1 0 c C 0.5 1 0 c C 1 c:C 0 2 1 \n c 0 2 1 \n c 2 \n:c 1 2 0 a A 1.2 2 0 a A 2 a:A 0 2 2 d D 1.65 2 2 d D 2 d:D 2 2 0.5 2 final 2
The syntax of the lines in the text format is one of the following in the AT&T format:
and one of the following in sfst format:
When AT&T format is used in HFST, weights are ignored. When SFST or AT&T unweighted format is used in HWFST, weights are set to zero.
Replace the epsilon in kt, with epsilon_replacement.
When tokenizing input-strings, the strings should never contain a substring matching the symbol name of the epsilon key in the KeyTable used in tokenization. Therefore the epsilons in the tokenizer should be replaced by an internal epsilon-symbol, which is unlikely to occur in real input-strings.
recode_key_table returns a KeyTable, which is the same as kt, except the key 0 corresponds to the internal epsilon symbol name epsilon_replacement and the original epsilon symbol name corresponds to the first unused key in kt.
size_t size_pi_symbol | ( | SymbolPairSet * | Pi | ) |
Size of the iterator for the symbol pair set Pi.
size_t size_sigma_symbol | ( | SymbolSet * | Si | ) |
Size of the iterator for the symbol set Si.
KeyPairVector* tokenize_pair_string | ( | TransducerHandle | tokeniser, | |
char * | pairs, | |||
KeyTable * | inputKeys | |||
) |
Tokenise with tokeniser a string s of individual characters and colon separated pairs into transducer.
E.g. a string cat+pl:s will be made to c a t +pl:s given that tokeniser creates such tokens.
tokeniser | A transducer that, upon composing leftwards against transducer made of UTF-8 characters of string, results in acyclic tokenisation(s) of original path. | |
pairs | UTF-8 encoded string for transducer | |
inputKeys | KeyTable that matches mapping of UTF-8 characters on input side of tokeniser. |
KeyVector* tokenize_string | ( | TransducerHandle | tokeniser, | |
const char * | string, | |||
KeyTable * | inputKeys | |||
) |
Change a string s into identity pair transducer as tokenised by tokeniser.
E.g. a string cat will be tokenised as transducer c a t, given that tokeniser creates tokens for c, a, and t.
tokeniser | A transducer that, upon composing leftwards against transducer made of UTF-8 characters of string, results in acyclic tokenisation(s) of original path. | |
string | UTF-8 encoded string for transducer pairs. | |
inputKeys | KeyTable that matches mapping of UTF-8 characters on input side of tokeniser. |
KeyPairVector* tokenize_string_pair | ( | TransducerHandle | tokeniser, | |
const char * | upper, | |||
const char * | lower, | |||
KeyTable * | inputKeys | |||
) |
Change 2 strings to a transducer aligned character by character according to tokenisation by tokeniser. The path(s) of result of composition of of string’s UTF-8 representations against tokeniser are paired up to a new tokeniser from beginning to end. Empty spaces in the end are filled with ε’s.
E.g. strings cat dog are aligned as c:d a:o g:t. Strings ääliö ääliöitä are aligned as ä ä l i ö ε:i ε:t ε:ä. And talo+NOUN+SINGULAR+NOMINATIVE talo as t a l o +NOUN:ε +SINGULAR:ε +NOMINATIVE:ε, given that tokeniser and keytable contains those symbols.
If specific alignment is required, it is possible to specify ε’s manually using the string for ε that is defined in inputKeys.
A tokeniser tokeniser may be built manually using or with functions, such as longestMatchTokeniser
(...)
tokeniser | A transducer that, upon composing leftwards against transducer made of UTF-8 characters of string, results in acyclic tokenisation(s) of original path. | |
upper | UTF-8 encoded string for input side of transducer. | |
lower | UTF-8 encoded string for output side of transducer. | |
inputKeys | KeyTable that matches mapping of UTF-8 characters on input side of tokeniser. |
char* transducer_to_pairstring | ( | TransducerHandle | t, | |
KeyTable * | T, | |||
bool | spaces = true , |
|||
bool | print_epsilons = true | |||
) |
A pairstring representation of one-path transducer t using the symbols defined in key table T. spaces defines whether pairs are separated by spaces.
The transitions are printed one after another, separated by spaces if so requested. If the input and output symbols are not equal, they are separated by a colon. If the backslash '\' and colon ':' are part of a symbol print name, they are escaped as "\\" and "\:".
The empty transducer is represented by "\empty_transducer" and the epsilon transducer as "EPS" where EPS is the symbol name for epsilon (pairstring_to_transducer recognizes "" as the epsilon transducer, but "EPS" is a more user-friendly notation). If the symbol name for epsilon is not defined, "\epsilon" is returned.
void write_runtime_transducer | ( | TransducerHandle | t, | |
KeyTable * | kt, | |||
FILE * | output_file | |||
) |
Write a transducer t with key table kt into file output_file. Write its symbols into the file with name symbol_file_name.
void write_symbol_table | ( | KeyTable * | T, | |
ostream & | os, | |||
bool | binary = false | |||
) |
Transform the key table T to a symbol table and write it to ostream os. binary defines whether the symbol table is written in binary or text format.
void write_transducer | ( | TransducerHandle | t, | |
KeyTable * | T, | |||
ostream & | os = std::cout , |
|||
bool | backwards_compatibility = false | |||
) |
Write t in binary form to output stream os. Key table T is stored with the transducer.
t | Transducer to be written | |
T | Key table that is stored with the transducer | |
os | Where transducer is written | |
backwards_compatibility | Whether the transducer is written in SFST/OpenFst compatible format. |