Onyx logo

Previous topic

Lexicon

Next topic

Signal Processing

This Page

onyx.lexicon.lexicon – A module for maintaining and using word and pronunciation collections.

A LexiconBuilder can be constructed in empty form (the default), from a string, or from a FrozenLexicon object. See help(LexiconBuilder) for more details.

>>> lb0 = LexiconBuilder(_dict0)

A FrozenLexicon is a form of FrozenCfg and can be used anywhere a FrozenCfg can be used. In addition, FrozenLexicon has a few other properties.

>>> lex0 = FrozenLexicon(lb0)
>>> lex0.num_orthos
19
>>> lex0.num_prons
21
>>> print(lex0)
Lexicon with 19 orthographies and 21 prons
>>> lb1 = LexiconBuilder(lex0)
>>> lb1.add_from_strings(("this_word th i S w e r d", "that_word th ae t w e r d"))
>>> lex1 = FrozenLexicon(lb1)
>>> print(lex1)
Lexicon with 21 orthographies and 23 prons
class onyx.lexicon.lexicon.FrozenLexicon(builder)

Bases: onyx.dataflow.simplecfg.FrozenCfg

A module for using word and pronunciation collections.

A FrozenLexicon is a form of FrozenCfg and can be used anywhere a FrozenCfg can be used. In addition, FrozenLexicon has a few other properties.

Make FrozenLexicons from LexiconBuilders:

>>> lb0 = LexiconBuilder(_dict0)
>>> lex0 = FrozenLexicon(lb0)
>>> lex0.num_orthos
19
>>> lex0.num_prons
21
>>> lex0.num_phones
22
>>> print(lex0)
Lexicon with 19 orthographies and 21 prons
>>> lex0.size
89
>>> lex0.num_productions
21
>>> len(lex0.terminals)
22
>>> len(lex0.non_terminals)
19
>>> sorted(lex0.terminals)[0]
(onyx.util.singleton.Singleton('onyx.lexicon.PHONE'), '@')
>>> sorted(lex0.non_terminals)[0]
(onyx.util.singleton.Singleton('onyx.lexicon.WORD'), '</s>')
>>> # for lhs, rhs in lex0: print('%s ====> %s)' % (lhs, rhs))
make_left_factored_cfg()

Return a FrozenCfg that is equivalent to self but for which the productions for a non_terminal are all left factored. This means that there is no prefix sharing across the productions for a given non-terminal. Another way to state this is that there will be only one production for each direct left corner of each non-terminal.

>>> b = CfgBuilder()
>>> b.add_production('A', ('x', 'y', 'z'))
>>> b.add_production('A', ('x', 'y', 'q'))
>>> b.add_production('A', ('l', 'n', 'z'))
>>> b.add_production('A', ('l', 'n', 'q'))
>>> b.add_production('A', ('l',))
>>> b.add_production('A', ('x', 'z', 'z'))
>>> b.add_production('A', ('x', 'q', 'q'))
>>> b.add_production('B', ())
>>> b.add_production('B', ('b',))
>>> b.add_production('B', ('C', 'b'))
>>> b.add_production('C', ('A',))
>>> b.add_production('C', ('D',))
>>> b.add_production('C', ('E', 'e'))
>>> b.add_production('D', ('B',))
>>> b.add_production('E', ('B', 'f'))
>>> b.add_production('S', ('A', 'C', 'x'))
>>> b.add_production('S', ('B', 'C', 'x'))
>>> cfg = FrozenCfg(b, 'S')
>>> for lhs, rhs in cfg: print repr(lhs), ':', ' '.join(repr(rhs_token) for rhs_token in rhs)
'A' : 'l'
'A' : 'l' 'n' 'q'
'A' : 'l' 'n' 'z'
'A' : 'x' 'q' 'q'
'A' : 'x' 'y' 'q'
'A' : 'x' 'y' 'z'
'A' : 'x' 'z' 'z'
'B' : 
'B' : 'C' 'b'
'B' : 'b'
'C' : 'A'
'C' : 'D'
'C' : 'E' 'e'
'D' : 'B'
'E' : 'B' 'f'
'S' : 'A' 'C' 'x'
'S' : 'B' 'C' 'x'
>>> cfg1 = cfg.make_left_factored_cfg()
>>> for lhs, rhs in cfg1: print repr(lhs), ':', ' '.join(repr(rhs_token) for rhs_token in rhs)
'A' : 'l' 'A_lf3'
'A' : 'x' 'A_lf2'
'A_lf0' : 'q'
'A_lf0' : 'z'
'A_lf1' : 'q'
'A_lf1' : 'z'
'A_lf2' : 'q' 'q'
'A_lf2' : 'y' 'A_lf0'
'A_lf2' : 'z' 'z'
'A_lf3' : 
'A_lf3' : 'n' 'A_lf1'
'B' : 
'B' : 'C' 'b'
'B' : 'b'
'C' : 'A'
'C' : 'D'
'C' : 'E' 'e'
'D' : 'B'
'E' : 'B' 'f'
'S' : 'A' 'C' 'x'
'S' : 'B' 'C' 'x'
>>> cfg2 = cfg1.make_no_epsilon_cfg()
>>> for lhs, rhs in cfg2: print str(lhs), ': ', '  '.join(str(rhs_token) for rhs_token in rhs)
A :  l
A :  l  A_lf3_er2
A :  x  A_lf2
A_lf0 :  q
A_lf0 :  z
A_lf1 :  q
A_lf1 :  z
A_lf2 :  q  q
A_lf2 :  y  A_lf0
A_lf2 :  z  z
A_lf3_er2 :  n  A_lf1
B_er1 :  C_er0  b
B_er1 :  b
C_er0 :  A
C_er0 :  D_er3
C_er0 :  E  e
D_er3 :  B_er1
E :  B_er1  f
E :  f
S :  A  C_er0  x
S :  A  x
S :  B_er1  C_er0  x
S :  B_er1  x
S :  C_er0  x
S :  x
>>> cfg3 = cfg2.make_left_factored_cfg()
>>> for lhs, rhs in cfg3: print str(lhs), ': ', '  '.join(str(rhs_token) for rhs_token in rhs)
A :  l  A_lf0_lf0
A :  x  A_lf2
A_lf0 :  q
A_lf0 :  z
A_lf0_lf0 :  
A_lf0_lf0 :  A_lf3_er2
A_lf1 :  q
A_lf1 :  z
A_lf2 :  q  q
A_lf2 :  y  A_lf0
A_lf2 :  z  z
A_lf3_er2 :  n  A_lf1
B_er1 :  C_er0  b
B_er1 :  b
C_er0 :  A
C_er0 :  D_er3
C_er0 :  E  e
D_er3 :  B_er1
E :  B_er1  f
E :  f
S :  A  S_lf0
S :  B_er1  S_lf1
S :  C_er0  x
S :  x
S_lf0 :  C_er0  x
S_lf0 :  x
S_lf1 :  C_er0  x
S_lf1 :  x
>>> cfg4 = cfg3.make_no_epsilon_cfg()
>>> for lhs, rhs in cfg4: print str(lhs), ': ', '  '.join(str(rhs_token) for rhs_token in rhs)
A :  l
A :  l  A_lf0_lf0_er0
A :  x  A_lf2
A_lf0 :  q
A_lf0 :  z
A_lf0_lf0_er0 :  A_lf3_er2
A_lf1 :  q
A_lf1 :  z
A_lf2 :  q  q
A_lf2 :  y  A_lf0
A_lf2 :  z  z
A_lf3_er2 :  n  A_lf1
B_er1 :  C_er0  b
B_er1 :  b
C_er0 :  A
C_er0 :  D_er3
C_er0 :  E  e
D_er3 :  B_er1
E :  B_er1  f
E :  f
S :  A  S_lf0
S :  B_er1  S_lf1
S :  C_er0  x
S :  x
S_lf0 :  C_er0  x
S_lf0 :  x
S_lf1 :  C_er0  x
S_lf1 :  x
>>> cfg5 = cfg4.make_left_factored_cfg()
>>> for lhs, rhs in cfg5: print str(lhs), ': ', '  '.join(str(rhs_token) for rhs_token in rhs)
A :  l  A_lf0_lf0
A :  x  A_lf2
A_lf0 :  q
A_lf0 :  z
A_lf0_lf0 :  
A_lf0_lf0 :  A_lf0_lf0_er0
A_lf0_lf0_er0 :  A_lf3_er2
A_lf1 :  q
A_lf1 :  z
A_lf2 :  q  q
A_lf2 :  y  A_lf0
A_lf2 :  z  z
A_lf3_er2 :  n  A_lf1
B_er1 :  C_er0  b
B_er1 :  b
C_er0 :  A
C_er0 :  D_er3
C_er0 :  E  e
D_er3 :  B_er1
E :  B_er1  f
E :  f
S :  A  S_lf0
S :  B_er1  S_lf1
S :  C_er0  x
S :  x
S_lf0 :  C_er0  x
S_lf0 :  x
S_lf1 :  C_er0  x
S_lf1 :  x
>>> cfg.size, cfg1.size, cfg2.size, cfg3.size, cfg4.size, cfg5.size
(42, 44, 51, 55, 55, 57)
>>> cfg.num_productions, cfg1.num_productions, cfg2.num_productions, cfg3.num_productions, cfg4.num_productions, cfg5.num_productions
(17, 21, 25, 28, 28, 29)
make_nlrg_cfg()

Return a FrozenCfg that is equivalent to self but for which the non-left-recursive productions for each non-terminal are grouped into a new non-terminal; this follows Robert Moore’s non-left-recursion-grouping (NLRG) algorithm

>>> b = CfgBuilder()
>>> b.add_production('A', ('x', 'y', 'z'))
>>> b.add_production('A', ('x', 'y', 'q'))
>>> b.add_production('A', ('l', 'n', 'z'))
>>> b.add_production('A', ('l', 'n', 'q'))
>>> b.add_production('A', ('l',))
>>> b.add_production('A', ('x', 'z', 'z'))
>>> b.add_production('A', ('x', 'q', 'q'))
>>> b.add_production('B', ())
>>> b.add_production('B', ('b',))
>>> b.add_production('B', ('C', 'b'))
>>> b.add_production('C', ('A',))
>>> b.add_production('C', ('D',))
>>> b.add_production('C', ('E', 'e'))
>>> b.add_production('D', ('B',))
>>> b.add_production('E', ('B', 'f'))
>>> b.add_production('S', ('A', 'C', 'x'))
>>> b.add_production('S', ('B', 'C', 'x'))
>>> cfg = FrozenCfg(b, 'S')
>>> for lhs, rhs in cfg: print repr(lhs), ':', ' '.join(repr(rhs_token) for rhs_token in rhs)
'A' : 'l'
'A' : 'l' 'n' 'q'
'A' : 'l' 'n' 'z'
'A' : 'x' 'q' 'q'
'A' : 'x' 'y' 'q'
'A' : 'x' 'y' 'z'
'A' : 'x' 'z' 'z'
'B' : 
'B' : 'C' 'b'
'B' : 'b'
'C' : 'A'
'C' : 'D'
'C' : 'E' 'e'
'D' : 'B'
'E' : 'B' 'f'
'S' : 'A' 'C' 'x'
'S' : 'B' 'C' 'x'
>>> cfg1 = cfg.make_nlrg_cfg()
>>> for lhs, rhs in cfg1: print repr(lhs), ':', ' '.join(repr(rhs_token) for rhs_token in rhs)
'A' : 'l'
'A' : 'l' 'n' 'q'
'A' : 'l' 'n' 'z'
'A' : 'x' 'q' 'q'
'A' : 'x' 'y' 'q'
'A' : 'x' 'y' 'z'
'A' : 'x' 'z' 'z'
'B' : 
'B' : 'C' 'b'
'B' : 'b'
'C' : 'C_nlg'
'C' : 'D'
'C_nlg' : 'A'
'C_nlg' : 'E' 'e'
'D' : 'B'
'E' : 'B' 'f'
'S' : 'A' 'C' 'x'
'S' : 'B' 'C' 'x'
make_no_epsilon_cfg()

Return a FrozenCfg that is equivalent to self but which contains no epsilon productions.

>>> b = CfgBuilder()
>>> b.add_production('A', ('x', 'y', 'z'))
>>> b.add_production('B', ())
>>> b.add_production('B', ('b',))
>>> b.add_production('B', ('B', 'b'))
>>> b.add_production('C', ('A',))
>>> b.add_production('C', ('B',))
>>> b.add_production('D', ('(', 'C', ')'))
>>> b.add_production('E', ('B', '(', 'C', ')'))
>>> b.add_production('F', ('B', '(', 'C', ')', 'C'))
>>> b.add_production('F', ('B', '(', 'C', ')', 'C', 'C'))
>>> b.add_production('G', ('B',))
>>> b.add_production('G', ('B', 'B', 'B'))
>>> b.add_production('S', ('A',))
>>> b.add_production('S', ('A', 'F'))
>>> b.add_production('S', ('A', 'C'))
>>> cfg = FrozenCfg(b)
>>> cfg = FrozenCfg(b, 'S')
>>> for lhs, rhs in cfg: print repr(lhs), ':', ' '.join(repr(rhs_token) for rhs_token in rhs)
'A' : 'x' 'y' 'z'
'B' : 
'B' : 'B' 'b'
'B' : 'b'
'C' : 'A'
'C' : 'B'
'D' : '(' 'C' ')'
'E' : 'B' '(' 'C' ')'
'F' : 'B' '(' 'C' ')' 'C'
'F' : 'B' '(' 'C' ')' 'C' 'C'
'G' : 'B'
'G' : 'B' 'B' 'B'
'S' : 'A'
'S' : 'A' 'C'
'S' : 'A' 'F'
>>> cfg1 = cfg.make_no_epsilon_cfg()
>>> for lhs, rhs in cfg1: print repr(lhs), ':', ' '.join(repr(rhs_token) for rhs_token in rhs)
'A' : 'x' 'y' 'z'
'B_er1' : 'B_er1' 'b'
'B_er1' : 'b'
'C_er0' : 'A'
'C_er0' : 'B_er1'
'D' : '(' ')'
'D' : '(' 'C_er0' ')'
'E' : '(' ')'
'E' : '(' 'C_er0' ')'
'E' : 'B_er1' '(' ')'
'E' : 'B_er1' '(' 'C_er0' ')'
'F' : '(' ')'
'F' : '(' ')' 'C_er0'
'F' : '(' ')' 'C_er0' 'C_er0'
'F' : '(' 'C_er0' ')' 'C_er0'
'F' : '(' 'C_er0' ')' 'C_er0' 'C_er0'
'F' : 'B_er1' '(' ')'
'F' : 'B_er1' '(' ')' 'C_er0'
'F' : 'B_er1' '(' ')' 'C_er0' 'C_er0'
'F' : 'B_er1' '(' 'C_er0' ')'
'F' : 'B_er1' '(' 'C_er0' ')' 'C_er0'
'F' : 'B_er1' '(' 'C_er0' ')' 'C_er0' 'C_er0'
'G_er2' : 'B_er1'
'G_er2' : 'B_er1' 'B_er1'
'G_er2' : 'B_er1' 'B_er1' 'B_er1'
'S' : 'A'
'S' : 'A' 'C_er0'
'S' : 'A' 'F'
make_no_left_recursion_cfg()

Return a FrozenCfg that is equivalent to self but which contains no left-recursive production chains.

>>> b = CfgBuilder()
>>> b.add_production('A', ('x', 'y', 'z'))
>>> b.add_production('B', ('b',))
>>> b.add_production('B', ('C', 'b'))
>>> b.add_production('C', ('A',))
>>> b.add_production('C', ('D',))
>>> b.add_production('C', ('E', 'e'))
>>> b.add_production('D', ('B',))
>>> b.add_production('E', ('B', 'f'))
>>> b.add_production('S', ('A', 'C'))
>>> b.add_production('S', ('B', 'C'))
>>> cfg = FrozenCfg(b, 'S')
>>> for lhs, rhs in cfg: print repr(lhs), ':', ' '.join(repr(rhs_token) for rhs_token in rhs)
'A' : 'x' 'y' 'z'
'B' : 'C' 'b'
'B' : 'b'
'C' : 'A'
'C' : 'D'
'C' : 'E' 'e'
'D' : 'B'
'E' : 'B' 'f'
'S' : 'A' 'C'
'S' : 'B' 'C'
>>> cfg1 = cfg.make_no_left_recursion_cfg()
>>> for lhs, rhs in cfg1: print repr(lhs), ':', ' '.join(repr(rhs_token) for rhs_token in rhs)
'S' : 
non_terminals

A frozenset of the non-terminals in the grammar.

num_orthos
num_phones
num_productions

The number of productions in the grammar.

num_prons
size

The size of the grammar.

We follow Robert Moore in calculating the size of the grammar: the size is the number of non-terminal symbols plus the sum of the lengths of the right-hand-side sequences over all the productions in the grammar. By counting each non-terminal just once, instead of once for each of its productions, this size statistic more closely tracks storage requirements of actual implementations of grammar structures. An empty right-hand-side sequence (epsilon) is counted has having length one.

terminals

A frozenset of the terminals in the grammar.

verify()
class onyx.lexicon.lexicon.LexiconBuilder(source=None)

Bases: onyx.dataflow.simplecfg.CfgBuilder

A class for building word and pronunciation collections.

A LexiconBuilder can be constructed in empty form (the default), from a string, or from a FrozenLexicon object. In the string case, the string should be a collection of lines, with each line consisting of space-separated tokens. Each line represents one word/pron combination; the first token of the line is the word, the remaining tokens collectively are the pron.

>>> lb0 = LexiconBuilder(_dict0)
add_from_strings(iterable)

Add word/prons to a lexicon from a string source.

iterable should give strings of tokens tokens separated by spaces. Each string represents one word/pron combination; the first token of the string is the word, the remaining tokens collectively are the pron.

add_production(*dummy)
add_word_with_pron(word, phones)

Add word/prons to a lexicon.

word should be a string and phones an iterable of strings.

size

The size of the grammar. We follow Robert Moore in calculating the size of the grammar: the size is the number of non-terminal symbols plus the sum of the lengths of the right-hand-side sequences over all the productions in the grammar. By counting each non-terminal just once, instead of once for each of its productions, this size statistic more closely tracks storage requirements of actual implementations of grammar structures. An empty right-hand-side sequence (epsilon) is counted has having length one.

update_production(lhs, rhs_set)

Add a set of productions to the CFG. The lhs argument is an immutable object that is the left-hand-side (non-terminal) for each of the productions. The rhs_set argument is a possibly-empty iterable of rhs sequences. Each rhs sequence is an iterable of immutable objects (symbols) that are the sequence of non-terminals and terminals that make up the right-hand-side of the given production. An empty rhs is used to add an epsilon production. The productions for a given non-terminal are treated as a set; this means that duplicate right-hand sides are ignored. See also add_production().

>>> builder = CfgBuilder()
>>> builder.add_production('A', ('x', 'y', 'zoo'))
>>> builder.update_production('B', ((), ('b',), ('B', 'b')))
>>> builder.size
9
>>> builder.update_production('Cows', (('A',), ('B',)))
>>> builder.size
12
>>> builder.add_production('Cows', ('A',))
>>> builder.size
12
>>> builder.update_production('Cows', (('B',), ('A',), ))
>>> builder.size
12
>>> 'A' in builder, 'zoo' in builder, 'Moo' in builder
(True, True, False)