Onyx logo

Previous topic

Evaluation Modules

Next topic

onyx.util.rocutils – Utilities for generating ROC and DET curves

This Page

onyx.util.alignment – Minimum edit distance alignment of lattices and sequences

Here are some alignment examples. We begin by showing simple string alignment. The nbest_align call returns an iterator which will give us alignments of two sequences in NBest order:

>>> iter = nbest_align('ABCD', 'ABEDF')

Use the function next() to get an alignment:

>>> iter.next()
(2, (('A', 'A', 0), ('B', 'B', 0), ('C', 'E', 1), ('D', 'D', 0), (None, 'F', 1)))

The alignment is a pair of (cost, alignment); here cost = 2. The alignment is a sequence of triples which represent the edits. Each triple consists of a token from string1 (or None for a deletion), a token from string2 (or None for an insertion), and the cost of that edit. See help(nbest_align) for more details.

Mixed sequences may be used:

>>> iter = nbest_align((1, 2, 'foo', True), (3, 2, 'bar', True))
>>> iter.next()
(2, ((1, 3, 1), (2, 2, 0), ('foo', 'bar', 1), (True, True, 0)))

A few more simple examples:

>>> iter = nbest_align('abc', 'adc')
>>> iter.next()
(1, (('a', 'a', 0), ('b', 'd', 1), ('c', 'c', 0)))
>>> iter.next()
(2, (('a', 'a', 0), (None, 'd', 1), ('b', None, 1), ('c', 'c', 0)))
>>> iter.next()
(2, (('a', 'a', 0), ('b', None, 1), (None, 'd', 1), ('c', 'c', 0)))
>>> iter.next()
(3, (('a', None, 1), (None, 'a', 1), ('b', 'd', 1), ('c', 'c', 0)))
>>> iter = nbest_align('abcy', 'xadc')
>>> iter.next()
(3, ((None, 'x', 1), ('a', 'a', 0), ('b', 'd', 1), ('c', 'c', 0), ('y', None, 1)))

Here we use the graphtools interface to construct a double-diamond lattice (see below). The first arg is the sequence of node labels, which are not used so they’re set to None. The remaining three args specify the arcs as a sequence of start nodes, a sequence of end nodes, and a sequence of arc labels. See help(GraphTables) for more information. The source is the node in the upper left corner, the sink is in the lower right.

*---A---*---E---*
|       |       |
B       D       F
|       |       |
*---C---*---G---*
>>> gt3 = GraphTables(((None,)*6, (0,0,1,1,2,3,4), (1,2,3,4,4,5,5), ('A','B','E','D','C','F','G')))
>>> l3 = FrozenGraph(gt3)
>>> l3.is_lattice
True

Here’s another lattice:

*---A---*
|       |
B       C
|       |
*---D---*
>>> gt4 = GraphTables(((None,)*4, (0,0,1,2), (1,2,3,3), ('A','B','C','D')))
>>> l4 = FrozenGraph(gt4)
>>> l4.is_lattice
True

Here we align the two lattices. A lattice alignment is a set of operations which will transform some path in one lattice into some path in another lattice.

Here’s a 1-Best alignment:

>>> align(l3, l4)
(2, (('A', 'A', 0), ('E', 'C', 1), ('F', None, 1)))

Here’s an N-Best alignment of the same two lattices:

>>> iter = nbest_align(l3, l4)

Several alignments have cost 2, which is the minimum. Ties will be broken in a deterministic but undefined order.

>>> iter.next()
(2, (('A', 'A', 0), ('E', None, 1), ('F', 'C', 1)))
>>> iter.next()
(2, (('A', 'A', 0), ('E', 'C', 1), ('F', None, 1)))
>>> iter.next()
(2, (('A', 'A', 0), ('D', None, 1), ('G', 'C', 1)))

Look Ma, no arcs!

>>> gt5 = GraphTables(((None,), (), (), ()))
>>> l5 = FrozenGraph(gt5)
>>> l5.is_lattice
True
>>> align(l5, l3)
(3, ((None, 'A', 1), (None, 'E', 1), (None, 'F', 1)))
>>> iter = nbest_align(l5, l3)
>>> iter.next()
(3, ((None, 'A', 1), (None, 'E', 1), (None, 'F', 1)))
>>> iter = nbest_align(l3, l5)
>>> iter.next()
(3, (('A', None, 1), ('E', None, 1), ('F', None, 1)))

Lattices may also be aligned with sequences:

>>> iter = nbest_align(l3, 'ABEDF')
>>> iter.next()
(2, (('A', 'A', 0), (None, 'B', 1), ('E', 'E', 0), (None, 'D', 1), ('F', 'F', 0)))
>>> iter = nbest_align(l3, 'ABCD')
>>> iter.next()
(2, ((None, 'A', 1), ('B', 'B', 0), ('C', 'C', 0), ('G', 'D', 1)))
>>> gt = GraphTables(((None,)*5, (0,0,1,2,3), (1,2,3,3,4), ('A','B','C','D','F')))
>>> g = FrozenGraph(gt)
>>> g
FrozenGraph(GraphTables(((None, None, None, None, None), (0, 0, 1, 2, 3), (1, 2, 3, 3, 4), ('A', 'B', 'C', 'D', 'F'))))
>>> g.is_lattice
True
>>> iter = nbest_align(l3, g)
>>> iter.next()
(1, (('A', 'A', 0), ('E', 'C', 1), ('F', 'F', 0)))
>>> iter.next()
(2, (('A', 'A', 0), (None, 'C', 1), ('E', None, 1), ('F', 'F', 0)))
>>> iter.next()
(2, (('A', 'A', 0), ('E', None, 1), (None, 'C', 1), ('F', 'F', 0)))
>>> iter.next()
(2, (('A', 'A', 0), ('D', 'C', 1), ('G', 'F', 1)))
>>> iter = nbest_align('ABCD', l3)
>>> iter.next()
(2, (('A', None, 1), ('B', 'B', 0), ('C', 'C', 0), ('D', 'G', 1)))

Specifying substring_cost=True means there is no penalty for deletions from the second sequence (or lattice) that occur before any characters of the first sequence are used or after they have all been used. This provides a way to search flexibly for a small string or lattice within a larger string or lattice. Note, e.g., that the 4 final deletions all have cost 0 below.

>>> align('abc', 'abcdefg', substring_cost=True)
(0, (('a', 'a', 0), ('b', 'b', 0), ('c', 'c', 0), (None, 'd', 0), (None, 'e', 0), (None, 'f', 0), (None, 'g', 0)))
>>> iter = nbest_align('abc', 'abcdefg', substring_cost=True)
>>> iter.next()
(0, (('a', 'a', 0), ('b', 'b', 0), ('c', 'c', 0), (None, 'd', 0), (None, 'e', 0), (None, 'f', 0), (None, 'g', 0)))
>>> iter = nbest_align('xadc', 'abcyxabc', substring_cost = True)
>>> iter.next()
(1, ((None, 'a', 0), (None, 'b', 0), (None, 'c', 0), (None, 'y', 0), ('x', 'x', 0), ('a', 'a', 0), ('d', 'b', 1), ('c', 'c', 0)))
>>> iter = nbest_align('ABCD', l3, substring_cost = True)
>>> iter.next()
(2, (('A', None, 1), ('B', 'B', 0), ('C', 'C', 0), ('D', None, 1), (None, 'G', 0)))
>>> iter.next()
(2, (('A', None, 1), ('B', 'B', 0), ('C', 'C', 0), ('D', 'G', 1)))
>>> iter.next()
(2, (('A', 'A', 0), ('B', None, 1), ('C', None, 1), ('D', 'D', 0), (None, 'G', 0)))

Clients may provide their own cost function; see help(nbest_align) for details.

>>> def subst_ok(x,y) : return 1 if x is None or y is None else 0.5 if x != y else 0
>>> align('abcy', 'xadc', subst_ok)
(2.0, (('a', 'x', 0.5), ('b', 'a', 0.5), ('c', 'd', 0.5), ('y', 'c', 0.5)))
>>> iter = nbest_align('abcy', 'xadc', subst_ok)
>>> iter.next()
(2.0, (('a', 'x', 0.5), ('b', 'a', 0.5), ('c', 'd', 0.5), ('y', 'c', 0.5)))
>>> iter.next()
(2.5, ((None, 'x', 1), ('a', 'a', 0), ('b', 'd', 0.5), ('c', 'c', 0), ('y', None, 1)))

The remainder of this test is a caution to clients about NBest alignments. Although you might think that matching ‘xabcp’ to the ‘abc’ right after ‘234’ is the ‘obvious’ second choice, this algorithm has other ideas. You may wish to consider using some scheme whereby subsequent matches are filtered based on overlapping with some earlier match. Alternatively, you might think about recursively finding matches of a short string within the pieces of the longer string to the left and right of the first match.

>>> iter = nbest_align('xabcp', '234abcyxadcpqrs', substring_cost = True)
>>> iter.next()
(1, ((None, '2', 0), (None, '3', 0), (None, '4', 0), (None, 'a', 0), (None, 'b', 0), (None, 'c', 0), (None, 'y', 0), ('x', 'x', 0), ('a', 'a', 0), ('b', 'd', 1), ('c', 'c', 0), ('p', 'p', 0), (None, 'q', 0), (None, 'r', 0), (None, 's', 0)))
>>> iter.next()
(2, ((None, '2', 0), (None, '3', 0), (None, '4', 0), (None, 'a', 0), (None, 'b', 0), (None, 'c', 0), (None, 'y', 0), (None, 'x', 0), ('x', None, 1), ('a', 'a', 0), ('b', 'd', 1), ('c', 'c', 0), ('p', 'p', 0), (None, 'q', 0), (None, 'r', 0), (None, 's', 0)))
>>> iter.next()
(2, ((None, '2', 0), (None, '3', 0), (None, '4', 0), (None, 'a', 0), (None, 'b', 0), (None, 'c', 0), (None, 'y', 0), ('x', 'x', 0), ('a', 'a', 0), (None, 'd', 1), ('b', None, 1), ('c', 'c', 0), ('p', 'p', 0), (None, 'q', 0), (None, 'r', 0), (None, 's', 0)))
>>> iter.next()
(2, ((None, '2', 0), (None, '3', 0), (None, '4', 0), (None, 'a', 0), (None, 'b', 0), (None, 'c', 0), (None, 'y', 0), ('x', 'x', 0), ('a', 'a', 0), ('b', None, 1), (None, 'd', 1), ('c', 'c', 0), ('p', 'p', 0), (None, 'q', 0), (None, 'r', 0), (None, 's', 0)))
onyx.util.alignment.align(a1, a2, cost_fn=None, substring_cost=False)

Find the best alignment path through a pair of iterables or lattices, a1 and a2. a1 and a2 may either be iterables or lattices represented as FrozenGraph objects. Mixing of the two argument types is allowed. The alignment is done using either the values returned by the iterables or the labels on the arcs of the FrozenGraph. FrozenGraph arguments must return True from is_lattice().

The alignment path is a sequence of pairs <t1, t2> where the elements of each pair are either tokens from a1 and a2, respectively, or None. Every pair will have at least one real token; <None,None> pairs will never occur.

The path returned will have the following properties:

  1. Taking only the first or only the second element of each pair, and removing the Nones, each path is a simple path from the start of the corresponding lattice to its end.
  2. Considering each pair as an edit, where a (token,token) pair is either an exact match or a substitution, a (token, None) pair is an insertion, and a (None, token) pair is a deletion, and given a cost function on such pairs, the path returned has minimal cost.

The alignment path returned has the form: (total_cost, (edit1, edit2, ... editN)) where each edit is a triple (t1, t2, edit_cost).

cost_fn may be used to provide costs for substitutions, insertions, and deletions, and must be a callable taking two arguments which are either labels from a1 and a2, respectively, or None, and return a cost C >= 0. If the first argument is None, the cost should be that of an insertion. If the second argument is None, the cost should be that of a deletion. In no case will both arguments be None. If cost_fn is not given, the standard function using 0 for an exact match and 1 for all other costs will be used.

If substring_cost is True, align one lattice (a1) within the other (a2). This works like ordinary alignment except that any number of elements from a1 may be freely deleted before the first substitution or insertion and after the last substitution or insertion. Note that insertions from a2 are never free.

Please see the help for this module for a collection of examples.

onyx.util.alignment.nbest_align(a1, a2, cost_fn=None, substring_cost=False)

Find alignment paths through a pair of iterables or lattices, a1 and a2. The arguments a1 and a2 may either be iterables or lattices represented as FrozenGraph objects. Mixing of the two argument types is allowed. The alignment is done using either the values returned by the iterables or the labels on the arcs of the FrozenGraph. FrozenGraph arguments must have is_lattice True.

An alignment path is a sequence of pairs <t1, t2> where the elements of each pair are either tokens from a1 and a2, respectively, or None. Every pair will have at least one real token; <None,None> pairs will never occur.

The paths found by the algorithm have the following properties:

  1. Taking only the first or only the second element of each pair, and removing the Nones, each path is a simple path from the start of the corresponding lattice to the end.
  2. Considering each pair as an edit, where a (token,token) pair is either an exact match or a substitution, a (token, None) pair is an insertion, and a (None, token) pair is a deletion, and given a cost function on such pairs, the paths are returned in order of cost, with minimal cost paths returned first.

The iterator returned by this function returns alignment paths in the form: (total_cost, (edit1, edit2, ... editN)) where each edit is a triple (t1, t2, edit_cost). The paths are returned in non-decreasing order of total cost.

cost_fn may be used to provide costs for substitutions, insertions, and deletions, and must take two arguments which are either labels from a1 and a2, respectively, or None, and return a cost C >= 0. If the first argument is None, the cost should be that of an insertion. If the second argument is None, the cost should be that of a deletion. In no case will both arguments be None. If cost_fn is not given, the standard function using 0 for an exact match and 1 for all other costs will be used.

If substring_cost is True, align one lattice (a1) within the other (a2). This works like ordinary alignment except that any number of elements from 11 may be freely deleted before the first substitution or insertion and after the last substitution or insertion. Note that insertions from a2 are never free.

Please see the help for this module for a collection of examples.