>>> True
True
Bases: object
A tokenizing reader for raw text files (or any other source of text lines). Functions as an iterator producing tuples of white-space separated tokens from the stream of lines provided. Blank lines of text are always skipped. There is no support for skipping comments. For a slower, more fully-functional reader, see RawTextReader.
>>> import cStringIO
>>> file_lines = '''
...
... here is some real data
... more text w/ extra whitespace and more
... more and more
... more and more and more and more, followed by whitespace
...
... '''
>>> rtr0 = FastRawTextReader(cStringIO.StringIO(file_lines))
>>> for line in rtr0:
... print rtr0.current_line_number, line
3 ('here', 'is', 'some', 'real', 'data')
4 ('more', 'text', 'w/', 'extra', 'whitespace', 'and', 'more')
5 ('more', 'and', 'more')
6 ('more', 'and', 'more', 'and', 'more', 'and', 'more,', 'followed', 'by', 'whitespace')
stream should be a source of text lines. name should be the name of the source, e.g. a filename, to be used in error reporting and the like.
Get the next line in the iterable as a tuple of tokens.
Bases: object
A tokenizing reader for raw text files (or any other source of text lines). Functions as an iterator producing tuples of white-space separated tokens from the stream of lines provided.
Blank lines of text are always skipped. Optional comment_symbol specifies the string (or character) that will signify that a line of text is a comment and should be discarded. Optional comment_must_be_token, if True, specifies that the first token on a line must be equal to comment_symbol for the line to be discarded as a comment; otherwise, if the first token on the line starts with comment_symbol, the default, the line will be discarded as a comment.
>>> import cStringIO
>>> file_lines = '''
... # some comments, preceded and followed by blanks
...
... here is some real data
... # more comments, regardless of comment_must_be_token
... more text w/ extra whitespace and more
... more and more
... more and more and more and more, followed by whitespace
...
... '''
>>> rtr0 = RawTextReader(cStringIO.StringIO(file_lines), comment_symbol='#')
>>> for line in rtr0:
... print rtr0.current_line_number, line
4 ('here', 'is', 'some', 'real', 'data')
6 ('more', 'text', 'w/', 'extra', 'whitespace', 'and', 'more')
7 ('more', 'and', 'more')
8 ('more', 'and', 'more', 'and', 'more', 'and', 'more,', 'followed', 'by', 'whitespace')
Get the next line in the iterable as a tuple of tokens.
Bases: object
Mix this into your token-based parser to get a consistent and easy-to-use error messaging.
For best operation, stream should be an iterable source with the attributes: current_line_number, current_filename, and current_line_contents. See, e.g. RawTextReader and YamldataReader.
This mixin provides some useful functions for parsing, checking, and generating errors.
>>> lines = '''3.25 4.25 5.25bad_token
... p q r
... x y z'''
>>> import cStringIO
>>> stream = cStringIO.StringIO(lines)
>>> rtr0 = RawTextReader(stream)
>>> SimpleParser = type('SimpleParser', (TextParserErrorMixin,), dict())
Initialize your parser with a stream; your class’s __init__() should pass the stream to the mixin’s __init__().
>>> sp0 = SimpleParser(rtr0)
Get tokens from lines of text - the RawTextReader gives back tuples of strings. next_line() will pull the next line in from the stream.
>>> tokens = sp0.next_line()
>>> tokens
('3.25', '4.25', '5.25bad_token')
Convert read values to other types with useful error messages when conversion fails:
>>> sp0.convert_or_raise(tokens[0], float)
3.25
>>> sp0.convert_or_raise(tokens[2], float)
Traceback (most recent call last):
...
TextdataParseFailure: unexpected token '5.25bad_token'; failed in conversion to <type 'float'> on or near line 1 in file <unknown>
Complete line: [[3.25 4.25 5.25bad_token
]]
>>> tokens = sp0.next_line()
>>> tokens
('p', 'q', 'r')
Generally verify that you get specific things you are looking for:
>>> sp0.verify_thing('p', tokens[0], 'first line')
>>> sp0.verify_thing('s', tokens[2], 'last item in first line')
Traceback (most recent call last):
...
TextdataParseFailure: Expected to read last item in first line s, but read r on or near line 2 in file <unknown>
Complete line: [[ p q r
]]
Verify that you get exactly or at least as many items as you expect:
>>> sp0.verify_token_count(3, tokens, 'first line')
>>> sp0.verify_token_count(4, tokens, 'first line')
Traceback (most recent call last):
...
TextdataParseFailure: Expected first line with 4 tokens, but read 3 tokens on or near line 2 in file <unknown>
Complete line: [[ p q r
]]
>>> sp0.verify_token_count_min(2, tokens, 'first line')
>>> sp0.verify_token_count_min(4, tokens, 'first line')
Traceback (most recent call last):
...
TextdataParseFailure: Expected first line with at least 4 tokens, but read 3 tokens on or near line 2 in file <unknown>
Complete line: [[ p q r
]]
>>> tokens = sp0.next_line()
>>> tokens
('x', 'y', 'z')
Unless you say it’s OK, next_line() fails with a useful error if at EOF
>>> tokens = sp0.next_line(eof_ok=True)
>>> tokens is None
True
>>> tokens = sp0.next_line()
Traceback (most recent call last):
...
TextdataParseFailure: Unexpected end of stream on or near line 3 in file <unknown>
Complete line: [[ x y z]]
Get the next line from the stream. Errors on EOF unless eof_ok is True, in which case returns None.
Clients and subclasses may call this function to raise a consistent error. If the stream has any of the attributes: current_line_number, current_filename, or current_line_contents, additional information will be added to the error string.
Bases: exceptions.StandardError
Base exception for all Textdata exceptions
Bases: onyx.textdata.tdutil.TextdataException
Raised when text structure or preconditions lead to a parsing failure
Bases: onyx.textdata.tdutil.TextdataException
Raised when number of tokens is outside a specified range
atlimit is limit of legal zero-based index
return chained iterator for the sequence of files, or the default
return tuple of iterators for the sequence of files, or the default
return chained iterator for the sequence of files, or the default
return iziped iterator for the sequence of files, or the default
Return a generator that normalizes each line from stringiterable into an instance of tdtokens. In addition to being a list of tokens, each tdtokens has three attributes: name the value of the optional name argument to tdnormalize, if given, else ‘<unknown>’; line the line number of the line of text from which the list of tokens was constructed; text, the unnormalized line of text from which the list of tokens was constructed.
Blank lines of text are always skipped. Optional comment_symbol specifies the string (or character) that will signify that a line of text is a comment and should be discarded. Optional comment_must_be_token, if True, specifies that the first token on a line must be equal to comment_symbol for the line to be discarded as a comment; otherwise, if the first token on the line starts with comment_symbol, the default, the line will be discarded as a comment.
>>> import cStringIO
>>> file_lines = '''
... # some comments, preceded and followed by blanks
...
... here's some real data
... # more comments, regardless of comment_must_be_token
... more text w/ extra whitespace and more
... more and more
... #not discarded if comment_must_be_token=True
... more and more and more and more, followed by whitespace
...
... '''
>>> for parts in tdnormalize(cStringIO.StringIO(file_lines), name='file_lines', comment_symbol='#',):
... print parts.name, parts.line_number, repr(parts.text.strip()), parts
file_lines 4 "here's some real data" ["here's", 'some', 'real', 'data']
file_lines 6 'more text w/ extra whitespace and more' ['more', 'text', 'w/', 'extra', 'whitespace', 'and', 'more']
file_lines 7 'more and more' ['more', 'and', 'more']
file_lines 9 'more and more and more and more, followed by whitespace' ['more', 'and', 'more', 'and', 'more', 'and', 'more,', 'followed', 'by', 'whitespace']
With comment_must_be_token, we get an extra set of tokens
>>> for parts in tdnormalize(cStringIO.StringIO(file_lines), name='file_lines_comment_must_be_token', comment_symbol='#', comment_must_be_token=True):
... print parts.name, parts.line_number, repr(parts.text.strip()), parts
file_lines_comment_must_be_token 4 "here's some real data" ["here's", 'some', 'real', 'data']
file_lines_comment_must_be_token 6 'more text w/ extra whitespace and more' ['more', 'text', 'w/', 'extra', 'whitespace', 'and', 'more']
file_lines_comment_must_be_token 7 'more and more' ['more', 'and', 'more']
file_lines_comment_must_be_token 8 '#not discarded if comment_must_be_token=True' ['#not', 'discarded', 'if', 'comment_must_be_token=True']
file_lines_comment_must_be_token 9 'more and more and more and more, followed by whitespace' ['more', 'and', 'more', 'and', 'more', 'and', 'more,', 'followed', 'by', 'whitespace']
Bases: list
A specialization of list that has some attributes to track the provenance of its contents. The required args, name, line_number, and text are made available as attributes of those same names. The option *args is an iterable from which the list is constructed.
>>> x = tdtokens('foo', 23, 'yowza', xrange(10))
>>> x
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> x.name
'foo'
>>> x.line_number
23
>>> x.text
'yowza'
L.append(object) – append object to end
L.count(value) -> integer – return number of occurrences of value
L.extend(iterable) – extend list by appending elements from the iterable
L.index(value, [start, [stop]]) -> integer – return first index of value. Raises ValueError if the value is not present.
L.insert(index, object) – insert object before index
L.pop([index]) -> item – remove and return item at index (default last). Raises IndexError if list is empty or index is out of range.
L.remove(value) – remove first occurrence of value. Raises ValueError if the value is not present.
L.reverse() – reverse IN PLACE
L.sort(cmp=None, key=None, reverse=False) – stable sort IN PLACE; cmp(x, y) -> -1, 0, 1