Onyx logo

Previous topic

Textdata

Next topic

onyx.textdata.yamldata – Tools for decoding and encoding Onyx text-based data into YAML documents.

This Page

onyx.textdata.tdutil – Some low-level utilities for working with textual data files.

XXX Needs real testing.
>>> True
True
class onyx.textdata.tdutil.FastRawTextReader(stream, name='<unknown>')

Bases: object

A tokenizing reader for raw text files (or any other source of text lines). Functions as an iterator producing tuples of white-space separated tokens from the stream of lines provided. Blank lines of text are always skipped. There is no support for skipping comments. For a slower, more fully-functional reader, see RawTextReader.

>>> import cStringIO
>>> file_lines = '''
...
...   here is some real data
...      more   text w/ extra whitespace     and      more
...   more and more
...   more and more and more and more, followed by whitespace
...   
...   '''
>>> rtr0 = FastRawTextReader(cStringIO.StringIO(file_lines))
>>> for line in rtr0:
...    print rtr0.current_line_number, line
3 ('here', 'is', 'some', 'real', 'data')
4 ('more', 'text', 'w/', 'extra', 'whitespace', 'and', 'more')
5 ('more', 'and', 'more')
6 ('more', 'and', 'more', 'and', 'more', 'and', 'more,', 'followed', 'by', 'whitespace')

stream should be a source of text lines. name should be the name of the source, e.g. a filename, to be used in error reporting and the like.

current_filename
current_line_contents
current_line_number
next()

Get the next line in the iterable as a tuple of tokens.

class onyx.textdata.tdutil.RawTextReader(stringiterable, name='<unknown>', comment_symbol=None, comment_must_be_token=False)

Bases: object

A tokenizing reader for raw text files (or any other source of text lines). Functions as an iterator producing tuples of white-space separated tokens from the stream of lines provided.

Blank lines of text are always skipped. Optional comment_symbol specifies the string (or character) that will signify that a line of text is a comment and should be discarded. Optional comment_must_be_token, if True, specifies that the first token on a line must be equal to comment_symbol for the line to be discarded as a comment; otherwise, if the first token on the line starts with comment_symbol, the default, the line will be discarded as a comment.

>>> import cStringIO
>>> file_lines = '''
...   # some comments, preceded and followed by blanks
...
...   here is some real data
...     # more comments, regardless of comment_must_be_token
...      more   text w/ extra whitespace     and      more
...   more and more
...   more and more and more and more, followed by whitespace
...   
...   '''
>>> rtr0 = RawTextReader(cStringIO.StringIO(file_lines), comment_symbol='#')
>>> for line in rtr0:
...    print rtr0.current_line_number, line
4 ('here', 'is', 'some', 'real', 'data')
6 ('more', 'text', 'w/', 'extra', 'whitespace', 'and', 'more')
7 ('more', 'and', 'more')
8 ('more', 'and', 'more', 'and', 'more', 'and', 'more,', 'followed', 'by', 'whitespace')
next()

Get the next line in the iterable as a tuple of tokens.

class onyx.textdata.tdutil.TextParserErrorMixin(stream)

Bases: object

Mix this into your token-based parser to get a consistent and easy-to-use error messaging.

For best operation, stream should be an iterable source with the attributes: current_line_number, current_filename, and current_line_contents. See, e.g. RawTextReader and YamldataReader.

This mixin provides some useful functions for parsing, checking, and generating errors.

>>> lines = '''3.25 4.25 5.25bad_token
...            p q r
...            x y z'''
>>> import cStringIO
>>> stream = cStringIO.StringIO(lines)
>>> rtr0 = RawTextReader(stream)
>>> SimpleParser = type('SimpleParser', (TextParserErrorMixin,), dict())

Initialize your parser with a stream; your class’s __init__() should pass the stream to the mixin’s __init__().

>>> sp0 = SimpleParser(rtr0)

Get tokens from lines of text - the RawTextReader gives back tuples of strings. next_line() will pull the next line in from the stream.

>>> tokens = sp0.next_line()
>>> tokens
('3.25', '4.25', '5.25bad_token')

Convert read values to other types with useful error messages when conversion fails:

>>> sp0.convert_or_raise(tokens[0], float)
3.25
>>> sp0.convert_or_raise(tokens[2], float)
Traceback (most recent call last):
...
TextdataParseFailure: unexpected token '5.25bad_token'; failed in conversion to <type 'float'> on or near line 1 in file <unknown>
 Complete line: [[3.25 4.25 5.25bad_token
]]
>>> tokens = sp0.next_line()
>>> tokens
('p', 'q', 'r')

Generally verify that you get specific things you are looking for:

>>> sp0.verify_thing('p', tokens[0], 'first line')
>>> sp0.verify_thing('s', tokens[2], 'last item in first line')
Traceback (most recent call last):
...
TextdataParseFailure: Expected to read last item in first line s, but read r on or near line 2 in file <unknown>
 Complete line: [[           p q r
]]

Verify that you get exactly or at least as many items as you expect:

>>> sp0.verify_token_count(3, tokens, 'first line')
>>> sp0.verify_token_count(4, tokens, 'first line')
Traceback (most recent call last):
...
TextdataParseFailure: Expected first line with 4 tokens, but read 3 tokens on or near line 2 in file <unknown>
 Complete line: [[           p q r
]]
>>> sp0.verify_token_count_min(2, tokens, 'first line')
>>> sp0.verify_token_count_min(4, tokens, 'first line')
Traceback (most recent call last):
...
TextdataParseFailure: Expected first line with at least 4 tokens, but read 3 tokens on or near line 2 in file <unknown>
 Complete line: [[           p q r
]]
>>> tokens = sp0.next_line()
>>> tokens
('x', 'y', 'z')

Unless you say it’s OK, next_line() fails with a useful error if at EOF

>>> tokens = sp0.next_line(eof_ok=True)
>>> tokens is None
True
>>> tokens = sp0.next_line()
Traceback (most recent call last):
...
TextdataParseFailure: Unexpected end of stream on or near line 3 in file <unknown>
 Complete line: [[           x y z]]
convert_or_raise(value, new_type)
next_line(eof_ok=False)

Get the next line from the stream. Errors on EOF unless eof_ok is True, in which case returns None.

raise_parsing_error(err_string)

Clients and subclasses may call this function to raise a consistent error. If the stream has any of the attributes: current_line_number, current_filename, or current_line_contents, additional information will be added to the error string.

verify_thing(expected, found, what)
verify_token_count(expected, tokens, what)
verify_token_count_min(expected, tokens, what)
exception onyx.textdata.tdutil.TextdataException

Bases: exceptions.StandardError

Base exception for all Textdata exceptions

args
message
exception onyx.textdata.tdutil.TextdataParseFailure

Bases: onyx.textdata.tdutil.TextdataException

Raised when text structure or preconditions lead to a parsing failure

args
message
exception onyx.textdata.tdutil.TextdataSizeError

Bases: onyx.textdata.tdutil.TextdataException

Raised when number of tokens is outside a specified range

args
message
onyx.textdata.tdutil.tdcheckatleast(atleast, tokensiterable)
onyx.textdata.tdutil.tdcheckatmost(atmost, tokensiterable)
onyx.textdata.tdutil.tdchecksizelimit(atlimit, atspec, tokensiterable)

atlimit is limit of legal zero-based index

onyx.textdata.tdutil.tddict(keycolumn, tokensiterable)
onyx.textdata.tdutil.tdfilesiter(filenames, default=<open file '<stdin>', mode 'r' at 0x2b5a11c40150>)

return chained iterator for the sequence of files, or the default

onyx.textdata.tdutil.tdfilesiters(filenames, default=<open file '<stdin>', mode 'r' at 0x2b5a11c40150>)

return tuple of iterators for the sequence of files, or the default

onyx.textdata.tdutil.tdfilesiterschain(filenames, default=<open file '<stdin>', mode 'r' at 0x2b5a11c40150>)

return chained iterator for the sequence of files, or the default

onyx.textdata.tdutil.tdfilesitersizip(filenames, default=<open file '<stdin>', mode 'r' at 0x2b5a11c40150>)

return iziped iterator for the sequence of files, or the default

onyx.textdata.tdutil.tdnormalize(stringiterable, name='<unknown>', comment_symbol=None, comment_must_be_token=False)

Return a generator that normalizes each line from stringiterable into an instance of tdtokens. In addition to being a list of tokens, each tdtokens has three attributes: name the value of the optional name argument to tdnormalize, if given, else ‘<unknown>’; line the line number of the line of text from which the list of tokens was constructed; text, the unnormalized line of text from which the list of tokens was constructed.

Blank lines of text are always skipped. Optional comment_symbol specifies the string (or character) that will signify that a line of text is a comment and should be discarded. Optional comment_must_be_token, if True, specifies that the first token on a line must be equal to comment_symbol for the line to be discarded as a comment; otherwise, if the first token on the line starts with comment_symbol, the default, the line will be discarded as a comment.

>>> import cStringIO
>>> file_lines = '''
...   # some comments, preceded and followed by blanks
...
...   here's some real data
...     # more comments, regardless of comment_must_be_token
...      more   text w/ extra whitespace     and      more
...   more and more
...   #not discarded if comment_must_be_token=True
...   more and more and more and more, followed by whitespace
...   
...   '''
>>> for parts in tdnormalize(cStringIO.StringIO(file_lines), name='file_lines', comment_symbol='#',):
...   print parts.name, parts.line_number, repr(parts.text.strip()), parts
file_lines 4 "here's some real data" ["here's", 'some', 'real', 'data']
file_lines 6 'more   text w/ extra whitespace     and      more' ['more', 'text', 'w/', 'extra', 'whitespace', 'and', 'more']
file_lines 7 'more and more' ['more', 'and', 'more']
file_lines 9 'more and more and more and more, followed by whitespace' ['more', 'and', 'more', 'and', 'more', 'and', 'more,', 'followed', 'by', 'whitespace']

With comment_must_be_token, we get an extra set of tokens

>>> for parts in tdnormalize(cStringIO.StringIO(file_lines), name='file_lines_comment_must_be_token', comment_symbol='#', comment_must_be_token=True):
...   print parts.name, parts.line_number, repr(parts.text.strip()), parts
file_lines_comment_must_be_token 4 "here's some real data" ["here's", 'some', 'real', 'data']
file_lines_comment_must_be_token 6 'more   text w/ extra whitespace     and      more' ['more', 'text', 'w/', 'extra', 'whitespace', 'and', 'more']
file_lines_comment_must_be_token 7 'more and more' ['more', 'and', 'more']
file_lines_comment_must_be_token 8 '#not discarded if comment_must_be_token=True' ['#not', 'discarded', 'if', 'comment_must_be_token=True']
file_lines_comment_must_be_token 9 'more and more and more and more, followed by whitespace' ['more', 'and', 'more', 'and', 'more', 'and', 'more,', 'followed', 'by', 'whitespace']
class onyx.textdata.tdutil.tdtokens(name, line_number, text, *args)

Bases: list

A specialization of list that has some attributes to track the provenance of its contents. The required args, name, line_number, and text are made available as attributes of those same names. The option *args is an iterable from which the list is constructed.

>>> x = tdtokens('foo', 23, 'yowza', xrange(10))
>>> x
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> x.name
'foo'
>>> x.line_number
23
>>> x.text
'yowza'
append

L.append(object) – append object to end

count

L.count(value) -> integer – return number of occurrences of value

extend

L.extend(iterable) – extend list by appending elements from the iterable

index

L.index(value, [start, [stop]]) -> integer – return first index of value. Raises ValueError if the value is not present.

insert

L.insert(index, object) – insert object before index

line_number
name
pop

L.pop([index]) -> item – remove and return item at index (default last). Raises IndexError if list is empty or index is out of range.

remove

L.remove(value) – remove first occurrence of value. Raises ValueError if the value is not present.

reverse

L.reverse() – reverse IN PLACE

sort

L.sort(cmp=None, key=None, reverse=False) – stable sort IN PLACE; cmp(x, y) -> -1, 0, 1

text