trac.util.text – Text manipulation

The Jinja2 template engine

As Jinja2 is mainly a text template engine, the low-level helper functions dealing with this package are placed here.

trac.util.text.jinja2env(**kwargs)

Creates a Jinja2 Environment configured with Trac conventions.

All default parameters can optionally be overriden. The loader parameter is not set by default, so unless it is set by the caller, only inline templates can be created from the environment.

Return type:jinja.Environment
trac.util.text.jinja2template(template, text=False)

Creates a Jinja2 Template from inlined source.

Parameters:
  • template – the template content
  • text – if set to False, the result of the variable expansion will be XML/HTML escaped

The Unicode toolbox

Trac internals are almost exclusively dealing with Unicode text, represented by unicode objects. The main advantage of using unicode over UTF-8 encoded str (as this used to be the case before version 0.10), is that text transformation functions in the present module will operate in a safe way on individual characters, and won’t risk to cut a multi-byte sequence in the middle. Similar issues with Python string handling routines are avoided as well. For example, did you know that “Priorità” is encoded as 'Priorit\xc3\x0a' in UTF-8? Calling strip() on this value in some locales can cut away the trailing \x0a and it’s no longer valid UTF-8...

The drawback is that most of the outside world, while eventually “Unicode”, is definitely not unicode. This is why we need to convert back and forth between str and unicode at the boundaries of the system. And more often than not we even have to guess which encoding is used in the incoming str strings.

Encoding unicode to str is usually directly performed by calling encode() on the unicode instance, while decoding is preferably left to the to_unicode helper function, which converts str to unicode in a robust and guaranteed successful way.

trac.util.text.to_unicode(text, charset=None)

Convert input to an unicode object.

For a str object, we’ll first try to decode the bytes using the given charset encoding (or UTF-8 if none is specified), then we fall back to the latin1 encoding which might be correct or not, but at least preserves the original byte sequence by mapping each byte to the corresponding unicode code point in the range U+0000 to U+00FF.

For anything else, a simple unicode() conversion is attempted, with special care taken with Exception objects.

trac.util.text.exception_to_unicode(e, traceback=False)

Convert an Exception to an unicode object.

In addition to to_unicode, this representation of the exception also contains the class name and optionally the traceback.

Web utilities

trac.util.text.unicode_quote(value, safe='/')

A unicode aware version of urllib.quote

Parameters:
  • value – anything that converts to a str. If unicode input is given, it will be UTF-8 encoded.
  • safe – as in quote, the characters that would otherwise be quoted but shouldn’t here (defaults to ‘/’)
trac.util.text.unicode_quote_plus(value, safe='')

A unicode aware version of urllib.quote_plus.

Parameters:
  • value – anything that converts to a str. If unicode input is given, it will be UTF-8 encoded.
  • safe – as in quote_plus, the characters that would otherwise be quoted but shouldn’t here (defaults to ‘/’)
trac.util.text.unicode_unquote(value)

A unicode aware version of urllib.unquote.

Parameters:str – UTF-8 encoded str value (for example, as obtained by unicode_quote).
Return type:unicode
trac.util.text.unicode_urlencode(params, safe='')

A unicode aware version of urllib.urlencode.

Values set to empty are converted to the key alone, without the equal sign.

trac.util.text.quote_query_string(text)

Quote strings for query string

trac.util.text.javascript_quote(text)

Quote strings for inclusion in single or double quote delimited Javascript strings

trac.util.text.to_js_string(text)

Embed the given string in a double quote delimited Javascript string (conform to the JSON spec)

Console and file system

trac.util.text.getpreferredencoding()

Return the encoding, which is retrieved on ahead, according to user preference.

We should use this instead of locale.getpreferredencoding() which is not thread-safe.

trac.util.text.path_to_unicode(path)

Convert a filesystem path to unicode, using the filesystem encoding.

trac.util.text.stream_encoding(stream)

Return the appropriate encoding for the given stream.

trac.util.text.console_print(out, *args, **kwargs)

Output the given arguments to the console, encoding the output as appropriate.

Parameters:kwargsnewline controls whether a newline will be appended (defaults to True)
trac.util.text.printout(*args, **kwargs)

Do a console_print on sys.stdout.

trac.util.text.printerr(*args, **kwargs)

Do a console_print on sys.stderr.

trac.util.text.raw_input(prompt)

Input one line from the console and converts it to unicode as appropriate.

Miscellaneous

trac.util.text.empty

A special tag object evaluating to the empty string, used as marker for missing value (as opposed to a present but empty value).

class trac.util.text.unicode_passwd

Bases: unicode

Conceal the actual content of the string when repr is called.

trac.util.text.cleandoc(message)

Removes uniform indentation and leading/trailing whitespace.

trac.util.text.levenshtein_distance(lhs, rhs)

Return the Levenshtein distance between two strings.

trac.util.text.sub_vars(text, args)

Substitute $XYZ-style variables in a string with provided values.

Parameters:
  • text – string containing variables to substitute.
  • args – dictionary with keys matching the variables to be substituted. The keys should not be prefixed with the $ character.
trac.util.text.getpreferredencoding()

Return the encoding, which is retrieved on ahead, according to user preference.

We should use this instead of locale.getpreferredencoding() which is not thread-safe.

Text formatting

trac.util.text.pretty_size(size, format='%.1f')

Pretty print content size information with appropriate unit.

Parameters:
  • size – number of bytes
  • format – can be used to adjust the precision shown
trac.util.text.breakable_path(path)

Make a path breakable after path separators, and conversely, avoid breaking at spaces.

trac.util.text.normalize_whitespace(text, to_space=u'\xa0', remove=u'\u200b')

Normalize whitespace in a string, by replacing special spaces by normal spaces and removing zero-width spaces.

trac.util.text.unquote_label(txt)

Remove (one level of) enclosing single or double quotes.

New in version 1.0.

trac.util.text.fix_eol(text, eol)

Fix end-of-lines in a text.

trac.util.text.expandtabs(s, tabstop=8, ignoring=None)

Expand tab characters '\t' into spaces.

Parameters:
  • tabstop – number of space characters per tab (defaults to the canonical 8)
  • ignoring – if not None, the expansion will be “smart” and go from one tabstop to the next. In addition, this parameter lists characters which can be ignored when computing the indent.
trac.util.text.is_obfuscated(word)

Returns True if the word looks like an obfuscated e-mail address.

Since:1.2
trac.util.text.obfuscate_email_address(address)

Replace anything looking like an e-mail address ('@something') with a trailing ellipsis ('@…')

trac.util.text.text_width(text, ambiwidth=1)

Determine the column width of text in Unicode characters.

The characters in the East Asian Fullwidth (F) or East Asian Wide (W) have a column width of 2. The other characters in the East Asian Halfwidth (H) or East Asian Narrow (Na) have a column width of 1.

That ambiwidth parameter is used for the column width of the East Asian Ambiguous (A). If 1, the same width as characters in US-ASCII. This is expected by most users. If 2, twice the width of US-ASCII characters. This is expected by CJK users.

cf. http://www.unicode.org/reports/tr11/.

trac.util.text.print_table(data, headers=None, sep=' ', out=None, ambiwidth=None)

Print data according to a tabular layout.

Parameters:
  • data – a sequence of rows; assume all rows are of equal length.
  • headers – an optional row containing column headers; must be of the same length as each row in data.
  • sep – column separator
  • out – output file descriptor (None means use sys.stdout)
  • ambiwidth – column width of the East Asian Ambiguous (A). If None, detect ambiwidth with the locale settings. If others, pass to the ambiwidth parameter of text_width.
trac.util.text.shorten_line(text, maxlen=75)

Truncates text to length less than or equal to maxlen characters.

This tries to be (a bit) clever and attempts to find a proper word boundary for doing so.

trac.util.text.stripws(text, leading=True, trailing=True)

Strips unicode white-spaces and ZWSPs from text.

Parameters:
  • leading – strips leading spaces from text unless leading is False.
  • trailing – strips trailing spaces from text unless trailing is False.
trac.util.text.strip_line_ws(text, leading=True, trailing=True)

Strips unicode white-spaces and ZWSPs from each line of text.

Parameters:
  • leading – strips leading spaces from text unless leading is False.
  • trailing – strips trailing spaces from text unless trailing is False.
trac.util.text.strip_line_ws(text, leading=True, trailing=True)

Strips unicode white-spaces and ZWSPs from each line of text.

Parameters:
  • leading – strips leading spaces from text unless leading is False.
  • trailing – strips trailing spaces from text unless trailing is False.
trac.util.text.wrap(t, cols=75, initial_indent='', subsequent_indent='', linesep='\n', ambiwidth=1)

Wraps the single paragraph in t, which contains unicode characters. The every line is at most cols characters long.

That ambiwidth parameter is used for the column width of the East Asian Ambiguous (A). If 1, the same width as characters in US-ASCII. This is expected by most users. If 2, twice the width of US-ASCII characters. This is expected by CJK users.

trac.util.text.cleandoc(message)

Removes uniform indentation and leading/trailing whitespace.

trac.util.text.sub_vars(text, args)

Substitute $XYZ-style variables in a string with provided values.

Parameters:
  • text – string containing variables to substitute.
  • args – dictionary with keys matching the variables to be substituted. The keys should not be prefixed with the $ character.

Conversion utilities

trac.util.text.unicode_to_base64(text, strip_newlines=True)

Safe conversion of text to base64 representation using utf-8 bytes.

Strips newlines from output unless strip_newlines is False.

trac.util.text.unicode_from_base64(text)

Safe conversion of text to unicode based on utf-8 bytes.

trac.util.text.to_utf8(text, charset='latin1')

Convert input to a UTF-8 str object.

If the input is not an unicode object, we assume the encoding is already UTF-8, ISO Latin-1, or as specified by the optional charset parameter.