Trivial UTF-8

Trivial UTF-8 is a small library for doing UTF-8-based in- and output on a Lisp implementation that already supports Unicode -- meaning char-code and code-char deal with Unicode character codes.

The rationale for the existence of this library is that while Unicode-enabled implementations usually do provide some kind of interface to dealing with character encodings, these are typically not flexible or efficient enough. Specifically, SBCL's sb-ext:octets-to-string and string-to-octets are 10 times slower than the equivalents in this library (and not easily optimized because of the way they are defined -- the use-value restart in particular), and do not provide a way to directly read or write UTF-8 from or to a stream.

Download and installation

The latest release of trivial-utf-8 can be downloaded from http://common-lisp.net/project/trivial-utf-8/trivial-utf-8.tgz, or installed with asdf-install.

A darcs repository with the most recent changes can be checked out with:

> darcs get http://common-lisp.net/project/trivial-utf-8/darcs/trivial-utf-8

Or look at it online.

Support and mailing lists

The trivial-utf-8-devel mailing list can be used for any questions, discussion, bug-reports, patches, or anything else relating to this library.

Reference

function string-to-utf-8-bytes (string) => array of (unsigned-byte 8)

Convert a string into an array of unsigned bytes containing its utf-8 representation.

function utf-8-bytes-to-string (bytes) => string

Convert a byte array containing utf-8 encoded characters into the string it encodes.

function write-utf-8-bytes (string)

Write a string to a byte-stream, encoding it as utf-8.

function read-utf-8-string (input &key null-terminated stop-at-eof char-length byte-length)

Read utf-8 encoded data from a byte stream and construct a string with the characters found. When null-terminated is given it will stop reading at a null character, stop-at-eof tells it to stop at the end of file without raising an error, and the char-length and byte-length parameters can be used to specify the max amount of characters or bytes to read.

function utf-8-byte-length (string) => integer

Calculate the amount of bytes needed to encode a string.

function utf-8-group-size (byte) => integer

Determine the amount of bytes that are part of the character starting with a given byte.

condition utf-8-decoding-error

A condition of this type is raised whenever an incorrectly encoded character is encountered.


Back to Common-lisp.net.

Valid XHTML 1.0 Strict