Trivial UTF-8 is a small library for doing UTF-8-based in- and output on a Lisp implementation that already supports Unicode -- meaning char-code and code-char deal with Unicode character codes.
The rationale for the existence of this library is that while Unicode-enabled implementations usually do provide some kind of interface to dealing with character encodings, these are typically not flexible or efficient enough. Specifically, SBCL's sb-ext:octets-to-string and string-to-octets are 10 times slower than the equivalents in this library (and not easily optimized because of the way they are defined -- the use-value restart in particular), and do not provide a way to directly read or write UTF-8 from or to a stream.
The latest release of trivial-utf-8 can be downloaded from http://common-lisp.net/project/trivial-utf-8/trivial-utf-8.tgz, or installed with asdf-install.
A darcs repository with the most recent changes can be checked out with:
> darcs get http://common-lisp.net/project/trivial-utf-8/darcs/trivial-utf-8
Or look at it online.
The trivial-utf-8-devel mailing list can be used for any questions, discussion, bug-reports, patches, or anything else relating to this library.
function string-to-utf-8-bytes (string) => array of (unsigned-byte 8)
Convert a string into an array of unsigned bytes containing its utf-8 representation.
function utf-8-bytes-to-string (bytes) => string
Convert a byte array containing utf-8 encoded characters into the string it encodes.
function write-utf-8-bytes (string)
Write a string to a byte-stream, encoding it as utf-8.
function read-utf-8-string (input &key null-terminated stop-at-eof char-length byte-length)
Read utf-8 encoded data from a byte stream and construct a string with the characters found. When null-terminated is given it will stop reading at a null character, stop-at-eof tells it to stop at the end of file without raising an error, and the char-length and byte-length parameters can be used to specify the max amount of characters or bytes to read.
function utf-8-byte-length (string) => integer
Calculate the amount of bytes needed to encode a string.
function utf-8-group-size (byte) => integer
Determine the amount of bytes that are part of the character starting with a given byte.
condition utf-8-decoding-error
A condition of this type is raised whenever an incorrectly encoded character is encountered.
Back to Common-lisp.net.