Handling Unicode
Unicode is available since version 4.0.0.
It required quite a lot of work to mutualize code between Unicode and non-Unicode functions, as char
became wchar_t
and all functions related to character handling had to be changed to their wide version.
Hence, the code has complexified somehow due to the #ifdef/#else/#endif
that has to cope with Unicode being
used or not.
Unicode is handled through the definition of SXMLC_UNICODE
in preprocessor. To activate it, give -DSXMLC_UNICODE
to the compiler (to most of them anyway).
Defining it changes the definition of SXML_CHAR
type to wchar_t
instead of char
.
It also adds three more members to the XMLDoc
struct to deal with Byte Order Mark (BOM):
bom_type
represents the BOM that has been read in the file. It is BOM_NONE
when no BOM
has been detected, or one of the BOM_*
enum.bom
is the BOM byte content.sz_bom
is the size of the BOM (i.e. how many bytes is the BOM, usefull when writing the file).
The function freadBOM
has been added to determine the BOM and skip it, so that the file can be read straight.
It can recognize several BOMs:
0xef 0xbb 0xbf
)0xff 0xfe
)0xfe 0xff
)0xff 0xfe 0x00 0x00
)0x00 0x00 0xfe 0xff
)
/!\ Warning!
Though it can recognize (and skip) UTF-32 BOM, SXMLC can handle it only to the extent of wchar_t
. That
means that under Microsoft OS, Unicode handling stops at UTF-16.
Also, UTF-8 is handled only on a one-byte-per-character basis as, internally, SXMLC opens the file in text mode when
detecting UTF-8 BOM. If you know fancier portable fopen/fgetc/fprintf
functions to process UTF-8, please
tell me! :-)
To ease creating Unicode-portable code, several macros are defined when opening/reading/writing streams. All of them
start with sx_
and should be used instead of the "regular" ones. E.g use sx_fopen
instead of
fopen
or sx_strcpy
instead of strcpy
.
A special macro C2SX()
adds the L
in front of constant strings and characters when SXMLC_UNICODE
is defined. This allows to use string constants with or without Unicode.
Of course, when writing your application, if you know for sure whether you will be using Unicode, you don't have to use these macros and
can use the direct function calls instead. The following three examples are equivalent:
No Unicode, SXMLC_UNICODE is undefined
char tag[128]; XMLNode node; XMLNode_init(&node); strcpy(tag, "element"); XMLNode_set_tag(&node, tag); XMLNode_add_attribute(&node, "name", "toto");
Pure Unicode, SXMLC_UNICODE is defined
wchar_t tag[128]; XMLNode node; XMLNode_init(&node); wcscpy(tag, L"element"); XMLNode_set_tag(&node, tag); XMLNode_add_attribute(&node, L"name", L"toto");
Portable code, works if SXMLC_UNICODE is defined or not
SXML_CHAR tag[128]; XMLNode node; XMLNode_init(&node); sx_strcpy(tag, C2SX("element")); XMLNode_set_tag(&node, tag); XMLNode_add_attribute(&node, C2SX("name"), C2SX("toto"));
The full list of sx_*
function is available in utils.h
.
/!\ Be careful when writing files with XMLDoc_print
! The FILE*
object has to be opened
in binary mode when dealing with UTF-16 encoding! (either Little or Big Endian).
Other encodings such as ASCII or "regular" UTF-8 have to be opened in text mode as they are one-byte characters.
Note that you HAVE TO define SXMLC_UNICODE
if you plan to write or read Unicode files.
Usually, you can open the FILE*
in binary mode when there is a BOM to write in the document (doc.sz_bom > 0
).
The following code would write a document to a file according to whether the XML document is Unicode:
int write_doc(XMLDoc* doc, SXML_CHAR* filename) { SXML_CHAR* mode; FILE* f; if (doc->sz_bom > 0 && doc->bom_type != BOM_UTF_8) /* Use text mode for UTF-8 */ mode = C2SX("w+b"); else mode = C2SX("w+t"); f = sx_fopen(filename, mode); return XMLDoc_print(doc, f, NULL, NULL, false, 0, 0); }
©Copyright 1999-2010 - Geeknet, Inc., All Rights Reserved