Users

Coding Philosophy

Data structures

How to

Handling Unicode


Download sxmlc files

Project detail and discuss

Get support

Handling Unicode


Unicode is available since version 4.0.0.

It required quite a lot of work to mutualize code between Unicode and non-Unicode functions, as char became wchar_t and all functions related to character handling had to be changed to their wide version.
Hence, the code has complexified somehow due to the #ifdef/#else/#endif that has to cope with Unicode being used or not.

Using Unicode

Unicode is handled through the definition of SXMLC_UNICODE in preprocessor. To activate it, give -DSXMLC_UNICODE to the compiler (to most of them anyway).
Defining it changes the definition of SXML_CHAR type to wchar_t instead of char.
It also adds three more members to the XMLDoc struct to deal with Byte Order Mark (BOM):

  • bom_type represents the BOM that has been read in the file. It is BOM_NONE when no BOM has been detected, or one of the BOM_* enum.
  • bom is the BOM byte content.
  • sz_bom is the size of the BOM (i.e. how many bytes is the BOM, usefull when writing the file).

The function freadBOM has been added to determine the BOM and skip it, so that the file can be read straight.
It can recognize several BOMs:

  • No BOM
  • UTF-8 (file starts with sequence 0xef 0xbb 0xbf)
  • UTF-16LE (Little Endian, file starts with sequence 0xff 0xfe)
  • UTF-16BE (Big Endian, file starts with sequence 0xfe 0xff)
  • UTF-32LE (Little Endian, file starts with sequence 0xff 0xfe 0x00 0x00)
  • UTF-32BE (Big Endian, file starts with sequence 0x00 0x00 0xfe 0xff)

/!\ Warning!
Though it can recognize (and skip) UTF-32 BOM, SXMLC can handle it only to the extent of wchar_t. That means that under Microsoft OS, Unicode handling stops at UTF-16.
Also, UTF-8 is handled only on a one-byte-per-character basis as, internally, SXMLC opens the file in text mode when detecting UTF-8 BOM. If you know fancier portable fopen/fgetc/fprintf functions to process UTF-8, please tell me! :-)

Coding Unicode

To ease creating Unicode-portable code, several macros are defined when opening/reading/writing streams. All of them start with sx_ and should be used instead of the "regular" ones. E.g use sx_fopen instead of fopen or sx_strcpy instead of strcpy.
A special macro C2SX() adds the L in front of constant strings and characters when SXMLC_UNICODE is defined. This allows to use string constants with or without Unicode.
Of course, when writing your application, if you know for sure whether you will be using Unicode, you don't have to use these macros and can use the direct function calls instead. The following three examples are equivalent:

No Unicode, SXMLC_UNICODE is undefined

char tag[128];
XMLNode node;

XMLNode_init(&node);
strcpy(tag, "element");
XMLNode_set_tag(&node, tag);
XMLNode_add_attribute(&node, "name", "toto");

Pure Unicode, SXMLC_UNICODE is defined

wchar_t tag[128];
XMLNode node;

XMLNode_init(&node);
wcscpy(tag, L"element");
XMLNode_set_tag(&node, tag);
XMLNode_add_attribute(&node, L"name", L"toto");

Portable code, works if SXMLC_UNICODE is defined or not

SXML_CHAR tag[128];
XMLNode node;

XMLNode_init(&node);
sx_strcpy(tag, C2SX("element"));
XMLNode_set_tag(&node, tag);
XMLNode_add_attribute(&node, C2SX("name"), C2SX("toto"));

The full list of sx_* function is available in utils.h.

Writing Unicode XML

/!\ Be careful when writing files with XMLDoc_print! The FILE* object has to be opened in binary mode when dealing with UTF-16 encoding! (either Little or Big Endian).
Other encodings such as ASCII or "regular" UTF-8 have to be opened in text mode as they are one-byte characters.
Note that you HAVE TO define SXMLC_UNICODE if you plan to write or read Unicode files.

Usually, you can open the FILE* in binary mode when there is a BOM to write in the document (doc.sz_bom > 0).
The following code would write a document to a file according to whether the XML document is Unicode:

int write_doc(XMLDoc* doc, SXML_CHAR* filename)
{
	SXML_CHAR* mode;
	FILE* f;
	
	if (doc->sz_bom > 0 && doc->bom_type != BOM_UTF_8) /* Use text mode for UTF-8 */
		mode = C2SX("w+b");
	else
		mode = C2SX("w+t");
	
	f = sx_fopen(filename, mode);
	
	return XMLDoc_print(doc, f, NULL, NULL, false, 0, 0);
}

Project Web Hosted by SourceForge.net

©Copyright 1999-2010 - Geeknet, Inc., All Rights Reserved

About - Legal - Help