|
|
|
|
CORBA interoperability means different things at different levels. At the level of the network, it means that processes may be able to communicate using varying transport mediums. At the messaging layer, interoperability means that the participating ORBs use a common protocol, such as GIOP. At the presentation layer, interoperability means that information contained in data may be interpreted and presented consistently. This CORBA News Brief is intended to discuss interoperability at the presentation layer through codeset translation.
One of the challenges to distributed computing is dealing with applications using different numeric values representing text characters. There are many different character sets in use throughout the world, some of which use different values to refer to the same character. A collection of values which map to specific text characters is referred to as a Codeset. To successfully communicate text data between systems using different codesets, the text must be translated from the sender's codeset to that of the receiver.CORBA exists to ease the burden of distributed computing by internally dealing with many issues that application developers would otherwise have to deal with. Character code value translation is no exception to this. the CORBA specification, going back to CORBA 2.2, defines the means for applications to declare a Native Code Set for character based text (NCS-C). This is the codeset that is used by the application to read & write files, or to interact with a user. An application may have a second Native Code Set specifically for wide characters(NCS-W). Applications may use translators to convert from their Native Code Set to various conversion codesets (CCS-C, CCS-W), both for 8-bit and wide characters. The conversion codesets are those the application is able to translate to and from the native codeset.
CORBA makes use of two kinds of codesets, one for byte-oriented based text, the other for wide, or nonbyte-oriented text. The term codepoint is used to describe a numeric value for a character that is more than 8-bits long. Wide characters may be 2, 3, or 4 bytes long. Since the length is fixed, some of the bytes in the codepoint may be 0. For this reason, wide character text manipuation requires separate system calls that examine more than one byte at a time when computing string length, or doing string comparisons. Byte-oriented codesets may also be comprised of codepoints that are greater than one byte, however none of the codepoints will contain a byte with a value of 0. This allows strings of multibyte characters to be manipulated with traditional system calls.
CORBA uses the Open System Foundation's Code and Character Set registry when communicating codeset identity information between applications. The registry is freely available, and contains details such as a description of the codeset, a unique numeric identifier, the maximum number of bytes needed to hold one (possibly escaped) character, and a list of the character sets contained within the codeset. A codeset may contain several character sets that may be distinct from each other.
Using the OSF codeset registry values as identifiers allows CORBA servers to encode their native and conversion codesets into IOR profiles. CORBA clients may use these codeset identifiers to select an appropriate Transmission codeset for byte-oriented characters (TCS-C), or wide characters (TCS-W). To do so, the client uses the following algorithm:
Codeset negotiation happens with the first client request. Once set, the TCS remains fixed for the remainder of the connection.
Codeset translation in TAO is pluggable. This gives the greatest level of flexability allowing developers to add translators only as needed. To provide a new translator to TAO, four components must be supplied.
class ACE_Char_Codeset_Translator
{
public:
virtual ACE_CDR::Boolean read_char (ACE_InputCDR&,
ACE_CDR::Char&) = 0;
virtual ACE_CDR::Boolean read_string (ACE_InputCDR&,
ACE_CDR::Char *&) = 0;
virtual ACE_CDR::Boolean read_char_array (ACE_InputCDR&,
ACE_CDR::Char*,
ACE_CDR::ULong) = 0;
virtual ACE_CDR::Boolean write_char (ACE_OutputCDR&,
ACE_CDR::Char) = 0;
virtual ACE_CDR::Boolean write_string (ACE_OutputCDR&,
ACE_CDR::ULong,
const ACE_CDR::Char*) = 0;
virtual ACE_CDR::Boolean write_char_array (ACE_OutputCDR&,
const ACE_CDR::Char*,
ACE_CDR::ULong) = 0;
static ACE_CDR::ULong ncs {return 0;}
static ACE_CDR::ULong tcs {return 0;}
};
The read methods take text from the CDR stream encoded using the TCS, and return to the application the text in the NCS. The write methods perform the inverse operation, taking NCS text from the application, and writing it to the stream using the TCS. Of course, the example shown is used to translate byte-oriented codesets. The companion class, ACE_WChar_Codeset_Translator, has methods for reading and writing wchar text.
See the full class definition in the header file ace/CDR_Stream.h for more information on the translator methods. An example translator implementation may be found in ace/Codeset_IBM1047.*.
The last two static methods are used to identify the Native Code Set and the Transmission Code Set the translator uses. When creating a translator, the appropriate OSF Character and Codeset registry values should be returned by these methods.
templateThe template arguments are:class TAO_Export TAO_Codeset_Translator_Factory_T
dynamic Char_IBM1047_ISO8859_Factory Service_Object * TAO_CodeSet:_make_TAO_Char_IBM1047_ISO8859_Factory () static Resource_Factory "-ORBNativeCharCodeset EBCDIC -ORBNativeWCharCodeset 0x10026352 -ORBCharCodesetTranslator Char_IBM1047_ISO8859_Factory"The first directive loads a service object called Char_IBM1047_ISO8859_Factory which is in the TAO_CodeSet library. The second directive uses the resource factory to configure the NCS-C to the local codeset that is named "EBCDIC". Native codeset declaration may either use a name which corresponds to an entry in the codeset registry, or to a number. The second argument in the resource factory directive sets the native wide character code set using a numeric ID, which happens to correspond to "IBM-850 (CCSID 25426); Multilingual IBM PC Display-MLP". The final argument tells the resource factory to add the previously loaded translator into the list of available byte-oriented translators.
An application using these configuration options would be able to communicate with other applications that use either IBM-1047 or ISO-8859-1, for character oriented text, or IBM-850 for wide characters. in which no translation is needed, or ISO8859, using the provided translator.
If no codeset information configured, the ORB assumes that ISO-8859-1 is used as the byte-oriented codeset. There is no default for non-byte oriented codesets. If any interface includes WChar or WString data types, then at least -ORBNativeWCharCodeset must be specified.
One is not required to populate the Codeset Registry with all possible codesets. It is quite reasonable to build a registry with only the codesets you will actually support, as a subset of the entire registry available from the OSF. Simply construct a file containing a subset of the OSF's full codeset registry, add your own system-specific local names and run mkcsregdb. Having run mkcsregdb, you will have to rebuild ACE to link in the new registry details.
Here is an example of a single entry from the OSF's codeset registry, version 1.2g:
start description IBM-1047 (CCSID 01047); Latin-1 Open System loc_name NONE rgy_value 0x10020417 char_values 0x0011 max_bytes 1 end
Note that the local name (loc_name) is assigned the string "NONE" because local names are not defined by the OSF. For the configuration shown above work, the loc_name should be given the string "EBCDIC", regenerate the codeset database by running mkcsregdb then rebuild ACE. Many operating system vendors define their own local names, which may be related to Locales. In many cases, it is possible to obtain a localized codeset registry from a system vendor. Otherwise it is not difficult to produce your own.
Object Computing, Inc (OCI) has been providing educational services to clients, industries and universities since 1993. We offer one of the most comprehensive distributed Object Oriented training curricula in the country. These curricula focus on the fundamentals of OO technology; with close to 40 workshops in OOAD, Java, XML, C++/CORBA and Unix/Linux.
For further information regarding OCI's Educational Services programs, please visit our Educational Services section on the web or contact us at training@ociweb.com.
The OCI CORBA News Brief is intended to promote CORBA and object technology in the development of distributed computing applications. Each issue of the CORBA News Brief will feature news and technical information about OCI's supported open-source ORBs (TAO and JacORB), case studies, and examples using CORBA, as well as information about OCI's educational offerings.
The OCI CORBA News Brief is published on a monthly basis. Send ideas for articles of interest to corba@ociweb.com.
To subscribe or unsubscribe from the CNB mailing list, send mail to majordomo@ociweb.com with the line "subscribe cnb" or "unsubscribe cnb" in the body of the message.
|
![]() |
|