19.4 Example 2: Defining a Multibyte Character Code Conversion (JIS <-> Unicode)

Let us consider the example of a state-dependent code conversion. As mentioned previously, this type of conversion would occur between JIS, which is a state-dependent multibyte encoding for Japanese characters, and Unicode, which is a wide-character encoding. As usual, we assume that the external device uses multibyte encoding, and the internal processing uses wide-character encoding.

Here is what you have to do to implement and use a state-dependent code conversion facet:

Define a new conversion state type if necessary.
Define a new character traits type if necessary, or instantiate the character traits template with the new state type.
Define the code conversion facet.
Instantiate new stream types using the new character traits type.
Imbue a file stream's buffer with a locale that carries the new code conversion facet.

These steps are explained in detail in the following sections.

19.4.1 Define a New Conversion State Type

While parsing or creating a sequence of multibytes in a state-dependent multibyte encoding, the code conversion facet has to maintain a conversion state. This state is by default of type mbstate_t, which is the implementation-dependent state type defined by the C library. If this type does not suffice to keep track of the conversion state, you have to provide your own conversion state type. We will see how this is done in the code below, but please note first that the new state type must have the following member functions:

A constructor, since the argument 0 has the special meaning of creating a conversion state object that represents the initial conversion state;
Copy constructor and assignment;
Comparison for equality and inequality.

Now here is the sketch of a new conversion state type:

class JISstate_t {
public: 
                   JISstate_t( int state=0 )
                   : state_(state) { ; }
 
                   JISstate_t(const JISstate_t& state)
                   : state_(state.state_) { ; }
 
                   JISstate_t& operator=(const JISstate_t& state)
                    {
                       if ( &state != this )
                         state_= state.state_;
                       return *this;
                    }
 
                   JISstate_t& operator=(const int state)
                    {
                       state_= state;
                       return *this;
                    }
 
                   bool operator==(const JISstate_t& state) const
                    {
                       return ( state_ == state.state_ );
                    }
 
                   bool operator!=(const JISstate_t& state) const
                    {
                       return ( !(state_ == state.state_) );
                    }
 
private: 
                   int state_;
 
                 };

19.4.2 Define a New Character Traits Type

The conversion state type is part of the character traits. Hence, with a new conversion state type, you need a new character traits type.

If you do not want to rely on a nonstandard and thus non-portable feature of the library, you have to define a new character traits type and redefine the necessary types:

struct JIS_char_traits: public char_traits<wchar_t> 
{
        typedef JISstate_t                state_type;
        typedef fpos<state_type>          pos_type;
        typedef wstreamoff                off_type;
};

19.4.3 Define the Code Conversion Facet

Just as in the first example, you have to define the actual code conversion facet. The steps are basically the same as before, too: define a new class template for the new code conversion type and specialize it. The code would look like this:

template <class internT, class externT, class stateT>
class UnicodeJISConversion
: public codecvt<internT, externT, stateT>
{
};

class UnicodeJISConversion<wchar_t, char, JISstate_t>
: public codecvt<wchar_t, char, JISstate_t>
{
protected:
 
 result do_in(JISstate_t& state,
              const char*  from,
              const char*  from_end,
              const char*& from_next,
              wchar_t*     to, 
              wchar_t*     to_limit,
              wchar_t*&    to_next) const;

 result do_out(JISstate_t& state,
               const wchar_t*  from,
               const wchar_t*  from_end,
               const wchar_t*& from_next,
               char*           to,
               char*           to_limit, 
               char*&          to_next) const;

 bool do_always_noconv() const throw()
 { return false; };
 
 int do_encoding() const throw();
 { return -1; }
 
};

In this case, the function do_encoding()has to return -1, which identifies the code conversion as state-dependent. Again, the functions in() and out() have to conform to the error indication policy explained under class codecvt in the Class Reference.

The distinguishing characteristic of a state-independent conversion is that the conversion state argument to in() and out() is used for communication between the file stream buffer and the code conversion facet. The file stream buffer is responsible for creating, maintaining, and deleting the conversion state. At the beginning, the file stream buffer creates a conversion state object that represents the initial conversion state and hands it over to the code conversion facet. The facet modifies it according to the conversion it performs. The file stream buffer receives it and stores it between two subsequent code conversions.

19.4.4 Use the New Code Conversion Facet

Here is an example of how the new code conversion facet can be used:

typedef basic_fstream<wchar_t,JIS_char_traits> JIS_fstream;   //1
JIS_fstream inout("/tmp/fil");
UnicodeJISConversion<wchar_t,char,JISstate_t> cvtfac;
locale cvtloc(locale(),&cvtfac);
inout.rdbuf()->pubimbue(cvtloc)                               //2
wcout << inout.rdbuf();                                       //3

//1	Our Unicode-JIS code conversion needs a conversion state type different from the default type `mbstate_t`. Since the conversion state type is contained in the character traits, we have to create a new file type. Instead of `JIS_char_traits`, we could have taken advantage of the nonstandard extension to the character traits template and have used `char_traits<wchar_t,JISstate_t>`.
//2	Here the stream buffer's locale is replaced by a copy of the global locale that has a Unicode-JIS code conversion facet.
//3	The content of the JIS encoded file `"/tmp/fil"` is read, automatically converted to Unicode, and written to `wcout`.