Unicode, UTF-8, and hours of fun on IRC

This is the text of an email I sent to the local LUG list about how to generate funky characters when using X.

I shall format it up properly later.

Warning: this has lots of weird characters in it. Make sure your 
mailer understands UTF-8 or it will look very silly.

(Goodness knows how it will look in the archives.) 

On Mon, May 24, 2004 at 10:38:27PM +0100 or thereabouts, Dafydd
Harries wrote:
> Ar 23/05/2004 am 21:12, ysgrifennodd Chris M. Jackson:
> > Ar Sul, 2004-05-23 19:39 +0100, ysgrifennodd Neil Jones:
> > > Hi folks,
> > > 
> > > I need to type some accents and characters and cannot work
> > > out how to enter them on my normal English keyboard. I am
> > > still  using the standard  English version of my system
> > 
> > Shift+AltGr is the default X Compose key for most
> > installations.  Press (not hold) the Compose key(s) and the
> > two characters to compose.  Examples:
> > 
> > Ô = <Compose> O ^ 
> > © = <Compose> c o
> >  
> > Happy composing 8)
> 
> Just to make this crystal clear, you have to press Shift and
> AltGr together, then let them go, then press the two characters
> you wish to compose one after another. E.g.:
> 
> Press Shift 
> Press AltGt 
> Let go 
> Press ^ 
> Let go 
> Press a 
> Let go
> --> â
> 
> Quite the keyboard acrobatics.
> 
> Other combinations:
> 
> ' a --> á 
> ` a --> à 
> " a --> ä
> 
> You can probably guess others. Perhaps somebody can point to a
> list?

XFree86 (Xorg, now) lists them all in a big file. Do not expect 
to be able to read all the contents of the file unless you have a 
very large set of fonts. In CVS it's in xc/nls/Compose/ but on
my (not-yet-upgraded-to-X.org) system the file is 
/usr/X11RC/lib/X11/local/en_US.UTF-8/Compose.

It is full of stuff like this:

<Multi_key> <o> <c> 			: "©" copyright
<Multi_key> <question> <question> 	: "¿" questiondown
<Multi_key> <o> <e> 			: "œ" oe

<Multi_key> is presumably the [Shift][AltGr] combination.

and then (after a pile of Korean), you get to a series of 

<Multi_key> <grave> <A> : "À" U00C0 # LATIN CAPITAL LETTER A WITH GRAVE

There are three different ways to produce that character. One
uses this "<Multi-key>" thing. One uses a "<combining_grave>".
One uses a "<dead_grave>" I have no idea what the other two
are. I presume they're for other keyboards.

So if the character is listed in that file, you should be 
able to generate it whilst in X. And X should be able to 
understand it. X may not be able to find a font which has
the glyph in it in order to display it for you, but that's
different.


There are some other ways which won't work in all of X. I
do not know the Qt/KDE ways. But Gtk (and hence Gnome, grip,
the Gimp and a bunch of others) also lets you generate such
characters by typing the UTF-8 "code point" of the character
whilst you have [shift] and [ctrl] held down. This, of course,
means knowing the hexadecimal number associated with a 
particular character. 

Code points are.. well. I would call them "the number for
a character" but I suspect that's not the whole story. 
They start with a capital U and a plus sign. You can
try to find the character you want in gucharmap (Gnome
Unicode character map). Or you can just keep a little list
yourself as you come across them. Which is what I do.

So there are a series of musical characters at U+2269 and on.
The trademark symbol is U+2122. The card suits start at U+2663.
And to generate a musical note, hold down control and shift.
Keep holding them down (note this is different from X's normal
way of putting fancy characters in) and type '2 6 6 a'. As you 
type these, you'll get those characters themselves, underlined. 
Until you type the final number and lift your fingers from 
control and shift. Then you will get a ♪. 

I ♥ Unicode :) There was a time when lots of software simply
didn't support it, but Unicode awareness is increasing to
the extent that http://utf-8.org/ documents not the software
which supports it but instead the software which doesn't.


> Other things have the notion of composing characters, including
> the Linux console and Vim. In Vim, you can do

I still do not know how to do any of this at the console. Well,
other than starting Vim. Anyone know the console way to do it?

> Control-K ' a -> á
> 
> You can even use it for Greek, Cyrillic, Arabic, Hebrew and
> other symbols. The command ":digraphs" will give you a list and
> ":help digraphs" the documentation. Of course, you have to make
> sure that the encodings and fonts are set up correctly for you
> to be able to view the characters you've typed in. :)

I remember discovering that for this to work in vim, you also
needed to have your locale set correctly. On Red Hat, for
example, I had an account somewhere where the default locale
was en_US or something. I had to change it to en_US.UTF-8 or
en_GB.UTF-8 to get vim to do the control-K stuff. 

We should write this stuff up.

Telsa
  

IRC and unicode

Contrary to popular belief (ie, mine, until recently), IRC will handle non-ASCII characters. I have seen, read and generated characters from Japanese to those in the email above in IRC. There are a few things to know.

Both you and the person (people) you are talking to need to have clients (and, if they are text clients, terminals) which know about UTF-8. There are lots of encodings which do non-ASCII which are not UTF-8: iso-8859-2, iso-8859-3 and so on up to iso-8859-15 (some of which are known as Latin-1, Latin-2 and friends), koi8r and the pestilential Windows-1252 which infests the web.

People using 8859-something-not-1 tend to assume they have proper working accents. When you tell them that they don't (and they don't, because they are not going to convert neatly to Unicode), they get cross. This may be a small thing, but it's a pest. In addition, if all their friends use the same character set, they are not going to change just to please you, because then they are going to have trouble reading what their friends say.

I have no intention of ending up maintaining a page on how to get UTF-8 in all clients, but here, courtesy of Linuxchix (hello, gang) is a list of the clients I do know about.