Zsh Mailing List Archive
Messages sorted by: Reverse Date, Date, Thread, Author

Re: proper UTF-8 support under OSX

To make it as complicated, as it seemingly is:

On Fri, 23 Oct 2009, François Revol wrote:

> Charset = character set = the list of available glyphs, no matter how
> they are coded.

Yes, kind of abstract definition what you 'want to use/show',
or what you have available (like 'ASCII' being 'smaller' than
'ISO-8859-*' smaller than 'UTF-8').

> Encoding = how they are represented.
> Like, UTF-8 one of the many encodings or the Unicode charset.

Yes, how they are encoded in 'strings the computer works with'.

If you type something on a typical UNIX-Keyboard into a terminal-Program
on an X11-Display and ssh to some other machine, your 'Keystrokes' cross
lots of 'Tables of Codes'.

- Keyboard (scancodes I believe and loadable Tables in the Kernels)
  encodes 'your kind of hardware' to some 'Standard keyboard, e.g.
  German in my case.
- Maps into X11-Display Symbolic Keys given to the Program (e.g xterm)
  so now X11 INPUT knows about german 'ß' as a Key pressed
- the x11-program then 'types' them again through the 'pts' devices
  (pseudo keyboard on pseudo tty), where it must know, which ENCODING
  will be used to map my 'ß'-Key to 'bytes on the (pseudo)wire.
  Here we have the first 'Latin1(iso8859-1)' versus UTF-8 Case.
- now those (hopefully UTF-8) bytes are grabbed by ssh and transferred
  to the remote, and there into another 'pts' so far we send 'just bytes'
  bit must make sure, 'everything gets through'(!)
  If one of the 'pseudo-ttys' accepts 7bit chars only, or eats up some
  bytes ... trouble ...
- on the remote host now sits your zsh and MUST know, which ENCODING
  the sent string has to understand my 'ß' (now two bytes in UTF-8).
  Same with 'mutt' or all the other programs on the remote 'pseudo-tty'.
- BUT now that remote shell/program/whatever sends an Answer ... ... ...
  back again, hopefully unicode, send in multibytes-per-character up
  to the local pseudo-tty, where the local program (e.g. xterm) grabs
  then and needs to send them to the Display.
- And HERE is the next Mapping - Fonts and Font-Encoding.
  The DISPLAY now must know, how to *show* the Character on-screen.
  Takes another Table and looks up the little rectangle of Bits.
  BUT there too 'UTF-8' versus whatever else can fail.

So to make a long story short:

keyboard--X11_input--pts--ssh-sshd--pts--remote_prog--(and back)

And on nearly every "--" you can mess up your 'multibyte' versus
'8bit' versus '7bit' per Character(glyph) Situation by assuming
the wrong 'Table of Codes'.

I hope I got this right, as I have to explain it all the time and time
again to our students messing up their connections from Home to 'here'
but in german normally :-))


Christoph von Stuckrad      * * |nickname |Mail <stucki@xxxxxxxxxxxxxxx> \
Freie Universitaet Berlin   |/_*|'stucki' |Tel(Mo.,Mi.):+49 30 838-75 459|
Mathematik & Informatik EDV |\ *|if online|  (Di,Do,Fr):+49 30 77 39 6600|
Takustr. 9 / 14195 Berlin   * * |on IRCnet|Fax(home):   +49 30 77 39 6601/

Messages sorted by: Reverse Date, Date, Thread, Author