<revremark>First rendition released by Maninder Bali.</revremark>
</revision>
</revhistory>
<abstract>
<para>
This is a detailed guide on how to install and use Indic scripts (devanagri etc.) using UTF-8 encoding under GNU/Linux. This HOWTO is a work in progress. More sections regarding fonts and other related things shall be added to this HOWTO in due course of time. Special thanks to Dan Scott for conversion from HTML to DocBook v4.1.2(XML). Any feedback, sugestions, pointers, gifts, cds, BMWs will be gladly accepted. All flames will be redirected to <filename>/mnt/praises_for_thee/</filename> for future reference. Be afraid.
</para>
</abstract>
</articleinfo>
<sect1id="intro">
<title>Introduction</title>
<para>
This HOWTO has been written to help you setup your Linux box to use UTF-8 encoding for using various Indic scripts. You will have to install the IndiX system developed by NCST, Mumbai on your machine in order for you to use various Indic scripts. I have tested the IndiX system on Exodus GNU/Linux, RedHat Linux, and Mandrake Linux. Anyone who has tested this system on a machine running Debian, please let me know and I will include that in this HOWTO.
I want to thank Mr. Keyur Shroff from NCST, Mumbai for allowing me to modify and redistribute his Devanagri-HOWTO.
</para>
<para>
Please note that Exodus GNU/Linux, developed by the good guys at Centurion Linux, India will ship with the IndiX system installed, thanks to the Transfer of Technology deal signed by NCST, Mumbai and Centurion Linux Pvt. Ltd.
</para>
<para>
Almost all of the leading GNU/Linux distributions available today have been localized in various international languages like French, German, Spanish, Chinese, Arabic, etc. This HOWTO aims at documenting the steps involved in enabling you to localize your GNU/Linux distribution to Indic scripts of your choice. To begin with, you must be aware of the complexity involved in localizing any of the Indian languages. Any Indian language text input differs from that of English. Perhaps the most significant difference is that in English, each keystroke maps directly onto a letter where each letter has a unique code. On the other hand, a 'syllable' - the Indian language equivalent unit of writing letter is composed of one or more characters entered through the keyboard.
</para>
<para>
The syllable is composed of vowels, consonants, modifiers and other special graphics signs. These are encoded, just as roman letters are. The user types in a sequence of vowels, consonants, modifiers and the graphics signs. The machine then composes these syllables at run time based on language dependent rules. Every syllable is thus represented in the machine as a unique sequence of vowels, consonants and modifiers. In a text sequence, these characters are stored in logical (phonetic) order.
</para>
<para>
Indic characters can combine or change shape depending on their context. A character's appearance is affected by its ordering with respect to other characters, the font used to render the character, and the application or system environment. These variables can cause the appearance of Devanagari characters to be different from their nominal glyphs (used in the code charts). Additionally, characters cause a change in the order of the displayed glyphs. This reordering is not commonly seen in non-Indic scripts and occurs independent of any bi-directional character reordering that might be required.
</para>
<para>
Each syllable has a unique visual representation. However, there are too many syllables to design glyphs for each one individually. So a font normally contains certain component glyphs from which a syllable is composed at run time. The onscreen representation of a syllable is then a composition of glyphs from the Indian language font. There is no direct mapping of glyph codes to the consonant, vowel or modifier codes. However, for every syllable (a sequence of consonants, vowels and modifiers) there is a corresponding sequence of glyphs. This constitutes a many-to-many mapping from keystrokes to glyphs as opposed to a simplistic one-to-one mapping in roman scripts.
</para>
<para>
Please read the <ulinkurl="http://www.linuxdoc.org/HOWTO/Unicode-HOWTO.html">Unicode-HOWTO</ulink> and visit <ulinkurl="http://www.unicode.org/">http://www.unicode.org/</ulink> for more information on the UTF-8 encoding.
</para>
<para>
The Indix system developed by NCST, Mumbai enables most applications in X Windows (irrespective of the toolkit used), to render Indic characters according to the unicode standard specification. IndiX provides support for OpenType fonts and Unicode encoding at X Windows level. This enables most of the existing applications to handle Indic scripts without any modification or recompilation.
</para>
<para>
Once you have installed the IndiX system, following all the steps mentioned in this HOWTO, you will be able to fly across seven seas and slap that annoying sailor who keeps goin' hic' hic'... Okay, on a more serious note, you will be able to enjoy your Linux experience in Devanagri and other Indic scripts of your choice.
</para>
</sect1>
<sect1id="install">
<title>Installing the IndiX system</title>
<para>
You can obtain the IndiX system from NCST, Mumbai site <ulinkurl="http://rohini.ncst.ernet.in/indix/">http://rohini.ncst.ernet.in/indix/</ulink>. The system is available in its source as well as binary form. This HOWTO covers the installation of the IndiX system using the binary files avaiable for download. At a later stage, I plan to cover the source installation of IndiX on your box, too. You need to download the following files in order to install IndiX sucessfully onto your machine:
NCST has written Simpm ( Simple Package Manager ) that takes care of the entire installation process on your system. Simpm carries out the following steps for a binary distribution of the IndiX system:
<listitem><para>It reads the names of the files within the distribution by essentially running the command <command>tar -tzpPf package.tgz > .package.list</command></para></listitem>
<listitem><para>It saves all these files and the file containing the list using the command <command>tar -czpPf .old.package.tgz .package.list `cat .package.list`</command></para></listitem>
<listitem><para>Simpm then extracts the files from the package and installs them using <command>tar -xzpPf package.tgz</command></para></listitem>
does all the above steps, 1 through 3. The 'i' flag indicates install. Successful installation will create <filename>savdir/.old.package.tgz</filename>. If it finds an existing <filename>.old.package.tgz</filename>, simpm will not proceed as it means that the IndiX system has already been installed earlier. However, you can force an IndiX install by renaming it to a newpackage. Alternatively, you can uninstall the package and install it again.
uninstalls the package. Note, however, that this command will work only if it finds a readable <filename>.old.package.tgz</filename>. Having uninstalled the package, simpm will restore the original files that were overwritten by the package. The <filename>.old.package.tgz</filename> will be deleted after the uninstallation so that all instances of the previous installation are removed. Simpm maintains a log of all installs and uninstalls in the <filename>savdir/simpm.log</filename> file.
</para>
<para>
To install the IndiX system, all you have to do is (pray and do your favourite tribal dance) type in the following commands:
</para>
<para>
<programlisting> # simpm -i /path/to/gtk.tar.gz
# simpm -i /path/to/indix.tar.gz</programlisting>
and all the necessary files will be backed up, and the IndiX system installed on your machine. Hurrah.
</para>
<para>
Congratulations, o' most precious one, on having installed IndiX system on your machine. The remainder of this HOWTO will focus on setting up your Linux environment to support Indic fonts and scripts in X.
</para>
</sect2>
</sect1>
<sect1id="iosetup">
<title>Devanagri Input and Output setup</title>
<sect2>
<title>Linux console</title>
<para>
Devanagari characters do not display properly in a Linux console. However, NCST has developed ncst-term (a terminal emulator program in X Window System) which has support for converting keystrokes to UTF-8 before sending them to the application running in the ncst-term, and for displaying Unicode characters that the application outputs as UTF-8 byte sequence.
</para>
</sect2>
<sect2id="xwindows">
<title>X Window System</title>
<para>
You need to make some changes in your <filename>XF86Config-4</filename> file (usually resides in <filenameclass="directory">/etc/X11/</filename> directory). A sample config file <filename>XF86Config-4.indix</filename> is installed along with IndiX system. This file can be found in <filenameclass="directory">/etc/X11/</filename> directory.
</para>
<sect3>
<title>Devanagri Font</title>
<para>
OpenType is the most suitable font format to render any Indic script properly. The IndiX system ships with one OpenType font called "raghu" for Hindi. Anyone can use and distribute this font free-of-cost. You can find this font in <filenameclass="directory">/usr/X11R6/lib/X11/fonts/TrueType/</filename> directory.
</para>
<para>
Installing the Indic Fonts:
</para>
<para>
In order to install the Indic fonts, you must log in as root. The X Font Server (xfs) is known to have some problems with the IndiX system, so remove it from the FontPath of the X Server. This can be achieved by modifying your <filename>XF86Config-4</filename> file (usually in <filenameclass="directory">/etc/X11/</filename>) and commenting the line in the Files section and adding <filenameclass="directory">/usr/X11R6/lib/X11/fonts/TrueType/</filename> to the current FontPath.
</para>
<para>
After that, the FontPath should look something similar to this:
Next, in order to make use of the OpenType font you have, load the "freetype" module at startup. You can achieve this by adding the following line in the Module section of <filename>XF86Config-4</filename> file.
<programlisting> Load "freetype"</programlisting>
Make sure you specify the modules search path in the Files section, too.
Any new Indic fonts you want to install should be placed in the <filenameclass="directory">/usr/X11R6/lib/X11/fonts/TrueType/</filename> directory. Now, change to this directory and run the following commands:
<programlisting> $ mkfontdir
$ xset fp rehash</programlisting>
In case you want to place your new Indic fonts in some other directory, you must use <command>xset</command> to add the new FontPath. Please see the <command>xset</command> man-page for further assistance. You can check the new installed fonts by running the <command>xlsfonts</command> command. In case you don't see any Indic fonts using this command, you may need to restart X.
</para>
</sect3>
<sect3>
<title>Devanagri Keyboard Layout</title>
<para>
The IndiX system comes with a keyboard map file for xmodmap. You can use the utility <command>xmodmap</command> to map a Devanagri keyboard. For most distributions, when you start X, the X-Server will look for a <filename>Xmodmap</filename> in <filenameclass="directory">/etc/X11/</filename> directory. If that file does not exist, the server will look for a <filename>.Xmodmap</filename> in your $HOME. Just putting the <filename>.Xmodmap</filename> in your $HOME will be okay. When you start the X server, it will load this file. You can also load <filename>.Xmodmap</filename> from the command line:
If you are using XFree86 version 4.0 or later, you need to add the line XkbDisable in InputDevice section of <filename>XF86Config-4 file</filename>. You may config the keyboard section like the following sample.
<programlisting> Section "InputDevice"
Identifier "Keyboard0"
Driver "keyboard"
Option "XkbDisable"
EndSection</programlisting>
</para>
</note>
</sect3>
</sect2>
</sect1>
<sect1id="locale">
<title>Locale Setup</title>
<sect2>
<title>Files and the kernel</title>
<para>
You can now use any Unicode characters in file names. No kernel or file utilities need modifications. This is because file names in the kernel can be anything not containing a null byte, and '/' is used to delimit subdirectories. When encoded using UTF-8, non-ASCII characters will never be encoded using null bytes or slashes. All that happens is that file and directory names occupy more bytes than they contain characters. For example, a filename consisting of five greek characters will appear to the kernel as a 10-byte filename. The kernel does not know (and does not need to know) that these bytes are displayed as greek.
</para>
<para>
This is the general theory, so long as your files reside on Linux. On filesystems which are used from other operating systems, you have mount options to control conversion of filenames to or from UTF-8:
<itemizedlist>
<listitem>
<para>
The "vfat" filesystems has a mount option "utf8". See file <filename>/usr/src/linux/Documentation/filesystems/vfat.txt</filename>. When you give an "iocharset" mount option different from the default (which is "iso8859-1"), the results with and without "utf8" are not consistent. Therefore, it is not I recommend to use the "iocharset" mount option.
</para>
</listitem>
<listitem>
<para>
The "msdos", "umsdos" filesystems have the same mount option, but appear to have no effect.
</para>
</listitem>
<listitem>
<para>
The "iso9660" filesystem has a mount option "utf8". See file <filename>/usr/src/linux/Documentation/filesystems/isofs.txt</filename>.
</para>
</listitem>
<listitem>
<para>
Since Linux 2.2.x kernels, the "ntfs" filesystem has a mount option "utf8". See file /usr/src/linux/Documentation/filesystems/ntfs.txt.
</para>
</listitem>
</itemizedlist>
</para>
<para>
The other filesystems (nfs, smbfs, ncpfs, hpfs, etc.) don't convert filenames; therefore they support Unicode file names in UTF-8 encoding only if the other operating system supports them. Please note that to enable a mount option for all future remounts, you add it to the fourth column of the corresponding <filename>/etc/fstab</filename> line.
</para>
</sect2>
<sect2>
<title>Locale environment variables</title>
<para>
You should have the following environment variables set, containing locale names:
<variablelist>
<varlistentry>
<term>LANGUAGE</term>
<listitem>
<para>override for LC_MESSAGES</para>
</listitem>
</varlistentry>
<varlistentry>
<term>LC_ALL</term>
<listitem>
<para>override for all other LC_* variables</para>
<para>individual variables for: character types and encoding, natural language messages, sorting rules, number formatting, money amount formatting, date and time display.</para>
</listitem>
</varlistentry>
<varlistentry>
<term>LANG</term>
<listitem>
<para>default value for all LC_* variables. (See `man 7 locale' for a detailed description.)</para>
</listitem>
</varlistentry>
</variablelist>
</para>
<para>
In order to tell your system and all applications that you are using UTF-8, you need to add a codeset suffix of UTF-8 to your locale names. For example, if you want to run an application in UTF-8 Hindi locale then with bash shell, you can specify which environment variable to be passed to the application.
After that you need not to set the LANG environment variable each time you run a specific application.
</para>
</sect2>
</sect1>
<sect1id="apps">
<title>Applications with Devanagri</title>
<sect2>
<title>Browsers</title>
<sect3>
<title>Netscape Navigator</title>
<para>
Netscape 6.01 or later can display HTML documents in UTF-8 encoding. All a document needs is the following line between the <head> and </head> tags:
<command>yudit</command> by Gáspár Sinai (<ulinkurl="http://czyborra.com/yudit/">http://czyborra.com/yudit/</ulink>) is an excellent unicode text editor for the X Window System. It supports simultaneous processing of many languages, input methods, conversions for local character standards etc. It has facilities for entering text in all languages with only an English keyboard, using keyboard configuration maps. Customization is very easy. Typically you will first want to customize your font. From the font menu, choose "Unicode". Next, you should customize your input method. The input methods "Straight", "Unicode" and "SGML" are most remarkable. For details about the other built-in input methods, look in <filenameclass="directory">/usr/local/share/yudit/data/</filename>. To make a change the default for the next session, edit your <filename>$HOME/.yuditrc</filename> file. The general editor functionality is limited to editing, cut and paste and search and replace. There is no provision for an undo. <command>yudit</command> can display text using a TrueType font. But it doesn't seem to support combining characters.
</para>
</sect3>
<sect3>
<title>Vim</title>
<para>
Vim (as of version 6.0) has good support for UTF-8. When started in an UTF-8 locale, it assumes UTF-8 encoding for the console and the text files being edited. It supports double-wide (CJK) characters as well and combining characters and therefore fits perfectly into UTF-8 enabled ncst-term.
</para>
</sect3>
<sect3>
<title>gedit</title>
<para>
gedit is an editor developed using GtkText widget. gedit-0.9.0 does not support FontSet. This means that you can't edit both English and Hindi text simultaneously. But if you choose a proper font then you will be able to use any one language at a time.
</para>
</sect3>
<sect3>
<title>xedit</title>
<para>
With XFree86-4.0.1, <command>xedit</command> is capable of editing UTF-8 files if your locale is set appropriately. Add the line
to your <filename>$HOME/.Xdefaults</filename> file.
</para>
</sect3>
</sect2>
<sect2>
<title>Mailers</title>
<para>
Mail clients released after January 1, 1999, should be capable of sending and displaying UTF-8 encoded mails, otherwise they are considered deficient. But these mails have to carry the MIME labels:
Simply piping an UTF-8 file into "mail" without caring about the MIME labels will not work. Mail client implementors should take a look at <ulinkurl="http://www.imc.org/imc-intl/">http://www.imc.org/imc-intl/</ulink> and <ulinkurl="http://www.imc.org/mail-i18n.html">http://www.imc.org/mail-i18n.html</ulink>.
</para>
<para>
Now about some of the individual mail clients (or "mail user agents"):
</para>
<sect3>
<title>kmail</title>
<para>
kmail (as of KDE 1.0) does not support UTF-8 mails at all.
</para>
</sect3>
<sect3>
<title>Netscape Mail</title>
<para>
Netscape Mail can send and display mails in UTF-8 encoding, but it needs a little bit of manual user intervention. To send an UTF-8 encoded mail:
<orderedlist>
<listitem><para>After opening the "Mail" window, but before starting to compose the message, select from the menu "View -> Character Coding -> Unicode (UTF-8)".</para></listitem>
<listitem><para>Then compose the message and send it.</para></listitem>
</orderedlist>
</para>
<para>
When you receive an UTF-8 encoded mail, Netscape does not display it in UTF-8 right away, and does not even give a visual clue that the mail was encoded in UTF-8. You have to manually select from the menu
For displaying UTF-8 mails, Netscape uses different fonts. You can adjust your font settings in the <guimenu>Edit</guimenu> -> <guisubmenu>Preferences</guisubmenu> -> <guimenuitem>Fonts</guimenuitem>
dialog by selecting the "Unicode" font category.
</para>
</sect3>
<sect3>
<title>exmh</title>
<para>
exmh 2.1.2 with Tk 8.4a1 can recognize and correctly display UTF-8 mails if you add the following lines to your <filename>$HOME/.Xdefaults</filename> file.
<para>The good guys at Centurion Linux have finished work on <productname>Exodus GNU/Linux</productname>, a 100% Free Software distribution featuring full Hindi language support for GNOME and KDE. The much awaited <productname>Exodus GNU/Linux</productname> (code named BitterCoffee) is expected to be released in the Indian market shortly.</para>
I can't start the X windows system. It gives an error "Could not open default Indic font 'xyz'".
</para>
</question>
<answer>
<para>
Please make sure that the font 'xyz' is correctly installed and is in the current FontPath. The Indic fonts usually reside in the <filenameclass="directory">/usr/X11R6/lib/X11/fonts/TrueType/</filename> directory. Your FontPath is defined in the <filename>/etc/X11/XFree86Config-4</filename> file. To learn more about howto specify your FontPath, read the section on X Window System (3.2) in this HOWTO.
</para>
</answer>
</qandaentry>
<qandaentry>
<question>
<para>
Can I use any other font as the default system font instead of the raghu font shipped with the IndiX system?
</para>
</question>
<answer>
<para>
You can load an Indic script font by giving command line server option while starting X Window System. e.g.
Here, "my_devanagari_font" and "my_tamil_font" should be replaced by the font name that you want to load. You can either specify alias name or full XLFD name for the font. However alias name must be there in <filename>fonts.alias</filename> file and XLFD name in <filename>fonts.dir</filename> file.
</para>
</answer>
</qandaentry>
<qandaentry>
<question>
<para>
I have installed IndiX system but it doesn't show Hindi characters. Why?
</para>
</question>
<answer>
<para>
This could possibly be due to the fact that your Hindi locale has not been setup correctly. To change/set the locale you should set LANG environment variable. Append the line
in your <filename>~/.bashrc</filename> and <filename>~/.bash_profile</filename> files. Restart your terminal emulator program and run the application. After this the application should display Hindi characters.
</para>
</answer>
</qandaentry>
<qandaentry>
<question>
<para>
Why are some of the pixels in Hindi characters distorted?
</para>
</question>
<answer>
<para>
This is probably because the X Font Server (xfs) is running and is still in the current FontPath. You can either shutdown the X Font Server or remove it from the current FontPath. To shutdown xfs issue the following command after becoming root:
To remove xfs from the current FontPath, read the section <xreflinkend="xwindows"/> in this HOWTO.
</para>
</answer>
</qandaentry>
<qandaentry>
<question>
<para>
All Hindi characters are displayed, but why are they not rendered properly?
</para>
</question>
<answer>
<para>
IndiX system uses an OpenType font to render Indic script characters, as it is the most suitable font format for Indic scripts. If you use some other kind of font, for example a TrueType font or a Bitmap font, then the font does not have enough information that is required to render Indic script text properly. So it is recommended to use only OpenType fonts for Indic scripts. Also, in case you are already using an OpenType font, please update your glibc.
</para>
</answer>
</qandaentry>
<qandaentry>
<question>
<para>
Why can't I download <acronym>ISO</acronym> images of <productname>Exodus GNU/Linux</productname>, yet?
</para>
</question>
<answer>
<para>
The good guys at Centurion Linux are looking for sponsors who can take care of their hosting needs. If you are interested in helping Centurion Linux out, please contact me on <email>bali@centurionlinux.com</email>.
</para>
</answer>
</qandaentry>
</qandaset>
</sect1>
<sect1id="notices">
<title>Acknowledgements and Copyright</title>
<para>
Parts of this HOWTO have been taken from <ulinkurl="http://www.linuxdoc.org/HOWTO/Unicode-HOWTO.html">The Unicode HOWTO</ulink> by Bruno Haible and The Devanagri HOWTO by Keyur Shroff.
</para>
<para>
I would also like to take this opportunity to thank my papa, mummy and my brothers Manvinder and Kulvinder for their unconditional love and support, without whom I could never have achieved anything in life. Forever, I love you. Loshaca :)
</para>
<para>
To Girija, my girlfriend: :) Thanks for everything.
</para>
<para>
I am very grateful to Keyur Shroff for allowing me to modify and redistribute his Devanagri HOWTO. Special thanks go out to him for his guidance, help, and support.
</para>
<para>
Thanks to Rohan D'Sa and Manvinder Bali of Centurion Linux for having helped me with various UTF-8 and Indic scripts issues. Also, thanks for representing Centurion Linux at the Business Technology meet organised by Ministry of Information Technology, New Delhi.
</para>
<para>
Once again, special thanks to Dan Scott for converting the HOWTO to DocBook XML format. Thanks Dan :)