Your Web News in One Place

Help Webnuz

Referal links:

Sign up for GreenGeeks web hosting
April 13, 2022 06:59 pm GMT

Devanagari Transliteration Pipeline for LaTeX

GitHub logo hrishikeshrt / devanagari-transliteration-latex

Devanagari Transliteration in LaTeX -- Write in Devanagari to render as IAST, Harvard-Kyoto, Velthuis, SLP1, WX etc.

Devanagari Transliteration in LaTeX

Write in Devanagari to render as IAST, Harvard-Kyoto, Velthuis, SLP1, WX etc.

Devanagari text can be transliterated in various standard schemes. There exist several input systems based on these transliteration schemes to enable users easily input the text. More often than not, a user has a preference of scheme to type the input in. Similarly, at times, one faces a need to render it in a different scheme in the PDF document.

In my case, I prefer using ibus-m17n to type text in Devanagari. While writing articles that contain Devanagari text, I also faced the need to render the text as IAST in the final PDFOne could always learn to input text in another input scheme, but that may get tedious. Similarly, transliterating each word using online systems such as Aksharamukha can also be a tedious task. So, I was looking for a way

Devanagari is the fourth most widely adopted writing system in the world, primarily used in the Indian subcontinent. The script is being used for more than 120 languages, some of the more notable languages being, Sanskrit, Hindi, Marathi, Pali, Nepali and several variations of these languages.

Devanagari text can be transliterated in various standard schemes. There exist several input systems based on these transliteration schemes to enable users easily input the text. More often than not, a user has a preference of scheme to type the input in. Similarly, at times, one faces a need to render it in a different scheme in the PDF document.

In my case, I prefer using ibus-m17n to type text in Devanagari. While writing articles that contain Devanagari text, I also faced the need to render the text as IAST in the final PDF.
One could always learn to input text in another input scheme, but that may get tedious. Similarly, transliterating each word using online systems such as Aksharamukha can also be a tedious task. So, I was looking for a way where I can type in Devanagari, and have it rendered in IAST after PDF compilation. As a solution, I came up with a system consisting of a small set of LaTeX commands to add custom syntax to LaTeX and a python transliteration script (based on indic-transliteration package) to serve as a middle-layer and process the LaTeX file to create a new LaTeX file with proper transliteration.

LaTeX Compilation System with Transliteration Support

There are two primary components to the system,

  1. LaTeX Synatx
  2. Transliteration Script

LaTeX Syntax

XeTeX (xelatex) and LuaTeX (lualatex) have good unicode support and can be used to write Devanagari text. In the current example, I mention the setup with XeTeX.

We first add the required packages in the preamble of the LaTeX (.tex) file.

% This assumes your files are encoded as UTF8\usepackage[utf8]{inputenc}% Devanagari Related Packages\usepackage{fontspec, xunicode, xltxtra}

Using fontspec, we can define environments for font families, to write text in specific scripts. To write Devanagari text, one needs to have a Devanagari font available. (It is assumed here that one may need to write both in Devanagari as well as other transliteration schemes.)

For more on Devanagari fonts, you may check the fonts section of this document. In this section, it is assumed that Sanskrit 2003 font is installed in the system.

To define the environments as mentioned earlier, we add the following lines in the preamble.

% Define Fonts
ewfontfamilyextskt
[Script=Devanagari]{Sanskrit 2003}
ewfontfamilyextiast
[Script=Latin]{Sanskrit 2003}% Commands for Devanagari Transliterations
ewcommand
{\skt}[1]{{extskt{#1}}}
ewcommand
{\iast}[1]{{extiast{#1}}}
ewcommand
{\Iast}[1]{{extiast{#1}}}
ewcommand
{\IAST}[1]{{extiast{#1}}}

This provides us with four commands. \skt{} can be used to render Devanagari text. \iast{}, \Iast{} and \IAST{} can be used to render devanagari text in IAST format in lower case, title case and upper case respectively. It should be noted that from the perspective of LaTeX engine, the commands \iast{}, \Iast{} and \IAST{} are identical. They are just different syntactically to aid the python script to perform transliteration and apply appropriate modifications.
It should further be noted that we can define new font families and new commands for any of the valid schemes as per the requirement, which can potentially give us additional commands such \velthuis{}, \hk{} and so on.

Minimal Example

Equipped with these commands, and some Devanagari text, we have a minimal example as follows, stored in the file minimal.tex,

\documentclass[10pt]{article}% This assumes your files are encoded as UTF8\usepackage[utf8]{inputenc}% Devanagari Related Packages\usepackage{fontspec, xunicode, xltxtra}% Define Fonts
ewfontfamilyextskt
[Script=Devanagari]{Sanskrit 2003}
ewfontfamilyextiast
[Script=Latin]{Sanskrit 2003}% Commands for Devanagari Transliterations
ewcommand
{\skt}[1]{{extskt{#1}}}
ewcommand
{\iast}[1]{{extiast{#1}}}
ewcommand
{\Iast}[1]{{extiast{#1}}}
ewcommand
{\IAST}[1]{{extiast{#1}}}itle{Transliteration of Devanagari Text}\author{Hrishikesh Terdalkar}\begin{document}\maketitle\skt{ }\iast{ }\Iast{ }\IAST{ }\end{document}

Transliteration Script

The python script is used to perform transliteration and some clean-up on the LaTeX.

python3 finalize.py minimal.tex final.tex

This result in the content being transformed in the following way,

% ...\skt{      }\iast{ko nvasmin smprata loke guavn kaca vryavn|}\Iast{Ko Nvasmin Smprata Loke Guavn Kaca Vryavn|}\IAST{KO NVASMIN SMPRATA LOKE GUAVN KACA VRYAVN|}% ...

We can now proceed to compile the final.tex file.

xelatex final

This results in the following output, PDF

Anatomy of the Transliteration Script

At the core of the transliteration script, there is a function transliterate_between.

def transliterate_between(    text: str,    from_scheme: str,    to_scheme: str,    start_pattern: str,    end_pattern: str,    post_hook: Callable[[str], str] = lambda x: x,) -> str:    """Transliterate the text appearing between two patterns    Only the text appearing between patterns `start_pattern` and `end_pattern`    it transliterated.    `start_pattern` and `end_pattern` can appear multiple times in the full    text, and for every occurrence, the text between them is transliterated.    `from_scheme` and `to_scheme` should be compatible with scheme names from    `indic-transliteration`    Parameters    ----------    text : str        Full text    from_scheme : str        Input transliteration scheme    to_scheme : str        Output transliteration scheme    start_pattern : regexp        Pattern describing the start tag    end_pattern : regexp        Pattern describing the end tag    post_hook : Callable[[str], str], optional        Function to be applied on the text within tags after transliteration        The default is `lambda x: x`.    Returns    -------    str        Text after replacements    """    if from_scheme == to_scheme:        return text    def transliterate_match(matchobj):        target = matchobj.group(1)        replacement = transliterate(target, from_scheme, to_scheme)        replacement = post_hook(replacement)        return f"{start_pattern}{replacement}{end_pattern}"    pattern = "%s(.*?)%s" % (re.escape(start_pattern), re.escape(end_pattern))    return re.sub(pattern, transliterate_match, text, flags=re.DOTALL)

We can provide the start and end patterns as \iast{ and } respsectively, to transliterate the text enclosed in these tags.

Using this function, we can write a generic function to work with any transliteration scheme.

def latex_transliteration(    input_text: str,    from_scheme: str,    to_scheme: str) -> str:    """Transliaterate parts of the LaTeX input enclosed in scheme tags    A scheme tag is of the form `\\to_scheme_lowercase{}` and is used    when the desired output is in `to_scheme`.    i.e.,    - Tags for IAST scheme are enclosed in \\iast{} tags    - Tags for VH scheme are enclosed in \\vh{} tags    - ...    Parameters    ----------    input_text : str        Input text    from_scheme : str        Transliteration scheme of the text written within the input tags    to_scheme : str        Transliteration scheme to which the text within tags should be        transliterated    Returns    -------    str        Text after replacement of text within the scheme tags    """    start_tag_pattern = f"\\{to_scheme.lower()}"    end_tag_pattern = "}"    return transliterate_between(        input_text,        from_scheme=from_scheme,        to_scheme=to_scheme,        start_pattern=start_tag_pattern,        end_pattern=end_tag_pattern    )

Note: The names of schemes (and therefore the corresponding LaTeX commands) have to conform to the names of schemes used
by the indic-transliteration package.

IAST is a case-insensitive transliteration scheme, and as such, we might be interested in specific capitalization of certain words (e.g. proper nouns). We can use the post_hook argument to provide this function. Using that, we can create a function to handle the three variants of IAST mentioned previously, namely, \iast{} (lower), \Iast{} (title) and \IAST{} (upper).

def devanagari_to_iast(input_text: str) -> str:    """Transliaterate parts of the input enclosed in    \\iast{}, \\Iast{} or \\IAST{} tags from Devanagari to IAST    Text in \\Iast{} tags also undergoes a `.title()` post-hook.    Text in \\IAST{} tags also undergoes a `.upper()` post-hook.    Parameters    ----------    input_text : str        Input text    Returns    -------    str        Text after replacement of text within the IAST tags    """    intermediate_text = transliterate_between(        input_text,        from_scheme=sanscript.DEVANAGARI,        to_scheme=sanscript.IAST,        start_pattern="\\iast{",        end_pattern="}"    )    intermediate_text = transliterate_between(        intermediate_text,        from_scheme=sanscript.DEVANAGARI,        to_scheme=sanscript.IAST,        start_pattern="\\Iast{",        end_pattern="}",        post_hook=lambda x: x.title()    )    final_text = transliterate_between(        intermediate_text,        from_scheme=sanscript.DEVANAGARI,        to_scheme=sanscript.IAST,        start_pattern="\\IAST{",        end_pattern="}",        post_hook=lambda x: x.upper()    )    return final_text

Finally, there are other utility functions to remove comments and clean excessive whitespaces.

Extras

Additionally, we may want some more structure to our setup, such as,

  • Separation of ontent into multiple files
\input{sections/section_devanagari.tex}\input{sections/section_iast_lower.tex}\input{sections/section_iast_title.tex}\input{sections/section_iast_upper.tex}
  • Bibliography
\bibliographystyle{acm}\bibliography{papers}

Final LaTeX Preparation

We may have used the scheme tags across multiple sections. One option is to apply the transliteration script on every section file, to create a new set of section files and use those to compile the final LaTeX file.

A simpler solution is available in the form of latexpand which resolves the \input{} commands to actually include the content and create a single consolidated LaTeX file.

latexpand main.tex > single.tex

Now, we can run the python script on this file to resolve the transliteration tags.

python3 finalize.py main.tex final.tex

Compilation

When working with BibTeX, we often need to multiple times to get the correct rendering of references in the PDF. Usually, this requires

xelatex finalbibtex finalxelatex finalxelatex final

Alternatively, we can use latexmk which takes care of the tedious compilation routines and reduces our job to a single command,

latexmk -pdflatex='xelatex %O %S' -pdf -ps- -dvi- final.tex

Another benefit of using latexmk is, we can clean the numerous files generated by LaTeX engine using a one-liner as well,

latexmk -c

Makefile

Finally, we can place all of the console commands together in a Makefile.

all: .all.all: main.tex sections/*.tex papers.bib        latexpand main.tex > single.tex        python3 finalize.py single.tex final.tex        latexmk -pdflatex='xelatex %O %S' -pdf -ps- -dvi- final.texclear:        latexmk -C        rm single.tex        rm final.texclean:        latexmk -c

Thus, now we can focus on writing content in the .tex files and once we are done, simply use the command,

make

Requirements

We have made use of a number of external tools, and it is required to have these setup prior to the described solution.

Minimal Requirements

The minimal example mentioned earlier requires only three things,

Extra Requirements

The extras have some more dependencies.

  • BibTeX (optional) (bibliography support)
  • latexpand (optional) (resolve \input{})
  • latexmk (optional) (simpler TeX compilation)

Devanagari Fonts

Nowadays, there are several good Devanagari fonts available. Google Fonts also provides a wide variety of Devanagari fonts.

Two of my personal favourites are,

Code

The source code for the entire setup is available at hrishikeshrt/devanagari-transliteration-latex.


Original Link: https://dev.to/hrishikeshrt/devanagari-transliteration-pipeline-for-latex-1fid

Share this article:    Share on Facebook
View Full Article

Dev To

An online community for sharing and discovering great ideas, having debates, and making friends

More About this Source Visit Dev To