An Interest In:
Web News this Week
- April 27, 2024
- April 26, 2024
- April 25, 2024
- April 24, 2024
- April 23, 2024
- April 22, 2024
- April 21, 2024
Devanagari Transliteration Pipeline for LaTeX
hrishikeshrt / devanagari-transliteration-latex
Devanagari Transliteration in LaTeX -- Write in Devanagari to render as IAST, Harvard-Kyoto, Velthuis, SLP1, WX etc.
Devanagari Transliteration in LaTeX
Write in Devanagari to render as IAST, Harvard-Kyoto, Velthuis, SLP1, WX etc.
Devanagari text can be transliterated in various standard schemes. There exist several input systems based on these transliteration schemes to enable users easily input the text. More often than not, a user has a preference of scheme to type the input in. Similarly, at times, one faces a need to render it in a different scheme in the PDF document.
In my case, I prefer using
ibus-m17n
to type text in Devanagari. While writing articles that contain Devanagari text, I also faced the need to render the text as IAST in the final PDFOne could always learn to input text in another input scheme, but that may get tedious. Similarly, transliterating each word using online systems such as Aksharamukha can also be a tedious task. So, I was looking for a way
Devanagari is the fourth most widely adopted writing system in the world, primarily used in the Indian subcontinent. The script is being used for more than 120 languages, some of the more notable languages being, Sanskrit, Hindi, Marathi, Pali, Nepali and several variations of these languages.
Devanagari text can be transliterated in various standard schemes. There exist several input systems based on these transliteration schemes to enable users easily input the text. More often than not, a user has a preference of scheme to type the input in. Similarly, at times, one faces a need to render it in a different scheme in the PDF document.
In my case, I prefer using ibus-m17n
to type text in Devanagari. While writing articles that contain Devanagari text, I also faced the need to render the text as IAST in the final PDF.
One could always learn to input text in another input scheme, but that may get tedious. Similarly, transliterating each word using online systems such as Aksharamukha can also be a tedious task. So, I was looking for a way where I can type in Devanagari, and have it rendered in IAST after PDF compilation. As a solution, I came up with a system consisting of a small set of LaTeX commands to add custom syntax to LaTeX and a python transliteration script (based on indic-transliteration
package) to serve as a middle-layer and process the LaTeX file to create a new LaTeX file with proper transliteration.
LaTeX Compilation System with Transliteration Support
There are two primary components to the system,
- LaTeX Synatx
- Transliteration Script
LaTeX Syntax
XeTeX (xelatex
) and LuaTeX (lualatex
) have good unicode support and can be used to write Devanagari text. In the current example, I mention the setup with XeTeX.
We first add the required packages in the preamble of the LaTeX (.tex
) file.
% This assumes your files are encoded as UTF8\usepackage[utf8]{inputenc}% Devanagari Related Packages\usepackage{fontspec, xunicode, xltxtra}
Using fontspec
, we can define environments for font families, to write text in specific scripts. To write Devanagari text, one needs to have a Devanagari font available. (It is assumed here that one may need to write both in Devanagari as well as other transliteration schemes.)
For more on Devanagari fonts, you may check the fonts section of this document. In this section, it is assumed that Sanskrit 2003
font is installed in the system.
To define the environments as mentioned earlier, we add the following lines in the preamble.
% Define Fonts
ewfontfamilyextskt[Script=Devanagari]{Sanskrit 2003}
ewfontfamilyextiast[Script=Latin]{Sanskrit 2003}% Commands for Devanagari Transliterations
ewcommand{\skt}[1]{{extskt{#1}}}
ewcommand{\iast}[1]{{extiast{#1}}}
ewcommand{\Iast}[1]{{extiast{#1}}}
ewcommand{\IAST}[1]{{extiast{#1}}}
This provides us with four commands. \skt{}
can be used to render Devanagari text. \iast{}
, \Iast{}
and \IAST{}
can be used to render devanagari text in IAST format in lower case, title case and upper case respectively. It should be noted that from the perspective of LaTeX engine, the commands \iast{}
, \Iast{}
and \IAST{}
are identical. They are just different syntactically to aid the python script to perform transliteration and apply appropriate modifications.
It should further be noted that we can define new font families and new commands for any of the valid schemes as per the requirement, which can potentially give us additional commands such \velthuis{}
, \hk{}
and so on.
Minimal Example
Equipped with these commands, and some Devanagari text, we have a minimal example as follows, stored in the file minimal.tex
,
\documentclass[10pt]{article}% This assumes your files are encoded as UTF8\usepackage[utf8]{inputenc}% Devanagari Related Packages\usepackage{fontspec, xunicode, xltxtra}% Define Fonts
ewfontfamilyextskt[Script=Devanagari]{Sanskrit 2003}
ewfontfamilyextiast[Script=Latin]{Sanskrit 2003}% Commands for Devanagari Transliterations
ewcommand{\skt}[1]{{extskt{#1}}}
ewcommand{\iast}[1]{{extiast{#1}}}
ewcommand{\Iast}[1]{{extiast{#1}}}
ewcommand{\IAST}[1]{{extiast{#1}}}itle{Transliteration of Devanagari Text}\author{Hrishikesh Terdalkar}\begin{document}\maketitle\skt{ }\iast{ }\Iast{ }\IAST{ }\end{document}
Transliteration Script
The python script is used to perform transliteration and some clean-up on the LaTeX.
python3 finalize.py minimal.tex final.tex
This result in the content being transformed in the following way,
% ...\skt{ }\iast{ko nvasmin smprata loke guavn kaca vryavn|}\Iast{Ko Nvasmin Smprata Loke Guavn Kaca Vryavn|}\IAST{KO NVASMIN SMPRATA LOKE GUAVN KACA VRYAVN|}% ...
We can now proceed to compile the final.tex
file.
xelatex final
This results in the following output,
Anatomy of the Transliteration Script
At the core of the transliteration script, there is a function transliterate_between
.
def transliterate_between( text: str, from_scheme: str, to_scheme: str, start_pattern: str, end_pattern: str, post_hook: Callable[[str], str] = lambda x: x,) -> str: """Transliterate the text appearing between two patterns Only the text appearing between patterns `start_pattern` and `end_pattern` it transliterated. `start_pattern` and `end_pattern` can appear multiple times in the full text, and for every occurrence, the text between them is transliterated. `from_scheme` and `to_scheme` should be compatible with scheme names from `indic-transliteration` Parameters ---------- text : str Full text from_scheme : str Input transliteration scheme to_scheme : str Output transliteration scheme start_pattern : regexp Pattern describing the start tag end_pattern : regexp Pattern describing the end tag post_hook : Callable[[str], str], optional Function to be applied on the text within tags after transliteration The default is `lambda x: x`. Returns ------- str Text after replacements """ if from_scheme == to_scheme: return text def transliterate_match(matchobj): target = matchobj.group(1) replacement = transliterate(target, from_scheme, to_scheme) replacement = post_hook(replacement) return f"{start_pattern}{replacement}{end_pattern}" pattern = "%s(.*?)%s" % (re.escape(start_pattern), re.escape(end_pattern)) return re.sub(pattern, transliterate_match, text, flags=re.DOTALL)
We can provide the start and end patterns as \iast{
and }
respsectively, to transliterate the text enclosed in these tags.
Using this function, we can write a generic function to work with any transliteration scheme.
def latex_transliteration( input_text: str, from_scheme: str, to_scheme: str) -> str: """Transliaterate parts of the LaTeX input enclosed in scheme tags A scheme tag is of the form `\\to_scheme_lowercase{}` and is used when the desired output is in `to_scheme`. i.e., - Tags for IAST scheme are enclosed in \\iast{} tags - Tags for VH scheme are enclosed in \\vh{} tags - ... Parameters ---------- input_text : str Input text from_scheme : str Transliteration scheme of the text written within the input tags to_scheme : str Transliteration scheme to which the text within tags should be transliterated Returns ------- str Text after replacement of text within the scheme tags """ start_tag_pattern = f"\\{to_scheme.lower()}" end_tag_pattern = "}" return transliterate_between( input_text, from_scheme=from_scheme, to_scheme=to_scheme, start_pattern=start_tag_pattern, end_pattern=end_tag_pattern )
Note: The names of schemes (and therefore the corresponding LaTeX commands) have to conform to the names of schemes used
by the indic-transliteration
package.
IAST is a case-insensitive transliteration scheme, and as such, we might be interested in specific capitalization of certain words (e.g. proper nouns). We can use the post_hook
argument to provide this function. Using that, we can create a function to handle the three variants of IAST mentioned previously, namely, \iast{}
(lower), \Iast{}
(title) and \IAST{}
(upper).
def devanagari_to_iast(input_text: str) -> str: """Transliaterate parts of the input enclosed in \\iast{}, \\Iast{} or \\IAST{} tags from Devanagari to IAST Text in \\Iast{} tags also undergoes a `.title()` post-hook. Text in \\IAST{} tags also undergoes a `.upper()` post-hook. Parameters ---------- input_text : str Input text Returns ------- str Text after replacement of text within the IAST tags """ intermediate_text = transliterate_between( input_text, from_scheme=sanscript.DEVANAGARI, to_scheme=sanscript.IAST, start_pattern="\\iast{", end_pattern="}" ) intermediate_text = transliterate_between( intermediate_text, from_scheme=sanscript.DEVANAGARI, to_scheme=sanscript.IAST, start_pattern="\\Iast{", end_pattern="}", post_hook=lambda x: x.title() ) final_text = transliterate_between( intermediate_text, from_scheme=sanscript.DEVANAGARI, to_scheme=sanscript.IAST, start_pattern="\\IAST{", end_pattern="}", post_hook=lambda x: x.upper() ) return final_text
Finally, there are other utility functions to remove comments and clean excessive whitespaces.
Extras
Additionally, we may want some more structure to our setup, such as,
- Separation of ontent into multiple files
\input{sections/section_devanagari.tex}\input{sections/section_iast_lower.tex}\input{sections/section_iast_title.tex}\input{sections/section_iast_upper.tex}
- Bibliography
\bibliographystyle{acm}\bibliography{papers}
Final LaTeX Preparation
We may have used the scheme tags across multiple sections. One option is to apply the transliteration script on every section file, to create a new set of section files and use those to compile the final LaTeX file.
A simpler solution is available in the form of latexpand
which resolves the \input{}
commands to actually include the content and create a single consolidated LaTeX file.
latexpand main.tex > single.tex
Now, we can run the python script on this file to resolve the transliteration tags.
python3 finalize.py main.tex final.tex
Compilation
When working with BibTeX, we often need to multiple times to get the correct rendering of references in the PDF. Usually, this requires
xelatex finalbibtex finalxelatex finalxelatex final
Alternatively, we can use latexmk
which takes care of the tedious compilation routines and reduces our job to a single command,
latexmk -pdflatex='xelatex %O %S' -pdf -ps- -dvi- final.tex
Another benefit of using latexmk
is, we can clean the numerous files generated by LaTeX engine using a one-liner as well,
latexmk -c
Makefile
Finally, we can place all of the console commands together in a Makefile
.
all: .all.all: main.tex sections/*.tex papers.bib latexpand main.tex > single.tex python3 finalize.py single.tex final.tex latexmk -pdflatex='xelatex %O %S' -pdf -ps- -dvi- final.texclear: latexmk -C rm single.tex rm final.texclean: latexmk -c
Thus, now we can focus on writing content in the .tex
files and once we are done, simply use the command,
make
Requirements
We have made use of a number of external tools, and it is required to have these setup prior to the described solution.
Minimal Requirements
The minimal example mentioned earlier requires only three things,
- XeLaTeX (unicode support) (included in TeX Live)
- Python3
indic-transliteration
Extra Requirements
The extras have some more dependencies.
- BibTeX (optional) (bibliography support)
latexpand
(optional) (resolve\input{}
)latexmk
(optional) (simpler TeX compilation)
Devanagari Fonts
Nowadays, there are several good Devanagari fonts available. Google Fonts also provides a wide variety of Devanagari fonts.
Two of my personal favourites are,
Code
The source code for the entire setup is available at hrishikeshrt/devanagari-transliteration-latex.
Original Link: https://dev.to/hrishikeshrt/devanagari-transliteration-pipeline-for-latex-1fid
Dev To
An online community for sharing and discovering great ideas, having debates, and making friendsMore About this Source Visit Dev To