A Generic Character Aligned Machine Transliteration System for Indic Languages

A Generic Character Aligned Machine Transliteration System for Indic Languages PDF Author: Nikhil Londhe
Publisher:
ISBN:
Category :
Languages : en
Pages : 32

Get Book Here

Book Description
A typical problem encountered in machine translation is the Out of Vocabulary (OOV) terms. These are usually names of places, people or technical terms that cannot be easily translated from one language to another or become obfuscated when translated. These end up as transliterated terms, i.e., a syllable or syllable group conversion from one language to another while trying to preserve the phonetic pronunciation. Although a large number of transliteration systems have been built over the years, they suffer from several problems. Firstly, any machine learning system is only as good as the underlying dataset used to train the system. For resource poor languages thus, either no such systems exist or perform extremely poorly. Secondly, most transliteration systems are over fitted to cater to the source language. However, with the proliferation of the Internet and the social media, language mixing is fairly common and most such systems fail if words derived from other languages are introduced. In this research, we aim to build better transliteration systems that can better model the language under consideration and incorporate additional features that can offset the over fitting problem described above. Also we explore how inherent language similarities can be used to bootstrap transliteration systems for resource poor languages. We explore how classical techniques in machine translation and information retrieval can be adapted to the problem in hand to build better and more robust systems.

A Generic Character Aligned Machine Transliteration System for Indic Languages

A Generic Character Aligned Machine Transliteration System for Indic Languages PDF Author: Nikhil Londhe
Publisher:
ISBN:
Category :
Languages : en
Pages : 32

Get Book Here

Book Description
A typical problem encountered in machine translation is the Out of Vocabulary (OOV) terms. These are usually names of places, people or technical terms that cannot be easily translated from one language to another or become obfuscated when translated. These end up as transliterated terms, i.e., a syllable or syllable group conversion from one language to another while trying to preserve the phonetic pronunciation. Although a large number of transliteration systems have been built over the years, they suffer from several problems. Firstly, any machine learning system is only as good as the underlying dataset used to train the system. For resource poor languages thus, either no such systems exist or perform extremely poorly. Secondly, most transliteration systems are over fitted to cater to the source language. However, with the proliferation of the Internet and the social media, language mixing is fairly common and most such systems fail if words derived from other languages are introduced. In this research, we aim to build better transliteration systems that can better model the language under consideration and incorporate additional features that can offset the over fitting problem described above. Also we explore how inherent language similarities can be used to bootstrap transliteration systems for resource poor languages. We explore how classical techniques in machine translation and information retrieval can be adapted to the problem in hand to build better and more robust systems.

Machine Translation and Transliteration involving Related, Low-resource Languages

Machine Translation and Transliteration involving Related, Low-resource Languages PDF Author: Anoop Kunchukuttan
Publisher: CRC Press
ISBN: 100042166X
Category : Computers
Languages : en
Pages : 220

Get Book Here

Book Description
Machine Translation and Transliteration involving Related, Low-resource Languages discusses an important aspect of natural language processing that has received lesser attention: translation and transliteration involving related languages in a low-resource setting. This is a very relevant real-world scenario for people living in neighbouring states/provinces/countries who speak similar languages and need to communicate with each other, but training data to build supporting MT systems is limited. The book discusses different characteristics of related languages with rich examples and draws connections between two problems: translation for related languages and transliteration. It shows how linguistic similarities can be utilized to learn MT systems for related languages with limited data. It comprehensively discusses the use of subword-level models and multilinguality to utilize these linguistic similarities. The second part of the book explores methods for machine transliteration involving related languages based on multilingual and unsupervised approaches. Through extensive experiments over a wide variety of languages, the efficacy of these methods is established. Features Novel methods for machine translation and transliteration between related languages, supported with experiments on a wide variety of languages. An overview of past literature on machine translation for related languages. A case study about machine translation for related languages between 10 major languages from India, which is one of the most linguistically diverse country in the world. The book presents important concepts and methods for machine translation involving related languages. In general, it serves as a good reference to NLP for related languages. It is intended for students, researchers and professionals interested in Machine Translation, Translation Studies, Multilingual Computing Machine and Natural Language Processing. It can be used as reference reading for courses in NLP and machine translation. Anoop Kunchukuttan is a Senior Applied Researcher at Microsoft India. His research spans various areas on multilingual and low-resource NLP. Pushpak Bhattacharyya is a Professor at the Department of Computer Science, IIT Bombay. His research areas are Natural Language Processing, Machine Learning and AI (NLP-ML-AI). Prof. Bhattacharyya has published more than 350 research papers in various areas of NLP.

Machine Translation and Transliteration Involving Related and Low-resource Languages

Machine Translation and Transliteration Involving Related and Low-resource Languages PDF Author: Anoop Kunchukuttan
Publisher: Chapman & Hall/CRC
ISBN: 9781003096771
Category : Computers
Languages : en
Pages : 0

Get Book Here

Book Description
Machine Translation and Transliteration involving Related, Low-resource Languages discusses an important aspect of natural language processing that has received lesser attention: translation and transliteration involving related languages in a low-resource setting. This is a very relevant real-world scenario for people living in neighbouring states/provinces/countries who speak similar languages and need to communicate with each other, but training data to build supporting MT systems is limited. The book discusses different characteristics of related languages with rich examples and draws connections between two problems: translation for related languages and transliteration. It shows how linguistic similarities can be utilized to learn MT systems for related languages with limited data. It comprehensively discusses the use of subword-level models and multilinguality to utilize these linguistic similarities. The second part of the book explores methods for machine transliteration involving related languages based on multilingual and unsupervised approaches. Through extensive experiments over a wide variety of languages, the efficacy of these methods is established. Features Novel methods for machine translation and transliteration between related languages, supported with experiments on a wide variety of languages. An overview of past literature on machine translation for related languages. A case study about machine translation for related languages between 10 major languages from India, which is one of the most linguistically diverse country in the world. The book presents important concepts and methods for machine translation involving related languages. In general, it serves as a good reference to NLP for related languages. It is intended for students, researchers and professionals interested in Machine Translation, Translation Studies, Multilingual Computing Machine and Natural Language Processing. It can be used as reference reading for courses in NLP and machine translation. Anoop Kunchukuttan is a Senior Applied Researcher at Microsoft India. His research spans various areas on multilingual and low-resource NLP. Pushpak Bhattacharyya is a Professor at the Department of Computer Science, IIT Bombay. His research areas are Natural Language Processing, Machine Learning and AI (NLP-ML-AI). Prof. Bhattacharyya has published more than 350 research papers in various areas of NLP.

Designing a General Framework for Text Alignment

Designing a General Framework for Text Alignment PDF Author: Niraj Aswani
Publisher:
ISBN:
Category :
Languages : en
Pages :

Get Book Here

Book Description
Building machine translation systems for many South Asian languages (such as Hindi, Gujarati, etc.) using statistical methods is problematic. The primary reason is insufficient parallel data to learn accurate word alignment. Additionally, these languages are morphologically rich and have free word order. When it is difficult to rely purely on statistical methods due to insufficient data, research shows that better performance can be obtained by building hybrid systems that rely on language specific resources, such as morphological analysers or dictionaries, as well as statistical methods. However, it is difficult to find such language specific resources for many South Asian languages. Since languages such as Hindi, Gujarati, Urdu, Bengali, Punjabi and Marathi are all very similar in structure and the main differences lie in the script and vocabulary used for these languages, we hypothesise that it is possible to develop resources for one of these languages and generalize the approach to allow rapid bootstrapping of similar resources for the other closely related languages -- with minimal effort and similar accuracies. To verify this, we develop a few resources for the Hindi language, including a sentence alignment algorithm, a morphological analyser and a transliteration similarity component and generalize the approach to allow rapid bootstrapping of similar resources for the Gujarati language. We show that the approach works on both the Hindi and Gujarati languages and achieves results that are comparable to similar state-of-the-art (SOA) resources available for these languages. We also hypothesise that it is possible to develop a high performance hybrid word alignment algorithm that relies on such language specific resources. To verify this, we design, implement and evaluate a novel English-Hindi hybrid word alignment system that uses the Hindi specific resources developed by us. Not only do we show our word alignment system outperforms other SOA English-Hindi word alignment systems, but also how simple it is to adapt it to the English-Gujarati language pair.

Information Systems for Indian Languages

Information Systems for Indian Languages PDF Author: Chandan Singh
Publisher: Springer Science & Business Media
ISBN: 3642194028
Category : Computers
Languages : en
Pages : 331

Get Book Here

Book Description
This book constitutes the refereed proceedings of the International Conference on Information Systems for Indian Languages, ICISIL 2011, held in Patiala, India, in March 2011. The 63 revised papers presented were carefully reviewed and selected from 126 paper submissions (full papers as well as poster papers) and 25 demo submissions. The papers address all current aspects on localization, e-governance, Web content accessibility, search engine and information retrieval systems, online and offline OCR, handwriting recognition, machine translation and transliteration, and text-to-speech and speech recognition - all with a particular focus on Indic scripts and languages.

International Journal of Translation

International Journal of Translation PDF Author:
Publisher:
ISBN:
Category : Translating and interpreting
Languages : en
Pages : 300

Get Book Here

Book Description


Machine Translation Systems

Machine Translation Systems PDF Author: Jonathan Slocum
Publisher:
ISBN:
Category :
Languages : en
Pages : 0

Get Book Here

Book Description


Information Processing and Management

Information Processing and Management PDF Author: Vinu V Das
Publisher: Springer
ISBN: 3642122140
Category : Computers
Languages : en
Pages : 693

Get Book Here

Book Description
It is my pleasure to write the preface for Information Processing and Management. This book aims to bring together innovative results and new research trends in inf- mation processing, computer science and management engineering. If an information processing system is able to perform useful actions for an obj- tive in a given domain, it is because the system knows something about that domain. The more knowledge it has, the more useful it can be to its users. Without that kno- edge, the system itself is useless. In the information systems field, there is conceptual modeling for the activity that elicits and describes the general knowledge a particular information system needs to know. The main objective of conceptual modeling is to obtain that description, which is called a conceptual schema. Conceptual schemas are written in languages called conceptual modeling languages. Conceptual modeling is an important part of requi- ments engineering, the first and most important phase in the development of an inf- mation system.

Praci Bhasavijnan

Praci Bhasavijnan PDF Author:
Publisher:
ISBN:
Category : India
Languages : en
Pages : 96

Get Book Here

Book Description


Neural Machine Translation

Neural Machine Translation PDF Author: Philipp Koehn
Publisher: Cambridge University Press
ISBN: 1108497322
Category : Computers
Languages : en
Pages : 409

Get Book Here

Book Description
Learn how to build machine translation systems with deep learning from the ground up, from basic concepts to cutting-edge research.