Data Filtering Using Cross-Lingual Word Embeddings

Data Filtering Using Cross-Lingual Word Embeddings PDF Author: Christian Herold
Publisher:
ISBN:
Category :
Languages : en
Pages : 0

Get Book Here

Book Description

Data Filtering Using Cross-Lingual Word Embeddings

Data Filtering Using Cross-Lingual Word Embeddings PDF Author: Christian Herold
Publisher:
ISBN:
Category :
Languages : en
Pages : 0

Get Book Here

Book Description


Cross-Lingual Word Embeddings

Cross-Lingual Word Embeddings PDF Author: Anders Søgaard
Publisher: Springer Nature
ISBN: 3031021711
Category : Computers
Languages : en
Pages : 120

Get Book Here

Book Description
The majority of natural language processing (NLP) is English language processing, and while there is good language technology support for (standard varieties of) English, support for Albanian, Burmese, or Cebuano--and most other languages--remains limited. Being able to bridge this digital divide is important for scientific and democratic reasons but also represents an enormous growth potential. A key challenge for this to happen is learning to align basic meaning-bearing units of different languages. In this book, the authors survey and discuss recent and historical work on supervised and unsupervised learning of such alignments. Specifically, the book focuses on so-called cross-lingual word embeddings. The survey is intended to be systematic, using consistent notation and putting the available methods on comparable form, making it easy to compare wildly different approaches. In so doing, the authors establish previously unreported relations between these methods and are able to present a fast-growing literature in a very compact way. Furthermore, the authors discuss how best to evaluate cross-lingual word embedding methods and survey the resources available for students and researchers interested in this topic.

Cross-Lingual Word Embeddings with Universal Concepts and Their Applications

Cross-Lingual Word Embeddings with Universal Concepts and Their Applications PDF Author: Pezhman Sheinidashtegol
Publisher:
ISBN:
Category : Electronic dissertations
Languages : en
Pages :

Get Book Here

Book Description
Enormous amounts of data are generated in many languages every day due to our increasing global connectivity. This increases the demand for the ability to read and classify data regardless of language. Word embedding is a popular Natural Language Processing (NLP) strategy that uses language modeling and feature learning to map words to vectors of real numbers. However, these models need a significant amount of data annotated for the training. While gradually, the availability of labeled data is increasing, most of these data are only available in high resource languages, such as English. Researchers with different sets of proficient languages seek to address new problems with multilingual NLP applications. In this dissertation, I present multiple approaches to generate cross-lingual word embedding (CWE) using universal concepts (UC) amongst languages to address the limitations of existing methods. My work consists of three approaches to build multilingual/bilingual word embeddings. The first approach includes two steps: pre-processing and processing. In the pre-processing step, we build a bilingual corpus containing both languages' knowledge in the form of sentences for the most frequent words in English and their translated pair in the target language. In this step, knowledge of the source language is shared with the target language and vice versa by swapping one word per sentence with its corresponding translation. In the second step, we use a monolingual embeddings estimator to generate the CWE. The second approach generates multilingual word embeddings using UCs. This approach consists of three parts. For part I, we introduce and build UCs using bilingual dictionaries and graph theory by defining words as nodes and translation pairs as edges. In part II, we explain the configuration used for word2vec to generate encoded-word embeddings. Finally, part III includes decoding the generated embeddings using UCs. The final approach utilizes the supervised method of the MUSE project, but, the model trained on our UCs. Finally, we applied our last two proposed methods to some practical NLP applications; document classification, cross-lingual sentiment analysis, and code-switching sentiment analysis. Our proposed methods outperform the state of the art MUSE method on the majority of applications.

Supervised Machine Learning for Text Analysis in R

Supervised Machine Learning for Text Analysis in R PDF Author: Emil Hvitfeldt
Publisher: CRC Press
ISBN: 1000461971
Category : Computers
Languages : en
Pages : 402

Get Book Here

Book Description
Text data is important for many domains, from healthcare to marketing to the digital humanities, but specialized approaches are necessary to create features for machine learning from language. Supervised Machine Learning for Text Analysis in R explains how to preprocess text data for modeling, train models, and evaluate model performance using tools from the tidyverse and tidymodels ecosystem. Models like these can be used to make predictions for new observations, to understand what natural language features or characteristics contribute to differences in the output, and more. If you are already familiar with the basics of predictive modeling, use the comprehensive, detailed examples in this book to extend your skills to the domain of natural language processing. This book provides practical guidance and directly applicable knowledge for data scientists and analysts who want to integrate unstructured text data into their modeling pipelines. Learn how to use text data for both regression and classification tasks, and how to apply more straightforward algorithms like regularized regression or support vector machines as well as deep learning approaches. Natural language must be dramatically transformed to be ready for computation, so we explore typical text preprocessing and feature engineering steps like tokenization and word embeddings from the ground up. These steps influence model results in ways we can measure, both in terms of model metrics and other tangible consequences such as how fair or appropriate model results are.

Cross-lingual Word Embeddings for Low-resource and Morphologically-rich Languages

Cross-lingual Word Embeddings for Low-resource and Morphologically-rich Languages PDF Author: Ali Hakimi Parizi
Publisher:
ISBN:
Category :
Languages : en
Pages : 0

Get Book Here

Book Description
Despite recent advances in natural language processing, there is still a gap in state-of-the-art methods to address problems related to low-resource and morphologically-rich languages. These methods are data-hungry, and due to the scarcity of training data for low-resource and morphologically-rich languages, developing NLP tools for them is a challenging task. Approaches for forming cross-lingual embeddings and transferring knowledge from a rich- to a low-resource language have emerged to overcome the lack of training data. Although in recent years we have seen major improvements in cross-lingual methods, these methods still have some limitations that have not been addressed properly. An important problem is the out-of-vocabulary word (OOV) problem, i.e., words that occur in a document being processed, but that the model did not observe during training. The OOV problem is more significant in the case of low-resource languages, since there is relatively little training data available for them, and also in the case of morphologically-rich languages, since it is very likely that we do not observe a considerable number of their word forms in the training data. Approaches to learning sub-word embeddings have been proposed to address the OOV problem in monolingual models, but most prior work has not considered sub-word embeddings in cross-lingual models. The hypothesis of this thesis is that it is possible to leverage sub-word information to overcome the OOV problem in low-resource and morphologically-rich languages. This thesis presents a novel bilingual lexicon induction task to demonstrate the effectiveness of sub-word information in the cross-lingual space and how it can be employed to overcome the OOV problem. Moreover, this thesis presents a novel cross-lingual word representation method that incorporates sub-word information during the training process to learn a better cross-lingual shared space and also better represent OOVs in the shared space. This method is particularly suitable for low-resource scenarios and this claim is proven through a series of experiments on bilingual lexicon induction, monolingual word similarity, and a downstream task, document classification. More specifically, it is shown that this method is suitable for low-resource languages by conducting bilingual lexicon induction on twelve low-resource and morphologically-rich languages.

Computational Linguistics

Computational Linguistics PDF Author: Le-Minh Nguyen
Publisher: Springer Nature
ISBN: 9811561680
Category : Computers
Languages : en
Pages : 525

Get Book Here

Book Description
This book constitutes the refereed proceedings of the 16th International Conference of the Pacific Association for Computational Linguistics, PACLING 2019, held in Hanoi, Vietnam, in October 2019. The 28 full papers and 14 short papers presented were carefully reviewed and selected from 70 submissions. The papers are organized in topical sections on text summarization; relation and word embedding; machine translation; text classification; web analyzing; question and answering, dialog analyzing; speech and emotion analyzing; parsing and segmentation; information extraction; and grammar error and plagiarism detection.

EMBEDDIA.

EMBEDDIA. PDF Author: Marko Robnik Šikonja
Publisher:
ISBN:
Category :
Languages : en
Pages : 0

Get Book Here

Book Description


Computational and Corpus-Based Phraseology

Computational and Corpus-Based Phraseology PDF Author: Gloria Corpas Pastor
Publisher: Springer Nature
ISBN: 3030301354
Category : Computers
Languages : en
Pages : 445

Get Book Here

Book Description
This book constitutes the refereed proceedings of the Third International Conference on Computational and Corpus-Based Phraseology, Europhras 2019, held in Malaga, Spain, in September 2019. The 31 full papers presented in this book were carefully reviewed and selected from 116 submissions. The papers in this volume cover a number of topics including general corpus-based approaches to phraseology, phraseology in translation and cross-linguistic studies, phraseology in language teaching and learning, phraseology in specialized languages, phraseology in lexicography, cognitive approaches to phraseology, the computational treatment of multiword expressions, and the development, annotation, and exploitation of corpora for phraseological studies.

Explorations in Word Embeddings

Explorations in Word Embeddings PDF Author: Zheng Zhang
Publisher:
ISBN:
Category :
Languages : en
Pages : 0

Get Book Here

Book Description
Word embeddings are a standard component of modern natural language processing architectures. Every time there is a breakthrough in word embedding learning, the vast majority of natural language processing tasks, such as POS-tagging, named entity recognition (NER), question answering, natural language inference, can benefit from it. This work addresses the question of how to improve the quality of monolingual word embeddings learned by prediction-based models and how to map contextual word embeddings generated by pretrained language representation models like ELMo or BERT across different languages.For monolingual word embedding learning, I take into account global, corpus-level information and generate a different noise distribution for negative sampling in word2vec. In this purpose I pre-compute word co-occurrence statistics with corpus2graph, an open-source NLP-application-oriented Python package that I developed: it efficiently generates a word co-occurrence network from a large corpus, and applies to it network algorithms such as random walks. For cross-lingual contextual word embedding mapping, I link contextual word embeddings to word sense embeddings. The improved anchor generation algorithm that I propose also expands the scope of word embedding mapping algorithms from context independent to contextual word embeddings.

Artificial Intelligence in Data and Big Data Processing

Artificial Intelligence in Data and Big Data Processing PDF Author: Ngoc Hoang Thanh Dang
Publisher: Springer Nature
ISBN: 3030976106
Category : Computers
Languages : en
Pages : 738

Get Book Here

Book Description
The book presents studies related to artificial intelligence (AI) and its applications to process and analyze data and big data to create machines or software that can better understand business behavior, industry activities, and human health. The studies were presented at “The 2021 International Conference on Artificial Intelligence and Big Data in Digital Era” (ICABDE 2021), which was held in Ho Chi Minh City, Vietnam, during December 18-19, 2021. The studies are pointing toward the famous slogan in technology “Make everything smarter,” i.e., creating machines that can understand and can communicate with humans, and they must act like humans in different aspects such as vision, communication, thinking, feeling, and acting. “A computer would deserve to be called intelligent if it could deceive a human into believing that it was human” —Alan Turing