-
Retrieval-Augmented Generation with Knowledge Graphs for Customer Service Question Answering
Authors:
Zhentao Xu,
Mark Jerome Cruz,
Matthew Guevara,
Tie Wang,
Manasi Deshpande,
Xiaofeng Wang,
Zheng Li
Abstract:
In customer service technical support, swiftly and accurately retrieving relevant past issues is critical for efficiently resolving customer inquiries. The conventional retrieval methods in retrieval-augmented generation (RAG) for large language models (LLMs) treat a large corpus of past issue tracking tickets as plain text, ignoring the crucial intra-issue structure and inter-issue relations, whi…
▽ More
In customer service technical support, swiftly and accurately retrieving relevant past issues is critical for efficiently resolving customer inquiries. The conventional retrieval methods in retrieval-augmented generation (RAG) for large language models (LLMs) treat a large corpus of past issue tracking tickets as plain text, ignoring the crucial intra-issue structure and inter-issue relations, which limits performance. We introduce a novel customer service question-answering method that amalgamates RAG with a knowledge graph (KG). Our method constructs a KG from historical issues for use in retrieval, retaining the intra-issue structure and inter-issue relations. During the question-answering phase, our method parses consumer queries and retrieves related sub-graphs from the KG to generate answers. This integration of a KG not only improves retrieval accuracy by preserving customer service structure information but also enhances answering quality by mitigating the effects of text segmentation. Empirical assessments on our benchmark datasets, utilizing key retrieval (MRR, Recall@K, NDCG@K) and text generation (BLEU, ROUGE, METEOR) metrics, reveal that our method outperforms the baseline by 77.6% in MRR and by 0.32 in BLEU. Our method has been deployed within LinkedIn's customer service team for approximately six months and has reduced the median per-issue resolution time by 28.6%.
△ Less
Submitted 6 May, 2024; v1 submitted 26 April, 2024;
originally announced April 2024.
-
Soil respiration signals in response to sustainable soil management practices enhance soil organic carbon stocks
Authors:
Mario Guevara
Abstract:
Development of a spatial-temporal and data-driven model of soil respiration at the global scale based on soil temperature, yearly soil moisture, and soil organic carbon (C) estimates. Prediction of soil respiration on an annual basis (1991-2018) with relatively high accuracy (NSE 0.69, CCC 0.82). Lower soil respiration trends, higher soil respiration magnitudes, and higher soil organic C stocks ac…
▽ More
Development of a spatial-temporal and data-driven model of soil respiration at the global scale based on soil temperature, yearly soil moisture, and soil organic carbon (C) estimates. Prediction of soil respiration on an annual basis (1991-2018) with relatively high accuracy (NSE 0.69, CCC 0.82). Lower soil respiration trends, higher soil respiration magnitudes, and higher soil organic C stocks across areas experiencing the presence of sustainable soil management practices.
△ Less
Submitted 19 June, 2024; v1 submitted 28 March, 2024;
originally announced April 2024.
-
Improving Clinical NLP Performance through Language Model-Generated Synthetic Clinical Data
Authors:
Shan Chen,
Jack Gallifant,
Marco Guevara,
Yanjun Gao,
Majid Afshar,
Timothy Miller,
Dmitriy Dligach,
Danielle S. Bitterman
Abstract:
Generative models have been showing potential for producing data in mass. This study explores the enhancement of clinical natural language processing performance by utilizing synthetic data generated from advanced language models. Promising results show feasible applications in such a high-stakes domain.
Generative models have been showing potential for producing data in mass. This study explores the enhancement of clinical natural language processing performance by utilizing synthetic data generated from advanced language models. Promising results show feasible applications in such a high-stakes domain.
△ Less
Submitted 28 March, 2024;
originally announced March 2024.
-
Segmentation-Based Parametric Painting
Authors:
Manuel Ladron de Guevara,
Matthew Fisher,
Aaron Hertzmann
Abstract:
We introduce a novel image-to-painting method that facilitates the creation of large-scale, high-fidelity paintings with human-like quality and stylistic variation. To process large images and gain control over the painting process, we introduce a segmentation-based painting process and a dynamic attention map approach inspired by human painting strategies, allowing optimization of brush strokes t…
▽ More
We introduce a novel image-to-painting method that facilitates the creation of large-scale, high-fidelity paintings with human-like quality and stylistic variation. To process large images and gain control over the painting process, we introduce a segmentation-based painting process and a dynamic attention map approach inspired by human painting strategies, allowing optimization of brush strokes to proceed in batches over different image regions, thereby capturing both large-scale structure and fine details, while also allowing stylistic control over detail. Our optimized batch processing and patch-based loss framework enable efficient handling of large canvases, ensuring our painted outputs are both aesthetically compelling and functionally superior as compared to previous methods, as confirmed by rigorous evaluations. Code available at: https://github.com/manuelladron/semantic\_based\_painting.git
△ Less
Submitted 23 November, 2023;
originally announced November 2023.
-
The impact of responding to patient messages with large language model assistance
Authors:
Shan Chen,
Marco Guevara,
Shalini Moningi,
Frank Hoebers,
Hesham Elhalawani,
Benjamin H. Kann,
Fallon E. Chipidza,
Jonathan Leeman,
Hugo J. W. L. Aerts,
Timothy Miller,
Guergana K. Savova,
Raymond H. Mak,
Maryam Lustberg,
Majid Afshar,
Danielle S. Bitterman
Abstract:
Documentation burden is a major contributor to clinician burnout, which is rising nationally and is an urgent threat to our ability to care for patients. Artificial intelligence (AI) chatbots, such as ChatGPT, could reduce clinician burden by assisting with documentation. Although many hospitals are actively integrating such systems into electronic medical record systems, AI chatbots utility and i…
▽ More
Documentation burden is a major contributor to clinician burnout, which is rising nationally and is an urgent threat to our ability to care for patients. Artificial intelligence (AI) chatbots, such as ChatGPT, could reduce clinician burden by assisting with documentation. Although many hospitals are actively integrating such systems into electronic medical record systems, AI chatbots utility and impact on clinical decision-making have not been studied for this intended use. We are the first to examine the utility of large language models in assisting clinicians draft responses to patient questions. In our two-stage cross-sectional study, 6 oncologists responded to 100 realistic synthetic cancer patient scenarios and portal messages developed to reflect common medical situations, first manually, then with AI assistance.
We find AI-assisted responses were longer, less readable, but provided acceptable drafts without edits 58% of time. AI assistance improved efficiency 77% of time, with low harm risk (82% safe). However, 7.7% unedited AI responses could severely harm. In 31% cases, physicians thought AI drafts were human-written. AI assistance led to more patient education recommendations, fewer clinical actions than manual responses. Results show promise for AI to improve clinician efficiency and patient care through assisting documentation, if used judiciously. Monitoring model outputs and human-AI interaction remains crucial for safe implementation.
△ Less
Submitted 29 November, 2023; v1 submitted 26 October, 2023;
originally announced October 2023.
-
Large Language Models to Identify Social Determinants of Health in Electronic Health Records
Authors:
Marco Guevara,
Shan Chen,
Spencer Thomas,
Tafadzwa L. Chaunzwa,
Idalid Franco,
Benjamin Kann,
Shalini Moningi,
Jack Qian,
Madeleine Goldstein,
Susan Harper,
Hugo JWL Aerts,
Guergana K. Savova,
Raymond H. Mak,
Danielle S. Bitterman
Abstract:
Social determinants of health (SDoH) have an important impact on patient outcomes but are incompletely collected from the electronic health records (EHR). This study researched the ability of large language models to extract SDoH from free text in EHRs, where they are most commonly documented, and explored the role of synthetic clinical text for improving the extraction of these scarcely documente…
▽ More
Social determinants of health (SDoH) have an important impact on patient outcomes but are incompletely collected from the electronic health records (EHR). This study researched the ability of large language models to extract SDoH from free text in EHRs, where they are most commonly documented, and explored the role of synthetic clinical text for improving the extraction of these scarcely documented, yet extremely valuable, clinical data. 800 patient notes were annotated for SDoH categories, and several transformer-based models were evaluated. The study also experimented with synthetic data generation and assessed for algorithmic bias. Our best-performing models were fine-tuned Flan-T5 XL (macro-F1 0.71) for any SDoH, and Flan-T5 XXL (macro-F1 0.70). The benefit of augmenting fine-tuning with synthetic data varied across model architecture and size, with smaller Flan-T5 models (base and large) showing the greatest improvements in performance (delta F1 +0.12 to +0.23). Model performance was similar on the in-hospital system dataset but worse on the MIMIC-III dataset. Our best-performing fine-tuned models outperformed zero- and few-shot performance of ChatGPT-family models for both tasks. These fine-tuned models were less likely than ChatGPT to change their prediction when race/ethnicity and gender descriptors were added to the text, suggesting less algorithmic bias (p<0.05). At the patient-level, our models identified 93.8% of patients with adverse SDoH, while ICD-10 codes captured 2.0%. Our method can effectively extracted SDoH information from clinic notes, performing better compare to GPT zero- and few-shot settings. These models could enhance real-world evidence on SDoH and aid in identifying patients needing social support.
△ Less
Submitted 5 March, 2024; v1 submitted 11 August, 2023;
originally announced August 2023.
-
Natural language processing to automatically extract the presence and severity of esophagitis in notes of patients undergoing radiotherapy
Authors:
Shan Chen,
Marco Guevara,
Nicolas Ramirez,
Arpi Murray,
Jeremy L. Warner,
Hugo JWL Aerts,
Timothy A. Miller,
Guergana K. Savova,
Raymond H. Mak,
Danielle S. Bitterman
Abstract:
Radiotherapy (RT) toxicities can impair survival and quality-of-life, yet remain under-studied. Real-world evidence holds potential to improve our understanding of toxicities, but toxicity information is often only in clinical notes. We developed natural language processing (NLP) models to identify the presence and severity of esophagitis from notes of patients treated with thoracic RT. We fine-tu…
▽ More
Radiotherapy (RT) toxicities can impair survival and quality-of-life, yet remain under-studied. Real-world evidence holds potential to improve our understanding of toxicities, but toxicity information is often only in clinical notes. We developed natural language processing (NLP) models to identify the presence and severity of esophagitis from notes of patients treated with thoracic RT. We fine-tuned statistical and pre-trained BERT-based models for three esophagitis classification tasks: Task 1) presence of esophagitis, Task 2) severe esophagitis or not, and Task 3) no esophagitis vs. grade 1 vs. grade 2-3. Transferability was tested on 345 notes from patients with esophageal cancer undergoing RT.
Fine-tuning PubmedBERT yielded the best performance. The best macro-F1 was 0.92, 0.82, and 0.74 for Task 1, 2, and 3, respectively. Selecting the most informative note sections during fine-tuning improved macro-F1 by over 2% for all tasks. Silver-labeled data improved the macro-F1 by over 3% across all tasks. For the esophageal cancer notes, the best macro-F1 was 0.73, 0.74, and 0.65 for Task 1, 2, and 3, respectively, without additional fine-tuning.
To our knowledge, this is the first effort to automatically extract esophagitis toxicity severity according to CTCAE guidelines from clinic notes. The promising performance provides proof-of-concept for NLP-based automated detailed toxicity monitoring in expanded domains.
△ Less
Submitted 23 March, 2023;
originally announced March 2023.
-
A General Purpose Transpiler for Fully Homomorphic Encryption
Authors:
Shruthi Gorantala,
Rob Springer,
Sean Purser-Haskell,
William Lam,
Royce Wilson,
Asra Ali,
Eric P. Astor,
Itai Zukerman,
Sam Ruth,
Christoph Dibak,
Phillipp Schoppmann,
Sasha Kulankhina,
Alain Forget,
David Marn,
Cameron Tew,
Rafael Misoczki,
Bernat Guillen,
Xinyu Ye,
Dennis Kraft,
Damien Desfontaines,
Aishe Krishnamurthy,
Miguel Guevara,
Irippuge Milinda Perera,
Yurii Sushko,
Bryant Gipson
Abstract:
Fully homomorphic encryption (FHE) is an encryption scheme which enables computation on encrypted data without revealing the underlying data. While there have been many advances in the field of FHE, developing programs using FHE still requires expertise in cryptography. In this white paper, we present a fully homomorphic encryption transpiler that allows developers to convert high-level code (e.g.…
▽ More
Fully homomorphic encryption (FHE) is an encryption scheme which enables computation on encrypted data without revealing the underlying data. While there have been many advances in the field of FHE, developing programs using FHE still requires expertise in cryptography. In this white paper, we present a fully homomorphic encryption transpiler that allows developers to convert high-level code (e.g., C++) that works on unencrypted data into high-level code that operates on encrypted data. Thus, our transpiler makes transformations possible on encrypted data.
Our transpiler builds on Google's open-source XLS SDK (https://github.com/google/xls) and uses an off-the-shelf FHE library, TFHE (https://tfhe.github.io/tfhe/), to perform low-level FHE operations. The transpiler design is modular, which means the underlying FHE library as well as the high-level input and output languages can vary. This modularity will help accelerate FHE research by providing an easy way to compare arbitrary programs in different FHE schemes side-by-side. We hope this lays the groundwork for eventual easy adoption of FHE by software developers. As a proof-of-concept, we are releasing an experimental transpiler (https://github.com/google/fully-homomorphic-encryption/tree/main/transpiler) as open-source software.
△ Less
Submitted 15 June, 2021;
originally announced June 2021.
-
MixerGAN: An MLP-Based Architecture for Unpaired Image-to-Image Translation
Authors:
George Cazenavette,
Manuel Ladron De Guevara
Abstract:
While attention-based transformer networks achieve unparalleled success in nearly all language tasks, the large number of tokens (pixels) found in images coupled with the quadratic activation memory usage makes them prohibitive for problems in computer vision. As such, while language-to-language translation has been revolutionized by the transformer model, convolutional networks remain the de fact…
▽ More
While attention-based transformer networks achieve unparalleled success in nearly all language tasks, the large number of tokens (pixels) found in images coupled with the quadratic activation memory usage makes them prohibitive for problems in computer vision. As such, while language-to-language translation has been revolutionized by the transformer model, convolutional networks remain the de facto solution for image-to-image translation. The recently proposed MLP-Mixer architecture alleviates some of the computational issues associated with attention-based networks while still retaining the long-range connections that make transformer models desirable. Leveraging this memory-efficient alternative to self-attention, we propose a new exploratory model in unpaired image-to-image translation called MixerGAN: a simpler MLP-based architecture that considers long-distance relationships between pixels without the need for expensive attention mechanisms. Quantitative and qualitative analysis shows that MixerGAN achieves competitive results when compared to prior convolutional-based methods.
△ Less
Submitted 19 August, 2021; v1 submitted 28 May, 2021;
originally announced May 2021.
-
Improving Maritime Traffic Emission Estimations on Missing Data with CRBMs
Authors:
Alberto Gutierrez-Torre,
Josep Ll. Berral,
David Buchaca,
Marc Guevara,
Albert Soret,
David Carrera
Abstract:
Maritime traffic emissions are a major concern to governments as they heavily impact the Air Quality in coastal cities. Ships use the Automatic Identification System (AIS) to continuously report position and speed among other features, and therefore this data is suitable to be used to estimate emissions, if it is combined with engine data. However, important ship features are often inaccurate or m…
▽ More
Maritime traffic emissions are a major concern to governments as they heavily impact the Air Quality in coastal cities. Ships use the Automatic Identification System (AIS) to continuously report position and speed among other features, and therefore this data is suitable to be used to estimate emissions, if it is combined with engine data. However, important ship features are often inaccurate or missing. State-of-the-art complex systems, like CALIOPE at the Barcelona Supercomputing Center, are used to model Air Quality. These systems can benefit from AIS based emission models as they are very precise in positioning the pollution. Unfortunately, these models are sensitive to missing or corrupted data, and therefore they need data curation techniques to significantly improve the estimation accuracy. In this work, we propose a methodology for treating ship data using Conditional Restricted Boltzmann Machines (CRBMs) plus machine learning methods to improve the quality of data passed to emission models. Results show that we can improve the default methods proposed to cover missing data. In our results, we observed that using our method the models boosted their accuracy to detect otherwise undetectable emissions. In particular, we used a real data-set of AIS data, provided by the Spanish Port Authority, to estimate that thanks to our method, the model was able to detect 45% of additional emissions, of additional emissions, representing 152 tonnes of pollutants per week in Barcelona and propose new features that may enhance emission modeling.
△ Less
Submitted 10 September, 2020; v1 submitted 7 September, 2020;
originally announced September 2020.
-
Multimodal Word Sense Disambiguation in Creative Practice
Authors:
Manuel Ladron de Guevara,
Christopher George,
Akshat Gupta,
Daragh Byrne,
Ramesh Krishnamurti
Abstract:
Language is ambiguous; many terms and expressions can convey the same idea. This is especially true in creative practice, where ideas and design intents are highly subjective. We present a dataset, Ambiguous Descriptions of Art Images (ADARI), of contemporary workpieces, which aims to provide a foundational resource for subjective image description and multimodal word disambiguation in the context…
▽ More
Language is ambiguous; many terms and expressions can convey the same idea. This is especially true in creative practice, where ideas and design intents are highly subjective. We present a dataset, Ambiguous Descriptions of Art Images (ADARI), of contemporary workpieces, which aims to provide a foundational resource for subjective image description and multimodal word disambiguation in the context of creative practice. The dataset contains a total of 240k images labeled with 260k descriptive sentences. It is additionally organized into sub-domains of architecture, art, design, fashion, furniture, product design and technology. In subjective image description, labels are not deterministic: for example, the ambiguous label dynamic might correspond to hundreds of different images. To understand this complexity, we analyze the ambiguity and relevance of text with respect to images using the state-of-the-art pre-trained BERT model for sentence classification. We provide a baseline for multi-label classification tasks and demonstrate the potential of multimodal approaches for understanding ambiguity in design intentions. We hope that ADARI dataset and baselines constitute a first step towards subjective label classification.
△ Less
Submitted 17 January, 2021; v1 submitted 15 July, 2020;
originally announced July 2020.
-
Artistic Style in Robotic Painting; a Machine Learning Approach to Learning Brushstroke from Human Artists
Authors:
Ardavan Bidgoli,
Manuel Ladron De Guevara,
Cinnie Hsiung,
Jean Oh,
Eunsu Kang
Abstract:
Robotic painting has been a subject of interest among both artists and roboticists since the 1970s. Researchers and interdisciplinary artists have employed various painting techniques and human-robot collaboration models to create visual mediums on canvas. One of the challenges of robotic painting is to apply a desired artistic style to the painting. Style transfer techniques with machine learning…
▽ More
Robotic painting has been a subject of interest among both artists and roboticists since the 1970s. Researchers and interdisciplinary artists have employed various painting techniques and human-robot collaboration models to create visual mediums on canvas. One of the challenges of robotic painting is to apply a desired artistic style to the painting. Style transfer techniques with machine learning models have helped us address this challenge with the visual style of a specific painting. However, other manual elements of style, i.e., painting techniques and brushstrokes of an artist, have not been fully addressed. We propose a method to integrate an artistic style to the brushstrokes and the painting process through collaboration with a human artist. In this paper, we describe our approach to 1) collect brushstrokes and hand-brush motion samples from an artist, and 2) train a generative model to generate brushstrokes that pertains to the artist's style, and 3) fine tune a stroke-based rendering model to work with our robotic painting setup. We will report on the integration of these three steps in a separate publication. In a preliminary study, 71% of human evaluators find our reconstructed brushstrokes are pertaining to the characteristics of the artist's style. Moreover, 58% of participants could not distinguish a painting made by our method from a visually similar painting created by a human artist.
△ Less
Submitted 28 July, 2020; v1 submitted 7 July, 2020;
originally announced July 2020.
-
Google COVID-19 Community Mobility Reports: Anonymization Process Description (version 1.1)
Authors:
Ahmet Aktay,
Shailesh Bavadekar,
Gwen Cossoul,
John Davis,
Damien Desfontaines,
Alex Fabrikant,
Evgeniy Gabrilovich,
Krishna Gadepalli,
Bryant Gipson,
Miguel Guevara,
Chaitanya Kamath,
Mansi Kansal,
Ali Lange,
Chinmoy Mandayam,
Andrew Oplinger,
Christopher Pluntke,
Thomas Roessler,
Arran Schlosberg,
Tomer Shekel,
Swapnil Vispute,
Mia Vu,
Gregory Wellenius,
Brian Williams,
Royce J Wilson
Abstract:
This document describes the aggregation and anonymization process applied to the initial version of Google COVID-19 Community Mobility Reports (published at http://google.com/covid19/mobility on April 2, 2020), a publicly available resource intended to help public health authorities understand what has changed in response to work-from-home, shelter-in-place, and other recommended policies aimed at…
▽ More
This document describes the aggregation and anonymization process applied to the initial version of Google COVID-19 Community Mobility Reports (published at http://google.com/covid19/mobility on April 2, 2020), a publicly available resource intended to help public health authorities understand what has changed in response to work-from-home, shelter-in-place, and other recommended policies aimed at flattening the curve of the COVID-19 pandemic. Our anonymization process is designed to ensure that no personal data, including an individual's location, movement, or contacts, can be derived from the resulting metrics.
The high-level description of the procedure is as follows: we first generate a set of anonymized metrics from the data of Google users who opted in to Location History. Then, we compute percentage changes of these metrics from a baseline based on the historical part of the anonymized metrics. We then discard a subset which does not meet our bar for statistical reliability, and release the rest publicly in a format that compares the result to the private baseline.
△ Less
Submitted 3 November, 2020; v1 submitted 8 April, 2020;
originally announced April 2020.
-
A bibliometric analysis of research based on the Roy Adaptation Model: a contribution to Nursing
Authors:
Paulina Hurtado-Arenas,
Miguel R. Guevara
Abstract:
Objective. To perform a modern bibliometric analysis of the research based on the Roy Adaptation Model, a founding nursing model proposed by Sor Callista Roy in the1970s. Method. A descriptive and longitudinal study. We used information from the two dominant scientific databases, Web Of Science and SCOPUS. We obtained 137 publications from the Core Collection of WoS, and 338 publications from SCOP…
▽ More
Objective. To perform a modern bibliometric analysis of the research based on the Roy Adaptation Model, a founding nursing model proposed by Sor Callista Roy in the1970s. Method. A descriptive and longitudinal study. We used information from the two dominant scientific databases, Web Of Science and SCOPUS. We obtained 137 publications from the Core Collection of WoS, and 338 publications from SCOPUS. We conducted our analysis using the software Bibliometrix, an R-package specialized in creating bibliometric analyses from a perspective of descriptive statistics and network analysis, including co-citation, co-keyword occurrence and collaboration networks. Results. Our quantitative results show the main actors around the research based on the model and the founding literature or references on which this research was based. We analyze the main keywords and how they are linked. Furthermore, we present the most prolific authors both in number of publications and in centrality in the network of coauthors. We present the most central institutions in the global network of collaboration. Conclusions. We highlight the relevance of this theoretical model in nursing and detail its evolution. The United States is the dominant country in production of documents on the topic, and the University of Massachusetts Boston and Boston College are the most influential institutions. The network of collaboration also describes clusters in Mexico, Turkey and Spain. Our findings are useful to acquire a general vision of the field.
△ Less
Submitted 29 March, 2020;
originally announced March 2020.
-
SOMOSPIE: A modular SOil MOisture SPatial Inference Engine based on data driven decisions
Authors:
Danny Rorabaugh,
Mario Guevara,
Ricardo Llamas,
Joy Kitson,
Rodrigo Vargas,
Michela Taufer
Abstract:
The current availability of soil moisture data over large areas comes from satellite remote sensing technologies (i.e., radar-based systems), but these data have coarse resolution and often exhibit large spatial information gaps. Where data are too coarse or sparse for a given need (e.g., precision agriculture), one can leverage machine-learning techniques coupled with other sources of environment…
▽ More
The current availability of soil moisture data over large areas comes from satellite remote sensing technologies (i.e., radar-based systems), but these data have coarse resolution and often exhibit large spatial information gaps. Where data are too coarse or sparse for a given need (e.g., precision agriculture), one can leverage machine-learning techniques coupled with other sources of environmental information (e.g., topography) to generate gap-free information and at a finer spatial resolution (i.e., increased granularity). To this end, we develop a spatial inference engine consisting of modular stages for processing spatial environmental data, generating predictions with machine-learning techniques, and analyzing these predictions. We demonstrate the functionality of this approach and the effects of data processing choices via multiple prediction maps over a United States ecological region with a highly diverse soil moisture profile (i.e., the Middle Atlantic Coastal Plains). The relevance of our work derives from a pressing need to improve the spatial representation of soil moisture for applications in environmental sciences (e.g., ecological niche modeling, carbon monitoring systems, and other Earth system models) and precision agriculture (e.g., optimizing irrigation practices and other land management decisions).
△ Less
Submitted 20 May, 2019; v1 submitted 16 April, 2019;
originally announced April 2019.
-
The Research Space: using the career paths of scholars to predict the evolution of the research output of individuals, institutions, and nations
Authors:
Miguel R. Guevara,
Dominik Hartmann,
Manuel Aristarán,
Marcelo Mendoza,
César A. Hidalgo
Abstract:
In recent years scholars have built maps of science by connecting the academic fields that cite each other, are cited together, or that cite a similar literature. But since scholars cannot always publish in the fields they cite, or that cite them, these science maps are only rough proxies for the potential of a scholar, organization, or country, to enter a new academic field. Here we use a large d…
▽ More
In recent years scholars have built maps of science by connecting the academic fields that cite each other, are cited together, or that cite a similar literature. But since scholars cannot always publish in the fields they cite, or that cite them, these science maps are only rough proxies for the potential of a scholar, organization, or country, to enter a new academic field. Here we use a large dataset of scholarly publications disambiguated at the individual level to create a map of science-or research space-where links connect pairs of fields based on the probability that an individual has published in both of them. We find that the research space is a significantly more accurate predictor of the fields that individuals and organizations will enter in the future than citation based science maps. At the country level, however, the research space and citations based science maps are equally accurate. These findings show that data on career trajectories-the set of fields that individuals have previously published in-provide more accurate predictors of future research output for more focalized units-such as individuals or organizations-than citation based science maps.
△ Less
Submitted 14 April, 2016; v1 submitted 26 February, 2016;
originally announced February 2016.
-
Revealing Comparative Advantages in the Backbone of Science
Authors:
Miguel Guevara,
Marcelo Mendoza
Abstract:
Mapping Science across countries is a challenging task in the field of Scientometrics. A number of efforts trying to cope with this task has been discussed in the state of the art, addressing this challenge by processing collections of scientific digital libraries and visualizing author-based measures (for instance, the h-index) or document-based measures (for instance, the averaged number of cita…
▽ More
Mapping Science across countries is a challenging task in the field of Scientometrics. A number of efforts trying to cope with this task has been discussed in the state of the art, addressing this challenge by processing collections of scientific digital libraries and visualizing author-based measures (for instance, the h-index) or document-based measures (for instance, the averaged number of citations per document). A major drawback of these approaches is related to the presence of bias. The bigger the country, the higher the measure value. We explore the use of an econometric index to tackle this limitation, known as the Revealed Comparative Advantage measure (RCA). Using RCA, the diversity and ubiquity of each field of knowledge is mapped across countries. Then, a RCA-based proximity function is explored to visualize citation and h-index ubiquity. Science maps relating 27 knowledge areas and 237 countries are introduced using data crawled from Scimago that ranges from 1996 to 2011. Our results shows that the proposal is feasible and can be extended to ellaborate a global scientific production characterization.
△ Less
Submitted 5 September, 2014;
originally announced September 2014.