-
Rethinking the Outlier Distribution in Large Language Models: An In-depth Study
Authors:
Rahul Raman,
Khushi Sharma,
Sai Qian Zhang
Abstract:
Investigating outliers in large language models (LLMs) is crucial due to their significant impact on various aspects of LLM performance, including quantization and compression. Outliers often cause considerable quantization errors, leading to degraded model performance. Identifying and addressing these outliers can enhance the accuracy and efficiency of the quantization process, enabling smoother…
▽ More
Investigating outliers in large language models (LLMs) is crucial due to their significant impact on various aspects of LLM performance, including quantization and compression. Outliers often cause considerable quantization errors, leading to degraded model performance. Identifying and addressing these outliers can enhance the accuracy and efficiency of the quantization process, enabling smoother deployment on edge devices or specialized hardware. Recent studies have identified two common types of outliers in LLMs: massive activations and channel-wise outliers. While numerous quantization algorithms have been proposed to mitigate their effects and maintain satisfactory accuracy, few have thoroughly explored the root causes of these outliers in depth. In this paper, we conduct a comprehensive investigation into the formation mechanisms of these outliers and propose potential strategies to mitigate their occurrence. Ultimately, we introduce some efficient approaches to eliminate most massive activations and channel-wise outliers with minimal impact on accuracy.
△ Less
Submitted 27 May, 2025;
originally announced May 2025.
-
DREAM: Drafting with Refined Target Features and Entropy-Adaptive Cross-Attention Fusion for Multimodal Speculative Decoding
Authors:
Yunhai Hu,
Tianhua Xia,
Zining Liu,
Rahul Raman,
Xingyu Liu,
Bo Bao,
Eric Sather,
Vithursan Thangarasa,
Sai Qian Zhang
Abstract:
Speculative decoding (SD) has emerged as a powerful method for accelerating autoregressive generation in large language models (LLMs), yet its integration into vision-language models (VLMs) remains underexplored. We introduce DREAM, a novel speculative decoding framework tailored for VLMs that combines three key innovations: (1) a cross-attention-based mechanism to inject intermediate features fro…
▽ More
Speculative decoding (SD) has emerged as a powerful method for accelerating autoregressive generation in large language models (LLMs), yet its integration into vision-language models (VLMs) remains underexplored. We introduce DREAM, a novel speculative decoding framework tailored for VLMs that combines three key innovations: (1) a cross-attention-based mechanism to inject intermediate features from the target model into the draft model for improved alignment, (2) adaptive intermediate feature selection based on attention entropy to guide efficient draft model training, and (3) visual token compression to reduce draft model latency. DREAM enables efficient, accurate, and parallel multimodal decoding with significant throughput improvement. Experiments across a diverse set of recent popular VLMs, including LLaVA, Pixtral, SmolVLM and Gemma3, demonstrate up to 3.6x speedup over conventional decoding and significantly outperform prior SD baselines in both inference throughput and speculative draft acceptance length across a broad range of multimodal benchmarks. The code is publicly available at: https://github.com/SAI-Lab-NYU/DREAM.git
△ Less
Submitted 29 May, 2025; v1 submitted 25 May, 2025;
originally announced May 2025.
-
Supports for Outerplanar and Bounded Treewidth Graphs
Authors:
Rajiv Raman,
Karamjeet Singh
Abstract:
We study the existence and construction of sparse supports for hypergraphs derived from subgraphs of a graph $G$. For a hypergraph $(X,\mathcal{H})$, a support $Q$ is a graph on $X$ s.t. $Q[H]$, the graph induced on vertices in $H$ is connected for every $H\in\mathcal{H}$.
We consider \emph{primal}, \emph{dual}, and \emph{intersection} hypergraphs defined by subgraphs of a graph $G$ that are \em…
▽ More
We study the existence and construction of sparse supports for hypergraphs derived from subgraphs of a graph $G$. For a hypergraph $(X,\mathcal{H})$, a support $Q$ is a graph on $X$ s.t. $Q[H]$, the graph induced on vertices in $H$ is connected for every $H\in\mathcal{H}$.
We consider \emph{primal}, \emph{dual}, and \emph{intersection} hypergraphs defined by subgraphs of a graph $G$ that are \emph{non-piercing}, (i.e., each subgraph is connected, their pairwise differences remain connected).
If $G$ is outerplanar, we show that the primal, dual and intersection hypergraphs admit supports that are outerplanar. For a bounded treewidth graph $G$, we show that if the subgraphs are non-piercing, then there exist supports for the primal and dual hypergraphs of treewidth $O(2^{tw(G)})$ and $O(2^{4tw(G)})$ respectively, and a support of treewidth $2^{O(2^{tw(G)})}$ for the intersection hypergraph. We also show that for the primal and dual hypergraphs, the exponential blow-up of treewidth is sometimes essential.
All our results are algorithmic and yield polynomial-time algorithms (when the treewidth is bounded). The existence and construction of sparse supports is a crucial step in the design and analysis of PTASs and/or sub-exponential time algorithms for several packing and covering problems.
△ Less
Submitted 8 June, 2025; v1 submitted 7 April, 2025;
originally announced April 2025.
-
On Supports for graphs of bounded genus
Authors:
Rajiv Raman,
Karamjeet Singh
Abstract:
Let $(X,\mathcal{E})$ be a hypergraph. A support is a graph $Q$ on $X$ such that for each $E\in\mathcal{E}$, the subgraph of $Q$ induced on the elements in $E$ is connected. We consider the problem of constructing a support for hypergraphs defined by connected subgraphs of a host graph. For a graph $G=(V,E)$, let $\mathcal{H}$ be a set of connected subgraphs of $G$. Let the vertices of $G$ be part…
▽ More
Let $(X,\mathcal{E})$ be a hypergraph. A support is a graph $Q$ on $X$ such that for each $E\in\mathcal{E}$, the subgraph of $Q$ induced on the elements in $E$ is connected. We consider the problem of constructing a support for hypergraphs defined by connected subgraphs of a host graph. For a graph $G=(V,E)$, let $\mathcal{H}$ be a set of connected subgraphs of $G$. Let the vertices of $G$ be partitioned into two sets the \emph{terminals} $\mathbf{b}(V)$ and the \emph{non-terminals} $\mathbf{r}(V)$. We define a hypergraph on $\mathbf{b}(V)$, where each $H\in\mathcal{H}$ defines a hyperedge consisting of the vertices of $\mathbf{b}(V)$ in $H$.
We also consider the problem of constructing a support for the \emph{dual hypergraph} - a hypergraph on $\mathcal{H}$ where each $v\in \mathbf{b}(V)$ defines a hyperedge consisting of the subgraphs in $\mathcal{H}$ containing $v$. In fact, we construct supports for a common generalization of the primal and dual settings called the \emph{intersection hypergraph}.
As our main result, we show that if the host graph $G$ has bounded genus and the subgraphs in $\mathcal{H}$ satisfy a condition of being \emph{cross-free}, then there exists a support that also has bounded genus. Our results are a generalization of the results of Raman and Ray (Rajiv Raman, Saurabh Ray: Constructing Planar Support for Non-Piercing Regions. Discret. Comput. Geom. 64(3): 1098-1122 (2020)).
Our techniques imply a unified analysis for packing and covering problems for hypergraphs defined on surfaces of bounded genus. We also describe applications of our results for hypergraph colorings.
△ Less
Submitted 30 April, 2025; v1 submitted 27 March, 2025;
originally announced March 2025.
-
Adaptive Class Learning to Screen Diabetic Disorders in Fundus Images of Eye
Authors:
Shramana Dey,
Pallabi Dutta,
Riddhasree Bhattacharyya,
Surochita Pal,
Sushmita Mitra,
Rajiv Raman
Abstract:
The prevalence of ocular illnesses is growing globally, presenting a substantial public health challenge. Early detection and timely intervention are crucial for averting visual impairment and enhancing patient prognosis. This research introduces a new framework called Class Extension with Limited Data (CELD) to train a classifier to categorize retinal fundus images. The classifier is initially tr…
▽ More
The prevalence of ocular illnesses is growing globally, presenting a substantial public health challenge. Early detection and timely intervention are crucial for averting visual impairment and enhancing patient prognosis. This research introduces a new framework called Class Extension with Limited Data (CELD) to train a classifier to categorize retinal fundus images. The classifier is initially trained to identify relevant features concerning Healthy and Diabetic Retinopathy (DR) classes and later fine-tuned to adapt to the task of classifying the input images into three classes: Healthy, DR, and Glaucoma. This strategy allows the model to gradually enhance its classification capabilities, which is beneficial in situations where there are only a limited number of labeled datasets available. Perturbation methods are also used to identify the input image characteristics responsible for influencing the models decision-making process. We achieve an overall accuracy of 91% on publicly available datasets.
△ Less
Submitted 21 January, 2025;
originally announced January 2025.
-
A fast algorithm for computing a planar support for non-piercing rectangles
Authors:
Ambar Pal,
Rajiv Raman,
Saurabh Ray,
Karamjeet Singh
Abstract:
For a hypergraph $\mathcal{H}=(X,\mathcal{E})$ a \emph{support} is a graph $G$ on $X$ such that for each $E\in\mathcal{E}$, the induced subgraph of $G$ on the elements in $E$ is connected. If $G$ is planar, we call it a planar support. A set of axis parallel rectangles $\mathcal{R}$ forms a non-piercing family if for any $R_1, R_2 \in \mathcal{R}$, $R_1 \setminus R_2$ is connected. Given a set…
▽ More
For a hypergraph $\mathcal{H}=(X,\mathcal{E})$ a \emph{support} is a graph $G$ on $X$ such that for each $E\in\mathcal{E}$, the induced subgraph of $G$ on the elements in $E$ is connected. If $G$ is planar, we call it a planar support. A set of axis parallel rectangles $\mathcal{R}$ forms a non-piercing family if for any $R_1, R_2 \in \mathcal{R}$, $R_1 \setminus R_2$ is connected. Given a set $P$ of $n$ points in $\mathbb{R}^2$ and a set $\mathcal{R}$ of $m$ \emph{non-piercing} axis-aligned rectangles, we give an algorithm for computing a planar support for the hypergraph $(P,\mathcal{R})$ in $O(n\log^2 n + (n+m)\log m)$ time, where each $R\in\mathcal{R}$ defines a hyperedge consisting of all points of $P$ contained in~$R$. We use this result to show that if for a family of axis-parallel rectangles, any point in the plane is contained in at most $k$ pairwise \emph{crossing} rectangles (a pair of intersecting rectangles such that neither contains a corner of the other is called a crossing pair of rectangles), then we can obtain a support as the union of $k$ planar graphs.
△ Less
Submitted 3 October, 2024;
originally announced October 2024.
-
The Persistent Robot Charging Problem for Long-Duration Autonomy
Authors:
Nitesh Kumar,
Jaekyung Jackie Lee,
Sivakumar Rathinam,
Swaroop Darbha,
P. B. Sujit,
Rajiv Raman
Abstract:
This paper introduces a novel formulation aimed at determining the optimal schedule for recharging a fleet of $n$ heterogeneous robots, with the primary objective of minimizing resource utilization. This study provides a foundational framework applicable to Multi-Robot Mission Planning, particularly in scenarios demanding Long-Duration Autonomy (LDA) or other contexts that necessitate periodic rec…
▽ More
This paper introduces a novel formulation aimed at determining the optimal schedule for recharging a fleet of $n$ heterogeneous robots, with the primary objective of minimizing resource utilization. This study provides a foundational framework applicable to Multi-Robot Mission Planning, particularly in scenarios demanding Long-Duration Autonomy (LDA) or other contexts that necessitate periodic recharging of multiple robots. A novel Integer Linear Programming (ILP) model is proposed to calculate the optimal initial conditions (partial charge) for individual robots, leading to the minimal utilization of charging stations. This formulation was further generalized to maximize the servicing time for robots given adequate charging stations. The efficacy of the proposed formulation is evaluated through a comparative analysis, measuring its performance against the thrift price scheduling algorithm documented in the existing literature. The findings not only validate the effectiveness of the proposed approach but also underscore its potential as a valuable tool in optimizing resource allocation for a range of robotic and engineering applications.
△ Less
Submitted 31 August, 2024;
originally announced September 2024.
-
Speckle Noise Analysis for Synthetic Aperture Radar (SAR) Space Data
Authors:
Sanjjushri Varshini R,
Rohith Mahadevan,
Bagiya Lakshmi S,
Mathivanan Periasamy,
Raja CSP Raman,
Lokesh M
Abstract:
This research tackles the challenge of speckle noise in Synthetic Aperture Radar (SAR) space data, a prevalent issue that hampers the clarity and utility of SAR images. The study presents a comparative analysis of six distinct speckle noise reduction techniques: Lee Filtering, Frost Filtering, Kuan Filtering, Gaussian Filtering, Median Filtering, and Bilateral Filtering. These methods, selected fo…
▽ More
This research tackles the challenge of speckle noise in Synthetic Aperture Radar (SAR) space data, a prevalent issue that hampers the clarity and utility of SAR images. The study presents a comparative analysis of six distinct speckle noise reduction techniques: Lee Filtering, Frost Filtering, Kuan Filtering, Gaussian Filtering, Median Filtering, and Bilateral Filtering. These methods, selected for their unique approaches to noise reduction and image preservation, were applied to SAR datasets sourced from the Alaska Satellite Facility (ASF). The performance of each technique was evaluated using a comprehensive set of metrics, including Peak Signal-to-Noise Ratio (PSNR), Mean Squared Error (MSE), Structural Similarity Index (SSIM), Equivalent Number of Looks (ENL), and Speckle Suppression Index (SSI). The study concludes that both the Lee and Kuan Filters are effective, with the choice of filter depending on the specific application requirements for image quality and noise suppression. This work provides valuable insights into optimizing SAR image processing, with significant implications for remote sensing, environmental monitoring, and geological surveying.
△ Less
Submitted 16 August, 2024;
originally announced August 2024.
-
High-dimensional multiple imputation (HDMI) for partially observed confounders including natural language processing-derived auxiliary covariates
Authors:
Janick Weberpals,
Pamela A. Shaw,
Kueiyu Joshua Lin,
Richard Wyss,
Joseph M Plasek,
Li Zhou,
Kerry Ngan,
Thomas DeRamus,
Sudha R. Raman,
Bradley G. Hammill,
Hana Lee,
Sengwee Toh,
John G. Connolly,
Kimberly J. Dandreo,
Fang Tian,
Wei Liu,
Jie Li,
José J. Hernández-Muñoz,
Sebastian Schneeweiss,
Rishi J. Desai
Abstract:
Multiple imputation (MI) models can be improved by including auxiliary covariates (AC), but their performance in high-dimensional data is not well understood. We aimed to develop and compare high-dimensional MI (HDMI) approaches using structured and natural language processing (NLP)-derived AC in studies with partially observed confounders. We conducted a plasmode simulation study using data from…
▽ More
Multiple imputation (MI) models can be improved by including auxiliary covariates (AC), but their performance in high-dimensional data is not well understood. We aimed to develop and compare high-dimensional MI (HDMI) approaches using structured and natural language processing (NLP)-derived AC in studies with partially observed confounders. We conducted a plasmode simulation study using data from opioid vs. non-steroidal anti-inflammatory drug (NSAID) initiators (X) with observed serum creatinine labs (Z2) and time-to-acute kidney injury as outcome. We simulated 100 cohorts with a null treatment effect, including X, Z2, atrial fibrillation (U), and 13 other investigator-derived confounders (Z1) in the outcome generation. We then imposed missingness (MZ2) on 50% of Z2 measurements as a function of Z2 and U and created different HDMI candidate AC using structured and NLP-derived features. We mimicked scenarios where U was unobserved by omitting it from all AC candidate sets. Using LASSO, we data-adaptively selected HDMI covariates associated with Z2 and MZ2 for MI, and with U to include in propensity score models. The treatment effect was estimated following propensity score matching in MI datasets and we benchmarked HDMI approaches against a baseline imputation and complete case analysis with Z1 only. HDMI using claims data showed the lowest bias (0.072). Combining claims and sentence embeddings led to an improvement in the efficiency displaying the lowest root-mean-squared-error (0.173) and coverage (94%). NLP-derived AC alone did not perform better than baseline MI. HDMI approaches may decrease bias in studies with partially observed confounders where missingness depends on unobserved factors.
△ Less
Submitted 17 May, 2024;
originally announced May 2024.
-
Automating REST API Postman Test Cases Using LLM
Authors:
S Deepika Sri,
Mohammed Aadil S,
Sanjjushri Varshini R,
Raja CSP Raman,
Gopinath Rajagopal,
S Taranath Chan
Abstract:
In the contemporary landscape of technological advancements, the automation of manual processes is crucial, compelling the demand for huge datasets to effectively train and test machines. This research paper is dedicated to the exploration and implementation of an automated approach to generate test cases specifically using Large Language Models. The methodology integrates the use of Open AI to en…
▽ More
In the contemporary landscape of technological advancements, the automation of manual processes is crucial, compelling the demand for huge datasets to effectively train and test machines. This research paper is dedicated to the exploration and implementation of an automated approach to generate test cases specifically using Large Language Models. The methodology integrates the use of Open AI to enhance the efficiency and effectiveness of test case generation for training and evaluating Large Language Models. This formalized approach with LLMs simplifies the testing process, making it more efficient and comprehensive. Leveraging natural language understanding, LLMs can intelligently formulate test cases that cover a broad range of REST API properties, ensuring comprehensive testing. The model that is developed during the research is trained using manually collected postman test cases or instances for various Rest APIs. LLMs enhance the creation of Postman test cases by automating the generation of varied and intricate test scenarios. Postman test cases offer streamlined automation, collaboration, and dynamic data handling, providing a user-friendly and efficient approach to API testing compared to traditional test cases. Thus, the model developed not only conforms to current technological standards but also holds the promise of evolving into an idea of substantial importance in future technological advancements.
△ Less
Submitted 16 April, 2024;
originally announced April 2024.
-
Finding fake reviews in e-commerce platforms by using hybrid algorithms
Authors:
Mathivanan Periasamy,
Rohith Mahadevan,
Bagiya Lakshmi S,
Raja CSP Raman,
Hasan Kumar S,
Jasper Jessiman
Abstract:
Sentiment analysis, a vital component in natural language processing, plays a crucial role in understanding the underlying emotions and opinions expressed in textual data. In this paper, we propose an innovative ensemble approach for sentiment analysis for finding fake reviews that amalgamate the predictive capabilities of Support Vector Machine (SVM), K-Nearest Neighbors (KNN), and Decision Tree…
▽ More
Sentiment analysis, a vital component in natural language processing, plays a crucial role in understanding the underlying emotions and opinions expressed in textual data. In this paper, we propose an innovative ensemble approach for sentiment analysis for finding fake reviews that amalgamate the predictive capabilities of Support Vector Machine (SVM), K-Nearest Neighbors (KNN), and Decision Tree classifiers. Our ensemble architecture strategically combines these diverse models to capitalize on their strengths while mitigating inherent weaknesses, thereby achieving superior accuracy and robustness in fake review prediction. By combining all the models of our classifiers, the predictive performance is boosted and it also fosters adaptability to varied linguistic patterns and nuances present in real-world datasets. The metrics accounted for on fake reviews demonstrate the efficacy and competitiveness of the proposed ensemble method against traditional single-model approaches. Our findings underscore the potential of ensemble techniques in advancing the state-of-the-art in finding fake reviews using hybrid algorithms, with implications for various applications in different social media and e-platforms to find the best reviews and neglect the fake ones, eliminating puffery and bluffs.
△ Less
Submitted 9 April, 2024;
originally announced April 2024.
-
Sweeping Arrangements of Non-Piercing Curves in Plane
Authors:
Suryendu Dalal,
Rahul Gangopadhyay,
Rajiv Raman,
Saurabh Ray
Abstract:
Let $Γ$ be a finite set of Jordan curves in the plane. For any curve $γ\in Γ$, we denote the bounded region enclosed by $γ$ as $\tildeγ$. We say that $Γ$ is a non-piercing family if for any two curves $α, β\in Γ$, $\tildeα \setminus \tildeβ$ is a connected region. A non-piercing family of curves generalizes a family of $2$-intersecting curves in which each pair of curves intersect in at most two p…
▽ More
Let $Γ$ be a finite set of Jordan curves in the plane. For any curve $γ\in Γ$, we denote the bounded region enclosed by $γ$ as $\tildeγ$. We say that $Γ$ is a non-piercing family if for any two curves $α, β\in Γ$, $\tildeα \setminus \tildeβ$ is a connected region. A non-piercing family of curves generalizes a family of $2$-intersecting curves in which each pair of curves intersect in at most two points. Snoeyink and Hershberger (``Sweeping Arrangements of Curves'', SoCG '89) proved that if we are given a family $\mathcal{C}$ of $2$-intersecting curves and a fixed curve $C\in\mathcal{C}$, then the arrangement can be \emph{swept} by $C$, i.e., $C$ can be continuously shrunk to any point $p \in \tilde{C}$ in such a way that the we have a family of $2$-intersecting curves throughout the process. In this paper, we generalize the result of Snoeyink and Hershberger to the setting of non-piercing curves. We show that given an arrangement of non-piercing curves $Γ$, and a fixed curve $γ\in Γ$, the arrangement can be swept by $γ$ so that the arrangement remains non-piercing throughout the process. We also give a shorter and simpler proof of the result of Snoeyink and Hershberger and cite applications of their result, where our result leads to a generalization.
△ Less
Submitted 2 April, 2024; v1 submitted 25 March, 2024;
originally announced March 2024.
-
Comparative Study and Framework for Automated Summariser Evaluation: LangChain and Hybrid Algorithms
Authors:
Bagiya Lakshmi S,
Sanjjushri Varshini R,
Rohith Mahadevan,
Raja CSP Raman
Abstract:
Automated Essay Score (AES) is proven to be one of the cutting-edge technologies. Scoring techniques are used for various purposes. Reliable scores are calculated based on influential variables. Such variables can be computed by different methods based on the domain. The research is concentrated on the user's understanding of a given topic. The analysis is based on a scoring index by using Large L…
▽ More
Automated Essay Score (AES) is proven to be one of the cutting-edge technologies. Scoring techniques are used for various purposes. Reliable scores are calculated based on influential variables. Such variables can be computed by different methods based on the domain. The research is concentrated on the user's understanding of a given topic. The analysis is based on a scoring index by using Large Language Models. The user can then compare and contrast the understanding of a topic that they recently learned. The results are then contributed towards learning analytics and progression is made for enhancing the learning ability. In this research, the focus is on summarizing a PDF document and gauging a user's understanding of its content. The process involves utilizing a Langchain tool to summarize the PDF and extract the essential information. By employing this technique, the research aims to determine how well the user comprehends the summarized content.
△ Less
Submitted 4 October, 2023;
originally announced October 2023.
-
Exploring Linguistic Similarity and Zero-Shot Learning for Multilingual Translation of Dravidian Languages
Authors:
Danish Ebadulla,
Rahul Raman,
S. Natarajan,
Hridhay Kiran Shetty,
Ashish Harish Shenoy
Abstract:
Current research in zero-shot translation is plagued by several issues such as high compute requirements, increased training time and off target translations. Proposed remedies often come at the cost of additional data or compute requirements. Pivot based neural machine translation is preferred over a single-encoder model for most settings despite the increased training and evaluation time. In thi…
▽ More
Current research in zero-shot translation is plagued by several issues such as high compute requirements, increased training time and off target translations. Proposed remedies often come at the cost of additional data or compute requirements. Pivot based neural machine translation is preferred over a single-encoder model for most settings despite the increased training and evaluation time. In this work, we overcome the shortcomings of zero-shot translation by taking advantage of transliteration and linguistic similarity. We build a single encoder-decoder neural machine translation system for Dravidian-Dravidian multilingual translation and perform zero-shot translation. We compare the data vs zero-shot accuracy tradeoff and evaluate the performance of our vanilla method against the current state of the art pivot based method. We also test the theory that morphologically rich languages require large vocabularies by restricting the vocabulary using an optimal transport based technique. Our model manages to achieves scores within 3 BLEU of large-scale pivot-based models when it is trained on 50\% of the language directions.
△ Less
Submitted 10 August, 2023;
originally announced August 2023.
-
CERTainty: Detecting DNS Manipulation at Scale using TLS Certificates
Authors:
Elisa Tsai,
Deepak Kumar,
Ram Sundara Raman,
Gavin Li,
Yael Eiger,
Roya Ensafi
Abstract:
DNS manipulation is an increasingly common technique used by censors and other network adversaries to prevent users from accessing restricted Internet resources and hijack their connections. Prior work in detecting DNS manipulation relies largely on comparing DNS resolutions with trusted control results to identify inconsistencies. However, the emergence of CDNs and other cloud providers practicin…
▽ More
DNS manipulation is an increasingly common technique used by censors and other network adversaries to prevent users from accessing restricted Internet resources and hijack their connections. Prior work in detecting DNS manipulation relies largely on comparing DNS resolutions with trusted control results to identify inconsistencies. However, the emergence of CDNs and other cloud providers practicing content localization and load balancing leads to these heuristics being inaccurate, paving the need for more verifiable signals of DNS manipulation. In this paper, we develop a new technique, CERTainty, that utilizes the widely established TLS certificate ecosystem to accurately detect DNS manipulation, and obtain more information about the adversaries performing such manipulation. We find that untrusted certificates, mismatching hostnames, and blockpages are powerful proxies for detecting DNS manipulation. Our results show that previous work using consistency-based heuristics is inaccurate, allowing for 72.45% false positives in the cases detected as DNS manipulation. Further, we identify 17 commercial DNS filtering products in 52 countries, including products such as SafeDNS, SkyDNS, and Fortinet, and identify the presence of 55 ASes in 26 countries that perform ISP-level DNS manipulation. We also identify 226 new blockpage clusters that are not covered by previous research. We are integrating techniques used by CERTainty into active measurement platforms to continuously and accurately monitor DNS manipulation.
△ Less
Submitted 14 May, 2023;
originally announced May 2023.
-
On Hypergraph Supports
Authors:
Rajiv Raman,
Karamjeet Singh
Abstract:
Let $\mathcal{H}=(X,\mathcal{E})$ be a hypergraph. A support is a graph $Q$ on $X$ such that for each $E\in\mathcal{E}$, the subgraph of $Q$ induced on the elements in $E$ is connected. In this paper, we consider hypergraphs defined on a host graph. Given a graph $G=(V,E)$, with $c:V\to\{\mathbf{r},\mathbf{b}\}$, and a collection of connected subgraphs $\mathcal{H}$ of $G$, a primal support is a g…
▽ More
Let $\mathcal{H}=(X,\mathcal{E})$ be a hypergraph. A support is a graph $Q$ on $X$ such that for each $E\in\mathcal{E}$, the subgraph of $Q$ induced on the elements in $E$ is connected. In this paper, we consider hypergraphs defined on a host graph. Given a graph $G=(V,E)$, with $c:V\to\{\mathbf{r},\mathbf{b}\}$, and a collection of connected subgraphs $\mathcal{H}$ of $G$, a primal support is a graph $Q$ on $\mathbf{b}(V)$ such that for each $H\in \mathcal{H}$, the induced subgraph $Q[\mathbf{b}(H)]$ on vertices $\mathbf{b}(H)=H\cap c^{-1}(\mathbf{b})$ is connected. A \emph{dual support} is a graph $Q^*$ on $\mathcal{H}$ s.t. for each $v\in X$, the induced subgraph $Q^*[\mathcal{H}_v]$ is connected, where $\mathcal{H}_v=\{H\in\mathcal{H}: v\in H\}$. We present sufficient conditions on the host graph and hyperedges so that the resulting support comes from a restricted family.
We primarily study two classes of graphs: $(1)$ If the host graph has genus $g$ and the hypergraphs satisfy a topological condition of being \emph{cross-free}, then there is a primal and a dual support of genus at most $g$. $(2)$ If the host graph has treewidth $t$ and the hyperedges satisfy a combinatorial condition of being \emph{non-piercing}, then there exist primal and dual supports of treewidth $O(2^t)$. We show that this exponential blow-up is sometimes necessary. As an intermediate case, we also study the case when the host graph is outerplanar. Finally, we show applications of our results to packing and covering, and coloring problems on geometric hypergraphs.
△ Less
Submitted 2 February, 2024; v1 submitted 29 March, 2023;
originally announced March 2023.
-
Fashion-model pose recommendation and generation using Machine Learning
Authors:
Vijitha Kannumuru,
Santhosh Kannan S P,
Krithiga Shankar,
Joy Larnyoh,
Rohith Mahadevan,
Raja CSP Raman
Abstract:
Fashion-model pose is an important attribute in the fashion industry. Creative directors, modeling production houses, and top photographers always look for professional models able to pose. without the skill to correctly pose, their chances of landing professional modeling employment are regrettably quite little. There are occasions when models and photographers are unsure of the best pose to stri…
▽ More
Fashion-model pose is an important attribute in the fashion industry. Creative directors, modeling production houses, and top photographers always look for professional models able to pose. without the skill to correctly pose, their chances of landing professional modeling employment are regrettably quite little. There are occasions when models and photographers are unsure of the best pose to strike while taking photographs. This research concentrates on suggesting the fashion personnel a series of similar images based on the input image. The image is segmented into different parts and similar images are suggested for the user. This was achieved by calculating the color histogram of the input image and applying the same for all the images in the dataset and comparing the histograms. Synthetic images have become popular to avoid privacy concerns and to overcome the high cost of photoshoots. Hence, this paper also extends the work of generating synthetic images from the recommendation engine using styleGAN to an extent.
△ Less
Submitted 19 February, 2023;
originally announced March 2023.
-
Payday loans -- blessing or growth suppressor? Machine Learning Analysis
Authors:
Rohith Mahadevan,
Sam Richard,
Kishore Harshan Kumar,
Jeevitha Murugan,
Santhosh Kannan,
Saaisri,
Tarun,
Raja CSP Raman
Abstract:
The upsurge of real estate involves a variety of factors that have got influenced by many domains. Indeed, the unrecognized sector that would affect the economy for which regulatory proposals are being drafted to keep this in control is the payday loans. This research paper revolves around the impact of payday loans in the real estate market. The research paper draws a first-hand experience of obt…
▽ More
The upsurge of real estate involves a variety of factors that have got influenced by many domains. Indeed, the unrecognized sector that would affect the economy for which regulatory proposals are being drafted to keep this in control is the payday loans. This research paper revolves around the impact of payday loans in the real estate market. The research paper draws a first-hand experience of obtaining the index for the concentration of real estate in an area of reference by virtue of payday loans in Toronto, Ontario in particular, which sets out an ideology to create, evaluate and demonstrate the scenario through research analysis. The purpose of this indexing via payday loans is the basic - debt: income ratio which states that when the income of the person bound to pay the interest of payday loans increases, his debt goes down marginally which hence infers that the person invests in fixed assets like real estate which hikes up its growth.
△ Less
Submitted 30 May, 2022;
originally announced May 2022.
-
Turtle Score -- Similarity Based Developer Analyzer
Authors:
Sanjjushri Varshini,
Ponshriharini V,
Santhosh Kannan,
Snekha Suresh,
Harshavardhan Ramesh,
Rohith Mahadevan,
Raja CSP Raman
Abstract:
In day-to-day life, a highly demanding task for IT companies is to find the right candidates who fit the companies' culture. This research aims to comprehend, analyze and automatically produce convincing outcomes to find a candidate who perfectly fits right in the company. Data is examined and collected for each employee who works in the IT domain focusing on their performance measure. This is don…
▽ More
In day-to-day life, a highly demanding task for IT companies is to find the right candidates who fit the companies' culture. This research aims to comprehend, analyze and automatically produce convincing outcomes to find a candidate who perfectly fits right in the company. Data is examined and collected for each employee who works in the IT domain focusing on their performance measure. This is done based on various different categories which bring versatility and a wide view of focus. To this data, learner analysis is done using machine learning algorithms to obtain learner similarity and developer similarity in order to recruit people with identical working patterns. It's been proven that the efficiency and capability of a particular worker go higher when working with a person of a similar personality. Therefore this will serve as a useful tool for recruiters who aim to recruit people with high productivity. This is to say that the model designed will render the best outcome possible with high accuracy and an immaculate recommendation score.
△ Less
Submitted 10 May, 2022;
originally announced May 2022.
-
On the Parameterized Complexity of the Maximum Exposure Problem
Authors:
Remi Raman,
Shahin John J S,
R Subashini,
Subhasree Methirumangalath
Abstract:
We investigate the parameterized complexity of Maximum Exposure Problem (MEP). Given a range space (R, P) where R is the set of ranges containing a set P of points, and an integer k, MEP asks for k ranges which on removal results in the maximum number of exposed points. A point p is said to be exposed when p is not contained in any of the ranges in R. The problem is known to be NP-hard. In this le…
▽ More
We investigate the parameterized complexity of Maximum Exposure Problem (MEP). Given a range space (R, P) where R is the set of ranges containing a set P of points, and an integer k, MEP asks for k ranges which on removal results in the maximum number of exposed points. A point p is said to be exposed when p is not contained in any of the ranges in R. The problem is known to be NP-hard. In this letter, we give fixed-parameter tractable results of MEP with respect to different parameterizations.
△ Less
Submitted 22 March, 2022; v1 submitted 21 March, 2022;
originally announced March 2022.
-
Stack Index Prediction Using Time-Series Analysis
Authors:
Raja CSP Raman,
Rohith Mahadevan,
Divya Perumal,
Vedha Sankar,
Talha Abdur Rahman
Abstract:
The Prevalence of Community support and engagement for different domains in the tech industry has changed and evolved throughout the years. In this study, we aim to understand, analyze and predict the trends of technology in a scientific manner, having collected data on numerous topics and their growth throughout the years in the past decade. We apply machine learning models on collected data, to…
▽ More
The Prevalence of Community support and engagement for different domains in the tech industry has changed and evolved throughout the years. In this study, we aim to understand, analyze and predict the trends of technology in a scientific manner, having collected data on numerous topics and their growth throughout the years in the past decade. We apply machine learning models on collected data, to understand, analyze and forecast the trends in the advancement of different fields. We show that certain technical concepts such as python, machine learning, and Keras have an undisputed uptrend, finally concluding that the Stackindex model forecasts with high accuracy and can be a viable tool for forecasting different tech domains.
△ Less
Submitted 18 August, 2021;
originally announced August 2021.
-
Knowledge Value Stream Framework For Complex Product Design Decisions
Authors:
Ramakrishnan Raman,
Meenakshi D'Souza
Abstract:
Product Development value stream includes all the activities, value added and non value added, for designing and developing a product. It is characterized by flow of knowledge that drives the decisions, and deals with how the product is conceptualized, architected and designed by the design teams. The intangible flow of knowledge is determined by knowledge value stream, which shapes how the raw co…
▽ More
Product Development value stream includes all the activities, value added and non value added, for designing and developing a product. It is characterized by flow of knowledge that drives the decisions, and deals with how the product is conceptualized, architected and designed by the design teams. The intangible flow of knowledge is determined by knowledge value stream, which shapes how the raw concepts and ideas flow into mature knowledge, how the knowledge is socialized, internalized, and how the knowledge impels the decisions in the product development value stream. For complex products, the design teams encounter tough challenges, such as uncertainty and variability, while making design decisions. This paper proposes a framework for knowledge value stream for complex product design decisions. The framework encompasses knowledge cadence and learning cycles as its core elements and incorporates rudiments for complex product design such as uncertainty, variability and perceptions. It helps in managing uncertainty and variability and evolving the required knowledge to drive optimal decisions during the design of complex products. It advocates a phased model for framework deployment towards establishing and progressively maturing the knowledge value stream.
△ Less
Submitted 16 May, 2021;
originally announced May 2021.
-
Weighted Ancestors in Suffix Trees Revisited
Authors:
Djamal Belazzougui,
Dmitry Kosolobov,
Simon J. Puglisi,
Rajeev Raman
Abstract:
The weighted ancestor problem is a well-known generalization of the predecessor problem to trees. It is known to require $Ω(\log\log n)$ time for queries provided $O(n\mathop{\mathrm{polylog}} n)$ space is available and weights are from $[0..n]$, where $n$ is the number of tree nodes. However, when applied to suffix trees, the problem, surprisingly, admits an $O(n)$-space solution with constant qu…
▽ More
The weighted ancestor problem is a well-known generalization of the predecessor problem to trees. It is known to require $Ω(\log\log n)$ time for queries provided $O(n\mathop{\mathrm{polylog}} n)$ space is available and weights are from $[0..n]$, where $n$ is the number of tree nodes. However, when applied to suffix trees, the problem, surprisingly, admits an $O(n)$-space solution with constant query time, as was shown by Gawrychowski, Lewenstein, and Nicholson (Proc. ESA 2014). This variant of the problem can be reformulated as follows: given the suffix tree of a string $s$, we need a data structure that can locate in the tree any substring $s[p..q]$ of $s$ in $O(1)$ time (as if one descended from the root reading $s[p..q]$ along the way). Unfortunately, the data structure of Gawrychowski et al. has no efficient construction algorithm, limiting its wider usage as an algorithmic tool. In this paper we resolve this issue, describing a data structure for weighted ancestors in suffix trees with constant query time and a linear construction algorithm. Our solution is based on a novel approach using so-called irreducible LCP values.
△ Less
Submitted 11 April, 2021; v1 submitted 28 February, 2021;
originally announced March 2021.
-
Underspecification Presents Challenges for Credibility in Modern Machine Learning
Authors:
Alexander D'Amour,
Katherine Heller,
Dan Moldovan,
Ben Adlam,
Babak Alipanahi,
Alex Beutel,
Christina Chen,
Jonathan Deaton,
Jacob Eisenstein,
Matthew D. Hoffman,
Farhad Hormozdiari,
Neil Houlsby,
Shaobo Hou,
Ghassen Jerfel,
Alan Karthikesalingam,
Mario Lucic,
Yian Ma,
Cory McLean,
Diana Mincu,
Akinori Mitani,
Andrea Montanari,
Zachary Nado,
Vivek Natarajan,
Christopher Nielson,
Thomas F. Osborne
, et al. (15 additional authors not shown)
Abstract:
ML models often exhibit unexpectedly poor behavior when they are deployed in real-world domains. We identify underspecification as a key reason for these failures. An ML pipeline is underspecified when it can return many predictors with equivalently strong held-out performance in the training domain. Underspecification is common in modern ML pipelines, such as those based on deep learning. Predict…
▽ More
ML models often exhibit unexpectedly poor behavior when they are deployed in real-world domains. We identify underspecification as a key reason for these failures. An ML pipeline is underspecified when it can return many predictors with equivalently strong held-out performance in the training domain. Underspecification is common in modern ML pipelines, such as those based on deep learning. Predictors returned by underspecified pipelines are often treated as equivalent based on their training domain performance, but we show here that such predictors can behave very differently in deployment domains. This ambiguity can lead to instability and poor model behavior in practice, and is a distinct failure mode from previously identified issues arising from structural mismatch between training and deployment domains. We show that this problem appears in a wide variety of practical ML pipelines, using examples from computer vision, medical imaging, natural language processing, clinical risk prediction based on electronic health records, and medical genomics. Our results show the need to explicitly account for underspecification in modeling pipelines that are intended for real-world deployment in any domain.
△ Less
Submitted 24 November, 2020; v1 submitted 6 November, 2020;
originally announced November 2020.
-
Quantitative and Qualitative Evaluation of Explainable Deep Learning Methods for Ophthalmic Diagnosis
Authors:
Amitojdeep Singh,
J. Jothi Balaji,
Mohammed Abdul Rasheed,
Varadharajan Jayakumar,
Rajiv Raman,
Vasudevan Lakshminarayanan
Abstract:
Background: The lack of explanations for the decisions made by algorithms such as deep learning has hampered their acceptance by the clinical community despite highly accurate results on multiple problems. Recently, attribution methods have emerged for explaining deep learning models, and they have been tested on medical imaging problems. The performance of attribution methods is compared on stand…
▽ More
Background: The lack of explanations for the decisions made by algorithms such as deep learning has hampered their acceptance by the clinical community despite highly accurate results on multiple problems. Recently, attribution methods have emerged for explaining deep learning models, and they have been tested on medical imaging problems. The performance of attribution methods is compared on standard machine learning datasets and not on medical images. In this study, we perform a comparative analysis to determine the most suitable explainability method for retinal OCT diagnosis.
Methods: A commonly used deep learning model known as Inception v3 was trained to diagnose 3 retinal diseases - choroidal neovascularization (CNV), diabetic macular edema (DME), and drusen. The explanations from 13 different attribution methods were rated by a panel of 14 clinicians for clinical significance. Feedback was obtained from the clinicians regarding the current and future scope of such methods.
Results: An attribution method based on a Taylor series expansion, called Deep Taylor was rated the highest by clinicians with a median rating of 3.85/5. It was followed by two other attribution methods, Guided backpropagation and SHAP (SHapley Additive exPlanations).
Conclusion: Explanations of deep learning models can make them more transparent for clinical diagnosis. This study compared different explanations methods in the context of retinal OCT diagnosis and found that the best performing method may not be the one considered best for other deep learning tasks. Overall, there was a high degree of acceptance from the clinicians surveyed in the study.
Keywords: explainable AI, deep learning, machine learning, image processing, Optical coherence tomography, retina, Diabetic macular edema, Choroidal Neovascularization, Drusen
△ Less
Submitted 24 March, 2021; v1 submitted 26 September, 2020;
originally announced September 2020.
-
Generating a Gray code for prefix normal words in amortized polylogarithmic time per word
Authors:
Péter Burcsi,
Gabriele Fici,
Zsuzsanna Lipták,
Rajeev Raman,
Joe Sawada
Abstract:
A prefix normal word is a binary word with the property that no substring has more $1$s than the prefix of the same length. By proving that the set of prefix normal words is a bubble language, we can exhaustively list all prefix normal words of length $n$ as a combinatorial Gray code, where successive strings differ by at most two swaps or bit flips. This Gray code can be generated in…
▽ More
A prefix normal word is a binary word with the property that no substring has more $1$s than the prefix of the same length. By proving that the set of prefix normal words is a bubble language, we can exhaustively list all prefix normal words of length $n$ as a combinatorial Gray code, where successive strings differ by at most two swaps or bit flips. This Gray code can be generated in $\Oh(\log^2 n)$ amortized time per word, while the best generation algorithm hitherto has $\Oh(n)$ running time per word. We also present a membership tester for prefix normal words, as well as a novel characterization of bubble languages.
△ Less
Submitted 7 August, 2020; v1 submitted 5 March, 2020;
originally announced March 2020.
-
Finite-Sample Analysis of Image Registration
Authors:
Ravi Kiran Raman,
Lav R. Varshney
Abstract:
We study the problem of image registration in the finite-resolution regime and characterize the error probability of algorithms as a function of properties of the transformation and the image capture noise. Specifically, we define a channel-aware Feinstein decoder to obtain upper bounds on the minimum achievable error probability under finite resolution. We specifically focus on the higher-order t…
▽ More
We study the problem of image registration in the finite-resolution regime and characterize the error probability of algorithms as a function of properties of the transformation and the image capture noise. Specifically, we define a channel-aware Feinstein decoder to obtain upper bounds on the minimum achievable error probability under finite resolution. We specifically focus on the higher-order terms and use Berry-Esseen type CLTs to obtain a stronger characterization of the achievability condition for the problem. Then, we derive a strong type-counting result to characterize the performance of the MMI decoder in terms of the maximum likelihood decoder, in a simplified setting of the problem. We then describe how this analysis, when related to the results from the channel-aware context provide stronger characterization of the finite-sample performance of universal image registration.
△ Less
Submitted 12 January, 2020;
originally announced January 2020.
-
Decision Making in Star Networks with Incorrect Beliefs
Authors:
Daewon Seo,
Ravi Kiran Raman,
Lav R. Varshney
Abstract:
Consider a Bayesian binary decision-making problem in star networks, where local agents make selfish decisions independently, and a fusion agent makes a final decision based on aggregated decisions and its own private signal. In particular, we assume all agents have private beliefs for the true prior probability, based on which they perform Bayesian decision making. We focus on the Bayes risk of t…
▽ More
Consider a Bayesian binary decision-making problem in star networks, where local agents make selfish decisions independently, and a fusion agent makes a final decision based on aggregated decisions and its own private signal. In particular, we assume all agents have private beliefs for the true prior probability, based on which they perform Bayesian decision making. We focus on the Bayes risk of the fusion agent and counterintuitively find that incorrect beliefs could achieve a smaller risk than that when agents know the true prior. It is of independent interest for sociotechnical system design that the optimal beliefs of local agents resemble human probability reweighting models from cumulative prospect theory.
We also consider asymptotic characterization of the optimal beliefs and fusion agent's risk in the number of local agents. We find that the optimal risk of the fusion agent converges to zero exponentially fast as the number of local agents grows. Furthermore, having an identical constant belief is asymptotically optimal in the sense of the risk exponent. For additive Gaussian noise, the optimal belief turns out to be a simple function of only error costs and the risk exponent can be explicitly characterized.
△ Less
Submitted 26 October, 2021; v1 submitted 27 December, 2019;
originally announced December 2019.
-
Fixed-Parameter Tractable Algorithms for Corridor Guarding Problems
Authors:
Remi Raman,
R Subashini,
Subhasree Methirumangalath
Abstract:
Given an orthogonal connected arrangement of line-segments, Minimum Corridor Guarding(MCG) problem asks for an optimal tree/closed walk such that, if a guard moves through the tree/closed walk, all the line-segments are visited by the guard. This problem is referred to as Corridor-MST/Corridor-TSP (CMST/CTSP) for the cases when the guarding walk is a tree/closed walk, respectively. The correspondi…
▽ More
Given an orthogonal connected arrangement of line-segments, Minimum Corridor Guarding(MCG) problem asks for an optimal tree/closed walk such that, if a guard moves through the tree/closed walk, all the line-segments are visited by the guard. This problem is referred to as Corridor-MST/Corridor-TSP (CMST/CTSP) for the cases when the guarding walk is a tree/closed walk, respectively. The corresponding decision version of MCG is shown to be NP-Complete[1]. The parameterized version of CMST/CTSP referred to as k-CMST/k-CTSP, asks for an optimal tree/closed walk on at most k vertices, that visits all the line-segments. Here, vertices correspond to the endpoints and intersection points of the input line-segments. We show that k-CMST/k-CTSP is fixed-parameter tractable (FPT) with respect to the parameter k. Next, we propose a variant of CTSP referred to as Minimum Link CTSP(MLC), in which the link-distance of the closed walk has to be minimized. Here, link-distance refers to the number of links or connected line-segments with same orientation in the walk. We prove that the decision version of MLC is NP-Complete, and show that the parameterized version, namely b-MLC, is FPT with respect to the parameter b, where b corresponds to the link-distance. We also consider another related problem, the Minimum Corridor Connection (MCC). Given a rectilinear polygon partitioned into rectilinear components or rooms, MCC asks for a minimum length tree along the edges of the partitions, such that every room is incident to at least one vertex of the tree. The decision version of MCC is shown to be NP-Complete[2]. We prove the fixed parameter tractability of the parameterized version of MCC, namely k-MCC with respect to the parameter k, where k is the number of rooms.
△ Less
Submitted 29 March, 2022; v1 submitted 11 February, 2019;
originally announced February 2019.
-
Beliefs in Decision-Making Cascades
Authors:
Daewon Seo,
Ravi Kiran Raman,
Joong Bum Rhim,
Vivek K Goyal,
Lav R Varshney
Abstract:
This work explores a social learning problem with agents having nonidentical noise variances and mismatched beliefs. We consider an $N$-agent binary hypothesis test in which each agent sequentially makes a decision based not only on a private observation, but also on preceding agents' decisions. In addition, the agents have their own beliefs instead of the true prior, and have nonidentical noise v…
▽ More
This work explores a social learning problem with agents having nonidentical noise variances and mismatched beliefs. We consider an $N$-agent binary hypothesis test in which each agent sequentially makes a decision based not only on a private observation, but also on preceding agents' decisions. In addition, the agents have their own beliefs instead of the true prior, and have nonidentical noise variances in the private signal. We focus on the Bayes risk of the last agent, where preceding agents are selfish.
We first derive the optimal decision rule by recursive belief update and conclude, counterintuitively, that beliefs deviating from the true prior could be optimal in this setting. The effect of nonidentical noise levels in the two-agent case is also considered and analytical properties of the optimal belief curves are given. Next, we consider a predecessor selection problem wherein the subsequent agent of a certain belief chooses a predecessor from a set of candidates with varying beliefs. We characterize the decision region for choosing such a predecessor and argue that a subsequent agent with beliefs varying from the true prior often ends up selecting a suboptimal predecessor, indicating the need for a social planner. Lastly, we discuss an augmented intelligence design problem that uses a model of human behavior from cumulative prospect theory and investigate its near-optimality and suboptimality.
△ Less
Submitted 5 August, 2019; v1 submitted 23 November, 2018;
originally announced December 2018.
-
Promoting Distributed Trust in Machine Learning and Computational Simulation via a Blockchain Network
Authors:
Nelson Kibichii Bore,
Ravi Kiran Raman,
Isaac M. Markus,
Sekou L. Remy,
Oliver Bent,
Michael Hind,
Eleftheria K. Pissadaki,
Biplav Srivastava,
Roman Vaculin,
Kush R. Varshney,
Komminist Weldemariam
Abstract:
Policy decisions are increasingly dependent on the outcomes of simulations and/or machine learning models. The ability to share and interact with these outcomes is relevant across multiple fields and is especially critical in the disease modeling community where models are often only accessible and workable to the researchers that generate them. This work presents a blockchain-enabled system that…
▽ More
Policy decisions are increasingly dependent on the outcomes of simulations and/or machine learning models. The ability to share and interact with these outcomes is relevant across multiple fields and is especially critical in the disease modeling community where models are often only accessible and workable to the researchers that generate them. This work presents a blockchain-enabled system that establishes a decentralized trust between parties involved in a modeling process. Utilizing the OpenMalaria framework, we demonstrate the ability to store, share and maintain auditable logs and records of each step in the simulation process, showing how to validate results generated by computing workers. We also show how the system monitors worker outputs to rank and identify faulty workers via comparison to nearest neighbors or historical reward spaces as a means of ensuring model quality.
△ Less
Submitted 25 October, 2018;
originally announced October 2018.
-
Deep Learning vs. Human Graders for Classifying Severity Levels of Diabetic Retinopathy in a Real-World Nationwide Screening Program
Authors:
Paisan Raumviboonsuk,
Jonathan Krause,
Peranut Chotcomwongse,
Rory Sayres,
Rajiv Raman,
Kasumi Widner,
Bilson J L Campana,
Sonia Phene,
Kornwipa Hemarat,
Mongkol Tadarati,
Sukhum Silpa-Acha,
Jirawut Limwattanayingyong,
Chetan Rao,
Oscar Kuruvilla,
Jesse Jung,
Jeffrey Tan,
Surapong Orprayoon,
Chawawat Kangwanwongpaisan,
Ramase Sukulmalpaiboon,
Chainarong Luengchaichawang,
Jitumporn Fuangkaew,
Pipat Kongsap,
Lamyong Chualinpha,
Sarawuth Saree,
Srirat Kawinpanitan
, et al. (7 additional authors not shown)
Abstract:
Deep learning algorithms have been used to detect diabetic retinopathy (DR) with specialist-level accuracy. This study aims to validate one such algorithm on a large-scale clinical population, and compare the algorithm performance with that of human graders. 25,326 gradable retinal images of patients with diabetes from the community-based, nation-wide screening program of DR in Thailand were analy…
▽ More
Deep learning algorithms have been used to detect diabetic retinopathy (DR) with specialist-level accuracy. This study aims to validate one such algorithm on a large-scale clinical population, and compare the algorithm performance with that of human graders. 25,326 gradable retinal images of patients with diabetes from the community-based, nation-wide screening program of DR in Thailand were analyzed for DR severity and referable diabetic macular edema (DME). Grades adjudicated by a panel of international retinal specialists served as the reference standard. Across different severity levels of DR for determining referable disease, deep learning significantly reduced the false negative rate (by 23%) at the cost of slightly higher false positive rates (2%). Deep learning algorithms may serve as a valuable tool for DR screening.
△ Less
Submitted 18 October, 2018;
originally announced October 2018.
-
Trusted Multi-Party Computation and Verifiable Simulations: A Scalable Blockchain Approach
Authors:
Ravi Kiran Raman,
Roman Vaculin,
Michael Hind,
Sekou L. Remy,
Eleftheria K. Pissadaki,
Nelson Kibichii Bore,
Roozbeh Daneshvar,
Biplav Srivastava,
Kush R. Varshney
Abstract:
Large-scale computational experiments, often running over weeks and over large datasets, are used extensively in fields such as epidemiology, meteorology, computational biology, and healthcare to understand phenomena, and design high-stakes policies affecting everyday health and economy. For instance, the OpenMalaria framework is a computationally-intensive simulation used by various non-governmen…
▽ More
Large-scale computational experiments, often running over weeks and over large datasets, are used extensively in fields such as epidemiology, meteorology, computational biology, and healthcare to understand phenomena, and design high-stakes policies affecting everyday health and economy. For instance, the OpenMalaria framework is a computationally-intensive simulation used by various non-governmental and governmental agencies to understand malarial disease spread and effectiveness of intervention strategies, and subsequently design healthcare policies. Given that such shared results form the basis of inferences drawn, technological solutions designed, and day-to-day policies drafted, it is essential that the computations are validated and trusted. In particular, in a multi-agent environment involving several independent computing agents, a notion of trust in results generated by peers is critical in facilitating transparency, accountability, and collaboration. Using a novel combination of distributed validation of atomic computation blocks and a blockchain-based immutable audits mechanism, this work proposes a universal framework for distributed trust in computations. In particular we address the scalaibility problem by reducing the storage and communication costs using a lossy compression scheme. This framework guarantees not only verifiability of final results, but also the validity of local computations, and its cost-benefit tradeoffs are studied using a synthetic example of training a neural network.
△ Less
Submitted 22 September, 2018;
originally announced September 2018.
-
Dynamic Distributed Storage for Scaling Blockchains
Authors:
Ravi Kiran Raman,
Lav R. Varshney
Abstract:
Blockchain uses the idea of storing transaction data in the form of a distributed ledger wherein each node in the network stores a current copy of the sequence of transactions in the form of a hash chain. This requirement of storing the entire ledger incurs a high storage cost that grows undesirably large for high transaction rates and large networks. In this work we use the ideas of secret key sh…
▽ More
Blockchain uses the idea of storing transaction data in the form of a distributed ledger wherein each node in the network stores a current copy of the sequence of transactions in the form of a hash chain. This requirement of storing the entire ledger incurs a high storage cost that grows undesirably large for high transaction rates and large networks. In this work we use the ideas of secret key sharing, private key encryption, and distributed storage to design a coding scheme such that each node stores only a part of the entire transaction thereby reducing the storage cost to a fraction of its original cost. When further using dynamic zone allocation, we show the coding scheme can also improve the integrity of the transaction data in the network over current schemes. Further, block validation (bitcoin mining) consumes a significant amount of energy as it is necessary to determine a hash value satisfying a specific set of constraints; we show that using dynamic distributed storage reduces these energy costs.
△ Less
Submitted 7 January, 2018; v1 submitted 20 November, 2017;
originally announced November 2017.
-
m-Bonsai: a Practical Compact Dynamic Trie
Authors:
Andreas Poyias,
Simon J. Puglisi,
Rajeev Raman
Abstract:
We consider the problem of implementing a space-efficient dynamic trie, with an emphasis on good practical performance. For a trie with $n$ nodes with an alphabet of size $σ$, the information-theoretic lower bound is $n \log σ+ O(n)$ bits. The Bonsai data structure is a compact trie proposed by Darragh et al. (Softw., Pract. Exper. 23(3), 1993, p. 277-291). Its disadvantages include the user havin…
▽ More
We consider the problem of implementing a space-efficient dynamic trie, with an emphasis on good practical performance. For a trie with $n$ nodes with an alphabet of size $σ$, the information-theoretic lower bound is $n \log σ+ O(n)$ bits. The Bonsai data structure is a compact trie proposed by Darragh et al. (Softw., Pract. Exper. 23(3), 1993, p. 277-291). Its disadvantages include the user having to specify an upper bound $M$ on the trie size in advance (which cannot be changed easily after initalization), a space usage of $M \log σ+ O(M \log \log M)$ (which is asymptotically non-optimal for smaller $σ$ or if $n \ll M$) and a lack of support for deletions. It supports traversal and update operations in $O(1/ε)$ expected time (based on assumptions about the behaviour of hash functions), where $ε= (M-n)/M$ and has excellent speed performance in practice. We propose an alternative, m-Bonsai, that addresses the above problems, obtaining a trie that uses $(1+β) n (\log σ+ O(1))$ bits in expectation, and supports traversal and update operations in $O(1/β)$ expected time and $O(1/β^2)$ amortized expected time, for any user-specified parameter $β> 0$ (again based on assumptions about the behaviour of hash functions). We give an implementation of m-Bonsai which uses considerably less memory and is slightly faster than the original Bonsai.
△ Less
Submitted 19 April, 2017;
originally announced April 2017.
-
Universal Joint Image Clustering and Registration using Partition Information
Authors:
Ravi Kiran Raman,
Lav R. Varshney
Abstract:
We consider the problem of universal joint clustering and registration of images and define algorithms using multivariate information functionals. We first study registering two images using maximum mutual information and prove its asymptotic optimality. We then show the shortcomings of pairwise registration in multi-image registration, and design an asymptotically optimal algorithm based on multi…
▽ More
We consider the problem of universal joint clustering and registration of images and define algorithms using multivariate information functionals. We first study registering two images using maximum mutual information and prove its asymptotic optimality. We then show the shortcomings of pairwise registration in multi-image registration, and design an asymptotically optimal algorithm based on multiinformation. Further, we define a novel multivariate information functional to perform joint clustering and registration of images, and prove consistency of the algorithm. Finally, we consider registration and clustering of numerous limited-resolution images, defining algorithms that are order-optimal in scaling of number of pixels in each image with the number of images.
△ Less
Submitted 30 November, 2017; v1 submitted 10 January, 2017;
originally announced January 2017.
-
Universal Clustering via Crowdsourcing
Authors:
Ravi Kiran Raman,
Lav Varshney
Abstract:
Consider unsupervised clustering of objects drawn from a discrete set, through the use of human intelligence available in crowdsourcing platforms. This paper defines and studies the problem of universal clustering using responses of crowd workers, without knowledge of worker reliability or task difficulty. We model stochastic worker response distributions by incorporating traits of memory for simi…
▽ More
Consider unsupervised clustering of objects drawn from a discrete set, through the use of human intelligence available in crowdsourcing platforms. This paper defines and studies the problem of universal clustering using responses of crowd workers, without knowledge of worker reliability or task difficulty. We model stochastic worker response distributions by incorporating traits of memory for similar objects and traits of distance among differing objects. We are particularly interested in two limiting worker types---temporary workers who retain no memory of responses and long-term workers with memory. We first define clustering algorithms for these limiting cases and then integrate them into an algorithm for the unified worker model. We prove asymptotic consistency of the algorithms and establish sufficient conditions on the sample complexity of the algorithm. Converse arguments establish necessary conditions on sample complexity, proving that the defined algorithms are asymptotically order-optimal in cost.
△ Less
Submitted 5 October, 2016;
originally announced October 2016.
-
Tree Compression with Top Trees Revisited
Authors:
Lorenz Hübschle-Schneider,
Rajeev Raman
Abstract:
We revisit tree compression with top trees (Bille et al, ICALP'13) and present several improvements to the compressor and its analysis. By significantly reducing the amount of information stored and guiding the compression step using a RePair-inspired heuristic, we obtain a fast compressor achieving good compression ratios, addressing an open problem posed by Bille et al. We show how, with relativ…
▽ More
We revisit tree compression with top trees (Bille et al, ICALP'13) and present several improvements to the compressor and its analysis. By significantly reducing the amount of information stored and guiding the compression step using a RePair-inspired heuristic, we obtain a fast compressor achieving good compression ratios, addressing an open problem posed by Bille et al. We show how, with relatively small overhead, the compressed file can be converted into an in-memory representation that supports basic navigation operations in worst-case logarithmic time without decompression. We also show a much improved worst-case bound on the size of the output of top-tree compression (answering an open question posed in a talk on this algorithm by Weimann in 2012).
△ Less
Submitted 15 June, 2015;
originally announced June 2015.
-
SEPIA: Search for Proofs Using Inferred Automata
Authors:
Thomas Gransden,
Neil Walkinshaw,
Rajeev Raman
Abstract:
This paper describes SEPIA, a tool for automated proof generation in Coq. SEPIA combines model inference with interactive theorem proving. Existing proof corpora are modelled using state-based models inferred from tactic sequences. These can then be traversed automatically to identify proofs. The SEPIA system is described and its performance evaluated on three Coq datasets. Our results show that S…
▽ More
This paper describes SEPIA, a tool for automated proof generation in Coq. SEPIA combines model inference with interactive theorem proving. Existing proof corpora are modelled using state-based models inferred from tactic sequences. These can then be traversed automatically to identify proofs. The SEPIA system is described and its performance evaluated on three Coq datasets. Our results show that SEPIA provides a useful complement to existing automated tactics in Coq.
△ Less
Submitted 29 May, 2015;
originally announced May 2015.
-
Detection of texts in natural images
Authors:
Gowtham Rangarajan Raman
Abstract:
A framework that makes use of Connected components and supervised Support machine to recognise texts is proposed. The image is preprocessed and and edge graph is calculated using a probabilistic framework to compensate for photometric noise. Connected components over the resultant image is calculated, which is bounded and then pruned using geometric constraints. Finally a Gabor Feature based SVM i…
▽ More
A framework that makes use of Connected components and supervised Support machine to recognise texts is proposed. The image is preprocessed and and edge graph is calculated using a probabilistic framework to compensate for photometric noise. Connected components over the resultant image is calculated, which is bounded and then pruned using geometric constraints. Finally a Gabor Feature based SVM is used to classify the presence of text in the candidates. The proposed method was tested with ICDAR 10 dataset and few other images available on the internet. It resulted in a recall and precision metric of 0.72 and 0.88 comfortably better than the benchmark Eiphstein's algorithm. The proposed method recorded a 0.70 and 0.74 in natural images which is significantly better than current methods on natural images. The proposed method also scales almost linearly for high resolution, cluttered images.
△ Less
Submitted 1 November, 2014;
originally announced November 2014.
-
On Succinct Representations of Binary Trees
Authors:
Pooya Davoodi,
Rajeev Raman,
Srinivasa Rao Satti
Abstract:
We observe that a standard transformation between \emph{ordinal} trees (arbitrary rooted trees with ordered children) and binary trees leads to interesting succinct binary tree representations. There are four symmetric versions of these transformations. Via these transformations we get four succinct representations of $n$-node binary trees that use $2n + n/(\log n)^{O(1)}$ bits and support (among…
▽ More
We observe that a standard transformation between \emph{ordinal} trees (arbitrary rooted trees with ordered children) and binary trees leads to interesting succinct binary tree representations. There are four symmetric versions of these transformations. Via these transformations we get four succinct representations of $n$-node binary trees that use $2n + n/(\log n)^{O(1)}$ bits and support (among other operations) navigation, inorder numbering, one of pre- or post-order numbering, subtree size and lowest common ancestor (LCA) queries. The ability to support inorder numbering is crucial for the well-known range-minimum query (RMQ) problem on an array $A$ of $n$ ordered values. While this functionality, and more, is also supported in $O(1)$ time using $2n + o(n)$ bits by Davoodi et al.'s (\emph{Phil. Trans. Royal Soc. A} \textbf{372} (2014)) extension of a representation by Farzan and Munro (\emph{Algorithmica} \textbf{6} (2014)), their \emph{redundancy}, or the $o(n)$ term, is much larger, and their approach may not be suitable for practical implementations.
One of these transformations is related to the Zaks' sequence (S.~Zaks, \emph{Theor. Comput. Sci.} \textbf{10} (1980)) for encoding binary trees, and we thus provide the first succinct binary tree representation based on Zaks' sequence. Another of these transformations is equivalent to Fischer and Heun's (\emph{SIAM J. Comput.} \textbf{40} (2011)) \minheap\ structure for this problem. Yet another variant allows an encoding of the Cartesian tree of $A$ to be constructed from $A$ using only $O(\sqrt{n} \log n)$ bits of working space.
△ Less
Submitted 18 October, 2014;
originally announced October 2014.
-
Mining State-Based Models from Proof Corpora
Authors:
Thomas Gransden,
Neil Walkinshaw,
Rajeev Raman
Abstract:
Interactive theorem provers have been used extensively to reason about various software/hardware systems and mathematical theorems. The key challenge when using an interactive prover is finding a suitable sequence of proof steps that will lead to a successful proof requires a significant amount of human intervention. This paper presents an automated technique that takes as input examples of succes…
▽ More
Interactive theorem provers have been used extensively to reason about various software/hardware systems and mathematical theorems. The key challenge when using an interactive prover is finding a suitable sequence of proof steps that will lead to a successful proof requires a significant amount of human intervention. This paper presents an automated technique that takes as input examples of successful proofs and infers an Extended Finite State Machine as output. This can in turn be used to generate proofs of new conjectures. Our preliminary experiments show that the inferred models are generally accurate (contain few false-positive sequences) and that representing existing proofs in such a way can be very useful when guiding new ones.
△ Less
Submitted 14 May, 2014;
originally announced May 2014.
-
QPTAS for Geometric Set-Cover Problems via Optimal Separators
Authors:
Nabil H. Mustafa,
Rajiv Raman,
Saurabh Ray
Abstract:
Weighted geometric set-cover problems arise naturally in several geometric and non-geometric settings (e.g. the breakthrough of Bansal-Pruhs (FOCS 2010) reduces a wide class of machine scheduling problems to weighted geometric set-cover). More than two decades of research has succeeded in settling the $(1+ε)$-approximability status for most geometric set-cover problems, except for four basic scena…
▽ More
Weighted geometric set-cover problems arise naturally in several geometric and non-geometric settings (e.g. the breakthrough of Bansal-Pruhs (FOCS 2010) reduces a wide class of machine scheduling problems to weighted geometric set-cover). More than two decades of research has succeeded in settling the $(1+ε)$-approximability status for most geometric set-cover problems, except for four basic scenarios which are still lacking. One is that of weighted disks in the plane for which, after a series of papers, Varadarajan (STOC 2010) presented a clever \emph{quasi-sampling} technique, which together with improvements by Chan \etal~(SODA 2012), yielded a $O(1)$-approximation algorithm. Even for the unweighted case, a PTAS for a fundamental class of objects called pseudodisks (which includes disks, unit-height rectangles, translates of convex sets etc.) is currently unknown. Another fundamental case is weighted halfspaces in $\Re^3$, for which a PTAS is currently lacking. In this paper, we present a QPTAS for all of these remaining problems. Our results are based on the separator framework of Adamaszek-Wiese (FOCS 2013, SODA 2014), who recently obtained a QPTAS for weighted independent set of polygonal regions. This rules out the possibility that these problems are APX-hard, assuming $\textbf{NP} \nsubseteq \textbf{DTIME}(2^{polylog(n)})$. Together with the recent work of Chan-Grant (CGTA 2014), this settles the APX-hardness status for all natural geometric set-cover problems.
△ Less
Submitted 5 April, 2014; v1 submitted 4 March, 2014;
originally announced March 2014.
-
Encoding Range Minimum Queries
Authors:
Pooya Davoodi,
Gonzalo Navarro,
Rajeev Raman,
S. Srinivasa Rao
Abstract:
We consider the problem of encoding range minimum queries (RMQs): given an array A[1..n] of distinct totally ordered values, to pre-process A and create a data structure that can answer the query RMQ(i,j), which returns the index containing the smallest element in A[i..j], without access to the array A at query time. We give a data structure whose space usage is 2n + o(n) bits, which is asymptotic…
▽ More
We consider the problem of encoding range minimum queries (RMQs): given an array A[1..n] of distinct totally ordered values, to pre-process A and create a data structure that can answer the query RMQ(i,j), which returns the index containing the smallest element in A[i..j], without access to the array A at query time. We give a data structure whose space usage is 2n + o(n) bits, which is asymptotically optimal for worst-case data, and answers RMQs in O(1) worst-case time. This matches the previous result of Fischer and Heun [SICOMP, 2011], but is obtained in a more natural way. Furthermore, our result can encode the RMQs of a random array A in 1.919n + o(n) bits in expectation, which is not known to hold for Fischer and Heun's result. We then generalize our result to the encoding range top-2 query (RT2Q) problem, which is like the encoding RMQ problem except that the query RT2Q(i,j) returns the indices of both the smallest and second-smallest elements of A[i..j]. We introduce a data structure using 3.272n+o(n) bits that answers RT2Qs in constant time, and also give lower bounds on the effective entropy} of RT2Q.
△ Less
Submitted 18 November, 2013;
originally announced November 2013.
-
Succinct Indices for Range Queries with applications to Orthogonal Range Maxima
Authors:
Arash Farzan,
J. Ian Munro,
Rajeev Raman
Abstract:
We consider the problem of preprocessing $N$ points in 2D, each endowed with a priority, to answer the following queries: given a axis-parallel rectangle, determine the point with the largest priority in the rectangle. Using the ideas of the \emph{effective entropy} of range maxima queries and \emph{succinct indices} for range maxima queries, we obtain a structure that uses O(N) words and answers…
▽ More
We consider the problem of preprocessing $N$ points in 2D, each endowed with a priority, to answer the following queries: given a axis-parallel rectangle, determine the point with the largest priority in the rectangle. Using the ideas of the \emph{effective entropy} of range maxima queries and \emph{succinct indices} for range maxima queries, we obtain a structure that uses O(N) words and answers the above query in $O(\log N \log \log N)$ time. This is a direct improvement of Chazelle's result from FOCS 1985 for this problem -- Chazelle required $O(N/ε)$ words to answer queries in $O((\log N)^{1+ε})$ time for any constant $ε> 0$.
△ Less
Submitted 21 April, 2012;
originally announced April 2012.
-
Encoding 2-D Range Maximum Queries
Authors:
Mordecai J. Golin,
John Iacono,
Danny Krizanc,
Rajeev Raman,
S. Srinivasa Rao,
Sunil Shende
Abstract:
We consider the \emph{two-dimensional range maximum query (2D-RMQ)} problem: given an array $A$ of ordered values, to pre-process it so that we can find the position of the smallest element in the sub-matrix defined by a (user-specified) range of rows and range of columns. We focus on determining the \emph{effective} entropy of 2D-RMQ, i.e., how many bits are needed to encode $A$ so that 2D-RMQ qu…
▽ More
We consider the \emph{two-dimensional range maximum query (2D-RMQ)} problem: given an array $A$ of ordered values, to pre-process it so that we can find the position of the smallest element in the sub-matrix defined by a (user-specified) range of rows and range of columns. We focus on determining the \emph{effective} entropy of 2D-RMQ, i.e., how many bits are needed to encode $A$ so that 2D-RMQ queries can be answered \emph{without} access to $A$. We give tight upper and lower bounds on the expected effective entropy for the case when $A$ contains independent identically-distributed random values, and new upper and lower bounds for arbitrary $A$, for the case when $A$ contains few rows. The latter results improve upon previous upper and lower bounds by Brodal et al. (ESA 2010). In some cases we also give data structures whose space usage is close to the effective entropy and answer 2D-RMQ queries rapidly.
△ Less
Submitted 25 April, 2012; v1 submitted 13 September, 2011;
originally announced September 2011.
-
Optimal Indexes for Sparse Bit Vectors
Authors:
Alexander Golynski,
Alessio Orlandi,
Rajeev Raman,
S. Srinivasa Rao
Abstract:
We consider the problem of supporting Rank() and Select() operations on a bit vector of length m with n 1 bits. The problem is considered in the succinct index model, where the bit vector is stored in "read-only" memory and an additional data structure, called the index, is created during pre-processing to help answer the above queries. We give asymptotically optimal density-sensitive trade-offs,…
▽ More
We consider the problem of supporting Rank() and Select() operations on a bit vector of length m with n 1 bits. The problem is considered in the succinct index model, where the bit vector is stored in "read-only" memory and an additional data structure, called the index, is created during pre-processing to help answer the above queries. We give asymptotically optimal density-sensitive trade-offs, involving both m and n, that relate the size of the index to the number of accesses to the bit vector (and processing time) needed to answer the above queries. The results are particularly interesting for the case where n = o(m).
△ Less
Submitted 10 August, 2011;
originally announced August 2011.
-
Succinct Representations of Permutations and Functions
Authors:
J. Ian Munro,
Rajeev Raman,
Venkatesh Raman,
S. Srinivasa Rao
Abstract:
We investigate the problem of succinctly representing an arbitrary permutation, π, on {0,...,n-1} so that π^k(i) can be computed quickly for any i and any (positive or negative) integer power k. A representation taking (1+ε) n lg n + O(1) bits suffices to compute arbitrary powers in constant time, for any positive constant ε<= 1. A representation taking the optimal \ceil{\lg n!} + o(n) bits can be…
▽ More
We investigate the problem of succinctly representing an arbitrary permutation, π, on {0,...,n-1} so that π^k(i) can be computed quickly for any i and any (positive or negative) integer power k. A representation taking (1+ε) n lg n + O(1) bits suffices to compute arbitrary powers in constant time, for any positive constant ε<= 1. A representation taking the optimal \ceil{\lg n!} + o(n) bits can be used to compute arbitrary powers in O(lg n / lg lg n) time.
We then consider the more general problem of succinctly representing an arbitrary function, f: [n] \rightarrow [n] so that f^k(i) can be computed quickly for any i and any integer power k. We give a representation that takes (1+ε) n lg n + O(1) bits, for any positive constant ε<= 1, and computes arbitrary positive powers in constant time. It can also be used to compute f^k(i), for any negative integer k, in optimal O(1+|f^k(i)|) time.
We place emphasis on the redundancy, or the space beyond the information-theoretic lower bound that the data structure uses in order to support operations efficiently. A number of lower bounds have recently been shown on the redundancy of data structures. These lower bounds confirm the space-time optimality of some of our solutions. Furthermore, the redundancy of one of our structures "surpasses" a recent lower bound by Golynski [Golynski, SODA 2009], thus demonstrating the limitations of this lower bound.
△ Less
Submitted 9 August, 2011;
originally announced August 2011.
-
Optimal Trade-Off for Succinct String Indexes
Authors:
Roberto Grossi,
Alessio Orlandi,
Rajeev Raman
Abstract:
Let s be a string whose symbols are solely available through access(i), a read-only operation that probes s and returns the symbol at position i in s. Many compressed data structures for strings, trees, and graphs, require two kinds of queries on s: select(c, j), returning the position in s containing the jth occurrence of c, and rank(c, p), counting how many occurrences of c are found in the firs…
▽ More
Let s be a string whose symbols are solely available through access(i), a read-only operation that probes s and returns the symbol at position i in s. Many compressed data structures for strings, trees, and graphs, require two kinds of queries on s: select(c, j), returning the position in s containing the jth occurrence of c, and rank(c, p), counting how many occurrences of c are found in the first p positions of s. We give matching upper and lower bounds for this problem, improving the lower bounds given by Golynski [Theor. Comput. Sci. 387 (2007)] [PhD thesis] and the upper bounds of Barbay et al. [SODA 2007]. We also present new results in another model, improving on Barbay et al. [SODA 2007] and matching a lower bound of Golynski [SODA 2009]. The main contribution of this paper is to introduce a general technique for proving lower bounds on succinct data structures, that is based on the access patterns of the supported operations, abstracting from the particular operations at hand. For this, it may find application to other interesting problems on succinct data structures.
△ Less
Submitted 28 June, 2010;
originally announced June 2010.
-
Random Access to Grammar Compressed Strings
Authors:
Philip Bille,
Gad M. Landau,
Rajeev Raman,
Kunihiko Sadakane,
Srinivasa Rao Satti,
Oren Weimann
Abstract:
Grammar based compression, where one replaces a long string by a small context-free grammar that generates the string, is a simple and powerful paradigm that captures many popular compression schemes. In this paper, we present a novel grammar representation that allows efficient random access to any character or substring without decompressing the string.
Let $S$ be a string of length $N$ compre…
▽ More
Grammar based compression, where one replaces a long string by a small context-free grammar that generates the string, is a simple and powerful paradigm that captures many popular compression schemes. In this paper, we present a novel grammar representation that allows efficient random access to any character or substring without decompressing the string.
Let $S$ be a string of length $N$ compressed into a context-free grammar $\mathcal{S}$ of size $n$. We present two representations of $\mathcal{S}$ achieving $O(\log N)$ random access time, and either $O(n\cdot α_k(n))$ construction time and space on the pointer machine model, or $O(n)$ construction time and space on the RAM. Here, $α_k(n)$ is the inverse of the $k^{th}$ row of Ackermann's function. Our representations also efficiently support decompression of any substring in $S$: we can decompress any substring of length $m$ in the same complexity as a single random access query and additional $O(m)$ time. Combining these results with fast algorithms for uncompressed approximate string matching leads to several efficient algorithms for approximate string matching on grammar-compressed strings without decompression. For instance, we can find all approximate occurrences of a pattern $P$ with at most $k$ errors in time $O(n(\min\{|P|k, k^4 + |P|\} + \log N) + occ)$, where $occ$ is the number of occurrences of $P$ in $S$. Finally, we generalize our results to navigation and other operations on grammar-compressed ordered trees.
All of the above bounds significantly improve the currently best known results. To achieve these bounds, we introduce several new techniques and data structures of independent interest, including a predecessor data structure, two "biased" weighted ancestor data structures, and a compact representation of heavy paths in grammars.
△ Less
Submitted 29 October, 2013; v1 submitted 11 January, 2010;
originally announced January 2010.