Search and Retrieval in Massive Data Collections

Pedro Omar Contreras Albornoz

(2010)

Pedro Omar Contreras Albornoz (2010) Search and Retrieval in Massive Data Collections.

Our Full Text Deposits

Full text access: Open


Abstract

The main goal of this research is to produce a novel and efficient searching application by means of best match and proximity searching with particular application to very large numeric and textual data stores. In today’s world a huge amount of information is produced. Almost every part of our society is touched by systems that collect, store and analyse data. As an example I mention the case of scientific instrumentation: new sensors capture massive amounts of information (e.g. new telescopes acquiring data from different regions of the spectrum). Description of biological and chemical interactions also produce complex and large amounts of data. It is in this context that a big challenge for current analysis algorithms is presented. Many of the traditional methods for data analysis do not scale well in massive data sets nor in very high dimensional spaces. In this work I introduce a novel (ultrametric) distance called Baire based on the longest common prefix and show how it can be used to produce clusters through grouping data in ’bins’ taking linear or O(n) computational time. Furthermore, it follows that this distance can be strictly fitted to a hierarchy tree. This is a property that proves very useful for classifying, storing, accessing and retrieving information. I go further to apply this methodology on data from different scientific areas such as astronomy and chemistry to create groups or clusters. Additionally I apply this method to document sets for clustering and retrieval. In particular, I look into the new area of enterprise search to propose a new method to support scalable search and clustering.

Information about this Version

This is a Accepted version
This version's date is: 05/2010
This item is not peer reviewed

Link to this Version

https://repository.royalholloway.ac.uk/items/963baeb9-030e-4055-ba3a-b63bcb9bf06e/1/

Item TypeThesis (Doctoral)
TitleSearch and Retrieval in Massive Data Collections
AuthorsAlbornoz, Pedro
Uncontrolled Keywordssearch; searching; retrieval; data collections;
DepartmentsFaculty of Science\Computer Science

Deposited by Leanne Workman (UXYL007) on 05-May-2015 in Royal Holloway Research Online.Last modified on 05-Feb-2017

Notes

©2010 Pedro Omar Contreras Albornoz. Short sections of text, not to exceed two paragraphs, may be quoted without explicit permission provided that full credit including © notice, is given to the source.


Details