Fourth South American Workshop on String Processing (WSP 1997)

Fourth South American Workshop on String Processing (WSP 1997)

Ricardo Baeza-Yates (Ed.)
Copyright Date: 1997
Pages: 206
  • Cite this Item
  • Book Info
    Fourth South American Workshop on String Processing (WSP 1997)
    Book Description:

    We use string processing to denote any use of computers to process and manage strings or sequences of symbols. This includes text retrieval, compression, computational biology, natural language processing, word theory, etc. Strings can also be extended to other dimensions, including images and complex objects, such as trees or graphs. These areas are important for many applications, including text, image or genetic databases. Nowadays, the most important motivation for research is searching and managing the World Wide Web. The Web contains terabytes of data and searching for information is becoming as difficult as finding a needle in a haystack. Future versions of this work-shop will focus on generic information retrieval, query languages, user interfaces and visualization tools.

    eISBN: 978-0-7735-9140-0
    Subjects: General Science

Table of Contents

  1. Front Matter
    (pp. i-iv)
  2. Table of Contents
    (pp. v-vi)
    (pp. vii-viii)
    Ricardo Baeza-Yates
  4. Generalized Pattern Matching: the Case of Swaps (abstract of invited talk)
    (pp. 1-1)
    Amihood Amir
  5. Large Text Searching Allowing Errors
    (pp. 2-20)
    Márcio Drumond Araújo, Gonzalo Navarro and Nivio Ziviani

    The full text model in information retrieval (IR) is gaining popularity. In this model, documents are represented by their complete full texts. The user expresses his information needs by providing strings to be matched and the information system retrieves those documents containing the user specified strings. When the text collection is large it demands specialized index techniques for efficient text retrieval. A simple and popular indexing technique is the inverted list. It is especially adequate when the pattern to be searched for is formed by simple words. This is a common type of query, for instance when searching the World...

  6. Proximity Queries In Metric Spaces
    (pp. 21-36)
    Edgar Chávez and José Luis Marroquín

    In this work we will analyze the problem of satisfying proximity queries in general metric spaces. Seminal papers leading to, in a sense, optimal algorithms have been written since the very formulation of the problem. However, those algorithms use large amounts of memory (see below), and as new computer applications are developed, more competitive algorithms are demanded. Among the difficulties arising are the very large number of elements in the data set, the high dimensional of the data and the absence of coordinates for indexing the data set. Our goal is to show that there is a simple, yet powerful,...

  7. Suffix Tree Constructions: New Techniques and Optimal Algorithms (abstract of invited talk)
    (pp. 37-37)
    Martin Farach
  8. A General Technique to Improve Filter Algorithms for Approximate String Matching
    (pp. 38-52)
    Robert Giegerich, Frank Hischke, Stefan Kurtz and Enno Ohlebusch

    The problem of approximate string matching is stated as follows: given a database stringT, a query stringP, a thresholdk, find all approximate matches, i.e., all subwordsvofTwhose edit distance toPis at mostk.The dynamic programming approach [10] provides the general solution, and is still unbeaten w.r.t. its versatility. Its running time, however, isO(mn), wherem= |P| andn= |T|. This is impractical for large-scale applications like biosequence analysis, where the size of the gene and protein databases (i.e.T)grows exponentially due to the advances in sequencing technology....

  9. Distributed Generation of Suffix Arrays: a Quicksort-Based Approach
    (pp. 53-69)
    Joāo Paulo Kitajima, Gonzalo Navarro, Berthier A. Ribeiro-Neto and Nivio Ziviani

    We present a new algorithm for distributed parallel generation of large suffix arrays in the context of a high bandwidth network of processors. The motivation is three-fold. First, the high cost of the best known sequential algorithm for suffix array generation leads naturally to the exploration of parallel algorithms for solving the problem. Second, the use of a set of processors (for example, connected by a fast switch like ATM) as a parallel machine is an attractive alternative nowadays [1]. Third, the final index can be left distributed to reduce the query time overhead. The distributed algorithm we propose is...

  10. Transposition distance between a permutation and its reverse
    (pp. 70-79)
    João Meidanis, Maria Emilia M. T. Walter and Zanoni Dias

    The huge amount of data resulting from genome sequencing in Molecular Biology is giving rise to an increasing interest in the development of algorithms for comparing genomes of related species. Particularly these data prompted research on mutational events acting on large portions of the chromosomes. Such events can be used to compare genomes for which the traditional methods of comparing DNA sequences are not conclusive. The field originated by the study of large mutations on chromosomes is known asgenome rearrangements.

    There are several mutational events affecting large fragments of genomes of organisms, including duplication, reversal, transposition (acting on a...

  11. Practical Use of The Warm-up Algorithm on Length-Restricted Coding
    (pp. 80-94)
    Ruy Luiz Milidiú, Artur Alves Pessoa and Eduardo Sany Laber

    An important problem in the field of Coding and Information Theory is the Binary Prefix Code Problem. Given an alphabet$\sum = \left\{ {{a_1},...,{a_n}} \right\}$and a corresponding set of positive weights$\left\{ {{w_1},...,{w_n}} \right\},$the problem is to find a prefix code for ∑ that minimizes theweighted lengthof a code string, defined to${\sum ^n}_{i = 1}{w_i}{l_{i,}}$where${l_i}$is the length of the codeword assigned to${a_i}$. This problem equivalent to find a full binary tree¹Twhere each leaf correspond to a symbol${a_i}$from ∑ and where theweighted path length${\sum ^n}_{i = 1}{w_i}{l_i}$is minimal. In this case,${l_i}$is thelevelof the...

  12. Indexing Compressed Text
    (pp. 95-111)
    Edleno S. de Moura, Gonzalo Navarro and Nivio Ziviani

    The amount of textual information available worldwide is experimenting an impressive growth in the last years. The widespread use of digital libraries, office automation systems and document databases are some examples of the kind of requirements that are becoming commonplace. Phenomena like the World Wide Web and indexing mechanisms over the Internet definitely feed this explosion of textual information electronically available. Therefore, compression appears always as an attractive choice, if not mandatory. However, the combination of text compression and the retrieval requirements of textual databases does not always succeed. Because of this, many textual databases schemes do not compress the...

  13. A Partial Deterministic Automaton for Approximate String Matching
    (pp. 112-124)
    Gonzalo Navarro

    Approximate string matching is one of the main problems in classical string algorithms, with applications to text searching, computational biology, pattern recognition, etc.

    The problem is defined as follows: given atextof lengthnand apatternof lengthm(both sequences over an alphabet Σ of size σ), and given a maximal number of errors allowed,κ, we want to find all text positions where the pattern matches the text up tokerrors. Errors can be replacing, deleting or inserting a character. We call α = κ/m theerror ratio.We are interested in the on-line problem,...

  14. Multiple Approximate String Matching by Counting
    (pp. 125-139)
    Gonzalo Navarro

    A number of important problems related to string processing lead to algorithms for approximate string matching; text searching, pattern recognition, computational biology, audio processing, etc.

    Theedit distancebetween two stringsaandb, ed(a,b), is defined as the minimum number ofedit operationsthat must be carried out to make them equal. The allowed operations are insertion, deletion and replacement of characters inaorb. The problem ofapproximate string matchingis defined as follows: given atextof length n, and apatternof lengthm, both being sequences over an alphabet ∑ of size σ, find...

  15. Asymptotic estimation of the average number of terminal states in DAWGs
    (pp. 140-148)
    Mathieu Raffinot

    Suffix directed acyclic word graphs (DAWGs) are very useful for textual pattern matching and lead to very fast algorithms, likeBDMfor one pattern, orMultiBDMfor several patterns (see [3]). Studies have been undertaken to calculate their sizes in terms of number of nodes and edges, so as to predict the maximal or average space needed by the algorithms that use them (see [1]) and to demonstrate some of their properties. This paper takes place in this context. We give an asymptotic estimation of the number of terminal states of a DAWG under a Bernouilli model. This estimation is...

  16. On the multi backward dawg matching algorithm (MultiBDM)
    (pp. 149-165)
    Mathieu Raffinot

    We consider the multi pattern matching problem: find all occurrences of a set$P = \left\{ {{P_1},{P_2},....,{P_K}} \right\}$of patterns, which are strings defined over a fixed alphabet Σ, in a textTof lengthn, defined on the same alphabet. This classical problem has many applications, for example in data mining or in bibliographic search to find selected patterns, in security applications to detect suspicious keywords, in genomics to analyze DNA chains.

    Many algorithms exist for solving this problem, of which Watson and Zwann propose a taxonomy [12]. The most famous, and the first having a linear behavior, has been presented by Aho...

  17. Approaching the dictionary in the implementation of a natural language processing system: toward a distributed structure
    (pp. 166-178)
    Vera Lúcia Strube de Lima, Paulo Ricardo Carneiro Abrahāo and Ivandré Paraboni

    The need for tools in natural language processing (NLP) is increasing while computers are getting cheaper and more powerful. A structure that appears at the core of those tools is the dictionary. There are many different relationships between a lexicon and corpus: a dictionary can be regarded as a text in itself, while it can be viewed as a major tool in the text analysis process. In a broader sense, as the notion proposed in [Wilks 96], a dictionary can combine readability (by a computing system) with suitability (for NLP tasks).

    The multiple forms of linguistic knowledge have to be...

  18. String Databases and Finite Multitape Automata
    (pp. 179-179)
    Esko Ukkonen
  19. An algorithm for graph pattern-matching
    (pp. 180-197)
    Gabriel Valiente and Conrado Martinez

    This paper deals with graph pattern-matching, the problem of finding a homomorphic (or isomorphic) image of a given graph, called thepattern,in another graph, called thetarget,and it is also known as thesubgraph homomorphism(orsubgraph isomorphism)problem. As a generalization of string matching and two-dimensional pattern-matching, it offers a natural framework for the study of matching. problems upon multi-dimensional structures.

    A main drawback of graph pattern-matching, however, lies in its inherent computational complexity. Thesubgraph isomorphismproblem is known to be NP-complete [6] and, as a matter of fact, a naive graph pattern-matching algorithm, which generates...

  20. Back Matter
    (pp. 198-200)