In text retrieval, full-text search refers to techniques for searching a single computer-stored document or a collection in a full-text database. Full-text search is distinguished from searches based on metadata or on parts of the original texts represented in databases. In a full-text search, a search engine examines all of the words in every stored document as it tries to match search criteria. Full-text-searching techniques became common in online bibliographic databases in the 1990s. Many websites and application programs provide full-text-search capabilities; some web search engines, such as AltaVista, employ full-text-search techniques, while others index only a portion of the web pages examined by their indexing systems. When dealing with a small number of documents, it is possible for the full-text-search engine to directly scan the contents of the documents with each query, a strategy called "serial scanning"; this is. However, when the number of documents to search is large, or the quantity of search queries to perform is substantial, the problem of full-text search is divided into two tasks: indexing and searching.
The indexing stage will build a list of search terms. In the search stage, when performing a specific query, only the index is referenced, rather than the text of the original documents; the indexer will make an entry in the index for each term or word found in a document, note its relative position within the document. The indexer will ignore stop words that are both common and insufficiently meaningful to be useful in searching; some indexers employ language-specific stemming on the words being indexed. For example, the words "drives", "drove", "driven" will be recorded in the index under the single concept word "drive". Recall measures the quantity of relevant results returned by a search, while precision is the measure of the quality of the results returned. Recall is the ratio of relevant results returned to all relevant results. Precision is the number of relevant results returned to the total number of results returned; the diagram at right represents a low-recall search. In the diagram the red and green dots represent the total population of potential search results for a given search.
Red dots represent irrelevant results, green dots represent relevant results. Relevancy is indicated by the proximity of search results to the center of the inner circle. Of all possible results shown, those that were returned by the search are shown on a light-blue background. In the example only 1 relevant result of 3 possible relevant results was returned, so the recall is a low ratio of 1/3, or 33%; the precision for the example is a low 1/4, or 25%, since only 1 of the 4 results returned was relevant. Due to the ambiguities of natural language, full-text-search systems includes options like stop words to increase precision and stemming to increase recall. Controlled-vocabulary searching helps alleviate low-precision issues by tagging documents in such a way that ambiguities are eliminated; the trade-off between precision and recall is simple: an increase in precision can lower overall recall, while an increase in recall lowers precision. Full-text searching is to retrieve many documents that are not relevant to the intended search question.
Such documents are called false positives. The retrieval of irrelevant documents is caused by the inherent ambiguity of natural language. In the sample diagram at right, false positives are represented by the irrelevant results that were returned by the search. Clustering techniques based on Bayesian algorithms can help reduce false positives. For a search term of "bank", clustering can be used to categorize the document/data universe into "financial institution", "place to sit", "place to store" etc. Depending on the occurrences of words relevant to the categories, search terms or a search result can be placed in one or more of the categories; this technique is being extensively deployed in the e-discovery domain. The deficiencies of free text searching have been addressed in two ways: By providing users with tools that enable them to express their search questions more and by developing new search algorithms that improve retrieval precision. Keywords. Document creators are asked to supply a list of words that describe the subject of the text, including synonyms of words that describe this subject.
Keywords improve recall if the keyword list includes a search word, not in the document text. Field-restricted search; some search engines enable users to limit free text searches to a particular field within a stored data record, such as "Title" or "Author." Boolean queries. Searches that use Boolean operators can increase the precision of a free text search; the AND operator says, in effect, "Do not retrieve any document unless it contains both of these terms." The NOT operator says, in effect, "Do not retrieve any document that contains this word." If the retrieval list retrieves too few documents, the OR operator can be used to increase recall. This search will retrieve documents about online encyclopedias that use the term "Internet" instead of "online." This increase in precision is commonly counter-productive since it comes with
A concordance is an alphabetical list of the principal words used in a book or body of work, listing every instance of each word with its immediate context. Only works of special importance have had concordances prepared for them, such as the Vedas, Qur'an or the works of Shakespeare, James Joyce or classical Latin and Greek authors, because of the time and expense involved in creating a concordance in the pre-computer era. A concordance is more than an index. In the precomputing era, search technology was unavailable, a concordance offered readers of long works such as the Bible something comparable to search results for every word that they would have been to search for. Today, the ability to combine the result of queries concerning multiple terms has reduced interest in concordance publishing. In addition, mathematical techniques such as latent semantic indexing have been proposed as a means of automatically identifying linguistic information based on word context. A bilingual concordance is a concordance based on aligned parallel text.
A topical concordance is a list of subjects that a book covers, with the immediate context of the coverage of those subjects. Unlike a traditional concordance, the indexed word does not have to appear in the verse; the best-known topical concordance is Nave's Topical Bible. The first Bible concordance was compiled for the Vulgate Bible by Hugh of St Cher, who employed 500 monks to assist him. In 1448, Rabbi Mordecai Nathan completed a concordance to the Hebrew Bible, it took him ten years. A concordance to the Greek New Testament was published in 1599 by Henry Stephens, the Septuagint was done a couple of years by Conrad Kircher in 1602; the first concordance to the English Bible was published in 1550 by Mr Marbeck. According to Cruden, it did not employ the verse numbers devised by Robert Stephens in 1545, but "the pretty large concordance" of Mr Cotton did. Followed Cruden's Concordance and Strong's Concordance. Concordances are used in linguistics, when studying a text. For example: comparing different usages of the same word analysing keywords analysing word frequencies finding and analysing phrases and idioms finding translations of subsentential elements, e.g. terminology, in bitexts and translation memories creating indexes and word lists Concordancing techniques are used in national text corpora such as American National Corpus, British National Corpus, Corpus of Contemporary American English available on-line.
Stand-alone applications that employ concordancing techniques are known as concordancers or more advanced corpus managers. Some of them have integrated part-of-speech taggers and enable the user to create his/her own POS-annotated corpora to conduct various type of searches adopted in corpus linguistics; the reconstruction of the text of some of the Dead Sea Scrolls involved a concordance. Access to some of the scrolls was governed by a "secrecy rule" that allowed only the original International Team or their designates to view the original materials. After the death of Roland de Vaux in 1971, his successors refused to allow the publication of photographs to other scholars; this restriction was circumvented by Martin Abegg in 1991, who used a computer to "invert" a concordance of the missing documents made in the 1950s which had come into the hands of scholars outside of the International Team, to obtain an approximate reconstruction of the original text of 17 of the documents. This was soon followed by the release of the original text of the scrolls.
Index A Vedic Word Concordance Bible concordance Cross-reference Key Word in Context Text mining Shakespeare concordance - A concordance of Shakespeare's complete works Online Concordance to the Complete Works of Hryhorii Skovoroda - A concordance to Hryhorii Skovoroda's complete works Alex Catalogue of Electronic Texts - The Alex Catalogue is a collection of public domain electronic texts from American and English literature as well as Western philosophy. Each of the 14,000 items in the Catalogue are available as full-text but they are complete with a concordance. You are able to count the number of times a particular word is used in a text or list the most common words. Hyper-Concordance, Mitsu Matsuoka, Nagoya University - The Hyper-Concordance is written in C++, a program that scans and displays lines based on a command entered by the user. Includes Victorian, British & Irish, American literatures. Concord - Page includes link to Concord, an on-the-fly KWIC concordance generator. Works with at least some non-Latin scripts.
Multiple choices for sorting results. ConcorDance - A concordance interface to the WorldWideWeb, it uses Google's or Yahoo's search engine to find concordances and can be used directly from the browser. Chinese Text Project Concordance Tool - Concordance lookup and discussion of the continued importance of printed concordances in Sinology - Chinese Text Project KH Coder - A free software for KWIC concordance and collocation stats generation. Various statistical analysis functions are available such as co-occurrence network, multidimensional scaling, hierarchical cluster analysis, correspondence analysis of words
Dr. Andrea Crestadoro was a bibliographer who became Chief Librarian of Manchester Free Library, 1864–1879, he is credited with being the first person to propose that books could be catalogued by using keywords that did not occur in the title of the book. His ideas included a metallic balloon, reform of the tax system, improvements to a railway locomotive – the Impulsoria –, powered by four horses on a treadmill. Andrea Crestadoro was born in Genoa in 1808 and was educated there before he studied for his doctorate in philosophy at the University of Turin, he came to notice in 1849 when he left his position as Professor of Philosophy at the University of Turin to come to England to further his interest in mechanical devices. In England he took out a number of patents including improvements to the Impulsoria. Crestadoro improved the design of an unusual device called the Impulsoria, a mobile treadmill-powered locomotive; the invention was made by Clemente Masserano, from Pignerol in Italy. Following his improvements Crestadoro exhibited the Impulsoria at The Great Exhibition held in the Crystal Palace in 1851.
The power was transferred to the wheels using a gearbox that allowed it to climb. It could be used with two or four horses. Another suggestion from Crestadoro was to replace the paddle wheels or propellors on steamships with a smooth cylinder, he argued that the paddles or propellor blades were unnecessary, proposing smooth cylinders instead, which he suggested would gain traction by being immersed in the water. Crestadoro was given the task of creating a catalogue for the Manchester Library, he is credited with being the first person to propose that books could be catalogued by using keywords that did not occur in the title of the book. The system was called keyword in titles, first proposed for Manchester libraries in 1864; this system was developed many years as Key Word in Context by Hans Peter Luhn and was used in early computer based indexing. Crestadoro was an acquaintance of Anthony Panizzi, Principal Librarian of the British Museum and he was employed as a reader there. Exasperated by the delays in the publication by the British Museum of a Catalogue of Printed Books, Crestadoro wrote The Art of Making Catalogues of Libraries: Or A Method To Obtain In A Short Time A Most Perfect, And Satisfactory Printed Catalogue Of The British Museum Library, published anonymously in 1856.
The catalogue was to include 800,000 books but it had been in progress for over 20 years and consumed generous grants that had far exceeded £100,000 in 1853. Crestadoro published books on a number of subjects, his 1868 book proposed a method of dispensing with both gas and ballast by using a metallic balloon for flight. This too was exhibited at Crystal Palace in 1868. At the end of his life he was publishing ideas for the fairer allocation of taxation. After Crestadoro died in 1879 it was discovered that there was a built glider in one of the Manchester libraries; the Art Of Making Catalogues Of Libraries: Or A Method To Obtain In A Short Time A Most Perfect, And Satisfactory Printed Catalogue Of The British Museum Library, 1856 Catalogue of the books in the Manchester free library: Reference department Air locomotion dispensing with gas and ballast, 1868 On the best and fairest mode of raising the public revenue, 1876 Taxation Reform Or the Best and Fairest Means of Raising the Public Revenue, Paper at the Congress of the National Association for the Promotion of Social Science, Section Economy and Trade, Cheltenham, 1878
David Lorge Parnas is a Canadian early pioneer of software engineering, who developed the concept of information hiding in modular programming, an important element of object-oriented programming today. He is noted for his advocacy of precise documentation. Parnas earned his Ph. D. at Carnegie Mellon University in electrical engineering. Parnas earned a professional engineering license in Canada and was one of the first to apply traditional engineering principles to software design, he worked there as a professor for many years. He taught at the University of North Carolina at Chapel Hill, the Technische Universität Darmstadt, the University of Victoria, Queen's University in Kingston, Ontario, McMaster University in Hamilton and University of Limerick. David Parnas received a number of awards and honors: ACM "Best Paper" Award, 1979 Norbert Wiener Award for Social and Professional Responsibility, 1987 Two "Most Influential Paper" awards International Conference on Software Engineering, 1991 and 1995 Doctor honoris causa of the Computer Science Department, ETH Zurich, Switzerland, 1986 Fellow of the Royal Society of Canada, 1992 Fellow of the Association for Computing Machinery, 1994 Doctor honoris causa of the Faculté des Sciences Appliquées, Université catholique de Louvain, Belgium, 1996 ACM SIGSOFT's "Outstanding Research" award, 1998 IEEE Computer Society's 60th Anniversary Award, 2007 Doctor honoris causa of the Faculty of Informatics, University of Lugano, Switzerland, 2008 Fellow of the Gesellschaft für Informatik, 2008 Fellow of the Institute of Electrical and Electronics Engineers, 2009 Doctor honoris causa of the Vienna University of Technology, Vienna Austria, 2011 In modular design, his double dictum of high cohesion within modules and loose coupling between modules is fundamental to modular design in software.
However, in Parnas's seminal 1972 paper On the Criteria to Be Used in Decomposing Systems into Modules, this dictum is expressed in terms of information hiding, the terms cohesion and coupling are not used. He never used them. Dr Parnas took a public stand against the US Strategic Defense Initiative in the mid 1980s, arguing that it would be impossible to write an application of sufficient quality that it could be trusted to prevent a nuclear attack, he has been in the forefront of those urging the professionalization of "software engineering". Dr. Parnas is a heavy promoter of ethics in the field of software engineering. Parnas has joined the group of scientists which criticize the number-of-publications-based approach towards ranking academic production. On his November 2007 paper Stop the Numbers Game, he elaborates on several reasons on why the current number-based academic evaluation system used in many fields by universities all over the world is flawed and, instead of generating more advance of the sciences, it leads to knowledge stagnation.
Parnas, D. L.. "On the Criteria To Be Used in Decomposing Systems into Modules". Communications of the ACM. 15: 1053–58. Doi:10.1145/361598.361623. Automatic programming Hoffman, Daniel M.. McMaster University University of Limerick profile broken 2013-4-26 and CV broken 2013-4-26 IEEE Computer Society's 60th Anniversary Award David Lorge Parnas at the Mathematics Genealogy Project