WEB full-text information retrieval technology
Li Can
(South China University of Technology Library 510641)
Abstract: This article explores the use of Internet technology in INTERNET Technology to realize full-text retrieval on the Internet. This paper discusses the process from preprocessing such as indexing and classification of online information to organizing information retrieval, and elaborates on the development of intelligent retrieval technology.
Keywords: information retrieval, Internet full-text retrieval
1. Introduction
The Internet is currently the largest and most influential information network in the world. Local area networks (LANs) in governments, schools, libraries, business locations, research institutions, and other organizations integrate into a single, massive, global-spanning communications network. More and more people are using this network to communicate with people around the world. How to use the Internet to obtain valuable information has become a basic skill necessary for scientific researchers.
The Internet is an open and huge information resource library, with more than tens of millions of hosts and over 100 million users; and due to the infinite richness of information contained in the Internet, information organization and expression are intuitive and vivid. As well as the convenience and diversity of information services, more and more information seekers are attracted by its unique charm. In recent years, the number of Internet users has increased exponentially. It can be seen that Internet retrieval has become the most popular, most concerned, and most commonly involved field of information retrieval.
II. Overview
The information on the Internet has the characteristics of large quantity, various forms, wide content, and lack of professionalism, etc., which brings new aspects to intelligence collection, classification, retrieval and other work. problems and challenges. How to make full use of information resources on the Internet is becoming a hot topic for intelligence science researchers. Full-text information retrieval is a retrieval method developed based on the characteristics of Internet information. It mainly refers to the study of the representation, storage, organization and access of the entire document information, that is, retrieving relevant information materials from the information database according to the user's query requirements.
The central link of full-text retrieval is the expression of document content, the acquisition of information query, and the matching of related information. A good full-text information retrieval system not only requires the output information to be arranged in a relevant manner, but also should be able to adaptively and intelligently adjust the matching mechanism according to the user's intentions, interests and characteristics to obtain user-satisfactory retrieval output.
To implement full-text retrieval, WEB information must first be preprocessed.
3. Preprocessing of WEB information
The main function of information preprocessing is to filter file system information and provide a satisfactory index output for the expression of the file system. Its basic purpose is to obtain optimal index records so that users can easily retrieve the required information.
(1) Format filtering: Information preprocessing should be able to filter documents in different formats, as well as pictures, sounds, videos and other information. This allows search engines to retrieve not just the text, but all the information in the original format of the file.
(2) Word segmentation: Words are the smallest units of information expression, and Chinese is different from Western languages ??in that there are no separators between words in sentences, so word segmentation is required. Commonly used word segmentation methods include maximum phrase matching according to the dictionary, reverse maximum phrase matching, best matching method, association-backtracking method, fully automatic dictionary word segmentation, etc. In recent years, word segmentation methods based on neuron networks and expert systems and word segmentation methods based on statistics and frequency analysis have emerged.
(3) Lexical analysis: There are differences in segmentation in Chinese word segmentation. For example, the sentence "The tennis auction is over" can be segmented into "Tennis/The auction is over", or it can be segmented into "Tennis racket" / Sold out." Therefore, it is necessary to use various contextual knowledge to resolve lexical differences. In addition, it is necessary to perform lexical analysis on the words and identify the stems of each word in order to establish an information index based on the stems. For English words, some stop words (such as common function words "a", "the", "it", etc.) and root words (such as "ing", "ed", "ly", etc.) must first be removed before indexing.
(4) Part-of-speech tagging and phrase recognition: On the basis of segmentation, rule-based and statistical methods are used for part-of-speech tagging. On this basis, various grammatical rules must also be used to identify important phrase structures.
(5) Automatic indexing: Extract a set of key information from the web document that can summarize its content characteristics to the greatest extent and can be used as a user search entry, and use this set of information to index the document. Citation allows users to retrieve brief information of the document by entering key information, such as title, abstract, time, author and URL, etc., and further click to query the document.
(6) Automatic classification: Establish and maintain a complete set of classification directory systems. Based on the information characteristics of the document, calculate the most relevant classification or categories, and classify the document into these categories. Category, so that users can directly query the document by browsing the classification system.
.
4. Retrieval
Retrieval includes document information expression and query information expression as well as related information prediction process.
(1) Information expression: There are many ways to express information, such as Boolean expression, vector space expression, natural language expression, etc. Each expression method is proposed by the application system server and determined by the entire application system. Determined by the purpose and needs, and corresponding to the corresponding storage mode and retrieval algorithm, the efficiency of information query and organization, that is, the speed and storage space, determine to a large extent the performance of the retrieval service system.
(2) Query analysis: The user's query information must first be analyzed and processed to extract query item indexes, logical expressions or other query feature descriptions. The difference from the file information index is that the query index processing is submitted in time to form an index, while the file information index is used by the search engine to search for remote data according to a certain strategy and obtain a pre-generated local index. The query index and the file index adopt the same expression method, so the similarity estimation algorithm can be used to retrieve related files.
(3) Query expansion: In recent years, in order to improve the performance of information retrieval, application domain knowledge and indexing, correlation, estimation, and query expression are combined to achieve query expansion, that is, the query index also includes users who are not The part of the query word that appears in the query. A typical knowledge base query expansion application is shown in Figure 1. The knowledge stored in the knowledge base adds relevant words to the original query, thereby expanding the original query.
(4) Query word selection strategy:
·Dependent words: Dependent words refer to words that have a greater correlation with the query word. But the correlation between all words in the document collection must be calculated beforehand.
·Feedback words: Based on the file information fed back by users, important words are determined based on the frequency and distribution of words in relevant and non-related files, and these words are added to user queries.
·Interactive selection: The user determines the final query term from the candidate terms obtained through the above strategy.
Feedback network belongs to the category of human-computer interaction and aims to improve query performance and pertinence. Different users provide different feedback information according to the actual situation. Different information retrieval service systems also have different feedback structures and interaction methods according to their functions and retrieval methods, so the query results are also different.
(5) Information retrieval model: The core of the information retrieval system is the search engine, which needs to filter out the information that meets the user's needs from a large amount of complex information. According to the different ways search engines find relevant information, information retrieval can be divided into: Boolean logic model, fuzzy logic model, vector space model and probability model, etc.
Boolean logic model The Boolean logic model is the simplest retrieval model and the basis for other retrieval models. The standard Boolean logic model is binary logic, a series of binary variables that correspond to document characteristics. These variables include textual search terms extracted from the document, and sometimes more complex features such as data, phrases, private signatures, and manually added descriptors. There is an exact set of document characteristics expressed in the Boolean model. Users can submit queries based on the Boolean logical relationship between search terms in the document. The matching function is determined by the basic laws of Boolean logic.
The retrieved documents are either relevant to the query or not. Query results are generally not sorted by relevance.
In order to deal with the contradiction between accuracy and complexity, the fuzzy logic model introduces the fuzzy logic model. It is based on the fuzzy logic whose logical truth value is [0, 1] and uses the concept of membership function. Intermediate transitions that describe differences in phenomena. Fuzzy logic operations are introduced in the query result processing process, and the retrieved file information and the user's query requirements are compared with fuzzy logic, and the query results are arranged according to the priority of relevance. In Boolean retrieval, the fuzzy logic model can be used to overcome Boolean logic queries. The disorder of the results.
The vector space model is different from the Boolean retrieval model. In the vector space model, both queries and files are mapped to the same n-dimensional space vector. It uses singular value decomposition (SVD), the internal structural relationship between query words and files, and makes similarity comparisons through Euclidean distance and cosine rule, and arranges the query results according to the similarity in vector space. The vector space model can not only easily generate effective query results, but also provide classification of query results, providing users with the information required for accurate positioning.
The probabilistic model has uncertainty problems in information retrieval. For the query itself, it cannot uniquely represent the information requirements. For the results, it determines whether the query results are correct or not. The same is true for Boolean retrieval, since the query itself is submitted in an inexact manner. In order to solve the uncertainty problem in Boolean retrieval model, probabilistic retrieval model is introduced. The model is based on probabilistic queuing theory: maximum retrieval performance is achieved when documents are arranged according to the principle of decreasing relative probabilities.
5. Development of full-text information retrieval technology
Current full-text retrieval technology still has some unsatisfactory results, mainly due to the low performance of ordinary information retrieval systems. Use isolated words and vocabulary terms as query descriptors, so the similarity of document content is poor. Intelligent information retrieval is the product of the combination of artificial intelligence and information retrieval. It enables the information retrieval system to "understand" the user's information needs and the information content contained in the document. It realizes intelligent retrieval based on content analysis and understanding, content expression, knowledge learning, reasoning mechanism, decision-making, etc.
The current combination of artificial intelligence and information retrieval mainly includes three aspects: (1) Information retrieval and expert systems: The main research direction is to develop an expert intermediary system to assist query formation, search strategy selection and prediction of retrieved documents ; (2) Information retrieval and natural language processing: It is actually a symbol system using words or words as symbols. At present, the application of natural language processing to information retrieval still remains in simple language processing, such as confirming root words and phrases. (3) Information retrieval and knowledge representation: Research in this field is mainly about understanding the information content of documents and queries through the application of domain knowledge.
At present, although some information retrieval service systems on the WWW adopt methods such as intelligent user agents, they can monitor information sources on the network in real time according to the information retrieval requirements defined in advance by the user, such as specifying Web Page updates, online news, emails, etc., and proactively provide users with the information they need through email, etc., reducing the time users spend retrieving information. However, commercial information retrieval systems are still mainly based on Boolean fuzzy logic, supplemented by some natural language processing. The development of intelligent information retrieval technology, especially the application of knowledge learning and knowledge bases as well as human-computer interaction, will greatly improve the accuracy and relevance of information retrieval service systems. With the development of intelligent technology, full-text information retrieval technology will be more widely used in the field of online information retrieval.
References
1) Full-text information retrieval technology on WWW, Jin Yan et al., Computer Application Research, Issue 1, 1999, P40-43
2) Full-text database construction principles and application technology, Wang Lancheng et al., Journal of Information Science, Issue 4, 1999