In order to study the duplicate checking and identification of HowNet by computer, digital documents should be analyzed and processed first. Digital documents can be divided into two categories, namely natural language texts and formal language texts. The official language text is usually the source code of a computer program. Although there are many plagiarism cases, it is relatively easy to analyze and deal with them because of their standardized grammar and sentence structure, and the research on this kind of plagiarism identification is earlier. However, the principle of copy detection of natural language texts (such as papers) appeared 20 years later than that of program copy detection.
(2)1993 Manber of Arizona University put forward the concept of "approximate fingerprint", and based on this, proposed sif tool to measure the similarity between files through string matching. Brin and others of Stanford University first proposed COPS system and corresponding algorithm, and later proposed SCAM prototype to improve it. SCAM draws lessons from the vector space model in information retrieval technology, and uses the method based on word frequency statistics to measure text similarity. Si and Leong of Hong Kong Polytechnic University used statistical keywords to measure the similarity of documents, and established the prototype of CHECK, which introduced the structural information of documents into similarity measurement for the first time. In 2000, Monostori et al. used suffix tree to search the largest substring between strings and established MDR prototype. Before that, educators all over the United States now know how to comprehensively use classroom writing paragraph samples, Internet search tools and anti-plagiarism technology to curb the source of deception.
③ It is more difficult to identify plagiarism in Chinese papers. Chinese is different from English in that it takes characters as the basic writing unit, and there is no obvious distinguishing sign between words. Therefore, Chinese word segmentation is the basis of Chinese document processing. Chinese text plagiarism identification system first needs word segmentation as its most basic module, so the quality of Chinese text automatic word segmentation affects the accuracy of plagiarism identification to some extent. At the same time, computers lack understanding of natural language, and plagiarism is not limited to plagiarism, so it is difficult to identify plagiarism accurately. Therefore, we can't completely copy foreign technology to solve the problem of plagiarism identification of China's papers. Zhang Huanjiong of Beijing University of Posts and Telecommunications uses Hamming distance formula in coding theory to calculate text similarity. Based on the attribute theory, the Chinese Academy of Sciences calculates the matching distance between vectors, thus obtaining the text similarity. Based on the mathematical expression theory of Chinese characters, Cheng Yuzhu and others transformed the calculation of text similarity into the calculation of cosine of vector included angle in space coordinate system. Song Shuaibao of Xi Jiaotong University and others developed the CDSDG system, and used the overlap measurement algorithm based on word frequency statistics to calculate the overall semantic overlap and structural overlap under different granularity. The algorithm can not only detect all illegal copying behaviors of digital text, but also detect illegal copying behaviors such as subset copying and shifting local copying. Jin's similarity calculation algorithm based on context frame considers the semantic relationship between objects and gives the similarity relationship between texts from the semantic point of view. Jin Bo and He Teng of Dalian University of Technology analyze the text structure of academic papers according to their unique structure, and then calculate the similarity between academic papers by digital fingerprint and word frequency statistics. Zhang Minghui proposed a new paragraph-based approximate image algorithm for repeated web pages. Package and other grid-based text copy detection systems put forward the copy detection principle of semantic sequence kernel method. Jin Bo, Teng gave the framework of plagiarism detection system based on semantic understanding, the core of which is word similarity calculation based on HowNet, and extended the application scope to paragraphs. Nie Planning and other ontology-based paper duplicate checking systems use semantic web ontology technology to construct paper ontology and calculate paper similarity.
Please continue to pay attention to the school Check the copy paper () for more information about paper testing.