Compared with structured data (that is, row data, stored in the database, the data can be logically expressed using a two-dimensional table structure), data that is inconvenient to be expressed using a two-dimensional logical table in the database is It is called unstructured data, including all formats of office documents, text, pictures, XML, HTML, various reports, images and audio/video information, etc.
Fields can be expanded as needed, that is, the number of fields is variable and can be called semi-structured data, such as data stored in Exchange.
Unstructured database
In the information society, information can be divided into two major categories. One type of information can be represented by data or a unified structure, which we call structured data, such as numbers and symbols; while the other type of information cannot be represented by numbers or a unified structure, such as text, images, sounds, web pages, etc. We call it unstructured data. Structured data belongs to unstructured data and is a special case of unstructured data
As can be seen from the name, data cleaning is to "wash away" the "dirty" data. Because the data in the data warehouse is a collection of data oriented to a certain topic. These data are extracted from multiple business systems and contain historical data. In this way, it is inevitable that some data are wrong data and some data are inconsistent with each other. Conflicts, these erroneous or conflicting data are obviously unwanted and are called "dirty data". We have to "wash out" the "dirty data" according to certain rules, which is data cleaning. The task of data cleaning is to filter the data that does not meet the requirements, and hand the filtered results to the business department to confirm whether they are filtered out or not. It will be extracted after correction by the business unit. Data that does not meet the requirements mainly fall into three categories: incomplete data, erroneous data, and duplicate data.
(1) Incomplete data
This type of data is mainly missing information that should be there, such as the name of the supplier, the name of the branch, and the missing regional information of the customer , the main table and the detailed table in the business system cannot match, etc. This type of data is filtered out, and the missing content is written into different Excel files and submitted to the customer, which is required to be completed within the specified time. After completion, it is written to the data warehouse.
(2) Wrong data
The reason for this type of error is that the business system is not sound enough, and it is caused by directly writing to the background database without making a judgment after receiving the input, such as numerical values. The data is input into full-width numeric characters, there is a carriage return operation after the string data, the date format is incorrect, the date is out of bounds, etc. This type of data must also be classified. For problems such as full-width characters and invisible characters before and after the data, we can only find them out by writing SQL statements, and then ask the customer to extract them after correcting the business system. Errors such as incorrect date format or date out-of-bounds will cause ETL operation to fail. This type of error needs to be picked out in the business system database using SQL, and handed over to the business department for correction within a time limit, and then extracted after correction.
(3) Duplicate data
For this type of data - especially when this happens in dimension tables - export all fields of duplicate data records so that Customer confirmation and sorting.
Data cleaning is an iterative process and cannot be completed in a few days. Problems can only be discovered and solved continuously. Customers are generally required to confirm whether to filter or correct. For filtered data, write it into an Excel file or write the filtered data into a data table. In the early stages of ETL development, you can send emails with filtered data to business units every day to encourage them to respond as soon as possible. Correct errors and also serve as a basis for future verification of data. What you need to pay attention to when cleaning data is not to filter out useful data, carefully verify each filtering rule, and ask users to confirm it.
With the development of network technology, especially the rapid development of Internet and Intranet technology, the amount of unstructured data is increasing day by day. At this time, the limitations of relational databases, which are primarily used to manage structured data, became increasingly apparent.
Therefore, database technology has accordingly entered the "post-relational database era" and developed into the era of unstructured databases based on network applications. The so-called unstructured database means that the variable-length records of the database are composed of several non-repeatable and repeatable fields, and each field can be composed of several non-repeatable and repeatable sub-fields. Simply put, an unstructured database is a database with variable fields.
The unstructured database in my country is represented by the iBase database of Beijing Guoxin Base (iBase) Software Co., Ltd. IBase database is an unstructured database for end users. It is at the internationally advanced level in the fields of processing unstructured information, full-text information, multimedia information and massive information, as well as Internet/Intranet applications. It is at the international advanced level in the management and management of unstructured data. A breakthrough in full-text retrieval. It mainly has the following advantages:
(1) In Internet applications, there are a large number of complex data types. iBase can manage various document information and multimedia information through its external file data types, and for various Document information resources with retrieval significance, such as HTML, DOC, RTF, TXT, etc., also provide powerful full-text retrieval capabilities.
(2) It uses the mechanism of subfields, multi-valued fields and variable-length fields to allow the creation of many different types of unstructured or arbitrary format fields, thus breaking through the very strict tables of relational databases. Structure, allowing unstructured data to be stored and managed.
(3)iBase defines both unstructured and structured data as resources, so that the basic element of an unstructured database is the resource itself, and the resources in the database can contain both structured and unstructured data. information. Therefore, unstructured databases can store and manage a variety of unstructured data, realizing the transformation from database system data management to content management.
(4) iBase adopts the object-oriented cornerstone to closely integrate enterprise business data and business logic, and is particularly suitable for expressing complex data objects and multimedia objects.
(5)iBase is a database created to meet the needs of the development of the Internet. Based on the idea that the Web is a massive database of a wide area network, it provides an online resource management system iBase Web, which combines the network server (WebServer) and The database server (Database Server) is directly integrated into a whole, making the database system and database technology an important and integral part of the Web. It breaks through the limitation of the database only acting as the backend role of the Web system and realizes the organic and seamless combination of the database and the Web. It opens up a broader field for information management and even e-commerce applications on the Internet/Intranet.
(6)iBase is fully compatible with various large, medium and small databases, and provides import and link support for traditional relational databases such as Oracle, Sybase, SQLServer, DB2, Informix, etc.
Through the above analysis, we can predict that with the rapid development of network technology and network application technology, unstructured databases based entirely on Internet applications will become the successor to hierarchical databases, network databases and relational databases. Another key point and hot technology will come later.