Current location - Quotes Website - Team slogan - How do big data, cloud computing and the Internet realize value?
How do big data, cloud computing and the Internet realize value?
1. The rise of big data indicates that the "information age" has entered a new stage.

(1) Look at big data from a historical perspective.

Compared with the agricultural and industrial times, the information age is a relatively long period. There are obvious differences in production factors and social development motivation in different periods. The iconic technological inventions in the information age are digital computers, integrated circuits, optical fiber communication and the Internet (World Wide Web). Although there are a lot of "big data era" in the media, new technologies such as big data and cloud computing have not yet achieved technological breakthroughs comparable to the above-mentioned epoch-making technological inventions, and it is difficult to form a new era beyond the information age. The information age can be divided into several stages, and the application of new technologies such as big data indicates that the information society will enter a new stage.

Through the investigation and analysis of the long history of 100 years, we can find that there are many similarities between the development laws of the information age and the industrial age. The process of improving productivity in the electrification era is strikingly similar to that in the information age. It is only after 20 to 30 years of diffusion reserves that the increase is more obvious. The dividing lines are 19 15 and 1995 respectively. I guess, after decades of information technology dissemination, the first 30 years of 2 1 century may be the golden age for information technology to enhance productivity.

(2) Understand "big data" from the height of "new stage of information age"

China has entered the information age, but many people's thoughts are still in the industrial age. Many problems in economic and scientific work are rooted in the lack of understanding of the times. China was beaten backward in the18-19th century because the Manchu government didn't realize that times had changed and we couldn't repeat the historical mistakes.

After the central government put forward that China's economy has entered the "new normal", there have been many discussions in the media, but most of them are explanations for the slowdown of economic growth, and there are few articles discussing the "new normal" from the perspective of the changes of the times. The author believes that the new economic normal means that China has entered a new stage of promoting new industrialization, urbanization and agricultural modernization with information technology, which is a leap in economic and social management, not an expedient measure, nor a retrogression.

The "third platform" of IT architecture, which is composed of next-generation information technologies such as big data, mobile Internet, social network, cloud computing and Internet of Things, is a sign that the information society has entered a new stage and has a leading and driving role in the transformation of the whole economy. The Internet, Maker, "Second Machine Revolution" and "Industry 4.0" that often appear in the media are all related to big data and cloud computing. Big data and cloud computing are new levers to improve productivity under the new normal. The so-called innovation-driven development mainly relies on information technology to improve productivity.

(3) Big data may be a breakthrough for China's information industry to move from tracking to leading.

China's big data enterprises already have a fairly good foundation. China, the top ten Internet service companies in the world, has four seats (Alibaba, Tencent, Baidu and JD.COM). COM), the other six Top 10 Internet service companies are all American companies. There are no Internet companies in Europe and Japan entering Top 10. This shows that enterprises in China have taken the lead in Internet services based on big data. In the development of big data technology, China may change the situation that technology is controlled by people in the past 30 years, and China may play a leading role in the application of big data in the world.

However, the fact that enterprises are at the forefront of the world does not mean that China is ahead in big data technology. In fact, none of the mainstream technologies of big data that are popular in the world are initiated by China. Open source community and crowdsourcing are important ways to develop big data technology and industry, but our contribution to open source community is very small. Among the nearly 10,000 community core volunteers in the world, there may be less than 200 in China. It is necessary to learn from the lessons that basic research provided enterprises with insufficient core technologies in the past, strengthen basic research and forward-looking technology research on big data, and strive to overcome the core and key technologies of big data.

2. Understanding big data needs to rise to the height of culture and epistemology.

Data culture is an advanced culture.

The essence of data culture is to respect the spirit of seeking truth from facts in the objective world, and data is facts. Paying attention to data means emphasizing the scientific spirit of speaking with facts and thinking rationally. The traditional habit of China people is qualitative thinking, not quantitative thinking. At present, many cities are opening government data, but it is found that most people are not interested in the data that the government wants to open. To put big data on the track of healthy development, we must first vigorously promote data culture. The data culture mentioned in this paper is not only the big data used by cultural industries such as literature and art, publishing, but also the data consciousness of the whole people. The whole society should realize that the core of informatization is data, and only when the government and the public attach importance to data can we truly understand the essence of informatization; Data is a new factor of production, and the use of big data can change the weight of traditional factors such as capital and land in the economy.

Some people sum up the dance between God and data as one of the characteristics of American culture, saying that Americans have both sincerity to God and rationality to seek truth through data. The United States has completed the thinking transformation of data culture from the gilded era to the progressive era. After the Civil War, the census method was applied to many fields, forming a thinking mode of data prediction and analysis. In the past century, the modernization of the United States and western countries is closely related to the spread and infiltration of data culture, and China must also emphasize data culture in order to realize modernization.

The key to raising data awareness is to understand the strategic significance of big data. Data is a strategic resource as important as material and energy. Data collection and analysis involves every industry and is a global and strategic technology. The transition from hard technology to soft technology is a global technology development trend, and the technology that finds value from data is the most dynamic soft technology. The backwardness of data technology and data industry will delay an era like missing the opportunity of industrial revolution.

(2) Understanding big data requires correct epistemology.

Historically, scientific research began with logical deduction, and all the theorems of Euclid geometry can be deduced from several axioms. Since Galileo and Newton, scientific research has paid more attention to natural observation and experimental observation, and on the basis of observation, scientific theories have been refined by induction. "Science begins with observation" has become the mainstream of scientific research and epistemology. Both empiricism and rationalism have made great contributions to the development of science, but they have also exposed obvious problems and even gone to extremes. Rationalism went to extremes, becoming dogmatism criticized by Kant, and empiricism went to extremes, becoming skepticism and agnosticism.

In 1930s, the German philosopher Popper put forward an epistemological viewpoint called "falsificationism" by later generations. He believes that scientific theory can not be proved by induction, but can only be falsified by counterexamples found in experiments, so he denies that science begins with observation and puts forward the famous view that science begins with problems [3]. Falsificationism has its limitations. If we strictly abide by the law of falsification, important theories such as the law of universal gravitation and atomism may be killed by the early so-called counterexamples. However, the view that "science begins with problems" has guiding significance for the development of big data technology.

The rise of big data has triggered a new scientific research model: "Science begins with data". From the epistemological point of view, the analysis method of big data is close to the empiricism that "science begins with observation", but we should keep in mind the lessons of history and avoid slipping into the empirical mud pit that denies the role of theory. Don't doubt the existence of "causality" when emphasizing "correlation"; When declaring the objectivity and neutrality of big data, don't forget that no matter the size of the data, big data will always be subject to its own limitations and human prejudice. Don't believe the prophecy: "With big data mining, you don't need to ask any questions about the data, and the data will automatically generate knowledge". Faced with a huge amount of data, the biggest confusion for scientists and technicians engaged in data mining is, what is the "needle" we want to catch? Is there a needle in this sea? In other words, we need to know what the problem is. In this sense, "science begins with data" and "science begins with problems" should be organically combined.

The pursuit of "career" is the eternal power of scientific development. However, the reasons are endless, and it is impossible for human beings to find the "ultimate truth" in a limited time. On the road of scientific exploration, people often use "this is an objective law" to explain the world, and do not immediately ask why there is such an objective law. In other words, traditional science not only pursues causality, but also concludes with objective laws. The results of big data research are mostly some new knowledge or new models, which can also be used to predict the future and can be considered as a local objective law. In the history of science, there are many examples of discovering universal laws through small data models, such as Kepler's law of celestial motion; Most big data models discover some special laws. Laws in physics are generally inevitable, but big data models are not necessarily inevitable or deductive. The research object of big data is often human psychology and society, and it is at a higher level on the knowledge ladder. Its natural boundary is fuzzy, but it has more practical features. Big data researchers pay more attention to the integration of knowledge and practice and believe in practice. Big data epistemology has many characteristics different from traditional epistemology, so we can't deny the scientific nature of big data method just because of its different characteristics. The study of big data challenges the traditional epistemology's preference for causality, supplements the single causality with data laws, realizes the data unification of rationalism and empiricism, and a brand-new big data epistemology is taking shape.

3. Correctly understand the value and benefits of big data.

(1) The value of big data is mainly reflected in its driving effect.

People always expect to dig out unexpected "great value" from big data. In fact, the value of big data is mainly reflected in its driving effect, that is, driving related scientific research and industrial development, and improving the ability of all walks of life to solve problems and add value through data analysis. The contribution of big data to the economy is not fully reflected in the direct income of big data companies, but also the contribution to the efficiency and quality improvement of other industries. Big data is a typical universal technology. To understand the general technology, we should adopt the "bee model": the benefits of bees are not mainly the honey brewed by themselves, but the contribution of bee pollination to agriculture.

Von Neumann, one of the founders of electronic computers, once pointed out: "In every science, when we develop some methods that can be continuously popularized by studying those problems that are quite simple compared with the ultimate goal, this discipline has made great progress." We don't have to expect miracles every day, and do more "quite simple" things. The actual progress lies in solid efforts. The media likes to promote some amazing big data success stories, and we should keep a clear head on these cases. According to Wu Gansha, chief engineer of Intel China Research Institute, in a report, the so-called classic data mining case of "beer and diapers" is actually a "story" compiled by a manager of Teradata, which has never happened in history [4]. Even if there is this case, it does not mean that big data analysis itself is magical. Two seemingly unrelated things appear anywhere in big data at the same time or one after another. The key point is that people's analytical reasoning is to find out why two things appear at the same time or successively, and to find the right reason is new knowledge or newly discovered laws. Relevance itself is of little value.

There is a well-known fable that can illustrate the value of big data from one angle: an old farmer told his three sons before he died that he buried a bucket of gold in his home field, but did not say where.

His sons dug all the land in his family and found no gold, but because of the deep digging, the crops have been particularly good since then. The ability of data collection and analysis has improved. Even if no universal laws or completely unexpected new knowledge are discovered, the value of big data has gradually emerged.

(2) The power of big data comes from "great wisdom"

Each data source has certain limitations and one-sidedness. Only by merging and integrating all aspects of the original data can we reflect the whole picture of things. The essence and laws of things are hidden in the association of various original data. Different data may describe the same entity from different angles. For the same problem, different data can provide complementary information and have a deeper understanding of the problem. Therefore, in big data analysis, it is the key to collect as much data as possible from the source.

Data science is a science that combines mathematics (statistics, algebra, topology, etc.). ), computer science, basic science and various applied sciences are similar to the "great wisdom" proposed by Mr. Qian Xuesen [5]. Qian Lao pointed out: "You must gather great achievements to gain wisdom." The key to big data wisdom lies in the integration and integration of multiple data sources. Recently, IEEE Computer Society released a forecast report on the development trend of computer technology in 20 14 years, with the focus on "seamless intelligence". The goal of developing big data is to obtain the "seamless wisdom" of collaborative integration. Relying on only one data source, even if the data scale is very large, the opening and sharing of one-sided data may be like "blind people touching the elephant", which is not icing on the cake, but a necessary prerequisite to determine the success or failure of big data.

The research and application of big data should change the traditional thinking that all departments and disciplines are independent and develop independently in the past. The emphasis is not on supporting the development of individual technologies and methods, but on the collaboration of different departments and disciplines. Data science is not a vertical chimney, but a horizontal comprehensive science like environment and energy science.

(3) Big data has a bright future, but we can't expect too much in the near future.

When alternating current comes out, it is mainly used for lighting. It is impossible to imagine its ubiquitous application today. The same is true of big data technology, which will produce many unexpected applications in the future. We don't have to worry about the future of big data, but we must work very pragmatically in the near future. People often overestimate the recent development and underestimate the long-term development. Gartner predicts that big data technology will become the mainstream technology in 5~ 10 years, so we should be patient when developing big data technology.

Like other information technologies, big data follows the law of exponential development for a period of time. The characteristic of index development is that measured from a historical period (at least 30 years), the early development is relatively slow, and after a long period of accumulation (which may take more than 20 years), there will be an inflection point, and then there will be explosive growth. However, any technology will not maintain "exponential" growth forever. Generally speaking, high-tech development follows the technology maturity curve described by Gartner, and may eventually enter a stable state of benign development or die out.

The problems that big data technology needs to solve are often very complicated, such as social computing, life science, brain science and so on. These problems can never be solved by the efforts of several generations. The universe evolved for billions of years before creatures and humans appeared. Its complexity and ingenuity are unparalleled. Don't expect to completely unveil its mystery in the hands of our generation. Looking forward to the future of millions of years or even longer, big data technology is just a wave in the long river of scientific and technological development. We can't have unrealistic illusions about the possible scientific achievements of 10~20 years of big data research.

? 4. Look at the challenges faced by big data research and application from the perspective of complexity.

Big data technology is closely related to human efforts to explore complexity. In the 1970s, the rise of three new theories (dissipative structure theory, synergy theory and catastrophe theory) challenged the reductionism which has been studied in science and technology for hundreds of years. From 65438 to 0984, three Nobel laureates, including gherman, set up the Santa Fe Institute, which mainly studied complexity, put forward the slogan of transcending reductionism, and set off a complexity science movement in the scientific and technological circles. Although the thunder is very loud, it has not achieved the expected effect in the past 30 years. One of the reasons may be that complex technology had not been solved at that time.

The development of integrated circuits, computers and communication technologies has greatly enhanced the ability of human beings to study and deal with complex problems. Big data technology will carry forward the new ideas of complexity science, which may make complexity science land. Complexity science is the scientific basis of big data technology, and big data method can be regarded as the technical realization of complexity science. Big data method provides a technical way to realize the dialectical unity of reductionism and holism. Big data research should draw nutrition from complexity research. Scholars engaged in data science research should not only understand the "three new theories" of the 20th century, but also learn about hypercycle, chaos, fractal and cellular automata, so as to broaden their horizons and deepen their understanding of the mechanism of big data.

Big data technology is still immature. In the face of massive, heterogeneous and dynamic data, traditional data processing and analysis technologies are difficult to cope with. The existing data processing system has low efficiency, high cost, high energy consumption and is difficult to expand. Most of these challenges come from the complexity of data itself, the complexity of calculation and the complexity of information system.

(1) Challenges brought by data complexity

Data analysis, such as graphic retrieval, topic discovery, semantic analysis and emotional analysis, is very difficult, because big data involves complex types, complex structures and complex patterns, and the data itself is very complicated. At present, people don't understand the physical meaning behind big data, the correlation law between data, and the internal relationship between the complexity of big data and computational complexity. The lack of domain knowledge limits the discovery of big data models and the design of efficient computing methods. Formal or quantitative description of the essential characteristics and metrics of big data complexity requires in-depth study of the internal mechanism of data complexity. The complexity of the human brain is mainly reflected in the connection between trillions of dendrites and axons, and the complexity of big data is also mainly reflected in the correlation between data. Understanding the mystery of correlation between data may be a breakthrough to reveal the law of "emergence" from micro to macro. The research on the complexity law of big data is helpful to understand the essential characteristics and generation mechanism of complex patterns of big data, thus simplifying the representation of big data and obtaining better knowledge abstraction. Therefore, it is necessary to establish the data distribution theory and model under multimodal association, clarify the internal relationship between data complexity and computational complexity, and lay a theoretical foundation for big data computing.

(2) The challenges brought by computational complexity

Big data computing can't do statistical analysis and iterative calculation of global data like processing small sample data sets. When analyzing big data, we need to re-examine and study its computability, computational complexity and solving algorithm. The sample size of big data is huge, the internal correlation is close and complex, and the value density distribution is extremely uneven. These characteristics challenge the establishment of big data computing paradigm. For PB-level data, even the calculation of linear complexity is difficult to achieve, and many invalid calculations may be made due to the sparsity of data distribution.

Traditional computational complexity refers to the functional relationship between the time and space needed to solve a problem and the scale of the problem. The so-called polynomial complexity algorithm means that when the scale of the problem increases, the growth rate of computing time and space is within a tolerable range. The focus of traditional scientific calculation is how to "calculate quickly" for a given scale problem. In big data applications, especially in stream computing, the time and space for data processing and analysis are often clearly limited. For example, if the response time of network service exceeds several seconds or even milliseconds, many users will be lost. Big data application is essentially how to "calculate more" under given time and space constraints. From "quick calculation" to "multi-calculation", the thinking logic considering the complexity of calculation has changed greatly. The so-called "more calculations" does not mean that the greater the amount of data, the better. It is necessary to explore the on-demand reduction method from enough data to just good data to valuable data.

One way to solve problems based on big data is to give up general solutions and find solutions to specific problems according to special constraints. Human cognitive problems are generally NP-hard, but as long as there is enough data, a very satisfactory solution can be found under limited conditions. In recent years, the great progress of self-driving cars is a good case. In order to reduce the amount of calculation, it is necessary to study the local calculation and approximation methods based on bootstrap and sampling, put forward a new algorithm theory that does not depend on the total data, and study the uncertain algorithm that adapts to big data.

(3) Challenges brought by system complexity

Big data puts forward strict requirements for the operating efficiency and energy consumption of computer systems. It is challenging to evaluate and optimize the efficiency of big data processing system. It is necessary not only to clarify the relationship between the computational complexity of big data and the system efficiency and energy consumption, but also to comprehensively measure the efficiency factors of the system, such as throughput, parallel processing ability, job calculation accuracy and job unit energy consumption. In view of the sparsity and weak access locality of big data, it is necessary to study the distributed storage and processing architecture of big data.

The application of big data involves almost all fields. The advantage of big data is that it can find sparse and precious value in long tail applications. However, an optimized computer system structure is difficult to adapt to different needs. The application of fragmentation greatly increases the complexity of information system. How do big data and Internet of Things applications of up to five million kinds of insects form such a huge market as mobile phones? This is the so-called "Insecta Paradox" [6]. In order to solve the complexity of computer system, it is necessary to study heterogeneous computing system and plastic computing technology.

In the application of big data, the load of computer system has changed substantially, and the structure of computer system needs revolutionary reconstruction. Information system needs to change from data around the processor to data processing ability, and the focus is not on data processing, but on data processing; The starting point of system structure design should be changed from focusing on the completion time of a single task to improving the throughput and parallel processing ability of the system, and the scale of concurrent execution should be increased to more than 654.38+0 billion. The basic idea of building a data-centric computing system is to fundamentally eliminate unnecessary data streams, and the necessary data processing should also be changed from "elephant moving wood" to "ant moving rice".

? 5. Misunderstandings that should be avoided in the development of big data

(1) Don't blindly pursue "big data scale"

The main difficulty of big data lies not in the large amount of data, but in the diversity of data types, the timely response requirements and the difficulty in distinguishing the original data from the real data. Existing database software can't solve unstructured data, so we should pay attention to data fusion, data format standardization and data interoperability. The quality of collected data is often low, which is one of the characteristics of big data, but it is still worthy of attention to improve the quality of original data as much as possible. The biggest problem in brain science research is the poor reliability of the collected data, and it is difficult to analyze valuable results based on the data with poor reliability.

The blind pursuit of large-scale data will not only cause waste, but also may not be very effective. The integration of multi-source small data may dig out great value that single source big data can't get. We should pay more attention to data fusion technology and the openness and enjoyment of data. The so-called large-scale data is closely related to the application field. In some areas, a few petabytes of data may not be large, while in some areas, tens of terabytes of data may already be large.

The development of big data cannot pursue "bigger, more and faster" endlessly. It is necessary to lower the benign development path of cost, low energy consumption, benefiting the people, justice and the rule of law. Like the current environmental pollution control, we should pay attention to the "pollution" and privacy violations that big data may bring as soon as possible.

(2) Don't "technology-driven", but "application first"

New information technologies are emerging one after another, and new concepts and terms in the information field are constantly emerging. It is estimated that after "big data", new technologies such as "cognitive computing", "wearable devices" and "robots" will enter the peak of hype. We are used to following the upsurge of foreign countries, often unconsciously following the trend of technology, and it is easiest to embark on the road of "technology-driven". In fact, the purpose of developing information technology is to serve people, and the only criterion for testing all technologies is application. The development of big data industry in China must adhere to the development strategy of "application first" and the technical route of application traction. Limited technology and unlimited applications. To develop cloud computing and big data in various places, we must mobilize the enthusiasm of application departments and innovative enterprises through policies and various measures, explore new applications through cross-border portfolio innovation, and find a way out from applications.

(3) The "small data" method cannot be abandoned.

The popular definition of "big data" is a data set that current mainstream software tools cannot collect, store and process within a reasonable time. This is an incompetent technique to define the problem, which may lead to misunderstanding. According to this definition, people may only pay attention to the problems that cannot be solved at present, just like a pedestrian wants to step on the shadow in front of him. In fact, most of the data processing encountered by all walks of life is still a "small data" problem. Whether it is big data or small data, we should pay attention to practical problems.

Statisticians have spent more than 200 years summarizing various traps in the process of cognitive data, which will not be automatically filled with the increase of data volume. There are a lot of small data problems in big data, and big data collection will also produce the same statistical deviation as small data collection. Google's flu prediction failed in the past two years because of human intervention such as search recommendation.

There is a popular view in the big data community: big data does not need to analyze causality, do not need sampling, and do not need accurate data. This concept cannot be absolute. In practical work, we should combine logical deduction and induction, white box and black box research, big data method and small data method.

(4) Pay close attention to the construction cost of big data platform.

At present, big data centers are being built all over the country, and Luliang Mountain has established a data processing center with a capacity of more than 2 PB. Many city public security departments require high-definition surveillance videos to be kept for more than 3 months. These systems are very expensive. The value of data mining comes from the cost, so we can't blindly build a big data system regardless of the cost. What data needs to be saved and how long it needs to be saved depends on the possible value and the required cost. The technology of big data system is still under study. The E-class supercomputer system in the United States requires a reduction in energy consumption by 1000 times, and it is planned to be developed in 2024. The giant system built with current technology consumes a lot of energy.

We should not compare the scale of big data systems, but consume less resources and energy than the actual application effect to accomplish the same thing. First, grasp the big data applications that ordinary people need most and develop big data according to local conditions. The development of big data and the strategy of informatization are the same: the goal should be ambitious, the start should be precise and the development should be rapid.