In the future, the mainstream computing market (desktops, servers, laptops) needs a limited multi-core architecture, emphasizing the single-threaded performance of the core, and many core architectures (dozens or even hundreds of cores) will be applied to special computing environments such as streaming computing, HPC, SoC, etc. This will also be a watershed for future Intel processors, so there will be so-called "big core" and "small core" processors. The former is based on the current core architecture and pursues better single-threaded performance. The latter is based on Atom kernel, which emphasizes higher parallelism and lower power consumption in design.
In the aspect of instruction execution, the "big core" adopts out-of-order execution mode, and the "small core" adopts orderly execution mode. Out-of-order execution and corresponding sequential execution refer to the technology that CPU allows multiple instructions to be unfolded out of the order specified by the program and sent to the corresponding circuit unit for processing.
contrast
Compared with sequential execution technology, out-of-order execution can improve IPC more effectively, that is, increase the number of instructions that can be executed per clock frequency. Generally speaking, in the same main frequency period, the number of instructions executed by the out-of-order core is more than that executed by the ordered core, so the computing power of the processor single core with out-of-order execution architecture is stronger. However, the processor with out-of-order execution mode is more complicated in circuit design, and the power consumption of the core is relatively high, so it is difficult to meet its design requirements when mobile phones and some embedded applications need absolutely low power consumption, so the Atom processor naturally adopts sequential execution mode.
In the future, many core processors and limited multi-core processors will develop in parallel to meet the needs of increasingly differentiated and complex computing environments. The criteria for evaluating the quality of a processor will be more complicated, which may not be based on the main frequency or even IPC, but on its application characteristics.
Out-of-order Execution Technology and Godson 2F Chip
App application
Godson processor has been widely used in industrial control, PC, notebook, military industry and other fields. In fact, in a sense, domestic chips have entered the mainstream market. According to Mr. Wang Chengjiang, many governments and armies have used Loongson platform for a long time.
contrast
Dawn Gigabit Firewall is a 64-bit universal RISC processor with Godson 2F chip, which is manufactured by 90nm CMOS process and fully compatible with MIPS 64 standard. Godson 2F is an improved version based on Godson 2E processor, which was successfully developed in 2007. Godson 2F integrates a high-performance Godson 2 CPU core, a four-launch dynamic superscalar structure, and a 9- 10 pipeline, and supports out-of-order execution technologies such as register renaming, dynamic scheduling, and branch prediction. Godson 2F improves I/O performance and memory access bandwidth on the basis of Godson 2E, and integrates a memory controller, which improves the speed of data throughput and provides a better platform for network security products.
Out-of-order execution technology and Intel E8400 processor
brief introduction
The 45nm Intel Core 2 Duo processor E8400 can support embedded applications for up to 7 years. The processor also supports Intel Trusted Execution Technology to help customers deploy secure embedded solutions.
Enhanced multimedia performance
A super out-of-order execution engine is introduced into this 45nm processor, which can enhance the Intel SIMD Streaming Instruction Extension (SSE) algorithm optimized for graphics and multimedia processing. Super out-of-order execution engine can reduce the delay, accelerate the running speed of existing SSE instructions, and significantly improve the performance of the latest SSE4 instruction set. Developers can make full use of SSE4 multimedia instruction set to improve the inherent video editing and coding functions of embedded applications (such as interactive clients or digital signatures).
Intel Trusted Execution Technology
Intel Trusted Execution Technology is a hardware extension technology in Intel Core 2 dual-core processor E8400, which introduces hardware data security into the embedded market, making dual-core processor an ideal choice for national defense, government, medium-sized network security equipment and retail applications. This security technology aims to protect the data in the virtualized computing environment from software attacks, virus intrusions and other types of threats. [6]
Edit this paragraph
Out-of-order execution technology and Intel's Nehalem architecture chip
build
Nehalem is basically built on the framework of core micro-architecture, adding SMT, 3-layer cache, hierarchical TLB and branch prediction, IMC, QPI, DDR3 support and SSE4.2 instructions. Compared with the great change from NetBurst architecture of Pentium 4 to Core microarchitecture, Nehalem architecture has a little change from Core microarchitecture to basic core part, because Nehalem is still decoding/renaming/canceling with 4 instruction widths.
cause
Nehalem's out-of-order engine has been significantly expanded, not only for performance reasons, but also to provide SMT, because SMT needs resources to enjoy.
Like Core 2, Nehalem's register alias table (RAT) indicates that each architecture register either enters the reorder buffer (ROB). Either enter the retirement register file (RRF, or translate it into retirement register file) and keep the latest speculation status. On the other hand, RRF has maintained its recent non-speculative status. RAT can rename four micro-operations every cycle, and give each micro-operation a destination register in ROB. Renamed instructions read their source operands and send them to a unified reservation station (RS) with a common architecture, which can be used by various instruction types.
Nehalem's ROB (reorder buffer) increased from 96 to 128, and RS (reservation station) increased from 32 to 36, all of which were shared by two threads, but the strategies were different. ROB is statically assigned to two threads, so that both threads can be as predictable as possible in the instruction stream. And RS is competitive, based on the needs of each thread. This is because many times a thread may stop, wait for operands from memory, and use few RS entries. It is best to let another more active thread use RS items as much as possible. When all operands of instructions in RS are ready, they are allocated to execution units.
Compared with Core 2, the execution unit of Nehalem is basically unchanged and is not affected by SMT, except for higher utilization rate.
Edit this paragraph
Out-of-order execution technology and nano-processor
brief introduction
VIA Nano processor is the first 64-bit superscalar out-of-order execution processor in VIA x86 platform series, aiming at activating the traditional desktop and notebook computer market and providing truly high-quality performance for computing technology, entertainment and network connection applications.
VIA C7 series processors adopt market-leading energy-saving technology, and VIA Nano processor series has improved its performance by four times in the same power consumption range, further enhancing its leading position in performance-to-power ratio. The same pin compatibility with C7 series processors ensures that OEM and motherboard manufacturers can realize the conversion between them more smoothly, and in addition, it makes it easier to upgrade the existing system and motherboard.
VIA Nano Processor Series
Processor name
model
basic frequency
Through V4 front-end bus
package
Processor process
Idle power consumption
Weisheng nano
L2 100
1.8GHz
800 MHz
NanoBGA2
65 nm
500mW
Weisheng nano
L2200
1.6GHz
800 MHz
NanoBGA2
65 nm
100 MW
Weisheng nano
U2300
1.3+GHz
800 MHz
NanoBGA2
65 nm
100 MW
Weisheng nano
U2500
1.2GHz
800 MHz
NanoBGA2
65 nm
100 MW
Weisheng nano
U2400
1.0GHz
800 MHz
NanoBGA2
65 nm
100 MW
Key architecture performance
measure
VIA Nano processor adopts Fujitsu's advanced 65nm processor technology, which realizes the perfect combination of high performance and low power consumption. It further consolidated VIA's leading position in processor miniaturization, and realized the new generation miniaturization design and application of x86 platform through ultra-dense design.
Package size: VIA Nano)BGA2 package (2 1mm x 2 1mm)
Core size: 7.650 mm x 8.275mm (63.3 square mm)
Micro-architecture of 64-bit superscalar out-of-order execution
VIA Nano processor supports a complete 64-bit instruction set, and has the functions of macro fusion, micro fusion and complex branch prediction. Further reducing the power consumption of the processor and improving its efficiency.
High performance computing and media processing
VIA Nano processor supports high-speed and low-power VIA V4 front-end bus with a minimum of 800 MHz, supports new SSE instructions, two 64KB L 1 caches and 1MB independent L2 caches, and has the connection performance of 16 channels, thus achieving a great leap in multimedia performance.
In particular, VIA Nano processor has significantly improved the high-performance floating-point operation, and adopted a brand-new floating-point addition algorithm, which greatly reduced the minimum floating-point delay time in x86 processors. Similarly, floating-point multipliers also have the shortest floating-point delay time.
In other words, this means that the VIA Nano processor provides excellent performance for smoothly playing Blu-ray discs and other high-definition video formats, and it can decode media streams at a speed of 40Mbps. In addition, its unique dual-clock floating-point unit (FPU) and 128-bit data path provide excellent game experience and smooth 3D picture representation.
The following figure shows that VIA Nano processor is computationally superior to popular C7 processor:
Advanced power and cooling management
Powerful dynamic power management, including support for new "C6" power state, PowerSaver technology, brand-new circuit design and mechanism to manage chip core temperature, reduce power consumption and improve thermal management level.
Through the above innovative technologies in the processor, VIA Nano processor has a superscalar structure, which achieves a significant performance improvement while maintaining the same power consumption as the previous VIA C7 series processors.
The maximum design power consumption (TDP) of the first product of VIA 1.0 GHz ULV processor is only 5 watts (idle running power consumption is only 100 MW), while the power consumption of VIA Nano processor at 1.8GHz is only 25.5 watts (idle running power consumption is 500 MW).
VIA Nano processor's computing performance has been improved, but its power consumption remains unchanged, which further improves its performance-to-power ratio, making it the best product for the industry to start again.
Total performance score of the test in 2007
TDP (maximum thermal power consumption) of 1.6GHz Celeron -M = 31w; The TDP through Nano is 1.6 GHz = 17W.
Operating system = Windows Vista Enterprise Edition
Upgradable VIA C7 processor: VIA Nano processor is pin-compatible with VIA C7 processor series, which enables OEM and motherboard manufacturers to use products with new architecture alternately smoothly, and enables them to expand into different market areas only through a single motherboard or system design.
Green technology: In addition, it fully conforms to RoHS standards and WEEE rules, and the products are halogen-free and lead-free, which is of great benefit to environmental protection and sustainable computing technology.
Enhanced by padlock safety engine
VIA Nano processor inherits the hardware encryption accelerator and security features of via processor family, including double random data generator (RNG), AES encryption engine, NX bit and secure hybrid engine for processing SHA- 1/SHA-256 encryption calculation.
AMD Dragon Intel Core 2 Intel Atom passes through C7 and through Nano.
Safe mixing no no no no complete sha-1&; SHA-256 complete sha-1&; SHA-256
Buffer overflow NX bit NX bit NX bit NX bit NX bit NX bit
On-chip encryption) No No No Complete AES encoding/translation acceleration RSA accelerates the peak value of CBC, CFB-M, AC and CTR modes at 25Gb/s; Complete AES encoding/translation acceleration RSA improves the peak speed of CBC, CFB-M, AC and CTR modes by 25 Gb/s..
Random number generation (RNG) NO NO NO 2 enhances hardware RNG, and the speed caused by output Feeds is 12Mb/s 2 to enhance hardware RNG, and the speed caused by output Feeds is 12Mb/s [7].
Edit this paragraph
Analysis of Barcelona's new features: stack operation and out-of-order execution
origin
Intel's earliest Pentium M processor introduced a new feature called "dedicated stack manager", which, as the name implies, was responsible for all X86 stack operations (such as push, pop, call, return, etc. ). It processes these data centrally without the participation of other execution units, especially simplifying the work of integer execution units of CPU and speeding up the processing speed of integer execution units.
technology
AMD also introduced a similar technology in Barcelona, which AMD called sideband stack optimizer. With the sideband stack optimizer, the instructions in the processor no longer need to be encoded in three ways, nor do they need to be processed by integer execution units, which speeds up the processing of the stack and integer execution units.
An important improvement of Intel Core microarchitecture is OOOE out-of-order execution: when the load instruction queue is waiting, the processor can load and execute the waiting instructions behind the queue first, instead of waiting until the congestion is over. On average, about 30% of instructions will be blocked for some time. The introduction of this out-of-order execution mode has obviously improved the performance of the new architecture CPU. AMD's K8 architecture does not support OOOE out-of-order execution of instructions, so even though the K8 architecture has an excellent memory controller built in, it is still defeated by its rival's core architecture. Facing up to this technical backwardness, AMD timely improved Barcelona, the first chip of K8L architecture, to OOOE technology, which will definitely bring great improvement to the performance of K8L architecture.
Barcelona will be able to execute instructions out of order, and it can also load and process the next instruction with free cells before processing the previous instruction, even if the two instructions need to read different memory addresses. Barcelona has three address generation units, which can complete three register instructions per cycle, while the core architecture can only execute 1 time per cycle-the register speed of ——K8L architecture is three times faster than that of the core architecture.
K8L framework adds SSE4 instruction extensions: SSEEXTRQ/INSERTQ instruction and MOVNTSD/MOVNTSS instruction. The former can combine multiple instructions into one instruction for execution, and the latter is used to calculate the flow register instruction. Intel will also add it to the Penryn processor to be released later.