Why Apple made a bold move and switched to ARM?
A detailed look into the differences between ARM and x86
Have you ever wondered how Apple’s M-chip macbooks manage to be so blazingly fast yet incredibly energy efficient? The answer lies in a fundamental shift in processor architecture.
Today, we’re diving deep into the world of ARM and x86 processors, and why Apple made the bold move to ditch Intel.
Some information in this article is simplified for explanation purposes, but the fundamentals are still valid.
RISC or CISC architecture
There are two fundamentally different philosophies for designing computer processors that are often compared to each other: RISC and CISC.
RISC, or Reduced Instruction Set Computing, is all about simplicity. It uses a smaller set of simple instructions, which results in less power consumption and reduced heat generation. ARM, which stands for Advanced RISC Machine, is ideal for devices where energy efficiency is a priority, such as mobile devices.
CISC, or Complex Instruction Set Computing, processors are the opposite. They have a more extensive set of instructions which can perform complex tasks per instruction. This complexity can provide higher performance for demanding computing tasks such as video gaming, content creation, and various desktop applications.
Example: Load, Add, Store
But what does it really mean for an instruction to be simple or complex? Let’s look at a real-world example: adding two numbers together.
Say we want to add a constant number c to a number stored in your computer’s memory at location x. Seems simple enough, right? But under the hood it plays out very differently depending on the architecture.
RISC needs 3 instructions: Load, Add, Store
RISC processors, like ARM and the ones found in Apple’s M chips, are very methodical. They can achieve this task via three simple steps:
Load: Grab the value from memory location x.
Add: Add the constant c to it.
Store: Put the result back in memory location x.
LOAD R1, [x]
ADD R1, R1, c
STORE R1, [x]
CISC needs just a single instruction: Add
Okay, so how would we do this in CISC architecture? CISC makes this easy because a single ADD instruction can do all of this work for us. We can write ADD [x], c which loads a value at memory location x, adds constant c to it, and writes the result back to the memory location x.
In Intel’s x86 architecture, source operand can be a constant value called an immediate, a register, or a memory location. The destination operand can be either a register or a memory location. The only constraint is that at most one operand can be a memory location.
ADD <destination> <source>
We usually don’t concern ourselves with low-level details when writing code on a daily basis, but there’s a lot going on under the hood to achieve all of this in a single instruction. However, this flexibility comes with tradeoffs.
Instruction Encoding and Decoding
So, we’ve seen how RISC and CISC processors approach tasks differently, but how do they actually understand what to do? It all comes down to how computer programs are represented in memory. When you write a program in your favourite language, the compiler takes the source code and eventually produces a binary file that can be executed.
Every program consists of data and instructions which are encoded as a bunch of bits. For example, your program may look like this:
$ hexdump target/release/examples/techwithnikola-bin | head -n10
0000000 facf feed 000c 0100 0000 0000 0002 0000
0000010 0013 0000 0770 0000 0085 00a0 0000 0000
0000020 0019 0000 0048 0000 5f5f 4150 4547 455a
0000030 4f52 0000 0000 0000 0000 0000 0000 0000
0000040 0000 0000 0001 0000 0000 0000 0000 0000
0000050 0000 0000 0000 0000 0000 0000 0000 0000
0000060 0000 0000 0000 0000 0019 0000 0228 0000
0000070 5f5f 4554 5458 0000 0000 0000 0000 0000
0000080 0000 0000 0001 0000 0000 0028 0000 0000
0000090 0000 0000 0000 0000 0000 0028 0000 0000
Hexadecimal encoding is a common way to represent binary data.
These bytes correspond to instructions and data that the CPU may access after the operating system loads it into memory. So, how can the CPU understand instructions from the binary? Instructions are carefully encoded as per manufacturer’s specification, and the CPU knows how to decode them.
Technically, instructions are decoded by a circuit in the CPU called decoder, which takes the binary code and decides what the CPU should do. Ultimately, the decoder is a complex arrangement of transistors that form logic gates. These gates work together to analyze the binary patterns and translate them into control signals, which then direct the processor’s actions.
Example: ARM instructions
Let’s look at a few examples of ARM instruction encoding:
ADD R4, R1, #66 tells the processor to add an immediate value 66 to register R1, and store the result in R4. It is encoded as E2814042. Every subsequence of bits has a predefined meaning.
ADD R4, R1, R3 tells the processor to add values in registers R1 and R3, and store them in R4. It is encoded as 034081E0.
LDR R2, [R0] instruction takes the memory location stored in register R0 and loads its value into register R2. It is encoded as E5902000.
And so on.
The beauty of ARM instructions lies in their uniform length. On the Armv7 processor, for instance, every instruction is exactly 32 bits long. This consistency makes the decoder easy to implement.
But how to add large immediate values?
However, this approach has some downsides. Some of you may have noticed that ARM cannot add large immediate values in a single add instruction. ARM instructions may not have enough bits to encode both the instruction and the immediate number.
To workaround this problem, compilers will generate multiple ADD instructions or even consider storing the immediate value into a register first, then adding two registers. It’s up to compilers to choose the most optimal path.
Not too many instruction types
Another advantage of ARM processors is that they don’t come with many instructions. Of course, it depends on the specific CPU and what exactly counts as an instruction, but without being too far off we can say that ARM supports around several hundred instructions.
In contrast, an Intel x86 processor may have thousands of instructions, which adds to the complexity of the CPU architecture.
Example: x86 instructions
Now let’s look at a few examples of Intel’s x86 instruction encoding:
ADD EAX, 9 is encoded as 05 09 00 00 00. 05 is for opcode which says this is an instruction that adds source operand to EAX register, followed by four bytes representing 9 in the little-endian byte-ordering.
ADD ESI, 15 is encoded as 83 F6 0F. This one has a different length because an opcode 83 is used for instructions that involve an 8-bit immediate value. F6 tells the processor it should use the ESI register and that it’s an ADD instruction involving register. Finally, 0F represents number 15, which is encoded as a single byte. The x86 processor is clever enough to figure this out based on the opcode. However, if we needed to add a larger number, the instruction would require more bytes and a different encoding scheme.
There are many more kinds of opcodes and ways to represent an instruction. Notice that instructions have variable length, so the x86 processor must handle that complexity. If we look at the transistor for decoding an x86 instruction, it will have many logic circuits, multiplexers, and latches — it’s quite complex, hence the name Complex Instruction Set Computing.
Decoding Summary
So, to summarize, when we say instructions are simple, we mean:
Fewer instruction formats: which makes decoding and execution easier.
Fixed Instruction Length: Instructions are usually of a fixed size, e.g. 32-bits, which simplifies instruction fetching and decoding.
Load/Store Architecture: Most operations are performed on registers, not directly on memory; separate instructions are used for loading and storing to memory.
Limited Instruction Set: There are only several hundred instructions.
When we say instructions are complex, we mean:
Variable Instruction Length: There are many instruction formats with lengths from 1 to 15 bytes, which complicates fetching and decoding.
Complex Addressing Modes: Instructions can operate directly on data in memory or registers.
Large Instruction Set: There are many, order of magnitude more, instructions in Intel than in ARM.
Multi-stage Execution: Many x86 instructions undergo multiple stages of decoding, which can impact execution speed but provides high functionality per instruction.
Pipeline Utilization
You may wonder, how can ARM processors be more efficient if they require multiple instructions for work that Intel can do in a single instruction? This brings us to the next important topic: instructions and pipeline utilization.
Instruction pipelining is a technique that breaks an instruction into smaller parts that can be done in parallel. Every CPU instruction typically goes through a series of stages, collectively known as the instruction cycle or fetch-decode-execute cycle. The specific stages and their details can vary slightly depending on the processor’s architecture.
To improve performance in modern processors, multiple instructions can be in different stages of the cycle simultaneously. While one instruction is being executed, another can be decoded, and yet another can be fetched. This overlapping of stages significantly increases the processor’s throughput, even though each individual instruction still takes the same amount of time to complete the full cycle.
Why does this matter?
ARM processors are particularly well-suited for pipelining. Their simpler instructions, with fewer steps and more predictable execution times, fit neatly into the pipeline stages. This reduces the likelihood of stalls, which occur when one instruction depends on the result of a previous instruction that hasn’t finished yet.
Additionally, ARM’s load/store architecture further enhances pipelining efficiency. By separating memory operations from other instructions, ARM minimizes conflicts among instructions and allows the pipeline to keep humming along, which results in more instructions being processed in parallel for longer.
In contrast, x86 processors, with their complex and variable-length instructions, can be more prone to pipeline stalls. Their instructions might take varying numbers of cycles to complete, making it harder to predict when results will be ready for subsequent instructions.
While this is generally true, it’s worth noting that modern x86 processors also have sophisticated pipelining capabilities. The main difference lies in the complexity of the instructions and how well they fit into the pipeline stages.
Simpler transistors
There are more benefits coming from simple instructions. Processors usually don’t work directly with instructions as we write them in assembly. Instead, the decoder converts them to so-called micro-ops before it executes them. In CISC architectures, a single instruction often results in multiple micro-ops. On the other hand, RISC instructions tend to map more closely to micro-ops, often being the same.
As a result, the decoder is much simpler in ARM processors and requires fewer transistors, which in turn requires less power to execute the instruction.
What’s the impact?
Now we understand the benefits of CISC architecture, but what is the overall impact in practice?
There are a couple factors to consider:
Power Consumption: ARM processors are more efficient and use less power, which makes them a very good option for mobile devices.
Heating: ARM processors don’t heat as much as x86 processors, so they require smaller cooling solutions, which is important for smaller devices.
Performance: Historically, ARM processors were seen as less powerful than Intel processors. However, modern ARM processors have closed this gap significantly, especially with designs from companies like Apple with its M-series chips. ARM can compete in high-performance environments.
Does this mean that ARM is just a better processor and everyone should forget about Intel?
Not necessarily. For example, Intel still performs better in high-performance computing environments such as gaming, video editing, and other demanding applications due to its high clock speed. It’s also widely used in enterprise and server solutions because it’s optimized for I/O and virtualization.
I’m really looking forward to seeing if ARM will win this battle in the future.
Why did Apple switch to ARM?
This brings us to the main topic of this video. Why did Apple switch to ARM if Intel can still be more performant? Did they trade performance for battery life?
Turns out that new M-series processors, also known as Apple Silicon, are a lot more efficient than old Intel processors in Apple’s Mac devices. In part, this is due to general advancement in technology. For example, M-series chips are fabricated on a 3nm process. Also, ARM processors can achieve better throughput given the same amount of power due to efficient pipelining, instructions requiring fewer cycles, and simpler transistors.
But there is one more important factor …
An important difference between ARM and Intel is that ARM only designs processors, but it doesn’t manufacture them. In contrast, Intel is responsible for the manufacturing process of x86 processors.
Why is this important to Apple?
With M-series Apple has full control over both hardware and software, and can adjust the design of ARM processors to provide better synergy between them. For example:
Custom Architecture: By designing their own chips, Apple can tailor the processor architecture specifically to the performance characteristics of their operating systems and applications. This can lead to better optimization of the software on this custom hardware, potentially improving efficiency and speed.
Unified Memory Architecture: Apple Silicon uses a unified memory architecture that allows the CPU and GPU to access the same memory bank. This reduces latency and increases performance, as there’s no need to copy data between multiple memory pools.
Custom Features: Apple can integrate unique hardware features that are specifically designed to run its software more effectively, such as custom machine learning accelerators for tasks that benefit from AI, enhancing applications like voice recognition, image processing, and more.
There are many more advantages, but you get the point.
This is why with new MacBook Pro laptops you get both: very good performance and long battery life.
Compatibility
With that said, programs compiled for Intel processors cannot simply run on M-chips.
Instead of requiring every software to be released specifically for M-chips, Apple has mitigated this temporarily by introducing an emulator.
Rosetta 2 emulator
Rossetta 2 essentially translates x86 instructions (used by Intel processors) into ARM instructions (used by Apple Silicon) on the fly. This allows users to run applications that were originally designed for Intel Macs on their newer ARM-based Macs without the need for developers to recompile their apps.
All this happens transparently to the user, which is amazing. However, while Rosetta 2 is impressive, it does introduce some overhead - Intel-based apps might run slightly slower on Apple Silicon Macs compared to native ARM apps. However, Apple has optimized Rosetta 2 for performance, and the difference is often negligible for many everyday tasks.
Rosseta 2 is just temporary solution
Rosseta 2 isprimarily a bridge to facilitate the transition from Intel to Apple Silicon. Developers are encouraged to update their apps to run natively on ARM, as this offers the best performance and efficiency.
Final words
Thanks for joining me on this deep dive into Apple’s ARM revolution. If you’re as excited about this technology as I am, I encourage you to explore it further.
Also, check out my YouTube channel where I talk about computer science and software engineering.