这是indexloc提供的服务,不要输入任何密码
SlideShare a Scribd company logo
Linux Kernel Memory Model
SeongJae Park <sj38.park@gmail.com>
These slides were presented during
4th Korea Linux Kernel Community Meetup
(https://kernel-dev-ko.github.io/4th/)
This work by SeongJae Park is licensed under the
Creative Commons Attribution-ShareAlike 3.0 Unported
License. To view a copy of this license, visit
http://creativecommons.org/licenses/by-sa/3.0/.
I, SeongJae Park
● SeongJae Park <sj38.park@gmail.com>
● PhD candidate researcher at DCSLAB, SNU
● Part time linux kernel programmer at KOSSLAB
● Interested in memory management and parallel programming for OS kernels
It’s Multi-processor Era, Now
● Processor vendors changed their mind to increase number of cores instead of
clock speed more than a decade ago
○ Now, multi-core systems are prevalent
○ Even octa-core system in your pocket, maybe?
● As a result, the free lunch is over;
parallel programming is essential for high performance and scalability
http://www.gotw.ca/images/CPU.png
GIVE UP :’(
Writing Correct Parallel Program is Hard
● Nature of parallelism is counter-intuitive to human
○ Time is relative, order is ambiguous, and only observations exist
● Compilers and processors do reorderings for performance
● C language developed with Uni-Processor assumption
○ Standard for the kernel (C99) is oblivious of multi-processor systems
CPU 0 CPU 1
X = 1;
Y = 1;
X = 2;
Y = 2;
if (Y == 2 && X == 1)
OOPS();
CPU 1 can call OOPS() in Linux Kernel
Hardware Reorderings
Why and How?
Instruction Reordering Helps Performance
● By reordering dependent instructions far away, total time consumptions can
be reduced
● If the reordering is guaranteed to not change the result of the instruction
sequence, it would be helpful for better performance
● Such reordering is legal and recommended
fetch decode execute
fetch decode execute
fetch decode execute
Instruction 1
Instruction 3
Instruction 2
instruction 2 depends on result of instruction 1
(e.g., first instruction modifies opcode of next instruction)
By reordering instruction 2 and 3, total execution time can be shorten
6 cycles for 3 instructions: 2 cycles per instruction
Time and Order is Relative
● Each CPU concurrently generates their events and
observes effects of events from others
● It is impossible to define absolute order of two concurrent events;
Only relative observation order is possible
CPU 1 CPU 2 CPU 3 CPU 4
Generated
event 1
Generated
event 2
Observed
event 1
followed by
event 2
I observed
event 2
followed by
event 1
Event Bus
Cache Coherency is Per-CacheLine
● It is well known that cache coherency protocol helps system memory
consistency
● In actual, it guarantees only per-cacheline sequential consistency
● Every CPU will eventually agree about global order of each cacheline,
but some CPU can aware the change faster than others,
order of changes between different variables is not guaranteed
CPU 1 CPU 2 CPU 3 CPU 4
A = 1; (1)
B = 1; (2)
A = 2; (3)
B = 2; (4)
The order is:
(1), (2), (3), (4)
Coherent cache protocol
The order is:
(1), (3), (2), (4)
The order is:
(2), (1), (4), (3)
Every CPU is honest;
The system is legally
cache-coherent
An Example with A Mythological Architecture
● Many systems equip hierarchical memory for better performance and space
● Propagation speed of an event to a given core can be influenced by specific
sub-layer of memory
If CPU 0 Message Queue is busy, CPU 2 can observe an event from
CPU 0 (event A) after an event of CPU 1 (event B)
though CPU 1 observed event A before generating event B
CPU 0 CPU 1
Cache
CPU 0
Message
Queue
CPU 1
Message
Queue
Memory
CPU 2 CPU 3
Cache
CPU 2
Message
Queue
CPU 3
Message
Queue
Bus
An Example with A Mythological Architecture
● Many systems equip hierarchical memory for better performance and space
● Propagation speed of an event to a given core can be influenced by specific
sub-layer of memory
If CPU 0 Message Queue is busy, CPU 2 can observe an event from
CPU 0 (event A) after an event of CPU 1 (event B)
though CPU 1 observed event A before generating event B
CPU 0 CPU 1
Cache
CPU 0
Message
Queue
CPU 1
Message
Queue
Memory
CPU 2 CPU 3
Cache
CPU 2
Message
Queue
CPU 3
Message
Queue
Bus
Generate
Event A;
Event A
An Example with A Mythological Architecture
● Many systems equip hierarchical memory for better performance and space
● Propagation speed of an event to a given core can be influenced by specific
sub-layer of memory
If CPU 0 Message Queue is busy, CPU 2 can observe an event from
CPU 0 (event A) after an event of CPU 1 (event B)
though CPU 1 observed event A before generating event B
CPU 0 CPU 1
Cache
CPU 0
Message
Queue
CPU 1
Message
Queue
Memory
CPU 2 CPU 3
Cache
CPU 2
Message
Queue
CPU 3
Message
Queue
Bus
Generate
Event A;
Seen Event A;
Generate
Event B;
Event BEvent A
An Example with A Mythological Architecture
● Many systems equip hierarchical memory for better performance and space
● Propagation speed of an event to a given core can be influenced by specific
sub-layer of memory
If CPU 0 Message Queue is busy, CPU 2 can observe an event from
CPU 0 (event A) after an event of CPU 1 (event B)
though CPU 1 observed event A before generating event B
CPU 0 CPU 1
Cache
CPU 0
Message
Queue
CPU 1
Message
Queue
Memory
CPU 2 CPU 3
Cache
CPU 2
Message
Queue
CPU 3
Message
Queue
Bus
Generate
Event A;
Seen Event A;
Generate
Event B;
Seen Event B;
Event A
Event B
Busy… ;;;
An Example with A Mythological Architecture
● Many systems equip hierarchical memory for better performance and space
● Propagation speed of an event to a given core can be influenced by specific
sub-layer of memory
If CPU 0 Message Queue is busy, CPU 2 can observe an event from
CPU 0 (event A) after an event of CPU 1 (event B)
though CPU 1 observed event A before generating event B
CPU 0 CPU 1
Cache
CPU 0
Message
Queue
CPU 1
Message
Queue
Memory
CPU 2 CPU 3
Cache
CPU 2
Message
Queue
CPU 3
Message
Queue
Bus
Generate
Event A;
Seen Event A;
Generate
Event B;
Seen Event B;
Seen Event A;
Event B
Event A
Yet Another Mythological Architecture
● Store Buffer and Invalidation Queue can reorder observation of stores and
loads
CPU 0
Cache
Store
Buffer
Invalidation
Queue
Memory
CPU 1
Cache
Store
Buffer
Invalidation
Queue
Bus
Compiler Reorderings
C-language Doesn’t Know Multi-Processor
● By the time of initial C-language development, multi-processor was rare
● As a result, C-language has only few guarantees about memory operations
on multi-processor
● Undefined behavior is allowed for undefined case
https://upload.wikimedia.org/wikipedia/commons/thumb/9/95/The_C_Programming
_Language,_First_Edition_Cover_(2).svg/2000px-The_C_Programming_Languag
e,_First_Edition_Cover_(2).svg.png
Compiler Optimizes (Reorders) Code
● Clever compilers try hard (really hard) to optimize code for high IPC
(not for human perspective goals)
○ Converts small, private functions to inline code
○ Reorder memory access code to minimize dependency
○ Simplify unnecessarily complex loops, ...
● Optimization uses term `Undefined behavior` as they want
○ It’s legal, but sometimes do insane things in programmer’s perspective
● Compilers reorder memory accesses based on the C-standard, which is
oblivious of multi-processor systems; this can result in unintended programs
● C11 standard aware the multi-processor systems, though
Synchronization Primitives
Synchronization Primitives
● Because reordering and asynchronous effect propagation is legal,
synchronization primitives are necessary to write human intuitive program
● Most environments including the Linux kernel provide sufficiently enough
synchronization primitives for human intuitive programs
https://s-media-cache-ak0.pinimg.com/236x/42/bc/55/42bc55a6d7e5affe2d0dbe9c872a3df9.jpg
Atomic Operations
● Atomic operations are configured with multiple sub operations
○ E.g., compare1)
-and-exchange2)
, fetch1)
-and-add2)
, and test1)
-and-set2)
● Atomic operations have mutual exclusiveness
○ Middle state of atomic operation execution cannot be seen by others
○ It can be thought of as small critical section that protected by a global lock
● Linux also provides plenty of atomic operations
http://www.scienceclarified.com/photos/atomic-mass-3024.jpg
Memory Barriers
● In general, memory barriers guarantee
effects of memory operations issued before it
to be propagated to other components in the system (e.g., processor)
before memory operations issued after the barrier
● Linux provides various types of memory barriers including compiler memory
barriers, hardware memory barriers, load barriers, store barriers
CPU 1 CPU 2 CPU 3
READ A;
WRITE B;
<BARRIER>;
READ C;
READ A,
WRITE B,
then READ C
occurred
WRITE B,
READ A,
then READ C
occurred
READ A and WRITE B can be reordered but READ C is guaranteed to
be ordered after {READ A, WRITE B}
Partial Ordering Primitives
● Partial ordering primitives include read / write memory barriers, acquire /
release semantics and data / address / control dependencies
● Semantics of partial ordering primitives are somewhat confusing
● Cost of partial ordering can be much cheaper than fully ordering primitives
● Linux also provides partial ordering primitives
https://www.ternvets.co.uk/wp-content/uploads/2017/02/cat-with-cone.jpg
Cost of Synchronization Primitives
● Generally, synchronization primitives are expensive, unscalable
● Each primitive costs differently; the gap is significant
● Correct selection and efficient use of synchronization primitives are important
for the performance and scalability
Performance of various synchronization primitives
LKMM: Linux Kernel
Memory Model
Each Environment Provides Own Memory Model
● Memory model, a.k.a Memory consistency model
● Defines what values can be obtained by load instructions of given code
● Programming environments including Instruction Set Architectures,
Programming languages, Operating systems, etc define their memory model
○ Modern language memory models (e.g., Golang, Rust, Java, C11, …) aware multi-processor,
while C99, which is for the Linux kernel, doesn’t
http://www.sciencemag.org/sites/default/files/styles/article_main_large/public/Memory.jpg?itok=4FmHo7M5
Each ISA Provides Specific Memory Model
● Some architectures have stricter ordering enforcement rule than others
● PA-RISC CPUs are strictest, Alpha is weakest
● Because Linux kernel supports multiple architectures, it defines its memory
model based on weakest one, the Alpha
https://kernel.org/pub/linux/kernel/people/paulmck/perfbook/perfbook.2015.01.31a.pdf
Linux Kernel Memory Model (LKMM)
● The memory model for Linux kernel programming environment
○ Defines what values can be obtained, given a piece of Linux kernel code, for specific load
instructions in the code
● Linux kernel original memory model is necessary because
○ It uses C99 but C99 memory model is oblivious of multi-processor environment;
C11 memory model aware of multi-processor, but Torvalds doesn’t want it
○ It supports multiple architectures with different ISA-level memory model
https://github.com/sjp38/perfbook
LKMM Providing Ordering Primitives
● Designed for weakest memory model architecture, Alpha
○ Almost every combination of reordering is possible, doesn’t provide address dependency
● Atomic instructions
○ atomix_xchg(), atomic_inc_return(), atomic_dec_return(), …
○ Most of those just use mapping instruction, but provide general guarantees at Kernel level
● Memory barriers
○ Compiler barriers: WRITE_ONCE(), READ_ONCE(), barrier(), ...
○ CPU barriers: mb(), wmb(), rmb(), smp_mb(), smp_wmb(), smp_rmb(), …
○ Semantic barriers: ACQUIRE operations, RELEASE operations, dependencies, …
○ For detail, refer to https://www.kernel.org/doc/Documentation/memory-barriers.txt
● Because different memory ordering primitive has different cost, only
necessary ordering primitives should be used in necessary case for high
performance and scalability
Informal LKMM
● Originally, LKMM was just an informal text, ‘memory-barriers.txt’
● It explains about the Linux Kernel Memory Model in English
(There is Korean translation, too: https://www.kernel.org/doc/Documentation/translations/ko_KR/memory-barriers.txt)
● To use the LKMM to prove your code, you should use Feynman Algorithm
○ Write down your code
○ Think real hard with the ‘memory-barriers.txt’
○ Write down your provement
○ Hard and unreliable, of course!
https://www.kernel.org/doc/Documentation/memory-barriers.txt
Formal and Executable LKMM: Help Arrives at Last
● Jade Alglave, Paul E. McKenney, and some guys made formal LKMM
● It is formal, executable memory model
○ It receives C-like simple code as input
○ The code containing parallel code snippets and a question: can this result happen?
● Provides model for each ordering primitive of LKMM based on concepts
including orders and cycles
● Can be executed with herd7 and klitmus7
○ LKMM extends herd7 and klitmus7 to support LKMM ordering primitives in code
○ herd7 based execution does proof in user mode
○ klitmus7 based execution generates test running kernel module and
wrapper scripts for execution of the module
Steps for LKMM Execution
● Installation
○ LKMM is merged in Linux source tree at tools/memory-model;
Just get the linux source code
○ Install herdtools7 (https://github.com/herd/herdtools7)
● Usage
○ Using herd7 based user mode proof
$ herd7 -conf linux-kernel.cfg <your-litmus-test file>
○ Using klitmus7 based real kernel mode execution
$ mkdir mymodules
$ klitmus7 -o mymodules <your-litmus-test file>
$ cd mymodules ; make
$ sudo sh run.sh
● That’s it! Now you can prove your parallel code for all Linux environments!
Summary
● Nature of parallel programming is counter-intuitive
○ Cannot define order of events without interaction
○ Ordering rule is different for different environment
○ Memory model defines their ordering rule
● For human-intuitive and correct program, interaction is necessary
○ Almost every environment provides memory ordering primitives including atomic instructions
and memory barriers, which is expensive in common
○ Memory model defines what result can occur and cannot with given code snippet
● Formal Linux kernel memory model is available
○ Linux kernel provides its memory model based on weakest memory ordering rule architecture
it supports, the Alpha, C99, and its original ordering primitives including RCU
○ Formal and executable LKMM is in the kernel tree; now you can prove your parallel code!
Thanks
Demonstration
`herdtools` Install
● LKMM (in Linux v4.19) depends on herdtools version 7.49
○ Depending version may change
● `herdtools` depends on OPAM, the package manager for OCaml
● Ubuntu package manager supports OPAM
$ sudo apt install opam
$ opam init && sudo opam update && sudo opam upgrade
$ git clone https://github.com/herd/herdtools7 && cd herdtools7
$ git checkout 7.49
$ make all && make install
$ herd7 -version
7.49, Rev: 93dcbdd89086d5f3e981b280d437309fdeb8b427
Herd7 Based Userspace Litmus Tests Execution
$ herd7 -conf linux-kernel.cfg 
litmus-tests/SB+fencembonceonces.litmus
Test SB+fencembonceonces Allowed
States 3
0:r0=0; 1:r0=1;
0:r0=1; 1:r0=0;
0:r0=1; 1:r0=1;
No
Witnesses
Positive: 0 Negative: 3
Condition exists (0:r0=0 / 1:r0=0)
Observation SB+fencembonceonces Never 0 3
Time SB+fencembonceonces 0.01
Hash=d66d99523e2cac6b06e66f4c995ebb48
Klitmus7 Based Litmus Tests Execution
$ mkdir my; klitmus7 -o my/ 
litmus-tests/SB+fencembonceonces.litmus
$ cd klitmus_test; make; sudo sh ./run.sh
...
Test SB+fencembonceonces Allowed
Histogram (3 states)
16580117:>0:r0=1; 1:r0=0;
16402936:>0:r0=0; 1:r0=1;
3016947 :>0:r0=1; 1:r0=1;
...
Positive: 0, Negative: 36000000
Condition exists (0:r0=0 / 1:r0=0) is NOT validated
Hash=d66d99523e2cac6b06e66f4c995ebb48
Observation SB+fencembonceonces Never 0 36000000
...

More Related Content

What's hot (20)

PPTX
Linux Network Stack
Adrien Mahieux
 
PDF
Kernel Recipes 2017 - Understanding the Linux kernel via ftrace - Steven Rostedt
Anne Nicolas
 
PDF
Vmlinux: anatomy of bzimage and how x86 64 processor is booted
Adrian Huang
 
PDF
Decompressed vmlinux: linux kernel initialization from page table configurati...
Adrian Huang
 
PDF
Scheduling in Android
Opersys inc.
 
PPTX
Slab Allocator in Linux Kernel
Adrian Huang
 
PPTX
Understanding DPDK
Denys Haryachyy
 
PDF
The basics of fluentd
Treasure Data, Inc.
 
PPSX
FD.io Vector Packet Processing (VPP)
Kirill Tsym
 
PDF
Ceph and RocksDB
Sage Weil
 
PPTX
Linux Initialization Process (1)
shimosawa
 
PDF
Understanding of linux kernel memory model
SeongJae Park
 
PPTX
Linux Memory Management
Ni Zo-Ma
 
PPT
Linux kernel memory allocators
Hao-Ran Liu
 
PDF
Memory management in Linux kernel
Vadim Nikitin
 
PDF
Qemu device prototyping
Yan Vugenfirer
 
PDF
Kernel_Crash_Dump_Analysis
Buland Singh
 
PDF
Kdump and the kernel crash dump analysis
Buland Singh
 
PDF
HKG15-505: Power Management interactions with OP-TEE and Trusted Firmware
Linaro
 
PDF
OpenWrt From Top to Bottom
Kernel TLV
 
Linux Network Stack
Adrien Mahieux
 
Kernel Recipes 2017 - Understanding the Linux kernel via ftrace - Steven Rostedt
Anne Nicolas
 
Vmlinux: anatomy of bzimage and how x86 64 processor is booted
Adrian Huang
 
Decompressed vmlinux: linux kernel initialization from page table configurati...
Adrian Huang
 
Scheduling in Android
Opersys inc.
 
Slab Allocator in Linux Kernel
Adrian Huang
 
Understanding DPDK
Denys Haryachyy
 
The basics of fluentd
Treasure Data, Inc.
 
FD.io Vector Packet Processing (VPP)
Kirill Tsym
 
Ceph and RocksDB
Sage Weil
 
Linux Initialization Process (1)
shimosawa
 
Understanding of linux kernel memory model
SeongJae Park
 
Linux Memory Management
Ni Zo-Ma
 
Linux kernel memory allocators
Hao-Ran Liu
 
Memory management in Linux kernel
Vadim Nikitin
 
Qemu device prototyping
Yan Vugenfirer
 
Kernel_Crash_Dump_Analysis
Buland Singh
 
Kdump and the kernel crash dump analysis
Buland Singh
 
HKG15-505: Power Management interactions with OP-TEE and Trusted Firmware
Linaro
 
OpenWrt From Top to Bottom
Kernel TLV
 

Similar to Linux Kernel Memory Model (20)

PDF
An Introduction to the Formalised Memory Model for Linux Kernel
SeongJae Park
 
PDF
Memory Barriers in the Linux Kernel
Davidlohr Bueso
 
PPTX
Memory model
Yi-Hsiu Hsu
 
PPTX
Memory model
MingdongLiao
 
PPT
Executing Multiple Thread on Modern Processor
NurHadisukmana3
 
PPTX
Architecture of cash memory for engineering ch 3.pptx
magedsrhan773
 
ODP
C11/C++11 Memory model. What is it, and why?
Mikael Rosbacke
 
PPTX
Io sy.stemppt
muthumani mahesh
 
PPT
Distributed shared memory in distributed systems.ppt
lasmonkapota201
 
PDF
Prerequisite knowledge for shared memory concurrency
Viller Hsiao
 
PPTX
Multithreading Fundamentals
PostSharp Technologies
 
PDF
Design of Parallel and HPC, Lecture: Memory Models
ZandruYamanay
 
PPT
Multicore Processors
Smruti Sarangi
 
PDF
KA 5 - Lecture 1 - Parallel Processing.pdf
Carlos701746
 
PPTX
operating system
shreeuva
 
PPSX
linux kernel overview 2013
Rohit Pratap Singh
 
ODP
Multicore
Mark Veltzer
 
PPT
Contiki OS preparation usage with kit CC256
ryanng90
 
PPTX
Operating Systems & Applications
Maulen Bale
 
An Introduction to the Formalised Memory Model for Linux Kernel
SeongJae Park
 
Memory Barriers in the Linux Kernel
Davidlohr Bueso
 
Memory model
Yi-Hsiu Hsu
 
Memory model
MingdongLiao
 
Executing Multiple Thread on Modern Processor
NurHadisukmana3
 
Architecture of cash memory for engineering ch 3.pptx
magedsrhan773
 
C11/C++11 Memory model. What is it, and why?
Mikael Rosbacke
 
Io sy.stemppt
muthumani mahesh
 
Distributed shared memory in distributed systems.ppt
lasmonkapota201
 
Prerequisite knowledge for shared memory concurrency
Viller Hsiao
 
Multithreading Fundamentals
PostSharp Technologies
 
Design of Parallel and HPC, Lecture: Memory Models
ZandruYamanay
 
Multicore Processors
Smruti Sarangi
 
KA 5 - Lecture 1 - Parallel Processing.pdf
Carlos701746
 
operating system
shreeuva
 
linux kernel overview 2013
Rohit Pratap Singh
 
Multicore
Mark Veltzer
 
Contiki OS preparation usage with kit CC256
ryanng90
 
Operating Systems & Applications
Maulen Bale
 
Ad

More from SeongJae Park (20)

PDF
Biscuit: an operating system written in go
SeongJae Park
 
PDF
GCMA: Guaranteed Contiguous Memory Allocator
SeongJae Park
 
PDF
Design choices of golang for high scalability
SeongJae Park
 
PDF
Brief introduction to kselftest
SeongJae Park
 
PDF
Let the contribution begin (EST futures)
SeongJae Park
 
PDF
Porting golang development environment developed with golang
SeongJae Park
 
PDF
gcma: guaranteed contiguous memory allocator
SeongJae Park
 
PDF
An introduction to_golang.avi
SeongJae Park
 
PDF
Develop Android/iOS app using golang
SeongJae Park
 
PDF
Develop Android app using Golang
SeongJae Park
 
PDF
Sw install with_without_docker
SeongJae Park
 
PDF
Git inter-snapshot public
SeongJae Park
 
PDF
(Live) build and run golang web server on android.avi
SeongJae Park
 
PDF
Deep dark-side of git: How git works internally
SeongJae Park
 
PDF
Deep dark side of git - prologue
SeongJae Park
 
PDF
DO YOU WANT TO USE A VCS
SeongJae Park
 
PDF
Experimental android hacking using reflection
SeongJae Park
 
PDF
Hacktime for adk
SeongJae Park
 
PDF
Let the contribution begin
SeongJae Park
 
Biscuit: an operating system written in go
SeongJae Park
 
GCMA: Guaranteed Contiguous Memory Allocator
SeongJae Park
 
Design choices of golang for high scalability
SeongJae Park
 
Brief introduction to kselftest
SeongJae Park
 
Let the contribution begin (EST futures)
SeongJae Park
 
Porting golang development environment developed with golang
SeongJae Park
 
gcma: guaranteed contiguous memory allocator
SeongJae Park
 
An introduction to_golang.avi
SeongJae Park
 
Develop Android/iOS app using golang
SeongJae Park
 
Develop Android app using Golang
SeongJae Park
 
Sw install with_without_docker
SeongJae Park
 
Git inter-snapshot public
SeongJae Park
 
(Live) build and run golang web server on android.avi
SeongJae Park
 
Deep dark-side of git: How git works internally
SeongJae Park
 
Deep dark side of git - prologue
SeongJae Park
 
DO YOU WANT TO USE A VCS
SeongJae Park
 
Experimental android hacking using reflection
SeongJae Park
 
Hacktime for adk
SeongJae Park
 
Let the contribution begin
SeongJae Park
 
Ad

Recently uploaded (20)

PDF
How a Code Plagiarism Checker Protects Originality in Programming
Code Quiry
 
PDF
Lecture A - AI Workflows for Banking.pdf
Dr. LAM Yat-fai (林日辉)
 
PDF
The Past, Present & Future of Kenya's Digital Transformation
Moses Kemibaro
 
PDF
TrustArc Webinar - Data Privacy Trends 2025: Mid-Year Insights & Program Stra...
TrustArc
 
PDF
Productivity Management Software | Workstatus
Lovely Baghel
 
PPTX
Lecture 5 - Agentic AI and model context protocol.pptx
Dr. LAM Yat-fai (林日辉)
 
PPTX
Top Managed Service Providers in Los Angeles
Captain IT
 
PDF
Upskill to Agentic Automation 2025 - Kickoff Meeting
DianaGray10
 
PDF
Women in Automation Presents: Reinventing Yourself — Bold Career Pivots That ...
DianaGray10
 
PDF
CIFDAQ'S Token Spotlight for 16th July 2025 - ALGORAND
CIFDAQ
 
PPTX
The Yotta x CloudStack Advantage: Scalable, India-First Cloud
ShapeBlue
 
PDF
Integrating IIoT with SCADA in Oil & Gas A Technical Perspective.pdf
Rejig Digital
 
PDF
CloudStack GPU Integration - Rohit Yadav
ShapeBlue
 
PPTX
Building and Operating a Private Cloud with CloudStack and LINBIT CloudStack ...
ShapeBlue
 
PDF
Empowering Cloud Providers with Apache CloudStack and Stackbill
ShapeBlue
 
PDF
2025-07-15 EMEA Volledig Inzicht Dutch Webinar
ThousandEyes
 
PDF
OpenInfra ID 2025 - Are Containers Dying? Rethinking Isolation with MicroVMs.pdf
Muhammad Yuga Nugraha
 
PDF
NewMind AI Weekly Chronicles – July’25, Week III
NewMind AI
 
PDF
NewMind AI Journal - Weekly Chronicles - July'25 Week II
NewMind AI
 
PDF
Alpha Altcoin Setup : TIA - 19th July 2025
CIFDAQ
 
How a Code Plagiarism Checker Protects Originality in Programming
Code Quiry
 
Lecture A - AI Workflows for Banking.pdf
Dr. LAM Yat-fai (林日辉)
 
The Past, Present & Future of Kenya's Digital Transformation
Moses Kemibaro
 
TrustArc Webinar - Data Privacy Trends 2025: Mid-Year Insights & Program Stra...
TrustArc
 
Productivity Management Software | Workstatus
Lovely Baghel
 
Lecture 5 - Agentic AI and model context protocol.pptx
Dr. LAM Yat-fai (林日辉)
 
Top Managed Service Providers in Los Angeles
Captain IT
 
Upskill to Agentic Automation 2025 - Kickoff Meeting
DianaGray10
 
Women in Automation Presents: Reinventing Yourself — Bold Career Pivots That ...
DianaGray10
 
CIFDAQ'S Token Spotlight for 16th July 2025 - ALGORAND
CIFDAQ
 
The Yotta x CloudStack Advantage: Scalable, India-First Cloud
ShapeBlue
 
Integrating IIoT with SCADA in Oil & Gas A Technical Perspective.pdf
Rejig Digital
 
CloudStack GPU Integration - Rohit Yadav
ShapeBlue
 
Building and Operating a Private Cloud with CloudStack and LINBIT CloudStack ...
ShapeBlue
 
Empowering Cloud Providers with Apache CloudStack and Stackbill
ShapeBlue
 
2025-07-15 EMEA Volledig Inzicht Dutch Webinar
ThousandEyes
 
OpenInfra ID 2025 - Are Containers Dying? Rethinking Isolation with MicroVMs.pdf
Muhammad Yuga Nugraha
 
NewMind AI Weekly Chronicles – July’25, Week III
NewMind AI
 
NewMind AI Journal - Weekly Chronicles - July'25 Week II
NewMind AI
 
Alpha Altcoin Setup : TIA - 19th July 2025
CIFDAQ
 

Linux Kernel Memory Model

  • 1. Linux Kernel Memory Model SeongJae Park <sj38.park@gmail.com>
  • 2. These slides were presented during 4th Korea Linux Kernel Community Meetup (https://kernel-dev-ko.github.io/4th/)
  • 3. This work by SeongJae Park is licensed under the Creative Commons Attribution-ShareAlike 3.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by-sa/3.0/.
  • 4. I, SeongJae Park ● SeongJae Park <sj38.park@gmail.com> ● PhD candidate researcher at DCSLAB, SNU ● Part time linux kernel programmer at KOSSLAB ● Interested in memory management and parallel programming for OS kernels
  • 5. It’s Multi-processor Era, Now ● Processor vendors changed their mind to increase number of cores instead of clock speed more than a decade ago ○ Now, multi-core systems are prevalent ○ Even octa-core system in your pocket, maybe? ● As a result, the free lunch is over; parallel programming is essential for high performance and scalability http://www.gotw.ca/images/CPU.png GIVE UP :’(
  • 6. Writing Correct Parallel Program is Hard ● Nature of parallelism is counter-intuitive to human ○ Time is relative, order is ambiguous, and only observations exist ● Compilers and processors do reorderings for performance ● C language developed with Uni-Processor assumption ○ Standard for the kernel (C99) is oblivious of multi-processor systems CPU 0 CPU 1 X = 1; Y = 1; X = 2; Y = 2; if (Y == 2 && X == 1) OOPS(); CPU 1 can call OOPS() in Linux Kernel
  • 8. Instruction Reordering Helps Performance ● By reordering dependent instructions far away, total time consumptions can be reduced ● If the reordering is guaranteed to not change the result of the instruction sequence, it would be helpful for better performance ● Such reordering is legal and recommended fetch decode execute fetch decode execute fetch decode execute Instruction 1 Instruction 3 Instruction 2 instruction 2 depends on result of instruction 1 (e.g., first instruction modifies opcode of next instruction) By reordering instruction 2 and 3, total execution time can be shorten 6 cycles for 3 instructions: 2 cycles per instruction
  • 9. Time and Order is Relative ● Each CPU concurrently generates their events and observes effects of events from others ● It is impossible to define absolute order of two concurrent events; Only relative observation order is possible CPU 1 CPU 2 CPU 3 CPU 4 Generated event 1 Generated event 2 Observed event 1 followed by event 2 I observed event 2 followed by event 1 Event Bus
  • 10. Cache Coherency is Per-CacheLine ● It is well known that cache coherency protocol helps system memory consistency ● In actual, it guarantees only per-cacheline sequential consistency ● Every CPU will eventually agree about global order of each cacheline, but some CPU can aware the change faster than others, order of changes between different variables is not guaranteed CPU 1 CPU 2 CPU 3 CPU 4 A = 1; (1) B = 1; (2) A = 2; (3) B = 2; (4) The order is: (1), (2), (3), (4) Coherent cache protocol The order is: (1), (3), (2), (4) The order is: (2), (1), (4), (3) Every CPU is honest; The system is legally cache-coherent
  • 11. An Example with A Mythological Architecture ● Many systems equip hierarchical memory for better performance and space ● Propagation speed of an event to a given core can be influenced by specific sub-layer of memory If CPU 0 Message Queue is busy, CPU 2 can observe an event from CPU 0 (event A) after an event of CPU 1 (event B) though CPU 1 observed event A before generating event B CPU 0 CPU 1 Cache CPU 0 Message Queue CPU 1 Message Queue Memory CPU 2 CPU 3 Cache CPU 2 Message Queue CPU 3 Message Queue Bus
  • 12. An Example with A Mythological Architecture ● Many systems equip hierarchical memory for better performance and space ● Propagation speed of an event to a given core can be influenced by specific sub-layer of memory If CPU 0 Message Queue is busy, CPU 2 can observe an event from CPU 0 (event A) after an event of CPU 1 (event B) though CPU 1 observed event A before generating event B CPU 0 CPU 1 Cache CPU 0 Message Queue CPU 1 Message Queue Memory CPU 2 CPU 3 Cache CPU 2 Message Queue CPU 3 Message Queue Bus Generate Event A; Event A
  • 13. An Example with A Mythological Architecture ● Many systems equip hierarchical memory for better performance and space ● Propagation speed of an event to a given core can be influenced by specific sub-layer of memory If CPU 0 Message Queue is busy, CPU 2 can observe an event from CPU 0 (event A) after an event of CPU 1 (event B) though CPU 1 observed event A before generating event B CPU 0 CPU 1 Cache CPU 0 Message Queue CPU 1 Message Queue Memory CPU 2 CPU 3 Cache CPU 2 Message Queue CPU 3 Message Queue Bus Generate Event A; Seen Event A; Generate Event B; Event BEvent A
  • 14. An Example with A Mythological Architecture ● Many systems equip hierarchical memory for better performance and space ● Propagation speed of an event to a given core can be influenced by specific sub-layer of memory If CPU 0 Message Queue is busy, CPU 2 can observe an event from CPU 0 (event A) after an event of CPU 1 (event B) though CPU 1 observed event A before generating event B CPU 0 CPU 1 Cache CPU 0 Message Queue CPU 1 Message Queue Memory CPU 2 CPU 3 Cache CPU 2 Message Queue CPU 3 Message Queue Bus Generate Event A; Seen Event A; Generate Event B; Seen Event B; Event A Event B Busy… ;;;
  • 15. An Example with A Mythological Architecture ● Many systems equip hierarchical memory for better performance and space ● Propagation speed of an event to a given core can be influenced by specific sub-layer of memory If CPU 0 Message Queue is busy, CPU 2 can observe an event from CPU 0 (event A) after an event of CPU 1 (event B) though CPU 1 observed event A before generating event B CPU 0 CPU 1 Cache CPU 0 Message Queue CPU 1 Message Queue Memory CPU 2 CPU 3 Cache CPU 2 Message Queue CPU 3 Message Queue Bus Generate Event A; Seen Event A; Generate Event B; Seen Event B; Seen Event A; Event B Event A
  • 16. Yet Another Mythological Architecture ● Store Buffer and Invalidation Queue can reorder observation of stores and loads CPU 0 Cache Store Buffer Invalidation Queue Memory CPU 1 Cache Store Buffer Invalidation Queue Bus
  • 18. C-language Doesn’t Know Multi-Processor ● By the time of initial C-language development, multi-processor was rare ● As a result, C-language has only few guarantees about memory operations on multi-processor ● Undefined behavior is allowed for undefined case https://upload.wikimedia.org/wikipedia/commons/thumb/9/95/The_C_Programming _Language,_First_Edition_Cover_(2).svg/2000px-The_C_Programming_Languag e,_First_Edition_Cover_(2).svg.png
  • 19. Compiler Optimizes (Reorders) Code ● Clever compilers try hard (really hard) to optimize code for high IPC (not for human perspective goals) ○ Converts small, private functions to inline code ○ Reorder memory access code to minimize dependency ○ Simplify unnecessarily complex loops, ... ● Optimization uses term `Undefined behavior` as they want ○ It’s legal, but sometimes do insane things in programmer’s perspective ● Compilers reorder memory accesses based on the C-standard, which is oblivious of multi-processor systems; this can result in unintended programs ● C11 standard aware the multi-processor systems, though
  • 21. Synchronization Primitives ● Because reordering and asynchronous effect propagation is legal, synchronization primitives are necessary to write human intuitive program ● Most environments including the Linux kernel provide sufficiently enough synchronization primitives for human intuitive programs https://s-media-cache-ak0.pinimg.com/236x/42/bc/55/42bc55a6d7e5affe2d0dbe9c872a3df9.jpg
  • 22. Atomic Operations ● Atomic operations are configured with multiple sub operations ○ E.g., compare1) -and-exchange2) , fetch1) -and-add2) , and test1) -and-set2) ● Atomic operations have mutual exclusiveness ○ Middle state of atomic operation execution cannot be seen by others ○ It can be thought of as small critical section that protected by a global lock ● Linux also provides plenty of atomic operations http://www.scienceclarified.com/photos/atomic-mass-3024.jpg
  • 23. Memory Barriers ● In general, memory barriers guarantee effects of memory operations issued before it to be propagated to other components in the system (e.g., processor) before memory operations issued after the barrier ● Linux provides various types of memory barriers including compiler memory barriers, hardware memory barriers, load barriers, store barriers CPU 1 CPU 2 CPU 3 READ A; WRITE B; <BARRIER>; READ C; READ A, WRITE B, then READ C occurred WRITE B, READ A, then READ C occurred READ A and WRITE B can be reordered but READ C is guaranteed to be ordered after {READ A, WRITE B}
  • 24. Partial Ordering Primitives ● Partial ordering primitives include read / write memory barriers, acquire / release semantics and data / address / control dependencies ● Semantics of partial ordering primitives are somewhat confusing ● Cost of partial ordering can be much cheaper than fully ordering primitives ● Linux also provides partial ordering primitives https://www.ternvets.co.uk/wp-content/uploads/2017/02/cat-with-cone.jpg
  • 25. Cost of Synchronization Primitives ● Generally, synchronization primitives are expensive, unscalable ● Each primitive costs differently; the gap is significant ● Correct selection and efficient use of synchronization primitives are important for the performance and scalability Performance of various synchronization primitives
  • 27. Each Environment Provides Own Memory Model ● Memory model, a.k.a Memory consistency model ● Defines what values can be obtained by load instructions of given code ● Programming environments including Instruction Set Architectures, Programming languages, Operating systems, etc define their memory model ○ Modern language memory models (e.g., Golang, Rust, Java, C11, …) aware multi-processor, while C99, which is for the Linux kernel, doesn’t http://www.sciencemag.org/sites/default/files/styles/article_main_large/public/Memory.jpg?itok=4FmHo7M5
  • 28. Each ISA Provides Specific Memory Model ● Some architectures have stricter ordering enforcement rule than others ● PA-RISC CPUs are strictest, Alpha is weakest ● Because Linux kernel supports multiple architectures, it defines its memory model based on weakest one, the Alpha https://kernel.org/pub/linux/kernel/people/paulmck/perfbook/perfbook.2015.01.31a.pdf
  • 29. Linux Kernel Memory Model (LKMM) ● The memory model for Linux kernel programming environment ○ Defines what values can be obtained, given a piece of Linux kernel code, for specific load instructions in the code ● Linux kernel original memory model is necessary because ○ It uses C99 but C99 memory model is oblivious of multi-processor environment; C11 memory model aware of multi-processor, but Torvalds doesn’t want it ○ It supports multiple architectures with different ISA-level memory model https://github.com/sjp38/perfbook
  • 30. LKMM Providing Ordering Primitives ● Designed for weakest memory model architecture, Alpha ○ Almost every combination of reordering is possible, doesn’t provide address dependency ● Atomic instructions ○ atomix_xchg(), atomic_inc_return(), atomic_dec_return(), … ○ Most of those just use mapping instruction, but provide general guarantees at Kernel level ● Memory barriers ○ Compiler barriers: WRITE_ONCE(), READ_ONCE(), barrier(), ... ○ CPU barriers: mb(), wmb(), rmb(), smp_mb(), smp_wmb(), smp_rmb(), … ○ Semantic barriers: ACQUIRE operations, RELEASE operations, dependencies, … ○ For detail, refer to https://www.kernel.org/doc/Documentation/memory-barriers.txt ● Because different memory ordering primitive has different cost, only necessary ordering primitives should be used in necessary case for high performance and scalability
  • 31. Informal LKMM ● Originally, LKMM was just an informal text, ‘memory-barriers.txt’ ● It explains about the Linux Kernel Memory Model in English (There is Korean translation, too: https://www.kernel.org/doc/Documentation/translations/ko_KR/memory-barriers.txt) ● To use the LKMM to prove your code, you should use Feynman Algorithm ○ Write down your code ○ Think real hard with the ‘memory-barriers.txt’ ○ Write down your provement ○ Hard and unreliable, of course! https://www.kernel.org/doc/Documentation/memory-barriers.txt
  • 32. Formal and Executable LKMM: Help Arrives at Last ● Jade Alglave, Paul E. McKenney, and some guys made formal LKMM ● It is formal, executable memory model ○ It receives C-like simple code as input ○ The code containing parallel code snippets and a question: can this result happen? ● Provides model for each ordering primitive of LKMM based on concepts including orders and cycles ● Can be executed with herd7 and klitmus7 ○ LKMM extends herd7 and klitmus7 to support LKMM ordering primitives in code ○ herd7 based execution does proof in user mode ○ klitmus7 based execution generates test running kernel module and wrapper scripts for execution of the module
  • 33. Steps for LKMM Execution ● Installation ○ LKMM is merged in Linux source tree at tools/memory-model; Just get the linux source code ○ Install herdtools7 (https://github.com/herd/herdtools7) ● Usage ○ Using herd7 based user mode proof $ herd7 -conf linux-kernel.cfg <your-litmus-test file> ○ Using klitmus7 based real kernel mode execution $ mkdir mymodules $ klitmus7 -o mymodules <your-litmus-test file> $ cd mymodules ; make $ sudo sh run.sh ● That’s it! Now you can prove your parallel code for all Linux environments!
  • 34. Summary ● Nature of parallel programming is counter-intuitive ○ Cannot define order of events without interaction ○ Ordering rule is different for different environment ○ Memory model defines their ordering rule ● For human-intuitive and correct program, interaction is necessary ○ Almost every environment provides memory ordering primitives including atomic instructions and memory barriers, which is expensive in common ○ Memory model defines what result can occur and cannot with given code snippet ● Formal Linux kernel memory model is available ○ Linux kernel provides its memory model based on weakest memory ordering rule architecture it supports, the Alpha, C99, and its original ordering primitives including RCU ○ Formal and executable LKMM is in the kernel tree; now you can prove your parallel code!
  • 37. `herdtools` Install ● LKMM (in Linux v4.19) depends on herdtools version 7.49 ○ Depending version may change ● `herdtools` depends on OPAM, the package manager for OCaml ● Ubuntu package manager supports OPAM $ sudo apt install opam $ opam init && sudo opam update && sudo opam upgrade $ git clone https://github.com/herd/herdtools7 && cd herdtools7 $ git checkout 7.49 $ make all && make install $ herd7 -version 7.49, Rev: 93dcbdd89086d5f3e981b280d437309fdeb8b427
  • 38. Herd7 Based Userspace Litmus Tests Execution $ herd7 -conf linux-kernel.cfg litmus-tests/SB+fencembonceonces.litmus Test SB+fencembonceonces Allowed States 3 0:r0=0; 1:r0=1; 0:r0=1; 1:r0=0; 0:r0=1; 1:r0=1; No Witnesses Positive: 0 Negative: 3 Condition exists (0:r0=0 / 1:r0=0) Observation SB+fencembonceonces Never 0 3 Time SB+fencembonceonces 0.01 Hash=d66d99523e2cac6b06e66f4c995ebb48
  • 39. Klitmus7 Based Litmus Tests Execution $ mkdir my; klitmus7 -o my/ litmus-tests/SB+fencembonceonces.litmus $ cd klitmus_test; make; sudo sh ./run.sh ... Test SB+fencembonceonces Allowed Histogram (3 states) 16580117:>0:r0=1; 1:r0=0; 16402936:>0:r0=0; 1:r0=1; 3016947 :>0:r0=1; 1:r0=1; ... Positive: 0, Negative: 36000000 Condition exists (0:r0=0 / 1:r0=0) is NOT validated Hash=d66d99523e2cac6b06e66f4c995ebb48 Observation SB+fencembonceonces Never 0 36000000 ...