False Sharing

Introduction

缓存系统中是以缓存行(cache line)为单位存储的。缓存行是2的整数幂个连续字节,一般为32-256个字节。最常见的缓存行大小是64个字节。

当多线程修改互相独立的变量时,如果这些变量共享同一个缓存行,就会无意中影响彼此的性能,这就是伪共享。

缓存行上的写竞争是运行在SMP系统中并行线程实现可伸缩性最重要的限制因素。有人将伪共享描述成无声的性能杀手,因为从代码中很难看清楚是否会出现伪共享。 为了让可伸缩性与线程数呈线性关系,就必须确保不会有两个线程往同一个变量或缓存行中写。两个线程写同一个变量可以在代码中发现。为了确定互相独立的变量是否共享了同一个缓存行,就需要了解内存布局。


在核心1上运行的线程想更新变量X,同时核心2上的线程想要更新变量Y。

不幸的是,这两个变量在同一个缓存行中。每个线程都要去竞争缓存行的所有权来更新变量。

如果核心1获得了所有权,缓存子系统将会使核心2中对应的缓存行失效。当核心2获得了所有权然后执行更新操作,核心1就要使自己对应的缓存行失效。这会来来回回的经过L3缓存,大大影响了性能。如果互相竞争的核心位于不同的插槽,就要额外横跨插槽连接,问题可能更加严重。


Java Memory Layout

对于HotSpot JVM(32位),所有对象都有两个字(32位 + 32位)的对象头:

  • 第一个字是由24位哈希码和8位标志位(如锁的状态或作为锁对象)组成的Mark Word。
  • 第二个字是对象所属类的引用(如果是数组对象还需要一个额外的字来存储数组的长度)。

每个对象的起始地址都对齐于8字节以提高性能。因此当封装对象的时候为了高效率,对象字段声明的顺序会被重排序成下列基于字节大小的顺序:

  • doubles (8) 和 longs (8)
  • ints (4) 和 floats (4)
  • shorts (2) 和 chars (2)
  • booleans (1) 和 bytes (1)
  • references (4/8)

Performance Testing

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
class VolatileLong(
@Volatile
var value: Long = 0
) {
// if comment below code, happen false sharing
var p1: Long = 0
var p2: Long = 0
var p3: Long = 0
var p4: Long = 0
var p5: Long = 0
}

class FalseSharing(
private val threadCount: Int = 4,
private val loopTimes: Long = 500 * 1000L * 1000L
) {
private val vlongs: Array<VolatileLong> = Array(threadCount) { VolatileLong() }
private val threads = arrayListOf<Thread>()

fun start() {
for (index in 0 until threadCount) {
threads.add(thread {
// println("Thread ${Thread.currentThread().id} start")
var end = loopTimes + index
while (end != 0L) {
vlongs[index].value = end
end -= 1
}
// println("Thread ${Thread.currentThread().id} stop")
})
}

for (thread in threads) {
thread.join()
}
}
}

fun main() {
// 50亿
val times = 500 * 1000L * 1000L
val begin = System.currentTimeMillis()
FalseSharing(loopTimes = times).start()
val end = System.currentTimeMillis()
println("consume time ($times): ${end - begin}")
// no false sharing: consume time (500000000): 3278 ms
// false sharing: consume time (500000000): 44364
}

Code Tools: jol

JOL (Java Object Layout) is the tiny toolbox to analyze object layout schemes in JVMs.

Command Line Tool

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
java -jar jol-cli-0.9-full.jar internals java.lang.Long

# Running 64-bit HotSpot VM.
# Using compressed oop with 3-bit shift.
# Using compressed klass with 3-bit shift.
# WARNING | Compressed references base/shifts are guessed by the experiment!
# WARNING | Therefore, computed addresses are just guesses, and ARE NOT RELIABLE.
# WARNING | Make sure to attach Serviceability Agent to get the reliable addresses.
# Objects are 8 bytes aligned.
# Field sizes by type: 4, 1, 1, 2, 2, 4, 4, 8, 8 [bytes]
# Array element sizes: 4, 1, 1, 2, 2, 4, 4, 8, 8 [bytes]

Instantiated the sample instance via public java.lang.Long(long)

java.lang.Long object internals:
OFFSET SIZE TYPE DESCRIPTION VALUE
0 4 (object header) 01 00 00 00 (00000001 00000000 00000000 00000000) (1)
4 4 (object header) 00 00 00 00 (00000000 00000000 00000000 00000000) (0)
8 4 (object header) f4 22 00 f8 (11110100 00100010 00000000 11111000) (-134208780)
12 4 (alignment/padding gap)
16 8 long Long.value 0
Instance size: 24 bytes
Space losses: 4 bytes internal + 0 bytes external = 4 bytes total

Code

1
2
3
4
5
<dependency>
<groupId>org.openjdk.jol</groupId>
<artifactId>jol-core</artifactId>
<version>0.9</version>
</dependency>
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
class VolatileLong(
@Volatile
var value: Long = 0
) {
/*
var p1: Long = 0
var p2: Long = 0
var p3: Long = 0
var p4: Long = 0
var p5: Long = 0
var p6: Long = 0
*/
}

fun main() {
println(VM.current().details())
println(ClassLayout.parseClass(VolatileLong::class.java).toPrintable())
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# WARNING: Unable to attach Serviceability Agent. You can try again with escalated privileges. Two options: a) use -Djol.tryWithSudo=true to try with sudo; b) echo 0 | sudo tee /proc/sys/kernel/yama/ptrace_scope
# Running 64-bit HotSpot VM.
# Using compressed oop with 3-bit shift.
# Using compressed klass with 3-bit shift.
# WARNING | Compressed references base/shifts are guessed by the experiment!
# WARNING | Therefore, computed addresses are just guesses, and ARE NOT RELIABLE.
# WARNING | Make sure to attach Serviceability Agent to get the reliable addresses.
# Objects are 8 bytes aligned.
# Field sizes by type: 4, 1, 1, 2, 2, 4, 4, 8, 8 [bytes]
# Array element sizes: 4, 1, 1, 2, 2, 4, 4, 8, 8 [bytes]

com.demo.falsesharing.VolatileLong object internals:
OFFSET SIZE TYPE DESCRIPTION VALUE
0 12 (object header) N/A
12 4 (alignment/padding gap)
16 8 long VolatileLong.value N/A
Instance size: 24 bytes
Space losses: 4 bytes internal + 0 bytes external = 4 bytes total

如果希望VolatileLong在一个cache line上,可以修改为

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
class VolatileLong(
@Volatile
var value: Long = 0
) {
var p1: Long = 0
var p2: Long = 0
var p3: Long = 0
var p4: Long = 0
var p5: Long = 0
}

fun main() {
println(VM.current().details())
println(ClassLayout.parseClass(VolatileLong::class.java).toPrintable())
}
1
2
3
4
5
6
7
8
9
10
11
12
com.demo.falsesharing.VolatileLong object internals:
OFFSET SIZE TYPE DESCRIPTION VALUE
0 12 (object header) N/A
12 4 (alignment/padding gap)
16 8 long VolatileLong.p1 N/A
24 8 long VolatileLong.p2 N/A
32 8 long VolatileLong.p3 N/A
40 8 long VolatileLong.p4 N/A
48 8 long VolatileLong.p5 N/A
56 8 long VolatileLong.value N/A
Instance size: 64 bytes
Space losses: 4 bytes internal + 0 bytes external = 4 bytes total

Reference

https://mechanical-sympathy.blogspot.com/2011/07/false-sharing.html
http://openjdk.java.net/projects/code-tools/jol/
http://ifeve.com/falsesharing/
http://hg.openjdk.java.net/code-tools/jol/file/tip/jol-samples/src/main/java/org/openjdk/jol/samples/
[CPU缓存刷新的误解] http://ifeve.com/cpu-cache-flushing-fallacy-cn/