关于qing

渺渺何所似，天地一沙鸥

从shell 到main() — 剖析应用程序的启动与执行

Posted on 2024/04/09 by qing

本文分阶段介绍从shell 执行一条命令运行应用程序到应用程序main() 函数被调用过程中发生了什么。

0 起因

用ftrace 抓了一段程序的页面故障（page fault）的异常，其目的是为了查看一个简单程序哪里会产生页面故障（使用strace 也大致能查看到，但无法精确定位到那个语句产生了）。这个脚本大致如下：

cd /sys/kernel/debug/tracing
echo 1 > events/exceptions/enable   # 使能监控异常
echo 1 > events/syscalls/enable     # 方便起见，同时监控系统调用
echo nop > current_tracer           # 不记录函数调用
echo 8 > tracing_cpumask            # 监控 isolated CPU2
echo  > trace                       # 记录前先清空文件
echo 1 > tracing_on                 # 开始记录
sudo taskset -c 2 /path/to/a.out    # run on CPU2
cat trace

测试程序源码如下：

#include <unistd.h>

int main() {
  write(0, "hello\n", 6);
  return 0;
}

trace 输出的内容如下：

# tracer: nop
#
# entries-in-buffer/entries-written: 34/34   #P:8
#
#                                _-----=> irqs-off
#                               / _----=> need-resched
#                              | / _---=> hardirq/softirq
#                              || / _--=> preempt-depth
#                              ||| /     delay
           a.out-12405   [003] ...1 89386.976089: sys_sched_setaffinity -> 0x0
           a.out-12405   [003] ...1 89386.976091: sys_execve(filename: 7fffffffe7ef, argv: 7fffffffe5a0, envp: 7fffffffe5b0)
           a.out-12405   [003] d... 89386.976143: page_fault_kernel: address=0x555555558010 ip=__clear_user error_code=0x2
           a.out-12405   [003] d... 89386.976149: page_fault_kernel: address=0x7ffff7ffdff8 ip=__clear_user error_code=0x2
           a.out-12405   [003] ...1 89386.976154: sys_execve -> 0x0
           a.out-12405   [003] d... 89386.991695: page_fault_user: address=0x7ffff7fd0100 ip=0x7ffff7fd0100 error_code=0x14
           a.out-12405   [003] d... 89386.991697: page_fault_user: address=0x7ffff7ffc5e0 ip=0x7ffff7fd0e18 error_code=0x6
           a.out-12405   [003] d... 89386.991698: page_fault_user: address=0x7ffff7fcfb00 ip=0x7ffff7fd1128 error_code=0x4
           a.out-12405   [003] d... 89386.991699: page_fault_user: address=0x7ffff7feb700 ip=0x7ffff7feb700 error_code=0x14
           a.out-12405   [003] d... 89386.991701: page_fault_user: address=0x7ffff7ffe138 ip=0x7ffff7feb737 error_code=0x6
           a.out-12405   [003] d... 89386.991702: page_fault_user: address=0x7ffff7ff38b4 ip=0x7ffff7feb81a error_code=0x4
           a.out-12405   [003] ...1 89386.991705: sys_brk(brk: 0)
           a.out-12405   [003] ...1 89386.991705: sys_brk -> 0x555555559000
           a.out-12405   [003] ...1 89386.991707: sys_arch_prctl(option: 3001, arg2: 7fffffffe4c0)
           a.out-12405   [003] ...1 89386.991707: sys_arch_prctl -> 0xffffffffffffffea
           a.out-12405   [003] d... 89386.991707: page_fault_user: address=0x7ffff7ff1e20 ip=0x7ffff7ff1e20 error_code=0x14
           a.out-12405   [003] ...1 89386.991711: sys_access(filename: 7ffff7ff79e0, mode: 4)
           a.out-12405   [003] ...1 89386.991712: sys_access -> 0xfffffffffffffffe
           a.out-12405   [003] d... 89386.991712: page_fault_user: address=0x7fffffffdd80 ip=0x7ffff7fde54f error_code=0x6
...... 省略后续trace 记录

注意，address 是发生页面故障的内存地址，ip 是发生页面故障的指令地址，error_code 是发生页面故障的类型（究竟是内存不存在，还是权限，以及是读还是写等）ip 地址是运行时的，所以需要使用gdb 查看对应地址的指令。

那么从shell 执行该命令开始到main 函数运行，主要包括了1）shell 过程；2）动态连接器过程；3)用户程序过程。

shell 负责fork 出新的线程，即包裹有动态连接器代码的用户程序；

动态连接器过程，加载可执行文件依赖的共享对象文件，并进行符号重定位；

用户程序过程，其中包含了调用用户main 的过程。

注意：动态连接器过程和用户程序过程的开头都是_start，有些文章就把两者混淆了。

1 Shell 过程

解析命令行，比如找到命令的路径（ls -> /usr/bin/ls）等。

创建子进程fork()，新的子进程成为当前进程，并通过系统调用sys_execve，并将命令行中程序作为参数。

首先看trace 中前5条记录，即

           a.out-12405   [003] ...1 89386.976089: sys_sched_setaffinity -> 0x0
           a.out-12405   [003] ...1 89386.976091: sys_execve(filename: 7fffffffe7ef, argv: 7fffffffe5a0, envp: 7fffffffe5b0)
           a.out-12405   [003] d... 89386.976143: page_fault_kernel: address=0x555555558010 ip=__clear_user error_code=0x2
           a.out-12405   [003] d... 89386.976149: page_fault_kernel: address=0x7ffff7ffdff8 ip=__clear_user error_code=0x2
           a.out-12405   [003] ...1 89386.976154: sys_execve -> 0x0

execv 相关函数位于 exec.c 文件中，其中 do_execveat()函数主要执行过程如下（基于5.4 内核）

do_execveat_common(fd, filename, argv, envp, flags);
exec.c 文件__do_execve_file() 函数retval = exec_binprm(bprm);
exec.c 文件exec_binprm()函数ret = search_binary_handler(bprm);
exec.c 文件search_binary_handler()函数 fmt->load_binary(bprm);
binfmt_elf.c文件load_elf_binary()函数（解析elf header，分配地址空间，加载程序内容，建立页表，解析动态链接库，重定位，初始化程序状态，跳转至程序入口）

此处内容请看内核相关代码。

2.1 libc-start/_start

首先看trace 记录中第三条，也就是第一条在用户态发生的页面故障

a.out-5309    [003] d... 30851.627208: page_fault_user: address=0x7ffff7fd0100 ip=0x7ffff7fd0100 error_code=0x14

其发生的指令地址和地址相同，说明这是一条指令页故障，使用gdb 查看对应地址（0x7ffff7fd0100）执行的指令：

(gdb) x/5i 0x7ffff7fd0100
   0x7ffff7fd0100 <_start>:	mov    %rsp,%rdi
   0x7ffff7fd0103 <_start+3>:	callq  0x7ffff7fd0df0 <_dl_start>
   0x7ffff7fd0108 <_dl_start_user>:	mov    %rax,%r12
   0x7ffff7fd010b <_dl_start_user+3>:	mov    0x2c4e7(%rip),%eax        # 0x7ffff7ffc5f8 <_dl_skip_args>
   0x7ffff7fd0111 <_dl_start_user+9>:	pop    %rdx

这就是execve() 返回用户态调用的动态链接器的执行位置，源码可参考glibc 的sysdeps/x86_64/dl-machine.h 文件

#define RTLD_START asm ("\n\
.text\n\
	.align 16\n\
.globl _start\n\
.globl _dl_start_user\n\
_start:\n\
	movq %rsp, %rdi\n\
	call _dl_start\n\
_dl_start_user:\n\
	# Save the user entry point address in %r12.\n\
	movq %rax, %r12\n\

需要将这里动态链接库的_start 和用户程序的_start 区分开来。

2.2 _dl_start

接着看下一处页面故障

a.out-12405   [003] d... 89386.991697: page_fault_user: address=0x7ffff7ffc5e0 ip=0x7ffff7fd0e18 error_code=0x6

查看ip 地址执行的什么

(gdb) x/5i 0x7ffff7fd0e18
   0x7ffff7fd0e18 <_dl_start+40>:	mov    %rax,0x2b7c1(%rip)        # 0x7ffff7ffc5e0 <start_time>
   0x7ffff7fd0e1f <_dl_start+47>:	mov    0x2c042(%rip),%rax        # 0x7ffff7ffce68
   0x7ffff7fd0e26 <_dl_start+54>:	mov    %rdx,%r12
   0x7ffff7fd0e29 <_dl_start+57>:	sub    0x2c1d0(%rip),%r12        # 0x7ffff7ffd000
   0x7ffff7fd0e30 <_dl_start+64>:	mov    %rdx,0x2cbc1(%rip)        # 0x7ffff7ffd9f8 <_rtld_global+2456>

这里的页面故障是因为从动态链接库入口_start 处调用了_dl_start (位于elf/rtld.c) ，而_dl_start 中修改了start_time 对应地址的值引发了页面故障，（对应语句rtld_timer_start (&start_time);）。

下面不再详细追踪剩余trace 中的页面故障函数（及其对应地址的指令与函数，如_dl_sysdep_start 等）。

3.1 用户程序 _start

书接上回，先看看动态链接器的_start 最后跳转执行什么（这是在页面故障和系统调用的trace 中看不到的），我们在_start 指令所在地址打断点（地址为0x7ffff7fd0100）：

(gdb) x/100i 0x7ffff7fd0100
=> 0x7ffff7fd0100 <_start>:	mov    %rsp,%rdi
   0x7ffff7fd0103 <_start+3>:	callq  0x7ffff7fd0df0 <_dl_start>
   0x7ffff7fd0108 <_dl_start_user>:	mov    %rax,%r12
   0x7ffff7fd010b <_dl_start_user+3>:	mov    0x2c4e7(%rip),%eax        # 0x7ffff7ffc5f8 <_dl_skip_args>
   0x7ffff7fd0111 <_dl_start_user+9>:	pop    %rdx
   0x7ffff7fd0112 <_dl_start_user+10>:	lea    (%rsp,%rax,8),%rsp
   0x7ffff7fd0116 <_dl_start_user+14>:	sub    %eax,%edx
   0x7ffff7fd0118 <_dl_start_user+16>:	push   %rdx
   0x7ffff7fd0119 <_dl_start_user+17>:	mov    %rdx,%rsi
   0x7ffff7fd011c <_dl_start_user+20>:	mov    %rsp,%r13
   0x7ffff7fd011f <_dl_start_user+23>:	and    $0xfffffffffffffff0,%rsp
   0x7ffff7fd0123 <_dl_start_user+27>:	mov    0x2cf36(%rip),%rdi        # 0x7ffff7ffd060 <_rtld_global>
   0x7ffff7fd012a <_dl_start_user+34>:	lea    0x10(%r13,%rdx,8),%rcx
   0x7ffff7fd012f <_dl_start_user+39>:	lea    0x8(%r13),%rdx
   0x7ffff7fd0133 <_dl_start_user+43>:	xor    %ebp,%ebp
   0x7ffff7fd0135 <_dl_start_user+45>:	callq  0x7ffff7fe0c20 <_dl_init>
   0x7ffff7fd013a <_dl_start_user+50>:	lea    0x10c1f(%rip),%rdx        # 0x7ffff7fe0d60 <_dl_fini>
   0x7ffff7fd0141 <_dl_start_user+57>:	mov    %r13,%rsp
   0x7ffff7fd0144 <_dl_start_user+60>:	jmpq   *%r12
   ......

这段代码最后跳转到%r12 寄存器指向的位置（其实也就是用户程序对应的_start），看看是什么

(gdb) info registers 
......
r12            0x555555555060      93824992235616
......
(gdb) x/20i 0x555555555060
0x555555555060 <_start>: endbr64
0x555555555064 <_start+4>: xor %ebp,%ebp
0x555555555066 <_start+6>: mov %rdx,%r9
0x555555555069 <_start+9>: pop %rsi
0x55555555506a <_start+10>: mov %rsp,%rdx
0x55555555506d <_start+13>: and $0xfffffffffffffff0,%rsp
0x555555555071 <_start+17>: push %rax
0x555555555072 <_start+18>: push %rsp
0x555555555073 <_start+19>: lea 0x166(%rip),%r8 # 0x5555555551e0 <__libc_csu_fini>
0x55555555507a <_start+26>: lea 0xef(%rip),%rcx # 0x555555555170 <__libc_csu_init>
0x555555555081 <_start+33>: lea 0xc1(%rip),%rdi # 0x555555555149
0x555555555088 <_start+40>: callq *0x2f52(%rip) # 0x555555557fe0
0x55555555508e <_start+46>: hlt

这里的_start 是用户程序的，从执行代码内容可以看出和动态链接库_start 不同，用户程序的_start 是GCC 编译器在生成可执行文件时添加进去的，位置在代码段.text 的开头，最先执行，可以使用objdump 查看其汇编代码

#objdump -d a.out
省略其他段
Disassembly of section .text:

0000000000001060 <_start>:
    1060:	f3 0f 1e fa          	endbr64 
    1064:	31 ed                	xor    %ebp,%ebp
    1066:	49 89 d1             	mov    %rdx,%r9
    1069:	5e                   	pop    %rsi
    106a:	48 89 e2             	mov    %rsp,%rdx
    106d:	48 83 e4 f0          	and    $0xfffffffffffffff0,%rsp
    1071:	50                   	push   %rax
    1072:	54                   	push   %rsp
    1073:	4c 8d 05 66 01 00 00 	lea    0x166(%rip),%r8        # 11e0 <__libc_csu_fini>
    107a:	48 8d 0d ef 00 00 00 	lea    0xef(%rip),%rcx        # 1170 <__libc_csu_init>
    1081:	48 8d 3d c1 00 00 00 	lea    0xc1(%rip),%rdi        # 1149 <main>
    1088:	ff 15 52 2f 00 00    	callq  *0x2f52(%rip)        # 3fe0 <__libc_start_main@GLIBC_2.2.5>
    108e:	f4                   	hlt    
    108f:	90                   	nop
省略其他段

_start 的一个重要任务就是调用__libc_start_main，

3.2 __libc_start_main

__lib_start_main 也是定义在C 库（glibc），在我使用的glibc 2.31 版本中，其函数接受7 个参数的输入

define LIBC_START_MAIN __libc_start_main
STATIC int LIBC_START_MAIN ( int (*main) (int, char **, char ** MAIN_AUXVEC_DECL), 
                             int argc, 
                             char **argv,  
                             __typeof (main) init, 
                             void (*fini) (void), 
                             void (*rtld_fini) (void), 
                             void *stack_end) {

上面用户程序的_start 函数的主要功能就是初始化调用__libc_start_main 函数的参数栈，在X86-64 架构中，前6个参数通过寄存器来传递，我们看看都是啥

(gdb) disassemble _start
Dump of assembler code for function _start:
   0x0000555555555060 <+0>:	endbr64 
   0x0000555555555064 <+4>:	xor    %ebp,%ebp
   0x0000555555555066 <+6>:	mov    %rdx,%r9
   0x0000555555555069 <+9>:	pop    %rsi
   0x000055555555506a <+10>:	mov    %rsp,%rdx
   0x000055555555506d <+13>:	and    $0xfffffffffffffff0,%rsp
   0x0000555555555071 <+17>:	push   %rax
   0x0000555555555072 <+18>:	push   %rsp
   0x0000555555555073 <+19>:	lea    0x166(%rip),%r8        # 0x5555555551e0 <__libc_csu_fini>
   0x000055555555507a <+26>:	lea    0xef(%rip),%rcx        # 0x555555555170 <__libc_csu_init>
   0x0000555555555081 <+33>:	lea    0xc1(%rip),%rdi        # 0x555555555149 <main>
=> 0x0000555555555088 <+40>:	callq  *0x2f52(%rip)        # 0x555555557fe0
   0x000055555555508e <+46>:	hlt    
End of assembler dump.
(gdb) inf reg
rdi            0x555555555149      93824992235849            # <main> 函数
rsi            0x1                 1                         # argc
rdx            0x7fffffffe598      140737488348568           # **argv
rcx            0x555555555170      93824992235888            # <__libc_csu_init>
r8             0x5555555551e0      93824992236000            # <__libc_csu_finit>
r9             0x7ffff7fe0d60      140737354009952           # <_dl_fini>
// 省略其他寄存器的值

从寄存器保存的数值可以看出传递的参数信息。其中rdx 寄存器指向的是argv，保存了命令行参数

(gdb) x/gx 0x7fffffffe598
0x7fffffffe598: 0x00007fffffffe7eb
(gdb) x/20c 0x00007fffffffe7eb
0x7fffffffe7eb:	47 '/'	104 'h'	111 'o'	109 'm'	101 'e'	47 '/'	102 'f'	111 'o'
0x7fffffffe7f3:	111 'o'	108 'l'	47 '/'	116 't'	109 'm'	112 'p'	47 '/'	97 'a'
0x7fffffffe7fb:	46 '.'	111 'o'	117 'u'	116 't'
(gdb) x/s *(char **) (0x7fffffffe598)
0x7fffffffe7eb:	"/home/fool/tmp/a.out"

__libc_start_main 主要流程包括：

调用 __libc_csu_fini 和 __libc_csu_init 等初始化
调用应用程序main
调用exit 退出程序。

至此结束。

无论是动态链接库的_start 还是__libc_start_main 都是非常复杂的过程，没办法一句话，一篇文章讲清楚，实属遗憾。

参考：

https://www.gnu.org/software/hurd/glibc/startup.html

https://stackoverflow.com/questions/62709030/what-is-libc-start-main-and-start

Linux x86 Program Start Up (dbp-consulting.com)

https://tldp.org/LDP/LG/issue84/hawk.html

https://stackoverflow.com/questions/9885545/how-to-find-the-main-functions-entry-point-of-elf-executable-file-without-any-s

https://www.cnblogs.com/jiqingwu/p/linux_binary_load_and_run.html

CPU 频率、超标量和IPC

Posted on 2023/11/03 by qing

X86处理器可以通过CPUID 指令获取CPU 的基频（Base frequency），但嵌入式CPU 往往没有提供这样的指令，另一种朴素而有效的思路是：用高精度时钟计量N条指令执行时间来计算CPU 当前频率。这种方法在Linux 启动时也会用到，毕竟不是所有CPU 都支持CPUID 或类似指令。

简单而直观的例子

下面是一段简单而直观的例子，通过累加寄存器1000次（CPU 执行一次累加需要一个时钟周期）计量CPU 频率。

#define INC(cnt) "inc %[cnt] \n"
#define INC10(cnt) INC(cnt) INC(cnt) INC(cnt) INC(cnt) \
INC(cnt) INC(cnt) INC(cnt) INC(cnt) INC(cnt) INC(cnt)
#define INC100(cnt) INC10(cnt) INC10(cnt) INC10(cnt) INC10(cnt) \
INC10(cnt) INC10(cnt) INC10(cnt) INC10(cnt) INC10(cnt) INC10(cnt)
#define INC1K(cnt) INC100(cnt) INC100(cnt) INC100(cnt) INC100(cnt) \
INC100(cnt) INC100(cnt) INC100(cnt) INC100(cnt) INC100(cnt) INC100(cnt)

/**
 * should compile the code with no optimization, gcc -O0
 */
void measure0()
{
    uint64_t start, end;
    int temp;

    // 高精度计时开始 tStart
    __ams( INC1K(temp) : [cnt] "+r"(temp));
    // 高精度计时结束 tEnd

    printf("CPU frequency : %d Hz\n", 1000/(tEnd-tStart));  //计算频率
}

上面的被测代码汇编后代码如下（eax 寄存器缓存了栈上的temp 数值，并自增了1000 次）：

   0x000000000000202c <+47>:	mov    -0x24(%rbp),%eax
   0x000000000000202f <+50>:	inc    %eax
   0x0000000000002031 <+52>:	inc    %eax
   0x0000000000002033 <+54>:	inc    %eax
   ... ...
   0x00000000000xxxxx <+58>:	inc    %eax
   0x00000000000xxxxx <+250>:	mov    %eax,-0x24(%rbp)

注意：1 高精度时钟选择对结果的影响较大；2 增大测量的时钟周期越长，结果CPU 的频率更精确；3 不能添加编译优化选项。

该方法主要难度在于：精确编写运行期望CPU cycles的代码。

下面从反面来看看哪些示例会更多/更少地执行了期望CPU cycles：

错误1：访问非寄存器导致CPU stall

#define INC(dd) __asm("inc %[counter] \n" : [counter] "+r"(dd));
#define INC10(dd) INC(dd) INC(dd) INC(dd) INC(dd) INC(dd) \
INC(dd) INC(dd) INC(dd) INC(dd) INC(dd)
#define INC100(dd) INC10(dd) INC10(dd) INC10(dd) INC10(dd) \
INC10(dd) INC10(dd) INC10(dd) INC10(dd) INC10(dd) INC10(dd)
#define INC1K(dd) INC100(dd) INC100(dd) INC100(dd) INC100(dd) \
INC100(dd) INC100(dd) INC100(dd) INC100(dd) INC100(dd) INC100(dd)

void measure1()
{
    uint64_t start, end;
    int temp;

    // 高精度计时开始 tStart
    INC1K(temp);
    // 高精度计时结束 tEnd

    printf("CPU frequency : %d Hz\n", 1000/(tEnd-tStart));  //计算频率
}

原因分析：部分汇编的代码是这样的。

   0x000000000000202c <+47>:	mov    -0x24(%rbp),%eax
   0x000000000000202f <+50>:	inc    %eax
   0x0000000000002031 <+52>:	mov    %eax,-0x24(%rbp)
   0x0000000000002034 <+55>:	mov    -0x24(%rbp),%eax
   0x0000000000002037 <+58>:	inc    %eax

汇编后的代码没有被优化，CPU 每次都需要从Cache/Memory 中获取temp 数据并回写，导致不能在每个时钟周期都执行一条inc 指令。

错误2：被测代码指令级并行使得运行CPU cycles 小于期望

如果被测代码中自加的变量不止一个，如下

    __asm volatile(
    "cyclemeasure2:\n"
    "    dec %[counter] \n"
    "    dec %[counter] \n"
    "    dec %[counter] \n"
    "    dec %[counter] \n"
    "    dec %[counter2] \n"
    "    dec %[counter2] \n"
    "    dec %[counter2] \n"
    "    dec %[counter2] \n"
    "    jnz cyclemeasure2 \n"
    : /* read/write reg */ [counter] "+r"(cycles[0]), [counter2] "+r"(cycles[1])
    );

而counter 和counter2 又没有数据依赖关系，那么那么在同一个CPU cycle 中同时被执行（超标量），花费的时钟周期略大于counter2 初始值，但远小于两倍counter2。从其汇编结果可以看出来，dec %eax 和dec %edx 可以在指令集并行。

   0x00000000000013c0 <+69>:	mov    -0x10(%rbp),%edx
   0x00000000000013c3 <+72>:	mov    -0xc(%rbp),%eax
   0x00000000000013c6 <+75>:	dec    %edx
   0x00000000000013c8 <+77>:	dec    %edx
   0x00000000000013ca <+79>:	dec    %edx
   0x00000000000013cc <+81>:	dec    %edx
   0x00000000000013ce <+83>:	dec    %eax
   0x00000000000013d0 <+85>:	dec    %eax
   0x00000000000013d2 <+87>:	dec    %eax
   0x00000000000013d4 <+89>:	dec    %eax
   0x00000000000013d6 <+91>:	jne    0x13c6 <measure2p+75>

循环减小代码段长度

生成那么太长的被测代码段可能导致iCache miss 或缺页中断，影响测试结果，上面示例中给除了基于循环的测试代码

int cycles = 65536;

rdtsc(start);
__asm volatile(
"cyclemeasure3:\n"
"    dec %[counter] \n"
"    dec %[counter] \n"
"    dec %[counter] \n"
"    dec %[counter] \n"
"    jnz cyclemeasure3 \n"
: /* read/write reg */ [counter] "+r"(cycles),
);  
rdtsc(end);

有一点需要注意：虽然每执行若干次dec 指令紧接着一次判断跳转指令jnz，但得益于现代CPU 的指令融合（称作instruction-fusion/micro-fusion，将比较指令及其之前的一个微指令合并为一个执行），jnz 并不会单独占用一个时钟周期，因此总的执行周期和cycles 初始值一致。

另外，rdtsc 通过X86 指令读取CPU cycle 计数器，如下

#define rdtsc(u64) {                                    \
    uint32_t hi, lo;                                    \
    __asm__ __volatile__ ("RDTSC\n\t" : "=a" (lo), "=d" (hi)); \
    u64 = ((uint64_t )hi << 32) | lo;                        \
}

结果显示，实际运行的CPU cycles（end-start）和变量cycles 非常接近。

通过计算最大IPC（Instruction Per Cycle）得到CPU 指令集并行数

单核CPU 在每个时钟周期可执行N 条指令，通过计算一个程序最大的IPC 即可（向上取整）近似得到N 的大小。基本思路是将（可并行的）M（>N）条指令同时执行，得到的IPC。

    int cycles[8] = {NUM, NUM, NUM, NUM, NUM, NUM, NUM, NUM};
    
    rdtsc(start);
    __asm volatile(
    "cyclemeasure8:\n"
    "    dec %[counter] \n"
    "    dec %[counter2] \n"
    "    dec %[counter3] \n"
    "    dec %[counter4] \n"
    "    dec %[counter5] \n"
    "    dec %[counter6] \n"
    "    dec %[counter7] \n"
    "    dec %[counter8] \n"
    "    jnz cyclemeasure8 \n"
    : /* read/write reg */ [counter] "+r"(cycles[0]), 
    [counter2] "+r"(cycles[1]),
    [counter3] "+r"(cycles[2]),
    [counter4] "+r"(cycles[3]),
    [counter5] "+r"(cycles[4]),
    [counter6] "+r"(cycles[5]),
    [counter7] "+r"(cycles[6]),
    [counter8] "+r"(cycles[7])
    );  
    rdtsc(end);

    printf("IPC             : %lf\n", (8.0*NUM)/(end-start));

注意，一般N 的大小不会大于通用寄存器个数。

两个问题：

1 尝试将循环中指令修改为nop，但效果不如计算（inc/dec 无依赖关系的数据）好；

2 计算和IO （load/store）相关的指令执行器应该是不同的，它们之间的并行是不是使得理论IPC 应稍大于N？

调度和中断

调度和中断是运行时对结果最大两个因素：1 调度可通过设置进程优先级为SCHED_FIFO完成；2 中断在X86 系统中可通过cli/sti 指令关闭和开启（需要root 权限和IO 权限，即iopl(3)）。

通过软件的方法多次测试，去除掉因为调度或中断导致的明显偏差测试也是一种方法，具有更好兼容性。

参考文献

https://lemire.me/blog/2019/05/19/measuring-the-system-clock-frequency-using-loops-intel-and-arm/

https://en.wikipedia.org/wiki/Superscalar_processor

Some Linux Commands for Backups and Restores

Posted on 2022/08/19 by qing

The machine that hosts the site(http://blog.foool.net) collapsed some time ago, and this is not the first time that such a breakdown happens. I reinstalled the machine and restored the data, including the database and some ordinary files. The followings are some useful commands that helped me to backup and restore the system.

Disk Info.

df -h

list info. of all file systems. (‘h’ indicates to print the size with the human-readable format )

lsblk

list info. of all block devices.

File info.

du -a / | sort -nr | head -10

list the top 10 largest directories and files.

-a counts all files, not just directories.

-n compare according to string numerical value.

-r reverse the results.

-<num> list only the top <num> items.

du -sh

the storage amount of the current directory.

-s display only a total result

find . -type f -printf "%s %p\n" | sort -nr | head -10

list the top 10 biggest files of the current directory.

Backup & Restore

ssh user@remote "dd if=/dev/sda | gzip -1 -" | dd of=image.gz

Backup disk /dev/sda to a remote compressed image file.

The second – of “gzip -1 -” means reading the input from standard input.

ssh user@ip ‘dd if=/home/user/sdb.img.gz’ | gunzip -1 - | dd of=/dev/sdb

Restore disk /dev/sdb from a remote compressed image file.

sudo rsync -aAXv / –-delete --exclude={/dev/*,/proc/*,/lost+found}  user@ip:path

Backup all files, excluding some specific ones, to remote.

-aAXv (a)archive mode, (A)perserve ACLs, (X)preserve extended attributes, (v)verbose.

在Linux 中测量延迟

Posted on 2021/11/28 by qing

原文地址：http://btorpey.github.io/blog/2014/02/18/clock-sources-in-linux/

他山之石，可以攻玉，该文章将详细地介绍了在Linux 系统中使用TSC 时钟精确地计算一段代码过程（中断和函数等）执行时间需要注意的内容，可以配置Intel 官方文档《How to Benchmark Code Execution Times on Intel® IA-32 and IA-64 Instruction Set Architectures》一起阅读。下面是译文。

为了在现代（操作）系统中的测量（一段过程的）延迟，我们需要（测量时间间隔的时钟）至少能够以微秒为单位，最好以纳秒为单位。好消息是，使用相对现代的硬件和软件，可以准确地测量小到几个纳秒的时间间隔。

但是，为确保您的（测量）结果是准确的，更重要的是要了解您要测量什么，以及不同情况下测量过程的边界是什么。

概要

为了（在Linux 中测量延迟）获得最佳效果，您应该使用：

（使用）Linux 内核 2.6.18 或更高版本 —— 这是第一个包含 hrtimers （高精度时钟）包的版本。最好的是 2.6.32 或更高版本，包括了对大多数不同时钟源的支持。
具有恒定不变的 TSC（constant, invariant TSC (time-stamp counter)）的 CPU。这意味着 TSC 在所有插槽和CPU核心上都以恒定速率运行，而不管电源管理（代码）对 CPU 的频率进行如何更改。如果 CPU 支持 RDTSCP 指令那就更好了（RDTSCP 会使得读取的时间更准确和稳定）。
TSC 应配置为Linux 内核的时钟源。
您应该测量发生在同一台机器上的两个事件之间的间隔（机器内计时，即intra-machine timing）。
对于机器内计时，您最好的选择使用汇编语言直接读取 TSC。在我的测试机器上，软件读取 TSC 大约需要 100ns，这是该方法准确性的（边界）限制（要测量100ns 以内的时间间隔是不实际的）。但不同机器读取TSC 开销不尽相同，这就是为什么我提供了源代码，您可以用来进行测量自己的机器。
- 请注意，上面提到的100ns 主要是因为我的Linux 机器不支持RDTSCP 指令。所以为了获取合理准确的计时，在RDTSC 之前还执行了CPUID 指令以序列化指令执行过程。而在另一台支持RDTSCP 指令的机器（新 MacBook Air）上，开销下降了大约 14ns。

下面将讨论时钟在Linux 上的工作原理，如何从软件（角度）访问各种时钟，以及如何测量访问它们的开销。

继续阅读 →

利用autossh和中间主机为内网主机建立稳定ssh 连接

Posted on 2021/09/14 by qing

通常会遇到一些内网主机没有独立IP 地址，隐藏在NAT 之后，用户无法直接建立ssh 连接。

这时候就需要一个中间人机器（具有独立IP）做为跳板，内网机器反向连接至中间机器。用户登陆时，首先连接至中间机器，再反向连接至内网主机。

其步骤如下：

在内网主机，运行 ssh -R 7777:localhost:22 qing@middleman
在中间主机，运行 ssh -p 7777 user@localhost

注意：步骤2的user 是内网主机user。

ssh -R 参数中7777 是远端映射的端口，连接该端口将建立起和内网22 号端口的链接；下面是man ssh 中关于-R 选项的说明

-R [bind_address:]port:host:hostport

-R [bind_address:]port:local_socket

-R remote_socket:host:hostport

-R remote_socket:local_socket

-R [bind_address:]port

Specifies that connections to the given TCP port or Unix socket on the remote (server) host are to be forwarded to the local side.

This works by allocating a socket to listen to either a TCP port or to a Unix socket on the remote side. Whenever a connection is made to this port or Unix socket, the connection is forwarded over the secure channel, and a connection is made from the local machine to either an explicit destination specified by host port hostport, or local_socket, or, if no explicit destination was specified, ssh will act as a SOCKS 4/5 proxy and forward connections to the destinations requested by the remote SOCKS client. Port forwardings can also be specified in the configuration file. Privileged ports can be forwarded only when logging in as root on the remote machine. IPv6 ad‐ dresses can be specified by enclosing the address in square brackets.

By default, TCP listening sockets on the server will be bound to the loopback interface only. This may be overridden by specifying a bind_address. An empty bind_address, or the address ‘*’, indicates that the remote socket should listen on all interfaces. Specifying a remote bind_address will only succeed if the server's GatewayPorts option is enabled (see sshd_config(5)).

If the port argument is ‘0’, the listen port will be dynamically allocated on the server and reported to the client at run time. When used together with -O forward the allocated port will be printed to the standard output.

但这样存在两个问题：1）ssh 连接超过固定时间会自动释放；2）每次连接中间机器都需要用户手动输入密码。

第一个问题通过autossh 解决

autossh 通过将ssh 命令包裹至一个循环中，并在ssh 命令断开时自动建立连接，这样就保证了即使内网机器无法访问，也会自动建立和中间主机的逆向连接。autossh 命令格式如下

autossh [autossh options] [ssh options]

即autossh 除了自身参数，其他参数直接用ssh 的即可。

第二个问题通过公钥免密码登录解决：1）内网主机执行ssh-keygen；2）ssh-copy-id -i ~/.ssh/id_rsa.pub user@middleman_machine

结合起autossh 和免密码登录，autossh 命令如下：

autossh -o "PasswordAuthentication=no" -o "PubkeyAuthentication=yes" -i ~/.ssh/id_rsa -R 7777:localhost:22 user@middleman

将该命令添加至开机启动模块中实现开机启动。

Futex 简述

Posted on 2021/04/06 by qing

简介：futex 全称为Fast User-space Mutex，是Linux 2.5.7 内核引入的锁原语，不同于其他进程间通信IPC原语（如信号量Semaphore、信号Signal和各种锁pthread_mutex_lock），futex更轻量级、快速，一般应用开发人员可能很少用到，但可基于futex实现各类读写锁、屏障（barriers）和信号机制等。

相关背景

在Linux的早期版本（内核Linux 2.5.7 版本以前），进程间通信（Inter-Process Communication，IPC）沿用的是传统Unix系统和System V 的IPC，如信号量（Semaphores）和Socket 等，这些IPC 均基于系统调用（System Call）。这类方法的缺点是当系统竞争度较低时，每次都进行系统调用，会造成较大系统开销。

原理和做法

用户程序每次调用IPC机制都会产生系统调用，程序发生用户态和内核态的切换，futex 的基本思想是竞争态总是很少发生的，只有在竞争态才需要进入内核，否则在用户态即可完成。futex的两个目标是：1）尽量避免系统调用；2）避免不必要的上下文切换（导致的TLB失效等）。

具体而言，任务获取一个futex 将发起带锁的减指令，并验证数值结果值是否为0（加上了锁），如果成功则可继续执行程序，失败（为已经占用的锁继续加锁）则任务在内核被阻塞。为相同futex 变量的加锁的任务被阻塞后放在同一个队列，解锁任务通过减少变量（只有一个加锁且锁队列为空）或进入内核从锁队列唤醒任务。

注意：futex 在Linux 的内核实现为一个系统调用（SYS_futex），用户程序如果直接调用它肯定会进入内核态，它还需要和其他语句（如原子操作）配合使用，新手在未理解其futex 原理和并发控制机制时极易犯错，这也是为什么不推荐直接使用它的原因。