2023年度总结
从我个人的视野看,2023过的还是比较充实的。之前在暑假时,就因为感慨良多写了一篇日志2023春季学期总结。至到如今年末,又来写新的一篇,然而此刻的心情却是比7月份高温假回家时平静不少。
工作杂想
待补充
一致性导向的基于模型的分布式系统错误注入工具
这个工作还在做,这里先省略了。
2023目标和完成情况
读100篇论文 (未完成)
这个目标仍然和2022年一样没有完成,先在这里列一下今年读的论文情况。
Jounral Club (8篇)
Jounral Club是我所在的课题组每4周会开展一次的活动。同学们被分为多个小组。各个小组轮流担任组织小组。组织小组选定一篇论文,每个小组独立阅读该论文并准备PPT,并在组会上抽签选择一个小组来讲解该论文。
Jounral Club提供了一个通过阅读论文了解和自己不太相关的研究方向的机会。2023年课题组一共组织了8次Jounral Club,阅读论文如下:
- B. H. Kim, T. Kim, and D. Lie, “Modulo: Finding Convergence Failure Bugs in Distributed Systems with Divergence Resync Models,” in Proceedings of USENIX Annual Technical Conference (ATC), 2022, pp. 383–398.
- Y. Sun, C. M. Poskitt, J. Sun, Y. Chen, and Z. Yang, “LawBreaker: An Approach for Specifying Traffic Laws and Fuzzing Autonomous Vehicles,” in Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering (ASE), 2023, pp. 1–12.
- E. Androulaki et al., “Hyperledger Fabric: A Distributed Operating System for Permissioned Blockchains,” in Proceedings of the 13th European Conference on Computer Systems (EuroSys), 2018, pp. 1–15.
- X. Sun et al., “Automatic Reliability Testing For Cluster Management Controllers,” in 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2022, pp. 143–159.
- S. Fu and S. Ratnasamy, “dSpace: Composable Abstractions for Smart Spaces,” in Proceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles (SOSP), 2021, pp. 295–310.
- C. Lemieux, J. P. Inala, S. K. Lahiri, and S. Sen, “CodaMosa: Escaping Coverage Plateaus in Test Generation with Pre-Trained Large Language Models,” in Proceedings of the 45th International Conference on Software Engineering (ICSE), 2023, pp. 919–931.
- S. Kang, J. Yoon, and S. Yoo, “Large Language Models are Few-Shot Testers: Exploring LLM-Based General Bug Reproduction,” in Proceedings of the 45th International Conference on Software Engineering (ICSE), 2023, pp. 2312–2323.
- D. Gu et al., “ElasticFlow: An Elastic Serverless Training Platform for Distributed Deep Learning,” in Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2023, pp. 266–280.
Reading Report (29篇)
小老板要求我每周阅读一篇论文并写一篇阅读报告。阅读的论文一般由我自己选择。
除去被和其他分类下重合的文章,2023年一共写了29篇阅读报告。与2022年阅读报告的论文素材收集自同方向师姐的毕业论文引用列表不同,2023年的论文素材大多来源于1)相同方向团队的较新成果,2)一些较为陌生的小方向的系列论文(系统验证、分布式共识),和3)研究方向下的新话题的相关论文(大模型驱动测试,分布式事务可靠性)。
一部分阅读报告被我放在了我的知乎专栏上reading report。具体阅读的论文如下所示:
- C. Hawblitzel et al., “IronFleet: Proving Practical Distributed Systems Correct,” in Proceedings of the 25th Symposium on Operating Systems Principles (SOSP), 2015, pp. 1–17.
- X. Yuan and J. Yang, “Effective Concurrency Testing for Distributed Systems,” in Proceedings of the 25th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2020, pp. 1141–1156.
- M. Babaei and J. Dingel, “Efficient Replay-based Regression Testing for Distributed Reactive Systems in the Context of Model-driven Development,” in Proceedings of the ACM/IEEE 24th International Conference on Model Driven Engineering Languages and Systems (MODELS), 2021, pp. 89–100.
- R. Natella and V.-T. Pham, “ProFuzzBench: A Benchmark for Stateful Protocol Fuzzing,” in Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA Demo), 2021, pp. 662–665.
- C. Lou, Y. Jing, and P. Huang, “Demystifying and Checking Silent Semantic Violations in Large Distributed Systems,” in Proceedings of the 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2022, pp. 91–107.
- J. Lu, H. Li, C. Liu, L. Li, and K. Cheng, “Detecting Missing-Permission-Check Vulnerabilities in Distributed Cloud Systems,” in Proceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security (CCS), 2022, pp. 2145–2158.
- Y. Deng, C. S. Xia, H. Peng, C. Yang, and L. Zhang, “Large Language Models Are Zero-Shot Fuzzers: Fuzzing Deep-Learning Libraries via Large Language Models,” in Proceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA), 2023, pp. 423–435.
- C. A. Stuardo et al., “ScaleCheck: A Single-Machine Approach for Discovering Scalability Bugs in Large Distributed Systems,” in Proceedings of the 17th USENIX Conference on File and Storage Technologies (FAST), 2019, pp. 359–373.
- Y. Zhang, S. Makarov, X. Ren, D. Lion, and D. Yuan, “Pensieve: Non-Intrusive Failure Reproduction for Distributed Systems using the Event Chaining Approach,” in Proceedings of the 26th Symposium on Operating Systems Principles (SOSP), 2017, pp. 19–33.
- L. Tang et al., “Fail through the Cracks: Cross-System Interaction Failures in Modern Cloud Systems,” in Proceedings of the 18th European Conference on Computer Systems (EuroSys), 2023, pp. 433–451.
- H. Ng, S. Haridi, and P. Carbone, “Omni-Paxos: Breaking the Barriers of Partial Connectivity,” in Proceedings of the 18th European Conference on Computer Systems (EuroSys), 2023, pp. 314–330.
- D. Ongaro and J. Ousterhout, “In Search of an Understandable Consensus Algorithm,” in Proceedings of the 2014 USENIX Conference on USENIX Annual Technical Conference (ATC), 2014, pp. 305–320.
- T. Hance, A. Lattuada, C. Hawblitzel, J. Howell, R. Johnson, and B. Parno, “Storage Systems Are Distributed Systems (So Verify Them That Way!),” in Proceedings of the 14th USENIX Conference on Operating Systems Design and Implementation (OSDI), 2020, pp. 99–115.
- H. Dai, Y. Wang, K. B. Kent, L. Zeng, and C. Xu, “The State of the Art of Metadata Managements in Large-Scale Distributed File Systems — Scalability, Performance and Availability,” IEEE Transactions on Parallel and Distributed Systems (TPDS), vol. 33, no. 12, pp. 3850–3869, 2022.
- T. Eldeeb, X. Xie, P. A. Bernstein, A. Cidon, and J. Yang, “Chardonnay: Fast and General Datacenter Transactions for On-Disk Databases,” in Proceedings of the 17th USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2023, pp. 343–360.
- Y. Shan, K. Chen, and Y. Wu, “Explore Data Placement Algorithm for Balanced Recovery Load Distribution,” in Proceedings of the USENIX Annual Technical Conference (ATC), 2023, pp. 233–240.
- H. Ma et al., “Sift: Using Refinement-guided Automation to Verify Complex Distributed Systems,” in Proceedings of the 2022 USENIX Annual Technical Conference (ATC), 2022, pp. 151–166.
- J. R. Wilcox et al., “Verdi: A Framework for Implementing and Formally Verifying Distributed Systems,” in Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), 2015, pp. 357–368.
- J. Idziorek et al., “Distributed Transactions at Scale in Amazon DynamoDB,” in Proceedings of the USENIX Annual Technical Conference (ATC), 2023, pp. 705–717.
- M. Lesani, C. J. Bell, and A. Chlipala, “Chapar: Certified Causally Consistent Distributed Key-Value Stores,” in Proceedings of the 43rd ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL), 2016, pp. 357–370.
- L. Huang et al., “Metastable Failures in the Wild,” in Proceedings of the 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2022, pp. 73–90.
- GG. Zakhour, P. Weisenburger, and G. Salvaneschi, “Type-Checking CRDT Convergence,” in Proceedings of the 44th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), 2023, pp. 1365–1388.
- Y. Chen, F. Ma, Y. Zhou, M. Gu, Q. Liao, and Y. Jiang, “Chronos: Finding Timeout Bugs in Practical Distributed Systems by Deep-Priority Fuzzing with Transient Delay,” in Proceedings of the IEEE Symposium on Security and Privacy (SP), 2024.
- R. Meng, G. Pîrlea, A. Roychoudhury, and I. Sergey, “Greybox Fuzzing of Distributed Systems,” in Proceedings of the ACM SIGSAC Conference on Computer and Communications Security (CCS), 2023, pp. 1615–1629.
- Y. Zhang et al., “Understanding and Detecting Software Upgrade Failures in Distributed Systems,” in Proceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles (SOSP), 2021, pp. 116–131.
- X. Gu, W. Cao, Y. Zhu, X. Song, Y. Huang, and X. Ma, “Compositional Model Checking of Consensus Protocols via Interaction-Preserving Abstraction,” in Proceedings of the 41st International Symposium on Reliable Distributed Systems (SRDS), 2022, pp. 82–93.
- A. J. J. Davis, M. Hirschhorn, and J. Schvimer, “Extreme Modelling in Practice,” in Proceedings of the International Conference on Very Large Data Bases (VLDB), 2020, pp. 1346–1358.
- S. Ghosh, M. Shetty, C. Bansal, and S. Nath, “How to Fight Production Incidents? An Empirical Study on a Large-Scale Cloud Service,” in Proceedings of the 13th Symposium on Cloud Computing (SoCC), 2022, pp. 126–141.
- M. Balakrishnan et al., “Virtual Consensus in Delos,” in Proceedings of the 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), 2020, pp. 617–632.
Laboratory Paper (4篇)
在科研过程中,对课题组师兄师姐之前发表过的一些论文进行了更精细的重读。
在对分布式系统测试方向进行了一番阅读之后,感觉似乎有个模板“ xxx guidedd (大概率) (fault injection) for cloud/distributed system”。错误注入是我们课题组看家本领,不可不尝。
具体阅读的论文如下:
- D. Wang, W. Dou, Y. Gao, C. Wu, J. Wei, and T. Huang, “Model Checking Guided Testing for Distributed Systems,” in Proceedings of the Eighteenth European Conference on Computer Systems (EuroSys), 2023, pp. 127–143.
- Y. Gao et al., “Coverage Guided Fault Injection for Cloud Systems,” in Proceedings of the 45th International Conference on Software Engineering (ICSE), 2023, pp. 2211–2223.
- Y. Gao, D. Wang, Q. Dai, W. Dou, and J. Wei, “Common Data Guided Crash Injection for Cloud Systems,” in Proceedings of the ACM/IEEE 44th International Conference on Software Engineering: Companion Proceedings (ICSE Demo), 2022, pp. 36–40.
- H. Chen, W. Dou, D. Wang, and F. Qin, “CoFI: Consistency-Guided Fault Injection for Cloud Systems,” in Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering (ASE), 2020, pp. 536–547.
FaultFuzz Demo 调研 (6篇)
在写FaultFuzz这篇Demo论文时,阅读了一些其他的Demo论文学习写作方法。
大概总结一下再已有一篇Regular paper下写Demo paper的方式:1)摘要和引用要像Normal Paper一样认真写。2)方法部分受到篇幅限制可能需要省略细节,但是方法的思路和内容仍然要清晰。3)许多Demo会不再做新实验,而是只引用Regular paper的成果,但也有一些审稿人会要求新实验。4)需要详细介绍该工具(Demo)的使用方法,最好能为该工具开发出可视化界面并在论文中附上截图。
- X. Tang, S. Wu, D. Zhang, Z. Wang, G. Yuan, and G. Chen, “A Demonstration of DLBD: Database Logic Bug Detection System,” Proceedings of the VLDB Endowment, vol. 16, no. 12, pp. 3914–3917, Aug. 2023.
- J. Bell and G. Kaiser, “Dynamic taint tracking for Java with phosphor (demo),” in Proceedings of the International Symposium on Software Testing and Analysis (ISSTA Demo), 2015, pp. 409–413.
- A. Hernandez, M. Nassif, and M. P. Robillard, “DScribe: co-generating unit tests and documentation,” in Proceedings of the ACM/IEEE 44th International Conference on Software Engineering: Companion Proceedings (ICSE Demo), 2022, pp. 56–60.
- D. Ishimwe, T. Nguyen, and K. Nguyen, “Dynaplex: inferring asymptotic runtime complexity of recursive programs,” in Proceedings of the ACM/IEEE 44th International Conference on Software Engineering: Companion Proceedings (ICSE Demo), 2022, pp. 61–64.
- Y. Tang, A. Spektor, R. Khatchadourian, and M. Bagherzadeh, “A Tool for Rejuvenating Feature Logging Levels via Git Histories and Degree of Interest,” in Proceedings of the ACM/IEEE 44th International Conference on Software Engineering: Companion Proceedings (ICSE Demo), 2022, pp. 21–25.
- A. A. Ahmad, M. Anwar, H. Sharif, A. Gehani, and F. Zaffar, “Trimmer: Context-Specific Code Reduction,” in Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering (ASE Demo), 2023, pp. 1–5.
论文阅读总结
知乎曾说,一个博士生应该可以1.5h读完4篇文章并能和他人讨论。我觉得,这个只有在“同样的一个问题已经有大量的文献,并且新读的这4篇文献相比以往的文献的贡献十分容易理解”同时“该博士生已经阅读了大量关于该问题的文献”这两点时才能做到。只强调其中任意一点都会有失偏颇。
但是我感觉可以把其反过来作为一个普通博士生的初级奋斗目标——“可以4h内读完1.5篇文献”。目前我读论文的节奏大概是,上午9点半上班开始读,11点半下班时能读大半。下午2点半上班继续读,大概到4点通读完一遍,然后开始写reading report,5点半下班时可以写出来大体的框架。然后晚上7点半左右回来后再写到9点左右可以收工。大概是“读文章+写Reading Report”总计6个小时。希望之后能逐渐压缩到5个小时。
发表一篇ccf-b同等或以上等级的论文 (近似完成)
在2023期间,发表了一篇ICSE Demo论文。Demo论文是软件工程系列会议的一种特殊论文形式,长度一般在4页,着重关心:1)如果是一个新的工作,其整体思路和初步实验结果如何。2)如果是一个在现有论文上进行改进的工作,其易用程度如何。
软工会议除了大家一般关心的正常赛道Technique Track,一般还会有很多赛道,比如经常见到的Workshop Track(一般认为是做出了完整的科研工作,但是达不到正常赛道的要求),Doctoral Symposium(由博士生分享其科研进展和思路),New Ideas and Emerging Results (NIER) Track (还未完成完整的研究的新思路)等。这些赛道的中稿论文国内高校一般不会算在毕业成果要求里,所以讨论热度不高。
我的Demo Track论文也是如此,无法被算到毕业成果当中。不过按照我们实验室的传统,发表Demo短文的人也可以获得资助去实地参加会议,同时如果最后毕业时实在成果匮乏也可以将其算在毕业论文的三个小课题中的一个。(不过从这篇文章是在现有工作上的改进,大概率是不会允许我算三个小课堂中的一个的)。更重要的是,这种改进易用性的小工作,非常适合科研新手熟悉整个科研流程。
工作内容:覆盖导向的分布式系统错误注入工具
分布式系统测试的一个重要手段是故障注入。直白点说的话,就是在系统运行时人为对系统的一些节点进行宕机、重启、网络分区等模拟故障的操作,然后检测分布式系统能否继续按照预期运行。使用错误注入方法的一个关键点在于:如何确定注入故障的时机?换言之,我们在什么时候注入故障更容易触发系统的Bug?
这个问题很容易让人产生一些直观的猜想,比如:当一个节点刚刚收到消息,但是还没有来得及写到本地时,将这个节点宕机掉,可能就会更容易触发问题。不过如果我们想要将这些猜想实际运用起来又会有很多实现上的问题,比如,我们的测试程序如何得知系统什么时候会将收到的网络消息中的内容写到本地?
由此,面向分布式系统测试的问题,有很多故障注入思路的研究工作就会是如下思路:“作者提出了一种新的更容易触发Bug的时机/模式。将该模式应用到实际系统上需要解决以下N个问题。针对每个问题,我们的解决方案是XXX。我们做了实验,找到了XXX个Bug,和其他方法相比效率提升XXX”。
师姐之前做了一个工作,其思路为:选取系统执行I/O操作的时刻为可能的故障注入点,利用模糊测试技术选取更加容易触发Bug的I/O点来注入故障。具体来说,当在一个I/O点上注入故障测试带来了代码覆盖率的提升时,测试程序会考虑在该I/O点的基础上再向其他其他的I/O点注入另一个故障,从而测试更深层次的场景。更具体的测试策略和实现方法在此处略下不表。
在师姐工作的基础上,我做了以下改进:
- 增加了测试时注入的错误类型。原先只会注入宕机和重启,现在还会注入网络断开和重连。
- 提供了对目标系统更精细的控制。
- 制作了前端界面,并对其中一些功能提供了一些使用更方便的API接口。
改进后的工作作为一篇工具论文(4页短文)发表在了ICSE Demo Track (2024) 上。
攒钱(金额待定)
虽说是待定,其实在年初时也给自己定了一个目标。不过今年的攒钱目标并没有实现。
今年的收入和支出主要产生了以下几方面变化:
- 正式进入直博第三年(对应普通博士第一年)后每月补助有所上涨。
- 每天和课题组同学一起吃饭后日常开销有所上升。
- 近一年或许是由于压力较大,去医院较为频繁,每月有额外的一笔医药费开销。
- 课题忙碌起来后,2023年没有再去参加GsoC,开源之夏这类可以获取额外收入的项目。
现在略有些理解所谓“死工资”和“对更高收入的追求”之间的矛盾之处了。不过好在目前对经济资产的需求还不是特别高,我还可以比较用比较松缓的节奏继续工作而不太用为这方面的事情焦虑。
体重降至85kg
没有达成。体重现在基本稳定在了90-95kg之间波动,大部分时间接近90kg。
感觉在生活压力比较大时就很难顾及到体重控制。
或者说还是自己目前的全局掌控能力不足。
读20本书
《活着》《秋》《耕作革命》《深入理解分布式系统》
今年也很巧合地和去年一样只读完了4本书。明年还要继续努力。
生活目标
这里应该本应该浓墨重彩写一下和邱同学的相识过程的。但是感觉要是好好写的话又太长了,而且也不算是我本来做的2023年计划的一部分。
所以决定,还是之后有空了再单独写吧。
2023目标及完成情况总结
比较难过的说,2023真的有较为满意的进展的只有论文阅读和科研成果,而且这两项都是我觉得我本应做的更好的。
只能说2024继续加油。每年的目标其实都是类似的,也许每一年在其中一两件上有所进步,到最后也可以在某一年发现原来已经把这些目标都实现了。