charmoniumQ · charmoniumQ · Aug 1, 2025 · Aug 12, 2025 · Aug 13, 2025 · Aug 13, 2025
diff --git a/docs/benchmarks.md b/docs/benchmarks.md
@@ -0,0 +1,7 @@
+# Examples of negative overheads in LM bench
+
+Tbl. 2 of [Pasquier et al. 2018](https://doi.org/10.1145/3243734.3243776)
+
+Tbl. 1 of [Bates et al. 2015](https://www.usenix.org/system/files/conference/usenixsecurity15/sec15-paper-bates.pdf)
+
+Tbl. 2 of [Pohly et al. 2012](https://doi.org/10.1145/2976749.2978378)
diff --git a/docs/dataflow_algorithms/README.md b/docs/dataflow_algorithms/README.md
@@ -0,0 +1,173 @@
+# Overview
+
+We have a happens-before graph, A->B means A happens-before B, which is the union of program-order, forks, and joins.
+
+How to construct a dataflow graph?
+
+Certainly program-order and forks have dataflow, carrying an entire copy of the process's memory. Joins are also dataflow, as they carry a return integer. While the integer may not seem like much, often its success or failure (non-zero-ness) is really important for the parent's control flow.
+
+How to version repeated accesses to the same inode?
+
+# Sample-schedule
+
+Pick a "sample" schedule that is a topological order of the hb-graph.
+
+[at_open_algo.py](./at_open_algo.py)
+
+[at_opens_and_closes.py](./at_opens_and_closes.py)
+
+[at_opens_and_closes_with_separate_access.py](./at_opens_and_closes_with_separate_access.py)
+
+Problem with event-based: no individual schedule constructs a sufficiently conservative DFG. Consider the following HB graph
+
+digraph {
+  Read of A -> Write of B
+  Read of B -> Write of A
+}
+
+Any valid schedule will only have one inode dataflow edge. But data could flow from second op of second proc to first op of first proc AND from second of first to first of second. Just not in the same schedule.
+
+It doesn't handle `sh -c "a | b"` elegantly, which has the following graph
+
+digraph {
+  open pipe for reading -> open pipe for writing -> fork writer -> fork reader -> wait writer -> wait reader -> close pipe for reading -> close pipe for writing;
+  fork writer -> close pipe for reading -> dup pipe for writing to stdout -> close pipe for writing -> exec -> wait writer;
+  fork reader -> close pipe for writing -> dup pipe for reading to stdin -> close pipe for reading -> exec -> wait reader;
+}
+
+A schedule that puts `write` at the close and `read` at the open will not see dataflow from the writing process to the reading one.
+
+# Interval approach
+
+Instead, we take the more complex, "interval" approach. Unlike an interval on the real numbers, an interval in a partial order can have multiple lower-limits and upper-limits. Only nodes which are below some lower-limit and above some upper-limit are contained in the interval.
+
+Define the binary predicate "all-happens-before", which is that all of the lower bounds of the first interval happen-before some upper bound in the second interval.
+
+[interval_algo.py](./interval_algo.py)
+
+[interval_redux.py](./interval_redux.py)
+
+The logic in `interval_redux.py` for dealing with concurrent segments works if the segments are disjoint.
+
+We start with the HB graph, because all of the edges (program order, fork, join, exec) _can_ have dataflow. We get rid of extraneous nodes in the process.
+
+Then for each referenced inode:
+  Find its open-close intervals.
+  Form a partial order on the open-close intervals with all-happens-before.
+  Find the transitive reduction.
+  For each node, the node may read any of the versions of any of its predecessors.
+
+
+# Doubts
+
+- Matching closes with opens in thread-level parallelism
+  - Global fd table
+    - How to when clone and CLONE_FILES?
+    - Kernel already implements a readers/writers lock around the fd table https://docs.kernel.org/filesystems/files.html
+- TOC2TOU errors with stat->open or stat->close?
+  - Change stat->open to open->stat
+    - But I still need to check the inode table, to see if I need to copy a file,
+      - But only if the file exists and the open mode is truncate
+- Chdir with thread-level parallelism
+  - Always stat dirfd OR only do stat on fds OR have a global fd table
+
+1. Gate this feature with an env var, so its performance can be evaluated.
+   - See copy-files.
+   - Since PROBE has both modes, analysis in `probe_py` should support both modes.
+2. Have a process-global fd table, maps FDs to open-number. Open-number increments every open and gets logged with the open.
+   - Use a multi-level table.
+   - Consider 10-bit x 10-bit = 20-bit table at first.
+   - If a returned FD is too large, error out loudly.
+   - Use readers/writers lock or atomics
+   - open-number should go in the Path struct.
+3. Uses of the FD, as a dirfd too, or in close, get logged.
+   - Accesses with `AT_FDCWD` log the open-number of the cwd.
+4. When we reconstruct close,dup,fcntl -> open, we use the open number, if present, otherwise inode-match (current strat).
+   - Kill the existing stat in close,dup,fcntl, other fns that operate on FDs.
+   - Perhaps `create_path_lazy` should not do a stat when open-number is present.
+5. When reconstructing the process-global cwd, we use the open number.
+6. Fail loudly on clone if `CLONE_FILES` is set/not-set (whichever poses a problem for us).
+
+Next PR:
+7. Two bloom filters for all open-numbers read-from and written-to in this process.
+  - a bloom filter query returns either "possibly in set" or "definitely not in set"
+  - For each read/write, update filter
+  - Open-number is more precise than inode, since the same inode can be opened for multiple periods within the process.
+  - Read-from, written-to status gets logged on close.
+8. Dataflow can ignore write segments from threads that don't do any writing, read segments from threads that don't do any reading.
+
+# Shell ambiguity
+
+- Dataflow graph of `sh -c 'a | b'` appears to have more edges than it needs, but these edges are necessary because for each edge, there exists a program with the same HB graph that has those edges.
+  - The following program is, for our purposes, equivalent to `sh -c '/bin/echo hi | cat'`. Note that uncommenting either pair read1+write1 or read2+write2 creates a dataflow edge from the first_child->parent or parent->first_child, respectively. In practice, we know that there are no such extraneous reads/writes in Bash, but for an arbitrary black-box program, we wouldn't know. Indeed, if we used `echo` instead of `/bin/echo` which invokes a shell builtin rather than a subprocess, then the first fork would have a Write 2 uncommented (with dup2 and execl removed). Really, we should be tracking the reads and writes to resolve the ambiguity more precisely.
+  ```
+  void main() {
+    int pipe_fds[2];
+    pipe(pipe_fds);
+    pid_t first_fork = fork()
+    if (first_fork == 0) {
+      // first child
+      //read(pipe_fds[0], ...); /* Read 1 */
+      //write(pipe_fds[1], ...); /* Write 2 */
+      close(pipe_fds[0]);
+      dup2(pipe_fds[1], stdout_fileno);
+      close(pipe_fds[1]);
+      // Exec never returns; it switches to the main method of echo.
+      execl("/bin/echo", "hi");
+    } else {
+      // parent
+      //write(pipe_fds[1], ....); /* Write 1 */
+      //read(pipe_fds[0], ...); /* Read 2 */
+      close(pipe_fds[1]);
+      pid_t second_fork = fork();
+      if (second_fork) {
+        // second child
+        dup2(pipe_fds[0], stdin_fileno);
+        close(pipe_fds[0]);
+        execl("cat");
+      } else {
+        // parent
+        close(pipe_fds[0]);
+        wait(first_fork);
+        wait(second_fork);
+      }
+    }
+  }
+  ```
+
+# Alternatives
+
+ReproZip is incorrect in the following cases:
+
+``` bash
+cd ../../benchmark
+
+nix build .#reprozip-all && export PATH="$PWD/result/bin:$PATH" && reprounzip usage_report --disable
+
+# Graph should show a dataflow path from cat to head.
+touch src && reprozip trace --overwrite bash -c 'cat src | head' && rm dataflow-graph.dot ; reprounzip graph --dir .reprozip-trace dataflow-graph.dot && xdot dataflow-graph.dot
+
+# Test with a symlink (hardlink is also not supported)
+setup="echo hello > src && echo hello > src && rm dst ; ln --symbolic src dst"
+cmd="cat src src > tmp && cat tmp > dst"
+test="cat dst"
+bash -c "$setup"
+bash -c "$cmd"
+bash -c "$test"
+bash -c "$setup" && reprozip trace --overwrite bash -c "$cmd"
+# Get trace
+rm test-trace.rpz ; reprozip pack test-trace.rpz
+# Show files should have src as an input and an output
+reprounzip showfiles test-trace.rpz
+# Show dataflow
+rm dataflow-graph.dot ; reprounzip graph --dir .reprozip-trace dataflow-graph.dot && xdot dataflow-graph.dot
+
+# Same procedure, but with a file that modifies itself
+echo hi > src && reprozip trace --overwrite bash -c "cat src src > tmp ; head tmp > src"
+rm test-trace.rpz ; reprozip pack test-trace.rpz
+# src should be an input file; tmp should not be an input file
+reprounzip showfiles test-trace.rpz
+rm dataflow-graph.dot ; reprounzip graph --dir .reprozip-trace dataflow-graph.dot && xdot dataflow-graph.dot
+# src should have one hello not two.
+rm --recursive --force tmpdir && reprounzip directory setup test-trace.rpz tmpdir
+```
diff --git a/docs/dataflow_algorithms/at_open_algo.py b/docs/dataflow_algorithms/at_open_algo.py
@@ -0,0 +1,153 @@
+# From 2a63c07eaa53a0e017eb0ad557ebbfffcaf66fc8
+# Following #138
+
+###
+# New
+
+import dataclasses
+import collections
+import pathlib
+import typing
+import os
+
+import networkx
+import rich.console
+
+
+#####
+# analysis.py:23
+
+@dataclasses.dataclass(frozen=True)
+class ProcessNode:
+    pid: int
+    cmd: tuple[str,...]
+
+
+@dataclasses.dataclass(frozen=True)
+class FileAccess:
+    inode_version: InodeVersion
+    path: pathlib.Path
+
+    @property
+    def label(self) -> str:
+        return f"{self.path!s} inode {self.inode_version.inode}"
+
+#####
+# New
+
+Pid = int
+ExecNo = int
+Tid = int
+OpNode = tuple[Pid, ExecNo, Tid, int]
+HbGraph = networkx.DiGraph[OpNode]
+DfGraph = networkx.DiGraph[FileAccess | ProcessNode]
+class ProbeLog:
+    pass
+
+#####
+# analysis.py:47
+
+
+def traverse_hb_for_dfgraph(probe_log: ProbeLog, starting_node: OpNode, traversed: set[int] , dataflow_graph: DfGraph, cmd_map: dict[int, list[str]], inode_version_map: dict[int, set[InodeVersion]], hb_graph: HbGraph) -> None:
+    starting_pid = starting_node.pid
+
+    starting_op = get_op(probe_log, *starting_node.op_quad())
+
+    name_map = collections.defaultdict[Inode, list[pathlib.Path]](list)
+
+    target_nodes = collections.defaultdict[int, list[OpNode]](list)
+    console = rich.console.Console(file=sys.stderr)
+
+    print("starting at", starting_node, starting_op)
+
+    for edge in networkx.bfs_edges(hb_graph, starting_node):
+
+        pid, exec_epoch_no, tid, op_index = edge[0].op_quad()
+
+        # check if the process is already visited when waitOp occurred
+        if pid in traversed or tid in traversed:
+            continue
+
+        op = get_op(probe_log, pid, exec_epoch_no, tid, op_index).data
+        next_op = get_op(probe_log, *edge[1].op_quad()).data
+        if isinstance(op, OpenOp):
+            access_mode = op.flags & os.O_ACCMODE
+            processNode = ProcessNode(pid=pid, cmd=tuple(cmd_map[pid]))
+            dataflow_graph.add_node(processNode, label=processNode.cmd)
+            inode = Inode(Host.localhost(), Device(op.path.device_major, op.path.device_minor), op.path.inode)
+            path_str = op.path.path.decode("utf-8")
+            curr_version = InodeVersion(inode, numpy.datetime64(op.path.mtime.sec * int(1e9) + op.path.mtime.nsec, "ns"), op.path.size)
+            inode_version_map.setdefault(op.path.inode, set())
+            inode_version_map[op.path.inode].add(curr_version)
+            fileNode = FileAccess(curr_version, pathlib.Path(path_str))
+            dataflow_graph.add_node(fileNode)
+            path = pathlib.Path(op.path.path.decode("utf-8"))
+            if path not in name_map[inode]:
+                name_map[inode].append(path)
+            if access_mode == os.O_RDONLY:
+                dataflow_graph.add_edge(fileNode, processNode)
+            elif access_mode == os.O_WRONLY:
+                dataflow_graph.add_edge(processNode, fileNode)
+            elif access_mode == 2:
+                console.print(f"Found file {path_str} with access mode O_RDWR", style="red")
+            else:
+                raise Exception("unknown access mode")
+        elif isinstance(op, CloneOp):
+            if op.task_type == TaskType.TASK_PID:
+                if edge[0].pid != edge[1].pid:
+                    target_nodes[op.task_id].append(edge[1])
+                    continue
+            elif op.task_type == TaskType.TASK_PTHREAD:
+                if edge[0].tid != edge[1].tid:
+                    target_nodes[op.task_id].append(edge[1])
+                    continue
+            if op.task_type != TaskType.TASK_PTHREAD and op.task_type != TaskType.TASK_ISO_C_THREAD:
+
+                processNode1 = ProcessNode(pid = pid, cmd=tuple(cmd_map[pid]))
+                processNode2 = ProcessNode(pid = op.task_id, cmd=tuple(cmd_map[op.task_id]))
+                dataflow_graph.add_node(processNode1, label = " ".join(arg for arg in processNode1.cmd))
+                dataflow_graph.add_node(processNode2, label = " ".join(arg for arg in processNode2.cmd))
+                dataflow_graph.add_edge(processNode1, processNode2)
+            target_nodes[op.task_id] = list()
+        elif isinstance(op, WaitOp) and op.options == 0:
+            for node in target_nodes[op.task_id]:
+                traverse_hb_for_dfgraph(probe_log, node, traversed, dataflow_graph, cmd_map, inode_version_map, hb_graph)
+                traversed.add(node.tid)
+        # return back to the WaitOp of the parent process
+        if isinstance(next_op, WaitOp):
+            if next_op.task_id == starting_pid or next_op.task_id == starting_op.pthread_id:
+                return
+
+
+def probe_log_to_dataflow_graph(probe_log: ProbeLog, hb_graph: HbGraph) -> DfGraph:
+    dataflow_graph = DfGraph()
+    root_node = [n for n in hb_graph.nodes() if hb_graph.out_degree(n) > 0 and hb_graph.in_degree(n) == 0][0]
+    traversed: set[int] = set()
+    cmd_map = collections.defaultdict[int, list[str]](list)
+    for edge in list(hb_graph.edges())[::-1]:
+        pid, exec_epoch_no, tid, op_index = edge[0].op_quad()
+        op = get_op(probe_log, pid, exec_epoch_no, tid, op_index).data
+        if isinstance(op, ExecOp):
+            if pid.main_thread() == tid and exec_epoch_no == 0:
+                cmd_map[tid] = [arg.decode(errors="surrogate") for arg in op.argv]
+
+    inode_version_map: dict[int, set[InodeVersion]] = {}
+    traverse_hb_for_dfgraph(probe_log, root_node, traversed, dataflow_graph, cmd_map, inode_version_map, hb_graph)
+
+    file_version: dict[str, int] = {}
+    for inode, versions in inode_version_map.items():
+        sorted_versions = sorted(
+            versions,
+            key=lambda version: typing.cast(int, version.mtime),
+        )
+        for idx, version in enumerate(sorted_versions):
+            str_id = f"{inode}_{version.mtime}"
+            file_version[str_id] = idx
+
+    for idx, node in enumerate(dataflow_graph.nodes()):
+        if isinstance(node, FileAccess):
+            str_id = f"{inode}_{version.mtime}"
+            label = f"{node.path} inode {node.inode_version.inode.number} fv {file_version[str_id]} "
+            networkx.set_node_attributes(dataflow_graph, {node: label}, "label") # type: ignore
+
+    return dataflow_graph