Improve parsing speed of walkfs plugins #749

JSCU-CNI · 2024-07-16T15:47:42Z

Inspired by fox-it/dissect.ntfs#35 we profiled and improved the walkfs plugin. By removing the unnecessary FilesystemEntry -> TargetPath -> FilesystemEntry conversion we see ~ 2x speed improvement on varying underlying container and filesystem formats.

We also introduced fsutil.walk_ng since we missed a function that returns a plain iterator for FilesystemEntrys instead of tuple separated files, folders and parent paths.

When refactoring the walkfs plugin we also had to refactor the capabilities plugin to match it's new record structure.

We currently have not added any extra tests that show the speed improvement we have experienced, however we noticed the #747 PR, which seems to be using some form of pytest-benchmark. That certainly looks promising for the dissect project.

Please let us know if walkfs_ng should be renamed to something else, consider it a placeholder for a better function name :)

pyrco

Very nice find of the unnecessary double conversion!

dissect/target/filesystem.py

dissect/target/helpers/fsutil.py

dissect/target/filesystem.py

pyrco · 2024-07-19T13:54:43Z

dissect/target/helpers/fsutil.py

+    for child_entry in path_entry.scandir():
+        yield child_entry
+
+        if child_entry.is_dir() and not child_entry.is_symlink():
+            yield from walk_ng(child_entry)


Can't this just call walk_ext() like this:

for _, dirs, files in walk_ext(path_entry, topdown=topdown, onerror=onerror, followlinks=followlinks) yield from itertools.chain(dirs, files)

This would reduce the number of mechanisms we use to walk and it would allow to use the same topdown, onerror and followlinks arguments in this function and the functions that call this function in filesystem.py.

That would indeed be simpler, it would however introduce a performance hit compared to using the walk_ng function. I think this boils down to either simplifying the code or aiming for performance / efficiency.

If the difference is large enough that it would be worth having different code? Do you have numbers on the difference in performance by any chance?

A quick test on a vmdk image with ext4 volume shows a walkfs plugin runtime of 2 minutes using your proposed walk_ext method versus 1 minute 20 seconds with the recurse method.

I think another advantage of not using walk_ext is that the recurse function does not need to be taken into account when changing walk_ext as recurse would be fully independent.

current 19m43.133s recurse 1m20.519s proposed 2m3.715s

pyrco · 2024-07-26T08:27:21Z

dissect/target/helpers/fsutil.py

+    for child_entry in path_entry.scandir():
+        yield child_entry
+
+        if child_entry.is_dir() and not child_entry.is_symlink():
+            yield from walk_ng(child_entry)


If the difference is large enough that it would be worth having different code? Do you have numbers on the difference in performance by any chance?

dissect/target/helpers/fsutil.py

JSCU-CNI · 2024-08-19T11:10:30Z

What's the status of the review on this PR?

Schamper

I have some additional suggestions for the capability changes too:

diff --git a/dissect/target/plugins/filesystem/unix/capability.py b/dissect/target/plugins/filesystem/unix/capability.py
index 4485994..97095dc 100644
--- a/dissect/target/plugins/filesystem/unix/capability.py
+++ b/dissect/target/plugins/filesystem/unix/capability.py
@@ -1,4 +1,3 @@
-import struct
 from enum import IntEnum
 from io import BytesIO
 from typing import Iterator
@@ -86,8 +85,8 @@ class CapabilityPlugin(Plugin):
         if not self.target.has_function("walkfs"):
             raise UnsupportedPluginError("Need walkfs plugin")
 
-        if self.target.os == "windows":
-            raise UnsupportedPluginError("Walkfs not supported on Windows")
+        if not any(fs.__type__ in ("extfs", "xfs") for fs in self.target.filesystems):
+            raise UnsupportedPluginError("Capability plugin only works on EXT and XFS filesystems")
 
     @export(record=CapabilityRecord)
     def capability_binaries(self) -> Iterator[CapabilityRecord]:
@@ -111,7 +110,7 @@ class CapabilityPlugin(Plugin):
 
             for attr in attrs:
                 try:
-                    parsed_attr = parse_attr(attr, BytesIO(attr.value))
+                    permitted, inheritable, effective, root_id = parse_attr(attr.value)
                 except ValueError as e:
                     self.target.log.warning("Could not parse attributes for entry %s: %s", entry, str(e.value))
                     self.target.log.debug("", exc_info=e)
@@ -119,19 +118,25 @@ class CapabilityPlugin(Plugin):
                 yield CapabilityRecord(
                     ts_mtime=entry.lstat().st_mtime,
                     path=entry.path,
-                    **parsed_attr,
+                    permitted=permitted,
+                    inheritable=inheritable,
+                    effective=effective,
+                    root_id=root_id,
                     _target=self.target,
                 )
 
 
-def parse_attr(attr: object, buf: BytesIO) -> dict:
+def parse_attr(attr: bytes) -> tuple[int, list[str], list[str], bool]:
     """Efficiently parse a Linux xattr capability struct.
 
-    Returns: dictionary of parsed capabilities for the given entry.
+    Returns:
+        A tuple of permitted capability names, inheritable capability names, effective flag and ``root_id``.
     """
+    buf = BytesIO(attr)
 
-    # The struct is small enough we can just use struct
-    magic_etc = struct.unpack("<I", buf.read(4))[0]
+    # The struct is small enough we can just use int.from_bytes
+    magic_etc = int.from_bytes(buf.read(4), "little")
+    effective = magic_etc & VFS_CAP_FLAGS_EFFECTIVE != 0
     cap_revision = magic_etc & VFS_CAP_REVISION_MASK
 
     permitted_caps = []
@@ -153,16 +158,15 @@ def parse_attr(attr: object, buf: BytesIO) -> dict:
     else:
         raise ValueError("Unexpected capability revision '%s'" % cap_revision)
 
-    if data_len != (actual_len := len(attr.value)):
+    if data_len != (actual_len := len(attr)):
         raise ValueError("Unexpected capability length (%s vs %s)", data_len, actual_len)
 
     for _ in range(num_caps):
-        permitted_val, inheritable_val = struct.unpack("<2I", buf.read(8))
-        permitted_caps.append(permitted_val)
-        inheritable_caps.append(inheritable_val)
+        permitted_caps.append(int.from_bytes(buf.read(4), "little"))
+        inheritable_caps.append(int.from_bytes(buf.read(4), "little"))
 
     if cap_revision == VFS_CAP_REVISION_3:
-        root_id = struct.unpack("<I", buf.read(4))[0]
+        root_id = int.from_bytes(buf.read(4), "little")
 
     permitted = []
     inheritable = []
@@ -178,9 +182,4 @@ def parse_attr(attr: object, buf: BytesIO) -> dict:
             if caps[cap_index] & (1 << (capability.value & 31)) != 0:
                 results.append(capability.name)
 
-    return {
-        "root_id": root_id,
-        "permitted": permitted,
-        "inheritable": inheritable,
-        "effective": magic_etc & VFS_CAP_FLAGS_EFFECTIVE != 0,
-    }
+    return permitted, inheritable, effective, root_id
diff --git a/tests/plugins/filesystem/unix/test_capability.py b/tests/plugins/filesystem/unix/test_capability.py
index cf1b8b8..5a4edc2 100644
--- a/tests/plugins/filesystem/unix/test_capability.py
+++ b/tests/plugins/filesystem/unix/test_capability.py
@@ -1,10 +1,11 @@
 from unittest.mock import Mock
 
-from dissect.target.filesystem import VirtualFile
+from dissect.target.filesystem import VirtualFile, VirtualFilesystem
 from dissect.target.plugins.filesystem.unix.capability import CapabilityPlugin
+from dissect.target.target import Target
 
 
-def test_capability_plugin(target_unix, fs_unix):
+def test_capability_plugin(target_unix: Target, fs_unix: VirtualFilesystem) -> None:
     # Some fictional capability values
     xattr1 = Mock()
     xattr1.name = "security.capability"
@@ -33,7 +34,7 @@ def test_capability_plugin(target_unix, fs_unix):
     vfile3.lattr.return_value = [xattr3]
     fs_unix.map_file_entry("/path/to/xattr3/file", vfile3)
 
-    target_unix.add_plugin(CapabilityPlugin)
+    target_unix.add_plugin(CapabilityPlugin, check_compatible=False)
 
     results = list(target_unix.capability_binaries())
     assert len(results) == 3

#826 should make this neater in the future too.

dissect/target/helpers/fsutil.py

JSCU-CNI · 2024-08-19T12:30:07Z

Thanks, that looks better indeed. Added in b97e35b

dissect/target/plugins/filesystem/unix/capability.py

Co-authored-by: Erik Schamper <[email protected]>

codecov · 2024-08-19T17:34:56Z

Codecov Report

Attention: Patch coverage is 74.71264% with 22 lines in your changes missing coverage. Please review.

Project coverage is 75.51%. Comparing base (6894023) to head (0ea6469).
Report is 1 commits behind head on main.

Files	Patch %	Lines
...ssect/target/plugins/filesystem/unix/capability.py	77.19%	13 Missing ⚠️
dissect/target/plugins/filesystem/walkfs.py	55.55%	8 Missing ⚠️
dissect/target/helpers/fsutil.py	87.50%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #749      +/-   ##
==========================================
- Coverage   75.54%   75.51%   -0.03%     
==========================================
  Files         305      305              
  Lines       26334    26343       +9     
==========================================
- Hits        19894    19893       -1     
- Misses       6440     6450      +10

Flag	Coverage Δ
unittests	`75.51% <74.71%> (-0.03%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

dissect/target/filesystem.py

Co-authored-by: Erik Schamper <[email protected]>

Yes

JSCU-CNI changed the title ~~improve walkfs plugins~~ Improve parsing speed of walkfs plugins Jul 17, 2024

EinatFox linked an issue Jul 18, 2024 that may be closed by this pull request

Improve parsing speed of walkfs plugins #753

Closed

pyrco requested changes Jul 19, 2024

View reviewed changes

pyrco previously requested changes Jul 26, 2024

View reviewed changes

JSCU-CNI added 2 commits August 8, 2024 13:22

improve walkfs plugins

7eafb6d

implement review comments

ebf3eb9

JSCU-CNI force-pushed the improvement/walkfs branch from 029000e to ebf3eb9 Compare August 8, 2024 11:38

JSCU-CNI requested review from pyrco and Schamper August 8, 2024 11:52

JSCU-CNI added 2 commits August 14, 2024 14:55

Merge branch 'main' into improvement/walkfs

5672fe6

Merge branch 'main' into improvement/walkfs

9aa701c

Schamper requested changes Aug 19, 2024

View reviewed changes

dissect/target/helpers/fsutil.py Outdated Show resolved Hide resolved

JSCU-CNI and others added 2 commits August 19, 2024 14:24

Merge branch 'main' into improvement/walkfs

5623b8b

add review suggestions

b97e35b

Schamper requested changes Aug 19, 2024

View reviewed changes

dissect/target/plugins/filesystem/unix/capability.py Outdated Show resolved Hide resolved

dissect/target/plugins/filesystem/unix/capability.py Outdated Show resolved Hide resolved

dissect/target/plugins/filesystem/unix/capability.py Outdated Show resolved Hide resolved

Apply suggestions from code review

3d0f2d3

Co-authored-by: Erik Schamper <[email protected]>

Merge branch 'main' into improvement/walkfs

b6d6845

JSCU-CNI requested a review from Schamper August 20, 2024 12:37

Schamper requested changes Aug 21, 2024

View reviewed changes

dissect/target/filesystem.py Outdated Show resolved Hide resolved

dissect/target/filesystem.py Outdated Show resolved Hide resolved

Apply suggestions from code review

0ea6469

Co-authored-by: Erik Schamper <[email protected]>

Schamper approved these changes Aug 21, 2024

View reviewed changes

Schamper merged commit d5c7315 into fox-it:main Aug 22, 2024
16 of 18 checks passed

JSCU-CNI deleted the improvement/walkfs branch August 26, 2024 13:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve parsing speed of walkfs plugins #749

Improve parsing speed of walkfs plugins #749

JSCU-CNI commented Jul 16, 2024

pyrco left a comment

pyrco Jul 19, 2024

JSCU-CNI Jul 25, 2024

pyrco Jul 26, 2024

JSCU-CNI Aug 8, 2024 •

edited

Loading

pyrco Jul 26, 2024

JSCU-CNI commented Aug 19, 2024

Schamper left a comment

JSCU-CNI commented Aug 19, 2024

codecov bot commented Aug 19, 2024 •

edited

Loading

Improve parsing speed of walkfs plugins #749

Improve parsing speed of walkfs plugins #749

Conversation

JSCU-CNI commented Jul 16, 2024

pyrco left a comment

Choose a reason for hiding this comment

pyrco Jul 19, 2024

Choose a reason for hiding this comment

JSCU-CNI Jul 25, 2024

Choose a reason for hiding this comment

pyrco Jul 26, 2024

Choose a reason for hiding this comment

JSCU-CNI Aug 8, 2024 • edited Loading

Choose a reason for hiding this comment

pyrco Jul 26, 2024

Choose a reason for hiding this comment

JSCU-CNI commented Aug 19, 2024

Schamper left a comment

Choose a reason for hiding this comment

JSCU-CNI commented Aug 19, 2024

codecov bot commented Aug 19, 2024 • edited Loading

Codecov Report

JSCU-CNI Aug 8, 2024 •

edited

Loading

codecov bot commented Aug 19, 2024 •

edited

Loading