Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve parsing speed of walkfs plugins #749

Merged
merged 9 commits into from
Aug 22, 2024

Conversation

JSCU-CNI
Copy link
Contributor

Inspired by fox-it/dissect.ntfs#35 we profiled and improved the walkfs plugin. By removing the unnecessary FilesystemEntry -> TargetPath -> FilesystemEntry conversion we see ~ 2x speed improvement on varying underlying container and filesystem formats.

We also introduced fsutil.walk_ng since we missed a function that returns a plain iterator for FilesystemEntrys instead of tuple separated files, folders and parent paths.

When refactoring the walkfs plugin we also had to refactor the capabilities plugin to match it's new record structure.

We currently have not added any extra tests that show the speed improvement we have experienced, however we noticed the #747 PR, which seems to be using some form of pytest-benchmark. That certainly looks promising for the dissect project.

Please let us know if walkfs_ng should be renamed to something else, consider it a placeholder for a better function name :)

@JSCU-CNI JSCU-CNI changed the title improve walkfs plugins Improve parsing speed of walkfs plugins Jul 17, 2024
@EinatFox EinatFox linked an issue Jul 18, 2024 that may be closed by this pull request
Copy link
Contributor

@pyrco pyrco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice find of the unnecessary double conversion!

Comment on lines 298 to 302
for child_entry in path_entry.scandir():
yield child_entry

if child_entry.is_dir() and not child_entry.is_symlink():
yield from walk_ng(child_entry)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't this just call walk_ext() like this:

    for _, dirs, files in walk_ext(path_entry, topdown=topdown, onerror=onerror, followlinks=followlinks)
        yield from itertools.chain(dirs, files)

This would reduce the number of mechanisms we use to walk and it would allow to use the same topdown, onerror and followlinks arguments in this function and the functions that call this function in filesystem.py.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would indeed be simpler, it would however introduce a performance hit compared to using the walk_ng function. I think this boils down to either simplifying the code or aiming for performance / efficiency.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the difference is large enough that it would be worth having different code? Do you have numbers on the difference in performance by any chance?

Copy link
Contributor Author

@JSCU-CNI JSCU-CNI Aug 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A quick test on a vmdk image with ext4 volume shows a walkfs plugin runtime of 2 minutes using your proposed walk_ext method versus 1 minute 20 seconds with the recurse method.

I think another advantage of not using walk_ext is that the recurse function does not need to be taken into account when changing walk_ext as recurse would be fully independent.

current      19m43.133s
recurse      1m20.519s
proposed     2m3.715s

pyrco
pyrco previously requested changes Jul 26, 2024
Comment on lines 298 to 302
for child_entry in path_entry.scandir():
yield child_entry

if child_entry.is_dir() and not child_entry.is_symlink():
yield from walk_ng(child_entry)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the difference is large enough that it would be worth having different code? Do you have numbers on the difference in performance by any chance?

@JSCU-CNI JSCU-CNI force-pushed the improvement/walkfs branch from 029000e to ebf3eb9 Compare August 8, 2024 11:38
@JSCU-CNI JSCU-CNI requested review from pyrco and Schamper August 8, 2024 11:52
@JSCU-CNI
Copy link
Contributor Author

What's the status of the review on this PR?

Copy link
Member

@Schamper Schamper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some additional suggestions for the capability changes too:

diff --git a/dissect/target/plugins/filesystem/unix/capability.py b/dissect/target/plugins/filesystem/unix/capability.py
index 4485994..97095dc 100644
--- a/dissect/target/plugins/filesystem/unix/capability.py
+++ b/dissect/target/plugins/filesystem/unix/capability.py
@@ -1,4 +1,3 @@
-import struct
 from enum import IntEnum
 from io import BytesIO
 from typing import Iterator
@@ -86,8 +85,8 @@ class CapabilityPlugin(Plugin):
         if not self.target.has_function("walkfs"):
             raise UnsupportedPluginError("Need walkfs plugin")
 
-        if self.target.os == "windows":
-            raise UnsupportedPluginError("Walkfs not supported on Windows")
+        if not any(fs.__type__ in ("extfs", "xfs") for fs in self.target.filesystems):
+            raise UnsupportedPluginError("Capability plugin only works on EXT and XFS filesystems")
 
     @export(record=CapabilityRecord)
     def capability_binaries(self) -> Iterator[CapabilityRecord]:
@@ -111,7 +110,7 @@ class CapabilityPlugin(Plugin):
 
             for attr in attrs:
                 try:
-                    parsed_attr = parse_attr(attr, BytesIO(attr.value))
+                    permitted, inheritable, effective, root_id = parse_attr(attr.value)
                 except ValueError as e:
                     self.target.log.warning("Could not parse attributes for entry %s: %s", entry, str(e.value))
                     self.target.log.debug("", exc_info=e)
@@ -119,19 +118,25 @@ class CapabilityPlugin(Plugin):
                 yield CapabilityRecord(
                     ts_mtime=entry.lstat().st_mtime,
                     path=entry.path,
-                    **parsed_attr,
+                    permitted=permitted,
+                    inheritable=inheritable,
+                    effective=effective,
+                    root_id=root_id,
                     _target=self.target,
                 )
 
 
-def parse_attr(attr: object, buf: BytesIO) -> dict:
+def parse_attr(attr: bytes) -> tuple[int, list[str], list[str], bool]:
     """Efficiently parse a Linux xattr capability struct.
 
-    Returns: dictionary of parsed capabilities for the given entry.
+    Returns:
+        A tuple of permitted capability names, inheritable capability names, effective flag and ``root_id``.
     """
+    buf = BytesIO(attr)
 
-    # The struct is small enough we can just use struct
-    magic_etc = struct.unpack("<I", buf.read(4))[0]
+    # The struct is small enough we can just use int.from_bytes
+    magic_etc = int.from_bytes(buf.read(4), "little")
+    effective = magic_etc & VFS_CAP_FLAGS_EFFECTIVE != 0
     cap_revision = magic_etc & VFS_CAP_REVISION_MASK
 
     permitted_caps = []
@@ -153,16 +158,15 @@ def parse_attr(attr: object, buf: BytesIO) -> dict:
     else:
         raise ValueError("Unexpected capability revision '%s'" % cap_revision)
 
-    if data_len != (actual_len := len(attr.value)):
+    if data_len != (actual_len := len(attr)):
         raise ValueError("Unexpected capability length (%s vs %s)", data_len, actual_len)
 
     for _ in range(num_caps):
-        permitted_val, inheritable_val = struct.unpack("<2I", buf.read(8))
-        permitted_caps.append(permitted_val)
-        inheritable_caps.append(inheritable_val)
+        permitted_caps.append(int.from_bytes(buf.read(4), "little"))
+        inheritable_caps.append(int.from_bytes(buf.read(4), "little"))
 
     if cap_revision == VFS_CAP_REVISION_3:
-        root_id = struct.unpack("<I", buf.read(4))[0]
+        root_id = int.from_bytes(buf.read(4), "little")
 
     permitted = []
     inheritable = []
@@ -178,9 +182,4 @@ def parse_attr(attr: object, buf: BytesIO) -> dict:
             if caps[cap_index] & (1 << (capability.value & 31)) != 0:
                 results.append(capability.name)
 
-    return {
-        "root_id": root_id,
-        "permitted": permitted,
-        "inheritable": inheritable,
-        "effective": magic_etc & VFS_CAP_FLAGS_EFFECTIVE != 0,
-    }
+    return permitted, inheritable, effective, root_id
diff --git a/tests/plugins/filesystem/unix/test_capability.py b/tests/plugins/filesystem/unix/test_capability.py
index cf1b8b8..5a4edc2 100644
--- a/tests/plugins/filesystem/unix/test_capability.py
+++ b/tests/plugins/filesystem/unix/test_capability.py
@@ -1,10 +1,11 @@
 from unittest.mock import Mock
 
-from dissect.target.filesystem import VirtualFile
+from dissect.target.filesystem import VirtualFile, VirtualFilesystem
 from dissect.target.plugins.filesystem.unix.capability import CapabilityPlugin
+from dissect.target.target import Target
 
 
-def test_capability_plugin(target_unix, fs_unix):
+def test_capability_plugin(target_unix: Target, fs_unix: VirtualFilesystem) -> None:
     # Some fictional capability values
     xattr1 = Mock()
     xattr1.name = "security.capability"
@@ -33,7 +34,7 @@ def test_capability_plugin(target_unix, fs_unix):
     vfile3.lattr.return_value = [xattr3]
     fs_unix.map_file_entry("/path/to/xattr3/file", vfile3)
 
-    target_unix.add_plugin(CapabilityPlugin)
+    target_unix.add_plugin(CapabilityPlugin, check_compatible=False)
 
     results = list(target_unix.capability_binaries())
     assert len(results) == 3

#826 should make this neater in the future too.

@JSCU-CNI
Copy link
Contributor Author

Thanks, that looks better indeed. Added in b97e35b

Copy link

codecov bot commented Aug 19, 2024

Codecov Report

Attention: Patch coverage is 74.71264% with 22 lines in your changes missing coverage. Please review.

Project coverage is 75.51%. Comparing base (6894023) to head (0ea6469).
Report is 1 commits behind head on main.

Files Patch % Lines
...ssect/target/plugins/filesystem/unix/capability.py 77.19% 13 Missing ⚠️
dissect/target/plugins/filesystem/walkfs.py 55.55% 8 Missing ⚠️
dissect/target/helpers/fsutil.py 87.50% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #749      +/-   ##
==========================================
- Coverage   75.54%   75.51%   -0.03%     
==========================================
  Files         305      305              
  Lines       26334    26343       +9     
==========================================
- Hits        19894    19893       -1     
- Misses       6440     6450      +10     
Flag Coverage Δ
unittests 75.51% <74.71%> (-0.03%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@JSCU-CNI JSCU-CNI requested a review from Schamper August 20, 2024 12:37
@Schamper Schamper merged commit d5c7315 into fox-it:main Aug 22, 2024
16 of 18 checks passed
@JSCU-CNI JSCU-CNI deleted the improvement/walkfs branch August 26, 2024 13:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve parsing speed of walkfs plugins
3 participants