-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve parsing speed of walkfs plugins #749
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice find of the unnecessary double conversion!
dissect/target/helpers/fsutil.py
Outdated
for child_entry in path_entry.scandir(): | ||
yield child_entry | ||
|
||
if child_entry.is_dir() and not child_entry.is_symlink(): | ||
yield from walk_ng(child_entry) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't this just call walk_ext()
like this:
for _, dirs, files in walk_ext(path_entry, topdown=topdown, onerror=onerror, followlinks=followlinks)
yield from itertools.chain(dirs, files)
This would reduce the number of mechanisms we use to walk and it would allow to use the same topdown
, onerror
and followlinks
arguments in this function and the functions that call this function in filesystem.py
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would indeed be simpler, it would however introduce a performance hit compared to using the walk_ng
function. I think this boils down to either simplifying the code or aiming for performance / efficiency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the difference is large enough that it would be worth having different code? Do you have numbers on the difference in performance by any chance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A quick test on a vmdk image with ext4 volume shows a walkfs
plugin runtime of 2 minutes using your proposed walk_ext
method versus 1 minute 20 seconds with the recurse
method.
I think another advantage of not using walk_ext
is that the recurse
function does not need to be taken into account when changing walk_ext
as recurse
would be fully independent.
current 19m43.133s
recurse 1m20.519s
proposed 2m3.715s
dissect/target/helpers/fsutil.py
Outdated
for child_entry in path_entry.scandir(): | ||
yield child_entry | ||
|
||
if child_entry.is_dir() and not child_entry.is_symlink(): | ||
yield from walk_ng(child_entry) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the difference is large enough that it would be worth having different code? Do you have numbers on the difference in performance by any chance?
029000e
to
ebf3eb9
Compare
What's the status of the review on this PR? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have some additional suggestions for the capability changes too:
diff --git a/dissect/target/plugins/filesystem/unix/capability.py b/dissect/target/plugins/filesystem/unix/capability.py
index 4485994..97095dc 100644
--- a/dissect/target/plugins/filesystem/unix/capability.py
+++ b/dissect/target/plugins/filesystem/unix/capability.py
@@ -1,4 +1,3 @@
-import struct
from enum import IntEnum
from io import BytesIO
from typing import Iterator
@@ -86,8 +85,8 @@ class CapabilityPlugin(Plugin):
if not self.target.has_function("walkfs"):
raise UnsupportedPluginError("Need walkfs plugin")
- if self.target.os == "windows":
- raise UnsupportedPluginError("Walkfs not supported on Windows")
+ if not any(fs.__type__ in ("extfs", "xfs") for fs in self.target.filesystems):
+ raise UnsupportedPluginError("Capability plugin only works on EXT and XFS filesystems")
@export(record=CapabilityRecord)
def capability_binaries(self) -> Iterator[CapabilityRecord]:
@@ -111,7 +110,7 @@ class CapabilityPlugin(Plugin):
for attr in attrs:
try:
- parsed_attr = parse_attr(attr, BytesIO(attr.value))
+ permitted, inheritable, effective, root_id = parse_attr(attr.value)
except ValueError as e:
self.target.log.warning("Could not parse attributes for entry %s: %s", entry, str(e.value))
self.target.log.debug("", exc_info=e)
@@ -119,19 +118,25 @@ class CapabilityPlugin(Plugin):
yield CapabilityRecord(
ts_mtime=entry.lstat().st_mtime,
path=entry.path,
- **parsed_attr,
+ permitted=permitted,
+ inheritable=inheritable,
+ effective=effective,
+ root_id=root_id,
_target=self.target,
)
-def parse_attr(attr: object, buf: BytesIO) -> dict:
+def parse_attr(attr: bytes) -> tuple[int, list[str], list[str], bool]:
"""Efficiently parse a Linux xattr capability struct.
- Returns: dictionary of parsed capabilities for the given entry.
+ Returns:
+ A tuple of permitted capability names, inheritable capability names, effective flag and ``root_id``.
"""
+ buf = BytesIO(attr)
- # The struct is small enough we can just use struct
- magic_etc = struct.unpack("<I", buf.read(4))[0]
+ # The struct is small enough we can just use int.from_bytes
+ magic_etc = int.from_bytes(buf.read(4), "little")
+ effective = magic_etc & VFS_CAP_FLAGS_EFFECTIVE != 0
cap_revision = magic_etc & VFS_CAP_REVISION_MASK
permitted_caps = []
@@ -153,16 +158,15 @@ def parse_attr(attr: object, buf: BytesIO) -> dict:
else:
raise ValueError("Unexpected capability revision '%s'" % cap_revision)
- if data_len != (actual_len := len(attr.value)):
+ if data_len != (actual_len := len(attr)):
raise ValueError("Unexpected capability length (%s vs %s)", data_len, actual_len)
for _ in range(num_caps):
- permitted_val, inheritable_val = struct.unpack("<2I", buf.read(8))
- permitted_caps.append(permitted_val)
- inheritable_caps.append(inheritable_val)
+ permitted_caps.append(int.from_bytes(buf.read(4), "little"))
+ inheritable_caps.append(int.from_bytes(buf.read(4), "little"))
if cap_revision == VFS_CAP_REVISION_3:
- root_id = struct.unpack("<I", buf.read(4))[0]
+ root_id = int.from_bytes(buf.read(4), "little")
permitted = []
inheritable = []
@@ -178,9 +182,4 @@ def parse_attr(attr: object, buf: BytesIO) -> dict:
if caps[cap_index] & (1 << (capability.value & 31)) != 0:
results.append(capability.name)
- return {
- "root_id": root_id,
- "permitted": permitted,
- "inheritable": inheritable,
- "effective": magic_etc & VFS_CAP_FLAGS_EFFECTIVE != 0,
- }
+ return permitted, inheritable, effective, root_id
diff --git a/tests/plugins/filesystem/unix/test_capability.py b/tests/plugins/filesystem/unix/test_capability.py
index cf1b8b8..5a4edc2 100644
--- a/tests/plugins/filesystem/unix/test_capability.py
+++ b/tests/plugins/filesystem/unix/test_capability.py
@@ -1,10 +1,11 @@
from unittest.mock import Mock
-from dissect.target.filesystem import VirtualFile
+from dissect.target.filesystem import VirtualFile, VirtualFilesystem
from dissect.target.plugins.filesystem.unix.capability import CapabilityPlugin
+from dissect.target.target import Target
-def test_capability_plugin(target_unix, fs_unix):
+def test_capability_plugin(target_unix: Target, fs_unix: VirtualFilesystem) -> None:
# Some fictional capability values
xattr1 = Mock()
xattr1.name = "security.capability"
@@ -33,7 +34,7 @@ def test_capability_plugin(target_unix, fs_unix):
vfile3.lattr.return_value = [xattr3]
fs_unix.map_file_entry("/path/to/xattr3/file", vfile3)
- target_unix.add_plugin(CapabilityPlugin)
+ target_unix.add_plugin(CapabilityPlugin, check_compatible=False)
results = list(target_unix.capability_binaries())
assert len(results) == 3
#826 should make this neater in the future too.
Thanks, that looks better indeed. Added in b97e35b |
Co-authored-by: Erik Schamper <[email protected]>
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #749 +/- ##
==========================================
- Coverage 75.54% 75.51% -0.03%
==========================================
Files 305 305
Lines 26334 26343 +9
==========================================
- Hits 19894 19893 -1
- Misses 6440 6450 +10
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
Co-authored-by: Erik Schamper <[email protected]>
Inspired by fox-it/dissect.ntfs#35 we profiled and improved the
walkfs
plugin. By removing the unnecessaryFilesystemEntry
->TargetPath
->FilesystemEntry
conversion we see ~ 2x speed improvement on varying underlying container and filesystem formats.We also introduced
fsutil.walk_ng
since we missed a function that returns a plain iterator forFilesystemEntry
s instead of tuple separated files, folders and parent paths.When refactoring the
walkfs
plugin we also had to refactor thecapabilities
plugin to match it's new record structure.We currently have not added any extra tests that show the speed improvement we have experienced, however we noticed the #747 PR, which seems to be using some form of
pytest-benchmark
. That certainly looks promising for the dissect project.Please let us know if
walkfs_ng
should be renamed to something else, consider it a placeholder for a better function name :)