-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Metadata file #16
Comments
we have a root File and in case of processing a directory, then we have a list of root Files
Chunk
Questions:
|
Almost all of the information described above is now part of the reporting feature of unblob. The information that is missing right now:
I don't think item 1 has a lot of added value right now. Regarding item 2, we already have the structure in place to collect that information. What remains is making sure the extraction phase preserve that information so that we can simply I would take care of item 2 in two steps:
On top of that, I would like to add a specific feature to our meta-data collection effort: saving header information. The idea is to have a I submitted a PR to The idea behind this is to expose metadata to further analysis steps through the unblob report (e.g. a binary analysis toolkit would read the load address and architecture from a uImage chunk to analyze the file extracted from that chunk with the right settings). All of these changes are quite simple to implement since reporting is already there: diff --git a/unblob/handlers/archive/sevenzip.py b/unblob/handlers/archive/sevenzip.py
index 040b409..de171c5 100644
--- a/unblob/handlers/archive/sevenzip.py
+++ b/unblob/handlers/archive/sevenzip.py
@@ -70,4 +70,8 @@ class SevenZipHandler(StructHandler):
# We read the signature header here to get the offset to the header database
first_db_header = start_offset + len(header) + header.next_header_offset
end_offset = first_db_header + header.next_header_size
- return ValidChunk(start_offset=start_offset, end_offset=end_offset)
+ return ValidChunk(
+ start_offset=start_offset,
+ end_offset=end_offset,
+ metadata=dict(header),
+ )
diff --git a/unblob/models.py b/unblob/models.py
index 2b8431f..d101a08 100644
--- a/unblob/models.py
+++ b/unblob/models.py
@@ -88,6 +88,7 @@ class ValidChunk(Chunk):
handler: "Handler" = attr.ib(init=False, eq=False)
is_encrypted: bool = attr.ib(default=False)
+ metadata: dict = attr.ib(default={})
def extract(self, inpath: Path, outdir: Path):
if self.is_encrypted:
@@ -108,6 +109,7 @@ class ValidChunk(Chunk):
size=self.size,
handler_name=self.handler.NAME,
is_encrypted=self.is_encrypted,
+ metadata=self.metadata,
extraction_reports=extraction_reports,
)
@@ -188,7 +190,7 @@ class _JSONEncoder(json.JSONEncoder):
if isinstance(obj, bytes):
try:
- return obj.decode()
+ return obj.decode("utf-8", errors="surrogateescape")
except UnicodeDecodeError:
return str(obj)
diff --git a/unblob/report.py b/unblob/report.py
index 1b5bed1..acdabaf 100644
--- a/unblob/report.py
+++ b/unblob/report.py
@@ -4,7 +4,7 @@ import stat
import traceback
from enum import Enum
from pathlib import Path
-from typing import List, Optional, Union, final
+from typing import Dict, List, Optional, Union, final
import attr
@@ -116,6 +116,12 @@ class MaliciousSymlinkRemoved(ErrorReport):
class StatReport(Report):
path: Path
size: int
+ ctime: int
+ mtime: int
+ atime: int
+ uid: int
+ gid: int
+ mode: int
is_dir: bool
is_file: bool
is_link: bool
@@ -133,6 +139,12 @@ class StatReport(Report):
return cls(
path=path,
size=st.st_size,
+ ctime=st.st_ctime_ns,
+ mtime=st.st_mtime_ns,
+ atime=st.st_atime_ns,
+ uid=st.st_uid,
+ gid=st.st_gid,
+ mode=st.st_mode,
is_dir=stat.S_ISDIR(mode),
is_file=stat.S_ISREG(mode),
is_link=stat.S_ISLNK(mode),
@@ -181,6 +193,7 @@ class ChunkReport(Report):
end_offset: int
size: int
is_encrypted: bool
+ metadata: Dict
extraction_reports: List[Report] Please let me know what you think about this approach. |
An issue could be that in
So, by the time we run StatReport.from_path to check the permission is already changed. Also if the extraction is not running as root, the uid/gid will be inaccurate as well. Could be also problematic in case the ownership in the format is stored using names and those names are not present on the system. |
The meta-data part looks ok, though I am not sure we want to store the whole header, but rather try to standardize the stored meta information. We can also store the raw header, though in some cases there are multiple headers etc. |
Had some discussions with @orosam around unblob better preserving/logging/reporting file metadata. Our idea is to create a FUSE layer for the extraction directory, where we could capture metadata, like ownership information, character and block device details and so on. |
I like the approach, but can you be a bit more specific ? Do you have examples or specific ideas in mind ? |
I don't understand why this would help? If the format can reproduce these metadata, it contains in the format itself which can be parsed and extracted without looking at the extracted files. What am I missing? |
Probing question: do we want to eventually replace all extractors by our hand rolled ones? If so, then this totally makes sense. If we are to outsource extraction to external implementations, I don't want to that intimately familiarize ourselves with each format, that we'd be able to parse out these details. Some extractors have listing commands, but these need to be parsed as well, and may not contain all details we want to gather.
My idea is to have a very thin fuse driver executed either outside of unblob or inside as a thread, that would forward1 all operations to the underlying filesystem, and record metadata from interesting ones, like The complexity of this approach that we are not using the details stored in the archive/fs image, but the intent of extractor tools, e.g. if they are incomplete or just doing their own things diverging from the data format, we miss those. OTOH it would be trivial to wire up any format which has a well-behaving extractor. Footnotes
|
So if I understand correctly, the fuse layer would allow any kind of operation like a Correct ? |
Would a FUSE layer interpose itself between the extraction directory and external tools launched as subprocess like |
That would be the idea. Unfortunately, it is a pain1 to make it work inside docker, because it requires access to a kernel facility on the host. Otherwise, it would work for normal users. An alternative approach we have discussed in the past is to LD_PRELOAD/DYLIB_FORCE_LIBRARIES a shim or use some other introspection method to trace IO calls. Unfortunately, it has its own can of worms, as it may not work for e.g. commands which are statically linked to libc or call syscalls directly (e.g. go on Linux). Footnotes
|
Store the metadata file in
extract_root
in one JSON file.We don't want to pollute the extracted folder with lots of small files.
It's nice if this is easy to read, so a JSON is easy to look at.
For example:
The text was updated successfully, but these errors were encountered: