Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Parquet output Issue #3682 #3685

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from
Draft

Conversation

shani-gold
Copy link

@shani-gold shani-gold commented Nov 13, 2023

1. Explain what the PR does

Adding parquet as a format output
##3682

2. Explain how to test it

sudo ./dist/tracee --output parquet:test.parquet --events proc,syscalls,sched_switch,sched_process_fork,network_events

@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.


Shani seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

@rafaeldtinoco
Copy link
Contributor

Why can't we adapt the marshalling on the existing event type (and internal types) ?

type Event struct {
    // Existing fields with added Parquet tags
    Timestamp             int          `json:"timestamp" parquet:"name=timestamp, type=INT64"`
    // ... other int fields with similar tags
    CgroupID              uint         `json:"cgroupId" parquet:"name=cgroupId, type=INT64"`
    // ... other fields
    ProcessName           string       `json:"processName" parquet:"name=processName, type=BYTE_ARRAY, convertedtype=UTF8"`
    // ... other string fields with similar tags
    Container             Container    `json:"container,omitempty" parquet:"name=container"`
    Kubernetes            Kubernetes   `json:"kubernetes,omitempty" parquet:"name=kubernetes"`
    // ... other struct fields with similar tags
    MatchedPoliciesKernel uint64       `json:"-" parquet:"name=matchedPoliciesKernel, type=INT64"`
    MatchedPoliciesUser   uint64       `json:"-" parquet:"name=matchedPoliciesUser, type=INT64"`
    MatchedPolicies       []string     `json:"matchedPolicies,omitempty" parquet:"name=matchedPolicies, type=MAP, convertedtype=LIST, valuetype=BYTE_ARRAY, valueconvertedtype=UTF8"`
    StackAddresses        []uint64     `json:"stackAddresses" parquet:"name=stackAddresses, type=MAP, convertedtype=LIST, valuetype=INT64"`
    // ... other fields
    Metadata              *Metadata    `json:"metadata,omitempty" parquet:"name=metadata"`
}
type Container struct {
    ID          string `json:"id,omitempty" parquet:"name=id, type=BYTE_ARRAY, convertedtype=UTF8"`
    // ... other fields with similar tags
}
type Kubernetes struct {
    PodName      string `json:"podName,omitempty" parquet:"name=podName, type=BYTE_ARRAY, convertedtype=UTF8"`
    // ... other fields with similar tags
}
type Metadata struct {
    Version     string `parquet:"name=version, type=BYTE_ARRAY, convertedtype=UTF8"`
    Description string `parquet:"name=description, type=BYTE_ARRAY, convertedtype=UTF8"`
    Tags        []string `parquet:"name=tags, type=MAP, convertedtype=LIST, valuetype=BYTE_ARRAY, valueconvertedtype=UTF8"`
    Properties  map[string]interface{} // Requires custom serialization, e.g., as a JSON string
}

Instead of having new ones ? Which type made you try to duplicate the event type ? Why ? Were you worried about a particular type serialization ?

Another "problem" is that we're changing the tracee event to a new structure (using protobuf). Check pkg/server/grpc/tracee.go->convertTraceeEventToProto. Basically you can expect the tracee event format to be what protobuf is defining at: ./api/v1beta1/event.pb.

I believe that, if we use the same existing types, and rely on custom marshaling/unmarshalling logic only, at least for now, then the eventual migration to the protobuf types will be smoother on our side (and merging this change would be easier).

Copy link
Contributor

@rafaeldtinoco rafaeldtinoco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please sign the contributor agreement when possible.


Event Conversion to ParquetEvent is required
*/
type ParquetEvent struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To convert the event type might be problematic (not that we plan to do it often, but, still, not to have a single view of the major type). Can't we just marshal the trace.Event with a custom marshalling logic ? Taking in consideration what you probably already knows (more than I do):

  • uint, uint16, uint32, uint64 would all become INT64 (uint64 overflows could be problematic)
  • structs (Container, Kubernetes, ContextFlags, File, Metadata can be represented as nested structures.
  • The interface{} (Properties, Metadata) would need serialization (embedded json ? like you did to Args ?)


/*
The following structs are not necessary: ParquetContainer, ParquetKubernetes, ParquetContextFlags, ParquetFile.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, check my comment for the PR. I think we should try to have only the "really needed types", serialize with a custom marshaller whatever is needed and can't be automatically type-converted, and go from there (at least until we have protobufs fully done, having the event structure way more stable than it is now (soon).

Copy link
Author

@shani-gold shani-gold Nov 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't add the parquet tags to existing types (all the inner structs), since they are on a different package:
github.com/aquasecurity/tracee/types v0.0.0-20231013014739-b32a168ee6a8

Since the Event struct contains types which are not supported by parquet, I can't use it.
For example I can add the following tag to Event.CgroupID:
CgroupID uint json:"cgroupId" parquet:"name=cgroupId, type=INT64, convertedtype=UINT_64"
But I get error reflect: call of reflect.Value.Int on uint32 Value

And I couldn't find a way to implement a custom marshal function.
The marshal function doesn't support it:
https://github.com/xitongsys/parquet-go/blob/8ca067b2bd324788a77bf61d4e1ef9a5f8b4b1d2/marshal/marshal.go#L293C16-L293C16

It calls directly to the ParquetStruct Marshal function

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok so here is the thing, I played around a bit with parquet, because it was new to me and I wanted to understand better how it worked.

It ended up being good and bad. Good that I played and bad related to how this logical types of parquet cast unsigned values (and also due to the issue #3690 just discovered).

I was going to suggest you using json schema to parquet (https://gist.github.com/rafaeldtinoco/fbe58a020fd0d73ea37ce6980520f6ea) so we didnt have to maintain type duplicates. You would complain back to me saying tracee event type -> json -> parquet could add overhead, I would request you to check if that was too bad.

During this process, I discovered #3690. It basically means that many other fields from the tracee event struct should be unsigned integers (I would say from 4 to 8 fields should have their types changed). Not sure how we missed that, but during the decode phase (from eBPF to our internal types) we converted them to Ints and never checked for overflows.

Good thing that parquet does not support unsigned values (actually it's a terrible thing :P) so I detected the overflow when generating the parquet file, I believe.

Ok, with all that said, and thinking about the near future, we're moving away from these internal types to the new protobuf types you will find in github.com/aquasec/tracee/api (specially in event.pb.

The protobuf types are declared correctly (uint32s, uints64) but it might impose a very hard time for you IF you rely in a parallel Parquet type to each type automatically generated from the protobuf (and actually I don't think we want parallel types just for marshalling logic).

That explains why I was going after the JSON Schema (to give you an alternative). Why ?

  1. I believe it might be easier for you to convert types from json Integers (than to worry about golang types coming from internal structures) to parquet types.

  2. Its likely that tracee will eventually move into a grpc model where the engine will be a grpc server providing streams. Those streams will come serialized with protobuf types but we could also have an already marshalled version. You can see a very simple example at https://github.com/josedonizetti/tracee-grpc-client/blob/main/examples/stream_events/main.go for now.

  3. Of course you can work through parquet marshalling and our protobuf types, I just dont want to commit to accept your printer now (when we fix the wrong field types), having duplicate types, and then discover your printer wont work with our near future protobuf work (and either remove your printer or request you to change and be blocked until you do).

Does it make sense for you ?

Suggestion of course of action:

  • We fix the event struct types
  • you rely in json schema for the printer (similar to my gist)
  • you investigate the protobuf types we have for further actions
  • by relying in the json schema you will probably remove headaches from protobuf (and we could have a stream of json marshaled types when grpc streams are finished).

With all that, Im trying to give you a "roadmap" if you stick to tracee as your "engine" (for the project of yours).

I hope it helps. No hard commitments on our side, but for sure a best effort trying to help.

@rafaeldtinoco
Copy link
Contributor

@shani-gold Left you some comments about possible outcomes. Feel free to suggest something else if you have in mind and, please, sign the contributor agreement if you are willing to have this merged. Thank you.

@rafaeldtinoco
Copy link
Contributor

@shani-gold ping ?

@rafaeldtinoco rafaeldtinoco marked this pull request as draft December 14, 2023 03:53
@geyslan
Copy link
Member

geyslan commented Feb 21, 2024

We have this update #3875

Please rebase your PR against main to make use of the new workflow setup.

@yanivagman yanivagman linked an issue May 9, 2024 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add Parquet output
4 participants