Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing documentation for attach_perf_event and open_perf_event #1756

Open
cippaciong opened this issue May 13, 2018 · 36 comments
Open

Missing documentation for attach_perf_event and open_perf_event #1756

cippaciong opened this issue May 13, 2018 · 36 comments

Comments

@cippaciong
Copy link
Contributor

cippaciong commented May 13, 2018

Hello, in these days I've been playing around with perf events for a project I've been working on. While I successfully managed to get things working, mainly looking at code examples like tools/profile.py, it occurred to me that there is very little documentation about attach_perf_event and open_perf_event.
The attach_perf_event function seems to be completely undocumented, I can't find any mention of it in the reference guide, while open_perf_event is mentioned just once in a more general example about BPF_PERF_ARRAY.

It would be nice to add a few lines to explain how to use those helpers, maybe providing some examples showing different ways to interact with them, e.g.:

b["cpu_cycles"].open_perf_event(b["cpu_cycles"].HW_CPU_CYCLES)
b["cpu_cycles"].open_perf_event(PerfType.HARDWARE, PerfHWConfig.CPU_CYCLES)
b["cpu_cycles"].open_perf_event(4, int("73003c",16))
@yonghong-song
Copy link
Collaborator

Since you are looking at this, could you help with a pull request to add these documentation? The file is reference_guide.md, the two functions can be under bcc Python and Events.

@cippaciong
Copy link
Contributor Author

I'm leaving in a couple of days and I will be AFK for 2/3 weeks. I don't think I will be able to work on it before I leave, but I can do it once I come back if still needed.

@palmtenor
Copy link
Member

Thanks @cippaciong for pointing out missed documentation. I think they are an important part of usability for BCC. I added a lot of the Perf Event related logic and was just a bit lazy on writing reference guide. I will try to work on adding them as well

@changbindu
Copy link

hello, attach_perf_event() is still missed in reference_guide.md. Will someone add it? Thanks!

@yonghong-song
Copy link
Collaborator

Not sure whether @palmtenor got cycles to do this or not. But since you are looking at this, maybe you can submit a pull request to add the missing info. to the reference_guide?

@changbindu
Copy link

sure. I can do it, but I can't promise it.

@palmtenor
Copy link
Member

Sorry for the delay:( I will try spend some time working on the documentations if possible. Meanwhile if you writing can write them up I would be happy to proof read them:)

@gustavahlman
Copy link

Hi, I'm interested in using perf events but also found that there was no documentation. An update would be greatly appreciated. :)

@hsane2001
Copy link

hsane2001 commented Feb 26, 2020

Still looking for documentation for usage for bpf_attach_perf_event and bpf_attach_perf_event_raw. I want to be able to use raw perf events besides the 9 in perf_event.h. Would be great if we can filter PEBS data records in a bcc function. Any example of using precise events within bcc?

@yonghong-song
Copy link
Collaborator

bpf_attach_perf_event_raw essentially permits you to pass raw perf_event_attr instead of limited choices with bpf_attach_perf_event. You can checkout examples using bpf_attach_perf_event. The bpf_attach_perf_event_raw can be used in all these places.

@hsane2001
Copy link

Thanks @yonghong-song. Is there an example of using the raw event. I wish to use the "mem_load_retired.l3_miss:pp" event from perf. I am interested in the virtual data addresses that it dumps as a precise event (using -d flag). How do I capture this through the "perf_event_attr " you mentioned?

@yonghong-song
Copy link
Collaborator

@palmtenor Could you help? Maybe some documentation to translate metrics to attr numbers? I guess you could look at kernel perf code tools/perf/pmu-events/arch/x86/..., might find some information.

@hsane2001
Copy link

Here is an example I am trying to get working with the "attach_perf_event_raw" api:
The below works for raw event of CYCLES with "attach_perf_event"
b.attach_perf_event(
ev_type=4, ev_config=int("73003c",16),
fn_name="cycles", sample_freq=args.sample_freq)

So, I try to use attach_perf_event_raw in this manner by defining the attribute:
ev_type=4
ev_config=int("73003c",16)
attr = Perf.perf_event_attr()
attr.type = ev_type;
attr.config = ev_config;
attr.sample_freq = args.sample_freq;

try:
b.attach_perf_event_raw(
perf_event_attr=attr,
fn_name="on_cycles")
I get the following error:
Failed to attach to a hardware event. Is this a virtual machine?

Questions:
-- What am I missing in packaging this correctly?
-- I plan to use the extra_flags next to get more information.. is this supported?

Thanks in advance.

@yonghong-song
Copy link
Collaborator

Current bcc does not have attach_perf_event_raw. Did you have your own implementation? To debug this, you can add some debug output right before perf_event_open syscall in libbpf.c. This way you can spot the difference between attach_perf_event and attach_perf_event_raw and fix the problem.

@hsane2001
Copy link

hsane2001 commented Mar 8, 2020

Tried instrumenting but it does not reach perf_event_open syscall since no such interface is present, as you correctly pointed out. It would be great to have a python interface to "attach_perf_event_raw". Since the C interface "bpf_attach_perf_event()" from libbpf.c anyway calls "bpf_attach_perf_event_raw()" after populating the perf_event_attr structure. But without direct interface to the raw function, we are limited by attributes in "attach_perf_event".
Would love to add it, but I am not a core developer (may take long). Would one start by defining it in init.py and declare it in libbcc.py, just like "attach_perf_event"?
Also what is "StatusTuple BPF::attach_perf_event_raw" in bcc/src/cc/api/BPF.cc?

@yonghong-song
Copy link
Collaborator

Also what is "StatusTuple BPF::attach_perf_event_raw" in bcc/src/cc/api/BPF.cc?

This is the C++ API.

Maybe you can take a look at

commit 0d72237946afebc5d300676af641319fb3d020be
Author: Yonghong Song <[email protected]>
Date:   Fri Apr 27 04:56:08 2018 -0700

    introduce {attach|detach}_raw_tracepoint API
...

This should give you an example how to add bpf_attach_perf_event_raw python API.

@hsane2001
Copy link

hsane2001 commented Mar 11, 2020

Thanks @yonghong-song. I was able to create a python interface for "attach_perf_event_raw" which calls the C api "bpf_attach_perf_event_raw" and tested this for raw values of CPU_CYCLES.
I can also program PEBS events through raw values, but the question is: How do I read the data from the ring buffer in the ebpf code and send it back to user space? My hope is that with the right extra_flags enabled in the perf_event_attr, I can get more data than simply counts, eg like sample_type = sample_address for PEBS events. I just dont know how to capture it in ebpf code?

@yonghong-song
Copy link
Collaborator

Currently, the uapi

struct bpf_perf_event_data {
        bpf_user_pt_regs_t regs;
        __u64 sample_period;
        __u64 addr;
};

is not enough for you to read other perf event values.

There is a trick for this though.
For bpf program, the ctx is actually a pointer to

struct bpf_perf_event_data_kern {
        bpf_user_pt_regs_t *regs;
        struct perf_sample_data *data;
        struct perf_event *event;
};

You could use bpf_probe_read to access perf_sample_data and read data.

@hsane2001
Copy link

Thanks @yonghong-song , please let me know if you have any examples with this trick? I am getting errors compiling bpf code accessing any data structure within bpf_perf_event_data_kern. Nothing on the web using this.

@yonghong-song
Copy link
Collaborator

Something like below (uncompiled, untested)

#include <linux/perf_event.h>
int bpf_prog(struct bpf_perf_event_data *ctx)
{
   struct bpf_perf_event_data_kern *kern_ctx;
   struct perf_sample_data *data;
   struct perf_raw_record  *raw;

   kern_ctx = (struct bpf_perf_event_data_kern *)ctx;
   perf_probe_read(&data, sizeof(data), &kern_ctx->data); 
   perf_probe_read(&raw, sizeof(raw), &data->raw);
   ...
}

@hsane2001
Copy link

hsane2001 commented Mar 25, 2020

Thanks Yonghong-song. I am reading the "perf_sample_data" structure since it has the addr fields I need (basically enabling perf in ebpf beyond just counting). Seems like my code is extracting fields but many come back as static values which make no sense, maybe my method to read ring buffer is wrong? - Here is my code snippet:
"""
...
BPF_PERF_OUTPUT(events);
int getdata(struct bpf_perf_event_data *ctx) {
struct bpf_perf_event_data_kern *kctx;
struct perf_sample_data *data;

kctx = (struct bpf_perf_event_data_kern *)ctx;
bpf_probe_read(&data, sizeof(data), &kctx->data);
events.perf_submit(ctx, &data, sizeof(data));
return 0;

}
"""
--> Fill raw perf_event_attr and attach perf
attr = Perf.perf_event_attr()
attr.type = 4 #PerfType.RAW
attr.config = int("4320d1",16) #mem_load_retired.l3_miss:pp -> Precise event with data address
attr.sample_period = 100
attr.sample_type = int("7df",16) #Enable PERF_SAMPLE_ADDR
b.attach_perf_event_raw(perf_event_attr=attr, fn_name="getdata")

--> Based on PERF_SAMPLE_DATA structure in perf_event.h
class PerfId(ct.Structure):
fields = [
("pid", ct.c_uint),
("tid", ct.c_uint)
]
class PerfEvent(ct.Structure):
fields = [
("time", ct.c_ulonglong),
("ip", ct.c_ulonglong),
("addr", ct.c_ulonglong),
("id", PerfId),
]

def print_event(cpu, data, size):
event = ct.cast(data, ct.POINTER(PerfEvent)).contents
p_time = event.time
p_ip = event.ip
p_addr = event.addr
p_id = event.id
p_pid = p_id.pid
p_tid = p_id.tid
print(cpu, p_time, hex(p_ip), hex(p_addr), p_pid, p_tid)

b["events"].open_perf_buffer(print_event)
while 1:
try:
b.perf_buffer_poll(100)
except KeyboardInterrupt:
exit()

The output come out as:
CPU TIme IP ADDR PID TID
30 18446741874692602816 0x900000000 0xc00180001 6306752 4294966784
30 18446741874692602816 0x900000000 0xc00180001 6306752 4294966784
30 18446741874692602816 0x900000000 0xc00180001 6306752 4294966784
???

@yonghong-song
Copy link
Collaborator

The below

struct perf_sample_data *data;

kctx = (struct bpf_perf_event_data_kern *)ctx;
bpf_probe_read(&data, sizeof(data), &kctx->data);
events.perf_submit(ctx, &data, sizeof(data));

is not correct.
The sizeof(data) is 8 and you are copying a pointer value.
You can go into perf_sample_data to copy the value you want.

@hsane2001
Copy link

hsane2001 commented May 13, 2020

Thanks this has been working. Essentially enabled perf record like behavior with ability to use the attach_perf_event_raw interface. Would need additions to make it usable, like mapping raw events to the perf list etc. before it can be usable in general.

@drandynisbet
Copy link

@hsane2001 Hi do you have an example of your code that you would be willing to share - I am about to begin to debug some code from one of our MSc students, we are attempting to access unc_cbo_cache_lookup.read_i and unc_arb_trk_requests.writes events using open_perf_event from python that fails with an invalid argument and attempts to read the counter value return an error number. The counters succeed with perf stat.

@hsane2001
Copy link

hsane2001 commented Jun 15, 2020

@drandynisbet - I am still hitting issues of related to sampling where I get much more data from the same perf record command as compared to "perf through epbf". Can you share your code and where it seems to be failing? Also, does not seem like you are using bcc, rather programming perf_event_open dirctly?

@hsane2001
Copy link

hsane2001 commented Jun 25, 2020

@yonghong-song - Although I am able to read the perf buffer through bcc now, the number of samples get limited to ~10K. I believe this is because BCC is not giving any way to set the MMAP_PAGE_CNT to the mapped buffer which may have a samll default, after perf_event_open is called. After attaching the perf event, how can I set the size of the MMAP buffer? I also ran into limitations on the BPF_HASH which I increased by providing a higher 'size' value than default.

@yonghong-song
Copy link
Collaborator

See tools memleak.py for an example to have bigger default hash table size.
To change the number of pages for perf event, you can specify page_count through open_perf_buffer api, see tcpstates.py for an example.

@drandynisbet
Copy link

drandynisbet commented Jun 29, 2020

@hsane2001
My code uses open_perf_event from bcc/python as follows - cutted/pasted to indicate usage

text="""
BPF_PERF_ARRAY(cnt1, NUM_CPUS); # perf array ...

int get_memory_read(void *ctx) {
u32 cpu = bpf_get_smp_processor_id();
u64 val = cnt1.perf_read(CUR_CPU_IDENTIFIER);
bpf_trace_printk("MEM READ %ld \n",val);
return 0;
}
"""
b = bcc.BPF(text,...)

b.attach_uretprobe(name="/home/drandynisbet/MSCS/Jvm_performance/hello", sym="simple_program", fn_name="get_memory_read",pid=pid) # attach probe to function simple_program in hello executable
cnt1 = b["cnt1"]
cnt1.open_perf_event(4,0x20003c) # cpu_clk_unhalted.thread_any WORKS
#cnt1.open_perf_event(0xd,0x2081) # FAILS ... unc_arb_trk_requests.writes

Comment/uncomment to try different counters ...

@drandynisbet
Copy link

sudo strace perf stat -vv -e unc_cbo_cache_lookup.read_i /bin/ls
indicates that sys_perf_event_open failes with EINVAL then tries switching off PERF_FLAG_FD_CLOEXEC and fails with EINVAL, then tries switching off exclude_guest, exclude_host flags and succeeds ...

@hsane2001
Copy link

hsane2001 commented Jun 29, 2020

@drandynisbet In case of "cnt1.open_perf_event(0xd,0x2081) # FAILS" --> Why is the 1st argument given as 0xd? Since its a raw perf event, it should still be 0x4 (PERF_TYPE_RAW). I am able to program other counters in raw mode this way.
I am rather using the "bpf_attach_perf_event_raw" from libbpf.c to directly program the perf_event_attr before calling sys_perf_event_open.
Additionally, the counter you are trying to program is an uncore event, which I dont think cant be collected at "per cpu" or "per process" level, only core counters can.

@drjantz
Copy link

drjantz commented Mar 5, 2021

We have also been trying to follow this example for our own tool. We would like to use precise events to record LLC misses to virtual address regions of a configurable size. We are able to enable perf collection of the appropriate events using attach_perf_event_raw with the C++ API, but we are still having trouble reading the events from the BPF side as described in this thread.

Here is our BPF code:

BPF_PERCPU_HASH(region_map, u64, u64, MAX_REGIONS);

int on_llc_miss(struct bpf_perf_event_data *ctx) {
    u64 key;
    u64 val;

    if (!ctx) {
        return 0;
    }

    key = (ctx->addr >> REGION_SHIFT);
    val = region_map_lookup_or_try_init(&key);
    if (val) {
        (*val) += 1;
    }
}

On the user side, we initialize perf as follows:

    pfm_initialize();
    if (get_perf_event_attr(DDR_MISS_EVENT_STR, &pe)) {
        std::cerr << "bad event: " << DDR_MISS_EVENT_STR << std::endl;
        return 1;
    }

    res = bpf.attach_perf_event_raw(&pe, "on_llc_miss");
    if (res.code() != 0) {
      std::cerr << res.msg() << std::endl;
      return 1;
    }

Our code for initializing the perf_event_attr struct is below:

    memset(pe, 0, sizeof(struct perf_event_attr));
    memset(&pfm, 0, sizeof(pfm_perf_encode_arg_t));

    /* perf_event_open */
    pe->size = sizeof(struct perf_event_attr);
    pe->exclude_kernel = 1;
    pe->exclude_hv     = 1;
    pe->precise_ip     = 2;
    pe->task           = 1;
    pe->use_clockid    = 1;
    pe->clockid        = CLOCK_MONOTONIC_RAW;
    pe->sample_period  = config.profile_overflow_thresh;
    pe->disabled       = 1;

    pe->sample_type  = PERF_SAMPLE_TID;
    pe->sample_type |= PERF_SAMPLE_TIME;
    pe->sample_type |= PERF_SAMPLE_ADDR;
    pe->sample_type |= PERF_SAMPLE_PHYS_ADDR;

    pfm.size = sizeof(pfm_perf_encode_arg_t);
    pfm.attr = pe;

    err = pfm_get_os_event_encoding(event_str, PFM_PLM2 | PFM_PLM3,
          PFM_OS_PERF_EVENT, &pfm);

The trouble we are having is that ctx->addr actually appears to be the correct virtual address. However, if we use the technique described in this post, we are not able to collect any of the other information associated with the sample. Actually, if we cast the ctx pointer to a (bpf_perf_event_data_kern*), the data we read with bpf_probe_read seems to be garbage. Could you provide any additional guidance? @hsane2001 would you mind posting the BPF code you used for your tool? Thanks in advance.

@drjantz
Copy link

drjantz commented Mar 12, 2021

Just FYI for anyone interested -- I am now able to read the bpf_perf_event_data_kern structure using the following code:

int on_llc_miss(struct bpf_perf_event_data *ctx) {
    u64 key;
    u64 val;
    struct bpf_perf_event_data_kern *kctx;
    struct perf_sample_data *data;

    if (!ctx) {
        return 0;
    }

    kctx = (struct bpf_perf_event_data_kern *)ctx;
    bpf_probe_read(&data, sizeof(struct perf_sample_data*), &(kctx->data));
    if (data) {
        bpf_probe_read(&addr, sizeof(u64), &(data->addr));
        key = (addr >> REGION_SHIFT);
        val = region_map_lookup_or_try_init(map, &key);
        if (val) {
            (*val) += 1;
        }
    }

The virtual address this code reads matches exactly with the virtual address in ctx->addr. I thought I would be able to read other parts of the perf_sample_data with this code. However, it looks like most of the perf_sample_data structure is not ready by the time this code is reached. (specifically, in Linux kernel v. 5.7.2, it looks like perf_sample_data_init() has been called, but perf_prepare_sample() is not reached before the BPF callback is reached).

@hsane2001
Copy link

Hi @drjantz,
Your above method using "bpf_perf_event_data_kern structure" is what I used to extract the data addresses. You should be able to read other parts of perf_sample_data in the same way using "bpf_probe_read", but I am on kernel 5.3. What other data are you exactly trying to extract and plan to do with it?

@drjantz
Copy link

drjantz commented Mar 14, 2021

For now, we just want to use BPF to profile the LLC misses to different virtual regions. I think we can get the PID from bpf_get_current_pid_tgid() (and we've tested that call -- the PIDs do look reasonable), so the virtual address is really all we need right now. I just wanted to be able to read the other fields in perf_sample_data in case we found a use for any of that info later. However, it really looks like reading those other fields just gives us garbage right now. For instance, the code below just prints garbage for the PID:

    u32 pid;
    struct bpf_perf_event_data_kern *kctx;
    struct perf_sample_data *data;

    kctx = (struct bpf_perf_event_data_kern *)ctx;
    bpf_probe_read(&data, sizeof(struct perf_sample_data*), &(kctx->data));
    if (data) {
        bpf_probe_read(&pid, sizeof(u32), &(data->tid_entry.pid));
        printk("pid: %u\n", pid);
    }

@mstange
Copy link

mstange commented Aug 9, 2021

Looks like this thread got rather side-tracked, and there still is no documentation for attach_perf_event.

In the meantime, I've found this comment for the C function bpf_attach_perf_event:

bcc/src/cc/libbpf.h

Lines 115 to 119 in 101304b

// attach a prog expressed by progfd to run on a specific perf event, with
// certain sample period or sample frequency
int bpf_attach_perf_event(int progfd, uint32_t ev_type, uint32_t ev_config,
uint64_t sample_period, uint64_t sample_freq,
pid_t pid, int cpu, int group_fd);

Specifically, I was interested in the meaning of the sample_period and sample_freq arguments and what happens when they conflict. The implementation answers that question: Only one of them is allowed to be non-zero.

bcc/src/cc/libbpf.c

Lines 1459 to 1464 in 949a4e5

if (!((sample_period > 0) ^ (sample_freq > 0))) {
fprintf(
stderr, "Exactly one of sample_period / sample_freq should be set\n"
);
return -1;
}

@mstange
Copy link

mstange commented Aug 9, 2021

And then the exact meaning of the various arguments is documented in the man page of the perf_event_open syscall. https://man7.org/linux/man-pages/man2/perf_event_open.2.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants