Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support PPE profiling for FileCache / MultiprocessFileCache #350

Merged
merged 8 commits into from
Oct 17, 2024

Conversation

KantaTamura
Copy link
Collaborator

This PR resolves the FileCache / MultiprocessFileCache task in #258

Note

preload() and preserve() are excluded from profiling because they are pre-processing and post-processing and have nothing to do with the I/O being executed.

Sample program to check tracing

import json
import pytorch_pfn_extras as ppe

from pfio.cache import FileCache, MultiprocessFileCache


def filecache():
    tracer = ppe.profiler.get_tracer()
    tracer.clear()

    cache = FileCache(1, dir="./tmp", do_pickle=True, trace=True)
    cache.put(0, b"foo")
    assert b"foo" == cache.get(0)

    dict = ppe.profiler.get_tracer().state_dict()
    keys = [event["name"] for event in json.loads(dict['_event_list'])]

    print(keys)

    w = ppe.writing.SimpleWriter(out_dir="")
    tracer.initialize_writer("trace_filecache.json", w)
    tracer.flush("trace_filecache.json", w)


def multiprocess_filecache():
    tracer = ppe.profiler.get_tracer()
    tracer.clear()

    cache = MultiprocessFileCache(1, dir="./tmp", do_pickle=True, trace=True)
    cache.put(0, b"foo")
    assert b"foo" == cache.get(0)

    dict = ppe.profiler.get_tracer().state_dict()
    keys = [event["name"] for event in json.loads(dict['_event_list'])]

    print(keys)

    w = ppe.writing.SimpleWriter(out_dir="")
    tracer.initialize_writer("trace_multiprocess_filecache.json", w)
    tracer.flush("trace_multiprocess_filecache.json", w)


filecache()
multiprocess_filecache()

output

['pfio.cache.file:put:lock', 'pfio.cache.file:put', 'pfio.cache.file:get:lock', 'pfio.cache.file:get']
['pfio.cache.multiprocessfile:put:lock-99', 'pfio.cache.multiprocessfile:put', 'pfio.cache.multiprocessfile:get:lock-99', 'pfio.cache.multiprocessfile:get']

rendering of the output json file with chrome://tracing is shown below

  • trace_filecache.json
    image

  • trace_multiprocess_filecache.json
    image

@kuenishi kuenishi requested a review from k5342 October 9, 2024 08:38
@kuenishi kuenishi added this to the 2.9.0 milestone Oct 9, 2024
Copy link
Member

@k5342 k5342 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test codes are LGTM, but I have more suggestions for records.


def _get(self, i):
if i < 0 or self.length <= i:
raise IndexError("index {} out of range ([0, {}])"
.format(i, self.length - 1))

offset = self.buflen * i
with self.lock.rdlock():
with record("pfio.cache.file:get:lock", trace=self.trace), \
self.lock.rdlock():
buf = os.pread(self.cachefp.fileno(), self.buflen, offset)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better we add more records to track the actual I/O operation here because under get:lock issues one rdlock and two pread operations.

@@ -270,7 +276,8 @@ def _put(self, i, data):
return False

offset = self.buflen * i
with self.lock.wrlock():
with record("pfio.cache.file:put:lock", trace=self.trace), \
self.lock.wrlock():
buf = os.pread(self.cachefp.fileno(), self.buflen, offset)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto as above

if l < 0 or o < 0:
with record(f"pfio.cache.multiprocessfile:get:lock-{self.cache_fd}",
trace=self.trace):
index_entry = os.pread(self.cache_fd, self.buflen, offset)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, we need to track this read call as well

warnings.warn(ose.strerror, RuntimeWarning)
return False
else:
raise ose

def _put(self, i, data):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like file_cache, we'd like to track I/O-related and lock operations separately. Separating them will be useful for diagnosing whether I/O or just lock wait is long.

@KantaTamura KantaTamura requested a review from k5342 October 17, 2024 07:11
Copy link
Member

@k5342 k5342 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@k5342 k5342 merged commit 09e15ed into pfnet:master Oct 17, 2024
7 checks passed
@kuenishi kuenishi mentioned this pull request Oct 17, 2024
7 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants