-
Couldn't load subscription status.
- Fork 36
WIP glob feature #936
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
yariseidenbenz
wants to merge
11
commits into
darshan-hpc:main
Choose a base branch
from
yariseidenbenz:WIP-glob_feature
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
WIP glob feature #936
Changes from 10 commits
Commits
Show all changes
11 commits
Select commit
Hold shift + click to select a range
dce2c0a
Generate HTML report summarizing file usage from .darshan log: DataFr…
yariseidenbenz 8277396
updated glob_feature.py which creates dataframe of glob_filename and …
yariseidenbenz 397780a
WIP: this script creates a condensed dataframe of glob_filename and …
yariseidenbenz edc12c3
Refactored glob_feature.py script and improved data frame creation fo…
yariseidenbenz 787c8ed
Merge branch 'darshan-hpc:main' into WIP-glob_feature
yariseidenbenz 9b757d5
Rearranged glob_feature.py and added test_glob_feature.py to the test…
yariseidenbenz 33f7292
Merge branch 'WIP-glob_feature' of github.com:yariseidenbenz/darshan …
yariseidenbenz a5df394
Remove glob_feature.py from wrong location
yariseidenbenz 452568b
Fixed styling of glob_feature.py and added [.*] grouping feature. Add…
yariseidenbenz 26c2572
Instead of using difflib to group files together, glob_feature.py now…
yariseidenbenz cd9d522
The glob_feature.py now groups files based on agglomerative hierarchi…
yariseidenbenz File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,5 @@ | ||
| """ | ||
| Creates a DataFrame with two columns ("glob_filename" and "glob_count") | ||
| based on the files read by a .darshan file. | ||
| """ | ||
|
|
136 changes: 136 additions & 0 deletions
136
darshan-util/pydarshan/darshan/glob_feature/glob_feature.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,136 @@ | ||
| # Creates a DataFrame with two columns ("glob_filename" and "glob_count") based on the files read by a .darshan file. | ||
| # The script utilizes agglomerative hierarchical clustering to effectively group similar file paths together, based on their characteristics. | ||
| # It then displays a dataframe where one file represents a group and uses [.*] to show where filepaths within a group differ | ||
| # The result of this process is an HTML report that provides a comprehensive overview of the grouped paths and their respective counts. | ||
| # Command to run: python glob_feature.py -p path/to/log/file.darshan -o path/to/output_file | ||
|
|
||
| import argparse | ||
| import pandas as pd | ||
| import darshan | ||
| from sklearn.feature_extraction.text import TfidfVectorizer | ||
| from sklearn.cluster import AgglomerativeClustering | ||
| from sklearn.metrics import silhouette_score | ||
| import numpy as np | ||
| import os | ||
|
|
||
|
|
||
| def main(log_path, output_path): | ||
|
|
||
| report = darshan.DarshanReport(log_path) | ||
| df = pd.DataFrame.from_dict(report.name_records, orient="index", columns=["filename_glob"]) | ||
| df = df[df["filename_glob"].str.contains(r"/.*")] | ||
|
|
||
| num_files = len(df) | ||
| optimal_k = 2 # Initialize optimal_k to 2 | ||
| if num_files == 1: | ||
| print("Only one file detected.") | ||
| optimal_k = 1 | ||
| # Process and save results for the single file | ||
| grouped_paths = {0: [df["filename_glob"].iloc[0]]} | ||
| new_paths = [(path, 1) for _, paths in grouped_paths.items() for path in paths] | ||
|
|
||
| print("grouped_paths", grouped_paths) | ||
|
|
||
| else: | ||
|
|
||
| # Convert strings to feature vectors | ||
| vectorizer = TfidfVectorizer() | ||
| X = vectorizer.fit_transform(df["filename_glob"]) | ||
| print("X is:", X) | ||
|
|
||
| # Determine the maximum number of clusters dynamically | ||
| max_clusters = int(np.sqrt(len(df))) | ||
|
|
||
| silhouette_scores = [] | ||
| for k in range(2, max_clusters + 1): | ||
| print("max clusters is", max_clusters) | ||
| # Perform clustering | ||
| clustering = AgglomerativeClustering(n_clusters=k) | ||
| clusters = clustering.fit_predict(X.toarray()) | ||
|
|
||
| # Calculate the silhouette score | ||
| score = silhouette_score(X, clusters) | ||
| print("clusters are:", clusters) | ||
|
|
||
| silhouette_scores.append(score) | ||
|
|
||
| # Find the optimal number of clusters based on the silhouette scores | ||
| optimal_k = np.argmax(silhouette_scores) + 2 # Add 2 because range starts from 2 | ||
|
|
||
| print("Optimal number of clusters:", optimal_k) | ||
|
|
||
| # Perform clustering with the optimal number of clusters | ||
| clustering = AgglomerativeClustering(n_clusters=optimal_k) | ||
| clusters = clustering.fit_predict(X.toarray()) | ||
| print("clusters are", clusters) | ||
| grouped_paths = {} | ||
| for i, cluster_label in enumerate(clusters): | ||
| if cluster_label not in grouped_paths: | ||
| grouped_paths[cluster_label] = [] | ||
| grouped_paths[cluster_label].append(df["filename_glob"].iloc[i]) | ||
|
|
||
| new_paths = [] | ||
| for _, group in grouped_paths.items(): | ||
| if len(group) > 1: | ||
| merged_path = "" | ||
| max_length = max(len(path) for path in group) | ||
| differing_chars_encountered = False | ||
| common_extension = None | ||
|
|
||
|
|
||
| for i in range(max_length): | ||
| chars = set(path[i] if len(path) > i else "" for path in group) | ||
| if len(chars) == 1: | ||
| merged_path += chars.pop() | ||
| differing_chars_encountered = True | ||
| else: | ||
| if differing_chars_encountered: | ||
| merged_path += "[.*]" | ||
| differing_chars_encountered = False | ||
|
|
||
| # Checks if all paths have the same file extension | ||
| extensions = [os.path.splitext(path)[1] for path in group] | ||
| common_extension = None | ||
| if len(set(extensions)) == 1: | ||
| common_extension = extensions[0] | ||
|
|
||
| # Append the common extension if it exists and it's not already in the merged_path | ||
| if common_extension and common_extension not in merged_path: | ||
| merged_path += common_extension | ||
|
|
||
| new_paths.append((merged_path, len(group))) | ||
| else: | ||
| new_paths.append((group[0], 1)) | ||
|
|
||
|
|
||
| # Save the results to an output file | ||
| df = pd.DataFrame(new_paths, columns=["filename_glob", "glob_count"]) | ||
|
|
||
| df = df.sort_values(by="glob_count", ascending=False) | ||
| print("df is", df) | ||
| style = df.style.background_gradient(axis=0, cmap="viridis", gmap=df["glob_count"]) | ||
| style = style.set_properties(subset=["glob_count"], **{"text-align": "right"}) | ||
| style.hide(axis="index") | ||
| style.set_table_styles([ | ||
| {"selector": "", "props": [("border", "1px solid grey")]}, | ||
| {"selector": "tbody td", "props": [("border", "1px solid grey")]}, | ||
| {"selector": "th", "props": [("border", "1px solid grey")]} | ||
| ]) | ||
|
|
||
| html = style.to_html() | ||
|
|
||
| with open(output_path, "w") as html_file: | ||
| html_file.write(html) | ||
|
|
||
| total_count = df["glob_count"].sum() | ||
| print("Total glob_count:", total_count) | ||
|
|
||
|
|
||
| if __name__ == "__main__": | ||
| parser = argparse.ArgumentParser() | ||
| parser.add_argument('-p', '--log-path', type=str, help="Path to the log file") | ||
| parser.add_argument('-o', '--output-path', type=str, help="Path to the output HTML file") | ||
| args = parser.parse_args() | ||
| main(log_path=args.log_path, output_path=args.output_path) | ||
|
|
||
|
|
||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I fix both of the above issues and then try to run your code via the Python command line route, I still get an error:
python glob_feature.py -p ~/github_projects/darshan-logs/darshan_logs/e3sm_io_heatmaps_and_dxt/e3sm_io_heatmap_only.darshanSome of the testing I mentioned a few weeks ago about handling the various
output_pathmodalities seems to be missing? You'll want tests for the command line and module-based incantations to make sure they work as you iterate on your code.