-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create Edge-Type Attribute File #99
Comments
Per discussion with @LucaCappelletti94 - create two separate
Will let you know as soon as this is ready @LucaCappelletti94! |
One more thing: if you have, generally speaking, node features and edge features, as in either other categorical or metric ones, even stuff like the BED coordinates if some are genomic regions, they can be useful when running GNN and GCN models on the graph. |
Great suggestion @LucaCappelletti94! One thought that immediately comes to mind is gene expression values (for specific tissues) we can definitely add that. I'll think through what else might make a good edge type. Sometimes the distinction between what should be used as a weight or type gets blurred. However, if we were to first mark everything that may be interesting/useful as an edge type (i.e., all categorical and metric-based), then we would also allow the user the ability to select from those what they wanted to use an edge based on their use case. I like this idea a lot! |
Hi, if you dont mind me sharing my two cents. In the data Also, in DisGeNet, for example, the edges between DIS-GENE can have multiple labels, which completely changes the meaning between their interaction. Would it be possible to add the edge subtypes? https://www.disgenet.org/dbinfo paragraph In PheKnowLator's data sources description, it doesnt seem that any particular type of edge was filtered/selected: https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources#disgenet For Protein-Protein from String, there are multiple scores. Some of them, for example, are lab based, others are from literature, and others are from predictions. I would imagine that some stakeholders would feel better being able to filter out only the scores coming from lab based experiments. So maybe all the provided scores could be made available for filtering/enhancement? Same logic applies to DisGeNet, where count of # of papers seems to be a good score metric too. |
Hi @fmellomascarenhas - I always appreciate your feedback! You are right that I have not yet included specific edge typing from the resources that we import, like DisGeNet and STRING. I agree it's time that we do! I will spend some time working on and thinking through this tomorrow and will create a spec for what we can add from each source we import. I will also include a brief plan/overview of how I might approach integrating them (there will always be some solutions that are easier or better than others 😄). I can post those both here so guys can take a look. I will set aside time next week to make the changes as part of a new major release. How does that sound? |
Sounds great! :) From a Machine Learning point of view, I see four main uses for that:
|
Excellent points and even more motivation for me to make these changes! 👍 |
Just an update -- I was not able to get to this last week, but plan on coming back to it next week. Sorry for the delay! |
Sorry for the delay, I think we are close to being able to make the updates we have been discussing in this thread. I have been reviewing the different resources that we bring in and thinking through some of the challenges with @bill-baumgartner, who has been involved with me from the beginning in building I think we came up with the best possible solution in terms of being able to incorporate the greatest amount of Current Approach and OutputCurrently, we are only producing metadata output for nodes and relations, not for triples or edges. Node MetadataFilename:
Proposed RepresentationAll data for nodes and edges will be output to a JSON Lines file ( {"name": "Gilbert", "wins": [["straight", "7♣"], ["one pair", "10♥"]]}
{"name": "Alexa", "wins": [["two pair", "4♠"], ["two pair", "9♠"]]}
{"name": "May", "wins": []}
{"name": "Deloise", "wins": [["three of a kind", "5♣"]]} Node MetadataOutput Filename: The node metadata file will be keyed
Additional types of metadata at the node level will be added and the general format will be: Edge MetadataOutput Filename: The edge metadata file will be keyed by a triple or edge identifier created as the MD5 hash of each identifier in the edge (i.e.,
Additional types of metadata at the edge level will be added and the general format will be: As a result of including this file, I will also update the two flat-file outputs ( Feedback/Questions@bill-baumgartner - does that seem correct and cover everything we talked about? @LucaCappelletti94 - I realize that the proposed output would not readily work as input to @fmellomascarenhas and @sanyabt - Please let me know if you have any comments/feedback or if you have any issues with this approach. I think it will be the best overall and hopefully, be flexible enough to be useful for most use cases. |
Hi @callahantiff , I am vacation this week, but I will get back to you soon! Thanks :) |
That sounds great! Have a great vacation! 😄 |
Hi @callahantiff, just caught up with the discussion here and I agree that this would be a great solution! I can envision adding timestamps to the edge metadata and other metrics (eg. node centrality) to node metadata if needed. Thank you for figuring out a solution so quickly 😄 |
Absolutely, that's what I was envisioning too. That we would have a baseline amount of metadata we provide, that users can choose from and/or extend -- with things like timestamps -- as needed. I will likely make the updates the week after next and will let you know when it's ready. Thanks again for your feedback! |
Hi @callahantiff, I had the time today to read everything. I think it sounds good! I don't think I am in a position to propose a better way of organizing the files, but if that helps, I thought of some additional ideas of features/metadata. Maybe this can help with the brainstorming process :) : Edge related: 4. Edge features: Examples
5. Source information: If paper ID is available, possibly add it. This can help with generating a timestamp. Also, I remember once checking one edge that had 3 sources, but when checking the paper ID, they were 2 different versions of the same manuscript and a third paper of the same group citing themselves. So there weren't 3 sources, just one. This information can help stakeholders validate why the edge exists. Node related:
7. Parent/Children: Some biological entities can be described as a tree structure. Diseases, for example, branch into multiple disease subtypes. This information can be very useful to:
One thing I haven't had the bandwidth to think about is edge properties that are true only when others conditions are also true. For example, the gene expression in a cell type is X1 when disease D in present, otherwise the expression is X2. Or features that differ by gender/race/age. But this is probably way too complex for this stage. Thanks for all of your great work! |
@fmellomascarenhas this is fantastic feedback, thank you very much! I also really enjoy the examples. I am not sure we can accommodate everything in the first pass, but this format will allow easy integration of the types of metadata you suggest (and likely things neither of us has thought of yet [I think 🤔 and hope 😄 ])! OK, will keep you posted as I begin working on this over the next few weeks. Thanks so much for the feedback and suggestions! |
Task
Add an output file to accompany the node metadata created for each build that provides users with an easy way to identify what type each triple is.
Design
The file should at a minimum contain the following information:
n-triple
files we can create this as a named graph).N:1
,1:N
,1:1
Weight1
Additional information related to the edge could also be added, but it's unclear at this time what would be useful.
The text was updated successfully, but these errors were encountered: