Homework: Practice with Network Statistics
DUE: Friday 9/13/2017, 11:59pm
Posted: 9/1/2017
Last Update: 9/9/2017
Data
Obtain data for the year genetic interaction network at http://thecellmap.org/costanzo2016/. Download
data in matrix format from http://thecellmap.org/costanzo2016/data_files/Raw%20genetic%20interaction%20datasets:%20Matrix%20format.zip,
and after unzipping use file Data File S2. Raw genetic interaction datasets: Matrix format/SGA_NxN_clustered.cdt
This is a tab-separated value
file
with the following format:
- The first five rows contain metadata. Rows six to the end contain interaction data
- The first six columns also contain metadata. Columns seven to the end contain interaction datasets
- Each column corresponds to an array containing all double knockdowns for a specific gene, with each row corresponds to the second gene knocked-down (the query gene).
For example, column seven corresponds to the array with knockdowns of gene
PDR1
, and the entry in row 6 in column 7 corresponds to knocking down geneRSC1
in that array.
For simplicity, let’s make a few transformations to turn this into the adjacency matrix of an undirected graph.
Convert the numeric interaction score given in the matrix into edge indicators (0/1). Here is one way of doing it:
- Set entries with absolute value greater than 0.2 as 1 (there are more principled ways to do this)
- Ensure the diagonal of the matrix is 0
- Set missing entries to 0
Use only genes with entries in both rows and columns (that is, they were used both as queries and arrays**. Use ORF ids to match rows and columns.
Make the adjacency matrix symmetric.
There is R code at the end of this page to perform all of these steps. Feel free to use this as you wish as part of your submission.
Note: You can use libraries like igraph
to answer all questions below except in the case you are taking this course as a “core requirement” course in Computer Science.
In that case, you have to implement solutions as described below (look for label Core).
Exercises
Compute and report the following quantities for the graph:
a) number of vertices,
b) number of edges,
c) average degree,
d) densityCompute the degree of each node and make the following two plots (these are similar to panels b. and c. of Image 2.4 of the Barabasi textbook):
a) a histogram of degrees,
b) a log-log plot of the degree distributionCompute the distance (length of shortest paths) between every pair of nodes. Core: You have to implement an algorithm to calculate all pair-wise distances. You have a few options here:
- For each vertex in the network, use breadth-first search to compute distance to every other node
- Use the Floyd-Warshall algorithm https://en.wikipedia.org/wiki/Floyd%E2%80%93Warshall_algorithm
- Section 10.3 of the Newman textbook discusses more options
Report:
a) the average distance
b) the network diameter
c) a plot of the distance distribution (simiar to 2.18 a in Barabasi textbook)Is the network connected? How many components are there? Core: You have to implement one of these two solutions:
- Use BFS (Box 2.6 of Barabasi)
- Use the eigenvalues of the graph Laplacian (Section 16.3.2 of Newman textbook)
Compute the clustering coefficient for each vertex in the network. Make a log-log plot of average clustering coefficient as a function of vertex degree (Image 2.18 b of Barabasi)
What to turn in
Turn in a single pdf
containing plots and answers to each exercise. Make sure to comment on how you processed data to get an adjacency matrix.
If you used the method described above state so, if you did something else, please describe what you did. Include all code used to answer this in your pdf.
If using R, I recommend you use Rmarkdown to do this project. If using Python, I recommend you use a Jupyter notebook.
You can work in self-organizing groups of at most 3 of your classmates. On ELMS, please have every member of the group submit the same pdf.
R code to process data
library(readr)
library(tidyverse)
# read data file
mat <- read_tsv("data/Data File S2. Raw genetic interaction datasets: Matrix format/SGA_NxN_clustered.cdt")
amat <- mat %>%
# extract data rows and columns
slice(-(1:5)) %>%
select(starts_with("dma")) %>%
# convert to numeric matrix
type_convert() %>%
as.matrix()
# turn into unweighted, undirected adjacency matrix
amat <- 1 * (abs(amat) > 0.2)
# extract data about arrays (columns)
coldata <- mat %>%
slice(1:5) %>%
select(GID, starts_with("dma")) %>%
slice(2) %>% select(-1) %>%
gather(dma, orf, starts_with("dma"))
# extract ORF ids for queries (rows)
row_orf <- mat %>%
slice(-(1:5)) %>%
pull(ORF)
# match row and column ORF ids
m <- match(row_orf, coldata$orf)
rows_to_use <- !is.na(m)
cols_to_use <- m[rows_to_use]
# subset matrix into ORFs found in both rows and columns
amat <- amat[rows_to_use, cols_to_use]
# set diagonal and missing entries in matrix to 0
diag(amat) <- 0
amat[is.na(amat)] <- 0
# make the adjacency matrix symmetric
amat <- ceiling(0.5 * (amat + t(amat)))