Preprint reviews by NHGRI Journal Club

Functional and non-functional classes of peptides produced by long non-coding RNAs

Jorge Ruiz-Orera, Pol Verdaguer-Grau, José Luis Villanueva-Cañas, Xavier Messeguer, M Mar Albà

Review posted on 03rd January 2017

We reviewed this paper in our December preprint journal club. Overall, we found the paper to be well written and the conclusions to be convincing. We had only a few minor comments and suggestions:

· Please be more clear about what the coding score in Figure 3B and 4C means. It is difficult to move from the results to the methods to interpret the CS_hexamer_ equation, so it would help your readers if you give a more intuitive interpretation of this value right in the results. Also, how did you determine that 0.049 is the cutoff for high coding score?

· It would have been nice to see two distinct tissues compared in figure 4B, given that one might expect “brain” and hippocampus to be fairly similar. If this would be an incorrect assumption, then it should be spelled out why, otherwise one or more confirmatory figures should be included in the supplement. Also, how did you choose 60% as the cutoff? Just by eye?
· Please add coding genes to figure 4C.
· Figure 2B could be improved by adding density plots in the margins with asterisks indicating significance (such as those provided in Figure 4E).
· We were interested to see the effect of the algorithm for predicting coding potential. Do things change significantly if you use e.g. CPAT rather than CIPHER?
· In the discussion, you focus on lncRNAs as a potential intermediate step leading to de novo protein coding genes. Isn’t it equally likely that lncRNAs (especially those that are highly conserved) were at some point functional and are degenerate in mouse? If yes, please consider this in the discussion, otherwise add a short explanation as to why this can’t be so.

Sofia de Pereira Barreira
Steve Bond
John Didion
Tony Kirilusha
Luli Zou

show less

Total RNA Sequencing reveals microbial communities in human blood and disease specific effects

Serghei Mangul, Loes M Olde Loohuis, Anil Ori, Guillaume Jospin, David Koslicki, Harry Taegyun Yang, Timothy Wu, Marco P Boks, Catherine Lomen-Hoerth, Martina Wiedau-Pazos, Rita Cantor, Willem M de Vos, Rene S Kahn, Eleazar Eskin, Roel A. Ophoff

Review posted on 03rd August 2016

We reviewed this paper in our July preprint journal club. Obviously, the potential for contamination to influence the results is the first question that all reviewers will ask. Although we were impressed with the steps the authors did take to mitigating such concerns, we felt there are a couple of simple experiments that still need to be done:

1) The entire pipeline should be run in parallel with both a blood sample and a saline or ultra-pure water control sample. If the later is free of contamination, and if the sequencing pipeline introduces no contamination, then the sequencing library is expected to generate no data; thus, any data that is generated provides a background that can be used to normalize the data from the blood sample.

2) You state that you expect the microbes found in blood originate in the gut or oral cavity. But isn’t it possible that you’re simply picking up microbe-derived RNA that has crossed into the blood, rather than RNA of microbes that are in the blood? If such a diversity of bacteria truly does exist in the blood, then shouldn’t it be possible to observe it by microscopy or by culture? Some of the species you report will certainly grown in culture. Alternatively, you should be able to perform flow sorting or some other single-cell approach to isolate non-human cells from the blood. Another experiment you could do to test your hypothesis (which is probably beyond the scope of this paper but would be interesting for a future study) is to compare the blood profiles of mice raised in germ-free versus dirty environments.

Some additional concerns/suggestions:

1) It is not stated whether or not you used index (i.e. barcoded) adapters. As you probably know, Illumina instruments have some degree of carryover between runs (http://core-genomics.blogspot..... So if a metagenomic sample was run previously on the same instrument (or in a different lane during the same run), a fraction of the reads (consistent with what shows up as non-human in your samples) could be explained by carryover.
2) You state that you observed no eukaryotic species, but, to our knowledge, the Phylosift reference database does not include any eukaryotic proteins by default. Were you specifically not looking for eukaryotes? If there were contamination from the skin during blood draw, we would expect to see some evidence of yeast species.
3) You state that you drew two vials of blood from each individual and randomly selected one for sequencing. Yes, this will randomly distribute errors, but it would still be informative to show a comparison between the microbial communities detected in first-draw vials versus second-draw vials.
4) You find the SCZ group to be different from the other three, but this group is also quite different in terms of age and/or sex ratio. Are you concerned about these potentially confounding factors? What happens if you restrict your analyses to only age- and sex-matched subsets of each cohort?

Steve Bond
Sofia de Pereira Barreira
John Didion
Nick Giangreco
Anthony Kirilusha
Sergey Koren
Luli Zou

show less

Different Evolutionary Paths to Complexity for Small and Large Populations of Digital Organisms

Thomas LaBar, Christoph Adami

Review posted on 14th June 2016

We reviewed this paper in our May preprint journal club.

This is a clever use of Avida to look at the dynamics of genome evolution.

We debated the choice of 15 essential instructions as the starting genome. On the one hand, it seems appropriate to assume modern genomes arose from smaller genomes, but on the other hand, you are starting from a genome that can only expand because all deletions are necessarily fatal (at least until the genome has acquired insertions). More interesting, perhaps, would be to start with a larger genome that is already capable of performing 3-4 logical operations (albeit inefficiently), and then observing the selective pressures imposed by population size. This way, deletions (deleterious or not) can run to fixation in the starting populations, and should be a more accurate representation of what is likely to occur in real world populations. It will also be interesting to see the effect of high deletion bias enforcement on these larger starting genomes (as per Fig. S6). We would anticipate that the genome size of small populations would drop quickly due to drift, likely losing traits; however, the largest populations would undergo an initial drop in genome size as trait efficiency improved, followed by the gradual increase reported in the manuscript.

Another point of discussion was the way trait count was used as a proxy for complexity. While it makes sense to use a multiplicative fitness function that limits merit to a single instance of any given trait (otherwise genomes would expand uncontrollably with repeats of simple operations), we believe you are doing yourselves a disservice by ignoring the emergence of redundant traits and by counting each trait equally once acquired. We would have preferred to see which traits emerged, and how many of each, plotted over time (we do appreciate that this is a difficult data visualization challenge, but we also believe the trends you are reporting will be much more compelling).


Generally, we recommend more informative plots. We observed considerable fluctuation in genome size and merit/fitness through time when we replicated several of the simulations, especially for small populations, so this variation should be better illustrated and discussed. E.g., line graph of mean genome size as a function of time, including 95% CI as a shaded area about the mean.

It also seems that some of the individual figures should be merged into multi-panel figures, and it is unnecessary to relegate so many figures to the supplement. For example, figures 2, 3, and S2 could be made into a single multi-panel figure.

Steve Bond
Daniel Bar
Anthony Kirilusha
David McGaughey
Sofia Barreira
John Didion

show less

Many long intergenic non-coding RNAs distally regulate mRNA gene expression levels

Ian McDowell, Athma Pai, Cong Guo, Christopher M Vockley, Christopher D Brown, Timothy E Reddy, Barbara E Engelhardt

Review posted on 15th April 2016

We reviewed this paper in our April preprint journal club at NHGRI. Overall, we enjoyed the paper and it fostered good discussion. An interesting point you could bring up in the intro or discussion (completely understandable if there’s not room or it’s beyond-scope) would be the hypothesized biological origins of lncRNAs, and the evolutionary discussion of how and why they might have acquired function, especially since one of the main points of the paper is that pc-mRNA and lncRNA are regulated by the same transcriptional machinery and also in light of the recent preprint from Young et al. (

One item we found questionable was the selection of tissue types to study. Three of four tissues (adipose, artery, lung) have clear links to obesity, so a skeptical reviewer might be suspicious that you were at the outset setting yourself up to discover obesity associations. It might help to allay those suspicions by discussing the criteria you used to select the tissues you studied. Would picking any four GTEx tissues out of a hat give you similar results?

Great to see experimental validation, and agree with Arjun that the MR approach is cool. We’re interested to use the software, but the github URL does not yet resolve.

The last paragraph in section 2.6 seems a bit out of place – it might read better if it were integrated in the discussion.

Figures 1:

• You never define eQTLBMA or SNPTEST in the text.
• It’s not clear how associations are placed on the x-axis. For associations between TSS and TES, are you just normalizing the position of the association to the length of the gene (i.e. position / (TES-TS))? A clarifying note in the legend would be helpful.

Figure 2:
• Put panels A and B in the same orientation (currently pc-eQTLs are on top in A but on the bottom in B).
• Would be helpful if the supplementary figures were included so we could see the skin overlaps.

Figure 3: It seems a bit misleading to say “Both cis-linc-eQTLs and cis-pc-eQTLs were enriched for linkage to TASs” when the odds ratio for best linc-eQTLs is the only one <1. Some discussion of why OR<1 for best linc-eQTLs but >1 for all linc-eQTLs would be welcome.

Figure 4: Not sure what we’re supposed to take away from this plot, other than “hey there’s a lincRNA next to 10 adipose TASs.” The mean expression values are hard to interpret – you’ve taken an estimate (RPKM), log2-transformed it, and then put it in grayscale in a tiny box. Is it highly expressed in adipose? Significantly more than in other tissues? It would be cool if you layered this on top of the chromatin map to see how different the adipose regulatory environment is from the other tissues.

Figure 6: This is the classical MR schematic, but we felt it would be much more informative to see a toy example of a positive MR result in the current context – i.e. replace Z with “SNP”, X with “cis-RNA”, etc. It’s in the legend, but readers will appreciate not having to jump back and forth between the figure and the text to figure out what’s going on.

Figure 7:
• Confused – you talk about the naïve approach first in the text and refer to figure 7A as the naïve results, but the legend says that 7A is the MR results and 7B is the naïve results.
• Might be helpful (and the MR results will also look even more impressive) to put both plots on the same y-axes.

David McGaughey
Steve Bond
John Didion
Tony Kirilusha

show less

Methylation Analysis Reveals Fundamental Differences Between Ethnicity and Genetic Ancestry

Joshua M Galanter, Christopher R Gignoux, Sam S Oh, Dara Torgerson, Maria Pino-Yanes, Neeta Thakur, Celeste Eng, Donglei Hu, Scott Huntsmann, Harold J Farber, Pedro Avila, Emerita Brigino-Buenaventura, Michael LeNoir, Kelly Meade, Denise Serebrisky, William Rodriguez-Cintron, Raj Kumar, Jose R Rodriguez-Santana, Max Seibold, Luisa Borrell, Esteban G Burchard, Noah Zaitlen

Review posted on 10th March 2016

We reviewed this paper in our February 2016 preprint journal club. First, we found the research question interesting and important – if a substantial fraction of ethnicity is explained by non-genetic effects, then this is clinically relevant information and should be taken into account during treatment, drug development and testing, etc. Our main concern was that the study design makes it difficult to believe that any associations with Puerto Rican ancestry are not due to environmental effects, since nearly 90% of the self-identified Puerto Ricans and none of the self-identified Mexicans were recruited in Puerto Rico. The authors seem to realize this problem because at several points they either test for association with recruitment site, or correct for recruitment site in tests of association between ethnicity and methylation. However, we suspect that, if instead of using recruitment site as a multi-value or continuous covariate the authors use “recruitment site == Puerto Rico” as a binary covariate, some of the significant associations between methylation and ethnicity might go away. If we were reviewers on this paper, we would ask for that additional analysis. Similarly, trying to identify methylation effects of Puerto Rican ethnicity that are independent of environmental differences that are particular to Puerto Rico (perhaps there’s a different smoking rate or level of air pollution there than in the other recruitment sites?) is problematic given this study’s data set.

Another analysis that we think is important when comparing results to previously reported findings is testing whether the effect sizes and directions are consistent. For example, in the “Ethnic differences in environmentally-associated methylation sites” section, do the 19 nominally significant loci that were previously identified in a study of Norwegian newborns have the same direction of methylation change between the two studies? This would require you to know the smoking rate among your sample populations, but you could use the population smoking rates at the recruitment sites as a reasonable proxy.
Some minor comments:
• The cis-meQTL analysis is certainly important, but it would be nice to know whether you tested for trans effects, and whether any loci came up significant.
• We found it a bit odd that Bonferroni correction was used rather than the now more common FDR control. Does the number of significant associations change when using FDR <= 0.05 rather than a p-value threshold?
• For figures 1-3, the A panels are genome-wide analyses while the remainder of the panels focus on a specific locus. The A panels should either be split into separate figures, or each panel should be very clearly labeled with a title indicating what is being shown.

show less