Testing for association with rare variants in the coding and non-coding genome: RAVA-FIRST, a new approach based on CADD deleteriousness score.

Bibliographic Details
Title:	Testing for association with rare variants in the coding and non-coding genome: RAVA-FIRST, a new approach based on CADD deleteriousness score.
Authors:	Ozvan Bocher, Thomas E Ludwig, Marie-Sophie Oglobinsky, Gaëlle Marenne, Jean-François Deleuze, Suryakant Suryakant, Jacob Odeberg, Pierre-Emmanuel Morange, David-Alexandre Trégouët, Hervé Perdry, Emmanuelle Génin
Source:	PLoS Genetics, Vol 18, Iss 9, p e1009923 (2022)
Publisher Information:	Public Library of Science (PLoS), 2022.
Publication Year:	2022
Collection:	LCC:Genetics
Subject Terms:	Genetics, QH426-470
More Details:	Rare variant association tests (RVAT) have been developed to study the contribution of rare variants widely accessible through high-throughput sequencing technologies. RVAT require to aggregate rare variants in testing units and to filter variants to retain only the most likely causal ones. In the exome, genes are natural testing units and variants are usually filtered based on their functional consequences. However, when dealing with whole-genome sequence (WGS) data, both steps are challenging. No natural biological unit is available for aggregating rare variants. Sliding windows procedures have been proposed to circumvent this difficulty, however they are blind to biological information and result in a large number of tests. We propose a new strategy to perform RVAT on WGS data: "RAVA-FIRST" (RAre Variant Association using Functionally-InfoRmed STeps) comprising three steps. (1) New testing units are defined genome-wide based on functionally-adjusted Combined Annotation Dependent Depletion (CADD) scores of variants observed in the gnomAD populations, which are referred to as "CADD regions". (2) A region-dependent filtering of rare variants is applied in each CADD region. (3) A functionally-informed burden test is performed with sub-scores computed for each genomic category within each CADD region. Both on simulations and real data, RAVA-FIRST was found to outperform other WGS-based RVAT. Applied to a WGS dataset of venous thromboembolism patients, we identified an intergenic region on chromosome 18 enriched for rare variants in early-onset patients. This region that was missed by standard sliding windows procedures is included in a TAD region that contains a strong candidate gene. RAVA-FIRST enables new investigations of rare non-coding variants in complex diseases, facilitated by its implementation in the R package Ravages.
Document Type:	article
File Description:	electronic resource
Language:	English
ISSN:	1553-7390 1553-7404
Relation:	https://doaj.org/toc/1553-7390; https://doaj.org/toc/1553-7404
DOI:	10.1371/journal.pgen.1009923
Access URL:	https://doaj.org/article/47dcd592d87c4bea91b6dd59452c4a97
Accession Number:	edsdoj.47dcd592d87c4bea91b6dd59452c4a97
Database:	Directory of Open Access Journals
Full text is not displayed to guests.	Login for full access.

FullText	Links: – Type: pdflink Url: https://content.ebscohost.com/cds/retrieve?content=AQICAHjPtM4BHU3ZchRwgzYmadcigk49r9CVlbU7V5F6lgH7WwHno6YihTPrAvtncmg8xWJXAAAA4TCB3gYJKoZIhvcNAQcGoIHQMIHNAgEAMIHHBgkqhkiG9w0BBwEwHgYJYIZIAWUDBAEuMBEEDNXnzoVWD1jMTLb7LgIBEICBmRy8Vm0B-RwpDaR3-YMS0T7eM0SqqjO5g-mq5vZBXywmg9xBXZ3OrG9hF1BOjJZ0szQDsTV2YK1KVR5hsVOVqFo-PN6Y1wwtElnunHqTBL5GnX71R0T2XOpysThxBYJKNIyaLbylpbdz2PuOSl--x-WxoRFsbh2j8Rq9XpS7aueiHpgnkYcmbSP4i7YzunbPs5yW2-UGyCVWoQ== Text: Availability: 1 Value: <anid>AN0159163981;[166q]16sep.22;2022Sep20.03:55;v2.2.500</anid> <title id="AN0159163981-1">Testing for association with rare variants in the coding and non-coding genome: RAVA-FIRST, a new approach based on CADD deleteriousness score </title> <sbt id="AN0159163981-2">Introduction</sbt> <p>Rare variant association tests (RVAT) have been developed to study the contribution of rare variants widely accessible through high-throughput sequencing technologies. RVAT require to aggregate rare variants in testing units and to filter variants to retain only the most likely causal ones. In the exome, genes are natural testing units and variants are usually filtered based on their functional consequences. However, when dealing with whole-genome sequence (WGS) data, both steps are challenging. No natural biological unit is available for aggregating rare variants. Sliding windows procedures have been proposed to circumvent this difficulty, however they are blind to biological information and result in a large number of tests. We propose a new strategy to perform RVAT on WGS data: "RAVA-FIRST" (RAre Variant Association using Functionally-InfoRmed STeps) comprising three steps. (<reflink idref="bib1" id="ref1">1</reflink>) New testing units are defined genome-wide based on functionally-adjusted Combined Annotation Dependent Depletion (CADD) scores of variants observed in the gnomAD populations, which are referred to as "CADD regions". (<reflink idref="bib2" id="ref2">2</reflink>) A region-dependent filtering of rare variants is applied in each CADD region. (<reflink idref="bib3" id="ref3">3</reflink>) A functionally-informed burden test is performed with sub-scores computed for each genomic category within each CADD region. Both on simulations and real data, RAVA-FIRST was found to outperform other WGS-based RVAT. Applied to a WGS dataset of venous thromboembolism patients, we identified an intergenic region on chromosome 18 enriched for rare variants in early-onset patients. This region that was missed by standard sliding windows procedures is included in a TAD region that contains a strong candidate gene. RAVA-FIRST enables new investigations of rare non-coding variants in complex diseases, facilitated by its implementation in the R package Ravages.</p> <p>Author summary: Technological progresses have made possible whole-genome sequencing at an unprecedented scale, opening up the possibility to explore the role of genetic variants of low frequency in common diseases. The challenge is now methodological and requires the development of novel methods and strategies to analyse sequencing data that are not limited to assessing the role of coding variants. With RAVA-FIRST, we propose a novel strategy to investigate the role of rare variants in the whole-genome that takes benefit from biological information. Especially, RAVA-FIRST relies on testing units that go beyond genes to gather rare variants in the association tests. In this work, we show that this new strategy presents several advantages compared to existing methods. RAVA-FIRST offers an easy and straightforward analysis of genome-wide rare variants, especially the intergenic ones which are frequently left behind, making it a promising tool to get a better understanding of the biology of complex diseases.</p> <p>With advance in sequencing technologies, it is now possible to explore the role of rare genetic variants in complex diseases. Rare variant association tests (RVAT) have been developed that gather rare variants into testing units and compare their rare variant content between cases and controls [[<reflink idref="bib1" id="ref4">1</reflink>]–[<reflink idref="bib3" id="ref5">3</reflink>]]. While the impact of rare variants has already been shown in several complex diseases [[<reflink idref="bib4" id="ref6">4</reflink>]–[<reflink idref="bib6" id="ref7">6</reflink>]], RVAT face two key challenges: (i) the definition of the testing units and (ii) the selection of the qualifying rare variants to include in these units. The proportion of causal variants in the testing units being a major driver of power, especially for burden tests, it is indeed important to ensure that qualifying variants are enriched in variants likely to have some functional impact [[<reflink idref="bib3" id="ref8">3</reflink>], [<reflink idref="bib7" id="ref9">7</reflink>]]. When exome analyses are undertaken, rare variants are most often grouped by genes and included in the analysis depending on their impact on the corresponding protein [[<reflink idref="bib8" id="ref10">8</reflink>]]. Nevertheless, the gene definition is not always optimal as differences in rare variants burden between cases and controls could sometimes only be found in a sub-region of a gene. This is for example the case in the <emph>RNF213</emph> gene where an enrichment in rare variants located in the C-terminal region was found in Moyamoya cases [[<reflink idref="bib10" id="ref11">10</reflink>]]. Defining testing units and qualifying variants is much more challenging in the non-coding genome due to the lack of defined genomic elements and the higher difficulty to predict the functional impact of non-coding variants [[<reflink idref="bib11" id="ref12">11</reflink>]]. It is yet a question of interest as several studies have shown the importance of rare non-coding variants in the development of complex diseases [[<reflink idref="bib12" id="ref13">12</reflink>]–[<reflink idref="bib14" id="ref14">14</reflink>]]. Functional elements such as enhancers or promoters can be used as testing units [[<reflink idref="bib5" id="ref15">5</reflink>], [<reflink idref="bib15" id="ref16">15</reflink>]]. However, these elements only cover a small portion of the non-coding genome and their size is often too small to gather a sufficient number of rare variants. On the other hand, sliding windows procedures such as SCAN-G [[<reflink idref="bib17" id="ref17">17</reflink>]] or WGSCAN [[<reflink idref="bib18" id="ref18">18</reflink>]] can be used to test for association over the whole-genome. Nevertheless, they present several limits including the window definition that is arbitrary and blind to biological information, the high number of tests and the associated computation time. With overlapping windows, there is also a strong correlation between the different testing units that requires permutation procedures to account for multiple testing. Finally, to filter rare variants in the testing units, pathogenicity scores are often used but without guidelines on which score to use and which threshold to apply.</p> <p>In this paper, we propose RAVA-FIRST (RAre Variant Association using Functionally InfoRmed STeps), a new strategy for analysing rare variants in the coding and the non-coding genome that addresses the previous issues. First, we provide pre-defined testing units in the whole genome called "CADD regions" based on the Combined Annotation Dependent Depletion (CADD) scores of deleteriousness of variants observed in the gnomAD general population. Second, we propose a filtering approach based on CADD scores with region-dependant thresholds to represent the genetic context of each CADD region and avoid the use of a fixed threshold along the genome. Finally, we integrate functional information into the burden test to detect an accumulation of rare variants in specific genomic categories within CADD regions. Through a statistical description of these CADD regions, we show that they preserve the integrity of the majority of functional elements in the genome. We also show that the RAVA-FIRST filtering strategy enables a better discrimination between functional and non-functional variants. We applied RAVA-FIRST to real whole-genome sequencing data from individuals with venous thromboembolism (VTE) and detected an intergenic association signal that would have been missed with sliding windows and a classical filtering of rare variants. RAVA-FIRST is implemented in the R package Ravages available on the CRAN and maintained on Github [[<reflink idref="bib19" id="ref19">19</reflink>]].</p> <hd id="AN0159163981-3">Description of the method</hd> <p></p> <hd id="AN0159163981-4">Ethics statement</hd> <p>The MARTHA study was approved by its institutional ethics committee and informed written consent was obtained in accordance with the Declaration of Helsinki. Ethics approval were obtained from the "Departement santé de la direction générale de la recherche et de l'innovation du ministère" (Projects DC: 2008–880 and 09.576).</p> <p>RAVA-FIRST is developed to test for association with rare variants in the whole genome. It deals with all steps from the definition of testing units and the filtering of rare variants, to the association test accounting for functional information. The main steps are described here and represented in Fig 1 and further details are provided in S1 File and S1 Fig.</p> <p>Graph: Fig 1 Steps performed in RAVA-FIRST: definition of ACS, CADD regions, region-specific thresholds and functionally-informed burden tests.</p> <hd id="AN0159163981-5">Testing units in RAVA-FIRST: The CADD regions</hd> <p>To define testing units for association tests, we took inspiration from the work of Havrilla et al. (2019) [[<reflink idref="bib21" id="ref20">21</reflink>]]. They defined "constrained coding regions" (CCR) as exonic regions where no important functional variation (defined as being at least missense) was found in the general population of gnomAD [[<reflink idref="bib22" id="ref21">22</reflink>]]. Those regions could be of interest in RVAT as we can expect that an accumulation of rare variants within them would lead to an increased risk of developing a disease. However, in our experience, two limits prevent the direct use of CCR as testing units in the whole genome: they are too small to gather a sufficient number of rare variants (224 bp being the maximum length of a CCR) and their definition relies on the consequence of the variants on the translated protein, not available in the non-coding genome. To define regions in the non-coding genome using the same underlying hypothesis, we therefore decided to expand the proposed approach by estimating the functionality of variants through CADD scores [[<reflink idref="bib23" id="ref22">23</reflink>]]. CADD scores were chosen because of their availability for every substitution in the genome and because they rank well in the comparison test of functional annotation tools [[<reflink idref="bib24" id="ref23">24</reflink>]]. The goal here is to split the genome into regions according to the distribution of functional variation observed in gnomAD and not to detect the most constrained regions as aimed by Havrilla et al (2019) [[<reflink idref="bib21" id="ref24">21</reflink>]].</p> <p>Coding variants tend to present higher CADD values than non-coding variants [[<reflink idref="bib23" id="ref25">23</reflink>]]. A selection based on a CADD threshold would therefore result in a majority of coding variants selected. In order to avoid this pattern, we adjusted the RAW CADD scores of all possible SNVs and of a set of 48,000,000 Indels on a PHRED scale within each of three genomic categories: "coding", "regulatory" and "intergenic" regions to obtain an "adjusted CADD score" also called "ACS". Coding regions correspond to CCDS [[<reflink idref="bib25" id="ref26">25</reflink>]] and represent 1.2% of the genome. Regulatory regions represent 44.3% of the genome and gather introns, 5' and 3' UTR, promoters and enhancers, all being involved in gene regulation [[<reflink idref="bib26" id="ref27">26</reflink>]]. Enhancers and promoters have been obtained with the SCREEN tool from ENCODE which enables the definition of a large number of regulatory elements in diverse cell types [[<reflink idref="bib27" id="ref28">27</reflink>]]. Finally, intergenic regions correspond to all other regions and represent 54.5% of the genome. More details are given in the S1 File.</p> <p>ACS were used to select the variants that will bound the "CADD regions" based on criteria defined from a fine tuning to ensure that CADD regions had sizes compatible with RVAT; i.e., not too small to contain enough rare causal variants in cases and not too large to avoid pollution by too many rare neutral variants. First, we selected the variants with an ACS greater than 20, which is the top 1% of variants with the highest predicted functional impact within each of the three genomic categories. Then, among those variants, only the ones observed at least two times in gnomAD r2.0.1 genomes were used as boundaries of CADD regions. The choice of excluding gnomAD singletons was made to avoid splitting CADD regions because of sequencing errors. Contiguous small regions of less than 10 kb were grouped together. All genomic regions where CADD scores are not available (such as centromeres and telomeres among others) were excluded, as well as regions that are not sequenced or are low-covered in gnomAD but contain genomic sites where predicted ACS exceeds 20 for at least one of the possible alleles. This creates gaps within CADD regions that are sometimes of only one base pair but avoids keeping artificially long CADD regions due to a lack of observed variants in gnomAD. More details about the steps and parameters used for the definition of CADD regions are presented in the S1 File.</p> <hd id="AN0159163981-6">The RAVA-FIRST filtering strategy</hd> <p>In addition to the definition of new testing units in the whole genome, we propose a new filtering strategy in RAVA-FIRST to select qualifying variants based on thresholds that are specific to each CADD region. The idea is similar to the gene-specific CADD thresholds proposed by Itan et al (<reflink idref="bib27" id="ref29">27</reflink>) to improve variant deleteriousness prediction. To define region-specific thresholds, we computed the median of ACS of all the variants (SNPs and InDels) observed at least two times in gnomAD in each CADD region. This value is expected to represent the median score level that is tolerated in the general population within each CADD region. Qualifying variants are then defined as rare variants with an ACS above the threshold specific to their region. We chose to include InDels in this median so that they can be analysed using the RAVA-FIRST strategy as they represent an important source of genetic variation.</p> <hd id="AN0159163981-7">Burden test in RAVA-FIRST: Taking into account functional information</hd> <p>Several of the CADD regions overlap different genomic categories (coding, regulatory or intergenic, Figs 1 and S2). As the effects of variants belonging to these different genomic categories may not be the same, we extended the burden test by integrating a sub-score for each genomic category into the regression model, similarly to the analysis of rare and frequent variants proposed by Li and Leal (2008) [[<reflink idref="bib7" id="ref30">7</reflink>]]:</p> <p>Graph</p> <p> <ephtml> &lt;math display="block" xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;mrow&gt;&lt;mrow&gt;&lt;mi mathvariant="normal"&gt;ln&lt;/mi&gt;&lt;/mrow&gt;&lt;mrow&gt;&lt;mfrac&gt;&lt;mrow&gt;&lt;mi&gt;P&lt;/mi&gt;&lt;mo&gt;(&lt;/mo&gt;&lt;msub&gt;&lt;mrow&gt;&lt;mi&gt;Y&lt;/mi&gt;&lt;/mrow&gt;&lt;mrow&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;mo&gt;=&lt;/mo&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;mo&gt;)&lt;/mo&gt;&lt;/mrow&gt;&lt;mrow&gt;&lt;mi&gt;P&lt;/mi&gt;&lt;mo&gt;(&lt;/mo&gt;&lt;msub&gt;&lt;mrow&gt;&lt;mi&gt;Y&lt;/mi&gt;&lt;/mrow&gt;&lt;mrow&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;mo&gt;=&lt;/mo&gt;&lt;mn&gt;0&lt;/mn&gt;&lt;mo&gt;)&lt;/mo&gt;&lt;/mrow&gt;&lt;/mfrac&gt;&lt;/mrow&gt;&lt;/mrow&gt;&lt;mo&gt;=&lt;/mo&gt;&lt;msub&gt;&lt;mrow&gt;&lt;mi&gt;&amp;#946;&lt;/mi&gt;&lt;/mrow&gt;&lt;mrow&gt;&lt;mn&gt;0&lt;/mn&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;mo&gt;+&lt;/mo&gt;&lt;msub&gt;&lt;mrow&gt;&lt;mi&gt;&amp;#946;&lt;/mi&gt;&lt;/mrow&gt;&lt;mrow&gt;&lt;mi&gt;C&lt;/mi&gt;&lt;mi&gt;o&lt;/mi&gt;&lt;mi&gt;v&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;msub&gt;&lt;mrow&gt;&lt;mi&gt;X&lt;/mi&gt;&lt;/mrow&gt;&lt;mrow&gt;&lt;mi&gt;C&lt;/mi&gt;&lt;mi&gt;o&lt;/mi&gt;&lt;mi&gt;v&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;mo&gt;+&lt;/mo&gt;&lt;mrow&gt;&lt;munder&gt;&lt;mo stretchy="false"&gt;&amp;#8721;&lt;/mo&gt;&lt;mrow&gt;&lt;mi&gt;G&lt;/mi&gt;&lt;mo&gt;=&lt;/mo&gt;&lt;mo&gt;{&lt;/mo&gt;&lt;mi&gt;c&lt;/mi&gt;&lt;mi&gt;o&lt;/mi&gt;&lt;mi&gt;d&lt;/mi&gt;&lt;mo&gt;;&lt;/mo&gt;&lt;mi&gt;r&lt;/mi&gt;&lt;mi&gt;e&lt;/mi&gt;&lt;mi&gt;g&lt;/mi&gt;&lt;mo&gt;;&lt;/mo&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;mi&gt;n&lt;/mi&gt;&lt;mi&gt;t&lt;/mi&gt;&lt;mi&gt;e&lt;/mi&gt;&lt;mi&gt;r&lt;/mi&gt;&lt;mo&gt;}&lt;/mo&gt;&lt;/mrow&gt;&lt;/munder&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mrow&gt;&lt;mi&gt;&amp;#946;&lt;/mi&gt;&lt;/mrow&gt;&lt;mrow&gt;&lt;mi&gt;G&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;msub&gt;&lt;mrow&gt;&lt;mi&gt;X&lt;/mi&gt;&lt;/mrow&gt;&lt;mrow&gt;&lt;mi&gt;G&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;/mrow&gt;&lt;/mrow&gt;&lt;/math&gt; </ephtml> </p> <p> <emph>Y<subs>j</subs></emph> is the vector of phenotypes for the n individuals: 0 for the group of controls and 1 for the group of cases. <emph>β</emph><subs>0</subs> represents the intercept of the model and <emph>X<subs>Cov</subs></emph> the matrix of covariates (if any) with their associated effect, <emph>β<subs>Cov</subs>. β<subs>G</subs></emph> corresponds to the estimated effect of the burden <emph>X<subs>G</subs></emph> within each genomic category within the tested CADD region. It can be computed for example using WSS [[<reflink idref="bib1" id="ref31">1</reflink>]], which corresponds to a weighted sum of rare alleles based on their frequency, the rarest alleles having the highest weights.</p> <p>Sub-scores <emph>X<subs>G</subs></emph> are thus constructed for each genomic category within a CADD region, with at most three sub-scores (coding, regulatory or intergenic). The p-value can be determined using a likelihood ratio test comparing this model to the null model where the sub-scores are not included. This sub-score analysis, referred to as RAVA-FIRST burden test, is also available for continuous and for categorical phenotypes using the extension of burden tests developed in Bocher et al. (2019) [[<reflink idref="bib19" id="ref32">19</reflink>]]. The RAVA-FIRST burden test coupled with the region-specific filtering on the ACS enables to perform only one test by CADD region while keeping the most important functional variants within each genomic category and accounting for those categories in the association test.</p> <hd id="AN0159163981-8">Verification and comparison</hd> <p></p> <hd id="AN0159163981-9">Statistics on CADD regions and comparison with genomic elements</hd> <p>A total of 135,224 CADD regions were defined covering 93.2% of the genome (in build GRCh37), of which 106,251 CADD regions are larger than 1kb (covering 93% of the genome). Overall, 42.1% of CADD regions span only one type of genomic category, 47.5% span two of the three types of genomic categories, and 10.4% overlap the three genomic categories (S2 Fig). Some CADD regions are extremely large, mainly in the regions close to the centromeres (Table 1). Care should be taken when interpreting results obtained in these regions. Indeed, only few high-quality genomes covering these genomic regions are currently available and CADD scores may not be as reliable as in other parts of the genome. About 70% of CADD regions have a size between 1 and 50 kb with a mean of 20 kb, making them completely compatible with the size of genes commonly used as testing units in RVAT.</p> <p>Graph</p> <p>Table 1 Summary statistics of the lengths of CADD regions.</p> <p> <ephtml> &lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th align="center" rowspan="2" /&gt;&lt;th align="center" colspan="5"&gt;Quantiles&lt;/th&gt;&lt;th align="center" rowspan="2"&gt;Mean&lt;/th&gt;&lt;/tr&gt;&lt;tr&gt;&lt;th align="center"&gt;0%&lt;/th&gt;&lt;th align="center"&gt;25%&lt;/th&gt;&lt;th align="center"&gt;50%&lt;/th&gt;&lt;th align="center"&gt;75%&lt;/th&gt;&lt;th align="center"&gt;100%&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td align="center"&gt;Length (kb)&lt;/td&gt;&lt;td align="center"&gt;0.002&lt;/td&gt;&lt;td align="center"&gt;2.576&lt;/td&gt;&lt;td align="center"&gt;13.006&lt;/td&gt;&lt;td align="center"&gt;24.323&lt;/td&gt;&lt;td align="center"&gt;1,731.228&lt;/td&gt;&lt;td align="center"&gt;19.852&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt; </ephtml> </p> <p>We then compared the position of genomic elements relative to the defined CADD regions (see S1 Table for the definition of genomic elements). A large majority of genomic elements are entirely included into a single CADD region and thus their integrity is preserved (see S2 Table). This is expected as all these genomic elements are substantially smaller than the CADD. For larger elements such as introns or lncRNA, the percentage decreases but remains high (more than 80% of lncRNA are overlapped by at most two CADD regions). The genomic elements spanning more than one CADD region are on average longer than the ones being entirely included into a single CADD region. However, when comparing CCR and CADD regions, it is interesting to note that the CCRs entirely encompassed within a single CADD region are the longest ones that also represent the most constrained regions.</p> <hd id="AN0159163981-10">Performance of RAVA-FIRST filtering based on ACS</hd> <p>To assess the performance of the ACS and the RAVA-FIRST filtering, we evaluated its capacity in discriminating rare pathogenic SNVs defined in the Clinvar database [[<reflink idref="bib28" id="ref33">28</reflink>]] from rare SNVs polymorphisms observed in the 1000Genomes project [[<reflink idref="bib29" id="ref34">29</reflink>]]. We computed true positive rate (TPR), true negative rate (TNR) and precision for the RAVA-FIRST filtering and compared the results to the ones obtained by applying a fixed CADD threshold of 10, 15 or 20 on variants annotated with CADD scores v1.4. A total of 82,811 variants (44,566 benign and 38,245 pathogenic), both coding and non-coding, were included in this analysis (see S1 and S2 Files for more details on the selection of variants).</p> <p>For coding variants, all filtering strategies based on CADD scores (fixed threshold or ACS) show a very high TPR (Fig 2A), meaning that the majority of pathogenic variants would be selected as qualifying variants for RVAT. The TNR increases with the increasing CADD score threshold which is expected as less variants, and therefore less benign variants, are included in the analysis. The RAVA-FIRST filtering shows the highest TNR and the highest precision. While the TPR value is extremely important to select the most probable causal variants in RVAT, it is also important to have a high TNR value, otherwise the signal will be diluted by a high proportion of non-causal variants. The precision value summarises the TPR and TNR parameters and is representative, to a certain extent, of the percentage of causal variants among selected variants. Therefore, we show that the RAVA-FIRST filtering strategy is the most accurate to select qualifying rare variants for RVAT. Focusing on the coding genome, we also compared the performance of RAVA-FIRST filtering approach against two others procedures classically applied on genes as testing units: (<reflink idref="bib1" id="ref35">1</reflink>) filter for variants with a functional impact expected to change the protein ("missense_variant", "missense_variant&amp;splice_region_variant", "splice_acceptor_variant", "splice_donor_variant", "start_lost", "start_lost&amp;splice_region_variant", "stop_gained", "stop_gained&amp;splice_region_variant", "stop_lost", "stop_lost&amp;splice_region_variant" and "stop_retained_variant"), and (<reflink idref="bib2" id="ref36">2</reflink>) filter on the MSC value, a gene-specific CADD threshold [[<reflink idref="bib30" id="ref37">30</reflink>]]. These two filtering approaches resulted in a slightly higher TPR than our proposed strategy but lower TNR and lower precision (Fig 2A). Therefore, even in an exome analysis, the RAVA-FIRST filtering would outperform classical filtering strategies to select qualifying rare variants for RVAT.</p> <p>Graph: Fig 2 TPR, TNR and precision of different filtering strategies on the Clinvar coding or non-coding variants pathogenic variants compared to rare 1000Genome polymorphisms.</p> <p>For non-coding variants, performances are lower than for coding variants. This is true when using both fixed CADD thresholds and the ACS median (Fig 2B) but the TPR is much lower when a fixed CADD threshold is used. This is explained by the fact that CADD values are lower in the non-coding genome. The best CADD threshold among hard-threshold filtering is indeed 10 in the non-coding genome while it is 20 in the coding genome. It is thus difficult to use a single fixed CADD value to select rare variants in testing units in the whole genome and the proposed ACS thresholds may therefore be preferred. Note however that, because of a bias towards coding variants in ClinVar pathogenic variants, the number of non-coding variants included in this analysis is rather low (<reflink idref="bib2" id="ref38">2</reflink>,<reflink idref="bib980" id="ref39">980</reflink>) compared to coding variants (<reflink idref="bib79" id="ref40">79</reflink>,<reflink idref="bib831" id="ref41">831</reflink>) and results should therefore be interpreted with caution.</p> <p>In summary, compared to classical filtering strategies, the RAVA-FIRST approach based on ACS is expected to improve rare variant selection for RVAT in both the coding and the non-coding parts of the genome.</p> <hd id="AN0159163981-11">RAVA-FIRST burden test–Simulations</hd> <p>To validate the RAVA-FIRST burden test, we performed simulations under the null hypothesis and under different scenarios of association using data from the 1000 Genomes European populations [[<reflink idref="bib29" id="ref42">29</reflink>]] in the <emph>LCT</emph> gene. We simulated 1,000 controls and 1,000 cases using the simulations based on haplotypes implemented in the R package Ravages [[<reflink idref="bib19" id="ref43">19</reflink>]]. A total of 201 variants was considered in the <emph>LCT</emph> gene. These variants were polymorphic in the European populations with a MAF lower than 1%. Two CADD regions overlap the <emph>LCT</emph> gene, R019233 and R019234, containing respectively 75 and 126 variants, both regions overlap coding and regulatory categories.</p> <hd id="AN0159163981-12">Type I error</hd> <p>We first simulated data under the null hypothesis to verify that the RAVA-FIRST burden test maintains appropriate type I errors. We simulated two groups of 1,000 individuals in the R019234 CADD region without any genetic effect and we applied the classical WSS and the RAVA-FIRST WSS. Type I errors were computed using 5∙10<sups>6</sups> simulations at three significance levels: 5∙10<sups>−2</sups>, 10<sups>−3</sups> and 2.5∙10<sups>−6</sups> (the usual threshold for whole exome rare variant association tests). The RAVA-FIRST WSS maintains good type I error levels at these different significance thresholds, similar to the ones obtained with the classical WSS (S3 Table).</p> <hd id="AN0159163981-13">Power analysis</hd> <p>We then performed a power study with causal variants located either in the R019234 CADD region only or in the entire <emph>LCT</emph> gene in any of the two CADD regions. We simulated 50% of causal variants randomly spread in the whole unit (scenarios S1 and S3), in the coding regions (scenarios S2A and S4A) or in the regulatory regions (scenarios S2B and S4B). All the scenarios are summarised in Table 2. We compared the classical WSS to the RAVA-FIRST WSS using the gene or the two CADD regions as testing units. When CADD regions were used as testing units, analyses were performed for each of the two CADD regions and the minimum p-value was taken and multiplied by two to correct for multiple testing. A total of 1,000 replicates were simulated for each scenario and power was assessed at a genome-wide significance threshold of 2.5∙10<sups>−6</sups>.</p> <p>Graph</p> <p>Table 2 Scenarios of association simulated to assess the performance of the RAVA-FIRST burden test.</p> <p> <ephtml> &lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th align="center" rowspan="3" style="border-right:thick;border-bottom:thick" /&gt;&lt;th align="center" colspan="4"&gt;&lt;italic&gt;LCT&lt;/italic&gt; gene&lt;/th&gt;&lt;/tr&gt;&lt;tr&gt;&lt;th align="center" colspan="2"&gt;R019233&lt;/th&gt;&lt;th align="center" colspan="2"&gt;R019234&lt;/th&gt;&lt;/tr&gt;&lt;tr&gt;&lt;th align="center" style="border-bottom:thick"&gt;Coding&lt;/th&gt;&lt;th align="center" style="border-bottom:thick"&gt;Regulatory&lt;/th&gt;&lt;th align="center" style="border-bottom:thick"&gt;Coding&lt;/th&gt;&lt;th align="center" style="border-bottom:thick"&gt;Regulatory&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td align="center" style="border-right:thick"&gt;S1&lt;/td&gt;&lt;td align="center" colspan="2" style="background-color:#D0CECE" /&gt;&lt;td align="center" colspan="2"&gt;50%&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="center" style="border-right:thick"&gt;S2A&lt;/td&gt;&lt;td align="center" colspan="2" style="background-color:#D0CECE" /&gt;&lt;td align="center"&gt;50%&lt;/td&gt;&lt;td align="center"&gt;0%&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="center" style="border-right:thick;border-bottom:thick"&gt;S2B&lt;/td&gt;&lt;td align="center" colspan="2" style="background-color:#D0CECE" /&gt;&lt;td align="center"&gt;0%&lt;/td&gt;&lt;td align="center"&gt;50%&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="center" style="border-right:thick"&gt;S3&lt;/td&gt;&lt;td align="center" colspan="4"&gt;50%&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="center" style="border-right:thick"&gt;S4A&lt;/td&gt;&lt;td align="center"&gt;50%&lt;/td&gt;&lt;td align="center"&gt;0%&lt;/td&gt;&lt;td align="center"&gt;50%&lt;/td&gt;&lt;td align="center"&gt;0%&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="center" style="border-right:thick"&gt;S4B&lt;/td&gt;&lt;td align="center"&gt;0%&lt;/td&gt;&lt;td align="center"&gt;50%&lt;/td&gt;&lt;td align="center"&gt;0%&lt;/td&gt;&lt;td align="center"&gt;50%&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt; </ephtml> </p> <p>Table 3 presents the power results obtained from this simulation study for both the classical WSS and the RAVA-FIRST WSS. Similar trends were observed between the two analyses, regardless if the simulations are performed at the scale of CADD regions or at the scale of the gene. When the causal variants were randomly sampled across the entire region (scenarios S1 and S3), the classical WSS with only one score for the entire region slightly outperformed the RAVA-FIRST method with sub-scores. Nevertheless, the loss of power for the latter was modest (less than 10%). By contrast, when causal variants were present only in the coding categories (scenarios S2A and S4A), which represent a small proportion of the entire region (approximately 15%), the RAVA-FIRST strategy was much more powerful than the classical WSS (approximately 50% gain in power). When causal variants were present in the regulatory categories only (scenarios S2B and S4B), both strategies showed similar power. All these results highlight the gain of power using the RAVA-FIRST WSS when a cluster of causal variants is present within a functional category of the CADD region while maintaining good power levels when causal variants are spread across the entire region. When comparing the simulations with causal variants sampled at the gene level or at the CADD region level, burden tests gathering variants within the corresponding testing units show, as expected, the highest levels of power. Nevertheless, the loss of power when using CADD regions as testing units instead of the entire gene is lower when causal variants are sampled across the entire gene (scenario S3) than the gain of power they present when causal variants are sampled within a specific CADD region (scenario S1). This is particularly true for the RAVA-FIRST WSS.</p> <p>Graph</p> <p>Table 3 Power at the genome-wide significance level of 2.5∙10−6 under the different simulation scenarios using either the classical WSS or the RAVA-FIRST WSS at the scale of either the entire gene or CADD regions.</p> <p> <ephtml> &lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th align="center" rowspan="2" style="border-right:thick;border-bottom:thick" /&gt;&lt;th align="center" colspan="2" style="border-right:thick"&gt;By gene&lt;/th&gt;&lt;th align="center" colspan="2"&gt;By CADD regions&lt;/th&gt;&lt;/tr&gt;&lt;tr&gt;&lt;th align="center" style="border-bottom:thick"&gt;Classical WSS&lt;/th&gt;&lt;th align="center" style="border-right:thick;border-bottom:thick"&gt;RAVA-FIRST WSS&lt;/th&gt;&lt;th align="center" style="border-bottom:thick"&gt;Classical WSS&lt;/th&gt;&lt;th align="center" style="border-bottom:thick"&gt;RAVA-FIRST WSS&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td align="center" style="border-right:thick"&gt;S1&lt;/td&gt;&lt;td align="center"&gt;0.409&lt;/td&gt;&lt;td align="center" style="border-right:thick"&gt;0.370&lt;/td&gt;&lt;td align="center"&gt;0.782&lt;/td&gt;&lt;td align="center"&gt;0.701&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="center" style="border-right:thick"&gt;S2A&lt;/td&gt;&lt;td align="center"&gt;0&lt;/td&gt;&lt;td align="center" style="border-right:thick"&gt;0.431&lt;/td&gt;&lt;td align="center"&gt;0.002&lt;/td&gt;&lt;td align="center"&gt;0.602&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="center" style="border-right:thick;border-bottom:thick"&gt;S2B&lt;/td&gt;&lt;td align="center" style="border-bottom:thick"&gt;0.408&lt;/td&gt;&lt;td align="center" style="border-right:thick;border-bottom:thick"&gt;0.404&lt;/td&gt;&lt;td align="center" style="border-bottom:thick"&gt;0.689&lt;/td&gt;&lt;td align="center" style="border-bottom:thick"&gt;0.706&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="center" style="border-right:thick"&gt;S3&lt;/td&gt;&lt;td align="center"&gt;0.751&lt;/td&gt;&lt;td align="center" style="border-right:thick"&gt;0.678&lt;/td&gt;&lt;td align="center"&gt;0.512&lt;/td&gt;&lt;td align="center"&gt;0.433&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="center" style="border-right:thick"&gt;S4A&lt;/td&gt;&lt;td align="center"&gt;0.004&lt;/td&gt;&lt;td align="center" style="border-right:thick"&gt;0.564&lt;/td&gt;&lt;td align="center"&gt;0.012&lt;/td&gt;&lt;td align="center"&gt;0.474&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="center" style="border-right:thick"&gt;S4B&lt;/td&gt;&lt;td align="center"&gt;0.657&lt;/td&gt;&lt;td align="center" style="border-right:thick"&gt;0.64&lt;/td&gt;&lt;td align="center"&gt;0.39&lt;/td&gt;&lt;td align="center"&gt;0.391&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt; </ephtml> </p> <hd id="AN0159163981-14">Applications</hd> <p></p> <hd id="AN0159163981-15">RAVA-FIRST analysis</hd> <p>RAVA-FIRST was used on whole-genome sequence (WGS) data from patients affected by venous thromboembolism (VTE). VTE is a multifactorial disease with a strong genetic component [[<reflink idref="bib31" id="ref44">31</reflink>]]. There exists a huge heterogeneity between patients in the age at first VTE event. To study the role of rare variants on VTE age of onset, WGS data were used from 200 individuals from the MARTHA cohort [[<reflink idref="bib32" id="ref45">32</reflink>]]. These individuals were selected among patients with unprovoked VTE event who were previously genotyped for a genome-wide association study [[<reflink idref="bib33" id="ref46">33</reflink>]] and present no known genetic predisposing factor. Individuals were dichotomized based on the age at first VTE event either before 50 years of age (early-onset) or after (late-onset). The threshold of 50 years was chosen based on the results of recent studies [[<reflink idref="bib34" id="ref47">34</reflink>]] that hint toward a genetic heterogeneity between these two groups. A quality control (QC) of the sequencing data was performed using the RAVAQ pipeline [[<reflink idref="bib35" id="ref48">35</reflink>]]https://gitlab.com/gmarenne/ravaq. After QC, 184 individuals were included for analysis with 127 presenting an early-onset VTE and 57 a late-onset VTE. Only variants passing all QC steps and with a MAF lower than 1% in the sample were considered in the association tests comparing early and late-onset groups. For these comparisons, rare variants were gathered either by CADD regions or by using the sliding windows procedure implemented in WGScan [[<reflink idref="bib18" id="ref49">18</reflink>]]. Qualifying variants were selected based on CADD scores and using two filtering strategies: (<reflink idref="bib1" id="ref50">1</reflink>) a fixed CADD threshold of 15 or (<reflink idref="bib2" id="ref51">2</reflink>) the RAVA-FIRST CADD region-specific filtering (applied on ACS). Association was tested using the WSS burden test. When the RAVA-FIRST filtering was used, the corresponding WSS test with sub-scores was applied. Table 4 shows the number of testing units and variants kept under each strategy. For all tests with CADD regions, only testing units containing at least 5 rare variants were kept. WGScan was used with default parameters, i.e. with testing units of 5, 10, 15, 25 or 50 kb.</p> <p>Graph</p> <p>Table 4 Number of testing units and variants kept under the three strategies.</p> <p> <ephtml> &lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th align="center" /&gt;&lt;th align="center"&gt;Testing units&lt;/th&gt;&lt;th align="center"&gt;Filtering&lt;/th&gt;&lt;th align="center"&gt;Number of testing units&lt;/th&gt;&lt;th align="center"&gt;Number of variants&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td align="center"&gt;WGScan Fixed CADD threshold&lt;/td&gt;&lt;td align="center"&gt;Sliding windows&lt;/td&gt;&lt;td align="center"&gt;MAF &amp;#8804; 1% CADD v1.4 &amp;#8805; 15&lt;/td&gt;&lt;td align="center"&gt;377,092&lt;/td&gt;&lt;td align="center"&gt;96,347&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="center"&gt;RAVA-FIRST units (CADD regions) No CADD filtering&lt;/td&gt;&lt;td align="center" rowspan="3"&gt;CADD regions&lt;/td&gt;&lt;td align="center"&gt;MAF &amp;#8804; 1%&lt;/td&gt;&lt;td align="center"&gt;103,439&lt;/td&gt;&lt;td align="center"&gt;9,423,012&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="center"&gt;RAVA-FIRST units (CADD regions) Fixed CADD threshold&lt;/td&gt;&lt;td align="center"&gt;MAF &amp;#8804; 1% CADD v1.4 &amp;#8805; 15&lt;/td&gt;&lt;td align="center"&gt;10,389&lt;/td&gt;&lt;td align="center"&gt;96,294&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="center"&gt;RAVA-FIRST units (CADD regions) RAVA-FIRST filtering&lt;/td&gt;&lt;td align="center"&gt;MAF &amp;#8804; 1% ACS &amp;#8805; median&lt;/td&gt;&lt;td align="center"&gt;95,220&lt;/td&gt;&lt;td align="center"&gt;3,494,327&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt; </ephtml> </p> <p>QQ-plots for the WSS tests using those three strategies are shown in Fig 3. As expected, a lower significance threshold is required to reach genome-wide significance with the sliding window procedure due to the higher number of testing units. Accordingly, the computation time was much lower for the two analyses by CADD regions (6min when filtering based on a fixed CADD score threshold and 25min when using the region-specific CADD thresholds) than for the sliding windows procedure (47min). Our dataset contains less than 200 individuals, suggesting that the gain in computation time of CADD regions compared to sliding window procedures would be even greater in larger WGS datasets. No significant result was found when no functional filter was applied nor when selecting variants with a CADD score greater than 15. One association reached borderline significance (p = 6.41∙10<sups>−7</sups>) when using the RAVA-FIRST strategy, i.e. with CADD regions and the corresponding ACS filtering.</p> <p>Graph: Fig 3 QQ-plot of WSS analyses on VTE data using the four strategies of analysis.Early-onset patients (&lt;50 years old) were compared to late-onset patients (≥50 years old).</p> <p>This association maps to R126442, a CADD region of 21 kb on chromosome 18:66788277–66809402 that contains 30 rare variants after RAVA-FIRST filtering. In this region, none of the variants observed in VTE patients or in gnomAD achieved a CADD score above 15. This explains why the association could not have been detected by the two other strategies based on fixed CADD score ≥ 15. The median of CADD scores observed for gnomAD variants in this region is 1.73 and the ACS of selected variants range from 1.82 to 8.50. These observations emphasize the need to adapt thresholds depending on the genomic region under analysis. Interestingly, only early-onset VTE patients carry qualifying rare variants and have non-null WSS scores (Fig 4 and S3 File). Among early-onset patients, a trend is also observed for WSS scores to decrease with increasing age of onset. Information about the CADD region R126442 is available in the S3 File. Information about individuals (WSS score, age and sex) and variants (position, adjusted CADD score and weight in WSS) are given.</p> <p>Graph: Fig 4 WSS scores in the CADD region depending on the age at first VTE event.The dashed line corresponds to the age 50 discriminating early onset from late onset events.</p> <p>To make sure that there was indeed an advantage of using CADD regions compared to random windows over each chromosome, we shuffled the CADD regions on chromosome 18, computed new CADD medians in each region and tested for association again. We repeated this procedure 500 times and looked at the region where the top signal (lowest p-value) is located in each permutation (S3 Fig). We found an enrichment of top signals around R126442 and no other region in the chromosome reached the same significance level. Specifically, the top signal overlaps with R126442 in 34.2% of the replicates and this percentage increases to 62.4% if we consider the top 5 signals. The percentage of CADD regions overlapping with R126442 is yet smaller than 0.1% when looking at the whole p-value spectrum.</p> <p>The CADD region R126442 was then tested for association with 20 biological VTE biomarkers available in MARTHA patients: antithrombin, basophil, eosinophil, Factor VIII, Factor XI, fibrinogen, hematocrit, lymphocytes, mean corpuscular volume, mean platelet volume, monocytes, neutrophils, PAI-1, platelets count, protein C, protein S, prothrombin time, red blood cells count, von Willebrand Factor, and white blood cells count. For this, a linear regression model was used where adjustment was made on age at sampling and sex. At the Bonferroni threshold of 0.0025, one significant association (p = 7.1∙10<sups>−4</sups>) was observed, VTE patients with a non-null WSS score exhibiting decreased haematocrit levels, a surrogate marker of red blood cells (S4 Table). A similar trend (p = 4.6∙10<sups>−3</sups>) was observed with red blood cell count.</p> <p>We also investigated the association of the identified region with 376 plasma protein antibodies that were selected to be involved in thrombosis-related processes and that have been previously profiled in MARTHA [[<reflink idref="bib32" id="ref52">32</reflink>], [<reflink idref="bib36" id="ref53">36</reflink>]]. Regression analysis were conducted on log transformed values of antibodies and were adjusted for age, sex, and three internal control antibodies. In order to handle the correlation between measured protein antibodies, we used the Li and Ji method [[<reflink idref="bib37" id="ref54">37</reflink>]] to estimate the number of effective independent tests. This number, calculated to be 163, was then used to define a Bonferroni threshold for declaring study-wide statistical significance. While not reaching the study-wise significance level of p = 3.1∙10<sups>−4</sups> after correction for multiple testing, it is worth noting that the two proteins that exhibited the strongest significance with marginal association at p &lt; 0.001, procalcitonin tagged by the HPA043700 antibody (p = 7.2∙10<sups>−4</sups>) and PDPK1 tagged by HPA035199 (p = 7.5∙10<sups>−4</sups>), were both suggested to be involved in red blood cell biology [[<reflink idref="bib38" id="ref55">38</reflink>]].</p> <p>According to ENCODE data, the R1246442 CADD region overlaps "intergenic" and "regulatory" categories with one distant enhancer-like signature. To further describe this region, we looked at TADs positions in https://dna.cs.miami.edu/TADKB/brows.php in HUVEC and HMEC cell lines, two cell types known to be relevant for VTE pathophysiology. We found that the CADD region is included into the topological associated domain (TAD) 18:66450000–68150000. By studying TADs described by Lieberman-Aiden et al. 2009 [[<reflink idref="bib40" id="ref56">40</reflink>]] in other cell lines such as KBM7, K562 or GM12878, we retrieved a TAD with similar positions, giving additional evidence for the presence of this TAD around the CADD region associated with early-onset patients. We then explored this TAD region for the presence of candidate VTE genes whose regulation could be influenced by the enhancer region that maps our R1246442 region. Using the UCSC genome browser [[<reflink idref="bib41" id="ref57">41</reflink>]] integrating information about interactions between GeneHancer regulatory elements and genes expression (see S4 Fig), we identified <emph>CD226</emph> as a strong biological candidate. <emph>CD226</emph> codes for a glycoprotein expressed at the surface of several types of cells, including blood cell, and several studies have shown that it was associated with vascular endothelial dysfunction [[<reflink idref="bib42" id="ref58">42</reflink>]–[<reflink idref="bib44" id="ref59">44</reflink>]]. Genetic variants in <emph>CD226</emph> have also been found associated with several blood cell traits including platelets, white blood cells (e.g. neutrophil, eosinophil) [[<reflink idref="bib45" id="ref60">45</reflink>]] and reticulocyte counts [[<reflink idref="bib46" id="ref61">46</reflink>]], another red blood cell biomarker.</p> <hd id="AN0159163981-16">Discussion</hd> <p>Even though whole genome sequencing data are now widely available, rare variant association tests (RVAT) usually remain restricted to the coding parts of the genome. This is explained by the lack of tools to explore rare variant associations outside genes [[<reflink idref="bib11" id="ref62">11</reflink>]]. It is especially difficult to predict the functional consequence of non-coding variants and not currently possible to analyse them in RVAT without using computationally-intensive sliding window procedures. In this work, we propose RAVA-FIRST, an entire new strategy of analysis of rare variants in the coding and the non-coding genome that leverages functional information. RAVA-FIRST is composed of three steps. First, RAVA-FIRST groups variants observed in cases and controls into some new testing units, the so-called "CADD regions". These CADD regions are defined over the entire genome based on CADD scores of variants observed in gnomAD. They are large enough to include a sufficient number of rare variants to allow RVAT. They tend to preserve functional elements that, for a majority of them, are not split into several CADD regions. Second, RAVA-FIRST filters variants based on region-specific adjusted CADD thresholds that allow the selection of the best candidate variants within each region. This filtering approach was found to be more efficient than traditional approaches to discriminate between benign and pathogenic variants within a set of variants. Indeed, our benchmarking study using a set of Clinvar variants compared to 1000Genomes polymorphisms showed that the other filtering strategies were good at identifying true causal variants (true positive rates were high) but bad at finding the non-causal variants (true negative rates were low), especially in the coding genome. Both true positive and true negative rates are important to achieve a high percentage of causal variants within testing units, a major driver of power in RVAT, especially in burden tests [[<reflink idref="bib2" id="ref63">2</reflink>], [<reflink idref="bib7" id="ref64">7</reflink>]]. Thus, the RAVA-FIRST filtering strategy is expected to result in an appreciable gain of power. Indeed, RAVA-FIRST enables to keep the most important functional variants within coding, regulatory and intergenic categories of the genome by adapting CADD score threshold to the genomic context. Third, RAVA-FIRST includes a burden test that integrates information on genomic categories in the regression model and that, coupled with the region-specific filtering, leads to a better detection of causal variants, should they cluster in one of these genomic categories only. We also showed through simulations that good power levels were maintained using RAVA-FIRST burden test when causal variants were randomly sampled.</p> <p>RAVA-FIRST was applied on real WGS data from VTE patients where an accumulation of rare variants in patients with early-onset events was investigated. We did not detect any significant signal using the sliding window procedure or CADD regions when qualifying rare variants were selected based on a minor allele frequency threshold and/or a fixed CADD threshold. However, we detected an association signal using both the grouping and filtering of rare variants proposed in RAVA-FIRST. The associated CADD region is intergenic, contains a predicted enhancer and is surrounded by a TAD containing 5 genes including <emph>CD226</emph>, a strong candidate for blood cell traits that are now well recognized to be key players in VTE physiopathology [[<reflink idref="bib31" id="ref65">31</reflink>]]. All rare variants in this region present low CADD scores and were not even included in analyses based on a fix CADD threshold, highlighting the importance of considering the genomic context to detect the most important predicted functional variants within each CADD region. These 31 rare variants are exclusively observed in early-onset cases. Fourteen of these variants are absent from gnomAD, and 10 of the 17 remaining variants have a lower frequency in gnomAD population than in our sample. This reinforces the value of the association signal in this CADD region, although it should be further described and validated using functional experiments. Preliminary investigations that need to be further explored, at both experimental and epidemiological levels, strongly suggest that this region is associated with several inflammatory markers impaired in anaemia of inflammation [[<reflink idref="bib39" id="ref66">39</reflink>], [<reflink idref="bib47" id="ref67">47</reflink>]] and in platelets, both mechanisms being involved in thrombotic processes [[<reflink idref="bib48" id="ref68">48</reflink>]].</p> <p>The RAVA-FIRST approach could be improved on different aspects. First, the definition of CADD regions relies on the gnomAD population and on the adjusted CADD threshold. We chose to use the whole gnomAD dataset but it could be of interest to select only some of the populations to detect population-specific associations that could for example be explained by ancestry-related differential expression patterns [[<reflink idref="bib49" id="ref69">49</reflink>]]. Nevertheless, in classical exome analyses, rare variants are usually filtered based on the maximum frequency observed among multiple populations. Furthermore, CADD regions are not defined for low-covered and non-sequenced genomic regions in gnomAD and their definition could benefit from the inclusion of data from other large population datasets where these regions are better covered. We also observed that CADD regions close to the centromeres can be very large, possibly due to less accurate annotation scores resulting from only few high quality genomes mapping these areas. We therefore recommend to cautiously interpret association signals that would be detected in these regions. To define the regulatory regions of the genome as one of the three genomic categories, we decided to include all genomic elements directly implicated into regulatory functions but we did not include silencers or lncRNA for example. However, the choice of elements to include as the regulatory category will only impact the adjusted CADD scores that are similar between regulatory and intergenic regions, and won't therefore have a huge impact on CADD regions definition. As an example, the use of DECRES [[<reflink idref="bib50" id="ref70">50</reflink>]] to predict enhancers and promoters instead of SCREEN results in a very high correlation between the definition of CADD regions, 80% of them being identical. The choice of focusing on variants seen at least twice in gnomAD and with an ACS larger than 20 could also be discussed. This decision was made based on a fine tuning to obtain testing units with sizes that were the most compatible with rare variant analysis, but this could also be adapted to the genomic context as we have done by grouping small regions where several variants showed high ACS.</p> <p>By using CADD scores to define the testing units in RAVA-FIRST, we were able to propose a general framework to cover the entire genome. Indeed, while several other predictive tools have been proposed (as for example LINSIGHT [[<reflink idref="bib51" id="ref71">51</reflink>]], JARVIS [[<reflink idref="bib52" id="ref72">52</reflink>]] or ORION [[<reflink idref="bib53" id="ref73">53</reflink>]]), only few provide a score that is variant specific and defined in both the coding and non-coding parts of the genome. The use of the same framework to define testing units in the whole genome offers several advantages, including the region-specific filtering which enables to overcome the question of selecting a hard threshold to filter rare variants in RVAT. In addition, the newly defined CADD regions can be used in existing software that require regions as input parameters [[<reflink idref="bib54" id="ref74">54</reflink>]], enabling to apply a wide variety of RVAT available in those programs to the whole genome. Especially, Bayesian methods which have been shown to be of great promise in the analysis and filtering of rare variants [[<reflink idref="bib56" id="ref75">56</reflink>]] could be applied beyond genes by using CADD regions.</p> <p>CADD regions represent predefined testing units for RVAT that cover the highest proportion of the genome and have been made publicly available. They are part of a whole new strategy of rare variant analysis in the whole genome, RAVA-FIRST, that further benefits from the integration of functional information both for the filtering of rare variants and their analysis with burden tests. RAVA-FIRST has been implemented in the R package Ravages available in the CRAN and on Github, offering an easy and straightforward tool to perform RVAT in the whole genome. We believe that our developments will help researchers to explore the role of genome-wide rare variants in complex diseases. Firstly, through the redefinition of testing units in the coding genome where cluster of causal variants can be found within genes and retrieved using CADD regions [[<reflink idref="bib10" id="ref76">10</reflink>]]. Secondly, through the study of non-coding variants, especially intergenic ones, which are currently often excluded from the analysis. Going beyond the gene and the consequences on proteins, RAVA-FIRST will help for a better understanding of biological mechanisms behind complex diseases.</p> <hd id="AN0159163981-17">Supporting information</hd> <p>S1 Fig. Definition of CADD regions and removal of low-covered and non-sequenced regions in gnomAD.</p> <p>(TIF)</p> <p>S2 Fig. Percentage of CADD regions overlapping each of the three genomic categories.</p> <p>(TIF)</p> <p>S3 Fig. Association analysis on VTE data on chromosome 18 by shuffling the CADD regions (500 replicates).</p> <p>(TIF)</p> <p>S4 Fig. Screenshot of the TAD 18:66450000–68150000 in the UCSC genome browser containing the CADD region R126442 and a potential enhancer regulating the CD226 gene, a candidate gene in VTE.</p> <p>(TIF)</p> <p>S1 Table. Sources used to get genomic elements for comparisons with CADD regions.</p> <p>(DOCX)</p> <p>S2 Table. Percentage of genomic elements entirely encompassed within a CADD region.</p> <p>(DOCX)</p> <p>S3 Table. Type I error of the classical WSS and the RAVA-FIRST WSS using 5∙10<sups>6</sups> simulations under the null hypothesis.</p> <p>(DOCX)</p> <p>S4 Table. Characteristics of the studied VTE sample.</p> <p>Mean (Standard Deviation) for quantitative variables. Count (%) for qualitative variables.</p> <p>(DOCX)</p> <p>S1 File. Details about the RAVA-RIST method and its evaluation.</p> <p>(DOCX)</p> <p>S2 File. Variants used for the evaluation of RAVA-FIRST comparing ClinVar pathogenic variants to 1000Genomes polymorphisms.</p> <p>(XLSX)</p> <p>S3 File. Information on the CADD region R126442 associated with VTE age at onset.</p> <p>Information about individuals (WSS score, age and sex) and variants (position, adjusted CADD score and weight in WSS) are given.</p> <p>(XLSX)</p> <ref id="AN0159163981-18"> <title> Footnotes </title> <blist> <bibl id="bib1" idref="ref1" type="bt">1</bibl> <bibtext> The authors have declared that no competing interests exist.</bibtext> </blist> </ref> <ref id="AN0159163981-19"> <title> References </title> <blist> <bibtext> Madsen BE, Browning SR (2009) A groupwise association test for rare mutations using a weighted sum statistic. PLoS genetics5:e1000384doi: 10.1371/journal.pgen.1000384, 19214210</bibtext> </blist> <blist> <bibl id="bib2" idref="ref2" type="bt">2</bibl> <bibtext> Wu MC, Lee S, Cai T, Li Y, Boehnke M, Lin X (2011) Rare-variant association testing for sequencing data with the sequence kernel association test. Am J Hum Genet89:82–93doi: 10.1016/j.ajhg.2011.05.029, 21737059</bibtext> </blist> <blist> <bibl id="bib3" idref="ref3" type="bt">3</bibl> <bibtext> Lee S, Abecasis GR, Boehnke M, Lin X (2014) Rare-variant association analysis: study designs and statistical tests. Am J Hum Genet95:5–23doi: 10.1016/j.ajhg.2014.06.009, 24995866</bibtext> </blist> <blist> <bibl id="bib4" idref="ref6" type="bt">4</bibl> <bibtext> Bellenguez C, Charbonnier C, Grenier-Boley B, Quenez O, Le Guennec K, Nicolas G, et al (2017) Contribution to Alzheimer's disease risk of rare variants in TREM2, SORL1, and ABCA7 in 1779 cases and 1273 controls. Neurobiol Aging59:220.e1–220.e9</bibtext> </blist> <blist> <bibl id="bib5" idref="ref15" type="bt">5</bibl> <bibtext> Shaffer JR, LeClair J, Carlson JC, Feingold E, Buxó CJ, Christensen K, et al (2019) Association of low-frequency genetic variants in regulatory regions with nonsyndromic orofacial clefts. American Journal of Medical Genetics Part A179:467–474doi: 10.1002/ajmg.a.61002, 30582786</bibtext> </blist> <blist> <bibl id="bib6" idref="ref7" type="bt">6</bibl> <bibtext> Wang Q, Dhindsa RS, Carss K, Harper AR, Nag A, Tachmazidou I, et al (2021) Rare variant contribution to human disease in 281,104 UK Biobank exomes. Nature. doi: 10.1038/s41586-021-03855-y, 34375979</bibtext> </blist> <blist> <bibl id="bib7" idref="ref9" type="bt">7</bibl> <bibtext> Li B, Leal SM (2008) Methods for detecting associations with rare variants for common diseases: application to analysis of sequence data. Am J Hum Genet83:311–321doi: 10.1016/j.ajhg.2008.06.024, 18691683</bibtext> </blist> <blist> <bibl id="bib8" idref="ref10" type="bt">8</bibl> <bibtext> Bis JC, Jian X, Kunkle BW, Chen Y, Hamilton-Nelson KL, Bush WS, et al (2018) Whole exome sequencing study identifies novel rare and common Alzheimer's-Associated variants involved in immune response and transcriptional regulation. Mol Psychiatry. doi: 10.1038/s41380-018-0112-7, 30108311</bibtext> </blist> <blist> <bibl id="bib9" type="bt">9</bibl> <bibtext> Cirulli ET, White S, Read RW, Elhanan G, Metcalf WJ, Tanudjaja F, et al (2020) Genome-wide rare variant analysis for thousands of phenotypes in over 70,000 exomes from two cohorts. Nat Commun.doi: 10.1038/s41467-020-14288-y, 31992710</bibtext> </blist> <blist> <bibtext> Guey S, Kraemer M, Hervé D, Ludwig T, Kossorotoff M, Bergametti F, et al (2017) Rare RNF213 variants in the C-terminal region encompassing the RING-finger domain are associated with moyamoya angiopathy in Caucasians. Eur J Hum Genet25:995–1003doi: 10.1038/ejhg.2017.92, 28635953</bibtext> </blist> <blist> <bibtext> Bocher O, Génin E (2020) Rare variant association testing in the non-coding genome. Hum Genet. doi: 10.1007/s00439-020-02190-y, 32500240</bibtext> </blist> <blist> <bibtext> UK10K Consortium, Walter K, Min JL, Huang J, Crooks L, Memari Y, et al (2015) The UK10K project identifies rare variants in health and disease. Nature526:82–90doi: 10.1038/nature14962, 26367797</bibtext> </blist> <blist> <bibtext> Zhang F, Lupski JR (2015) Non-coding genetic variants in human disease. Hum Mol Genet24:R102–110doi: 10.1093/hmg/ddv259, 26152199</bibtext> </blist> <blist> <bibtext> Castel SE, Cervera A, Mohammadi P, Aguet F, Reverter F, Wolman A, et al (2018) Modified penetrance of coding variants by cis-regulatory variation contributes to disease risk. Nature Genetics50:1327–1334doi: 10.1038/s41588-018-0192-y, 30127527</bibtext> </blist> <blist> <bibtext> Morrison AC, Huang Z, Yu B, Metcalf G, Liu X, Ballantyne C, et al (2017) Practical Approaches for Whole-Genome Sequence Analysis of Heart- and Blood-Related Traits. The American Journal of Human Genetics100:205–215doi: 10.1016/j.ajhg.2016.12.009, 28089252</bibtext> </blist> <blist> <bibtext> Cochran JN, Geier EG, Bonham LW, Newberry JS, Amaral MD, Thompson ML, et al (2020) Non-coding and Loss-of-Function Coding Variants in TET2 are Associated with Multiple Neurodegenerative Diseases. Am J Hum Genet106:632–645doi: 10.1016/j.ajhg.2020.03.010, 32330418</bibtext> </blist> <blist> <bibtext> Li Z, Li X, Liu Y, Shen J, Chen H, Zhou H, et al (2019) Dynamic Scan Procedure for Detecting Rare-Variant Association Regions in Whole-Genome Sequencing Studies. The American Journal of Human Genetics104:802–814doi: 10.1016/j.ajhg.2019.03.002, 30982610</bibtext> </blist> <blist> <bibtext> He Z, Xu B, Buxbaum J, Ionita-Laza I (2019) A genome-wide scan statistic framework for whole-genome sequence data analysis. Nature Communications10:1–11</bibtext> </blist> <blist> <bibtext> Bocher O, Marenne G, Pierre AS, Ludwig TE, Guey S, Tournier-Lasserve E, et al (2019) Rare variant association testing for multicategory phenotype. Genetic Epidemiology43:646–656doi: 10.1002/gepi.22210, 31087445</bibtext> </blist> <blist> <bibtext> Bocher O, Marenne G, Tournier-Lasserve E, FREX Consortium, Génin E, Perdry H(2021) Extension of SKAT to multi-category phenotypes through a geometrical interpretation. Eur J Hum Genet. doi: 10.1038/s41431-020-00792-8, 33446828</bibtext> </blist> <blist> <bibtext> Havrilla JM, Pedersen BS, Layer RM, Quinlan AR (2019) A map of constrained coding regions in the human genome. Nature Genetics51:88–95doi: 10.1038/s41588-018-0294-6, 30531870</bibtext> </blist> <blist> <bibtext> Genome Aggregation Database Consortium, Karczewski KJ, Francioli LC, Tiao G, Cummings BB, Alföldi J, et al (2020) The mutational constraint spectrum quantified from variation in 141,456 humans. Nature581:434–443doi: 10.1038/s41586-020-2308-7, 32461654</bibtext> </blist> <blist> <bibtext> Rentzsch P, Witten D, Cooper GM, Shendure J, Kircher M (2019) CADD: predicting the deleteriousness of variants throughout the human genome. Nucleic Acids Research47:D886–D894doi: 10.1093/nar/gky1016, 30371827</bibtext> </blist> <blist> <bibtext> Nishizaki SS, Boyle AP (2017) Mining the Unknown: Assigning Function to Noncoding Single Nucleotide Polymorphisms. Trends in Genetics33:34–45doi: 10.1016/j.tig.2016.10.008, 27939749</bibtext> </blist> <blist> <bibtext> Pujar S, O'Leary NA, Farrell CM, Loveland JE, Mudge JM, Wallin C, et al (2018) Consensus coding sequence (CCDS) database: a standardized set of human and mouse protein-coding regions supported by expert curation. Nucleic Acids Res46:D221–D228doi: 10.1093/nar/gkx1031, 29126148</bibtext> </blist> <blist> <bibtext> Barrett LW, Fletcher S, Wilton SD (2012) Regulation of eukaryotic gene expression by the untranslated gene regions and other non-coding elements. Cell Mol Life Sci69:3613–3634doi: 10.1007/s00018-012-0990-9, 22538991</bibtext> </blist> <blist> <bibtext> Moore JE, Purcaro MJ, Pratt HE, Epstein CB, Shoresh N, Adrian J, et al (2020) Expanded encyclopaedias of DNA elements in the human and mouse genomes. Nature583:699–710doi: 10.1038/s41586-020-2493-4, 32728249</bibtext> </blist> <blist> <bibtext> Landrum MJ, Lee JM, Benson M, Brown GR, Chao C, Chitipiralla S, et al (2018) ClinVar: improving access to variant interpretations and supporting evidence. Nucleic Acids Res46:D1062–D1067doi: 10.1093/nar/gkx1153, 29165669</bibtext> </blist> <blist> <bibtext> The 1000 Genomes Project Consortium (2015) A global reference for human genetic variation. Nature526:68–74doi: 10.1038/nature15393, 26432245</bibtext> </blist> <blist> <bibtext> Itan Y, Shang L, Boisson B, Ciancanelli MJ, Markle JG, Martinez-Barricarte R, et al (2016) The mutation significance cutoff: gene-level thresholds for variant predictions. Nat Methods13:109–110doi: 10.1038/nmeth.3739, 26820543</bibtext> </blist> <blist> <bibtext> Lindström S, Wang L, Smith EN, Gordon W, van Hylckama Vlieg A, de Andrade M, et al (2019) Genomic and transcriptomic association studies identify 16 novel susceptibility loci for venous thromboembolism. Blood134:1645–1657doi: 10.1182/blood.2019000435, 31420334</bibtext> </blist> <blist> <bibtext> Razzaq M, Iglesias MJ, Ibrahim-Kosta M, Goumidi L, Soukarieh O, Proust C, et al (2021) An artificial neural network approach integrating plasma proteomics and genetic data identifies PLXNA4 as a new susceptibility locus for pulmonary embolism. Sci Rep11:14015doi: 10.1038/s41598-021-93390-7, 34234248</bibtext> </blist> <blist> <bibtext> Germain M, Saut N, Greliche N, Dina C, Lambert J-C, Perret C, et al (2011) Genetics of Venous Thrombosis: Insights from a New Genome Wide Association Study. PLOS ONE6:e25581doi: 10.1371/journal.pone.0025581, 21980494</bibtext> </blist> <blist> <bibtext> Roupie A-L, Dossier A, Goulenok T, Perozziello A, Papo T, Sacre K (2016) First venous thromboembolism in admitted patients younger than 50years old. European Journal of Internal Medicine34:e18–e20doi: 10.1016/j.ejim.2016.05.013, 27230786</bibtext> </blist> <blist> <bibtext> Marenne G, Ludwig TE, Bocher O, Herzig AF, Aloui C, Tournier-Lasserve E, Génin E RAVAQ: An integrative pipeline from quality control to region-based rare variant association analysis. Genetic Epidemiology. https://doi.org/10.1002/gepi.22450</bibtext> </blist> <blist> <bibtext> Razzaq M, Goumidi L, Iglesias M-J, Munsch G, Bruzelius M, Ibrahim-Kosta M, et al (2021) Explainable Artificial Neural Network for Recurrent Venous Thromboembolism Based on Plasma Proteomics. In: Cinquemani E, Paulevé L (eds) Computational Methods in Systems Biology. Springer International Publishing, Cham, pp 108–121</bibtext> </blist> <blist> <bibtext> Li J, Ji L (2005) Adjusting multiple testing in multilocus analyses using the eigenvalues of a correlation matrix. Heredity95:221–227doi: 10.1038/sj.hdy.6800717, 16077740</bibtext> </blist> <blist> <bibtext> Cokic VP, Bhattacharya B, Beleslin-Cokic BB, Noguchi CT, Puri RK, Schechter AN (2012) JAK-STAT and AKT pathway-coupled genes in erythroid progenitor cells through ontogeny. J Transl Med10:116doi: 10.1186/1479-5876-10-116, 22676255</bibtext> </blist> <blist> <bibtext> Weiss G, Ganz T, Goodnough LT (2019) Anemia of inflammation. Blood133:40–50doi: 10.1182/blood-2018-06-856500, 30401705</bibtext> </blist> <blist> <bibtext> Lieberman-Aiden E, van Berkum NL, Williams L, Imakaev M, Ragoczy T, Telling A, et al (2009) Comprehensive mapping of long range interactions reveals folding principles of the human genome. Science326:289–293doi: 10.1126/science.1181369, 19815776</bibtext> </blist> <blist> <bibtext> Kent WJ, Sugnet CW, Furey TS, Roskin KM, Pringle TH, Zahler AM, Haussler and D (2002) The Human Genome Browser at UCSC. Genome Res12:996–1006doi: 10.1101/gr.229102, 12045153</bibtext> </blist> <blist> <bibtext> Chen L, Xie X, Zhang X, Jia W, Jian J, Song C, Jin B (2003) The expression, regulation and adhesion function of a novel CD molecule, CD226, on human endothelial cells. Life Sci73:2373–2382doi: 10.1016/s0024-3205(03)00606-4, 12941439</bibtext> </blist> <blist> <bibtext> Kojima H, Kanada H, Shimizu S, Kasama E, Shibuya K, Nakauchi H, et al (2003) CD226 mediates platelet and megakaryocytic cell adhesion to vascular endothelial cells. J Biol Chem278:36748–36753doi: 10.1074/jbc.M300702200, 12847109</bibtext> </blist> <blist> <bibtext> Zhou S, Xie J, Yu C, Feng Z, Cheng K, Ma J, et al (2021) CD226 deficiency promotes glutaminolysis and alleviates mitochondria damage in vascular endothelial cells under hemorrhagic shock. FASEB J35:e21998doi: 10.1096/fj.202101134R, 34669985</bibtext> </blist> <blist> <bibtext> Chen M-H, Raffield LM, Mousas A, Sakaue S, Huffman JE, Moscati A, et al (2020) Trans-ethnic and Ancestry-Specific Blood-Cell Genetics in 746,667 Individuals from 5 Global Populations. Cell182:1198–1213.e14 doi: 10.1016/j.cell.2020.06.045, 32888493</bibtext> </blist> <blist> <bibtext> Vuckovic D, Bao EL, Akbari P, Lareau CA, Mousas A, Jiang T, et al (2020) The Polygenic and Monogenic Basis of Blood Traits and Diseases. Cell182:1214–1231.e11 doi: 10.1016/j.cell.2020.08.008, 32888494</bibtext> </blist> <blist> <bibtext> Nemeth E, Ganz T (2014) Anemia of inflammation. Hematol Oncol Clin North Am28:671–681, vi doi: 10.1016/j.hoc.2014.04.005, 25064707</bibtext> </blist> <blist> <bibtext> Wagner DD, Burger PC (2003) Platelets in inflammation and thrombosis. Arterioscler Thromb Vasc Biol23:2131–2137doi: 10.1161/01.ATV.0000095974.95122.EC, 14500287</bibtext> </blist> <blist> <bibtext> Halachev M, Meynert A, Taylor MS, Vitart V, Kerr SM, Klaric L, et al (2019) Increased ultra-rare variant load in an isolated Scottish population impacts exonic and regulatory regions. PLoS Genet15:e1008480doi: 10.1371/journal.pgen.1008480, 31765389</bibtext> </blist> <blist> <bibtext> Li Y, Shi W, Wasserman WW (2018) Genome-wide prediction of cis-regulatory regions using supervised deep learning methods. BMC Bioinformatics19:202doi: 10.1186/s12859-018-2187-1, 29855387</bibtext> </blist> <blist> <bibtext> Huang Y-F, Gulko B, Siepel A (2017) Fast, scalable prediction of deleterious noncoding variants from functional and population genomic data. Nature Genetics49:618–624doi: 10.1038/ng.3810, 28288115</bibtext> </blist> <blist> <bibtext> Vitsios D, Dhindsa RS, Middleton L, Gussow AB, Petrovski S (2021) Prioritizing non-coding regions based on human genomic constraint and sequence context with deep learning. Nat Commun12:1504doi: 10.1038/s41467-021-21790-4, 33686085</bibtext> </blist> <blist> <bibtext> Gussow AB, Copeland BR, Dhindsa RS, Wang Q, Petrovski S, Majoros WH, et al (2017) Orion: Detecting regions of the human non-coding genome that are intolerant to variation using population genetics. PLOS ONE12:e0181604doi: 10.1371/journal.pone.0181604, 28797091</bibtext> </blist> <blist> <bibtext> Zhan X, Hu Y, Li B, Abecasis GR, Liu DJ (2016) RVTESTS: an efficient and comprehensive tool for rare variant association analysis using sequence data. Bioinformatics32:1423–1426doi: 10.1093/bioinformatics/btw079, 27153000</bibtext> </blist> <blist> <bibtext> Baskurt Z, Mastromatteo S, Gong J, Wintle RF, Scherer SW, Strug LJ (2020) VikNGS: a C++ variant integration kit for next generation sequencing association analysis. Bioinformatics36:1283–1285doi: 10.1093/bioinformatics/btz716, 31580400</bibtext> </blist> <blist> <bibtext> Quintana MA, Berstein JL, Thomas DC, Conti DV (2011) Incorporating model uncertainty in detecting rare variants: the Bayesian risk index. Genetic Epidemiology35:638–649doi: 10.1002/gepi.20613, 22009789</bibtext> </blist> <blist> <bibtext> Greene D, Richardson S, Turro E (2017) A Fast Association Test for Identifying Pathogenic Variants Involved in Rare Diseases. The American Journal of Human Genetics101:104–114doi: 10.1016/j.ajhg.2017.05.015, 28669401</bibtext> </blist> </ref> <aug> <p>By Ozvan Bocher; Thomas E. Ludwig; Marie-Sophie Oglobinsky; Gaëlle Marenne; Jean-François Deleuze; Suryakant Suryakant; Jacob Odeberg; Pierre-Emmanuel Morange; David-Alexandre Trégouët; Hervé Perdry and Emmanuelle Génin</p> <p>Reported by Author; Author; Author; Author; Author; Author; Author; Author; Author; Author; Author</p> </aug> <nolink nlid="nl1" bibid="bib10" firstref="ref11"></nolink> <nolink nlid="nl2" bibid="bib11" firstref="ref12"></nolink> <nolink nlid="nl3" bibid="bib12" firstref="ref13"></nolink> <nolink nlid="nl4" bibid="bib14" firstref="ref14"></nolink> <nolink nlid="nl5" bibid="bib15" firstref="ref16"></nolink> <nolink nlid="nl6" bibid="bib17" firstref="ref17"></nolink> <nolink nlid="nl7" bibid="bib18" firstref="ref18"></nolink> <nolink nlid="nl8" bibid="bib19" firstref="ref19"></nolink> <nolink nlid="nl9" bibid="bib21" firstref="ref20"></nolink> <nolink nlid="nl10" bibid="bib22" firstref="ref21"></nolink> <nolink nlid="nl11" bibid="bib23" firstref="ref22"></nolink> <nolink nlid="nl12" bibid="bib24" firstref="ref23"></nolink> <nolink nlid="nl13" bibid="bib25" firstref="ref26"></nolink> <nolink nlid="nl14" bibid="bib26" firstref="ref27"></nolink> <nolink nlid="nl15" bibid="bib27" firstref="ref28"></nolink> <nolink nlid="nl16" bibid="bib28" firstref="ref33"></nolink> <nolink nlid="nl17" bibid="bib29" firstref="ref34"></nolink> <nolink nlid="nl18" bibid="bib30" firstref="ref37"></nolink> <nolink nlid="nl19" bibid="bib980" firstref="ref39"></nolink> <nolink nlid="nl20" bibid="bib79" firstref="ref40"></nolink> <nolink nlid="nl21" bibid="bib831" firstref="ref41"></nolink> <nolink nlid="nl22" bibid="bib31" firstref="ref44"></nolink> <nolink nlid="nl23" bibid="bib32" firstref="ref45"></nolink> <nolink nlid="nl24" bibid="bib33" firstref="ref46"></nolink> <nolink nlid="nl25" bibid="bib34" firstref="ref47"></nolink> <nolink nlid="nl26" bibid="bib35" firstref="ref48"></nolink> <nolink nlid="nl27" bibid="bib36" firstref="ref53"></nolink> <nolink nlid="nl28" bibid="bib37" firstref="ref54"></nolink> <nolink nlid="nl29" bibid="bib38" firstref="ref55"></nolink> <nolink nlid="nl30" bibid="bib40" firstref="ref56"></nolink> <nolink nlid="nl31" bibid="bib41" firstref="ref57"></nolink> <nolink nlid="nl32" bibid="bib42" firstref="ref58"></nolink> <nolink nlid="nl33" bibid="bib44" firstref="ref59"></nolink> <nolink nlid="nl34" bibid="bib45" firstref="ref60"></nolink> <nolink nlid="nl35" bibid="bib46" firstref="ref61"></nolink> <nolink nlid="nl36" bibid="bib39" firstref="ref66"></nolink> <nolink nlid="nl37" bibid="bib47" firstref="ref67"></nolink> <nolink nlid="nl38" bibid="bib48" firstref="ref68"></nolink> <nolink nlid="nl39" bibid="bib49" firstref="ref69"></nolink> <nolink nlid="nl40" bibid="bib50" firstref="ref70"></nolink> <nolink nlid="nl41" bibid="bib51" firstref="ref71"></nolink> <nolink nlid="nl42" bibid="bib52" firstref="ref72"></nolink> <nolink nlid="nl43" bibid="bib53" firstref="ref73"></nolink> <nolink nlid="nl44" bibid="bib54" firstref="ref74"></nolink> <nolink nlid="nl45" bibid="bib56" firstref="ref75"></nolink> CustomLinks: – Url: https://resolver.ebsco.com/c/xy5jbn/result?sid=EBSCO:edsdoj&genre=article&issn=15537390&ISBN=&volume=18&issue=9&date=20220901&spage=e1009923&pages=&title=PLoS Genetics&atitle=Testing%20for%20association%20with%20rare%20variants%20in%20the%20coding%20and%20non-coding%20genome%3A%20RAVA-FIRST%2C%20a%20new%20approach%20based%20on%20CADD%20deleteriousness%20score.&aulast=Ozvan%20Bocher&id=DOI:10.1371/journal.pgen.1009923 Name: Full Text Finder (for New FTF UI) (s8985755) Category: fullText Text: Find It @ SCU Libraries MouseOverText: Find It @ SCU Libraries – Url: https://doaj.org/article/47dcd592d87c4bea91b6dd59452c4a97 Name: EDS - DOAJ (s8985755) Category: fullText Text: View record from DOAJ MouseOverText: View record from DOAJ
Header	DbId: edsdoj DbLabel: Directory of Open Access Journals An: edsdoj.47dcd592d87c4bea91b6dd59452c4a97 RelevancyScore: 947 AccessLevel: 3 PubType: Academic Journal PubTypeId: academicJournal PreciseRelevancyScore: 947.154418945313
IllustrationInfo
Items	– Name: Title Label: Title Group: Ti Data: Testing for association with rare variants in the coding and non-coding genome: RAVA-FIRST, a new approach based on CADD deleteriousness score. – Name: Author Label: Authors Group: Au Data: <searchLink fieldCode="AR" term="%22Ozvan+Bocher%22">Ozvan Bocher</searchLink><br /><searchLink fieldCode="AR" term="%22Thomas+E+Ludwig%22">Thomas E Ludwig</searchLink><br /><searchLink fieldCode="AR" term="%22Marie-Sophie+Oglobinsky%22">Marie-Sophie Oglobinsky</searchLink><br /><searchLink fieldCode="AR" term="%22Gaëlle+Marenne%22">Gaëlle Marenne</searchLink><br /><searchLink fieldCode="AR" term="%22Jean-François+Deleuze%22">Jean-François Deleuze</searchLink><br /><searchLink fieldCode="AR" term="%22Suryakant+Suryakant%22">Suryakant Suryakant</searchLink><br /><searchLink fieldCode="AR" term="%22Jacob+Odeberg%22">Jacob Odeberg</searchLink><br /><searchLink fieldCode="AR" term="%22Pierre-Emmanuel+Morange%22">Pierre-Emmanuel Morange</searchLink><br /><searchLink fieldCode="AR" term="%22David-Alexandre+Trégouët%22">David-Alexandre Trégouët</searchLink><br /><searchLink fieldCode="AR" term="%22Hervé+Perdry%22">Hervé Perdry</searchLink><br /><searchLink fieldCode="AR" term="%22Emmanuelle+Génin%22">Emmanuelle Génin</searchLink> – Name: TitleSource Label: Source Group: Src Data: PLoS Genetics, Vol 18, Iss 9, p e1009923 (2022) – Name: Publisher Label: Publisher Information Group: PubInfo Data: Public Library of Science (PLoS), 2022. – Name: DatePubCY Label: Publication Year Group: Date Data: 2022 – Name: Subset Label: Collection Group: HoldingsInfo Data: LCC:Genetics – Name: Subject Label: Subject Terms Group: Su Data: <searchLink fieldCode="DE" term="%22Genetics%22">Genetics</searchLink><br /><searchLink fieldCode="DE" term="%22QH426-470%22">QH426-470</searchLink> – Name: Abstract Label: Description Group: Ab Data: Rare variant association tests (RVAT) have been developed to study the contribution of rare variants widely accessible through high-throughput sequencing technologies. RVAT require to aggregate rare variants in testing units and to filter variants to retain only the most likely causal ones. In the exome, genes are natural testing units and variants are usually filtered based on their functional consequences. However, when dealing with whole-genome sequence (WGS) data, both steps are challenging. No natural biological unit is available for aggregating rare variants. Sliding windows procedures have been proposed to circumvent this difficulty, however they are blind to biological information and result in a large number of tests. We propose a new strategy to perform RVAT on WGS data: "RAVA-FIRST" (RAre Variant Association using Functionally-InfoRmed STeps) comprising three steps. (1) New testing units are defined genome-wide based on functionally-adjusted Combined Annotation Dependent Depletion (CADD) scores of variants observed in the gnomAD populations, which are referred to as "CADD regions". (2) A region-dependent filtering of rare variants is applied in each CADD region. (3) A functionally-informed burden test is performed with sub-scores computed for each genomic category within each CADD region. Both on simulations and real data, RAVA-FIRST was found to outperform other WGS-based RVAT. Applied to a WGS dataset of venous thromboembolism patients, we identified an intergenic region on chromosome 18 enriched for rare variants in early-onset patients. This region that was missed by standard sliding windows procedures is included in a TAD region that contains a strong candidate gene. RAVA-FIRST enables new investigations of rare non-coding variants in complex diseases, facilitated by its implementation in the R package Ravages. – Name: TypeDocument Label: Document Type Group: TypDoc Data: article – Name: Format Label: File Description Group: SrcInfo Data: electronic resource – Name: Language Label: Language Group: Lang Data: English – Name: ISSN Label: ISSN Group: ISSN Data: 1553-7390<br />1553-7404 – Name: NoteTitleSource Label: Relation Group: SrcInfo Data: https://doaj.org/toc/1553-7390; https://doaj.org/toc/1553-7404 – Name: DOI Label: DOI Group: ID Data: 10.1371/journal.pgen.1009923 – Name: URL Label: Access URL Group: URL Data: <link linkTarget="URL" linkTerm="https://doaj.org/article/47dcd592d87c4bea91b6dd59452c4a97" linkWindow="_blank">https://doaj.org/article/47dcd592d87c4bea91b6dd59452c4a97</link> – Name: AN Label: Accession Number Group: ID Data: edsdoj.47dcd592d87c4bea91b6dd59452c4a97
PLink	https://login.libproxy.scu.edu/login?url=https://search.ebscohost.com/login.aspx?direct=true&site=eds-live&scope=site&db=edsdoj&AN=edsdoj.47dcd592d87c4bea91b6dd59452c4a97
RecordInfo	BibRecord: BibEntity: Identifiers: – Type: doi Value: 10.1371/journal.pgen.1009923 Languages: – Text: English PhysicalDescription: Pagination: StartPage: e1009923 Subjects: – SubjectFull: Genetics Type: general – SubjectFull: QH426-470 Type: general Titles: – TitleFull: Testing for association with rare variants in the coding and non-coding genome: RAVA-FIRST, a new approach based on CADD deleteriousness score. Type: main BibRelationships: HasContributorRelationships: – PersonEntity: Name: NameFull: Ozvan Bocher – PersonEntity: Name: NameFull: Thomas E Ludwig – PersonEntity: Name: NameFull: Marie-Sophie Oglobinsky – PersonEntity: Name: NameFull: Gaëlle Marenne – PersonEntity: Name: NameFull: Jean-François Deleuze – PersonEntity: Name: NameFull: Suryakant Suryakant – PersonEntity: Name: NameFull: Jacob Odeberg – PersonEntity: Name: NameFull: Pierre-Emmanuel Morange – PersonEntity: Name: NameFull: David-Alexandre Trégouët – PersonEntity: Name: NameFull: Hervé Perdry – PersonEntity: Name: NameFull: Emmanuelle Génin IsPartOfRelationships: – BibEntity: Dates: – D: 01 M: 09 Type: published Y: 2022 Identifiers: – Type: issn-print Value: 15537390 – Type: issn-print Value: 15537404 Numbering: – Type: volume Value: 18 – Type: issue Value: 9 Titles: – TitleFull: PLoS Genetics Type: main
ResultId	1