Inferring a gene regulatory network (GRN) from gene expression data is a computationally expensive task, exacerbated by increasing data sizes due to advances in high-throughput gene profiling technology.

The Arboreto software library addresses this issue by providing a computational strategy that allows executing the class of GRN inference algorithms exemplified by GENIE3 [1] on hardware ranging from a single computer to a multi-node compute cluster. This class of GRN inference algorithms is defined by a series of steps, one for each target gene in the dataset, where the most important candidates from a set of regulators are determined from a regression model to predict a target gene’s expression profile.

Members of the above class of GRN inference algorithms are attractive from a computational point of view because they are parallelizable by nature. In arboreto, we specify the parallelizable computation as a Dask graph [2], a data structure that represents the task schedule of a computation. A Dask scheduler assigns the tasks in a Dask graph to the available computational resources. Arboreto uses the Dask distributed scheduler to spread out the computational tasks over multiple processes running on one or multiple machines.

Arboreto currently supports 2 GRN inference algorithms:

GRNBoost2: fast GRN inference algorithm using stochastic Gradient Boosting Machine [3] regression with early-stopping regularization, the Arboreto flagship algorithm.
GENIE3: the popular classic GRN inference algorithm using Random Forest (RF) or ExtraTrees (ET) regression.

Usage Example¶

# import python modules
import pandas as pd
from arboreto.utils import load_tf_names
from arboreto.algo import grnboost2

if __name__ == '__main__':
    # load the data
    ex_matrix = pd.read_csv(<ex_path>, sep='\t')
    tf_names = load_tf_names(<tf_path>)

    # infer the gene regulatory network
    network = grnboost2(expression_data=ex_matrix,
                        tf_names=tf_names)

    network.head()

TF	target	importance
G109	G1406	151.648784
G16	G1440	136.741815
G188	G938	124.707570
G10	G1312	124.195566
G48	G1419	121.488200

Check out more examples.

License¶

BSD 3-Clause License

References¶

[1]	Huynh-Thu VA, Irrthum A, Wehenkel L, Geurts P (2010) Inferring Regulatory Networks from Expression Data Using Tree-Based Methods. PLoS ONE

[2]	Rocklin, M. (2015). Dask: parallel computation with blocked algorithms and task scheduling. In Proceedings of the 14th Python in Science Conference (pp. 130-136).

[3]	Friedman, J. H. (2002). Stochastic gradient boosting. Computational Statistics & Data Analysis, 38(4), 367-378.

[4]	Marbach, D., Costello, J. C., Kuffner, R., Vega, N. M., Prill, R. J., Camacho, D. M., … & Dream5 Consortium. (2012). Wisdom of crowds for robust gene network inference. Nature methods, 9(8), 796-804.