Statistical tools for network data: Prediction and resampling

Citation

Li, Tianxi (2018). Statistical tools for network data: Prediction and resampling.

Abstract

Advances in data collection and social media have led to more and more network data appearing in diverse areas, such as social sciences, internet, transportation and biology. This thesis develops new principled statistical tools for network analysis, with emphasis on both appealing statistical properties and computational efficiency. Our first project focuses on building prediction models for network-linked data. Prediction algorithms typically assume the training data are independent samples, but in many modern applications samples come from individuals connected by a network. For example, in adolescent health studies of risk-taking behaviors, information on the subjects’ social network is often available and plays an important role through network cohesion, the empirically observed phenomenon of friends behaving similarly. Taking cohesion into account in prediction models should allow us to improve their performance. We propose a network-based penalty on individual node effects to encourage similarity between predictions for linked nodes, and show that incorporating it into prediction leads to improvement over traditional models both theoretically and empirically when network cohesion is present. The penalty can be used with many loss-based prediction methods, such as regression, generalized linear models, and Cox’s proportional hazard model. Applications to predicting levels of recreational activity and marijuana usage among teenagers from the AddHealth study based on both demographic covariates and friendship networks are discussed in detail. We show that our approach to taking friendships into account can significantly improve predictions of behavior while providing interpretable estimates of covariate effects. Resampling, data splitting, and cross-validation are powerful general strategies in statistical inference, but resampling from a network remains a challenging problem. Many statistical models and methods for networks need model selection and tuning parameters, which could be done by cross-validation if we had a good method for splitting network data; however, splitting network nodes into groups requires deleting edges and destroys some of the structure. Here we propose a new network crossvalidation strategy based on splitting edges rather than nodes, which avoids losing information and is applicable to a wide range of network models. We provide a theoretical justification for our method in a general setting and demonstrate how our method can be used in a number of specific model selection and parameter tuning tasks, with extensive numerical results on simulated networks demonstrating its competitiveness with task-specific methods. We also apply the method to analysis of a citation network of statisticians and obtain meaningful research communities. Finally, we consider the problem of community detection on partially observed networks. Communities are one important type of structure in networks and they have been widely studied. However, in practice, network data are often collected through sampling mechanisms, such as survey questionnaires, instead of direct observation. The noise and bias introduced by such sampling mechanisms can obscure the community structure and invalidate the assumptions of standard community detection methods. We propose a model to incorporate neighborhood sampling, through a model reflective of survey designs, into community detection for directed networks, since friendship networks obtained from surveys are naturally directed. We model the edge sampling probabilities as a function of both individual preferences and community parameters, and fit the model by a combination of spectral clustering and the method of moments. The algorithm is computationally efficient and comes with a theoretical guarantee of consistency. We evaluate the proposed model in extensive simulation studies and applied it to a faculty hiring dataset, discovering a meaningful hierarchy of communities among US business schools

URL

https://deepblue.lib.umich.edu/bitstream/handle/2027.42/145894/tianxili_1.pdf?sequence=1&isAllowed=y

Reference Type

Thesis/Dissertation

Book Title

Statistics

Author(s)

Li, Tianxi

Series Author(s)

Levina, Liza

Year Published

2018

Volume Number

PhD

Pages

189

Publisher

University of Michigan

Reference ID

9358