Network dependence

  1. Vainora, Julius
Dirixida por:
  1. Miguel Angel Delgado González Director

Universidade de defensa: Universidad Carlos III de Madrid

Fecha de defensa: 27 de xullo de 2020

Tribunal:
  1. Wenceslao González Manteiga Presidente
  2. Carlos Velasco Gómez Secretario/a
  3. Gábor Lugosi Vogal

Tipo: Tese

Resumo

The thesis considers the problem of making valid inferences with network data taking into account that, in general, in socio-economic networks there exists no ordering of vertices, making direct extensions of spatial or time series methods unsatisfactory. It is natural to assume that data is dependent when the underlying entities form a network, but it is far from obvious what exactly is the relation between network topology and data dependence. While accounting for dependence between network entity attributes is instrumental for valid inferences in econometrics, it has received little formal treatment and tends to be neglected. Although certain methods of networks in econometrics are fairly well established by now, they are likely to lack flexibility due to being extensions of methods for spatial and time series data. The first contribution of this thesis is developing a formal three-component framework for network-dependent data in Chapter 1 consequently giving rise to new ways of making inferences. It is assumed that there are K entities of interest, such as individuals, firms, crimes, municipalities, or countries. The first component of the proposed framework is a network defined as a pair of a random graph and a vector of entity characteristics. As their characteristic each entity has assigned a single random variable such as household income or country gross domestic product in a given year. The random graph can be seen as a representation of a relevant socio-economic phenomenon concerned with a given set of entities. In this thesis we consider only undirected and unweighted random graphs, where loops or multiple edges are not allowed. The second framework component is a classifier function --- a bivariate graph statistic assigning a class to any two vertices relative to a given graph. The classifier serves as a generalization of the length of a shortest path distance and can be seen as a function determining the network ``dependence direction''. For illustrational purposes, in this thesis we consider classifiers corresponding to the number of common neighbours, absolute degrees difference, and the length of a shortest path. We also provide an example data generating process for each classifier and shed light on the validity of related assumptions outlined in the thesis. The network stationarity definition is the last component creating a bridge between the topology of the random graph and the dependence between the entity characteristics. In particular, network covariance stationarity assumes that the covariance between two entity characteristics conditionally on the whole graph equals the covariance conditionally only on the class of the corresponding two entities. Further, the latter conditional covariance function is the same across all pairs of entities and is the network autocovariance function of interest. Hence, if two pairs of vertices belong to the same class, then their conditional covariances will coincide. We extend this notion to network mean stationarity and strong stationarity. As one of the main applications, the proposed methodology provides an alternative cross-sectional dependence modelling approach. Proposed estimators are easily implementable in practice and require only a single network observation. The network stationarity definition encompasses dependence structures such as uncorrelatedness, clustering, time series, and panel data stationarity. Further, an asymptotic theory for networks is developed in Chapter 2. It starts with a discussion of key concepts such as superpopulation model and infinite-population inference. We consider complete-data generating processes allowing to observe full networks as opposed to incomplete-data generating process. Hence, each network could be seen as a finite population with its calculated, say, mean and variance then being nonrandom. However, we consider a superpopulation approach, where each finite population is seen to be an outcome of a superpopulation model with superpopulation parameters that are the objects of interest in estimation. We focus on inferences based on a single growing network, which is particularly relevant in practice, where multiple realizations are rare. To restrict the degree of dependence we apply conditional strong mixing and develop its analogue for networks. In particular, instead of considering pairwise distances between entities in some Euclidean space, akin to random fields, conditional mixing coefficients are redefined to use pairwise classes on a graph, in this way taking into account network interactions. We prove two laws of large numbers (one for the constant network mean and one for the case when the conditional mean depends on the entity class), an autocovariance function consistency result, and a central limit theorem, where a large portion of assumptions is random graph regularity conditions. For instance, one of the assumptions requires that the size of a given class would diverge in probability, while another assumption requires the order of the second moment of the class size to be not higher than any of its positive quantiles. The results and their proof strategy can be seen as an extension of those of random fields. Lastly, we consider a set of theoretical and empirical applications. We first briefly discuss the consistency and asymptotic normality of the ordinary least squares estimator under network dependence in the model errors. Next we proceed with two alternatives for making inferences --- a feasible generalized least squares estimator with more stringent assumptions and a network-robust estimator for standard errors, which provides valid results under infinitely many true, yet unknown, classifiers, and is a simple and direct extension of cluster-robust inferences. We illustrate the latter estimator with microfinance data from Indian villages using several classifiers. Chapter 3 introduces a new concept of network dependence counterfactuals. In particular, suppose that one is dealing with a stationary time series or a stationary spatial process having autocovariance functions characterized by distance in time or space. Then introducing new points does not lead to any changes in the dependence between the already existing points since none of the pairwise distances changed, i.e., typically there are no interactions between points in time or space. Meanwhile, introducing a single new vertex to a network can create multiple new connections and paths. That is, as a result of a single new network entity, the entire network dependence structure may change, and the magnitude of this change is no longer immediately obvious or irrelevant, as it is in time series or spatial data. Noteworthy, such counterfactual outcomes are unique to network data and are very distinct from the classical ones by being multivariate, not involving any two distinct groups or treatment effects, and by dealing with counterfactual variances and covariances rather than counterfactual mean level, among other aspects. Thus, the notion of network dependence counterfactuals is another important contribution of this thesis. We exploit the network stationarity assumption allowing to estimate, given a single network observation, the counterfactual covariance between outcome variables of any two network entities under a given hypothetical network structure. As applications, we suggest ways to measure influence of any combination of sets of vertices and edges; how this idea can be further applied to prediction on networks as to make inferences about unobserved state variables; ways to measure network robustness as the ability to withstand failures and perturbations; how to map the observed unweighted graph into a weighted one; and ways to generate a sequence of future vertices and their connections that would improve the overall network dependence flow. In most of the thesis chapters the classifier is unspecified or equal to some explicit bivariate graph statistic for the sake of an example. In practice, however, one is typically unable to provide a closed-form expression for a valid classifier due to the complexity of networks. The goal then is to develop a flexible class of data-driven classifiers able to automatically uncover complex dependence patterns. Chapter 4 makes first steps to address a practical issue that manually specified classifiers are likely to be too restrictive to accurately capture complex patterns in network dependence. We provide motivation for the importance of the issue and introduce machine learning algorithms as a possible solution. In particular, we focus on, and briefly review, a range of graph embedding algorithms. We outline a general methodology of applying graph embedding algorithms to estimate a classifier and provide a further extension to network dependence counterfactuals. Importantly, as long as the graph embedding algorithm at hand can handle directed and weighted graphs, this approach provides an additional benefit over explicitly specified classifiers making it challenging to incorporate edge weights and directions. We finish the chapter by an empirical application using adjacency spectral embedding and the microfinance data of the largest village.