Python Centrality Measures for Yelp Data Review in Google
PLoS Ane. 2021; xvi(3): e0248573.
On using centrality to sympathize importance of entities in the Panama Papers
Mayank Kejriwal
Data Sciences Found, University of Southern California, Marina del Rey, CA, United states,
Hocine Cherifi, Editor
Received 2020 December 6; Accepted 2021 Mar 1.
Abstract
The Panama Papers contain 1 of the about contempo influential leaks containing detailed information on intermediary companies (such as law firms), offshore entities and company officers, and serve as a valuable source of insight into the operations of (approximately) 214,000 shell companies incorporated in tax havens around the globe over the by half century. Entities and relations in the papers tin can be used to construct a network that permits, in principle, a systematic and scientific study at scale using techniques adult in the computational social scientific discipline and network science communities. In this paper, we propose such a study by attempting to quantify and contour the importance of entities. In particular, our inquiry explores whether intermediaries are significantly more influential than offshore entities, and whether different centrality measures lead to varying, or even incompatible, conclusions. Some findings yield conclusions that resemble Simpson's paradox. Nosotros also explore the role that jurisdictions play in determining entity importance.
Introduction
Since being leaked in 2015, the and so-chosen Panama Papers (an xi.v meg certificate trove detailing information on roughly 214,000 offshore entities, intermediaries and officers) exposed corruption, money laundering and tax evasion at an unprecedented global scale. An important economical consequence of this leak, according to a recent report, has been the collection of more than than i.2 billion USD in back taxes and penalties by governments around the world [1].
Because the condensed, publicly available version of the information can be expressed as a graph, structural properties of the entities can be quantified using network science. Particularly interesting is the question of which entities are influential in such a network, and to what extent the importance is adamant past factors such as the grade of the entity (due east.g., whether the entity is an intermediary, an offshore organization or an officer of the organization or intermediary), the computational measure employed for quantifying the importance (e.g., the betweenness centrality [ii]) and the national jurisdiction affiliated with the entity (e.1000., Hong Kong).
In this paper, nosotros apply the data made publicly available past the International Consortium of Investigative Journalists (ICIJ) to selectively construct networks and study importance of entity classes in the Panama Papers by modeling entities equally vertices. An established manner to understand which vertices are focal or important is by computing their centrality. Since being published more than a one-half-century ago, centrality metrics like betweenness and information centralities are well-studied and established in network science [ii, 3].
However, it has besides been understood as early every bit the 1970s [4] that different centrality measures seem to underlie different real-world social phenomena. In the context of the Panama Papers and so, several important questions ascend. For instance, given a centrality metric, which class of entities is the about 'influential' on boilerplate? Are there strong, positive correlations when either the class or the centrality mensurate is varied? How important is the role of an entity'southward jurisdiction in determining whether information technology is focal? Thus far, these questions have not been answered using a well-defined quantitative methodology for the entities in the Panama Papers.
Nosotros blueprint and conduct a rigorous serial of experiments to answer these questions, while also illuminating interesting aspects of different centrality measures such equally betweenness and current catamenia. For instance, our experiments bear witness that, while some findings are consistent across virtually all centrality measures (e.g., high scores are typically assigned to intermediaries by virtually all centrality measures), there are significant distributional and statistical disparities between centralities (and in particular, the information axis), peculiarly when conditioned on an entity form. Some findings are besides found to atomic number 82 to results that resemble Simpson's paradox [5], especially when comparison the findings on a particular entity class to the overall network.
Nosotros too qualify some of our findings while controlling for national jurisdiction, and notice intriguing relationships between the different centrality measures and entity classes even at the amass level of jurisdictions. Our full set of results provides detailed insights on the distribution of centrality in an interconnected system of entities that, despite having attracted pregnant qualitative scrutiny from legal scholars and sociologists [6, 7], has received petty attention (specially at calibration) from the computational social sciences.
Background and related work
Since the release of the Panama Papers by ICIJ, multiple analyses take been presented, including a bestselling volume [1]. Much of this assay has been sociological or legal in nature. For example, [viii] discuss how firms utilize hugger-mugger offshore vehicles to 'finance corruption, avoid taxes and expropriate shareholders'. In a law review, [6] report the disclosures surrounding the leaked documents and provide a discussion on the affect of bribery on the global community, as well as tax evasion. Numerous other references embrace similar issues, often spanning disciplines: a selected few include [seven, 9–11]. Computational studies of any kind take not been mutual; [12] is one rare instance of a work that uses the network to report the fiscal networks of the Middle East, but the scope and analysis is both geopolitically and structurally limited. Some other work, which involves information extraction but non network scientific discipline, is the multilingual organisation proposed in [13]. In our ain recent work [fourteen], we did structural studies on selectively constructed Panama Papers networks by using network science, and constitute that the networks tend to follow a power-constabulary caste distribution, just are extremely fragmented. Nevertheless, the importance of entities, or dependence of whatever such importance on the entities' jurisdictions, were not studied in that work.
Another relevant paper, very recently published [15], proposed an algorithm to notice 'suspicious' entities in databases such as this 1, where suspicious entities were defined as entities that were probable to appoint in illegal acts. The authors of that work used external databases and known lists of suspicious entities to verify their basis truth. In dissimilarity, this work makes no claim of the legality of actions, but is attempting an aggregate written report of entity importance, later on adopting advisable controls, using structural properties of the constructed networks.
Centrality is an extremely well-studied area in network scientific discipline, and the first centrality metrics were published more than fifty years ago [two]. Assay and quantification of vertex (and in some cases, link) importance using centrality is standard in computational network science [3], but many questions remain (regarding both practice and theory), particularly involving networks that are not as ubiquitous or well-studied as social networks [3]. Furthermore, there is no one 'good' centrality measure out; several decades before, Freeman did detailed studies on centrality and suggested that unlike centrality measures stand for to different social occurrences [4]. The piece of work by Freeman is particularly relevant for this paper since we besides provide reasonably potent prove that unlike axis measures seem to correspond to different phenomena in the network underlying the Panama Papers, and in some cases, exhibit interesting aspects such as Simpson'due south paradox (a phenomenon which Freeman did not explore in his ain studies) [5]. More recently, Amrit and Maat [16] study data centrality (1 of the centralities we as well apply in this work as a measure out of importance) in a simulated setting and show that, contrary to previous work that postulated that it was more correlated with closeness centrality, information technology is more similar to caste and eigenvector centrality. Our data in the real-world Panama Papers setting partially back up this conclusion, as we talk over later, though we find that the conclusion can alter depending on the class of entity being studied.
In other work, Abbasi et al. [17] utilise a similar kind of argument to hypothesize that the degree axis of a researcher'southward collaboration network positively correlates to performance. Other papers have used other centrality measures to report various phenomena; for a detailed (and relatively recent) survey on centrality and its history, as well every bit applications, we recommend the work past Das et al. [3]. Similarly, Lu et al provide a review on vital nodes identification in circuitous networks, including an introductory treatment of the various centralities used in this commodity. [18]. Ghalmane et al. [19] extended the standard centrality measures, including betweenness and closeness, that were originally defined for networks with no customs construction to modular networks. Their proposed "modular centrality" is a two-dimensional vector. Afterwards, the modular axis was extended to networks with overlapping communities [20]. Sciarra et al. propose multi-component axis metrics as a natural extension of standard centrality metrics by using tests on a variety of networks to bear witness that standard metrics can perform less than satisfactorily [21]. Rajeh et al. [22] written report the interplay between hierarchy and centrality in complex networks; in detail, their results show that network density and transitivity can play an important role in determining the redundancy betwixt centrality and hierarchy measures. Many of the centrality measures studied in that article are also used herein for experiments, including the current-flow closeness centrality (which has been shown to exist equivalent to information centrality), betweenness and degree centrality. While we do not study hierarchy direct in this work, it remains an interesting expanse of future research in the context of studying the Panama Papers.
Research questions
We briefly state the enquiry questions under consideration in this paper beneath. While the first question studies importance of individual entities in the full (i.e. global) Panama network, the structure of which is technically described in the next section, the 2d question attempts to understand differences at the level of national jurisdictions.
-
Class-specific centrality distributions: Under the assumption that established centrality measures (such every bit caste and betweenness centrality) can be used to measure out node importance, which of the three entity classes (intermediaries, offshore entities and officers) accept relatively high centrality values? Furthermore, are their centrality distributions consistent and similar beyond unlike centrality measures?
-
Jurisdictional dependencies: Given that we know the jurisdictions of many of the entities in the network, can we quantify the issue of jurisdictions on the different classes of entities? Do some jurisdictions accept higher probability of containing more important entities for a given grade (due east.chiliad., intermediaries) than others? When using aggregated measures of entity importance as a variable, how strong are the associations betwixt jurisdictions?
Materials and methods
We utilize the latest version of the Panama Papers dataset bachelor on the ICIJ page [23]. Many other details, including ethical statements on using the data for research equally well as definitions on some of the key terms, tin too be establish on the project folio. There are three master classes of entities, namely offshore entities, intermediaries, and officers that are of interest to us in this paper and are modeled as vertices in the network on which nosotros carry centrality studies. A fourth class of 'entity' (the string representation of an entity'due south address) is also present merely does non have outgoing edges, and serves no purpose for these studies; hence, we only consider the three entity classes noted above. An offshore entity is defined by the ICIJ as "a company, trust or fund created by an agent in a low-tax jurisdiction that oft attracts non-resident clients through preferential tax treatment." An intermediary is divers as (usually) a "a law-business firm or a middleman that asks an offshore service provider to create an offshore firm for a customer." An officeholder could vest to a "wide grade of individuals, including beneficiaries and nominees, who are in a position of pregnant influence in the associated arrangement, which is an offshore-entity or an intermediary." We provide an illustrative visualization in Fig one. An important betoken to annotation is that, despite what the word implies, the term "officer" tin legally be used to refer to a corporate entity rather than a human. Furthermore, as shown in the figure, an intermediary can serve more than ane offshore entity.
An analogy showing one possible interlinked arrangement of intermediaries (triangles), officers (squares) and offshore entities (circles).
We use the bodily relations (edge labels) in the dataset to testify the style in which these entities tin can be linked.
While the original network has both directionality, and labels on edges, we consider the simple, undirected equivalent of this network. There are two reasons for this decision. Showtime, edge directions are capricious in the network and are based on the edge label (east.g., if nosotros changed a relation from 'president of' to 'has president', the directionality of all edges with this relational label would reverse). Equally such, directions practise not correspond a meaningful existent-world quantity similar information flow or follower/followee semantics, every bit in other social, supply-chain or organizational networks. Second, non all centrality metrics are defined for directed networks, and in the general case, the centrality metrics have been best studied for simple, unlabeled networks. To obtain a simple, undirected network from the raw data, we ignore the edge labels, remove directions and collapse multiple edges between two vertices into one canonical edge. This resulting network has 657,489 edges and 559,433 not-singleton nodes (a singleton node being defined as one that has no edges incident upon it). In studying the degree and connected component (CC) distributions of this network, previous piece of work has found that this network is disconnected, and that the degree and CC size distributions obey the ability law. The network besides has very low density (≤ 10−5) and transitivity (≤ 10−7) [xiv]. Similar findings concord fifty-fifty when the network is constructed in slightly different means. For example, when only retaining nodes that are incident to an "officer_of" relation (which yields a sub-graph that eliminates the intermediaries in the network), we find that the resulting graph still exhibits low density and transitivity. Analogous findings are observed, when we retain nodes connected through an "intermediary_of" relation instead (which eliminates officers). [14]. In this paper, both research questions are investigated using the single network described to a higher place (which contains all three entity types, namely officers, offshore entities and intermediaries), although in researching the second enquiry question, aggregations are conducted at the level of jurisdictional dependencies for a given entity blazon. For instance, when studying intermediaries in Federal republic of germany, aggregations would be conducted for the mensurate under study (which is typically a centrality measure, every bit afterwards discussed) only for intermediaries in Germany. Withal, this does not involve construction of a new network (involving only intermediaries in Deutschland, for example). In fact, such an practice would exist counter-productive to the scientific aims of this paper, since the Panama Papers has complicated international linkages that strongly affect the centrality values. It is essential to written report the properties (e.1000., individual node centralities) of the network in its global context, every bit we do in this paper. Once computed, the private metrics can be grouped and aggregated in a variety of ways, depending on the research question existence studied.
Every bit offset described in the introduction, our goal in this work is to study the importance of various classes of nodes in the Panama Papers. In keeping with prior studies, we besides proposed using axis measures to measure such importance. Below, nosotros enumerate the specific centrality measures used in this newspaper. All experiments in this paper were conducted using the NetworkX parcel [24]. Note that both the research questions mentioned earlier rely methodologically on these centrality measures.
-
Degree centrality: A conceptually elementary measure of centrality, the degree axis of a node v is a function of the node'south degree deg(v). We obtain a normalized value by dividing each node's degree deg(5) by |V| − 1 (the maximum theoretical degree).
-
Data centrality: Data centrality, originally proposed in 1989 [25], is based on the 'information' contained in all possible paths between pairs of points. It was motivated by full general ideas of statistical estimation, and departed from many of the traditional centralities (such as betweenness) in considering all paths between points rather than just the 'geodesic' (or shortest) paths. It was shown to be equivalent to current-menses closeness centrality [26]. For complete details, including the definition of 'data' used by the proposing authors, we refer the reader to the original newspaper by [25].
-
Closeness centrality: In a continued graph, the normalized closeness centrality (or closeness) of a node is the average length of the shortest path betwixt the node and all other nodes in the graph. Since closeness axis is only well-defined if the graph is connected, nosotros independently compute information technology for nodes in each of the individual connected components in G. In future studies, ane could also consider a variant of closeness axis where an adjustment, suggested by Wasserman and Faust [27], could exist used to account for the imbalanced size of components.
-
Betweenness centrality: Betweenness centrality, one of the almost archetype measures of centrality that was beginning proposed in 1948 [two], is a measure of the importance of a node over the information catamenia of information between every node pair assuming that the information primarily flows over the shortest paths between the pair. Specifically, betweenness centrality of a node v is the sum of the fraction of all-pairs shortest paths that laissez passer through five. For a graph with hundreds of thousands of nodes and edges, computing the exact betweenness axis is not viable in reasonable time. Hence, we used a well-known approximation method [28], whereby nosotros randomly choose k 'pin' nodes for computing the set of all-pairs shortest paths. We tried various values of k and plant that the centrality distribution started to stabilize around 100 pivots, which we used as the value of k for all our experiments.
-
Current flow betweenness centrality: Current-flow betweenness centrality, proposed by Newman in 2005 [29], uses an electric current model for information spreading in contrast to betweenness centrality (which uses shortest paths). Because of its high complexity, nosotros utilise an approximation algorithm proposed in [26] that is able to approximate the truthful value within absolute error of a parameter ϵ with loftier probability and has run-time , with m and n existence the number of edges and nodes respectively. Nosotros utilize the default ϵ value of 0.5 in the NetworkX implementation of this algorithm. Some other parameter k max , which is the maximum number of sample node pairs to use for approximation was set up to ten,000.
While other centrality measures also be, nosotros chose the 5 measures enumerated in a higher place for various reasons. First, all five measures are evaluated and take been used in many publications over the years, which let a stronger basis for comparisons. An advantage of being more than established is that standard implementations for these also exist in packages similar Python NetworkX, which would allow our presented results to be more than hands replicated for other interested researchers. Second, the five measures are also diverse. While betweenness and degree centralities are archetype measures that (respectively) capture the importance of a node from an information-flow standpoint and local connectivity respectively, more modern centrality measures such as current catamenia betweenness centrality are inspired past more advanced models (such equally the electrical current model), often from the natural sciences. The theoretical properties of these measures have too been much discussed in several reviews and surveys, as earlier described in Groundwork and Related Work. Beyond these given measures, however, we note that re-running these experiments with other centrality measures, including the modular centrality and the overlapping modular centrality, is an interesting avenue of future work that may yield further insights into the networks.
1 methodological business concern that might arise when using such a range of centrality measures is that their statistics may not be directly comparable. We accost this issue in two different ways. First, when reporting on these measures, we employ a range of statistics, as opposed to just ways and standard deviations. For case, past also reporting on minimum and maximum observed values, we provide an accurate sense of the scaling backdrop of these measures (at least, relative to one some other). Second, when measuring associations between these measures, we utilise the non-parametric Spearman'southward rank-order correlation, rather than measures like the Pearson correlation, which only tend to work for linearly related data. We provide additional details on the rationale for using Spearman's correlation subsequently, but one important advantage is that it allows usa to compare the centralities every bit ordinal variables. Nosotros emphasize that nosotros practise not claim that whatever one centrality measure is more than or less inferior (or even informative) than another. Rather, every bit our results will evidence, the centrality measures (taken together) provide a more than comprehensive and reliable motion-picture show of the findings than whatever one centrality mensurate could have been trusted to do.
Results
Research question 1: Course-specific axis distributions
Think that the starting time question sought to investigate which of the three entity classes (intermediaries, offshore entities and officers) in the Panama Papers had loftier axis values compared to the others (thereby signifying higher importance of that class of entities) and too whether the axis measures were consistent with respect to this determination. Using the centrality measures noted in the previous department, we tabulate the key results below.
First, in Table 1 nosotros provide some basic statistics for each of the three entity classes and the 5 centrality measures. The results show that, in full general, officers are less focal in this network compared to intermediaries and offshore entities. In looking at the mean centralities across measures and classes in Tabular array 1, we detect that, with the exception of closeness centrality where the average officer's centrality (0.0422) far exceeds that of intermediaries (0.00787) and is only slightly lower than that of offshore entities (0.0463), the average centrality for an officer node is commonly an order of magnitude lower than for intermediaries. However, the quantitative difference must be interpreted advisedly, as prior work has shown that axis measures are best used in a ranking-based framework (i.e., for ranking the nodes in order of importance) rather than for quantifying the importance (or the differences in importance) [xxx]. A less drastic, simply still highly meaning, difference is observed between the centralities of offshore entities and intermediaries. Fifty-fifty when considering the most central (Max.) entity in each class, we still find the highest value to be obtained by an intermediary in most cases, though the betweenness centrality is an interesting exception, in that the nearly central officeholder (0.320) achieves a much higher value compared to the almost central intermediary (0.0627) or offshore entity (0.0146). Fifty-fifty these bones statistics, therefore, prove that interpretations of entity importance tin can start diverging depending on both the entity course and the specific centrality mensurate used.
Table ane
D | I | C | B | F | |
---|---|---|---|---|---|
Max. | 0.0151 | 1.0 | 0.0965 | 0.0627 | 1.725 |
Mean | iii.259e-5 | 0.399 | 0.00787 | 8.741e-5 | 0.315 |
Std. Dev. | 2.59e-4 | 0.409 | 0.0213 | i.25e-3 | 0.435 |
Max. | eight.33e-three | 0.4 | 0.115 | 0.320 | 0.875 |
Mean | 2.64e-6 | 7.6e-3 | 0.0422 | 3.87e-6 | 8.67e-four |
Std. Dev. | 1.812e-5 | 0.0318 | 0.0298 | vi.66e-iv | 0.0138 |
Max. | 2.16e-three | 1.0 | 0.103 | 0.0146 | ane.575 |
Hateful | 5.089e-6 | 0.0329 | 0.0463 | one.026e-5 | 0.0294 |
Std. Dev. | 8.335e-6 | 0.147 | 0.0329 | 1.68e-4 | 0.141 |
With this methodological caveat in mind, a key event that does emerge in Table 1 is that intermediaries are significantly more central than the other entity classes. This result is not especially surprising, as there is some testify that axis and intermediary-similar function is correlated (at least in some systems, such equally transportation) [31]. As well, every bit far back as 1978, Freeman [iv] argued that fundamental nodes were in the 'thick of things' or were the sort of focal points or gatekeepers that intermediaries play in complex systems involving finance, auditing and constabulary.
To study the human relationship between the centrality measures after controlling for the entity class, we quantify (in Table 2) correlational relationships, using the non-parametric Spearman's rank-order correlation, between the different centrality measures. Note that, except for the small negative correlation between data and caste centrality in the offshore entity correlation table (p = 0.0106; hence, meaning, only non highly significant), nosotros establish all results to exist highly meaning (p ≤ 0.01). The main reason that Spearman'south rank correlation is preferred equally an associative metric for these experiments is that the different axis measures have non-intuitive and non-homogeneous scaling and the relationship between them may not be linear, which would lead to a methodological issue with using measures like the Pearson'southward correlation. However, since all the variables are ordinal (even if their statistical and scaling backdrop are dissimilar), we tin mensurate the monotonic relationship betwixt them to understand how they co-vary across both entities and entity classes. Hence, the Spearman's rank correlation is the methodologically advisable measure to use here.
Table ii
D | I | C | B | F | |
D | 1.0 | -0.694 | 0.812 | 0.466 | 0.833 |
I | -0.694 | one.0 | -0.969 | -0.500 | -0.432 |
C | 0.812 | -0.969 | 1.0 | 0.515 | 0.587 |
B | 0.466 | -0.500 | 0.515 | i.0 | 0.030 |
F | 0.833 | -0.432 | 0.587 | 0.030 | i.0 |
AVG. | 0.483 | -0.319 | 0.389 | 0.302 | 0.404 |
D | i.0 | 0.125 | 0.075 | 0.523 | 0.868 |
I | 0.125 | 1.0 | -0.483 | 0.058 | 0.201 |
C | 0.075 | -0.483 | 1.0 | 0.100 | -0.023 |
B | 0.523 | 0.058 | 0.100 | one.0 | 0.456 |
F | 0.868 | 0.201 | -0.023 | 0.456 | 1.0 |
AVG. | 0.518 | 0.180 | 0.134 | 0.427 | 0.500 |
D | ane.0 | 0.006 | 0.171 | 0.718 | 0.562 |
I | 0.006 | 1.0 | -0.479 | -0.305 | 0.376 |
C | 0.171 | -0.479 | i.0 | 0.516 | -0.087 |
B | 0.718 | -0.305 | 0.516 | 1.0 | 0.205 |
F | 0.562 | 0.376 | -0.087 | 0.205 | i.0 |
AVG. | 0.491 | 0.120 | 0.224 | 0.427 | 0.411 |
The results bear witness that, on boilerplate, with the exception of the information centrality for intermediary entities, there is reasonable positive correlation between all the centrality metrics, though the values vary considerably, and for sure cases, there are negative (sometimes, strongly so) correlations. For case, in that location is a small-scale (just significant) negative correlation between the closeness and electric current-flow betweenness centrality distributions of both officers and offshore-entities, but the aforementioned correlation becomes positive when considered for intermediaries. This lends further credence to the hypothesis that we cannot study construction in the Panama Papers without controlling for either a centrality measure out used or the course of entities existence studied.
Furthermore, since the diagonal correlations are guaranteed to be 1.0, we tin can subtract 0.25 from each average if we exercise not wish to consider the diagonal, but only the correlations between distinct centrality measures. When only averaging off-diagonal elements for each column, we find that I becomes negative for all three entity classes, while C becomes negative for officers and offshore entities (and only barely positive for intermediaries). Nevertheless, I and C are negatively correlated in all three entity classes.
In the Background and Related Piece of work section, we mentioned that Amrit and Maat [xvi] had determined (with simulated information flows) that information axis was more similar to degree (and also eigenvector, which is not considered in this work) than to closeness centrality. In Table ii, we find this to exist partially true (I is e'er negatively correlated to C) although I is also negatively correlated with D for intermediaries, and very weakly correlated with D for offshore entities. This suggests that, even structurally, unlike kinds of information are flowing (with unlike strengths) between entities in these different classes, and a practiced theory would need to robustly explain such varying associations.
Fifty-fifty though intermediaries are the nigh fundamental entities in the network, they besides seem to have the highest variance (typically) in Tabular array ane. This suggests significant distributional differences between entity classes, even after being conditioned on a single centrality mensurate.
However, although useful, correlational, betoken-statistical and amass statistical measures only provide limited data about the actual centrality distributions of the entity classes. To understand the distributions of class-specific centralities, we computed histograms of axis values for all 3 entity classes and all five centrality measures. Fig 2 illustrates these histograms for degree, betweenness and current flow centralities, while Fig three illustrates the histograms for information and closeness centralities. We separated these plots based on the observed extremity of values on the y-scale. For instance, while the 10-axis is plotted on a natural log calibration (with ln(0) taken to exist 0) for both figures, Fig ii also uses the log calibration for the y-axis (since there are extreme differences that are difficult to illustrate using an ordinary scale). In comparison, Fig iii has less extremity and the trend is amend illustrated using a histogram, with the y-axis (for a bin) now defined as the count of nodes having centralities in the bin range. Formally, looking at the data in Tabular array 1. we annotation that information and closeness centralities generally have smaller standard departure as a percentage of the mean, compared with caste and betweenness centralities. Current period seems to be more than similar the one-time, but has an farthermost outlier at 0, as shown in Fig 2.
Degree, betweenness and current flow centrality frequency distributions.
Both axes are on the natural log scale. Note outliers at x = 0 for both current period and betweenness.
Closeness and information centrality frequency histograms.
Only the 10-axis is on the natural log scale, with a granularity of 500 bins per plot.
At that place is a articulate difference in the conclusions that ane would draw from these 2 figures, yet class differences within the context of a single centrality measure. Specially prominent is the clear difference exhibited past the axis frequency distribution for intermediaries compared to the other 2 classes, for both the information and closeness centralities. The actual relationships seem to be inverted when comparing beyond the two centralities, lending further credence to previous observations that the information centrality is expressing a 'different' model of importance than the other measures. In contrast to all other centralities, the degree centrality exhibits a stable (and relatively homogeneous) power-constabulary distribution for all three entity classes. Current catamenia and betweenness centralities accept more heterogeneity, but practice non show the drastic differences betwixt classes as exercise the closeness and information centralities.
These differences as well raise the question every bit to whether the trends being shown in the figures for a particular axis measure would reverse for a particular entity class compared to the overall network. That is, if we computed the axis frequency plot for the full network, rather than 'separating' centrality results by entity class as we accept done in these results, would the actual determination (most the positive or negative correlation observed in such a trend) be inverted? Such an inversion would exist an important instance of Simpson's paradox, also called the Yule-Simpson effect (see [32] for a review on this effect in research findings), and would provide strong show for e'er being wary, at to the lowest degree in the context of the Panama Papers but possibly across, of the validity of such findings without controlling for the axis measure being used and the entity grade being studied.
To quantify the extent of Simpson's paradox for a particular centrality measure out, we compute the Spearman's rank correlation betwixt two paired variables, namely the centrality and the frequency of that axis. We compute the correlation both for the individual entity classes (per centrality measure out) as well as for the overall network. Table 3 shows the sign agreement between the erstwhile and the latter. A negative sign indicates Simpson'southward paradox for that item entity form. For intermediaries, we detect that the paradox manifests for current flow centrality, which is not credible from Fig two, where the iii plots seem to be complementing each other. Furthermore, we observe the paradox for offshore entities on the betweenness centrality, which is likewise not apparent in Fig ii. These results bear witness that the centralities seem to be capturing very unlike phenomena nigh these entity classes and their interactions than suggested by the overall network or the axis behavior of the other entity classes. We do non accept a sociologically grounded or theoretical caption for what may be causing such reversals in the Panama network, just believe that it is an important feature of the network, specially given its unusual nature.
Table 3
D | I | C | B | F | |
---|---|---|---|---|---|
Intermediary | + | + | + | + | - |
Offshore Entity | + | + | - | + | + |
Officer | + | + | + | + | + |
Research question ii: Jurisdictional dependencies
Thus far, we have studied the entities in the Panama Papers from a global perspective. In the existent world, these entities are heavily constrained (or encouraged) past their national jurisdictions; in many cases, they are set up to specifically accept reward of their tax jurisdictions for their clients. The pop notion (oft likewise depicted and dramatized in fictional works) is that intermediaries and offshore entities are set up up in 'tax havens' such every bit the Cayman Islands. In practice, complex multi-hop chains of entities are involved in moving money from the originator to its presumptive final destination. Rather than take on the daunting task of uncovering or discovering such potentially illegitimate chains of transactions (which may non even be possible from the public record), nosotros have on the more modest goal of measuring national-level jurisdictional dependencies of the three entity classes. In other words, we are looking to see if national jurisdictions demonstrate non-random patterns in the class-controlled centrality distributions of the entities within the jurisdiction, as well as inter-centrality (and inter-class) correlations between the entities.
As a first pace towards such an analysis, we obtain the jurisdiction of each node in the network. Since the jurisdictions of some nodes are unknown, and in some cases, the jurisdiction is besides ambiguous or complex (more than i jurisdiction is listed), we only retain nodes that are associated with exactly ane jurisdiction. Furthermore, to avoid the effects of jurisdictions that exercise not accept sufficient representation (i.e. too few entities accept that jurisdiction), we limit our analysis to jurisdictions that take at least x associated entities from each of the entity classes. Following this preprocessing, we are left with 320,564 entities and 76 jurisdictions for our study. The probability distribution of country frequency for all 3 entity classes is plotted in Fig iv. The boilerplate number of intermediary, officer and offshore entities per country is 153.half-dozen, 1626.iv and 2437.9 respectively.
Share of entity class versus country index.
A log-transform was washed on the count of entities per country per class, and country indices were assigned arbitrarily. Hence, each point on the x-axis represents a single country, with the entity course shares shown as y-value percentages.
Returning to the core of the research question, we gathered a detailed fix of correlations to decide the association between centrality and nationality for all entity classes. Before describing our experimental blueprint, we note that, since the associations are measured at the level of countries, nosotros have to assign an 'importance score' to a land in the context of a given centrality measure and a given entity class. The simplest fashion to assign such a score (and one that we prefer in this paper) is to compute the hateful of the centralities (for the given centrality measure) of all entities that belong to the given entity class and that list the land as their jurisdiction. Recall that our earlier preprocessing filtered out countries that do not have at to the lowest degree 10 entities (listing that state every bit their jurisdiction) from each of the entity classes to ensure sufficiently robust statistics for all entity classes. Put more than formally, given the centrality C(u) of node u that is associated with national jurisdiction J and has entity class E, we define u(j) = J and u(due east) = E in a slight corruption of notation. The importance score of J is then given past:
We use the subscript C on I to signal that the importance score depends on the centrality mensurate employed. Similarly, we use the post-script Due east to indicate the form of entities (eastward.1000., intermediaries) beingness used. In designing the experiments for such an analysis, one might be tempted to only measure correlations within the context of a single centrality mensurate (i.eastward. limit the scope of the study to a question such as how strong or pregnant is the clan between the country importance scores of intermediaries or officers (or any pair of distinct entity classes), with importance scores computed using a single (given) centrality measure such every bit betweenness?). Even so, every bit we saw before, in that location are non-fiddling relationships between the different centrality measures both within and across entity classes, and it is certainly plausible that such relationships may persist (or become amplified) when nosotros conduct a similar analysis at the level of countries. Hence, we exercise non impose a constraint on the axis measure used or the entity course. Rather nosotros consider all v centrality measures for this research question every bit well; hence, there are five importance scores calculated per jurisdiction per entity grade.
For our analysis, we brainstorm past computing a 15 × 15 matrix (i.east., three entity classes × 5 centrality measures) of Spearman's rank correlation coefficients between each pair of 76-dimensional importance score vectors and (at that place are 76 dimensions considering there are 76 jurisdictions in our filtered dataset described earlier), where C i , C j (∈ {D, C, B, I, F}) and East i , E j (∈ {Intermediary,Officer,Offshore Entity}) are allowed to vary independently, leading to xv possible vectors. To illustrate effective patterns, we plot the matrix as a heat map (Fig v). Yellow values signal weak (or no) correlation, green values indicate high positive correlations and red values point high negative correlations.
Spearman's rank correlation coefficients between each pair of 76-dimensional importance score vectors, defined in the text.
Similar to previous figures, F, B, C, D and I refer to current flow betweenness, betweenness, closeness, degree and data centralities respectively, while Int., Off and O ent respectively stand for Intermediary, Officer and Offshore Entity. Xanthous values indicate weak (or no) correlation, green values signal high positive correlations and red values betoken loftier negative correlations.
We observe from the heatmap that:
-
Coordinating to our findings for Enquiry Question (RQ) 1, information axis (I) is in one case again establish to exist negatively or weakly correlated to closeness centrality (C), and also to betweenness axis (B). Unlike the results in Table two however, nosotros exercise find generally negative correlations to caste axis (D) also. One hypothesis is that data centrality may exhibit Simpson's paradox-like beliefs when using jurisdiction (rather than entity grade, as in RQ1) as a control variable. We leave exploring this hypothesis for future piece of work.
-
Betweenness centrality is generally the almost positively correlated, both with itself (among the different entity classes) and also with many of the other centrality measures. This suggests that, for general studies of structure and influence on the Panama Papers, B may exist a more reliable metric than others.
-
Interestingly, when we look at the 'diagonal' blocks in the heatmap, we find that, for both D and F (current-period betweenness) there is weak or fifty-fifty negative correlations between entity classes. These centralities seem to be suggesting that dissimilar jurisdictions occupy different niches, since (using F every bit an example), loftier intermediary centralities would suggest weaker offshore entity centralities. Similarly, for degree centrality, jurisdictions' offshore entity centralities are negatively correlated with officeholder centralities.
Word
Nosotros summarize some of the key implications of the results in the previous sections:
-
Beginning, while differences are observed when using different centrality measures, an of import commonality (with few exceptions) is that entities that are intermediaries are seen to have overwhelmingly high centrality compared to both officers and offshore organizations. Intermediaries clearly play a crucial function in the fiscal system of money-motility represented in the Panama Papers. Since intermediaries tend to be incorporated locally (such every bit a constabulary house or an accounting business firm), they would be subject field to that locality's jurisdiction where a beat corporation is being created or money is being moved, as opposed to a multinational corporation that may be field of study to the source-country's jurisdiction.
-
2nd, despite the finding higher up, the choice of axis is an important one, since some centralities yield inverted results compared to the others. In item, the data axis is often negatively correlated with the other measures; in some cases, the current-period betweenness axis and the closeness centrality can likewise showroom inverted behavior. Near likely, it is non the example that one centrality measure is 'right', but instead, we hypothesize that the different centralities are capturing different kinds of importance. Previous piece of work has shed some low-cal on this [iv], simply the issue is not resolved in the broader community.
-
Third, Simpson's paradox-similar behavior is observed when we attempt to understand and compare the relation betwixt centrality values and their frequencies at the overall network-level (which includes all the entities and classes) versus for each of the classes individually. The beliefs is observed when using the electric current-flow betweenness and closeness centralities for intermediaries and offshore entities respectively. For the majority of centralities, there is agreement. More enquiry is needed to understand why the paradox arose for the two cases above.
-
Finally, many of the results higher up are replicated when we account for jurisdictions. In particular, information centrality is once again constitute to exhibit somewhat inverted behavior. Yet, articulate differences likewise beginning to emerge once we await at jurisdictions, and the results suggest that different jurisdictions or nations may serve as different niches in the organisation.
An important point is that centrality measures, though well established in the network science and complex systems community as a means for quantifying importance, are not perfect. There is also evidence to suggest that they may under-estimate the importance of non-hub nodes [33]. Some of these caveats likely apply to the Panama Papers, which is an unusual organisation to begin with, and which has many fragmented components. As the influential review of Borgatti and Everett showed, the accuracy of centrality indices likewise depends on the network topology [34].
Determination
Since their release, the Panama Papers have come under wide scrutiny past experts in legal scholarship, sociology and tax policy. A limited number of studies have used computational techniques to study the structure of the entities and officers in the network, but studies spanning jurisdictions or entity classes have thus far not been forthcoming. In this newspaper, nosotros specifically report entity importance across 3 different classes (intermediaries, officers and offshore entities) in the Panama Papers past employing five well-divers axis measures. Our experimental report is based on two related research questions, one of which maps out consistencies and differences (in measuring entity importance) between the centrality measures and entity classes, and the second of which considers similar questions but at the amass view of national jurisdictions.
Our results suggest many open up questions for future investigation. Although we studied the Simpson's paradox in the context of measuring centrality frequency correlations, a like experiment could be designed where the national jurisdiction is used equally the command variable. We hypothesize that values for this variable for which Simpson's paradox manifests may be correlated with legal and regulatory characteristics (such as whether the jurisdiction is a tax haven, scores desperately on corruption indices etc.) of the jurisdiction. Similarly, in the context of Research Question 2, nosotros note that the centralities of the various entity classes within a national jurisdiction can as well be studied using a similar experimental framework. In principle, such experiments are similar compared to those conducted in back up of Research Question one, but limited to the gear up of entities {u|u(j) = J} given a particular jurisdiction J (e.g., the USA). Heatmaps could exist constructed per land, and the jurisdictional properties of countries could be studied comparatively by studying individual heat map differences. Finally, we likewise plan to look at the community construction at both the local and aggregated levels to examine similar and dissimilar behaviors, in a like vein to [20]. The core-periphery structure of the network using hierarchical measures (similar to [22]) is yet another relevant attribute of futurity work, since identifying how nodes are positioned in the Panama Papers network could provide useful insights.
Funding Argument
The author(south) received no specific funding for this work.
References
1. Obermayer B, Obermaier F. The Panama Papers: Breaking the story of how the rich and powerful hibernate their money. Oneworld Publications; 2016. [Google Scholar]
2. Bavelas A. A mathematical model for group structures. Practical anthropology. 1948;7(three):16–xxx. [Google Scholar]
3. Das K, Samanta S, Pal M. Study on centrality measures in social networks: a survey. Social Network Analysis and Mining. 2018;8(one):13. ten.1007/s13278-018-0493-2 [CrossRef] [Google Scholar]
4. Freeman LC. Centrality in social networks conceptual description. Social networks. 1978;i(three):215–239. x.1016/0378-8733(78)90021-seven [CrossRef] [Google Scholar]
five. Wagner CH. Simpson'due south paradox in existent life. The American Statistician. 1982;36(ane):46–48. 10.2307/2684093 [CrossRef] [Google Scholar]
6. Trautman LJ. Post-obit the money: lessons from the Panama Papers: part 1: tip of the iceberg. Penn St 50 Rev. 2016;121:807. [Google Scholar]
seven. Ellis J. Corruption, Social Sciences and the Law: Exploration across the disciplines. Routledge; 2019. [Google Scholar]
eight. O'Donovan J, Wagner HF, Zeume Due south. The value of offshore secrets: bear witness from the panama papers. The Review of Fiscal Studies. 2019;32(eleven):4117–4155. 10.1093/rfs/hhz017 [CrossRef] [Google Scholar]
9. Cooley A, Heathershaw J, Sharman J. The ascension of kleptocracy: Laundering cash, whitewashing reputations. Journal of Democracy. 2018;29(one):39–53. x.1353/jod.2018.0003 [CrossRef] [Google Scholar]
ten. Neu D, Saxton M, Everett J, Shiraz AR. Speaking truth to power: Twitter reactions to the panama papers. Journal of Business concern Ideals. 2020;162(2):473–485. 10.1007/s10551-018-3997-9 [CrossRef] [Google Scholar]
eleven. Nerudova D, Solilova V, Litzman M, Janskỳ P. International tax planning within the construction of corporate entities endemic past the shareholder-individuals through Panama Papers destinations. Development Policy Review. 2020;38(1):124–139. x.1111/dpr.12403 [CrossRef] [Google Scholar]
12. Rabab'Ah A, Al-Ayyoub 1000, Shehab MA, Jararweh Y, Jansen BJ. Using the panama papers to explore the financial networks of the middle east. In: 2016 11th International Conference for Net Technology and Secured Transactions (ICITST). IEEE; 2016. p. 92–97.
xiii. Wiedemann K, Yimam SM, Biemann C. A Multilingual Data Extraction Pipeline for Investigative Journalism. arXiv preprint arXiv:180900221. 2018;.
fourteen. Kejriwal M, Dang A. Structural studies of the global networks exposed in the Panama papers. Applied Network Science. 2020;five(1):one–24. [Google Scholar]
xv. Joaristi M, Serra East, Spezzano F. Detecting suspicious entities in Offshore Leaks networks. Social Network Assay and Mining. 2019;9(1):62. x.1007/s13278-019-0607-5 [CrossRef] [Google Scholar]
16. Agreement Information Axis Metric: A Simulation Approach. CoRR. 2018;abs/1812.01292. [Google Scholar]
17. Abbasi A, Chung KSK, Hossain L. Egoistic analysis of co-authorship network construction, position and functioning. Information Processing & Management. 2012;48(iv):671–679. 10.1016/j.ipm.2011.09.001 [CrossRef] [Google Scholar]
eighteen. Lü L, Chen D, Ren XL, Zhang QM, Zhang YC, Zhou T. Vital nodes identification in complex networks. Physics Reports. 2016;650:ane–63. 10.1016/j.physrep.2016.06.007 [CrossRef] [Google Scholar]
19. Ghalmane Z, El Hassouni G, Cherifi C, Cherifi H. Centrality in modular networks. EPJ Data Science. 2019;8(1):15. 10.1140/epjds/s13688-019-0195-7 [CrossRef] [Google Scholar]
20. Ghalmane Z, Cherifi C, Cherifi H, El Hassouni M. Centrality in complex networks with overlapping customs structure. Scientific reports. 2019;9(1):one–29. [PMC free article] [PubMed] [Google Scholar]
21. Sciarra C, Chiarotti G, Laio F, Ridolfi Fifty. A change of perspective in network centrality. Scientific reports. 2018;8(1):1–9. [PMC free article] [PubMed] [Google Scholar]
22. Rajeh S, Savonnet Grand, Leclercq East, Cherifi H. Interplay between bureaucracy and centrality in complex networks. IEEE Access. 2020;8:129717–129742. 10.1109/Access.2020.3009525 [CrossRef] [Google Scholar]
24. Hagberg A, Swart P, S Chult D. Exploring network structure, dynamics, and function using NetworkX. Los Alamos National Lab.(LANL), Los Alamos, NM (U.s.); 2008. [Google Scholar]
25. Stephenson Grand, Zelen M. Rethinking centrality: Methods and examples. Social networks. 1989;eleven(1):1–37. x.1016/0378-8733(89)90016-6 [CrossRef] [Google Scholar]
26. Brandes U, Fleischer D. Centrality measures based on current flow. In: Almanac symposium on theoretical aspects of computer science. Springer; 2005. p. 533–544. [Google Scholar]
27. Wasserman S, Faust Thou, et al.. Social network analysis: Methods and applications. vol. eight. Cambridge university press; 1994. [Google Scholar]
28. Brandes U, Pich C. Axis interpretation in large networks. International Journal of Bifurcation and Chaos. 2007;17(07):2303–2318. x.1142/S0218127407018403 [CrossRef] [Google Scholar]
29. Newman ME. A measure of betweenness centrality based on random walks. Social networks. 2005;27(1):39–54. x.1016/j.socnet.2004.11.009 [CrossRef] [Google Scholar]
xxx. Bauer F, Lizier JT. Identifying influential spreaders and efficiently estimating infection numbers in epidemic models: A walk counting approach. EPL (Europhysics Letters). 2012;99(half dozen):68007. 10.1209/0295-5075/99/68007 [CrossRef] [Google Scholar]
31. Fleming DK, Hayuth Y. Spatial characteristics of transportation hubs: centrality and intermediacy. Journal of transport geography. 1994;two(1):3–18. 10.1016/0966-6923(94)90030-2 [CrossRef] [Google Scholar]
32. Goltz HH, Smith ML. Yule-Simpson'southward paradox in research. Applied Assessment, Inquiry, and Evaluation. 2010;15(1):fifteen. [Google Scholar]
33. Šikić 1000, Lančić A, Antulov-Fantulin North, Štefančić H. Epidemic centrality?is there an underestimated epidemic bear on of network peripheral nodes? The European Concrete Journal B. 2013;86(10):one–thirteen. [Google Scholar]
34. Borgatti SP, Everett MG. A graph-theoretic perspective on axis. Social networks. 2006;28(4):466–484. ten.1016/j.socnet.2005.11.005 [CrossRef] [Google Scholar]
Articles from PLoS 1 are provided here courtesy of Public Library of Science
yinglingdessitheigh83.blogspot.com
Source: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7993786/
0 Response to "Python Centrality Measures for Yelp Data Review in Google"
Enregistrer un commentaire