Recursos de colección
Frieze, Alan; Vera, Juan; Chakrabarti, Soumen
There is much current interest in the evolution of social networks, in particular, the World Wide Web graph, through time. ``Preferential attachment'' and the ``copying model'' are well-known models that explain the observed power-law degree distribution of the graph reasonably well. However, existing evolution models do not include the significant influence of search engines on how webpage authors find existing pages and create links to them. Recent applied work has raised the concern that highly popular search engines limit the attention of authors to a small set of ``celebrity'' URLs, for any query. Page authors frequently (with probability $p$) locate...
Erlebach, Thomas; Hall, Alexander; Panconesi, Alessandro; Vukadinović, Danic
In the valley-free path model, a path in a given directed graph is valid if it consists of a sequence of forward edges followed by a sequence of backward edges. This model is motivated by routing policies of autonomous systems in the Internet. We give a $2$-approximation algorithm for the problem of computing a maximumn number of edge- or vertex-disjoint valid paths between two given vertices $s$ and $t$, and we show that no better approximation ratio is possible unless $P=NP$. Furthermore, we give a $2$-approximation algorithm for the problem of computing a minimum vertex cut that separates $s$ and...
Charbit, Pierre; Scott, Alex D.
Motivated by copying models of the web graph, Bonato and Janssen introduced the following simple construction: given a graph $G$, for each vertex $x$ and each subset $X$ of its closed neighborhood, add a new vertex $y$ whose neighbors are exactly $X$. Iterating this construction yields a limit graph $\uparrow G$. Bonato and Janssen claimed that the limit graph is independent of $G$, and it is known as the infinite locally random graph. We show that this picture is incorrect: there are in fact infinitely many isomorphism classes of limit graph, and we give a classification. We also consider the...
Bao, Ying; Feng, Guang; Liu, Tie-Yan; Ma, Zhi-Ming; Wang, Ying
In this paper we suggest evaluating the importance of a website with the mean frequency of visiting the website for the Markov chain on the Internet graph describing random surfing. We show that this mean frequency is equal to the sum of the PageRanks of all the webpages in that website (hence referred to as PageRankSum), and we propose a novel algorithm, AggregateRank, based on the theory of stochastic complement to calculate the rank of a website. The AggregateRank algorithm gives a good approximation of the PageRankSum accurately, while the corresponding computational complexity is much lower than PageRankSum. By constructing...
Gleich, David; Polito, Marzia
In this paper, we consider the problem of calculating fast and accurate approximations to the personalized PageRank score of a webpage. We focus on techniques to improve speed by limiting the amount of web graph data we need to access.
¶ Our algorithms provide both the approximation to the personalized PageRank score as well as guidance in using only the necessary information—and therefore sensibly reduce not only the computational cost of the algorithm but also the memory and memory bandwidth requirements. We report experiments with these algorithms on web graphs of up to 118 million pages and prove a theoretical approximation...
Grünbaum, F. Alberto; Matusevich, Laura Felicia
We consider the problem of determining the unknown characteristics of a random routing strategy from end-to-end measurements. More specifically, we construct a Markov chain that models the traffic of messages in a multiterminal network consisting of input, intermediate, and output terminals. The topology of the network is assumed to be known, but the Markovian routing strategy is not known. We solve the problem of determining the unknown one-step transition probability matrix of our random walk from input-output measurements of ``travel time.'' We give explicit inversion formulas (up to a natural gauge) in a nontrivial example. The result holds for a...
Avrachenkov, Konstantin; Lebedev, Dmitri
PageRank is one of the principle criteria according to which Google ranks webpages. PageRank can be interpreted as a frequency of webpage visits by a random surfer, and thus it reflects the popularity of a webpage. In the present work we find an analytical expression for the expected PageRank value in a scale-free growing network model as a function of the age of the growing network and the age of a particular node. Then, we derive asymptotics that show that PageRank follows closely a power law in the middle range of its values. The exponent of the theoretical power law...
Flaxman, Abraham D.; Frieze, Alan M.; Vera, Juan
We study a random graph $G_n$ that combines certain aspects of geometric random graphs and preferential attachment graphs. The vertices of $G_n$ are $n$ sequentially generated points $x_1,x_2,\ldots,x_n$ chosen uniformly at random from the unit sphere in $\mathbb{R}^3$. After generating $x_t$, we randomly connect that point to $m$ points from those points in $x_1,x_2,\ldots,x_{t-1}$ that are within distance $r$ of $x_t$. Neighbors are chosen with probability proportional to their current degree, and a parameter $\alpha$ biases the choice towards self loops. We show that if $m$ is sufficiently large, if $r\geq \ln n / n^{1/2-\beta}$ for some constant $\beta$, and...
Fontoura, Marcus; Lempel, Ronny; Qi, Runping; Zien, Jason
Today's search engines are increasingly required to broaden their capabilities beyond free-text search. More complex features, such as supporting range constraints over numeric data, are becoming common; structured search over XML data will soon follow. This is particularly true in the enterprise search domain, where engines attempt to integrate data from the web and corporate knowledge portals with data residing in proprietary databases. In this paper we extend previous schemes by which an inverted-index-based search engine can efficiently support queries that contain numeric restrictions in addition to standard, free-text portions. Furthermore, we analyze both the known schemes and our extensions...
Mihail, Milena; Saberi, Amin; Tetali, Prasad
We show that in power law random graphs, almost surely, the expected rate at which a random walk with lookahead discovers the nodes of the graph is sublinear.
Allavena, André; Dasgupta, Anirban; Hopcroft, John; Kumar, Ravi
While several analytic models aim to explain the existence of short paths in social networks such as the web, relatively few address the problem of efficiently finding them, especially in a decentralized manner. Since developing purely decentralized search algorithms in general social-network models appears hard, we relax the notion of decentralized search by allowing the option of storing a small amount of preprocessed information about the network. We show that one can identify a small set of vertices in an undirected social network so that connectivity information of the vertices in this set can be used in conjunction with the...
Chung, Fan; Lu, Linyuan
We examine a number of generalized and extended versions of concentration inequalities and martingale inequalities. These inequalities are effective for analyzing processes with quite general conditions as illustrated in an example for an infinite Polya process and web graphs.
Chakrabarti, Amit; Do Ba, Khanh; Muthukrishnan, S.
We consider the problem of computing information-theoretic functions, such as entropy, on a data stream, using sublinear space.
¶ Our first result deals with a measure we call the entropy norm of an input stream: it is closely related to entropy but is structurally similar to the well-studied notion of frequency moments. We give a polylogarithmic-space, one-pass algorithm for estimating this norm under certain conditions on the input stream. We also prove a lower bound that rules out such an algorithm if these conditions do not hold.
¶ Our second group of results is for estimating the empirical entropy of an input...
Berkhin, Pavel
We introduce a novel bookmark-coloring algorithm (BCA) that computes authority weights over the web pages utilizing the web hyperlink structure. The computed vector (BCV) is similar to the PageRank vector defined for a page-specific teleportation. Meanwhile, BCA is very fast, and BCV is sparse. BCA also has important algebraic properties. If several BCVs corresponding to a set of pages (called hub) are known, they can be leveraged in computing arbitrary BCV via a straightforward algebraic process and hub BCVs can be efficiently computed and encoded.
Łuczak, Tomasz; Prałat, Paweł
We propose a new random model of web graphs in which the degree of a vertex depends on its age. We characterize the degree sequence of this model and study its behaviour near the connectivity threshold.
Pandurangan, Gopal; Raghavan, Prabhakar; Upfal, Eli
Recent work on modeling the web graph has dwelt on capturing the degree distributions observed on the web. Pointing out that this represents a heavy reliance on “local” properties of the web graph, we study the distribution of PageRank values on the web. Our measurements suggest that PageRank values on the web follow a power law. We then develop generative models for the web graph that explain this observation and moreover remain faithful to previously studied degree distributions. We analyze these models and compare the analysis to both snapshots from the web and to graphs generated by simulations on the...
Mitzenmacher, Michael
I argue that power law research must move from focusing on observation, interpretation, and modeling of power law behavior to instead considering the challenging problems of validation of models and control of systems.
Li, Lun; Alderson, David; Doyle, John C.; Willinger, Walter
There is a large, popular, and growing literature on "scale-free"networks with the Internet along with metabolic networks representing perhaps the canonical examples. While this has in many ways reinvigorated graph theory, there is unfortunately no consistent, precise definition of scale-free graphs and few rigorous proofs of many of their claimed properties. In fact, it is easily shown that the existing theory has many inherent contradictions and that the most celebrated claims regarding the Internet and biology are verifiably false. In this paper, we introduce a structural metric that allows us to differentiate between all simple, connected graphs having an identical...
Boldi, Paolo; Vigna, Sebastiano
We introduce a new family of simple, complete instantaneous codes for positive integers, called ζ codes, which are suitable for integers distributed as a power law with small exponent (smaller than 2). The main motivation for the introduction of ζ codes comes from web-graph compression: if nodes are numbered according to URL lexicographical order, gaps in successor lists are distributed according to a power law with small exponent. We give estimates of the expected length of ζ codes against power-law distributions, and compare the results with analogous estimates for the more classical γ, δ and variable-length block codes.
Boldi, Paolo; Santini, Massimo; Vigna, Sebastiano
Deciding which kind of visiting strategy accumulates high-quality pages more quickly is one of the most often debated issues in the design of web crawlers.
¶ This paper proposes a related, and previously overlooked, measure of effectiveness for crawl strategies: whether the graph obtained after a partial visit is in some sense representative of the underlying web graph as far as the computation of PageRank is concerned. More precisely, we are interested in determining how rapidly the computation of PageRank over the visited subgraph yields node orders that agree with the ones computed in the complete graph; orders are compared using...