Big Data: Principles and best practices of scalable realtime data systems

Nathan Marz

Language: English

Pages: 328

ISBN: 1617290343

Format: PDF / Kindle (mobi) / ePub

Summary

Big Data teaches you to build big data systems using an architecture that takes advantage of clustered hardware along with new tools designed specifically to capture and analyze web-scale data. It describes a scalable, easy-to-understand approach to big data systems that can be built and run by a small team. Following a realistic example, this book guides readers through the theory of big data systems, how to implement them in practice, and how to deploy and operate them once they're built.

Purchase of the print book includes a free eBook in PDF, Kindle, and ePub formats from Manning Publications.

About the Book

Web-scale applications like social networks, real-time analytics, or e-commerce sites deal with a lot of data, whose volume and velocity exceed the limits of traditional database systems. These applications require architectures built around clusters of machines to store and process data of any size, or speed. Fortunately, scale and simplicity are not mutually exclusive.

Big Data teaches you to build big data systems using an architecture designed specifically to capture and analyze web-scale data. This book presents the Lambda Architecture, a scalable, easy-to-understand approach that can be built and run by a small team. You'll explore the theory of big data systems and how to implement them in practice. In addition to discovering a general framework for processing big data, you'll learn specific technologies like Hadoop, Storm, and NoSQL databases.

This book requires no previous exposure to large-scale data analysis or NoSQL tools. Familiarity with traditional databases is helpful.

What's Inside

Introduction to big data systems
Real-time processing of web-scale data
Tools like Hadoop, Cassandra, and Storm
Extensions to traditional database skills

About the Authors

Nathan Marz is the creator of Apache Storm and the originator of the Lambda Architecture for big data systems. James Warren is an analytics architect with a background in machine learning and scientific computing.

Table of Contents

A new paradigm for Big Data

PART 1 BATCH LAYER

Data model for Big Data
Data model for Big Data: Illustration
Data storage on the batch layer
Data storage on the batch layer: Illustration
Batch layer
Batch layer: Illustration
An example batch layer: Architecture and algorithms
An example batch layer: Implementation

PART 2 SERVING LAYER

Serving layer
Serving layer: Illustration

PART 3 SPEED LAYER

Realtime views
Realtime views: Illustration
Queuing and stream processing
Queuing and stream processing: Illustration
Micro-batch stream processing
Micro-batch stream processing: Illustration
Lambda Architecture in depth

The Intelligent Web: Search, Smart Algorithms, and Big Data

Computational Aspects of Cooperative Game Theory (Synthesis Lectures on Artificial Inetlligence and Machine Learning)

Working With Unix Processes

Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data (2nd Edition) (Data-Centric Systems and Applications)

Introduction to Reversible Computing

Data-driven Generation of Policies

System being linearly scalable. A linearly scalable system can maintain performance under increased load by adding resources in proportion to the increased load. A nonlinearly scalable system, despite being “scalable,” isn’t particular useful. Suppose the number of machines you need in relation to the load Number of on your system has a quadratic machines needed relationship, like in figure 6.8. The costs of running your system would rise dramatically over time. Increasing your load ten-fold.

Look at a couple of examples. The first example uses the Count aggregator to find the number of people each person follows: The underscore informs JCascalog to ignore this field. The output field names define all potential groupings. new Subquery("?person", "?count") .predicate(FOLLOWS, "?person", "_") .predicate(new Count(), "?count"); Licensed to Mark Watson When executing the aggregator, the output fields imply tuples should be grouped by ?person. 122 CHAPTER 7.

This end, JCascalog exposes simple interfaces to define new filters, functions, and aggregators. Most importantly, this is all done with regular Java code by implementing the appropriate interfaces. FILTERS We’ll begin with filters. A filter predicate requires a single method named isKeep that returns true if the input tuple should be kept, and false if it should be filtered. The following is a filter that keeps all tuples where the input is greater than 10: public static class.

SuperWebAnalytics.com. SuperWebAnalytics.com is a more sophisticated and realistic example that’s intended to really demonstrate the intricacies of batch computation in terms of architecture, algorithms, and implementation. Licensed to Mark Watson An example batch layer: Architecture and algorithms This chapter covers ■ Building a batch layer from end to end ■ Practical examples of precomputation ■ Iterative graph algorithms ■ HyperLogLog for efficient.

Batch view should aggregate the pageviews for each URL at hourly, daily, 7-day, and 28-day granularities. The approach you’ll take is to first aggregate the pageviews at an hourly granularity. This will reduce the size of the data by many orders of magnitude. Afterward, you’ll roll up the hourly values to obtain the counts for the larger buckets. The latter operations will be much faster due to the smaller size of the input. Let’s start with the pipe diagram to compute the number of pageviews at.

Download sample

Download

M	T	W	T	F	S	S
« Feb
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31