Data Science at the Command Line: Facing the Future with Time-Tested Tools

Data Science at the Command Line: Facing the Future with Time-Tested Tools

Language: English

Pages: 212

ISBN: 1491947853

Format: PDF / Kindle (mobi) / ePub


This hands-on guide demonstrates how the flexibility of the command line can help you become a more efficient and productive data scientist. You’ll learn how to combine small, yet powerful, command-line tools to quickly obtain, scrub, explore, and model your data.

To get you started—whether you’re on Windows, OS X, or Linux—author Jeroen Janssens introduces the Data Science Toolbox, an easy-to-install virtual environment packed with over 80 command-line tools.

Discover why the command line is an agile, scalable, and extensible technology. Even if you’re already comfortable processing data with, say, Python or R, you’ll greatly improve your data science workflow by also leveraging the power of the command line.

  • Obtain data from websites, APIs, databases, and spreadsheets
  • Perform scrub operations on plain text, CSV, HTML/XML, and JSON
  • Explore data, compute descriptive statistics, and create visualizations
  • Manage your data science workflow using Drake
  • Create reusable tools from one-liners and existing Python or R code
  • Parallelize and distribute data-intensive pipelines using GNU Parallel
  • Model data with dimensionality reduction, clustering, regression, and classification algorithms

Architecture Principles: The Cornerstones of Enterprise Architecture (The Enterprise Engineering Series)

Database Systems: Design, Implementation and Management (11th Edition)

Distributed Computing Through Combinatorial Topology

Operating System Concepts (7th Edition)

Randomized Algorithms

Real-Time Collision Detection

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://safaribooksonline.com). For more information, contact our corporate/ institutional sales department: 800-998-9938 or corporate@oreilly.com. Editors: Mike Loukides, Ann Spencer, and Marie Beaugureau Production Editor: Matthew Hacker Copyeditor: Kiel Van Horn.

Not have to be a problem when the input data is finite, like a file. However, when the input data is a nonstop stream, such blocking command-line tools are useless. Luckily, Python and R can both process data in a streaming matter. You can apply a function on a line-per-line basis, for example. Examples 4-7 and 4-8 are two minimal examples that demonstrate how this works in Python and R, respectively. They com‐ pute the square of every integer that is piped to them. Example 4-7.

Philosophy, which makes several appearances throughout the book. Once you become familiar with the command line, and learn how to combine command-line tools, you will have developed an invaluable skill—and if you can create new tools, you’ll be a cut above. How to Read This Book In general, you’re advised to read this book in a linear fashion. Once a concept or command-line tool has been introduced, chances are that we employ it in a later xii | Preface chapter. For example, in Chapter 9,.

Species Note that this basic command assumes that the file is delimited by commas. Just as a reminder, if you intend to use this command often, you could define a function in your ~/.bashrc file called, say, names: names () { sed -e 's/,/\n/g;q'; } Which you can then use like this: $ < data/investments2.csv names company_permalink company_name company_category_list company_market company_country_code company_state_code company_region company_city investor_permalink Inspecting Data and Its.

• How to install it • How to obtain help • An example usage All command-line tools listed here are included in the Data Science Toolbox for Data Science at the Command Line. See Chapter 2 for instructions on how to set it up. The install commands assume that you’re running Ubuntu 14.04. Please note that citing open source software is not trivial, and that some information may be missing or incorrect. alias Define or display aliases. Alias is a Bash builtin. 165 $ help alias $ alias ll='ls.

Download sample

Download