Guerrilla Analytics: A Practical Approach to Working with Data
Format: PDF / Kindle (mobi) / ePub
Doing data science is difficult. Projects are typically very dynamic with requirements that change as data understanding grows. The data itself arrives piecemeal, is added to, replaced, contains undiscovered flaws and comes from a variety of sources. Teams also have mixed skill sets and tooling is often limited. Despite these disruptions, a data science team must get off the ground fast and begin demonstrating value with traceable, tested work products. This is when you need Guerrilla Analytics.
In this book, you will learn about:
The Guerrilla Analytics Principles:
simple rules of thumb for maintaining data provenance across the entire analytics life cycle from data extraction, through analysis to reporting.
Reproducible, traceable analytics:
how to design and implement work products that are reproducible, testable and stand up to external scrutiny.
Practice tips and war stories
: 90 practice tips and 16 war stories based on real-world project challenges encountered in consulting, pre-sales and research.
Preparing for battle:
how to set up your team's analytics environment in terms of tooling, skill sets, workflows and conventions.
over a dozen analytics patterns that your team will encounter again and again in projects
- The Guerrilla Analytics Principles: simple rules of thumb for maintaining data provenance across the entire analytics life cycle from data extraction, through analysis to reporting
- Reproducible, traceable analytics: how to design and implement work products that are reproducible, testable and stand up to external scrutiny
- Practice tips and war stories: 90 practice tips and 16 war stories based on real-world project challenges encountered in consulting, pre-sales and research
- Preparing for battle: how to set up your team's analytics environment in terms of tooling, skill sets, workflows and conventions
- Data gymnastics: over a dozen analytics patterns that your team will encounter again and again in projects
Text using “sed.” • When finished, count the number of international phone numbers in the output file using another “grep.” Availability of a command line is simply essential for many Guerrilla Analytics tasks. 18.5. High-level scripting language Command-line scripts are quick and dirty. While they can be fully functional programs in their own right, sometimes you want the extra features and ease of use of a high-level scripting language. This is a programming language that is abstracted away.
File’s output. The code file’s output should appear at the end of the code file (since it is the last thing the code file does) and should be clearly marked as the code file’s output. A simple convention on dataset naming can achieve this. For example, conventions such as prefixing all intermediate dataset names with a character like “_” makes it clear that these datasets are less important to the overall results. 7.4.3. Advantages When code file outputs are clearly identifiable, the advantages.
Members to remember exactly what every interface dataset represents and how it should be used? Where did that data field come from? Which build dataset do I go to for a particular data sample? A traditional approach here would call for a data dictionary document to be maintained so that the team can look it up and understand how best to use build datasets. A Guerrilla Analytics project rarely has time for this approach. Analysts have to stop their analytics work to go off and find documentation.
Different team members perform the data extraction on separate occasions and do not share the format of the data extracts with one another. • Consistent rates of activity: Data from an earlier and later time period will obviously have different date values. However, you might expect the rate of activities such as sales or logins to be consistent in both datasets. You could also look for consistency in periodic activities. For example, does account closing always happen within the last 2 days of.
Security concerns often limit this access in analytics projects but there are ways to provide Internet access for the team while mitigating these concerns. • Encryption: Data security is often important in projects. In very fast-paced Guerrilla Analytics projects, good quality encryption software should be available to the team to mitigate risk of data loss during transfer of data to the team and during delivery of results back to the customer. • Code libraries for data wrangling: Guerrilla.