Statistics for Absolute Beginners
A Plain English Introduction
Oliver Theobald
Published by Scatterplot Press
First Edition
Copyright 2017 by Oliver Theobald
All rights reserved. No part of this publication may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical methods, without the prior written permission of the publisher, except in the case of brief quotations embodied in critical reviews and certain other non-commercial uses permitted by copyright law.
Please contact the author at oliver.theobald@scatterplotpress.com for feedback, media contact, a university desk copy, omissions or errors regarding this book.
TABLE OF CONTENTS
General Terms
Data
A term for any value that describes the characteristics and attributes of an item that can be moved, processed, and analyzed. The item could be a transaction, a person, an event, a result, a change in the weather, and infinite other possibilities. Data can contain various sorts of information, and through statistical analysis, these recorded values can be better understood and used to support or debunk a research hypothesis.
Population
The parent group from which the experiments data is collected, e.g., all registered users of an online shopping platform or all investors of cryptocurrency.
Sample
A subset of a population collected for the purpose of an experiment, e.g., 10% of all registered users of an online shopping platform or 5% of all investors of cryptocurrency. A sample is often used in statistical experiments for practical reasons, as it might be impossible or prohibitively expensive to directly analyze the full population.
Variable
A characteristic of an item from the population that varies in quantity or quality from another item, e.g., the Category of a product sold on Amazon. A variable that varies in regards to quantity and takes on numeric values is known as a quantitative variable , e.g., the Price of a product. A variable that varies in quality/class is called a qualitative variable, e.g., the Product Name of an item sold on Amazon. This process is often referred to as classification , as it involves assigning a class to a variable.
Amazon product dataset
Discrete Variable
A variable that can only accept a finite number of values, e.g., customers purchasing a product on Amazon.com can rate the product as 1, 2, 3, 4, or 5 stars. In other words, the product has five distinct rating possibilities, and the reviewer cannot submit their own rating value of 2.5 or 0.0009. Helpful tip: qualitative variables are discrete, e.g., Name or Category of a product.
Continuous Variable
A variable that can assume an infinite number of values, e.g., depending on supply and demand, gold can be converted into unlimited possible values expressed in U.S. dollars. Opposite to a discrete variable, a continuous variable can also assume values arbitrarily close together. In the case of our dataset, Price and Reviews are continuous variables.
Categorical Variables
A variable whose possible values consist of a discrete set of categories, such as gender or political allegiance, rather than numbers quantifying values on a continuous scale.
Ordinal Variables
As a subcategory of categorical variables, ordinal variables categorize values in a logical and meaningful sequence. Unlike standard categorical variables, i.e. gender or film genre, ordinal variables contain an intrinsic ordering or sequence such as { small ; medium ; large } or { dissatisfied ; neutral ; satisfied ; very satisfied }.
The distance of separation between ordinal variables does not need to be consistent or quantified. For example, the measurable gap in performance between a gold and silver medalist in athletics need not mirror the difference in performance between a silver and bronze medalist.
Independent and Dependent Variables
An independent variable (expressed as X) is the variable that supposedly impacts the dependent variable (expressed as y). For example, the supply of oil (independent variable) impacts the cost of fuel (dependent variable). As the dependent variable is dependent on the independent variable, it is generally the independent variable that is tested in experiments. As the value of the independent variable changes, the effect on the dependent variable is observed and recorded.
In analyzing Amazon products, we could examine Category , Reviews , and 2-Day Delivery as the independent variables and observe how changes in those variables affect the dependent variable of Price . Equally, we could select the Reviews variable as the dependent variable and examine Price, 2-Day Delivery , and Category as the independent variables and observe how these variables influence the number of customer reviews.
The labels of independent and dependent are hence determined by experiment design rather than inherent composition, which means one variable could be a dependent variable in one study and an independent variable in another.
INTRODUCTION
Lets listen to the data.
Do you have the numbers to back that up?
We live in an age and society where we trust technology and quantifiable information more than we trust each otherand sometimes ourselves. The gut feeling and conviction of Steve Jobs to know what consumers would later want is revered and romanticized. Yet theres sparse literature ( Blink by Malcolm Gladwell is a notable exception), an eerie absence of online learning courses, and little sign of a mainstream movement promoting one persons unaided intuition as a prerequisite to success in business. Everyone is too preoccupied with thinking about quantitative evidence, including the personal data generated by Apples expanding line of products. Extensive customer profiling and procuring data designed to wrench out our every hidden desire are dominant and pervasive trends in business today.
Perhaps Jobs represents a statistical anomaly. His legacy cannot be wiped from the dataset, but few in the business world would set out to emulate him without data in their pocket. As Wired Magazines Editor-in-chief Chris Anderson puts it, we dont need theories but rather data to look at and analyze in the current age of big data.
Databoth big and smallis collected instantly and constantly: how far we travel each day, who we interact with online and where we spend our money. Every bit of data has a story to tell. But, left isolated, these parcels of information rest dormant and underutilizedequivalent to Lego blocks cordoned into bags of separate pieces.