• Complain

Harlan Harris - Analyzing the Analyzers: An Introspective Survey of Data Scientists and Their Work

Here you can read online Harlan Harris - Analyzing the Analyzers: An Introspective Survey of Data Scientists and Their Work full text of the book (entire story) in english for free. Download pdf and epub, get meaning, cover and reviews about this ebook. year: 2013, publisher: OReilly Media, genre: Politics. Description of the work, (preface) as well as reviews are available. Best literature library LitArk.com created for fans of good reading and offers a wide selection of genres:

Romance novel Science fiction Adventure Detective Science History Home and family Prose Art Politics Computer Non-fiction Religion Business Children Humor

Choose a favorite category and find really read worthwhile books. Enjoy immersion in the world of imagination, feel the emotions of the characters or learn something new for yourself, make an fascinating discovery.

Harlan Harris Analyzing the Analyzers: An Introspective Survey of Data Scientists and Their Work

Analyzing the Analyzers: An Introspective Survey of Data Scientists and Their Work: summary, description and annotation

We offer to read an annotation, description, summary or preface (depends on what the author of the book "Analyzing the Analyzers: An Introspective Survey of Data Scientists and Their Work" wrote himself). If you haven't found the necessary information about the book — write in the comments, we will try to find it.

Despite the excitement around data science, big data, and analytics, the ambiguity of these terms has led to poor communication between data scientists and organizations seeking their help. In this report, authors Harlan Harris, Sean Murphy, and Marck Vaisman examine their survey of several hundred data science practitioners in mid-2012, when they asked respondents how they viewed their skills, careers, and experiences with prospective employers. The results are striking.

Based on the survey data, the authors found that data scientists today can be clustered into four subgroups, each with a different mix of skillsets. Their purpose is to identify a new, more precise vocabulary for data science roles, teams, and career paths.

This report describes:

  • Four data scientist clusters: Data Businesspeople, Data Creatives, Data Developers, and Data Researchers
  • Cases in miscommunication between data scientists and organizations looking to hire
  • Why T-shaped data scientists have an advantage in breadth and depth of skills
  • How organizations can apply the survey results to identify, train, integrate, team up, and promote data scientists

Harlan Harris: author's other books


Who wrote Analyzing the Analyzers: An Introspective Survey of Data Scientists and Their Work? Find out the surname, the name of the author of the book and a list of all author's works by series.

Analyzing the Analyzers: An Introspective Survey of Data Scientists and Their Work — read online for free the complete book (whole text) full work

Below is the text of the book, divided by pages. System saving the place of the last page read, allows you to conveniently read the book "Analyzing the Analyzers: An Introspective Survey of Data Scientists and Their Work" online for free, without having to search again every time where you left off. Put a bookmark, and you can go to the page where you finished reading at any time.

Light

Font size:

Reset

Interval:

Bookmark:

Make
Appendix A. Survey Details
Design and Invitation

The survey was created on KwikSurveys.com (note: hacked, closed, and reopened under new management since we used it). The first page described the survey, stated our privacy policy, and thanked participants. The second page posed the skills-sorting task. The third page asked about education and experiences. The fourth page asked about professional web presence. The fifth page asked about self-identification and had some basic demographic questions. The final page thanked the participant and provided a link to send to others.

After testing, we posted links on social and professional networking sites, emailed friends and colleagues, and so forth. A sample personal invitation was:

As someone in the broad Analytics / Data Science / Big Data / Applied Stats / Machine Learning space, would you be willing to take a brief survey? Three of us in the DC Data Science community wondered about the ways that skills and experiences of practitioners in these fields vary, and are collecting some data to help us learn more. By participating, you would help us define these new fields better, and we hope the results will help people such as yourself talk about how your skills and your work fit in with everyone elses. Should take 10 minutes or less!

Skills List

Here are the list of skills we provided (in random order) and asked respondents to sort:

  • Algorithms (ex: computational complexity, CS theory)
  • Back-End Programming (ex: JAVA/Rails/Objective C)
  • Bayesian/Monte-Carlo Statistics (ex: MCMC, BUGS)
  • Big and Distributed Data (ex: Hadoop, Map/Reduce)
  • Business (ex: management, business development, budgeting)
  • Classical Statistics (ex: general linear model, ANOVA)
  • Data Manipulation (ex: regexes, R, SAS, web scraping)
  • Front-End Programming (ex: JavaScript, HTML, CSS)
  • Graphical Models (ex: social networks, Bayes networks)
  • Machine Learning (ex: decision trees, neural nets, SVM, clustering)
  • Math (ex: linear algebra, real analysis, calculus)
  • Optimization (ex: linear, integer, convex, global)
  • Product Development (ex: design, project management)
  • Science (ex: experimental design, technical writing/publishing)
  • Simulation (ex: discrete, agent-based, continuous)
  • Spatial Statistics (ex: geographic covariates, GIS)
  • Structured Data (ex: SQL, JSON, XML)
  • Surveys and Marketing (ex: multinomial modeling)
  • Systems Administration (ex: *nix, DBA, cloud tech.)
  • Temporal Statistics (ex: forecasting, time-series analysis)
  • Unstructured Data (ex: noSQL, text mining)
  • Visualization (ex: statistical graphics, mapping, web-based dataviz)
Non-negative Matrix Factorization

We used Non-negative Matrix Factorization to perform our Skills and Self-ID clusterings. NMF attempts to find a matrix factorization where all elements of the basis vectors are constrained to be non-negative. This is natural in data sets such as our skills rankings, which range from 0 (lowest or missing) to 21 (highest).

The R NMF package that we used is not currently available via CRAN, but can be downloaded from the archives.

We used the standard Brunet et al. (2004) method, which attempts to minimize KL-divergence. Note that NMF attempts to globally optimize a non-smooth function from a random initial state, and so we used 200 random runs to find a relatively reliable factorization. (See main text for several skills/self-ID terms that sometimes fell into other groups when different random seeds were chosen. These small differences did not appreciably affect our overall results.) The ranks of 5 and 4 (for Skills and Self-ID, respectively) were chosen to maximize the informativeness and interpretability (evaluated subjectively) of the resulting basis vectors. Lower ranks yielded vague factors, while higher ranks yielded less informative results compared to the raw ranks/ratings.

The results of NMF are two matrices: a coefficients matrix that describes how the observed dimensions can be approximately reconstructed with a smaller number of latent factors, and a basis matrix that describes how individual respondents rankings/ratings can be approximately summarized using the latent factors. We categorize individual respondents by normalizing the basis matrix and selecting the largest latent factor loading. The normalized coefficients matrix is used to assign skills/self-ID terms to Skill Groups/Self-ID Groups.

For T-shaped skills analysis, we use the normalized basis vectors for each respondent. was constructed by multiplying 1,000 simulated random rankings of skills by the computed coefficients matrix to get skill group loadings for each simulated respondent. For both that matrix and the observed basis matrix we then sorted the normalized loadings from high to low, then plotted the means in the following order (left-to-right): 5, 3, 1, 2, 4.

Acknowledgements

Thank you to those who gave us valuable feedback on drafts of this article and the survey, to those who gave us opportunities to share the results, as well as to our hundreds of survey participants around the world. Particular thanks to: MB, DC, AD, AF, NK, JDL, HM, NN, DJP, NT, KV, JW, JMW, and EZ.

Harlan Harris, Sean Murphy, and Marck Vaisman

About the Authors

Harlan D. Harris is a Senior Data Scientist at Kaplan Test Prep, the Co-Founder and Co-Organizer of the Data Science DC Meetup, and the Co-Founder and President of Data Community DC, Inc. He has a PhD in Computer Science (Machine Learning) from the University of Illinois at Urbana-Champaign and worked as a researcher in several Psychology departments before turning to industry.

Sean Patrick Murphy, with degrees in mathematics, electrical engineering, and biomedical engineering and an MBA from Oxford University, has served as a senior scientist at the Johns Hopkins Applied Physics Laboratory for the past ten years. Previously, he served as the Chief Data Scientist at WiserTogether, a series A funded health care analytics firm, and the Director of Research at Manhattan Prep, a boutique graduate educational company. He was also the co-founder and CEO of a big data-focused startup: CloudSpree.

Marck Vaisman is a data scientist, consultant, entrepreneur, master munger and hacker. Marck is the Principal Data Scientist at DataXtract, LLC helping clients from start-ups to Fortune 500 firms with all kinds of data science projects. His professional experience spans the management consulting, telecommunications, Internet, and technology industries. He is the co-founder of Data Community DC, Inc. and co-organizer of the Data Science DC and R Users DC meetup groups. He has an MBA from Vanderbilt University and a B.S. in Mechanical Engineering from Boston University. Marck is also a contributing author of The Bad Data Handbook.

Chapter 1. Introduction

Binita, Chao, Dmitri, and Rebecca are data scientists. What does that statement tell you about them? Probably not as much as youd like. You know they probably know something about statistics, programming, and data visualization. Youd hope that they had some experience finding insights from data, maybe even big data. But if youre trying to find the best person for a job, you need to be more specific than just doctor, or athlete, or data scientist. And thats a problem. Finding the right people for a task is all about efficient communication and, without the appropriate shared vocabulary, data science talent and data science problems are too often kept apart.

The three of us, organizers of data science events in Washington, DC, decided that we wanted to do something about this problem after too many personal experiences of failures caused by miscommunication. So in mid-2012 we surveyed data scientists, asking about their experiences and how they viewed their own skills and careers. The results may help us, as a professional community, settle on finer-grained descriptions and more effective means of communicating about what we do for a living.

Next page
Light

Font size:

Reset

Interval:

Bookmark:

Make

Similar books «Analyzing the Analyzers: An Introspective Survey of Data Scientists and Their Work»

Look at similar books to Analyzing the Analyzers: An Introspective Survey of Data Scientists and Their Work. We have selected literature similar in name and meaning in the hope of providing readers with more options to find new, interesting, not yet read works.


Reviews about «Analyzing the Analyzers: An Introspective Survey of Data Scientists and Their Work»

Discussion, reviews of the book Analyzing the Analyzers: An Introspective Survey of Data Scientists and Their Work and just readers' own opinions. Leave your comments, write what you think about the work, its meaning or the main characters. Specify what exactly you liked and what you didn't like, and why you think so.