Appendix A. Survey Details
Design and Invitation
The survey was created on KwikSurveys.com (note: hacked, closed, and reopened under new management since we used it). The first page described the survey, stated our privacy policy, and thanked participants. The second page posed the skills-sorting task. The third page asked about education and experiences. The fourth page asked about professional web presence. The fifth page asked about self-identification and had some basic demographic questions. The final page thanked the participant and provided a link to send to others.
After testing, we posted links on social and professional networking sites, emailed friends and colleagues, and so forth. A sample personal invitation was:
As someone in the broad Analytics / Data Science / Big Data / Applied Stats / Machine Learning space, would you be willing to take a brief survey? Three of us in the DC Data Science community wondered about the ways that skills and experiences of practitioners in these fields vary, and are collecting some data to help us learn more. By participating, you would help us define these new fields better, and we hope the results will help people such as yourself talk about how your skills and your work fit in with everyone elses. Should take 10 minutes or less!
Non-negative Matrix Factorization
We used Non-negative Matrix Factorization to perform our Skills and Self-ID clusterings. NMF attempts to find a matrix factorization where all elements of the basis vectors are constrained to be non-negative. This is natural in data sets such as our skills rankings, which range from 0 (lowest or missing) to 21 (highest).
The R NMF package that we used is not currently available via CRAN, but can be downloaded from the archives.
We used the standard Brunet et al. (2004) method, which attempts to minimize KL-divergence. Note that NMF attempts to globally optimize a non-smooth function from a random initial state, and so we used 200 random runs to find a relatively reliable factorization. (See main text for several skills/self-ID terms that sometimes fell into other groups when different random seeds were chosen. These small differences did not appreciably affect our overall results.) The ranks of 5 and 4 (for Skills and Self-ID, respectively) were chosen to maximize the informativeness and interpretability (evaluated subjectively) of the resulting basis vectors. Lower ranks yielded vague factors, while higher ranks yielded less informative results compared to the raw ranks/ratings.
The results of NMF are two matrices: a coefficients matrix that describes how the observed dimensions can be approximately reconstructed with a smaller number of latent factors, and a basis matrix that describes how individual respondents rankings/ratings can be approximately summarized using the latent factors. We categorize individual respondents by normalizing the basis matrix and selecting the largest latent factor loading. The normalized coefficients matrix is used to assign skills/self-ID terms to Skill Groups/Self-ID Groups.
For T-shaped skills analysis, we use the normalized basis vectors for each respondent. was constructed by multiplying 1,000 simulated random rankings of skills by the computed coefficients matrix to get skill group loadings for each simulated respondent. For both that matrix and the observed basis matrix we then sorted the normalized loadings from high to low, then plotted the means in the following order (left-to-right): 5, 3, 1, 2, 4.
Acknowledgements
Thank you to those who gave us valuable feedback on drafts of this article and the survey, to those who gave us opportunities to share the results, as well as to our hundreds of survey participants around the world. Particular thanks to: MB, DC, AD, AF, NK, JDL, HM, NN, DJP, NT, KV, JW, JMW, and EZ.
Harlan Harris, Sean Murphy, and Marck Vaisman
About the Authors
Harlan D. Harris is a Senior Data Scientist at Kaplan Test Prep, the Co-Founder and Co-Organizer of the Data Science DC Meetup, and the Co-Founder and President of Data Community DC, Inc. He has a PhD in Computer Science (Machine Learning) from the University of Illinois at Urbana-Champaign and worked as a researcher in several Psychology departments before turning to industry.
Sean Patrick Murphy, with degrees in mathematics, electrical engineering, and biomedical engineering and an MBA from Oxford University, has served as a senior scientist at the Johns Hopkins Applied Physics Laboratory for the past ten years. Previously, he served as the Chief Data Scientist at WiserTogether, a series A funded health care analytics firm, and the Director of Research at Manhattan Prep, a boutique graduate educational company. He was also the co-founder and CEO of a big data-focused startup: CloudSpree.
Marck Vaisman is a data scientist, consultant, entrepreneur, master munger and hacker. Marck is the Principal Data Scientist at DataXtract, LLC helping clients from start-ups to Fortune 500 firms with all kinds of data science projects. His professional experience spans the management consulting, telecommunications, Internet, and technology industries. He is the co-founder of Data Community DC, Inc. and co-organizer of the Data Science DC and R Users DC meetup groups. He has an MBA from Vanderbilt University and a B.S. in Mechanical Engineering from Boston University. Marck is also a contributing author of The Bad Data Handbook.
Chapter 1. Introduction
Binita, Chao, Dmitri, and Rebecca are data scientists. What does that statement tell you about them? Probably not as much as youd like. You know they probably know something about statistics, programming, and data visualization. Youd hope that they had some experience finding insights from data, maybe even big data. But if youre trying to find the best person for a job, you need to be more specific than just doctor, or athlete, or data scientist. And thats a problem. Finding the right people for a task is all about efficient communication and, without the appropriate shared vocabulary, data science talent and data science problems are too often kept apart.
The three of us, organizers of data science events in Washington, DC, decided that we wanted to do something about this problem after too many personal experiences of failures caused by miscommunication. So in mid-2012 we surveyed data scientists, asking about their experiences and how they viewed their own skills and careers. The results may help us, as a professional community, settle on finer-grained descriptions and more effective means of communicating about what we do for a living.