Executive Summary
OReilly Media conducted an anonymous salary and tools survey in 2012 and 2013 with attendees of the Strata Conference: Making Data Work in Santa Clara, California and Strata + Hadoop World in New York. Respondents from 37 US states and 33 countries, representing a variety of industries in the public and private sector, completed the survey.
We ran the survey to better understand which tools data analysts and data scientists use and how those tools correlate with salary. Not all respondents describe their primary role as data scientist/data analyst, but almost all respondents are exposed to data analytics. Similarly, while just over half the respondents described themselves as technical leads, almost all reported that some part of their role included technical duties (i.e., 1020% of their responsibilities included data analysis or software development).
We looked at which tools correlate with others (if respondents use one, are they more likely to use another?) and created a network graph of the positive correlations. Tools could then be compared with salary, either individually or collectively, based on where they clustered on the graph.
We found:
By a significant margin, more respondents used SQL than any other tool (71% of respondents, compared to 43% for the next highest ranked tool, R).
The open source tools R and Python, used by 43% and 40% of respondents, respectively, proved more widely used than Excel (used by 36% of respondents).
Salaries positively correlated with the number of tools used by respondents. The average respondent selected 10 tools and had a median income of $100k; those using 15 or more tools had a median salary of $130k.
Two clusters of correlating tool use: one consisting of open source tools (R, Python, Hadoop frameworks, and several scalable machine learning tools), the other consisting of commercial tools such as Excel, MSSQL, Tableau, Oracle RDB, and BusinessObjects.
Respondents who use more tools from the commercial cluster tend to use them in isolation, without many other tools.
Respondents selecting tools from the open source cluster had higher salaries than respondents selecting commercial tools. For example, respondents who selected 6 of the 19 open source tools had a median salary of $130k, while those using 5 of the 13 commercial cluster tools earned a median salary of $90k.
Note
We suspect that a scarcity of resources trained in the newer open source tools creates demand that bids up salaries compared to the more mature commercial cluster tools.
Salary Report
Big data can be described as both ordinary and arcane. The basic premise behind its genesis and utility are as simple as its name: efficient access to moremuch moredata can transform how we understand and solve major problems for business and government. On the other hand, the field of big data has ushered in the arrival of new, complex tools that relatively few people understand or have even heard of. But is it worth learning them?
If you have any involvement in data analytics and want to develop your career, the answer is yes. At the last two Strata conferences (New York 2012 and Santa Clara 2013), we collected surveys from our attendees about, among other things, the tools they use and their salaries. Heres what we found:
Several open source tools used in analytics such as R and Python are just as important, or even more so, than traditional data tools such as SAS or Excel.
Some traditional tools such as Excel, SAS, and SQL are used in relative isolation.
Using a wider variety of toolsprogramming languages, visualization tools, relational database/Hadoop platformscorrelates with higher salary.
Using more tools tailored to working with big data, such as MapR, Cassandra, Hive, MongoDB, Apache Hadoop, and Cloudera, also correlates with higher salary.
We should note that Strata attendees comprise a special group and do not form an unbiased sample of everyone who seriously works with data. These are people deeply involved with or interested in big data, seeking to network with others on the fields cutting edge and learn about the new technologies defining itin short, they are ahead of the curve. If a trend observed in the sample is not consistent with what would be observed in the larger population (of analysts, data scientists, and so on), then this trend could represent the direction big data is headed. This is likely to be the case for tool usage.
The majority of the surveys respondents were from the US, with most of the rest coming from Canada and Europe. Among those from the US, 68% were from states on either coast.
Our sample represented a wide range of ages, with most respondents in their thirties and forties. About 40% of respondents were based in the West, while the rest of the respondents were evenly distributed in the Northeast, Mid-Atlantic, South, and Midwest regions. California, Maryland, and Washington had the highest median salaries, while respondents in the South and Midwest reported the lowest median salaries.
Twenty-three industries were represented (those with at least 10 respondents are shown above) and about one-fifth came from startups. A significant share of respondents, 42%, work in software-oriented segments: software and application development, IT/solutions/VARs, data and information services, and manufacturing/design (IT/OEM). Government and education represent 14% of respondents. About 21% of those responding work for startupswith early startups, surprisingly, showing the highest median salary, $130k. Public companies had a median salary of $110k, private companies $100k and N/A (mostly government and education) at $80k.