Author: Mısra Turp, Data Scientist at IBM
“My background is in computer science and before getting into data science I was working with robots and other branches of artificial intelligence for the fun of it. My current goal is to help busy professionals have a smooth transition to a career in data science.”
I get a lot of questions about what the relevant skills to learn are to get a data science job and to become a data scientist. My classic answer is “learn Python or R and some common Python frameworks that are used for data analysis”. This approach makes sense to me because that’s what I did.
I know one popular advice to aspiring data scientists is to look at the job postings and investigate the requirements. I think this is a good idea to get a nice picture of hard-skills to learn. BUT. I also know that these job postings are many times prepared by HR people who do not really understand the requirements themselves and just write down what an expert in their company told them. I wouldn’t be surprised if some were using templates of job postings found online.
To satisfy my curiosity and understand what a data science position requires according to companies, I took it upon myself to look into job postings from the internet. I selected 100 job postings from the US and Europe on LinkedIn. All of them are entry-level and all of them are English speaking companies. E.g. for Germany, I skipped the job listings that were in German, etc. In this article, I will share my results with you. I will show which skills came up most often and tell you what I think about how these requirements reflect reality. With that said, don’t forget to take everything you read online with a grain of salt and think about how it fits your specific situation.
To start off, I divided the requirements into 9 groups. I call them skill types:
- Background (Mathematics, Statistics, Computer Science and Machine Learning)
- Education (Master’s degree, PhD, Master’s or PhD)
- Languages (Python, Scala, Java, R, C++, Julia)
- Databases (SQL, Kafka, NoSQL)
- Big Data Related Technologies (Spark, Hadoop, MapReduce)
- Business tools (Excel, PowerPoint, Word)
- Visualization tools (Dash, D3, Tableau, Qlikview, PowerBI, ggplot)
- Python frameworks (Pandas, Scikit-learn, TensorFlow, Keras)
On top of the requirements, I looked at the common disciplines mentioned in the postings which are: Machine Learning, DeepLearning, Analytics, Natural Language Processing, Computer Vision, Robotics in order to understand what the nature of these postings are.
Let’s start with the requirements. Here is a chart showing the total number of times a type of requirement was mentioned on a job posting out of the 100.
Education and programming language requirements are the ones mentioned the most often compared to other types of requirements. The interesting thing we see in this chart is the importance of soft skills compared to all the other hard skills or specific tools such as frameworks and libraries. This really puts the importance of being a good communicator into perspective. Especially for someone who is not 100% confident in their technical skills, emphasizing your soft skills could be the boost you need to get ahead of the competition.
Here is all of the types of requirements divided into specific skills.
occurrence of types shown with the specific skills
For education requirements, I counted the amount of time a Master’s degree was asked for, a PhD degree was asked for and a Master’s or PhD degree was asked for. There were no cases of job postings were a PhD degree was a requirement.It would have been ridiculous anyways to ask for a PhD for an entry-level position.
As you see a master’s degree in a related area to computer science, statistics or mathematics is asked for by 93 of the job listings, 29 of which say that a PhDis also acceptable(!). But there is no reason to lose hope if you don’t have these degrees. Many of these job listings state this requirement as a good-to-have or a plus. Basically, these companies are trying to minimise their risk while hiring someone by checking if they fit traditional requirements such as a Master’s degree in computer science. At the end of the day, they are looking for someone who can bring the most value. Convincing them of your value is up to you.
Languages are my favourite thing to talk about because of the amount of controversy around them. Not surprisingly Python leads by being part of 86% of all listings. It might seem not high enough at first glance but it was mentioned 96.63% of the time a language requirement was given. So we can say if you have to learn a programming language make it Python. Not convinced? Let’s look at the other programming languages.
R was not mentioned in as many job listings as I thought it would. With only 16 mentions it is nearly as low as C++, a programming language I did not expect to see on data science job postings. Julia, a programming language I recently heard about, is only mentioned 3 times out of the 100 job postings. I guess we can say it is still an up and coming language that has a lot of coming left to do. Finally, Scala and Java seem to be competing for second place but they are not anywhere close to how popular Python is.
It seems like Python is the go-to language for data science. To understand this better, let’s look at a different chart with every instance a language requirement was given. We can see here how Python is required nearly on all of them. And every time another language is given as a requirement, Python is also there. *Python drops the mic*
When it comes to background training one should have, which was mentioned on 77 of the 100 job listings, a statistics background seems to be very desirable, closely followed by computer science and mathematics. This might seem scary to you because there is not much one can do to get a degree in these fields, at least not very quickly. I should remind you though on most of those job listings the background training is marked as a good-to-have. As I said before, if your goal is to get into this new area of work that you don’t have the right background for, you need to not lose hope and compensate with other skills.
As I mentioned in previous posts, soft skills are one of the most important skills to highlight when applying for a data science position. 75 % of all job listings mention soft skills. 90.67% of those 75 postings mention communication as an important skill to have. Yes, communication is a vague word but you can read here what it means for data scientists specifically.
Language skills – English to be exact- was mentioned in those listings together with presentation skills and attention to detail. Better adjust those cover letters to emphasize your soft skills and give a couple of examples of how you use them on projects.
Looking at frameworks and commonly used libraries we see that TensorFlow is the more popular of the bunch. Especially compared to its counterpart Keras, which was mentioned only 3 times. Not really surprising and it is a good thing to keep in mind if you are looking into Neural Network/Deep Learning frameworks to use.
One thing that did surprise me though was how infrequently Pandas and scikit-learn were mentioned. The likely reason is that extremely common libraries like Pandas and Scikit-learn are taken for granted. Companies probably assume you know how to handle and manipulate data and implement fundamental machine learning algorithms as an entry-level data scientist.
SQL seems to be the most important out of the bunch, being mentioned 59 times out of the 62 times a database technology was mentioned. NoSQL in comparison was mentioned 5 times and Kafka 9 times.
There are posts on the internet claiming that if you are not dealing with big data, you should and can not call yourself a data scientist. I would love to send these findings to the authors of those posts, to be honest. In my and many of my colleagues’ experience, many times, you do not deal with extremely big amounts of data. You have a moderate amount of data and you do the best you can with it to the best of your knowledge. It is therefore not surprising that big data tools were mentioned only 37 times in total, 28 of them including Spark and 23 Hadoop which are tools to deal with big data. They were mostly mentioned together. MapReduce was not mentioned in these job listings.
I get questions about learning git, Linux or cloud solutions from aspiring data scientists. I always say that from what I’ve seen so far, the need for Linux skills is company dependent. It is also evident here since only 6 job listings mentioned it.
Cloud, however, is still getting more widely adopted and I think it is definitely a plus to say you can work on cloud. Nevertheless, many times people who will deal with deployment and integration tools on the cloud would be data engineers and not data scientists.
On the other hand, I think knowing how to use git, at least the simple pulling and pushing and the idea of version control is good for any job where you’ll be working in a team. Actually, even when you’re working alone you should use version control. Over the years I’ve seen some grand dramas simply because version control was neglected. I would suggest you to not take the frequency of git mentions seriously and learn about the fundamentals of it. I’m sure your future team lead would be more than happy to know you already know how to use it.
There are many visualization tools and libraries out there. Dash and Tableau seem to be the more popular ones among employers. Matplotlib, D3, Qlikview, PowerBI and ggplot are apparently not seen as an important requirement on 98% of the job listings.
I would say you probably will learn how to use Matplotlib while doing personal projects or even Kaggle competitions. And, to be honest, that’s the one I use the most often.
When it comes to software like Tableau, QlikView and PowerBI, I think many data scientists are hesitant to use them. Mostly because you can easily visualise what you need to see with code. I can see how for serious meetings and important presentations the more professional looking and complex visualisation prepared by these software solutions would be helpful though.
Long story short, I would probably not spend too much time learning about how to make pretty visualisations unless you are aiming for a visuals-heavy type of data science.
Surprisingly, even though we always think of coding when we say data science, excel is a frequently mentioned requirement too. I use it occasionally for simple tasks like checking to see if the data looks like what I expect it to look like or to sort the data based on a column without having to fire up my coding environment. I don’t really know how these companies think you will use Excel or how deep of an Excel knowledge they expect you to have. I’d say as long as you know how to code, your Excel skills are going to be not super necessary.
For the sake of completion, I checked to see how many time other office tools were mentioned. Microsoft Word was mentioned 6 times and PowerPoint 0 times. I guess companies are getting over the obsession of listing Microsoft Office tools as requirements after all.
Which discipline are these jobs in?
Analytics was the most frequently mentioned discipline with 74 times. Machine learning follows analytics closely. It is not surprising though because data analytics and machine learning is the core of data science.
Deep learning is only mentioned 22 out of the 100 instances. This shows that it is far from being the dominant discipline in data science but there are definitely a significant number of companies developing applications using deep learning. Natural language processing, computer vision and robotics are mentioned 4, 3 and 1 time(s) respectfully. In the light of these numbers, we can conclude that unless you’re aiming for that specific 2%, you need to get your basic data analytics and machine learning skills in order, before focusing on more advanced topics.
One of the more important things when learning a new profession is to keep yourself motivated and to set your priorities straight. I hope this post helped you understand what companies are generally looking for so that you can decide more clearly on what to work on next.