We live in the Age of Data, there is no doubt about it. Whether you wake up in the morning, check your e-mail, turn on your TV, read the news paper… we are surrounded and influenced by data at all times, specially since the raise of mobile devices, we keep a door open to a universe of data with us all day long. But what most people do not realize, is that they are not only data consumers but also data producers. Every time you browse the internet, your actions are being tracked, every time you buy groceries, your shopping list is being recorded for analysis, every time you listen to a song in i-Tunes, Spotify, etc. the music industry is a step closer to understand their customers. And these are just a few examples, on how companies are being flooded with data every day, data that is crucial to understand their business and their customers and not doing so will result in many missed opportunities.
The excess of information is as problematic as the lack of it. Big Data is the term that has been recently associated with the process of dealing with unmanageable amounts of information. It has become a big buzz word and top influencers such as McKinsey or IBM are talking about it. Even the popular JP Morgan summer reading list contains a book about Big Data.
When dealing with Big Data, traditional Business Intelligence techniques are no longer valid, companies get too much information from too many sources and data does no longer fit in an Excel spreadsheet. So the big players in this game are the Data Scientists. These are the professionals capable of slicing and dicing this huge amount of information, creating complex algorithms that will identify patterns to be used for business decisions and translating all this bits and bytes into a language that can be understood by the stakeholders.
But the role of a Data Scientist is not always well defined, people wonder if they should be classified as Scientist, Engineers, Statisticians, Software Developers… and the truth is they are a mix of all of these. The process of understanding Big Data goes through lots of different phases, each of them requiring a unique set of skills. There is a visualization that shows this intersection:
First of all, infrastructure needs to be deployed to store and retrieve a huge amount of data. Traditional relational databases cannot do the job, they cannot scale. They are good to store structured data but this focus in structure makes them difficult to be distributed across multiple machines. Therefore, NoSQL databases are the choice for most Data Scientist, Cassandra, HBase, MongoDB. Specially the popular distributed file storage HDFS (Hadoop Distributed File System), the open source version of the GFS (Google File System) that allowed Google to index a huge part of the vast internet and make it usable for their customers. Setting up, understanding the schemas and distributing your data across a computer cluster require a good amount of Hacking Skills specially if that data has to be cleaned and refined.
But storing the data is not enough, it also has to be understood, and this requires substantive expertise about the specific market in hand. It requires constant communication and understanding of the different business units of an organization in order to define relevant hypothesis that will be proved or dismissed with the data. Once these hypothesis have been defined, data cannot be manually examined to identify relevant information. Therefore, statistical analysis and machine learning techniques are the most suitable tools for this task. Clustering, regression, statistical inference, Support Vector Machines… are very common terms in the Data Science vocabulary. But the use of those techniques is not as straight forward as in the past, since they have to be parallelized across multiple computers using techniques such as Map Reduce. And finally, when data has been processed and analyzed, it has to be presented, so Data Visualization is an area that needs to be mastered by a Data Scientist. There is no value in extracting relevant information if this cannot be properly communicated.
Data Science has been defined as the “sexiest job of the 21st century” by the Harvard Business Review. Also, some universities have already jumped on the train of creating programs with very specific formation for this type of professionals, such as Columbia University or the New York University. There has been a raise in the past year of companies demanding people with this set of skills. And there will be more to come as companies discover the hidden opportunities behind Big Data and the technologies to process it get more mature. So being an early adopter of Data Science and understanding its importance is going to be a key differentiation for any young professional or any company looking for success.
Are you ready to jump on that train with me?Lluis Canet