Learning Data Science does not start with doing Data Science

Those who are in love with practice without theoretical knowledge are like the sailor who goes onto a ship without rudder or compass and who never can be certain whither he is going. Pra

The foundation of Data science lies on a set of math and statistics knowledge which provides the ability to understand and model datasets. Because data modeling algorithms may look to the untrained eye like simply a collection of lines of code, those jumping into modeling data without the statistical or mathematical background, may end up making serious mistakes that would nullify the results of any analysis done.

If you are serious about becoming a Data Scientist, avoid the temptation of jumping into modeling data first. Instead, ensure you have the right math and statistical foundation in place before learning how to build models. For details on the overall skills required to be a Data Scientist, read my previous blog What are the skills that define the role of Data Scientist? For resources on alternatives to become a Data Scientist, read Where do Data Scientists come from?

There are multiple ways to get skills to become a Data Scientist, but independently of the option you choose, make sure the right foundation is built into the program, or find ways to build the foundation yourself either through college classes, digital classes, or even through good old fashion academic books. Specifically, what math knowledge should you have? here is a summary of some of the most common math knowledge needed:

Probability Foundation

What is Probability?, Sample Spaces, Properties/Rules of Probability, Probability of combination of events (Intersection of events, Union of events, Contingency Tables), Conditional Probabilities, Independent vs. Dependent Events, Bayes’ Theorem, Counting Principles (permutations, Combinations), Sampling Techniques, Probability Distributions (Continuous and Discrete Random Variables, Cumulative Distribution, Binomial Distribution, Poisson Distribution, Geometric Distribution, Exponential Distribution, Normal Distribution, Chi-Square Distribution, Expected Values)

Statistics Foundation

  • Descriptive Statistics: Quantitative Data vs. Qualitative Data, Types of measurements: nominal, ordinal, intervals, ratios, Frequency Distributions, Relative Frequency Distributions, Cumulative Frequency Distributions, Measures of Central Tendency: Mean, Median and Mode, Measures of Dispersion: Range, Variance, Standard Deviation, Measures of Relative Position: Quartiles, Interquartile Range, Outliers, The empirical rule (normal distributions) and Chebyshev’s Theorem, Visualizing Data: Histograms, Stem and Leaf, Box Plots
  • Inferential Statistics: Sampling distribution of the mean, Sampling Distribution of the Proportion, Standard Error of the Mean, The Central Limit Theorem, Confidence Intervals and their interpretation, Effects of changing confidence levels and sample sizes, Working with small vs large samples, Formulating and testing hypothesis, The Null and Alternative Hypothesis, Type I and Type II Errors, 1-tail vs 2-tail hypothesis, Testing the Mean and the proportion of a population using 1 sample, Testing the difference in means and proportions using 2 samples, Analysis of Variance (ANOVA) comparing 3 or more population means, Understanding the role of Alpha and the p-value, Working with Dependent vs. Independent samples, Correlation and Simple Regression (confidence intervals, hypothesis test on the regression line, regression assumptions), Multiple Regression (assumptions, multicollinearity)

Linear Algebra Foundation: Vectors and Real Coordinate Spaces (Adding, Subtracting, and multiplying vectors by a scalar, Linear Combinations, Linear dependence and independence), Matrix Transformations (Functions and Linear Transformations, Inverses and determinants, Matrix Multiplications, Transpose), Orthogonal Complements and Projections, eigenvectors, eigenvalues.

The bottom line is that when you move from working in a deterministic world to probabilistic world, it is important to understand the implications of this paradigm change.

Where do Data Scientists Come From? How Do I become One?

interior design-2

I frequently get asked…. How do I become a data scientist? How do Data Scientists get their skills? There are several options but first… Let’s take a look at the profile of data scientists and see how your skills rank compared to them.  According to a 2018 study by Burtch Works, it is estimated that:

  • 91% of of data scientists have an advanced degree (43% hold a Master’s degree, and 48% hold a PhD).
  • 25% of data scientists hold a degree in statistics or mathematics, 20% have a computer science degree, 20% hold a degree in the natural sciences, 18% hold an engineering degree, 8% hold a business degree, 5% hold a social science degree, and 4% an economics degree
  • 44% of data scientists are employed by the technology industry, followed by Financial Services with 14%, marketing services with 9%, consulting 8%, Healthcare/Pharma 6% and retail with 5%
  • 62% of data scientists are US citizens, 19% have permanent residency and the rest are on temporary visas
  • 85% of data scientists are male

These statistics should give you a general idea about the skillset shared by the community of data scientists.

How do I become a Data Scientist? Where do I get the skills?

There are several options depending on how much time you want to spend acquiring the skills, how much money you want to spend, and how deep you want your skills to be. Here is a summary of the most popular options:

1-PhD: If you are interested in a PhD, or already pursuing a PhD, most likely than not, you will acquire skills that will help you become a Data Scientist. As we saw in the statistics above, today almost half of data scientists have a PhD.  On average, a PhD will take several years to complete.

2- Master of Science in Analytics/Data Science: The first master program focused on Analytics/Data Science was started by the Institute of Advanced Analytics (IAA) at NC State in 2007 (I graduated from this program in 2012). It was for many years the only program in the nation, and due to high demand, many new programs have been launched in the past few years. There are now over 200 programs in the United States, with some of them full time, some part time, and some online.  If you are curious about what programs are available that could match your needs, you can use the Interactive Map kept by the IAA to find options that might fit your needs in the United States. It will usually take you 1 or 2 years to complete a master degree.

3- Data Science Bootcamps: There are several alternatives.  Switchup has a good summary of the best Data Science Bootcamps.  Figure 1 summarizes a comparison of these programs using Switchup’s ratings and details from each of the programs (Disclaimer: this information varies frequently so this may be out of date by the time you read it). Refer to their web sites for up to date details. You will be able to complete a bootcamp in a matter of months.

Figure 1 Summary of Information on the Best Data Science Bootcamps (Switchup)

4- Data Science Online Certifications. MOOCs provide a a wealth of training options: Coursera, Edx, Cognitiveclass.AI, and Udacity are all great choices. Udacity has an advantage over the others in that they offer mentorship together with their education programs. (See what one student of both Udacity and other programs has to say about his experience). Prices and length of programs vary greatly among these choices, and they offer a lot of flexibility.  However, you will still need practical experience.  These classes can give you a foundation, that will need to be supplemented with practical experience to become a Junior Data Scientist.

For more information on the type of skills needed to be a Data Scientist, refer to What are the skills that define the role of Data Scientists?

What are the skills that define the role of Data Scientists?

Data Science is an emerging field, but it is definitely not a new field. Yet, many people still struggle to define Data Science as a field, and more importantly, struggle to define the set of skills that collectively define a “Data Scientist”.

What is data science?

Data Science is a cross-disciplinary set of skills found at the intersection of statistics, computer programming, and domain expertise.  Perhaps one of the simplest definitions is illustrated by Drew Conway’s Data Science Venn Diagram (Figure 1), first published on his blog in September 2010.  Discussions about this field, however, go as far back as 50 years.  If you are interested in learning more about the history of the Data Science field, you can read it in the  50 Years of Data Science paper written by David Donoho.

Figure 1 – Drew Conway’s Data Science Venn Diagram

The bottom line is that Data science comprises three distinct and overlapping areas: a set of math and statistics knowledge which provides the ability to understand and model datasets, a set of computer programming/hacking skills to leverage algorithms that can analyze and visualize data, and the domain expertise needed to ask the right questions, and put the answers in the right context.

It is important to call out attention to the “Danger Zone” above, as there is nothing more dangerous than aspiring Data Scientists who do not have the appropriate math and statistical foundation to model data.

What skills define the role of Data Scientists?

A Data Scientist is not a just a computer programmer, or just a statistician, or just a business analyst.  In order to be a data scientist, individuals need to acquire knowledge from all these disciplines, and at the minimum develop skills in the following areas:

1.Probability, Statistics, and Math foundation. This includes probability theory, sampling, probability distributions, descriptive statistics (measures of central tendency and dispersion, etc.), inferential statistics (correlations, regressions, central limit theory, confidence intervals, development and testing of hypothesis, etc.) and linear algebra (working with vectors and matrices, eigenvectors, eigenvalues, etc.)

2.Computer Programming.  Throughout the years, SAS has probably been the most commonly used programming language for Data Science, but adoption of Open Source Languages Python and R has increased significantly.  If you are starting today to acquire data science skills, my recommendation would be to focus on Python. Looking at worldwide searches on Google for “R Data Science” and comparing them to “Python Data Science”, the trends are clear (Figure 2). Interest in Python has surpassed R, and continue on a positive trend. This makes sense given that python allows you to create models and also to deploy them as part of an enterprise application, so within the same platform data scientists and app developers can work together to build and deploy end to end models. R while easier in some cases for modeling purposes, was not designed as a multi-purpose programming language.

Figure 2 Worldwide searches for “R Data Science” vs. “Python Data Science”. Google Trends (June 2018)

3. Data Science Foundation. This involves learning what data science is and its value in specific use cases. It also involves learning how to formulate problems as research questions with associated hypotheses, and applying the scientific method to business problems.  Data Science is an iterative process so it is critical to have a solid understanding of the methodologies used in the execution of this iterative process (Define the problem, Gather Information, Form hypothesis, Find/Collect data, Clean/Transform data, Analyze Data and Interpret Results, Form new hypothesis)

Figure 3 Data Science Iterative Cycle

4. Data Preparation/Data Wrangling.  Data is by definition dirty.  And before data can be analyzed and modeled, it needs to be collected, integrated, cleaned, manipulated and transformed. Although this is the domain of “Data Engineers”, Data Scientists should also have a solid understand of how to construct usable, clean datasets

5. Model Building. This is the core of the data science execution, where different algorithms are used to train models with data( structured and unstructured) and the best algorithm is selected.  At this stage, data scientists need to make basic decisions around the data such as how to deal with missing values, outliers, unbalanced data, multicollinearity, etc. They need to have solid knowledge of feature selection techniques (which data to include in the analysis), and be proficient in the use of techniques for dimensionality reduction such as principal component analysis. Data scientists will be able to test different supervised and unsupervised algorithms such as regressions, logistic regressions, decision trees, boosting, random forest, Support Vector Machines, association rules, classification, clustering, neural networks, time series, survival analysis, etc. Once different algorithms are tested, the “best” algorithm is selected using different model accuracy metrics. Data scientists should also be skilled in data visualization techniques, and should have solid communication skills to properly share the results of the analysis and the recommendations with nontechnical audiences.

6. Model deployment.  A very important part of building models is to understand how to deploy those models for consumption from a data application. While this is typically the domain of machine learning engineers and application developers, data scientists should be familiar with the different methods to deploy models.

7. Big Data Foundation. A lot of organizations have deployed big data infrastructure such as Hadoop and Spark.  It is important for data scientists to know how to work with these environments.

8. Soft Skills.  Successful data scientists should also have the following soft skills:

a.    Ability to work in teams. Because of the inter-disciplinary nature of this field, it is by definition a team sport.  While every data scientist on a team will need a good foundation on all skills defined above, the depth of skills will vary among them. This is not a field for individualistic stars, but a field for natural team players.

b.     Communication Skills. Data scientists need to be able to explain the results of their analysis and the implications of those results in nontechnical terms. The best analysis can go to waste is not properly communicated.

Last but not least, it is important to remember that the most important characteristic of great data scientists is CURIOSITY. Data Scientists should be relentless in their search for the best data and the best algorithm, and should also be lifelong learners as this field is advancing very rapidly.

In summary, if you are interested in the Data Science field, or if you are exploring ways to develop your skills, make sure that you are addressing all these areas, and especially make sure not to end up in the danger zone having programming skills and domain knowledge but lacking the math and statistics foundation needed to model data correctly.

If you are a digital marketer and think Russian Troll farms only impact the world of politics, think again….

By now, you have probably heard about how trolls and bots were used to influence the 2016 elections in the United States.  And if you are a marketer with expertise in Social Media, you can easily understand how Social Media Channels – because of their network effects –  can easily support the rapid dissemination of any messages – positive and negative.

If you are not familiar with how they work in the context of politics, let’s start with simple definitions:

  • Bots: automated accounts that repost content usually focused on a specific hashtag, or a specific digital destination.  A specific message can be disseminated in seconds by thousands of them working together without human intervention
  • Trolls: accounts created by individuals usually with a fake identity, focused on writing content – typically on controversial topics – that are then posted organically or promoted via paid ads and supported by an army of bots for reposts. These individuals are probably sitting somewhere in Russia but the accounts are  created as a 30 year old housewife in Michigan, or an 18 year old gun lover for example.

Bots and trolls can be found anywhere in the world, but the most sophisticated operation is found in Russia as they have used it internally to promote Putin’s agenda while making it seem like it is individual people talking about their priorities on social media.  This video provides additional information about the Russian troll farms.

Let’s say however, that you are not interested in politics, as a marketer how does this impact you and your priorities?

US social-media ad spend is expected to reach $14 billion in 2018, up from just $6.1 billion in 2013.   If you are a CMO or a digital marketer, you know a significant part of your budget is spent on social media.  But what happens when the platform includes a significant number of trolls sitting somewhere in Russia but posing as individuals in the US? The result is that audience metrics are significantly impacted and your money may be spent reaching out to fake accounts.

  • 10% of Facebook’s 2.07 billion monthly users are now estimated to be duplicate accounts, up from 6% estimated previously. The social network’s number of fake accounts, or accounts not associated with a real account, increased from 1% to 2-3%. These figures mean that there are now roughly 207 million duplicate accounts and as many as 60 million fake accounts on the network.  They say they are working on ways to take this into account when campaigns are being created but is it enough?
  •  Twitter is estimated to have about 50M fake accounts.

As advertisers more focus on having this issue fixed should be demanded. After all, you need to make sure your money is not wasted on advertising to fake accounts.  Technically it is probably a difficult challenge, but the same way that email systems had to find ways to reduce the impact of spam and work on better spam filters, it is time for social media organizations to add focus on technologies to help them reduce this problem.  A couple of ways that come to mind to address this problem could be to increase the use of machine learning models to support identification of bot and troll accounts, and the use of technologies like blockchain for digital id so that people on social networks are actually who they say they are.

What Data Science Tools Have you used in the past 12 months? KDnuggets Poll Results are out!

The results of the 18th annual KDnuggets Software Poll were recently published. This poll asks “What Predictive Analytics, Data Mining, Data Science Software/Tools have you used in the past 12 months?”.  This poll attracted 2,900 voters, and it is also worth mentioning that it sometimes attracts controversy due to excessive voting by some vendors.  See all data at KDNuggets.

Some of the most relevant findings are:

  • Python has now overtaken R as a Data Science Tool – barely but still noticeable (53% to 52% use but Python grew 15% while R only grew 6%)
  • There are now 2 newcomers joining the top 10 list: Tensorflow and Anaconda
  • Use of Excel for Analytics purposes decreased by 16%
  • In terms of programming languages, Python, R and SQL run the show with usage of all 3 growing
  • Big Data Tools was simplified to only 4 categories: Hadoop Open Source, Hadoop Commercial, SQL on Hadoop Tools and Spark.  The highest growth tool is SQL on Hadoop and usage of Hadoop Open Source is decreasing


We have 2 newcomers this year: Anaconda and Tensorflow

Top 2 tools:

  • Use: Python (53%) and R (52%)
  • Growth: Tensorflow (197%), Anaconda (37%)


Top 2 languages:

  • Use: Python (53%) and R (52%)
  • Growth: Python (15%), R (6%)



The tools on the survey have been simplified to 4: Hadoop Open Source, Hadoop Commercial, Spark and SQL on Hadoop Tools.

Top 2 Big data tools:

  • Use: Spark (23%) and Hadoop Open Source (15%)
  • Growth: SQL on Hadoop Tools (41%), Spark (5%)

It is important to notice the decrease of 32% in usage of Hadoop Open Source. I am not sure if there has been a real decrease, or if this is the result of the survey having changed splitting the hadoop category in 2: Open Source and Commercial.  Part of this “decrease” could be attributed to the fact that there are now 2 categories instead of 1.


Top Deep Learning Tools:

  • Use: Tensorflow (20%) and Keras (9.5%)
  • Growth: Microsoft CNTK (278%), mxnet (200%)

What is Cognitive Computing? The 3 Things Series

*** The 3 Things Series aims to simplify – sometimes even oversimplify – technology concepts so that you learn 3 things about a topic ***. Opinions are my own.

Cognitive Computing is an evolving technology that –  although in its infancy – has the potential to revolutionize all industries, by redefining how work gets done, and by augmenting human capabilities.

In a general sense, cognitive systems – with IBM Watson at the forefront of this new era – combine natural language processing, machine learning, and real-time computing power to process vast amounts of data – structured and unstructured – to provide answers to a specific request.

These systems are very different from traditional systems that are programmed based on rules and logic. They enable people to create value finding insights in volumes of data, while mimicking cognitive elements of human expertise. They “learn” about a specific domain, develop hypothesis, evaluate those hypothesis, and choose the best option. They do this at massive speed and scale.

So what does this all mean? I find that the best way is to show you the origins of IBM Watson. If you watch this 3-minute video, you will have a better understanding: http://bit.ly/2qhElVe

So how does a cognitive system work?

1.    The system first needs to be trained with domain specific data. If we think of jeopardy, Watson had access to about 200 million pages of structured and unstructured content including the full text of Wikipedia. This stage is very important as the outcomes will be as good as the information used in the training process. In fact, when Watson was being developed, the urban dictionary was included as part of the training only to be removed later when pretty quickly it learned how to use offensive words. If we think about the field of oncology for example, then the training data will include information on research papers, drugs used, demographics data on individuals receiving training, outcomes, etc.  So it must be able to understand both structured and unstructured data, but also to interact with humans in a natural way (through natural language)

2.    Once a system is trained, it will have the ability to form hypothesis, make considered arguments and prioritize recommendations to help humans make better decisions.    Cognitive systems are probabilistic, and they generate responses according to levels of confidence. They can also show the evidence for the responses—what data there is to back up the answer and the confidence score. If we think about the jeopardy scenario, you probably noticed on the video above that the top 3 options together with their confidence score were displayed on the screen. Decisions can then be made based on those confidence scores to select the best option. In the oncology example, the ability by a doctor to look at all the evidence the system used, and at the collection of hypothesis that could potentially include some that the doctor had not considered before, is extremely valuable.

3.    Cognitive systems ingest and accumulate data insights from every interaction. The confidence levels it provides are subject to change when subject matter experts grade the responses since the system is not programmed but trained by experts who enhance, scale and accelerate their expertise.

So what are the 3 things to remember about cognitive systems? They are trained, not programmed, they provide probabilistic responses with confidence levels instead of exact answers, and can get better overtime as experts enhance, scale and accelerate their expertise.

What is Analytics – The 3 Things Series


Before I embark in explaining what is Analytics, as well as the different types of Analytics, let’s just talk for a second about Why Analytics. The field of Analytics was born with the goal of using data and the analysis of that data to improve performance in key business domains, which basically means having the ability to make better decisions, and to execute the right actions based on data insight.

So what is Analytics?

The field of Analytics involves all that is necessary to drive better decision making and add value, such as Data platforms (On Premise, On Private or Public Cloud), Access to Data (Structured and Unstructured), and Tools for Quantitative Analysis and Data Visualization. In other words, Analytics is all about turning data into insight which in the world of business means turning data into competitive advantage.

Analytics Types

There are three main types of Analytics:

1- Descriptive Analytics help you understand “What happened?”. The goal of descriptive analytics like the name implies is to describe or summarize raw data and turn it into something that makes sense to the human eye – typically through presenting the data in tables or reports, or in visualizations or charts. They are very useful for understanding past behaviors, and how the past might potentially influence future outcomes.  Basic statistics like averages, sums, percent change, or proportions fit into descriptive analytics. This is the simplest form of analytics but nevertheless it is extremely useful and necessary as a stepping stone into more sophisticated/valuable analytics.

2- Predictive Analytics help you understand “What could happen?”. There are two main goals: Finding relationships or patterns and predicting what could potentially happen. They help you try to understand the future while providing actionable insights based on data. Predictive Analytics don’t provide predictions that are 100% accurate, but provide estimates about the likelihood of a future outcome. They can be used throughout an organization to forecast sales or inventories, to detect fraud, to understand customer behavior, or in any scenario where relationships among data and “predicting the future” will help make better decisions. We are all familiar with our credit scores right? That is an example of predictive analytics where historical data on how well you manage your credit is used to predict a score that can then be used as a proxy for how much of a credit risk you might be.

3- Prescriptive Analytics help you determine “What should we do?”. Prescriptive Analytics are all about providing specific guidance about what to do. They make an effort to quantify the effect of future decisions looking at the possible outcomes of each scenario before the actual decisions are made. They use a combination of business rules, algorithms, and modelling procedures to provide possible outcomes. These are typically used in Supply Chain Management, Price Optimization, Workforce planning among others. They are very useful like the name implies to “prescribe” a direction after examining multiple possible scenarios.

In summary, Analytics help you turn your data into insight for better decision making and there are 3 main types of Analytics that you use depending on your goal. Descriptive analytics to understand what has happened in the past, Predictive analytics to understand relationships among data and provide predictions about what may happen in the future, and Prescriptive analytics to provide specific recommendations about what to do in specific scenarios.

United’s CEO Fails to Understand the Power Shift brought on by Social Media


By now everyone has seen the despicable way a United Airlines passenger was treated, forcibly being removed from a plane to release his seat to another passenger. I personally find the whole situation appalling but that is not what I want to discuss now. I want to talk about the way United’s CEO handled the situation, and how it clearly demonstrates a failure to understand how social media has shifted power from the few to the many.

Imagine if this situation had happened 20 years ago, before a world where social media was ingrained in our lives. Even if people on the plane had been able to take pictures or record video, what choices did they have to share them? The result would have been that only a handful of people would have heard the story. The CEO and executives in turn, would have also had a lot of power to control the message, and probably would have been able to get away with this without losing several million dollars in market valuation due to people’s outrage around the world.

But in fact what happened was that within minutes of this event, people around the world were seeing video and pictures of this physical assault, and people were waiting for, or in fact, expecting a statement from United’s leadership. So after the first failure at managing the overbooking situation, came the second: the CEO’s response.

  1. He apologizes for “re-accommodating” a passenger, assuming that would be the end of the history
  2. He proceeds to blame the passenger accusing him of being belligerent and indicating it is important to find out why “the passenger acted the way he did”. He also doubles down on congratulating his employees for a job well done. Tone deaf much?
  3. After losing almost a $1B in valuation at some point during the day he finally comes out with the statement he should have issued from the beginning stating “He is sorry. This shouldn’t have happened, and they will take measures for this not to happen again”.

Lesson to learn from this event? Corporations – and their leaders – cannot get away with a lot of the things they could have gotten away with in the past, and we have social media platforms to thank for that. Ideally, business leaders would care about their clients and their business, but even if as a leader truly in your heart you don’t care and your first reaction is to say “Not our fault. We did everything by the book”, know that most likely that statement is only going to amplify existing outrage.

Yes, you may think it is possible to do it seeing how some politicians get away with so much gaslighting these days, but they have something you don’t, and that is followers willing to be gaslighted because they are blinded by their passion for a political party. More likely than not your clients and others don’t have that level of passion for your business.

So next time something like this happens…. Stay away from the temptation to blame the victim and address the situation the right way, which includes any combination of:

  • We are sorry
  • The buck stops here
  • We are investigating
  • We are taking steps to ensure this doesn’t happen again

Don’t forget: social media has shifted the power from the few to the many.

What is the Business Value of Big Data? – The Three Things Series

*** The 3 Things Series aims to simplify – sometimes even oversimplify – technology concepts so that you learn 3 things about a topic ***. Opinions are my own.

Organizations embark in Big Data projects typically with 3 goals in mind: cost reductions, improved decision making and the ability to create new products and services.

big data business value

1- Cost Reduction 

As the quantities and complexity of data in organizations increase, so does the cost of storing and processing this data. Decisions about how much data to keep available for analysis, and how much “historic” data to move to tape or other less expensive resources, are then made. The problem with this strategy is that by limiting the data that can be analyzed, the insight that can be derived from this data is also limited.

In recent years, technology developments especially in Open Source, have made cost reduction a reality through the use of inexpensive technology such as Hadoop clusters (Hadoop is a unified storage and processing environment that allows for data and data processing to be distributed across multiple computers). Hadoop clusters give organizations the ability to keep more data available for analysis at a lower cost, and to easily add complex data types (images, sound, etc) to the pool of data to be analyzed

2- Improved Decision Making

Data analysis can be significantly improved by adding new data sources and new data types to traditional data. For example a data-driven retailer may see significant benefits in their inventory planning processes, if a new data source like weather data is added to the model to better predict sales and inventory requirements. An enriched model may be able to predict shortages of winter clothing by incorporating temperature into the existing models. Additional benefits can also be achieved, if more complex data is analyzed. For example, this same retailer may better target their ads in social media, if they evaluate not only their clients purchasing history, but also the actions they take in social media to interact with their brands and those of their competitors.

3-  Development of New Products and Services

The most strategic and innovative business benefits will probably be achieved by the ability to use new data or new sources of data to create new products and services. Let’s think for a minute about the data our cars generate (yes, we don’t necessarily see it, but more and more cars are equipped with sensors that collect a lot of data about our driving history). Using this data, insurance companies can offer policies that are dynamically priced based on an individual’s driving history (which is good news to you only if you are a safe driver of course).   Integrating weather data can also bring tremendous savings to an insurance company. Some insurance companies have been able to achieve significant savings per claim by letting their clients know that a storm is coming and recommending they don’t leave their cars exposed to the elements. (Again, assuming that as a client you listen to your insurance company recommendations).

In summary, when thinking of the business value of Big Data, think of  three areas of value:

  • Cost reductions
  • Improved decision making
  • Ability to create new products and services


What is Big Data? (The 3 Things Series)

*** The 3 Things Series aims to simplify – sometimes even oversimplify – technology concepts so that you learn 3 things about a topic ***. Opinions are my own.

The technology industry is full of “buzzwords”, with Big Data being one of the most used in recent years. Organizations have always dealt with data and have stored that data in databases, but we can see in the chart below how searches on Google have changed throughout the years comparing searches for “Databases” to searches for “Big Data”.


databases vs big data searches


Big Data in general refers to the ability to gather, store, manage, manipulate, and – the most important one – get insights out of vast amounts of data. And the typical question is “how big does data need to be so it is considered Big?” And the answer is…. it depends. When it comes to size, an organization’s Big Data may be another organization’s small data.

There are 3 things to remember that define “Big Data”:

  • Volume. It refers to size. So if you are capturing vast amounts of information, you probably have Big Data in your hands
  • Velocity. Are you working with data at rest? Or data in motion? For example if you are analyzing sales figures for the past year, that data is at rest (it is not changing constantly). But if on the other hand you are analyzing tweets to understand how your clients are reacting to a product announcement, this is data in motion as it is continuously changing. It may not be necessarily big if you are looking at daily data, but the fact that it is data in motion is relevant to the definition of Big Data
  • Variety. As the ability to capture, store and analyze more data has increased, so has the interest in analyzing data that is more complex in nature. For example, an insurance company may want to analyze the recordings of customer service calls to determine what characteristics of the conversation led to a policy sale, a retailer may want to analyze videos to determine how people navigate the store and how that impacts sales, or a hospital may want to analyze x-rays to find patterns and correlations between common symptoms in patients.

So when it comes to the definition of Big Data, remember 3 things, or the 3 Vs:

  •  Volume (size)
  • Velocity (Frequency of data update during analysis)
  • Variety (complexity of data to analyze – images, videos, texts, log files, etc)