This is the one AI skill everyone needs to have and understand

The AI field is being revolutionized through Generative AI, and this will bring powerful new capabilities that can be turned into value for individuals, businesses, and society. It is a great time to work with AI. But it is also overwhelming, and challenging, and confusing for most people. If you are outside the AI world, you may be wondering, What really is AI? Is AI going to save the world? is AI going to destroy the world? And on top of that… what is Generative AI? There are so many AI fields, and subfields, and capabilities, and technologies, and algorithms, and frameworks that it is hard to know where to start to understand it. So start here… learn one thing: AI IS PROBABILISTIC!

What does probabilistic mean? it means, any answer you get from an AI system is not a precise answer but a prediction with a certain degree of confidence.

AI systems are not programmed like other deterministic technologies. In the world of technology outside of AI, you program a computer to make deterministic “precise” decisions: i.e.: If the user clicks on this button, then show them an image, if the account balance reaches $5, send the user an email, etc. AI systems, on the other hand, are “trained’ using data to either predict new data based on the input data (traditional AI) – i.e.: an inventory level prediction, a product you might like – , or to generate/create complex new data based on an input prompt (Generative AI) – i.e.: images, videos, code and text -. Training an AI model means you feed it existing data and then use mathematical models that align well with that data to predict future data. In other words, what is underneath these AI predictions is a collection of mathematical, probabilistic models. In essence, you have something similar to a weather forecast: There an x% of chance that tomorrow will rain. We all know how that goes right? maybe it will rain, maybe it won’t because a weather forecast is based on a mathematical probabilistic model. Mathematical probabilistic models are the foundation of any AI system. The problem is that unlike weather forecasts, AI systems are not telling us “there is an x% chance that what I am telling you is correct”. They just give us an answer, so the responsibility today is on us to understand this.

So now that you know that AI systems are probabilistic, how do you use this information? you must understand that any output from an AI system sometimes will be wrong, no matter how great it looks or how confident it sounds in the case of Generative Text models, and need to determine when it is safe to use them based on the impact of an incorrect output. Here are some examples of how to use this one skill to interact with AI:

  • Business Leaders. If you are a business leader driving an AI strategy, or aiming to leverage AI to impact your business area, know that with any AI implementation, not only you need to look at how to leverage AI to increase revenue, or reduce costs, or create new business models or new customer experiences. You also need to think about the increased need for risk management. What will happen in the % of cases that AI provides an incorrect answer? will the impact be minor? or will it be a disaster and guardrails need to be implemented?, or will it be a disaster of such magnitude that you can’t let this system be used unless there is a human in the loop? will a sector of the population be impacted different and negatively vs other segments of the population by automated AI-driven decisions? Always remember: AI outputs are not “precise” with 100% certainty. They are probabilistic with precision that is not 100%

  • Developers. Developers now have access to one of the greatest playgrounds in the world, with availability of multiple technologies to integrate AI capabilities into their applications. With just a call to an API, you can leverage pretrained AI models and add innovation to your applications without requiring the deep mathematical expertise required to train AI models. But if you haven’t worked with AI before, you are used to working with deterministic technologies. You call an API, you get a response, and that response should be 100% precise. With AI APIs, you are getting a probabilistic output which means, depending on the inputs you give the model you will get different results. These results may be very close to the answer you need if this input is similar to the data used to train the model, or completely nuts if this new input is significantly – mathematically speaking – different from any data used to train the model. Understanding this is key, as depending on the application where these capabilities are being integrated, you need to be mindful of causing no harm if the response this API produces is biased, or completely wrong. And if you are comparing models from different vendors, know that just testing models with a set of inputs may give you completely different results once you change the input sets. So do multiple tests. Don’t make the decision to leverage a model through an API, testing with a single set of inputs.

  • Data Scientists. You have got this. Even when you are not training Machine Learning models, but leveraging pretrained AI models, you are used to working in a probabilistic space. You need to be ambassadors of this understanding and drive other business and technical people in your organization, and friends in your circles, to internalize how to make decisions and evaluate outputs of AI technologies

  • Everybody else. The latest developments in the AI space have put AI in the hands of everyone. And more and more you will be using systems, tools, apps that have increased AI power. Just remember that any technology you are interacting with will give you results that are not correct. It doesn’t mean the system does not work – it is actually working as it should: in a probabilistic manner. It means you, the human, need to assess and think about the impact of the technology you are using and determine how safe it is for you to use it. For example, a tool that serves you an ice cream it “thinks” you should try, if you don’t have any allergies maybe is safe for you to use. But if you have a peanut allergy, maybe it is not safe to use unless there are guardrails that allow you to provide that restriction.

In summary, today the field of AI could not be more exciting or more full of potential for individuals, businesses and societies, but because AI systems don’t tell us their outputs are probabilistic, we need to always remember it. And not only remember it but assess the impact of any decisions being made by AI systems for us, or by us based on an AI system output. Keep in mind that these technologies are so sophisticated, that if we don’t remember they are just giving us a probabilistic prediction, it is very easy to think that they are in fact reasoning, and even acting like a “real” human.

AI and Generative AI are taking the world by storm

AI is everywhere and the hype around it is like nothing we have ever seen before. Just take a look at what Google trends show us about worldwide web searches for the word AI during the past 19 years. There was a peak in 2011 probably generated by IBM Watson’s winning Jeopardy and Apple’s launch of Siri – 2 extremely significant innovations at the time. But nothing compares to the levels of interest generated over the past few months for AI and Generative AI.

Worldwide Searches for the word AI – 2004-2023

I have worked in the AI field for over a decade, and I have never seen AI technologies evolve with such speed and innovation like we are seeing now through Generative AI. The AI field is being revolutionized, and these new technologies will bring powerful new capabilities that can be turned into value for individuals, businesses, and society. And this is what Google trends show us about worldwide web searches for Generative AI during the past 19 years.

Worldwide Searches for the words Generative AI – 2004-2023

This AI revolution is great for those of us that are passionate about AI’s capabilities to change the world in a positive way – as long as it is deployed with safety in mind. It is also highly relevant for businesses looking to derive incremental and transformational business value. It is safe to say that the world of AI is now moving into a whole new era.

International Women’s Day in the workplace should be about supporting women!

Today is March 8th, 2021 and there will be a lot of celebratory messages to women. As a woman who has spent 3 decades working outside the home and several years focused on increasing my education, it still pains me that women typically make less money than males for a similar job – especially if they happen to be black or hispanic like me. It also pains me that the pay gap gets larger as their career progress. Meaning even when females get promoted the difference in what they make vs their male coworkers gets larger. So here are some recommendations for a more actionable celebration of women at the workplace:

  1. Unconscious Bias. Be mindful of your unconscious biases. If you think you have no biases, that probably means you have a blind spot. We all have unconscious biases. Learn more about this topic and how to overcome them.
  2. Hiring. Diversity in hiring needs to be intentional. If all the resumes you are seeing after they have passed through the selection algorithm and the recruiter’s assessment tend to represent only a segment of the population (i.e.: white men), please go back to your recruiters and ask for a more diverse pool of candidates. They are there and you will find them. Sometimes their resumes are just not getting through.
  3. Data. Ideally organizations should be looking at data examining how their compensation and promotion practices differ by gender, race, age, etc. But even if your organization does not publish this data, if you are a manager you can still look at your team’s data to determine if there are some imbalances.
    • Run a compensation report on your team members and sort the data by salary within Salary Bands. Examine each of the groups (by salary band) carefully. Are most of the people on the bottom of the list females or minorities?
    • Now sort the data by time since last promotion. Are most of the people on the bottom of the list females or minorities?
    • Now go deeper. Look at the individuals’ background: academic degrees, years of experience, performance reviews and use that information to determine if the imbalances are warranted. Here is a cool exercise. Remove names and replace with a code. Now look at the data without the names. Is the salary and promotion data in alignment with what every individual brings to the organization?
    • Keep this in mind for the next promotion cycle. You may have an opportunity to rectify previous imbalances.
  4. Be a sponsor. Not just a mentor. Make sure you are providing opportunities to bring diverse pools of people to work on strategic assignments, and give visibility to them. Sometimes your best employees are not necessarily the most vocal ones. Help get theirs voices heard.

Data Analysts, Data Scientists, ML and AI Specialists are the jobs with highest demand according to WEF’s 2020 Future of Jobs report

Summary

In October 2020, The World Economic Forum published the report “The Future of Jobs”. This report has deep insights on technological adoption in the next five years, and it maps the jobs and skills of the future including a deep dive into Data and AI Skills. The report shows that technological adoption continues expanding, while skills availability remains the #1 barrier to that adoption. Businesses and governments around the world are investing significantly in upskilling and reskilling programs, with a significant percentage of that investment going towards transitions into Data and AI Jobs. Demand for Data Analysts, Data Scientists, and AI Specialists is high, but the skills gap that needs to be addressed to successfully transition into those roles is large.

Some of the key findings:

  1. Skills gaps continue to be high. This includes skills like critical thinking, analysis, problem-solving, and skills in self-management such as active learning, resilience, stress tolerance and flexibility. On average, companies estimate that around 40% of workers will require reskilling of six months or less and 94% of business leaders report that they expect employees to pick up new skills on the job, a sharp uptake from 65% in 2018. 
  2. Online learning is on the rise. There has been a four-fold increase in the numbers of individuals seeking out opportunities for learning online through their own initiative, a five-fold increase in employer provision of online learning opportunities to their workers and a nine-fold enrollment increase for learners accessing online learning through government programs
  3. The window of opportunity to reskill and upskill workers has become shorter. The share of core skills that will change in the next five years is 40%, and 50% of all employees will need reskilling 
  4. The large majority of employers recognize the value of human capital investment. 66% of employers surveyed expect to get a return on investment in upskilling and reskilling within one year. Employers expect to offer reskilling and upskilling to over 70% of their employees, but employee engagement into those courses is lagging, with only 42% of employees taking up employer-supported reskilling and upskilling opportunities. 

Over the past decade, a set of ground-breaking, emerging technologies have signaled the start of the Fourth Industrial Revolution. By 2025, the capabilities of machines and algorithms will be more broadly employed than in previous years, and the work hours performed by machines will match the time spent working by human beings. This augmentation of work will disrupt the employment prospects of workers across a broad range of industries and geographies, and we will see job growth in the ‘jobs of tomorrow’— such as roles at the forefront of the data and AI economy, as well as new roles in engineering, cloud computing and product development. 

Technological Adoption

The past two years have seen a clear acceleration in the adoption of new technologies. Cloud computing, big data and e-commerce remain high priorities, following a trend established in previous years. However, there has also been a significant rise in interest in encryption, and a significant increase in the number of firms expecting to adopt robots and artificial intelligence.  These new technologies are set to drive future growth across industries, as well as to increase the demand for new job roles and skill sets. Figure 1 shows technologies likely to be adopted by 2025 (by share of companies surveyed).

By 2025 the average estimated time spent by humans and machines at work will be at parity based on today’s tasks. Algorithms and machines will be primarily focused on the tasks of information and data processing and retrieval, administrative tasks and some aspects of traditional manual labor. The tasks where humans are expected to retain their comparative advantage include managing, advising, decision-making, reasoning, communicating and interacting. 

Emerging Jobs

Similar to the last survey in 2018, the leading positions in growing demand are roles such as Data Analysts and Scientists, AI and Machine Learning Specialists, Robotics Engineers, Software and Application developers as well as Digital Transformation Specialists. However, job roles such as Process Automation Specialists, Information Security Analysts and Internet of Things Specialists are newly emerging among a cohort of roles which are seeing growing demand from employers. The emergence of these roles reflects the acceleration of automation as well as the resurgence of cybersecurity risks. Figure 2 shows the top 20 job roles in increasing demand across industries, with Data Analysts, Data Scientists, and AI Specialists ranked with the highest demand overall. 

Figure 2 – Top 20 job roles in increasing demand across industries

These emerging jobs have been organized in clusters, and this report presents a unique analysis which examines key learnings gleaned from job transitions into those emerging clusters using LinkedIn and Coursera data gathered over the past five years.  The main clusters are: Data and AI, Cloud Computing, Engineering, Content Production, Marketing, People and Culture, and Product Development and Sales. Figure 3 shows Data and AI roles organized according to the scale of each opportunity within the cluster.

Figure 3 – Data and AI Job Cluster

Emerging Skills

The ability of global companies to harness the growth potential of new technological adoption is limited by skills shortages. Figure 4 shows that skills gaps in the local markets and inability to attract the right talent remain among the leading barriers to the adoption of new technologies. 

Figure 4 -Perceived barriers to the adoption of new technologies

Skill shortages are more acute in emerging professions. Business leaders consistently cite difficulties when hiring for Data Analysts and Scientists, AI and Machine Learning Specialists as well as Software and Application Developers. 

To address skills shortages, companies are investing in upskilling and reskilling programs. However, employee engagement into those courses is lagging, with only 42% of employees taking up employer-supported reskilling and upskilling opportunities. There are however significant challenges in the amount of skills that need to be developed especially for emerging roles in Data Science and Artificial Intelligence.  Figure 5 illustrates the skills gap that needs to be closed for individuals to transition into these roles, with Artificial Intelligence, NLP, Data Science and Signal Processing representing the largest amount of skills needed to be developed for a successful transition.

Figure 5 – Typical skills gaps across successful job transitions

Furthermore, the report uses data from Coursera learners to estimate the distance from the optimal level of mastery for learners targeting to transition into Data and AI, and quantifies the days of learning needed for the average worker to gain that level of mastery. (Figure 6).

Figure 6 – Top 10 skills by required level of mastery and time to achieve that mastery

Mastery score is the score attained by those in the top 80% on an assessment for that skill. Mastery gap is measured as a percentage representing the score among those looking to transition to the occupation as a share of the score among those already in the occupation. 

In conclusion, technological adoption continues expanding, and skills availability remains the #1 barrier to that adoption. Businesses and governments around the world are investing significantly in upskilling and reskilling programs, with a significant percentage of that investment going towards transitions into Data and AI Jobs. Demand for Data Analysts, Data Scientists, and AI Specialists is high, but the skills gap that needs to be addressed to successfully transition into those roles is large.

Coronavirus in Latin America

Coronavirus in Latin America has spread with a lag of several weeks vs. the rest of the world with most documented confirmed cases starting the first week of March. Many countries in Latin America have already established severe social distancing and lockdowns so hopefully with the benefit of early measures, the spread of the virus has the potential of being more controlled but of course it is too early to know. You can explore some of the data by clicking on this link and  exploring each of the tabs.

The data comes from the World Health Organization which lags a couple of days and I updated it frequently.

Coronavirus Stats



Data gives me peace of mind. Even when data is showing a really negative trend, it gives me a sense of control to know what is going on. So with the Coronavirus pandemic, I decided to spend some time looking at the data from the world health organization and I built a few interactive charts using Tableau public. The news are definitely not good. Especially for those of us in the USA as we are just getting started and the number of confirmed cases is growing exponentially. The death rate in the USA is lower than in other countries -for now – so that is the good news.

Here are some of the visualizations if you are interested…. Just click on Coronavirus Stats, and navigate the charts by clicking on the top tabs.

Learning Data Science does not start with doing Data Science

Those who are in love with practice without theoretical knowledge are like the sailor who goes onto a ship without rudder or compass and who never can be certain whither he is going. Pra

The foundation of Data science lies on a set of math and statistics knowledge which provides the ability to understand and model datasets. Because data modeling algorithms may look to the untrained eye like simply a collection of lines of code, those jumping into modeling data without the statistical or mathematical background, may end up making serious mistakes that would nullify the results of any analysis done.

If you are serious about becoming a Data Scientist, avoid the temptation of jumping into modeling data first. Instead, ensure you have the right math and statistical foundation in place before learning how to build models. For details on the overall skills required to be a Data Scientist, read my previous blog What are the skills that define the role of Data Scientist? For resources on alternatives to become a Data Scientist, read Where do Data Scientists come from?

There are multiple ways to get skills to become a Data Scientist, but independently of the option you choose, make sure the right foundation is built into the program, or find ways to build the foundation yourself either through college classes, digital classes, or even through good old fashion academic books. Specifically, what math knowledge should you have? here is a summary of some of the most common math knowledge needed:

Probability Foundation

What is Probability?, Sample Spaces, Properties/Rules of Probability, Probability of combination of events (Intersection of events, Union of events, Contingency Tables), Conditional Probabilities, Independent vs. Dependent Events, Bayes’ Theorem, Counting Principles (permutations, Combinations), Sampling Techniques, Probability Distributions (Continuous and Discrete Random Variables, Cumulative Distribution, Binomial Distribution, Poisson Distribution, Geometric Distribution, Exponential Distribution, Normal Distribution, Chi-Square Distribution, Expected Values)

Statistics Foundation

  • Descriptive Statistics: Quantitative Data vs. Qualitative Data, Types of measurements: nominal, ordinal, intervals, ratios, Frequency Distributions, Relative Frequency Distributions, Cumulative Frequency Distributions, Measures of Central Tendency: Mean, Median and Mode, Measures of Dispersion: Range, Variance, Standard Deviation, Measures of Relative Position: Quartiles, Interquartile Range, Outliers, The empirical rule (normal distributions) and Chebyshev’s Theorem, Visualizing Data: Histograms, Stem and Leaf, Box Plots
  • Inferential Statistics: Sampling distribution of the mean, Sampling Distribution of the Proportion, Standard Error of the Mean, The Central Limit Theorem, Confidence Intervals and their interpretation, Effects of changing confidence levels and sample sizes, Working with small vs large samples, Formulating and testing hypothesis, The Null and Alternative Hypothesis, Type I and Type II Errors, 1-tail vs 2-tail hypothesis, Testing the Mean and the proportion of a population using 1 sample, Testing the difference in means and proportions using 2 samples, Analysis of Variance (ANOVA) comparing 3 or more population means, Understanding the role of Alpha and the p-value, Working with Dependent vs. Independent samples, Correlation and Simple Regression (confidence intervals, hypothesis test on the regression line, regression assumptions), Multiple Regression (assumptions, multicollinearity)

Linear Algebra Foundation: Vectors and Real Coordinate Spaces (Adding, Subtracting, and multiplying vectors by a scalar, Linear Combinations, Linear dependence and independence), Matrix Transformations (Functions and Linear Transformations, Inverses and determinants, Matrix Multiplications, Transpose), Orthogonal Complements and Projections, eigenvectors, eigenvalues.

The bottom line is that when you move from working in a deterministic world to probabilistic world, it is important to understand the implications of this paradigm change.

Where do Data Scientists Come From? How Do I become One?

interior design-2

I frequently get asked…. How do I become a data scientist? How do Data Scientists get their skills? There are several options but first… Let’s take a look at the profile of data scientists and see how your skills rank compared to them.  According to a 2018 study by Burtch Works, it is estimated that:

  • 91% of of data scientists have an advanced degree (43% hold a Master’s degree, and 48% hold a PhD).
  • 25% of data scientists hold a degree in statistics or mathematics, 20% have a computer science degree, 20% hold a degree in the natural sciences, 18% hold an engineering degree, 8% hold a business degree, 5% hold a social science degree, and 4% an economics degree
  • 44% of data scientists are employed by the technology industry, followed by Financial Services with 14%, marketing services with 9%, consulting 8%, Healthcare/Pharma 6% and retail with 5%
  • 62% of data scientists are US citizens, 19% have permanent residency and the rest are on temporary visas
  • 85% of data scientists are male

These statistics should give you a general idea about the skillset shared by the community of data scientists.

How do I become a Data Scientist? Where do I get the skills?

There are several options depending on how much time you want to spend acquiring the skills, how much money you want to spend, and how deep you want your skills to be. Here is a summary of the most popular options:

1-PhD: If you are interested in a PhD, or already pursuing a PhD, most likely than not, you will acquire skills that will help you become a Data Scientist. As we saw in the statistics above, today almost half of data scientists have a PhD.  On average, a PhD will take several years to complete.

2- Master of Science in Analytics/Data Science: The first master program focused on Analytics/Data Science was started by the Institute of Advanced Analytics (IAA) at NC State in 2007 (I graduated from this program in 2012). It was for many years the only program in the nation, and due to high demand, many new programs have been launched in the past few years. There are now over 200 programs in the United States, with some of them full time, some part time, and some online.  If you are curious about what programs are available that could match your needs, you can use the Interactive Map kept by the IAA to find options that might fit your needs in the United States. It will usually take you 1 or 2 years to complete a master degree.

3- Data Science Bootcamps: There are several alternatives.  Switchup has a good summary of the best Data Science Bootcamps.  Figure 1 summarizes a comparison of these programs using Switchup’s ratings and details from each of the programs (Disclaimer: this information varies frequently so this may be out of date by the time you read it). Refer to their web sites for up to date details. You will be able to complete a bootcamp in a matter of months.

Figure 1 Summary of Information on the Best Data Science Bootcamps (Switchup)

4- Data Science Online Certifications. MOOCs provide a a wealth of training options: Coursera, Edx, Cognitiveclass.AI, and Udacity are all great choices. Udacity has an advantage over the others in that they offer mentorship together with their education programs. (See what one student of both Udacity and other programs has to say about his experience). Prices and length of programs vary greatly among these choices, and they offer a lot of flexibility.  However, you will still need practical experience.  These classes can give you a foundation, that will need to be supplemented with practical experience to become a Junior Data Scientist.

For more information on the type of skills needed to be a Data Scientist, refer to What are the skills that define the role of Data Scientists?

What are the skills that define the role of Data Scientists?

Data Science is an emerging field, but it is definitely not a new field. Yet, many people still struggle to define Data Science as a field, and more importantly, struggle to define the set of skills that collectively define a “Data Scientist”.

What is data science?

Data Science is a cross-disciplinary set of skills found at the intersection of statistics, computer programming, and domain expertise.  Perhaps one of the simplest definitions is illustrated by Drew Conway’s Data Science Venn Diagram (Figure 1), first published on his blog in September 2010.  Discussions about this field, however, go as far back as 50 years.  If you are interested in learning more about the history of the Data Science field, you can read it in the  50 Years of Data Science paper written by David Donoho.

Figure 1 – Drew Conway’s Data Science Venn Diagram

The bottom line is that Data science comprises three distinct and overlapping areas: a set of math and statistics knowledge which provides the ability to understand and model datasets, a set of computer programming/hacking skills to leverage algorithms that can analyze and visualize data, and the domain expertise needed to ask the right questions, and put the answers in the right context.

It is important to call out attention to the “Danger Zone” above, as there is nothing more dangerous than aspiring Data Scientists who do not have the appropriate math and statistical foundation to model data.

What skills define the role of Data Scientists?

A Data Scientist is not a just a computer programmer, or just a statistician, or just a business analyst.  In order to be a data scientist, individuals need to acquire knowledge from all these disciplines, and at the minimum develop skills in the following areas:

1.Probability, Statistics, and Math foundation. This includes probability theory, sampling, probability distributions, descriptive statistics (measures of central tendency and dispersion, etc.), inferential statistics (correlations, regressions, central limit theory, confidence intervals, development and testing of hypothesis, etc.) and linear algebra (working with vectors and matrices, eigenvectors, eigenvalues, etc.)

2.Computer Programming.  Throughout the years, SAS has probably been the most commonly used programming language for Data Science, but adoption of Open Source Languages Python and R has increased significantly.  If you are starting today to acquire data science skills, my recommendation would be to focus on Python. Looking at worldwide searches on Google for “R Data Science” and comparing them to “Python Data Science”, the trends are clear (Figure 2). Interest in Python has surpassed R, and continue on a positive trend. This makes sense given that python allows you to create models and also to deploy them as part of an enterprise application, so within the same platform data scientists and app developers can work together to build and deploy end to end models. R while easier in some cases for modeling purposes, was not designed as a multi-purpose programming language.

Figure 2 Worldwide searches for “R Data Science” vs. “Python Data Science”. Google Trends (June 2018)

3. Data Science Foundation. This involves learning what data science is and its value in specific use cases. It also involves learning how to formulate problems as research questions with associated hypotheses, and applying the scientific method to business problems.  Data Science is an iterative process so it is critical to have a solid understanding of the methodologies used in the execution of this iterative process (Define the problem, Gather Information, Form hypothesis, Find/Collect data, Clean/Transform data, Analyze Data and Interpret Results, Form new hypothesis)

Figure 3 Data Science Iterative Cycle

4. Data Preparation/Data Wrangling.  Data is by definition dirty.  And before data can be analyzed and modeled, it needs to be collected, integrated, cleaned, manipulated and transformed. Although this is the domain of “Data Engineers”, Data Scientists should also have a solid understand of how to construct usable, clean datasets

5. Model Building. This is the core of the data science execution, where different algorithms are used to train models with data( structured and unstructured) and the best algorithm is selected.  At this stage, data scientists need to make basic decisions around the data such as how to deal with missing values, outliers, unbalanced data, multicollinearity, etc. They need to have solid knowledge of feature selection techniques (which data to include in the analysis), and be proficient in the use of techniques for dimensionality reduction such as principal component analysis. Data scientists will be able to test different supervised and unsupervised algorithms such as regressions, logistic regressions, decision trees, boosting, random forest, Support Vector Machines, association rules, classification, clustering, neural networks, time series, survival analysis, etc. Once different algorithms are tested, the “best” algorithm is selected using different model accuracy metrics. Data scientists should also be skilled in data visualization techniques, and should have solid communication skills to properly share the results of the analysis and the recommendations with nontechnical audiences.

6. Model deployment.  A very important part of building models is to understand how to deploy those models for consumption from a data application. While this is typically the domain of machine learning engineers and application developers, data scientists should be familiar with the different methods to deploy models.

7. Big Data Foundation. A lot of organizations have deployed big data infrastructure such as Hadoop and Spark.  It is important for data scientists to know how to work with these environments.

8. Soft Skills.  Successful data scientists should also have the following soft skills:

a.    Ability to work in teams. Because of the inter-disciplinary nature of this field, it is by definition a team sport.  While every data scientist on a team will need a good foundation on all skills defined above, the depth of skills will vary among them. This is not a field for individualistic stars, but a field for natural team players.

b.     Communication Skills. Data scientists need to be able to explain the results of their analysis and the implications of those results in nontechnical terms. The best analysis can go to waste is not properly communicated.

Last but not least, it is important to remember that the most important characteristic of great data scientists is CURIOSITY. Data Scientists should be relentless in their search for the best data and the best algorithm, and should also be lifelong learners as this field is advancing very rapidly.

In summary, if you are interested in the Data Science field, or if you are exploring ways to develop your skills, make sure that you are addressing all these areas, and especially make sure not to end up in the danger zone having programming skills and domain knowledge but lacking the math and statistics foundation needed to model data correctly.

If you are a digital marketer and think Russian Troll farms only impact the world of politics, think again….

By now, you have probably heard about how trolls and bots were used to influence the 2016 elections in the United States.  And if you are a marketer with expertise in Social Media, you can easily understand how Social Media Channels – because of their network effects –  can easily support the rapid dissemination of any messages – positive and negative.

If you are not familiar with how they work in the context of politics, let’s start with simple definitions:

  • Bots: automated accounts that repost content usually focused on a specific hashtag, or a specific digital destination.  A specific message can be disseminated in seconds by thousands of them working together without human intervention
  • Trolls: accounts created by individuals usually with a fake identity, focused on writing content – typically on controversial topics – that are then posted organically or promoted via paid ads and supported by an army of bots for reposts. These individuals are probably sitting somewhere in Russia but the accounts are  created as a 30 year old housewife in Michigan, or an 18 year old gun lover for example.

Bots and trolls can be found anywhere in the world, but the most sophisticated operation is found in Russia as they have used it internally to promote Putin’s agenda while making it seem like it is individual people talking about their priorities on social media.  This video provides additional information about the Russian troll farms.

Let’s say however, that you are not interested in politics, as a marketer how does this impact you and your priorities?

US social-media ad spend is expected to reach $14 billion in 2018, up from just $6.1 billion in 2013.   If you are a CMO or a digital marketer, you know a significant part of your budget is spent on social media.  But what happens when the platform includes a significant number of trolls sitting somewhere in Russia but posing as individuals in the US? The result is that audience metrics are significantly impacted and your money may be spent reaching out to fake accounts.

  • 10% of Facebook’s 2.07 billion monthly users are now estimated to be duplicate accounts, up from 6% estimated previously. The social network’s number of fake accounts, or accounts not associated with a real account, increased from 1% to 2-3%. These figures mean that there are now roughly 207 million duplicate accounts and as many as 60 million fake accounts on the network.  They say they are working on ways to take this into account when campaigns are being created but is it enough?
  •  Twitter is estimated to have about 50M fake accounts.

As advertisers more focus on having this issue fixed should be demanded. After all, you need to make sure your money is not wasted on advertising to fake accounts.  Technically it is probably a difficult challenge, but the same way that email systems had to find ways to reduce the impact of spam and work on better spam filters, it is time for social media organizations to add focus on technologies to help them reduce this problem.  A couple of ways that come to mind to address this problem could be to increase the use of machine learning models to support identification of bot and troll accounts, and the use of technologies like blockchain for digital id so that people on social networks are actually who they say they are.