Artificial Intelligence

Exploring ChatGPTs Advanced Data Analysis Capabilities

Synthetic Data and Surveys – Photo by Author – David E. Sweenor

Creating synthetic data to analyze survey data

Recently, a colleague of mine asked if I’d be interested in analyzing survey data that was recently collected. I must say, there was a little trepidation since I haven’t analyzed raw survey data in quite some time. For the past several years, survey tools like Qualtrics, SurveyMonkey, Pollfish, and Alchemer have embedded analytics and visualizations directly into their web apps, making it easy for anyone to analyze survey data at lightning speed. However, these options were not available for this request.

The other wrinkle in this little story is that my Python skills are pretty rusty, and I didn’t have an analytics software platform available. So, what were my options? As I pondered this, I downloaded KNIME and thought it would be an excellent opportunity to explore ChatGPT’s Advanced Data Analysis capability. I will share my KNIME experience in a separate post and focus this discussion on ChatGPT.

What is Synthetic Data?

Quite simply, it’s artificially fabricated data–usually with machine learning (ML) technologies. Synthetic data can be quite complex and capture complex mathematical relationships between different variables. The synthetic data vault describes synthetic data: “Although the synthetic data is entirely machine generated, it maintains the original format and mathematical properties. This makes synthetic data versatile. It can completely replace the existing data in a workflow, or it can supplement the data to enhance its utility.”[1]

Synthetic data can undoubtedly be structured data (i.e., numbers) but can also be text, images, or other formats. In fact, images were one of the first types of synthetic data to take off. Many builders of computer vision applications utilize synthetic data to create different colors, shadows, angles, and other properties to train the algorithm to recognize objects under various conditions and perspectives. It’s also relied on quite heavily to train autonomous vehicles.

Why is Synthetic Data Needed?

The need for synthetic data continues to grow with all of the data privacy regulations and restrictions. The analyst firm Gartner predicts that by 2024, 60% of the data used for artificial intelligence (AI) and analytics projects will be synthetic.[2] Organizations can easily collaborate without the fear of exposing sensitive or proprietary information by creating synthetic data sets for projects. Another great example of synthetic data use is for training AI and machine learning (ML) models. This is one of the primary use cases that start-up Mostly.ai touts.[3] For example, if you are creating ML models to detect manufacturing equipment anomalies or fraudulent transactions, by their nature, they are rare events. Now, with synthetic data, data science teams can take those rare events and augment the original data set with synthetic data to make it more balanced, making it easier for the model to generalize.

Synthetic Data Creation

Not wanting to load proprietary data into ChatGPT, I created synthetic data that mimicked the structure of the survey data set I received using ChatGPT. Here’s the prompt used to create the synthetic data:

Prompt:

Please create a synthetic data set with the following attributes:

  1. Number of Respondents (Rows): How many survey responses would you like in the dataset? 50 respondents
  2. Questions/Attributes (Columns):
    1. List of questions or attributes you’d like to include
      1. Attributes (integer): Response ID – Unique for each response
      2. Job Level (categorical): Individual contributor, Manager (distribution=20%), Director (distirbution=40%), VP/SVP/EVP (distribution=10%), CxO/Founder (distribution=10%)
      3. Job Function (categorical): IT, CoE, Finance, R&D, Operations, HR
      4. Headquarter Location (categorical): United States (distribution=40%), insert EU countries (distribution=20%), India, Saudia Arabia, South Africa, Japan
      5. Number of Employees (categorical): 1 – 499, 499 – 1000, 1000 – 10,000 10,001 – 25,000, 25,000+
      6. Phase of Moon (categorical): insert phases of the moon
      7. Question 1 – How happy are you? (ordinal): insert 5-point Likert scale for happiness. Create a 40% correlation with the phase of moon for very happy  = full moon; 30% negative correlation for happiness for phase of moon = new moon
      8. Question 2 – What factors lead to your happiness? (categorical)  [insert 5 happiness factors for happiness with two of them being ice-cream and puppies]. Each respondent can select 3.
      9. Question 3 – What factors lead to your unhappiness? (categorical) [insert 5 factors for unhappiness with one factor being snow-rain and one being snake-bite] Each respondent can select 3.
      10. Question 4 – what is the air speed of an unladen swallow? (categorical) Choices are European or African.
      11. Question 5 – write-in response, approximately one sentence in length, about what makes people happy.
      12. For the factors that lead to happiness or unhappiness, make sure the table structure has 5 columns for each. If the respondent selected it, put it in the column and if not, leave it blank. You can label the columns using the format: [Response selected]: question
      13. For the unhappiness question, make sure only 60% of the respondents answered that question.

The resulting data set looked like the following, which I split into two for easier readability:

Table 1.1: Synthetic Data Set – Part 1 (first nine columns)

Response IDJob LevelJob FunctionHQ LocationNumber of EmployeesPhase of MoonHappinessSwallow SpeedWrite-in Response
1DirectorFinanceSouth Africa25,000Waxing GibbousHappyEuropeanAchieving a personal goal gives me immense satisfaction.
2Individual contributorITUnited States1000 – 10,000Full MoonHappyEuropeanSunsets on the beach make me feel alive and happy.
3CxO/FounderFinanceSaudia Arabia25,000First QuarterHappyEuropeanListening to my favorite songs lifts up my mood.
4CxO/FounderR&DJapan499 – 1000New MoonVery UnhappyEuropeanA good book and a cup of tea is my happiness.
5CxO/FounderR&DJapan25,000Last QuarterVery UnhappyEuropeanBeing with family always brings joy.

Table 1.2: Synthetic Data Set – Part 2 (last ten columns)

Ice-cream: HappinessPuppies: HappinessSunshine: HappinessVacation: HappinessMusic: HappinessSnow-Rain: UnhappinessSnake-Bite: UnhappinessTraffic: UnhappinessMondays: UnhappinessPower Outage: Unhappiness
Ice-creamPuppiesSunshine  Snow-RainSnake-BiteTraffic  
 Puppies VacationMusicSnow-RainSnake-Bite Mondays 
 Puppies VacationMusicSnow-RainSnake-Bite  Power Outage
Ice-cream  VacationMusicSnow-Rain Traffic Power Outage
Ice-creamPuppiesSunshine  Snow-Rain TrafficMondays 

Learnings from Synthetic Data Creation

Strengths:

  • ChatGPT’s Advanced Data Analysis feature can allow users to quickly create basic synthetic data sets that can be used for training, code debugging, and more.
  • Creating the synthetic data is an interactive process–you can ask the engine to modify the most recent output until the data structure is what you envision.
  • ChatGPT doesn’t always get it correct–when I first created the synthetic data set, I asked it to “Insert European Countries,” the resulting column contained data with the countries specified along with some values that said “European countries.” This worked fine with the 5-point Likert scale question.
  • Users of ChatGPT’s Advanced Data Analysis can easily save the Python code generated into a Python notebook for future reuse.

Cautions:

  • Creating more complex synthetic data sets with complex relationships and structure among the variables is probably not appropriate for ChatGPT; users should take a programmatic approach to understand distributions of the original data set so you can mimic those nuances in your synthetic data.
  • Unless you have an enterprise license, data input into ChatGPT is not protected, so you should not put any sensitive, protected, or proprietary data into ChatGPT.
  • Since synthetic data is made up and ChatGPT’s response is probabilistic, the next time you use the same prompt to create synthetic data, you may end up with a different result.
  • On more than one occasion, the Python code generated didn’t work correctly so it tried again. On a few occasions, it tried up to seven different times to creat the necessary data.

Next up, let’s analyze the survey data.

Survey Data Analysis

Since ChatGPT uses Python and associated libraries for its data analysis, you can do just about anything you can imagine. This includes:

  • ANOVA or t-tests:
    • Compare means across different groups.
  • Cluster Analysis:
    • Group similar respondents based on their responses.
  • Correlation Analysis
  • Analyze the relationship between two or more variables.
    • Scatter plots for visualizing relationships.
  • Cross-tabs of the data and Chi-square Tests:
    • Analyze the relationship between two categorical variables.
  • Descriptive Statistics:
    • Count, mean, median, and mode for numeric data.
    • Frequencies for categorical data.
    • Distributions of the data
  • Factor Analysis:
    • Analyze the most common factors leading to happiness or unhappiness.
    • Check if there are any interesting patterns or clusters in these factors.
  • Missing Data Analysis
    • Assess how much data is missing for each attribute
    • Analyze and impute missing data if there are any.
  • Predictive Modeling:
    • Build simple models to predict specific outcomes based on the data (e.g., predicting happiness level based on other variables).
  • Text Analysis:
    • Commonly used words or phrases.
    • Sentiment analysis on written responses.Word Frequency Analysis: Identify the most commonly used words in the responses.Bigram and Trigram Analysis: Determine the most common two-word and three-word phrases to capture context.Word Cloud Visualization: Create a visual representation of the most frequent words.Sentiment Analysis: While our dataset contains synthetic responses, we can still gauge the overall sentiment (positive, negative, neutral) of the responses.
    • Topic Modeling (if the dataset is large enough): Use techniques like Latent Dirichlet Allocation (LDA) to identify topics or themes in the responses.
  • Trend Analysis:
    • Analyze if there are specific trends in the data over time or other factors
  • Visualizations:
    • Histograms or bar plots for the distribution of categorical data, pie charts for proportional representations.
    • Correlation heatmaps.
    • Box plots for distribution and outlier detection.

Rather than sharing the direct prompt-response pairs, I have included a few examples of the analysis output based on my synthetic data and will summarize my learnings.

Figure 1.1: Distribution of Job Levels

Figure 1.2: Explanations of Correlations

Figure 1.3: Scatter Plot

Figure 1.4: Top Factors Analysis

Figure 1.5: Cross-Tab Analysis

Calculation Explanations:

Prompt:

How did you calculate the director-snake-bite cell percentage?

Response:

Let’s break down the calculation for the “Director” job level and the “Snake-bite” unhappiness factor:

  • First, I determined the number of “Director” respondents who selected “Snake-bite” as an unhappiness factor.
  • Then, I divided that count by the total number of respondents with the “Director” job level.
  • Finally, I multiplied the result by 100 to express it as a percentage.

Let’s verify the calculation for the “Director” job level and the “Snake-bite” unhappiness factor.

The percentage of “Director” respondents who selected “Snake-bite” as an unhappiness factor is approximately

34.78%

34.78%. This matches the value presented in the crosstab table earlier.

To clarify:

  • There were 8 “Director” respondents who selected “Snake-bite” as an unhappiness factor.
  • The total number of “Director” respondents in the dataset is 23.
  • Therefore, the percentage is calculated as
  • 8/23×100≈34.78%.

Learnings from Survey Analysis: 

Strengths:

  • ChatGPT’s ability to analyze data using plain-language prompts is a potential game-changer for organizations. However, you still need to know what questions to ask of the data.
  • Given that ChatGPT uses Python for analysis, ChatGPT can conduct just about any analysis you can think of.
  • ChatGPT can quickly create relevant analyses and visualizations based on the questions you ask of the data.
  • With ChatGPT, you can ask it how to interpret the analysis result or visualization, and it will describe that in perfect detail. Over the years, this has always been the dream of many analytic and BI vendors.
  • ChatGPT generates Python code you can reuse with your favorite IDE–like Jupyter Notebooks.
  • In addition to numeric and categorical analysis, ChatGPT can also be used to analyze open-ended write-in questions.

Cautions:

  • ChatGPT’s session times out relatively quickly–so if you leave it idle for too long, it has trouble restarting the session, and you’ll need to start over.
  • The analysis is not repeatable–in other words, if you need to start over, you need to either 1) save and re-run the Python code generated yourself or 2) type the prompts all over again.
  • On more than one occasion, ChatGPT displayed a warning message when it attempted to run the analysis again, but to no avail; it hadn’t restarted.
  • Unless you have an enterprise plan, the data submitted to ChatGPT is not protected and may be monitored. Avoid submitting any sensitive, confidential, or proprietary information to ChatGPT. So, for many who do not have an enterprise plan, ChatGPT may not be a viable option for you.
  • Asking analysis questions of ChatGPT is quite natural in plain language, but beware, when you ask for specific calculations, make sure you examine how the calculations are done so you know what you’re getting. In my cross-tab example in Figure 1.5, the percentage it calculated was not what I thought it would calculate.
  • ChatGPT can generate all the standard graphics expected for survey analysis. I did not explore more complex graphics.

If you enjoyed this article, please like the article, highlight interesting sections, and share comments. Consider signing up for the newsletter at the bottom of this page and following me on Medium and LinkedIn.


If you’re interested in this topic, consider TinyTechGuides’ latest report, The CIO’s Guide to Adopting Generative AI: Five Keys to Success or Artificial Intelligence: An Executive Guide to Make AI Work for Your Business.


[1] Patki, Neha, and Carles Sala. 2021. “Meet the Synthetic Data Vault.” DataCebo Blog. May 4, 2021. https://datacebo.com/blog/intro-to-sdv/.

[2] Castellanos, Sara. 2021. “Fake It to Make It: Companies Beef up AI Models with Synthetic Data.” Wall Street Journal, July 23, 2021, sec. WSJ Pro. https://www.wsj.com/articles/fake-it-to-make-it-companies-beef-up-ai-models-with-synthetic-data-11627032601.

[3] “Synthetic Data for Advanced Analytics, AI and Machine Learning – MOSTLY AI.” n.d. Mostly.AI. Accessed October 6, 2023. https://mostly.ai/use-case/synthetic-data-for-analytics-ai-training.