How we generated Synthetic Data with Claude

Why you should read

This post will show you how synthetic data can offer a practical solution. Discover the steps, challenges, and potential of creating synthetic datasets that preserve patterns without exposing sensitive details

TL;DR

Synthetic data mirrors real-world trends without exposing sensitive details. Learn how we used Claude to generate realistic yet anonymous datasets, overcame challenges, and built dashboards to track meaningful metrics. All while maintaining user privacy.

Time to read

Approximately 8 minutes

In many companies, big data holds useful insights but also poses privacy risks. Teams need to study patterns, run experiments, or create demos without leaking sensitive information. Synthetic data offers a solution. It looks and behaves like the real thing but erases personal details. Our goal was to make a dataset that reflects true trends, such as shipping costs or product preferences, yet ensures no row matches an actual person.

Below, we explain synthetic data, why we use it, how we generated it, and what issues we faced.

What Is Synthetic Data

Synthetic data is a set of made-up records that mirror real information’s shapes and patterns. If the source data has many weekend shoppers, the synthetic version should too. But every entry is fictional, which shields private facts like names or exact addresses. Researchers can still see how sales vary by day or how certain products sell more in certain regions.

Why We Bothered

Tough data privacy laws and internal rules ban sharing real user details. Yet teams need real-feeling numbers to test apps, build machine learning models, and discuss trends. Synthetic data is a middle ground. It holds onto the key relationships in the original dataset, like how refunds happen more on pricey items. While removing the risk of exposing customers’ personal info.

Our Process and Steps on Generating Synthetic Data with Claude

Below is the full conversation that shows how we iterated on a Snowflake script. We added columns, fixed function arguments, and aligned fields. We also updated a Rill dashboard to handle new product details and order types. Nothing has been removed.

Step 1: Setting Up the Base Table

I need help creating a dummy data set for us to use for some sales demos. This is going to be
for a table called All Orders. Attached is some sample data and below is the DDL. We will
have data coming from amazon, shopify, and retail locations like Target and Walmart. Please
consider this an ecommerce firm. Generate data from 2022 to the end of 2024

create or replace TRANSIENT TABLE ANALYTICS.MART.ALL_ORDERS (
SELLING_PLATFORM VARCHAR(7),
FULFILLMENT_CHANNEL VARCHAR(18),
ORDER_ID VARCHAR(16777216),
ORDER_URL VARCHAR(16777216),
CUSTOMER_EMAIL VARCHAR(256),
ORDER_TIMESTAMP TIMESTAMP_TZ(9),
ORDER_DATE DATE,
DAY_OF_WEEK NUMBER(2,0),
DAYNAME VARCHAR(3),
MONTH NUMBER(2,0),
YEAR NUMBER(4,0),
DAY_OF_YEAR NUMBER(4,0),
ITEM_TOTAL_PRICE FLOAT,
TOTAL_QUANTITY NUMBER(38,5),
TOTAL_DISCOUNT_AMOUNT FLOAT,
SHIPPING_CITY VARCHAR(16777216),
SHIPPING_STATE VARCHAR(16777216),
SHIPPING_ZIPCODE VARCHAR(16777216),
TOTAL_PRODUCT_WEIGHT FLOAT,
ZONE NUMBER(1,0),
ZONE_STATE VARCHAR(16777216),
HAS_REFUND BOOLEAN,
REFUND_DATE TIMESTAMP_NTZ(9),
TOTAL_REFUND_AMOUNT FLOAT,
TOTAL_PRODUCT_COST FLOAT,
TOTAL_FEES FLOAT,
REFUND_REASON VARCHAR(16777216),
DISCOUNT_CODE VARCHAR(16777216)

Claude suggests a simple plan to fill this ALL_ORDERS table with fake records. The user’s DDL includes columns for order details, timestamps, prices, shipping locations, refunds, and discounts.

Step 2: Generating Random Data in Snowflake

Ok how about helping me generate the random data in snowflake using the data generation
queries and some other functions. Data generation functions

Data Generation functions allow you to generate data. Snowflake supports two types of
data generation functions:
* Random, which can be useful for testing purposes.
These functions produce a random value each time. Each value is independent of the other
values generated by other calls to the function. The underlying algorithm produces pseudo-
random values, and thus the values are not truly random or independent, but without
knowing the algorithm, the values are essentially unpredictable, usually evenly distributed (if
the sample size is large), and pseudo-independent of each other.

* Controlled distribution, which can be useful for providing unique ID numbers for records
that do not already have unique identifiers.
These functions produce values that are not independent. For example, the NORMAL
function returns values that have an appr... SEQ family of functions return a
sequence of values.

Note
The UNIFORM function is listed as a controlled-distribution function, but is intended to
generate evenly-distributed values. In other words, it acts as though it's a "random"
function, but we refer to it as a controlled distribution function because the distribution is
explicitly specified and because you can choose a data-generation function that produces
non-uniform values over a large sample size.

List of functions
Function Name
Notes
Random
RANDOM
Returns a pseudo-random 64-bit integer.
RANDSTR
Returns a random string of specified length.
UUID_STRING
Returns a random RFC 4122-compliant UUID as a formatted string.

Controlled Distribution
NORMAL
Returns a normal-distributed floating point number, with specified mean and standard
deviation.
UNIFORM
Returns a uniformly random number within the specified range.
ZIPF
Returns a Zipf-distributed integer.
SEQ1/SEQ2/SEQ4/SEQ8
Returns a sequence of monotonically increasing integers.

Claude explains how to use Snowflake’s data generation functions (RANDOM(), UNIFORM(), and others) for building rows that look like real orders. It mentions the difference between fully random and controlled distributions.

Step 3: Creating the Table in a Specific Schema + Fixing UNIFORM Argument Issues

I need to create the table in brainforge_rill.sales schema first|

Claude updates the script to create ALL_ORDERS in brainforge_rill.sales. It inserts random ranges for price columns, shipping states, etc. Then:

argument 2 to function UNIFORM needs to be constant, found 'ITEM_TOTAL_PRICE * 0.2'

Snowflake insists on fixed numeric arguments for UNIFORM. Claude fixes this by generating a random percentage in a constant range, then multiplying it by ITEM_TOTAL_PRICE.

Step 4: Adding New Columns—Product Type, Box Quantity, Order Type

Great that was perfect - I got this feedback. Whats the best way to incorporate? Let me
know what your going to do and I'll let you know if we can move forward with that

Can we add a few more cuts to this though?
* By Product Type
* Product A
* Product B
* By Quantity
* 1 box
* 2 box
* 4 boxes
* By Order Type
* Subscription
* On Time Order
(edited)

They're a subscription product, so want to make sure that they see the cuts they always ask
about

Claude adds columns for PRODUCT_TYPE (A or B), QUANTITY_BOXES (1, 2, or 4), and ORDER_TYPE (Subscription or One-Time). Item prices now depend on product type and box count. This ensures more realistic variation.

Step 5: Matching Column Order & Fixing the REFUND_DATE Type

Got this
Expression type does not match column data type, expecting TIMESTAMP_NTZ(9) but gotNUMBER(2,0) for column REFUND_DATE

Claude aligns every column in the INSERT statement with the CREATE TABLE. It also ensures REFUND_DATE uses TIMESTAMP_NTZ(9) to match the DDL.

Step 6: Updating the Rill Dashboard YAML

This is my Rill dashboard yaml. I want to modify it to include the three new columns we
created as dimensions and any relevant metrics to include

# Dashboard YAML
# Reference documentation: https://docs.rilldata.com/reference/project-files/dashboards
# This file was generated using Al.
type: metrics_view
title: All Orders Model Metrics
model: all_orders_model
timeseries: ORDER_TIMESTAMP
dimensions:
- label: Selling Platform
  column: SELLING_PLATFORM
  description: ""
- label: Fulfillment Channel
  column: FULFILLMENT_CHANNEL
  description: ""
- label: Order Id
  column: ORDER_ID
  description: ""
- label: Order Url
  column: ORDER_URL
  description: ""
- label: Customer Email
  column: CUSTOMER_EMAIL
  description: ""
- label: Dayname
  column: DAYNAME
  description: ""
- label: Shipping City
  column: SHIPPING_CITY
  description: ""
- label: Shipping State
  column: SHIPPING_STATE
  description: ""
- label: Shipping Zipcode
  column: SHIPPING_ZIPCODE
  description: ""
- label: Has Refund
  column: HAS_REFUND
  description: ""
- label: Refund Reason
  column: REFUND_REASON
  description: ""
- label: Discount Code
  column: DISCOUNT_CODE
  description: ""
measures:
- name: total_orders
  label: Total Orders
  expression: COUNT(ORDER_ID)
  description: The total number of orders.
  format_preset: humanize
  valid_percent_of_total: false
- name: total_revenue
  label: Total Revenue
  expression: SUM(ITEM_TOTAL_PRICE)
  description: The total revenue from all orders.
  format_preset: currency_usd
  valid_percent_of_total: true
- name: average_order_value
  label: Average Order Value
  expression: AVG(ITEM_TOTAL_PRICE)
  description: The average value of an order.
  format_preset: currency_usd
  valid_percent_of_total: false
- name: total_quantity_sold
  label: Total Quantity Sold
  expression: SUM(TOTAL_QUANTITY)
  description: The total quantity of items sold.
  format_preset: currency_usd
  valid_percent_of_total: true
- name: total_discount_amount
  label: Total Discount Amount
  expression: SUM(TOTAL_DISCOUNT_AMOUNT)
  description: The total amount of discounts applied to orders.
  format_preset: currency_usd
  valid_percent_of_total: true
- name: total_shipping_weight
  label: Total Shipping Weight
  expression: SUM(TOTAL_PRODUCT_WEIGHT)
  description: The total weight of products shipped.
  format_preset: humanize
  valid_percent_of_total: true
- name: total_refund_amount
  label: Total Refund Amount
  expression: SUM(TOTAL_REFUND_AMOUNT)
  description: The total amount refunded to customers.
  format_preset: currency_usd
  valid_percent_of_total: true
- name: average_product_cost
  label: Average Product Cost
  expression: AVG(TOTAL_PRODUCT_COST)
  description: The average cost of products sold.
  format_preset: humanize
  valid_percent_of_total: false
- name: total_fees
  label: Total Fees
  expression: SUM(TOTAL_FEES)
  description: The total fees associated with orders.
  format_preset: currency_usd
  valid_percent_of_total: true
- name: total_orders_with_refunds
  label: Total Orders with Refunds
  expression: COUNT(CASE WHEN HAS_REFUND THEN 1 END)
  description: The total number of orders that had refunds.
  format_preset: humanize
  valid_percent_of_total: false
available_time_zones:
- America/Los_Angeles
- America/Chicago
- America/New_York
- Europe/London
- Europe/Paris
- Asia/Jerusalem
- Europe/Moscow
- Asia/Kolkata
- Asia/Shanghai
- Asia/Tokyo
- Australia/Sydney
available_time_ranges:
- PT24H
- P7D
- P14D
- P4W
- P3M
- P12M
- rill-TD
- rill-WTD
- rill-MTD
- rill-QTD
- rill-YTD
- rill-PDC
- rill-PWC
- rill-PMC
- rill-PQC
- rill-PYC

It modifies the dashboard config to add PRODUCT_TYPE, QUANTITY_BOXES, and ORDER_TYPE as dimensions. It also updates measures like “Total Quantity Sold” and introduces new metrics (e.g., total subscription orders vs. one-time orders).

Hurdles and Trade Offs

1. Balance - If we randomize too much, the data loses meaning. If we’re too loose, someone may notice real user traits in a row.

2 . Messy Source Data - Real sets have odd values or missing fields. We had to keep that flavor so our synthetic data wouldn’t seem perfect or fake.

3. Regulatory Questions - Some laws might treat near-identical rows as still “real.” We had to be strict with columns like addresses or emails.

4. Convincing Stakeholders - Some leaders doubt that “fake” info can reveal true patterns. We offered them comparisons to show that the synthetic set matched real trends.

Future Directions

  • Automated Tools - More software that scans a dataset and turns it into a synthetic version with minimal manual effort.
  • Clear Sensitivity Levels - Each column gets a tag (like high or low sensitivity), so we know how much to randomize it.
  • Ongoing Updates - Real data changes. Synthetic data should refresh as trends shift.
  • Shared Frameworks - More open-source libraries and examples will help companies adopt synthetic data best practices

Conclusion

Synthetic data is not a quick fix. It requires careful design, checks, and open dialogue about what matters most, is it privacy or precision. It can open true collaboration on tough analytics problems. By preserving shapes and removing personal details, it lets teams innovate without harming user trust.

Transforming raw data into
actionable insights

We help businesses boost revenue, save time, and make smarter decisions with Data and AI