How we generated Synthetic Data with Claude
Why you should read
This post will show you how synthetic data can offer a practical solution. Discover the steps, challenges, and potential of creating synthetic datasets that preserve patterns without exposing sensitive details
TL;DR
Synthetic data mirrors real-world trends without exposing sensitive details. Learn how we used Claude to generate realistic yet anonymous datasets, overcame challenges, and built dashboards to track meaningful metrics. All while maintaining user privacy.
Time to read
Approximately 8 minutes
In many companies, big data holds useful insights but also poses privacy risks. Teams need to study patterns, run experiments, or create demos without leaking sensitive information. Synthetic data offers a solution. It looks and behaves like the real thing but erases personal details. Our goal was to make a dataset that reflects true trends, such as shipping costs or product preferences, yet ensures no row matches an actual person.
Below, we explain synthetic data, why we use it, how we generated it, and what issues we faced.
What Is Synthetic Data
Synthetic data is a set of made-up records that mirror real information’s shapes and patterns. If the source data has many weekend shoppers, the synthetic version should too. But every entry is fictional, which shields private facts like names or exact addresses. Researchers can still see how sales vary by day or how certain products sell more in certain regions.
Why We Bothered
Tough data privacy laws and internal rules ban sharing real user details. Yet teams need real-feeling numbers to test apps, build machine learning models, and discuss trends. Synthetic data is a middle ground. It holds onto the key relationships in the original dataset, like how refunds happen more on pricey items. While removing the risk of exposing customers’ personal info.
Our Process and Steps on Generating Synthetic Data with Claude
Below is the full conversation that shows how we iterated on a Snowflake script. We added columns, fixed function arguments, and aligned fields. We also updated a Rill dashboard to handle new product details and order types. Nothing has been removed.
Step 1: Setting Up the Base Table
I need help creating a dummy data set for us to use for some sales demos. This is going to be
for a table called All Orders. Attached is some sample data and below is the DDL. We will
have data coming from amazon, shopify, and retail locations like Target and Walmart. Please
consider this an ecommerce firm. Generate data from 2022 to the end of 2024
create or replace TRANSIENT TABLE ANALYTICS.MART.ALL_ORDERS (
SELLING_PLATFORM VARCHAR(7),
FULFILLMENT_CHANNEL VARCHAR(18),
ORDER_ID VARCHAR(16777216),
ORDER_URL VARCHAR(16777216),
CUSTOMER_EMAIL VARCHAR(256),
ORDER_TIMESTAMP TIMESTAMP_TZ(9),
ORDER_DATE DATE,
DAY_OF_WEEK NUMBER(2,0),
DAYNAME VARCHAR(3),
MONTH NUMBER(2,0),
YEAR NUMBER(4,0),
DAY_OF_YEAR NUMBER(4,0),
ITEM_TOTAL_PRICE FLOAT,
TOTAL_QUANTITY NUMBER(38,5),
TOTAL_DISCOUNT_AMOUNT FLOAT,
SHIPPING_CITY VARCHAR(16777216),
SHIPPING_STATE VARCHAR(16777216),
SHIPPING_ZIPCODE VARCHAR(16777216),
TOTAL_PRODUCT_WEIGHT FLOAT,
ZONE NUMBER(1,0),
ZONE_STATE VARCHAR(16777216),
HAS_REFUND BOOLEAN,
REFUND_DATE TIMESTAMP_NTZ(9),
TOTAL_REFUND_AMOUNT FLOAT,
TOTAL_PRODUCT_COST FLOAT,
TOTAL_FEES FLOAT,
REFUND_REASON VARCHAR(16777216),
DISCOUNT_CODE VARCHAR(16777216)
Claude suggests a simple plan to fill this ALL_ORDERS
table with fake records. The user’s DDL includes columns for order details, timestamps, prices, shipping locations, refunds, and discounts.
Step 2: Generating Random Data in Snowflake
Ok how about helping me generate the random data in snowflake using the data generation
queries and some other functions. Data generation functions
Data Generation functions allow you to generate data. Snowflake supports two types of
data generation functions:
* Random, which can be useful for testing purposes.
These functions produce a random value each time. Each value is independent of the other
values generated by other calls to the function. The underlying algorithm produces pseudo-
random values, and thus the values are not truly random or independent, but without
knowing the algorithm, the values are essentially unpredictable, usually evenly distributed (if
the sample size is large), and pseudo-independent of each other.
* Controlled distribution, which can be useful for providing unique ID numbers for records
that do not already have unique identifiers.
These functions produce values that are not independent. For example, the NORMAL
function returns values that have an appr... SEQ family of functions return a
sequence of values.
Note
The UNIFORM function is listed as a controlled-distribution function, but is intended to
generate evenly-distributed values. In other words, it acts as though it's a "random"
function, but we refer to it as a controlled distribution function because the distribution is
explicitly specified and because you can choose a data-generation function that produces
non-uniform values over a large sample size.
List of functions
Function Name
Notes
Random
RANDOM
Returns a pseudo-random 64-bit integer.
RANDSTR
Returns a random string of specified length.
UUID_STRING
Returns a random RFC 4122-compliant UUID as a formatted string.
Controlled Distribution
NORMAL
Returns a normal-distributed floating point number, with specified mean and standard
deviation.
UNIFORM
Returns a uniformly random number within the specified range.
ZIPF
Returns a Zipf-distributed integer.
SEQ1/SEQ2/SEQ4/SEQ8
Returns a sequence of monotonically increasing integers.
Claude explains how to use Snowflake’s data generation functions (RANDOM(), UNIFORM()
, and others) for building rows that look like real orders. It mentions the difference between fully random and controlled distributions.
Step 3: Creating the Table in a Specific Schema + Fixing UNIFORM Argument Issues
I need to create the table in brainforge_rill.sales schema first|
Claude updates the script to create ALL_ORDERS
in brainforge_rill.sales.
It inserts random ranges for price columns, shipping states, etc. Then:
argument 2 to function UNIFORM needs to be constant, found 'ITEM_TOTAL_PRICE * 0.2'
Snowflake insists on fixed numeric arguments for UNIFORM.
Claude fixes this by generating a random percentage in a constant range, then multiplying it by ITEM_TOTAL_PRICE
.
Step 4: Adding New Columns—Product Type, Box Quantity, Order Type
Great that was perfect - I got this feedback. Whats the best way to incorporate? Let me
know what your going to do and I'll let you know if we can move forward with that
Can we add a few more cuts to this though?
* By Product Type
* Product A
* Product B
* By Quantity
* 1 box
* 2 box
* 4 boxes
* By Order Type
* Subscription
* On Time Order
(edited)
They're a subscription product, so want to make sure that they see the cuts they always ask
about
Claude adds columns for PRODUCT_TYPE
(A or B), QUANTITY_BOXES
(1, 2, or 4), and ORDER_TYPE
(Subscription or One-Time). Item prices now depend on product type and box count. This ensures more realistic variation.
Step 5: Matching Column Order & Fixing the REFUND_DATE Type
Got this
Expression type does not match column data type, expecting TIMESTAMP_NTZ(9) but gotNUMBER(2,0) for column REFUND_DATE
Claude aligns every column in the INSERT statement with the CREATE TABLE. It also ensures REFUND_DATE
uses TIMESTAMP_NTZ(9)
to match the DDL.
Step 6: Updating the Rill Dashboard YAML
This is my Rill dashboard yaml. I want to modify it to include the three new columns we
created as dimensions and any relevant metrics to include
# Dashboard YAML
# Reference documentation: https://docs.rilldata.com/reference/project-files/dashboards
# This file was generated using Al.
type: metrics_view
title: All Orders Model Metrics
model: all_orders_model
timeseries: ORDER_TIMESTAMP
dimensions:
- label: Selling Platform
column: SELLING_PLATFORM
description: ""
- label: Fulfillment Channel
column: FULFILLMENT_CHANNEL
description: ""
- label: Order Id
column: ORDER_ID
description: ""
- label: Order Url
column: ORDER_URL
description: ""
- label: Customer Email
column: CUSTOMER_EMAIL
description: ""
- label: Dayname
column: DAYNAME
description: ""
- label: Shipping City
column: SHIPPING_CITY
description: ""
- label: Shipping State
column: SHIPPING_STATE
description: ""
- label: Shipping Zipcode
column: SHIPPING_ZIPCODE
description: ""
- label: Has Refund
column: HAS_REFUND
description: ""
- label: Refund Reason
column: REFUND_REASON
description: ""
- label: Discount Code
column: DISCOUNT_CODE
description: ""
measures:
- name: total_orders
label: Total Orders
expression: COUNT(ORDER_ID)
description: The total number of orders.
format_preset: humanize
valid_percent_of_total: false
- name: total_revenue
label: Total Revenue
expression: SUM(ITEM_TOTAL_PRICE)
description: The total revenue from all orders.
format_preset: currency_usd
valid_percent_of_total: true
- name: average_order_value
label: Average Order Value
expression: AVG(ITEM_TOTAL_PRICE)
description: The average value of an order.
format_preset: currency_usd
valid_percent_of_total: false
- name: total_quantity_sold
label: Total Quantity Sold
expression: SUM(TOTAL_QUANTITY)
description: The total quantity of items sold.
format_preset: currency_usd
valid_percent_of_total: true
- name: total_discount_amount
label: Total Discount Amount
expression: SUM(TOTAL_DISCOUNT_AMOUNT)
description: The total amount of discounts applied to orders.
format_preset: currency_usd
valid_percent_of_total: true
- name: total_shipping_weight
label: Total Shipping Weight
expression: SUM(TOTAL_PRODUCT_WEIGHT)
description: The total weight of products shipped.
format_preset: humanize
valid_percent_of_total: true
- name: total_refund_amount
label: Total Refund Amount
expression: SUM(TOTAL_REFUND_AMOUNT)
description: The total amount refunded to customers.
format_preset: currency_usd
valid_percent_of_total: true
- name: average_product_cost
label: Average Product Cost
expression: AVG(TOTAL_PRODUCT_COST)
description: The average cost of products sold.
format_preset: humanize
valid_percent_of_total: false
- name: total_fees
label: Total Fees
expression: SUM(TOTAL_FEES)
description: The total fees associated with orders.
format_preset: currency_usd
valid_percent_of_total: true
- name: total_orders_with_refunds
label: Total Orders with Refunds
expression: COUNT(CASE WHEN HAS_REFUND THEN 1 END)
description: The total number of orders that had refunds.
format_preset: humanize
valid_percent_of_total: false
available_time_zones:
- America/Los_Angeles
- America/Chicago
- America/New_York
- Europe/London
- Europe/Paris
- Asia/Jerusalem
- Europe/Moscow
- Asia/Kolkata
- Asia/Shanghai
- Asia/Tokyo
- Australia/Sydney
available_time_ranges:
- PT24H
- P7D
- P14D
- P4W
- P3M
- P12M
- rill-TD
- rill-WTD
- rill-MTD
- rill-QTD
- rill-YTD
- rill-PDC
- rill-PWC
- rill-PMC
- rill-PQC
- rill-PYC
It modifies the dashboard config to add PRODUCT_TYPE
, QUANTITY_BOXES
, and ORDER_TYPE
as dimensions. It also updates measures like “Total Quantity Sold” and introduces new metrics (e.g., total subscription orders vs. one-time orders).
Hurdles and Trade Offs
1. Balance - If we randomize too much, the data loses meaning. If we’re too loose, someone may notice real user traits in a row.
2 . Messy Source Data - Real sets have odd values or missing fields. We had to keep that flavor so our synthetic data wouldn’t seem perfect or fake.
3. Regulatory Questions - Some laws might treat near-identical rows as still “real.” We had to be strict with columns like addresses or emails.
4. Convincing Stakeholders - Some leaders doubt that “fake” info can reveal true patterns. We offered them comparisons to show that the synthetic set matched real trends.
Future Directions
- Automated Tools - More software that scans a dataset and turns it into a synthetic version with minimal manual effort.
- Clear Sensitivity Levels - Each column gets a tag (like high or low sensitivity), so we know how much to randomize it.
- Ongoing Updates - Real data changes. Synthetic data should refresh as trends shift.
- Shared Frameworks - More open-source libraries and examples will help companies adopt synthetic data best practices
Conclusion
Synthetic data is not a quick fix. It requires careful design, checks, and open dialogue about what matters most, is it privacy or precision. It can open true collaboration on tough analytics problems. By preserving shapes and removing personal details, it lets teams innovate without harming user trust.
Transforming raw data into
actionable insights
We help businesses boost revenue, save time, and make smarter decisions with Data and AI