Why you should read
This post will show you how synthetic data can offer a practical solution. Discover the steps, challenges, and potential of creating synthetic datasets that preserve patterns without exposing sensitive details
TL;DR
Synthetic data mirrors real-world trends without exposing sensitive details. Learn how we used Claude to generate realistic yet anonymous datasets, overcame challenges, and built dashboards to track meaningful metrics. All while maintaining user privacy.
Time to read
Approximately 8 minutes
In many companies, big data holds useful insights but also poses privacy risks. Teams need to study patterns, run experiments, or create demos without leaking sensitive information. Synthetic data offers a solution. It looks and behaves like the real thing but erases personal details. Our goal was to make a dataset that reflects true trends, such as shipping costs or product preferences, yet ensures no row matches an actual person.
Below, we explain synthetic data, why we use it, how we generated it, and what issues we faced.
What Is Synthetic Data
Synthetic data is a set of made-up records that mirror real information’s shapes and patterns. If the source data has many weekend shoppers, the synthetic version should too. But every entry is fictional, which shields private facts like names or exact addresses. Researchers can still see how sales vary by day or how certain products sell more in certain regions.
Why We Bothered
Tough data privacy laws and internal rules ban sharing real user details. Yet teams need real-feeling numbers to test apps, build machine learning models, and discuss trends. Synthetic data is a middle ground. It holds onto the key relationships in the original dataset, like how refunds happen more on pricey items. While removing the risk of exposing customers’ personal info.
Our Process and Steps on Generating Synthetic Data with Claude
Below is the full conversation that shows how we iterated on a Snowflake script. We added columns, fixed function arguments, and aligned fields. We also updated a Rill dashboard to handle new product details and order types. Nothing has been removed.
Step 1: Setting Up the Base Table
Claude suggests a simple plan to fill this ALL_ORDERS
table with fake records. The user’s DDL includes columns for order details, timestamps, prices, shipping locations, refunds, and discounts.
Step 2: Generating Random Data in Snowflake
Claude explains how to use Snowflake’s data generation functions (RANDOM(), UNIFORM()
, and others) for building rows that look like real orders. It mentions the difference between fully random and controlled distributions.
Step 3: Creating the Table in a Specific Schema + Fixing UNIFORM Argument Issues
Claude updates the script to create ALL_ORDERS
in brainforge_rill.sales.
It inserts random ranges for price columns, shipping states, etc. Then:
Snowflake insists on fixed numeric arguments for UNIFORM.
Claude fixes this by generating a random percentage in a constant range, then multiplying it by ITEM_TOTAL_PRICE
.
Step 4: Adding New Columns—Product Type, Box Quantity, Order Type
Claude adds columns for PRODUCT_TYPE
(A or B), QUANTITY_BOXES
(1, 2, or 4), and ORDER_TYPE
(Subscription or One-Time). Item prices now depend on product type and box count. This ensures more realistic variation.
Step 5: Matching Column Order & Fixing the REFUND_DATE Type
Claude aligns every column in the INSERT statement with the CREATE TABLE. It also ensures REFUND_DATE
uses TIMESTAMP_NTZ(9)
to match the DDL.
Step 6: Updating the Rill Dashboard YAML
It modifies the dashboard config to add PRODUCT_TYPE
, QUANTITY_BOXES
, and ORDER_TYPE
as dimensions. It also updates measures like “Total Quantity Sold” and introduces new metrics (e.g., total subscription orders vs. one-time orders).
Hurdles and Trade Offs
1. Balance - If we randomize too much, the data loses meaning. If we’re too loose, someone may notice real user traits in a row.
2 . Messy Source Data - Real sets have odd values or missing fields. We had to keep that flavor so our synthetic data wouldn’t seem perfect or fake.
3. Regulatory Questions - Some laws might treat near-identical rows as still “real.” We had to be strict with columns like addresses or emails.
4. Convincing Stakeholders - Some leaders doubt that “fake” info can reveal true patterns. We offered them comparisons to show that the synthetic set matched real trends.
Future Directions
- Automated Tools - More software that scans a dataset and turns it into a synthetic version with minimal manual effort.
- Clear Sensitivity Levels - Each column gets a tag (like high or low sensitivity), so we know how much to randomize it.
- Ongoing Updates - Real data changes. Synthetic data should refresh as trends shift.
- Shared Frameworks - More open-source libraries and examples will help companies adopt synthetic data best practices
Conclusion
Synthetic data is not a quick fix. It requires careful design, checks, and open dialogue about what matters most, is it privacy or precision. It can open true collaboration on tough analytics problems. By preserving shapes and removing personal details, it lets teams innovate without harming user trust.