• I am an AI expert and this is why synthetic data is so popular fo

    From TechnologyDaily@1337:1/100 to All on Monday, August 18, 2025 15:30:09
    I am an AI expert and this is why synthetic data is so popular for LLMs

    Date:
    Mon, 18 Aug 2025 14:27:39 +0000

    Description:
    Regulated industries must take precaution when using data to train LLMs. Here's how synthetic data helps.

    FULL STORY ======================================================================

    As of March 2025, 40% of global companies report using artificial
    intelligence (AI) in their business. While the benefits offered by this transformational tool can feel nearly limitless, the reality is that AI isnt inherently secure, especially for companies dealing with sensitive information.

    AI quickly analyzes vast amounts of data to figure out patterns and provide users with a response in the shortest amount of time possible. Any data
    shared with the tool will be used to train the model going forward, making it a dangerous place for sensitive information. For industries that handle extremely personal data, like healthcare or law, using AI could risk client privacy.

    AI is designed to quickly analyze large datasets, detect patterns, and
    respond in real time. But many tools train on whatever data you provide. That means sharing private informationintentionally or notcan create long-term risks, especially in regulated industries like healthcare, finance, or law. The benefits of leveraging synthetic data

    AI works best with strong, structured, and relevant data. Whenever possible, real-world data is idealbut thats not always an option. Regulations like
    HIPAA and GDPR prevent teams from sharing personal data externally, including with AI models. Thats where synthetic data shines.

    Youll often see synthetic data used as a placeholderespecially when legal approvals or NDAs are still in progress. Instead of stalling development, teams can keep moving forward with stand-in data, then switch to production data later to validate the results. This keeps projects moving while staying compliant.

    In other cases, synthetic data fills in the gaps. You might have real data, but not enough of itor not enough variation to properly train your model. A good rule of thumb: youll need 10x more data samples than model parameters. When real data falls short, synthetic data can help augment and diversify
    your training set. Considerations for using synthetic data

    One common misconception is that synthetic data is just fake data. But in reality, it's often based on real-world information thats been restructured, anonymized, or generated to mirror actual scenarios. Think of it like a
    flight simulatoruseful for training and preparation, but its not the same as flying a real plane. Synthetic data can help teams test and train AI models, but it shouldnt be seen as a complete replacement for production data.

    That said, it does come with risksparticularly around re-identification. If synthetic data can be traced back to the original source, the whole premise
    of privacy falls apart. One of the most critical steps is to ensure the original dataset is no longer stored or accessible once the synthetic version is created. Simply having the two datasets in proximity to each other creates unnecessary risk.

    Another challenge is outliers. These are extreme or unusual values that can not only skew model training but also serve as clues about the original data. For example, if you're generating synthetic banking data and one of the transactions is for $10 million while the rest are in the hundreds, that single value becomes a beacon. Its both a modeling issue and a potential privacy concern.

    In many cases, partially synthetic data can offer the best of both worlds.
    You might use real documents or datasets while anonymizing any personally identifiable information. For example, you could keep the visual data from an X-ray but strip out details like the patients name, the facility, or the diagnosis. That way, you retain data complexity without exposing sensitive information. Finally, before using any synthetic dataset in a project, its worth having someone outside the core team take a final look. A fresh perspective can help spot anything youve missedwhether its residual identifiers, overlooked outliers, or subtle signs that the data could still
    be traced back to a real person. Conclusion

    Using synthetic data doesnt have to be all or nothing. Many projects benefit from a hybrid approachespecially in early phases. In a world racing to adopt AI, its easy to move fast and overlook the risks. But safe, responsible model training is everyones responsibility.

    Synthetic data isnt just a workaroundits a bridge to building secure, innovative systems that respect privacy and compliance from day one.

    We've featured the best Large Learning Model.

    This article was produced as part of TechRadarPro's Expert Insights channel where we feature the best and brightest minds in the technology industry today. The views expressed here are those of the author and are not necessarily those of TechRadarPro or Future plc. If you are interested in contributing find out more here: https://www.techradar.com/news/submit-your-story-to-techradar-pro



    ======================================================================
    Link to news story: https://www.techradar.com/pro/i-am-an-ai-expert-and-this-is-why-synthetic-data -is-so-popular-for-llms


    --- Mystic BBS v1.12 A49 (Linux/64)
    * Origin: tqwNet Technology News (1337:1/100)