• FaceDeer@fedia.io
    link
    fedilink
    arrow-up
    8
    ·
    13 hours ago

    Thanks for asking. My comment was off the top of my head based on stuff I’ve read over the years, so first I did a little fact-checking of myself to make sure. There’s a lot of black magic still involved in training LLMs so the exact mix of training data varies a lot depending who you ask; in some cases raw data is still used for the initial training of LLMs to get them to the point where they’re capable of responding coherently to prompts, and synthetic data is more often used for the fine-tuning phase where LLMs are trained to be good at responding to prompts in particular ways. But there doesn’t seem to be any reason why synthetic data can’t be used for the whole training run, it’s just that well-curated high-quality raw data is already available.

    This article on how to use LLMs to generate synthetic data seems to be pretty comprehensive, starting with the basics and then going into detail about how to generate it with a system called DeepEval. In another comment in this thread I pointed to NVIDIA’s Nemotron-4 models as another example.

    • leftzero@lemmynsfw.com
      link
      fedilink
      English
      arrow-up
      1
      arrow-down
      2
      ·
      8 hours ago

      there doesn’t seem to be any reason why synthetic data can’t be used for the whole training run

      Ah, of course, it’s LLMs all the way down!

      No, but seriously, you’re aware they’re selling this shit as a replacement for search engines, are you not?

      • FaceDeer@fedia.io
        link
        fedilink
        arrow-up
        1
        ·
        1 hour ago

        No, it’s not “LLMs all the way down.” Synthetic data is still ultimately built on raw data, it just improves the form that data takes and includes lots of curation steps to filter it for quality.

        I don’t know what you mean by “a replacement for search engines.” LLMs are commonly being used to summarize search engine results, but there’s still a search engine providing it with sources to generate that summary from.