• cyd@lemmy.world
    link
    fedilink
    English
    arrow-up
    0
    ·
    2 days ago

    No AI org of any significant size will ever disclose its full training set, and it’s foolish to expect such a standard to be met. There is just too much liability. No matter how clean your data collection procedure is, there’s no way to guarantee the data set with billions of samples won’t contain at least one thing a lawyer could zero in on and drag you into a lawsuit over.

    What Deepseek did, which was full disclosure of methods in a scientific paper, release of weights under MIT license, and release of some auxiliary code, is as much as one can expect.