• The Octonaut@mander.xyz
      link
      fedilink
      arrow-up
      0
      ·
      edit-2
      5 days ago

      The data part. ie the very first part of the OSI’s definition.

      It’s not available from their articles https://arxiv.org/html/2501.12948v1 https://arxiv.org/html/2401.02954v1

      Nor on their github https://github.com/deepseek-ai/DeepSeek-LLM

      Note that the OSI only ask for transparency of what the dataset was - a name and the fee paid will do - not that full access to it to be free and Free.

      It’s worth mentioning too that they’ve used the MIT license for the “code” included with the model (a few YAML files to feed it to software) but they have created their own unrecognised non-free license for the model itself. Why they having this misleading label on their github page would only be speculation.

      Without making the dataset available then nobody can accurately recreate, modify or learn from the model they’ve released. This is the only sane definition of open source available for an LLM model since it is not in itself code with a “source”.

        • The Octonaut@mander.xyz
          link
          fedilink
          arrow-up
          0
          ·
          5 days ago

          That’s the “prover” dataset, ie the evaluation dataset mentioned in the articles I linked you to. It’s for checking the output, it is not the training output.

          It’s also 20mb, which is miniscule not just for a training dataset but even as what you seem to think is a “huge data file” in general.

          You really need to stop digging and admit this is one more thing you have surface-level understanding of.