this post was submitted on 02 Feb 2025

90 points (98.9% liked)

Technology

37954 readers

356 users here now

A nice place to discuss rumors, happenings, innovations, and challenges in the technology sphere. We also welcome discussions on the intersections of technology and society. If it’s technological news or discussion of technology, it probably belongs here.

Remember the overriding ethos on Beehaw: Be(e) Nice. Each user you encounter here is a person, and should be treated with kindness (even if they’re wrong, or use a Linux distro you don’t like). Personal attacks will not be tolerated.

Subcommunities on Beehaw:

This community's icon was made by Aaron Schneider, under the CC-BY-NC-SA 4.0 license.

founded 3 years ago

MODERATORS

alyaza@beehaw.org

TheRtRevKaiser@beehaw.org

gyrfalcon@beehaw.org

rs5th@beehaw.org

coldredlight@beehaw.org

Los@beehaw.org

SemioticStandard@beehaw.org

TheRtRevKaiser@kbin.social

remington@beehaw.org

Bill proposed to outlaw downloading Chinese AI models. (www.hawley.senate.gov)

submitted 3 days ago by schizoidman@lemm.ee to c/technology@beehaw.org

55 comments fedilink hide all child comments

cross-posted from: https://lemmy.world/post/25011462

SECTION 1. SHORT TITLE

This Act may be cited as the ‘‘Decoupling America’s Artificial Intelligence Capabilities from China Act of 2025’’.

SEC. 3. PROHIBITIONS ON IMPORT AND EXPORT OF ARTIFICIAL INTELLIGENCE OR GENERATIVE ARTIFICIAL INTELLIGENCE TECHNOLOGY OR INTELLECTUAL PROPERTY

(a) PROHIBITION ON IMPORTATION.—On and after the date that is 180 days after the date of the enactment of this Act, the importation into the United States of artificial intelligence or generative artificial intelligence technology or intellectual property developed or produced in the People’s Republic of China is prohibited.

Currently, China has the best open source models in text, video and music generation.

you are viewing a single comment's thread
view the rest of the comments

[–] thingsiplay@beehaw.org 7 points 3 days ago (1 children)

Well its still not Open Source.

[–] Gamers_mate@beehaw.org 5 points 3 days ago (1 children)

Is part of the code not available?

[–] thingsiplay@beehaw.org 20 points 3 days ago (2 children)

None of the code and training data is available. Its just the usual Huggingface thing, where some weights and parameters are available, nothing else. People repeat DeepSeek (and many other) Ai LLM models being open source, but they aren't.

They even have a Github source code repository at https://github.com/deepseek-ai/DeepSeek-R1 , but its only an image and PDF file and links to download the model on Huggingface (plus optional weights and parameter files, to fine tune it). There is no source code, and no training data available. Also here is an interesting article talking about this issue: Liesenfeld, Andreas, and Mark Dingemanse. “Rethinking open source generative AI: open washing and the EU AI Act.” The 2024 ACM Conference on Fairness, Accountability, and Transparency. 2024

[–] Gamers_mate@beehaw.org 3 points 2 days ago

Damn that sucks it should be open source. Let people fork and optimize it so it uses less electricity as possible.

[–] p03locke@lemmy.dbzer0.com 4 points 3 days ago* (last edited 3 days ago) (3 children)

This literally took one click: https://github.com/deepseek-ai

Stop spreading FUD.

[–] jarfil@beehaw.org 5 points 3 days ago (2 children)

Where's the training data?

[–] Crotaro@beehaw.org 3 points 3 days ago (1 children)

Does open sourcing require you to give out the training data? I thought it only means allowing access to the source code so that you could build it yourself and feed it your own training data.

[–] jarfil@beehaw.org 7 points 3 days ago (2 children)

Open source requires giving whatever digital information is necessary to build a binary.

In this case, the "binary" are the network weights, and "whatever is necessary" includes both training data, and training code.

DeepSeek is sharing:

NO training data
NO training code
instead, PDFs with a description of the process
binary weights (a few snapshots)
fine-tune code
inference code
evaluation code
integration code

In other words: a good amount of open source... with a huge binary blob in the middle.

[–] teawrecks@sopuli.xyz 3 points 3 days ago (1 children)

Is there any good LLM that fits this definition of open source, then? I thought the "training data" for good AI was always just: the entire internet, and they were all ethically dubious that way.

What is the concern with only having weights? It's not abritrary code exectution, so there's no security risk or lack of computing control that are the usual goals of open source in the first place.

To me the weights are less of a "blob" and more like an approximate solution to an NP-hard problem. Training is traversing the search space, and sharing a model is just saying "hey, this point looks useful, others should check it out". But maybe that is a blob, since I don't know how they got there.

[–] jarfil@beehaw.org 2 points 3 days ago (2 children)

There are several "good" LLMs trained on open datasets like FineWeb, LAION, DataComp, etc. They are still "ethically dubious", but at least they can be downloaded, analyzed, filtered, and so on. Unfortunately businesses are keeping datasets and training code as a competitive advantage, even "Open"AI stopped publishing them when they saw an opportunity to make money.

What is the concern with only having weights? It's not abritrary code exectution

Unless one plugs it into an agent... which is kind of the use we expect right now.

Accessing the web, or even web searches, is already equivalent to arbitrary code execution: an LLM could decide to, for example, summarize and compress some context full of trade secrets, then proceed to "search" for it, sending it to wherever it has access to.

Agents can also be allowed to run local commands... again a use we kind of want now ("hey Google, open my alarms" on a smartphone).

[–] p03locke@lemmy.dbzer0.com 1 points 2 days ago (1 children)

There are several “good” LLMs trained on open datasets like FineWeb, LAION, DataComp, etc.

Then use those as training data. You're too caught up on this exacting definition of open source that you'll completely ignore the benefits of what this model could provide.

an LLM could decide to, for example, summarize and compress some context full of trade secrets, then proceed to “search” for it, sending it to wherever it has access to.

That's not how LLMs work, and you know it. A model of weights is not a lossless compression algorithm.

Also, if you're giving an LLM free reign to all of your session tokens and security passwords, that's on you.

[–] jarfil@beehaw.org 1 points 19 hours ago

That's not how LLMs work, and you know it. A model of weights is not a lossless compression algorithm.

https://www.piratewires.com/p/compression-prompts-gpt-hidden-dialects

if you're giving an LLM free reign to all of your session tokens and security passwords, that's on you.

There are more trade secrets than session tokens and security passwords. People want AI agents to summarize their local knowledge base and documents, then expand it with updated web searches. No passwords needed when the LLM can order the data to be exfiltrated directly.

[–] teawrecks@sopuli.xyz 1 points 2 days ago (1 children)

Those security concerns seem completely unrelated to the model, though. You can have a completely open source model that fits all those requirements, and still give it too much unfettered access to important resources with no way of actually knowing what it will do until it tries.

[–] jarfil@beehaw.org 1 points 19 hours ago (1 children)

While unfettered access is bad in general, DeepSeek takes it a step farther: the Mixture of Experts approach in order to reduce computational load, is great when you know exactly what "Experts" it's using, not so great when there is no way to check whether some of those "Experts" might be focused on extracting intelligence under specific circumstances.

[–] teawrecks@sopuli.xyz 1 points 15 hours ago

I agree that you can't know if the AI has been deliberately trained to act nefarious given the right circumstances. But I maintain that it's (currently) impossible to know if any AI had been inadvertently trained to do the same. So the security implications are no different. If you've given an AI the ability to exfiltrating data without any oversight, you've already messed up, no matter whether you're using a single AI you trained yourself, a black box full of experts, or deepseek directly.

But all this is about whether merely sharing weights is "open source", and you've convinced me that it's not. There needs to be a classification, similar to "source available"; this would be like "weights available".

[–] Crotaro@beehaw.org 1 points 2 days ago (1 children)

Thanks for the explanation. I don't understand enough about large language models to give a valuable judgement on this whole Deepseek happening from a technical standpoint. I think it's excellent to have competition on the market and it feels that the US' whole "But they're spying on you and being a national security risk" is a hypocritical outcry when Facebook, OpenAI and the like still exist.

What do you think about Deepseek? If I understood correctly, it's being trained on the output of other LLMs, which makes it much more cheap but, to me it seems, also even less trustworthy because now all the actual human training data is missing and instead it's a bunch of hallucinations, lies and (hopefully more often than not) correctly guessed answers to questions made by humans.

[–] jarfil@beehaw.org 1 points 19 hours ago* (last edited 19 hours ago)

There are several parts to the "spying" risk:

Sending private data to a third party server for the model to process it... well, you just sent it, game over. Use local models, or machines (hopefully) under your control, or ones you trust (AWS? Azure? GCP?... maybe).

All LMM models are a black box, the only way to make an educated guess about their risk, is to compare the training data and procedure, to the evaluation data of the final model. There is still a risk of hallucinations and deceival, but it can be quantified to some degree.

DeepSeek uses a "Mixture of Experts" approach to reduce computational load... which is great, as long as you trust the "Experts" they use. Since the LLM that was released for free, is still a black box, and there is no way to verify which "Experts" were used to train it, there is also no way to know whether some of those "Experts" might or might not be trained to behave in a malicious way under some specific conditions. It could as easily be a Troyan Horse with little chance of getting detected until it's too late.

it's being trained on the output of other LLMs, which makes it much more cheap but, to me it seems, also even less trustworthy

The feedback degradation of an LLM happens when it gets fed its own output as part of the training data. We don't exactly know what training data was used for DeepSeek, but as long as it was generated by some different LLM, there would be little risk of a feedback reinforcement loop.

Generally speaking, I would run the DeepSeek LLM in an isolated environment, but not trust it to be integrated in any sort of non-sandboxed agent. The downloadable smartphone app, is possibly "safe" as long as you restrict the hell out of it, don't let it access anything on its own, and don't feed it anything remotely sensitive.

[–] p03locke@lemmy.dbzer0.com 2 points 3 days ago (2 children)

Nobody releases training data. It's too large and varied. The best I've seen was the LAION-2B set that Stable Diffusion used, and that's still just a big collection of links. Even that isn't going to fit on a GitHub repo.

Besides, improving the model means using the model as a base and implementing new training data. Specialize, specialize, specialize.

[–] thingsiplay@beehaw.org 1 points 2 days ago

Nobody releases training data. It’s too large and varied.

That's why its not Open Source. They do not release the source and its impossible to build the model from source.

[–] jarfil@beehaw.org 1 points 3 days ago (1 children)

What about these? Dozens of TB here:

https://huggingface.co/HuggingFaceFW

There is also a LAION-5B now, and several other datasets.

[–] p03locke@lemmy.dbzer0.com 1 points 2 days ago

Wow, it's like you didn't even read my post.

[–] thingsiplay@beehaw.org 2 points 2 days ago (1 children)

Can you actually explain what in my reply is "Fear, uncertainty, and doubt"? Did you actually read it? I even linked to the specific github repository, which is basically empty. You just link to an overview, which does not point to any source code.

Please explain whats FUD and link to the source code, otherwise do not call people FUD if you don't know what you are talking about.

[–] p03locke@lemmy.dbzer0.com 1 points 2 days ago (1 children)

You're purposely being obtuse, and not arguing in good faith. The source code is right there, in the other repos owned by the deepseek-ai user.

[–] thingsiplay@beehaw.org 3 points 2 days ago

What are you talking about? What bad faith are you saying to me? I ask you to show me the repository that contains the source code. There is none. Please give me a link to the repo you have in mind. Where is the source code and training data of DeepSeek-R1? Can we build the model from source?