this post was submitted on 11 Sep 2023
634 points (98.3% liked)

Ask Lemmy

26778 readers
2605 users here now

A Fediverse community for open-ended, thought provoking questions

Please don't post about US Politics. If you need to do this, try !politicaldiscussion@lemmy.world


Rules: (interactive)


1) Be nice and; have funDoxxing, trolling, sealioning, racism, and toxicity are not welcomed in AskLemmy. Remember what your mother said: if you can't say something nice, don't say anything at all. In addition, the site-wide Lemmy.world terms of service also apply here. Please familiarize yourself with them


2) All posts must end with a '?'This is sort of like Jeopardy. Please phrase all post titles in the form of a proper question ending with ?


3) No spamPlease do not flood the community with nonsense. Actual suspected spammers will be banned on site. No astroturfing.


4) NSFW is okay, within reasonJust remember to tag posts with either a content warning or a [NSFW] tag. Overtly sexual posts are not allowed, please direct them to either !asklemmyafterdark@lemmy.world or !asklemmynsfw@lemmynsfw.com. NSFW comments should be restricted to posts tagged [NSFW].


5) This is not a support community.
It is not a place for 'how do I?', type questions. If you have any questions regarding the site itself or would like to report a community, please direct them to Lemmy.world Support or email info@lemmy.world. For other questions check our partnered communities list, or use the search function.


Reminder: The terms of service apply here too.

Partnered Communities:

Tech Support

No Stupid Questions

You Should Know

Reddit

Jokes

Ask Ouija


Logo design credit goes to: tubbadu


founded 1 year ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
[–] circuitfarmer@lemmy.sdf.org 108 points 1 year ago (2 children)

Technically not my industry anymore, but: companies that sell human-generated AI training data to other companies most often are selling data that a) isn't 100% human generated or b) was generated by a group of people pretending to belong to a different demographic to save money.

To give an example, let's say a company wants a training set of 50,000 text utterances of US English for chatbot training. More often than not, this data will be generated using contract workers in a non-US locale who have been told to try and sound as American as possible. The Philippines is a common choice at the moment, where workers are often paid between $1-2 an hour: more than an order of magnitude less what it would generally cost to use real US English speakers.

In the last year or so, it's also become common to generate all of the utterances using a language model, like ChatGPT. Then, you use the same worker pool to perform a post-edit task (look at what ChatGPT came up with, edit it if it's weird, and then approve it). This reduces the time that the worker needs to spend on the project while also ensuring that each datapoint has "seen a set of eyes".

Obviously, this makes for bad training data -- for one, workers from the wrong locale will not be generating the locale-specific nuance that is desired by this kind of training data. It's much worse when it's actually generated by ChatGPT, since it ends up being a kind of AI feedback loop. But every company I've worked for in that space has done it, and most of them would not be profitable at all if they actually produced the product as intended. The clients know this -- which is perhaps why it ends up being this strange facade of "yep, US English wink wink" on every project.

[–] IphtashuFitz@lemmy.world 20 points 1 year ago

A couple decades ago I worked for a speech recognition company that developed tools for the telephony industry. Every week or two all the employees would be handed sheets of words or phrases with instructions to call a specific telephone extension and read them off. That’s how they collected training data…

[–] Bluetreefrog@lemmy.world 14 points 1 year ago (1 children)

I'm not surprised tbh. Having perused some of the text training datasets they were pretty bad. The classification is dodgy too. I ended up starting my own dataset because of this.

[–] Linus_Torvalds@lemmy.world 1 points 1 year ago

What do you mean with 'classification'? Sentimwnt analysis?