Ask Lemmy

27813 readers

1389 users here now

A Fediverse community for open-ended, thought provoking questions

Rules: (interactive)

1) Be nice and; have fun

Doxxing, trolling, sealioning, racism, and toxicity are not welcomed in AskLemmy. Remember what your mother said: if you can't say something nice, don't say anything at all. In addition, the site-wide Lemmy.world terms of service also apply here. Please familiarize yourself with them

2) All posts must end with a '?'

This is sort of like Jeopardy. Please phrase all post titles in the form of a proper question ending with ?

3) No spam

Please do not flood the community with nonsense. Actual suspected spammers will be banned on site. No astroturfing.

4) NSFW is okay, within reason

Just remember to tag posts with either a content warning or a [NSFW] tag. Overtly sexual posts are not allowed, please direct them to either !asklemmyafterdark@lemmy.world or !asklemmynsfw@lemmynsfw.com. NSFW comments should be restricted to posts tagged [NSFW].

5) This is not a support community.

It is not a place for 'how do I?', type questions. If you have any questions regarding the site itself or would like to report a community, please direct them to Lemmy.world Support or email info@lemmy.world. For other questions check our partnered communities list, or use the search function.

6) No US Politics.

Please don't post about current US Politics. If you need to do this, try !politicaldiscussion@lemmy.world or !askusa@discuss.online

Reminder: The terms of service apply here too.

Partnered Communities:

Logo design credit goes to: tubbadu

founded 2 years ago

MODERATORS

candyman337@lemmy.world

Bluetreefrog@lemmy.world

TheSaneWriter@lemm.ee

TheSaneWriter@lemmy.thesanewriter.com

candyman337@sh.itjust.works

Asudox@lemmy.world

lemmy_bot@lemmy.world

beefbaby182@lemmy.world

asudox@discuss.tchncs.de

ModeratorCan@lemmy.world

shinigamiookamiryuu@lemm.ee

neidu3@sh.itjust.works

KaneLivesInDeath@lemmy.world

635

What is a well known 'public secret' in the industry you work in that the majority of outsiders are unaware of? (lemm.ee)

submitted 1 year ago by NotSpez@lemm.ee to c/asklemmy@lemmy.world

659 comments fedilink hide all child comments

you are viewing a single comment's thread
view the rest of the comments

[–] circuitfarmer@lemmy.sdf.org 108 points 1 year ago (2 children)

Technically not my industry anymore, but: companies that sell human-generated AI training data to other companies most often are selling data that a) isn't 100% human generated or b) was generated by a group of people pretending to belong to a different demographic to save money.

To give an example, let's say a company wants a training set of 50,000 text utterances of US English for chatbot training. More often than not, this data will be generated using contract workers in a non-US locale who have been told to try and sound as American as possible. The Philippines is a common choice at the moment, where workers are often paid between $1-2 an hour: more than an order of magnitude less what it would generally cost to use real US English speakers.

In the last year or so, it's also become common to generate all of the utterances using a language model, like ChatGPT. Then, you use the same worker pool to perform a post-edit task (look at what ChatGPT came up with, edit it if it's weird, and then approve it). This reduces the time that the worker needs to spend on the project while also ensuring that each datapoint has "seen a set of eyes".

Obviously, this makes for bad training data -- for one, workers from the wrong locale will not be generating the locale-specific nuance that is desired by this kind of training data. It's much worse when it's actually generated by ChatGPT, since it ends up being a kind of AI feedback loop. But every company I've worked for in that space has done it, and most of them would not be profitable at all if they actually produced the product as intended. The clients know this -- which is perhaps why it ends up being this strange facade of "yep, US English wink wink" on every project.

[–] IphtashuFitz@lemmy.world 20 points 1 year ago

A couple decades ago I worked for a speech recognition company that developed tools for the telephony industry. Every week or two all the employees would be handed sheets of words or phrases with instructions to call a specific telephone extension and read them off. That’s how they collected training data…

[–] Bluetreefrog@lemmy.world 14 points 1 year ago (1 children)

I'm not surprised tbh. Having perused some of the text training datasets they were pretty bad. The classification is dodgy too. I ended up starting my own dataset because of this.

[–] Linus_Torvalds@lemmy.world 1 points 1 year ago

What do you mean with 'classification'? Sentimwnt analysis?