I’d like to explore a project for NLP based search on the fediverse. But I’m a fediverse beginner and am not sure if it’s possible to index fediverse content.
My general idea is -
- Set up my own read-only instance, let’s say of kbin. I’m not sure if the concept of a read-only instance makes sense. It’s read-only because the instance only needs to be able to read the content already on the fediverse and doesn’t need the ability to post content.
- At some regular interval, let’s say once a day, monitor any changes in the content from the previous run. I’m not sure if there is a single “fediverse” where all the content can be read from. If not, then I can start with tracking the same content as on kbin.social. Is it possible to monitor changes to content on a kbin instance?
- I’ll convert the content into vector embeddings by a using an NLP ML model like CLIP. The embeddings will be stored in a vector store. The vector store will also include the url of the content as metadata.
- When a user requests a search, the search term is converted to its vector embedding using the same ML model and the most similar vectors are identified.
- The user gets the search results as urls of the most relevant content, and perhaps a preview of the content. The user can then access the full content from where it’s originally posted using its url.
I’m comfortable with setting up steps 3 and 4. But I do not know the fediverse enough to answer whether steps 1, 2, and 5 would work or even make sense how I’m envisioning them.
Can some of the fediverse veterans help me understand if this is a feasible approach or if I’ve got it all wrong?