this post was submitted on 31 Jul 2023
41 points (100.0% liked)

..:: tchncs ::..

1280 readers
1 users here now

Your friendly https://tchncs.de community! Discuss whats happening in the tchncs world and/or just use it as a community forum.

German and english allowed.

If you are looking for a way to support tchncs, please check out https://tchncs.de/donate


founded 1 year ago
MODERATORS
 

Well hello again, I have just learned that the host that recently had both nvme drives fail upon drive replacement, now has new problems: the filesystem report permanent data errors affecting the database of both, Matrix server and Telegram bridge.

I have just rented a new machine and am about to restore the database snapshot of the 26. of july, just in case. All the troubleshooting the recent days was very exhausting, however, i will try to do or at least prepare this within the upcoming hours.

Update

After a rescan the errors have gone away, however the drives logged errors too. It's now the question as to whether the data integrety should be trusted.

Status august 1st

Well ... good question... optimizations have been made last night, the restore was successful and ... we are back to debugging outgoing federation :(


The new hardware also will be a bit more powerful... and yes, i have not forgotten that i wanted to update that database. It's just that i was busy debugging federation problems.

References

you are viewing a single comment's thread
view the rest of the comments
[–] milan@discuss.tchncs.de 1 points 1 year ago* (last edited 1 year ago) (1 children)

I am a bit confused now... the spare was 98% as to read in my snippet above ... where does it say "no spare available"? I think it is on me to request a swap, and thats what i did as also the one with slightly less wear reported 255% used – which afaik is an aprox. lifetime left estimation based on rw cycles (not sure about all factors).

The one the hoster left in for me to play with, said no:

[Wed Jul 26 19:19:10 2023] nvme nvme1: I/O 9 QID 0 timeout, disable controller
[Wed Jul 26 19:19:10 2023] nvme nvme1: Device shutdown incomplete; abort shutdown
[Wed Jul 26 19:19:10 2023] nvme nvme1: Removing after probe failure status: -4

Tried multiple kernelflags n stuff but couldn't get past that error. Would have been interesting to have the hoster ship the thing to me (and maybe that would have been a long enough cooldown to have the thing working again), but i assume that would have been expensive from helsinki.

[–] Haui@discuss.tchncs.de 1 points 1 year ago

My bad. I must have misread. Sorry.

Yes, shipping it to you would have probably been a good idea. Does it cost a lot less to use the helsinki location? Otherwise Falkenstein would be a pretty good alternative I guess.