Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
6 changes: 6 additions & 0 deletions docs/news/posts/blogger-post-2006-09-23-1.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
---
date: 2006-09-23
---

# Open Computing Facility Blog Launch
In an effort to improve communication with our users, the Open Computing Facility has launched a blog! Status and feature updates, random staff trivia, and other tasty bits of information will be posted here on a regular basis. This blog will also serve as a means for notifying users of unexpected service interruptions.
6 changes: 6 additions & 0 deletions docs/news/posts/blogger-post-2006-09-25-1.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
---
date: 2006-09-25
---

# Random Issues and Printer Information
The Open Computing Facility is experiencing some issues with our web server and some core services at the moment. We're working on figuring out what's broken (for the technically-minded, our NFS server seems to be borked) and getting it working again. I'll post status updates as they come.<br /><br />Oh, and we've decided to order another maintenance kit for our printer. You might have noticed splotches appearing on printouts in the last few weeks -- our fuser was slowly dying, so we ordered a kit to fix it. Unfortunately, our order apparently got lost by CDW, so we're overnighting another kit. Our printer should be back soon! For now, we're using one of our older printers, so print times might be a bit long...<br /><br /><span style="font-weight: bold;">UPDATE</span><span style="font-weight: bold;"> #1</span>: It's past 1 AM, and the OCF is still broken. sluo came in and is working diligently on fixing stuff because I don't know enough about NIS+ and NFS to do anything :/<br /><br /><span style="font-weight: bold;">UPDATE #2</span>: Our super ex-SM was able to bring most services up by 2 AM. I think. Anyways, a bunch of other staffers showed up to provide moral support (by playing Quake 3). Here's some pictures of us staffers working for you -- we're always working for you (with regards to Verizon)...<br /><br /><a href="http://www.flickr.com/photos/ocf/252229713/" title="Photo Sharing"><img src="http://static.flickr.com/97/252229713_cb04a9bbc1_m.jpg" alt="OCF Tech Support Party at 2 AM" height="180" width="240" /></a><br /><br /><a href="http://www.flickr.com/photos/ocf/252239895/" title="Photo Sharing"><img src="http://static.flickr.com/120/252239895_68f171b90d_m.jpg" alt="OCF Tech Support Party at 2 AM" height="180" width="240" /></a><br /><br /><span style="font-weight: bold;">Update #3</span>: It's almost 3 AM, and we're still in here. Anybody know of any 24 hour food places that are within walking distance of Sproul or that deliver?
6 changes: 6 additions & 0 deletions docs/news/posts/blogger-post-2006-09-26-1.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
---
date: 2006-09-26
---

# Printer Fixed!
Our printer repair kit arrived this morning, and I installed it into our printer with tender loving care. Print jobs are now whizzing through the print queue, and there are no more splotches on the printouts (thanks to the new fuser)!<br /><br />Now I just wonder how long this new printer happiness is going to last before the next breakdown...
6 changes: 6 additions & 0 deletions docs/news/posts/blogger-post-2006-09-29-1.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
---
date: 2006-09-29
---

# Penguins! Everywhere!
<a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://photos1.blogger.com/blogger/7260/3883/1600/singleframe.jpg"><img style="margin: 0pt 10px 10px 0pt; float: left; cursor: pointer;" src="http://photos1.blogger.com/blogger/7260/3883/320/singleframe.jpg" alt="" border="0" /></a><a href="http://www.ocf.berkeley.edu/%7Ejameson/img/singleframe.jpeg">PEGUINS</a>, <a href="http://www.ocf.berkeley.edu/%7Ejameson/img/singleframe2.jpeg">PEGUINS </a>IN THE LAB!!!<br />AHH <a href="http://www.ocf.berkeley.edu/%7Ejameson/img/singleframe3.jpeg">PENGUINS </a><a href="http://www.ocf.berkeley.edu/%7Ejameson/img/singleframe5.jpeg">EVERYWHERE</a>!!!<br /><br /><br />-jlee<br /><br />Callug <a href="http://www.ocf.berkeley.edu/%7Ejameson/img/singleframe4.jpeg">penguin </a>take over OCF on friday 2:30am
6 changes: 6 additions & 0 deletions docs/news/posts/blogger-post-2006-10-02-1.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
---
date: 2006-10-02
---

# Shiny New Lab and Computers
We've cleaned up the OCF lab and installed a bunch of super fast new computers running Debian Linux. Please give the machines a shot; IMHO, they're much easier to use and more reliable than our Windows systems (this is not a dig at the Windows team, who have been practically living in the OCF to ensure that the systems work).
6 changes: 6 additions & 0 deletions docs/news/posts/blogger-post-2006-10-06-1.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
---
date: 2006-10-06
---

# We're Always Working for You!
Even in our sleep...<br /><br /><a href="http://www.flickr.com/photos/sle/261878171/" title="Photo Sharing"><img src="http://static.flickr.com/109/261878171_5ca76f7354_m.jpg" width="240" height="180" alt="Jameson Hard at Work" /></a>
6 changes: 6 additions & 0 deletions docs/news/posts/blogger-post-2006-10-12-1.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
---
date: 2006-10-12
---

# (More) Unscheduled Downtime
The OCF is currently (mostly) down due to a hardware failure on one of our core servers (specifically, war appears to have lost its SCSI controller). We're working to restore service; check back here for updates.
6 changes: 6 additions & 0 deletions docs/news/posts/blogger-post-2006-10-12-2.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
---
date: 2006-10-12
---

# Status Update
We're still working on the problem. It seems like our disk array that holds all user data is having some troubles. I'm currently backing up all user mail to a safe location, and we're working on doing the same with home directories (ie., your regular data and web space).
6 changes: 6 additions & 0 deletions docs/news/posts/blogger-post-2006-10-12-3.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
---
date: 2006-10-12
---

# Status Update
Mail is pretty much safe for the moment. We have two complete backups of mail stored on different systems. We're getting some errors while performing our initial backup of user data, but we're hoping that these errors are only temporary (ie., they'll be solved when we fsck the file system).<br /><br />Some users have requested an ETA, and, for the moment, it seems the earliest the OCF will be back up and in working order is this weekend. No guarantees, though. Although we know how important it is for our users to have access to their mail, data, and web services, we're trying to take our time and do everything right so no user data is lost. If there are any special circumstances or issues that you believe we should be aware of, please visit our IRC channel (#ocf on irc.ocf.berkeley.edu).
6 changes: 6 additions & 0 deletions docs/news/posts/blogger-post-2006-10-13-1.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
---
date: 2006-10-13
---

# Status Update
We're still backing up the rest of user data on our disk array to some spare space we have on our servers. Since we have upwards of 400 GB of data, and we're transferring most of it over NFS (regular Ethernet and not SCSI or Fibre Channel), it's taking a long time.<br /><br />Some users have asked about data loss during this recovery. Most mail daemons should be smart enough to retry delivery once service to the OCF is restored. If our downtime ends up becoming prolonged, we will try to figure out a way to queue mail so it doesn't end up getting bounced.<br /><br />In regards to user data (ie., anything other than mail), we're pulling the data off the disk array as quickly as possible. So far, it seems like most user data is intact; we're only getting about 1-2% corruption. That's not to say that that 1-2% of data is lost; we're just pulling the good data from the disk array at the moment. We haven't even begun to run the Unix equivalent of Scandisk, so it seems like there's a good chance we'll have 100% data recovery. Keep your fingers crossed, though.<br /><br />Beyond the fact that we're working with such a massive amount of data, one of the holdups on our recovery is acquiring a LSI Logic PCI-X SAS/SATA host controller that supports Sun Solaris SPARC so we can setup a staging area to backup our disk array. If you don't understand what all those acronyms mean, let's just say you can't walk into any CompUSA or BestBuy and find that card. The only place that seems to carry the card is Newegg, but it's $300, and, even with overnight shipping, the earliest we're getting it is Monday.<br /><br />Yury and I (the current site-managers) have been taking long shifts in the OCF to get user data back, and most of the other staff have been around to provide assistance (thanks sluo for saving us when we don't know Solaris 10!), so, we're working on it!
6 changes: 6 additions & 0 deletions docs/news/posts/blogger-post-2006-10-13-2.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
---
date: 2006-10-13
---

# Status Update
Some OCF staffers are making a trip to Fremont to pick up some 500 GB drives so we have more space to backup user data. They should be back in Berkeley by 5 PM today, and I'll be working through the night to setup the drives so that we can dump data to them.<br /><br />Home directories have been successfully backed up, and we're currently going through web space. After web space, we'll have Microsoft Windows profiles and MySQL/Postgresql databases left to backup (I'm sure the other staffers will correct me if I'm missing something here).<br /><br />Oh, and in regards to a user's comment, yes, mail in other directories should be restored (they're part of the home directories, which are almost done being dumped from the array).<br /><br />Thanks for all your support!
6 changes: 6 additions & 0 deletions docs/news/posts/blogger-post-2006-10-14-1.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
---
date: 2006-10-14
---

# Status Update
We're almost done backing up all user data. While the backup was going on, we were able to assemble a simple 1 TB ZFS array using our newly acquired 500 GB Seagate SATA drives from Frys. Once the backup finishes, we'll do a raw dump of the disk array (where user data was stored) to our ZFS array (which we just built yesterday night). This will provide a secondary backup, just in case things go wrong -- we want to be extra careful with user data. Once that completes, we'll perform a fsck of the disk array, and, if everything goes well, most or all user data should be safe and accessible, and we'll start bringing OCF services back up. In other words, if everything does go well, some OCF services should be back up by the end of the weekend.<br /><br />Now, if things go wrong, and the disk array starts spitting out errors, we're going to attempt to recover data from our ZFS array (it's basically our backup-backup). If that'll result in too much downtime, we'll dump our first backup onto the disk array (that is, the copy where we have 99% of user data or so) and work on bringing the OCF back up as quickly as possible. That way, users will have access to their data as soon as possible, and we'll work on restoring the extra 1% of data as we can from our secondary backup, without too much pressure on time.
6 changes: 6 additions & 0 deletions docs/news/posts/blogger-post-2006-10-15-1.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
---
date: 2006-10-15
---

# Status Update
The fsck didn't go so well. We're restoring our secondary backup of the disk array and going for another attempt at fixing the file system. This will be our final attempt at repairing the file system; we don't want to prolong our downtime since the process of restoring the backup to the disk array takes upwards of 10 hours. If we are unable to restore the file system, we'll wipe the disk array clean, create a new UFS file system, and rebuild user data from the tar archives we created on Friday and Saturday.<br /><br />That is, our first attempt at repairing the file system failed. We're going to try again, but we're trying to balance our efforts at recovery with minimizing downtime. If we can't repair our file system, we're just going to wipe the slate clean and pull data from an archive we made, which may be missing a very small fraction of user data (basically the data that was damaged during the initial hardware failure). Our worst case estimate is around 1% data loss; most users won't be affected, and for the users that were, most files that we were unable to be recover seem to be unimportant files (browser cache files, temporary lock files, etc).<br /><br />So, just to be clear: we're trying our best to get 100% data recovery, but doing so while minimizing downtime is difficult. Our worst case scenario is bringing back the OCF with about 99% of the data intact and working with users to recover any important data from the 1% that may be lost.
6 changes: 6 additions & 0 deletions docs/news/posts/blogger-post-2006-10-16-1.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
---
date: 2006-10-16
---

# Status Update
I'm about to head back to the OCF and swap our SCSI connectors to our disk array so we can continue with our fsck recovery efforts. In the mean time, the other staffers are working on bringing our mail server back online so we can queue incoming mail.
6 changes: 6 additions & 0 deletions docs/news/posts/blogger-post-2006-10-16-2.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
---
date: 2006-10-16
---

# Status Update
A staffer requested that I create a new post for every status update, as it'll be much easier for people using RSS readers to stay informed, so I've taken all the previous status updates and created a new post for each one and will continue doing so in the future.<br /><br />Our second attempt at repairing the file system failed. Therefore, per the course of action I mentioned in my last status update, we've decided to wipe the disk array clean and rebuild the file system from our primary backup. This should help minimize downtime and get the OCF back to peak performance ASAP.<br /><br />Once the dust settles, we (the Site Managers) will probably be sending out a more formal email message describing the failure, our response to it, and how we plan on avoiding such failures in the future.<br /><br />Oh, and I should note that we're currently queuing incoming mail.
6 changes: 6 additions & 0 deletions docs/news/posts/blogger-post-2006-10-16-3.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
---
date: 2006-10-16
---

# Status Update
A new file system was created on the disk array around 5 AM this morning, and we've been transferring data back to the array over NFS and regular Ethernet (silly endian issues prevent us from connecting the array directly to our backup system using U320 SCSI). At the current rate, we should be finished transferring the files over sometime tomorrow. A staffer might be able to drop by the lab today and figure out a better wiring method to improve transfer speeds and to get data on the array faster.<br /><br />Also, here's a lesson for all system administrators out there: DO NOT BUY DISK ARRAYS FROM SHADY VENDORS. Since our budget is relatively limited, we've always been pretty conservative with our purchases and primarily relied upon donations to keep us going (thanks to Sun Microsystems for our super fast servers!). Consequently, when we needed to expand the OCF's disk offerings, we were only able to justify the purchase of a 'budget' disk array. That was 2 years ago. Our disk array is currently failing, and the company we bought the disk array from went out of business and was bought by some other company, who only wants to perform service via RMA through a process that might take 15 days.<br /><br />ARE THEY FREAKING INSANE? So, we're supposed to send them our 3U form-factor disk array with 12 drives via postal mail and be down for 15 more days? Uh, no thanks.<br /><br />Ok, the end of my status update and rant.
6 changes: 6 additions & 0 deletions docs/news/posts/blogger-post-2006-10-17-1.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
---
date: 2006-10-17
---

# Status Update
Files are still being copied to the array, but at the current rate, we should definitely have all files back on our disk array by Tuesday morning. If all goes well, the OCF should be back up and running by Tuesday night or Wednesday morning. I hope.<br /><br />Since we're already down, we've decided to migrate our mail service over to a much faster server. Hopefully that'll allow some good to come out of this entire mess...
6 changes: 6 additions & 0 deletions docs/news/posts/blogger-post-2006-10-17-2.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
---
date: 2006-10-17
---

# Status Update
All user data has been restored to the disk array. I'm currently running a fsck on the array just to make sure that the array hasn't already corrupted it. We'll be keeping very regular backups of user data until we can figure out what's wrong with our disk array or until we can get it serviced, so there shouldn't be any future extended downtime like this again (at least as a result of the disk array).<br /><br />We're on track for the OCF coming back online sometime tonight or tomorrow morning. As an added safety precaution, I'm currently setting up a two-way RAID-1 mirror with hot-spare on our primary NFS server (basically the computer that serves all user files) to make everything triply redundant.<br /><br />Thanks for all your support through this process!
6 changes: 6 additions & 0 deletions docs/news/posts/blogger-post-2006-10-18-1.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
---
date: 2006-10-18
---

# OCF Sorta Back Up
We've just finished restoring most OCF services. Home directories and web pages should work, and logins to all our general servers should also work. There might be some glitches here and there as we put the finishing touches on our restoration, though, so please bear with us.<br /><br />Mail is still offline, but we're still queuing mail. The main reason mail remains offline is that we need to run all the mail we've queued up through our mail delivery system again, and we can only really do that once we're sure everything works.
6 changes: 6 additions & 0 deletions docs/news/posts/blogger-post-2006-10-19-1.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
---
date: 2006-10-19
---

# More Services Being Restored
We're re-enabled access to mail that was delivered before our array failure on Thursday. Users should be able to read and manage their old mail. Unfortunately, you'll need to login to one of our shell servers to read your mail; we haven't brought POP or IMAP back up yet. At the same time, we're working on delivering mail that was queued during the downtime. Please have patience; we're working very hard on it.<br /><br />Also, we're aware of the issues with databases and are trying to debug the problem.<br /><br />I'm currently attempting to probe our disk image of the array to see if I can find any files that were lost during the restoration process. Since the image is about ~1 TB, it's not going as quickly as I'd like, but it's still running.
6 changes: 6 additions & 0 deletions docs/news/posts/blogger-post-2006-10-22-1.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
---
date: 2006-10-22
---

# Mail and Database Status Update
We've begun delivering mail that was queued up during our array failure outage. We've also enabled IMAP and POP3 access again, so users should be able to read their mail using their favorite mail client. It may take a couple days for all mail to be delivered; hundred of thousands of messages were queued up during the outage.<br /><br />MySQL databases have been restored as well as we can restore them. Once we re-import all the databases into our MySQL server, we'll bring MySQL back online. A small number of users have irrecoverable errors with their databases; we will be contacting each of the affected users individually to work with them on recovering their data.
6 changes: 6 additions & 0 deletions docs/news/posts/blogger-post-2006-10-26-1.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
---
date: 2006-10-26
---

# Random Outages
Over the course of the past day and a half, we have been experiencing some random errors with our primary authentication system. These errors have led to some difficulties logging in for some users and some other problems in the physical lab (printer queue jams, frozen terminals, etc). We're not quite sure of what's causing the problems, but we have a pretty good idea it's related to us using NIS+ (an old standard developed by Sun Microsystems that has been deprecated). Thanks to sluo, we were able to recover from these errors, so everything should be up and working now.<br /><br />In regards to our other services:<br /><br />The mail queue is still being processed, but there's still a huge chunk of mail that's left in the queue.<br /><br />MySQL databases should be restored as well as we can restore them. Users with data that we have identified as problematic in recovering will be individually contacted via email tomorrow evening (I'm consolidating a list of the errors we received so I can send it all in one pass).<br /><br />PostgreSQL is currently being looked at and debugged.<br /><br />We've found a way to get our disk array serviced, but it means sending back critical parts of our disk array. Since downtime is unacceptable, we're going to build a temporary disk array out of commodity parts and use that in a 'hot-swap' manner with our current disk array. I'm waiting for the parts in the mail, though...<br /><br />Sorry about not updating this blog in the past couple days, but I've been rather busy, and I only got to leave the OCF around 4 AM yesterday morning.
Loading
Loading