Pressing Pause on Lion

[UPDATED – 1/26/2012]

There is doing something and doing something right.

When you take on any project, particularly one effecting 1000+ people you better make sure you’re doing it as right as possible.

My school is a year and a half into our 1:1 Initiative.  We provided every faculty member and student (grades 4-12) with a Macbook Pro computer and a 500GB hard drive for TimeMachine backups.  Everyone is an administrator on their own machine – which as one of my techs (@damienbarrett) will always point out to me is part of the problem.  We don’t let student plug their machines into power while in the classroom (not enough outlets & fire hazard) and require that they come to school with their machines fully charged each day, expecting that charge to last the entire day.  We also said that we would keep the machines as up-to-date as possible with the latest OS and software.  As we want to have these machines be the users primary device we want their access to the machine to be as close to 24/7/365 as possible.

To keep to those last goals of power expectations, up-to-date OS and software, along with as close to 24/7/365 access as possible we planned to spend two weeks replacing the battery in every machine along with upgrade to OS from Snow Leopard to Lion.  We figured we would be able to do this over a two-week period, doing a grade level per day (65-105 machine per day).

The batteries would be easy, but the Lion upgrade needed to be as right as possible.  If it wasn’t there’d be a whole lot of upset people, learning with the laptops with come to a crawl and the perception (perception = reality) would be that technology just doesn’t work.

We would need to order the batteries and on the day of the upgrade simply swap them out, no testing needed, but to get the Lion upgrade done right we would need to test.

PHASE ONE (2 months ahead of full upgrade): We first upgraded members of the technology staff, then our student leaders, followed by a select group of faculty.  During this period we reported back issues, application glitches and create learning resources for the full-scale upgrade (http://lion.mka.org).

PHASE TWO (1 week ahead of full upgrade):  We upgrade the entire 4th grade a week ahead of the rest of the faculty and students, along with replacing the battery in their laptop.

PHASE THREE (full upgrade): Over a two-week period we would upgrade the remaining faculty and students machine.

A day before we reached Phase Three we needed to PAUSE PROCESS!

Why now? Why did we wait so long?

By upgrading people the way we did in Phase One and  Phase Two we didn’t see issues in the content of a classroom of a full grade level.  The issues that we did see were either handled by the uses themselves, not reported or we were able to apply a, albeit, temporary patch.

Once Lion was out of the cage in a classroom we began to see wide-spread freezing and system lock-ups, pinwheeling (beach-balling), comas – mouse moves but nothing else works, and incredibly slow behavior.

Our upgrade process mirrored that of what a user would do if they upgraded via the App Store.  The OS was installed on top of the existing Snow Leopard OS, a “dirty install”.

We tried numerous things to resolve the issues (thx @damienbarrett):

  1. Permissions repair.
  2. Deep permissions repair and ACL reset from the “Password Reset” tool buried in the Lion Recovery partition.
  3. Clearing system, font, application, and user cache files using Onyx
  4. Disabling some of the “animations” in the Finder using Onyx.
  5. File system check (fsck) both from the command line (Single User Mode) and from Disk Utility
  6. Disk directory repair and rebuild using DiskWarrior
  7. Deletion of all ByHost files in ~/Library/Preferences/ByHost – in the user account folder (NEW – may help)

In the end what we found that solved the majority of the issues was to do a clean install and restore the user account (data & apps) from their TimeMachine backup.

This process however takes much longer and will require use to “push pause on Lion” and do the battery upgrade as schedule (next two-week) and use the following 6 weeks before spring break to do the clean install and account migration.

This process will take considerably more time as we will only be able to roughly 40 machines per day. It will require that we secure additional loaner machines (roughly 4 dozen) so not to disrupt learning as we will not be doing whole grades on a given day and will make what was planned to be a two-week process, two months.

Regardless, the lesson learned here is that with any project you need to be willing to press pause up till the last-minute.  It may not work out the way that you have planned but in the end it is better to have done it right than to have just done it.

UPDATE (1/23/2012 – Thx @damienbarrett) – Details on the type of problems/behavior we are see pot Lion Upgrade. This is both from a “dirty” upgrade over the old OS and a clean upgrade and then migrating the user’s folder and application over via TimeMachine.

By order of severity:

  1. Complete system freeze. Description of symptoms: operating system freezes up almost completely. Can move mouse but nothing is clickable. If music is playing, the music continues to play but everything else is “locked up”. The only recourse is to “hard power off” the machine by holding down the power button and then powering it back up.
  2.  Constant spinning beachballs. Description of symptoms: Applications often enter a state where there are constant spinning beachballs or busy cursor. Sometimes an application is launched and the spinning beachball never goes away. Often, you can force-quit the application and launch it again. Similarly, there are slowdowns in the Finder operations: opening folders, navigating folders, moving or copying documents between folders, etc.
  3.  Coma. Machines go into a “Coma” and will not wake from sleep. Similarly, a forced restart is the only recourse.
  4.  General sluggishness or slowness. Navigating Finder windows takes a long time. Opening a document takes 2x – 4x longer.
  5.  Application wonkiness. Some applications (web browsers, mostly) don’t behave as expected. Sites won’t load, or browser freezes on websites.

Update (1/24/2012) – The latest thought is the the issue may be related to the TimeMachine Local Snapshots that runs on laptops when they are not connected to their backup drives. We have disabled the local snapshot from running and will be testing to see if this alleviates the problems/behaviors.  This will NOT effect backups when the drive is connected.

Terminal command used to disable: sudo tmutil disablelocal

Local Snapshots ON:

Local Snapshots OFF:

Update (1/26/2012) –  This is what I believe will be the final update on this post as we have been able to confirm and issue with TimeMachine Local Snapshots AND Sophos Anti-virus.

There is an open thread on the Sophos boards discussion the matter and to paraphrase it has to do with Sopohos scanning the ‘Mobilebackups’ folder.  By either excluding this folder from scan or disabling the local backups as I’ve described with help the problem(s).

 

About William Stites

Currently the Director of Technology for Montclair Kimberley Academy, occasional consultant, serial volunteer for ATIS, husband, and father to two crazy kids who make me smile everyday.
This entry was posted in 1to1, Administration & Management, Schools, Technical and tagged , , , , , , , . Bookmark the permalink.