BUGS 10.3 KB
Newer Older
Leigh Stoller's avatar
Leigh Stoller committed
1
*** cvsupd. - Sep 17 2001 - Fixed by Mike on Sep 24 2001
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

The old version of cvsupd had the billenium bug, whereby number of
seconds since the epoc is greater than 1billion, and thus breaking
cvsupd. Upgrading was a disaster on Linux. It appears the new version
was trashing the boot block in Linux, and so nodes were not booting
after a cvsupd run.


*** mountd/exports - Sep 17 2001 - Needs to be fixed

Reported by Matt on Sep 17 2001, but actually a known bug with the
exports_setup script and the current mountd/kernel impl, which wipes
out all mounts before installing the new set.  This causes transient
failures in NFS access from the testbed nodes since the mounts become
momentarily invalid.


Leigh Stoller's avatar
Leigh Stoller committed
19
*** Batch Mode Nits - Sep 18 2001 - Fixed by LBS on Sep 25 2001
20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36

Reported by Mike.

Nit: experiment create date in Experiment Info "header" is not set.
Is this field meaningless or just not filled in correctly?

Nit: there are two "header" tables shown
I assume this is bacause the batch code prints out a header and then calls
the regular experiment info script to do the rest.  Anyway, the second table
is missing lots of date info as well.

Nit: web page for expr takes forever to show the nodes that were allocated.
For a regular experiment, the allocated nodes show up almost immediately,
presumably as soon as assign is done.  For a batch experiment, it seems to
take minutes.  I can go out and look at assign.log and see that nodes have
been assigned almost right away, they just doesn't wind up in the report
(the DB?) for awhile.
Leigh Stoller's avatar
Leigh Stoller committed
37 38 39 40 41 42 43 44 45 46 47 48


*** Shell experiments - Sep 18 2001 - Needs to be fixed

Reported by Rob.

Perhaps a feature instead of a bug. Shell experiments remain in the
"new" state, and are thus not cleaned up (nodes frees) when the
experiment is terminating. I view that as intended behaviour, but its
easy to change, so we should.


Leigh Stoller's avatar
Leigh Stoller committed
49
*** Broken lilo.conf - Sep 17 2001 - Partially fixed by LBS on Sep 24 2001
Leigh Stoller's avatar
Leigh Stoller committed
50 51 52 53 54 55 56

Reported by Matt (and then Rob).

According to Matt, our lilo.conf file is broken. In the next version
of the disk image, we should fix it. Matt sent a fixed version to
testbed-ops on Sep. 17 that we could use as a base.

Leigh Stoller's avatar
Leigh Stoller committed
57 58 59 60 61
Partially fixed means I hardwired some assembly code to the serial
line at 115200. A bug was causing it to go to the VGA all the time,
and the higher speeds are just a mess in lilo and I did not want to
mess with the assembly language. Too much of a time sink.

Leigh Stoller's avatar
Leigh Stoller committed
62 63 64 65 66 67 68 69

*** IPOD support - Sep 17 2001 - Needs to be fixed.

Reported by Rob.

Our Linux kernel needs to be rebuilt with ping of death support.


Leigh Stoller's avatar
Leigh Stoller committed
70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86
*** DummyNet support - Sept 19 2001 - Needs to be fixed.

Reported by Jay.

We need to pass through more DummyNet pipe configuration parameters
for delay nodes. It basically needs some front end parser work (not
too hard), and some changes to the DB where we store that stuff (a few
tables, not too hard), the tmcd to return the additional stuff (not
too hard), and the client delay configuration to use the additional
stuff when configuring the pipes (not too hard).


*** Multiple batch/reload daemons. - Sept 24 2001 - Needs to be fixed.

It is possible to start multiple batch and reload daemons (and many
other TB daemons). Can really mess things up!

Leigh Stoller's avatar
Leigh Stoller committed
87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110

*** No text box in approve user page - Sept 24 2001 - Needs to be fixed.

Reported by Mac.

We need to add a text box for sending a message to the applicant, just
like we have in the project approval page.

*** Cleanup of batch experiments - Sep 26 2001 - Needs to be fixed.

Reported by LBS.

Batch experiment system was created before Chris' big rework of the
tb system. At this point, it would be cleaner to combine the batch
experiment table into the normal experiment table (would result in a
couple of extra slots). The two tables make for consistency and
locking problems. What I do like about the current batch system is
that it is more scriptable cause most stuff is done in the backend
script; the web interface does just a bit of checking and then passes
the whole thing off. The normal experiment creation path should look
like this too, since splitting stuff between the web interface and the
perl scripts is messy, and the PHP DB interface is not as nice or
robust as the perl interface.

Leigh Stoller's avatar
Leigh Stoller committed
111

Leigh Stoller's avatar
Leigh Stoller committed
112
*** LastLogin info in web page - Sep 27 2001 - Done by LBS on Oct 1 2001.
Leigh Stoller's avatar
Leigh Stoller committed
113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141

    From: Jay Lepreau <lepreau@cs.utah.edu>
    To: "Leigh B. Stoller" <stoller@moab.cs.utah.edu>
    Cc: testbed@fast.cs.utah.edu
    Subject: Re: cvs commit: testbed/www showproject_list.php3 
    Date: Wed, 26 Sep 2001 19:11:11 MDT
    
    If it were pretty easy to get wtmp info from plastic
    into the database, and displayed in a separate column
    on this page, that would be great.
    
    It's the info we need to know who's really active and who's inactive,
    which is not only interesting but affects who we ask to free up nodes.

    LBS Comment: Probably want last web login info too. This would be
    easy to add with a last_login DB table that would be updated in
    DOLOGIN in the web server.


*** Need to install ssh on Linux - Sep 27 2001 - Needs to be done.

Suggested by Jay.

I think we should install xinetd and rsh by default.  (But not turn on
rsh.) For one thing, many computational cluster users will want MPI.

RPMS: /proj/parmc/rpms/xinetd-2.1.8.9pre14-6.i386.rpm
      /proj/parmc/rpms/rsh-server-0.17-2.5.i386.rpm

Leigh Stoller's avatar
Leigh Stoller committed
142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196

*** Needs web page to change proj trust - Oct 3 2001 - Needs to be fixed.

Reported by James Griffioen.

No way to change the trust level for a user in a project. Need to
provide a web page for it.

*** Paperbag/plasticwrap bug - Oct 5 2001 -- Fixed by Rob.

Reported by magnus.

    magnus@ops ~> os_load -i UTAHPC-FBSD+LINUX -w pc121 pc122 pc123 pc124 
    pc125 pc126 pc127
    Sorry, you used a forbidden character
    **********
    SSH failed. You may need to run the following commands:
    
    mkdir -m 0755 /users/magnus/.ssh
    ssh-keygen -P '' -f /users/magnus/.ssh/identity
    cp /users/magnus/.ssh/identity.pub /users/magnus/.ssh/authorized_keys
    chmod 600 /users/magnus/.ssh/authorized_keys
    **********


*** Inconsistent exit in scripts. Oct 3 2001 - Needs to be fixed

When going to background we should a __DIE__ hook to make sure that
the log file gets emailed off. Generally, the fatal error amd email
stuff is very inconsistent. Normal users should get warm fuzzies.
Informational stuff should go to us.


*** Per User info change - Oct 8 2001 - Needs to be fixed

    From: Jay Lepreau <lepreau@cs.utah.edu>
    To: Leigh Stoller <stoller@fast.cs.utah.edu>
    Subject: Re: Feature Request 
    Date: Mon, 08 Oct 2001 06:14:43 MDT
    
    After I sent that, I just checked it out.  Realized that you
    (logically) display the info on the general user info page.  What I
    was originally wanting was more accurate and precise data for the
    general project info page.  That would take a bunch of processing: for
    each proj, go thru its users, find the minimum time for each type of
    login and display that.
    
    Don't need it right now, but eventually we're going to need the kind
    of info to move projects to 'inactive'.
    
    The same info is what we'll need to deschedule experiments, or arm twist
    their users.  But for expts we have the creator, who will usually
    be the one using the machines.


197
*** Node control changes lost on swapin/out - Oct 30 2001 - Fixed by LBS
Leigh Stoller's avatar
Leigh Stoller committed
198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217

    > From: Mac Newbold <newbold@cs.utah.edu>
    > Subject: Re: swapin/out anomoly
    > Date: Tue, 30 Oct 2001 15:53:45 -0700 (MST)
    > 
    > Sometimes when we update settings, we update only the physical, and not
    > the virtual, and the virtual is all that persists between swaps. IMHO, I
    > think what we need to do is evaluate which things make sense to keep
    > between swaps, which things don't, and which things should offer a
    > choice. This affects OSs, delay params, and potentially other things
    > too. Perhaps there's even a unified solution we can implement as part of
    > the swapout process.
    
    Its probably a good idea to take the nodecontrol web page and remove the
    stuff that changes the DB. Instead, lets use a backend perl script that
    will update the DB appropriately (and can be used from the command line
    too). Basically, I'm not happy about doing that much DB munging in the web
    interface, especially virt_nodes and virt_lans (since it would be nice to
    present a web interface to change the delay params at some point).
    
218 219 220 221
*** Frisbee sucks up CPU - Nov 10 2001 - Needs to be fixed

Frisbee is sucking up 15% of the CPU. Needs to be profiled.

Leigh Stoller's avatar
Leigh Stoller committed
222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268
*** Add virtual name to node control for. - Nov 12 2001 -
    Fixed by Leigh on Nov 28 2001.

    Requested by Dave.

*** Add quotas on /users. - Nov 12 2001 - Needs to be done.

    Need to build and install a kernel on ops with quotas configured in.

*** Have Linux kernel source available - Nov 12 2001 - Needs to be done.

    Requested by Jay.

*** Use switch port for 10MBs - Nov 28 2001 - Needs to be done.

    LBS - Nov 28th: I worked on this. Turns out the switches do not
    like it when the nodes force their ports into 10MB full duplex,
    and the switch disables the port.

*** Add "Must Change Password" state. - Dec 3rd 2001 - Needs to be done.

    Suggested by Mac. When we change the password and email it, we
    should require that user changes his password next time he logs
    in.

*** Mysterious TCP drop problem - Nov 21 2001 - Needs to be fixed.

    From: Tian Bu <tbu@cs.umass.edu>
    To: Mike Hibler <mike@fast.cs.utah.edu>
    cc: <testbed-ops@fast.cs.utah.edu>
    Subject: News on packet drop
    Date: Thu, 29 Nov 2001 14:52:04 -0500 (EST)
    
    
    After spending sometime on investigating why the packet drop occur
    between a pair of linux nodes, I found that this problem is not
    related to the OS. Instead, the mysterious packet drop only occurs
    between a pair of nodes where one end is PC850 and the other end 
    is PC600.  The settting I first saw the drop was a link between
    a PC850 and a PC600. I did report that the drop does not appear
    between a pair of freeBSD nodes. That is misleading because it was 
    measured on a different setting where the pair of nodes I reserved happen 
    to be both PC850. :-). Today I start another experiment where there are
    a pair of PC850 running Linux and observe no mysterious packet drop
    between them. I guess there may be some incompatible features between 
    either the interfaces installed on these two different types of machines.
    
269