1. 08 Nov, 2018 5 commits
  2. 07 Nov, 2018 3 commits
    • Leigh Stoller's avatar
      Implement OS list optimization described in issue #415. Lets try it on · ee5f241d
      Leigh Stoller authored
      the Mothership for a while and see what breaks
      ee5f241d
    • Leigh Stoller's avatar
      Quick fix for watchdog/backup interaction; use a script lock. · 72b4ba32
      Leigh Stoller authored
      From Slack:
      
      What I notice is that mysqldump is read locking all of the tables for a
      long time. This time gets longer and longer of course as the DB gets
      bigger. Last night enough stuff backed up (trying to get various write
      locks) that we hit the 500 thread limit. I only know this cause mysql
      prints "killing 501" threads at 2:03am. Which makes me wonder if our
      thread limit is too small (but seems like it would have to be much
      bigger) or if our backup strategy is inappropriate for how big the DB is
      and how busy the system is. But to be clear, I am not even sure if
      mysqld throws in the towel when it hits 500 threads, I am in the midst
      of reading obtuse mysql documentation. (edited) There a bunch of other
      error messages that I do not understand yet.
      
      I can reproduce this in my elabinelab with a 10 line perl script. Two
      problems; one is that we do not use the permission system, so we cannot
      use dynamic permissions, which means that the single thread that is left
      for just this case, can be used by anyone, and so the server is fully
      out of threads. And 2) then the Emulab mysql watchdog cannot perform its
      query, and so it thinks mysqld has gone catatonic and kills it, right in
      the middle of the backup. Yuck * 2. (edited)
      
      And if anyone is curious about a more typical approach: "If you want to
      do this for MyISAM or mixed tables without any downtime from locking the
      tables, you can set up a slave database, and take your snapshots from
      there. Setting up the slave database, unfortunately, causes some
      downtime to export the live database, but once it's running, you should
      be able to lock it's tables, and export using the methods others have
      described. When this is happening, it will lag behind the master, but
      won't stop the master from updating it's tables, and will catch up as
      soon as the backup is complete"
      72b4ba32
    • Leigh Stoller's avatar
      Extend cheesy hack for nodes that take a long time to reload, · ebadc7d1
      Leigh Stoller authored
      like a mellanox switch.
      ebadc7d1
  3. 06 Nov, 2018 4 commits
  4. 05 Nov, 2018 6 commits
    • Leigh Stoller's avatar
      Working Mellanox user alloc switch support (issue #445): · 95e7bded
      Leigh Stoller authored
      * The primary problem with the mellanox is that the install image does a
        kexec out of ONIE into Linux, spends 30+ minutes doing stuff, and then
        reboots. This throws the reload state machine out of whack cause we do
        not get a chance to send the RELOADDONE state. So ... some change to
        rc.testbed and rc.reload on the USB dongle: the ONIE MFS sends
        RELOADING and writes a flag file to the ONIE partition on the
        "disk" (not the usb). Then the kexec into MLNX, the install happens,
        and reboots. The next boot into ONIE sees the flag file, erases it and
        sends REDLOADDONE. Waits for a bit, and then continues on the normal
        path. This abuses stated in that there a whiny messages in the stated
        log file, but I am immune to stated whining.
      
      * Another item of note is that the switch DHCPs, but only to get the IP
        info, there is no ability to give it an initial config file like we
        can with the Dell switches. The main problem here is that the switch
        comes up with its default login/password which is obviously well known
        cause its in the manual. That means there is a window where the switch
        is vulnerable, but since we block the switches from the public side,
        this is not a serious problem. As soon as we can get in (sshd is
        running) we login and update the config with passwords, keys,
        etc.
      
      * Other changes to the machine dependent osload library module, I had
        done some of this before switching to the Dells way back when, but it
        needed to be updated/completed.
      95e7bded
    • Leigh Stoller's avatar
      Changes to how we handle/report mapping failures that also fail the · 11074445
      Leigh Stoller authored
      empty testbed test.
      
      Prior to this commit, we were not invoking the empty testbed case
      consitently. Now we do, but that exposed another problem; reporting that
      to the error to the Portal in a meaningful way. Basically, we can report
      a different error code for an impossible to map error, but then we lose
      the info we store now about what the actual failure was (which we show
      to the user with additional helpful info). Since we cannot (easily)
      change the Geni API for CreateSliver(), I have elected to continue the
      practice of returning the specific error codes (which also go into the
      database for long term historical info), and add more helpful text that
      for the Portal user that explains clearly that the mapping is impossible
      on the target cluster. This extra text also go into the database in the
      attached message field, so we ccan come back later and post process if
      we decide to do something different.
      11074445
    • Leigh Stoller's avatar
    • Leigh Stoller's avatar
      7eeaa0fc
    • Leigh Stoller's avatar
      Allow NTPSERVER override in the NS file. · 849432e1
      Leigh Stoller authored
      849432e1
    • Leigh Stoller's avatar
  5. 30 Oct, 2018 5 commits
  6. 29 Oct, 2018 2 commits
  7. 26 Oct, 2018 10 commits
  8. 25 Oct, 2018 5 commits
    • Aleksander Maricq's avatar
      Add defs file for amaricq · 2f41610c
      Aleksander Maricq authored
      2f41610c
    • Leigh Stoller's avatar
    • David Johnson's avatar
      Replace the Docker entrypoint/cmd/env implementation for augmented images. · a986a085
      David Johnson authored
      (Also, add support for user to change container entrypoint at runtime.
      Note also that the server side now stores the entrypoint/cmd/env
      attributes as base64url-encoded virt_node_attributes, so that we can
      just use the existing table_regex for those values.)
      
      We add a new runit service (/etc/service/dockerentrypoint) to
      clientside/tmcc/linux/docker/dockerfiles/common to handle the
      entrypoint/cmd/env/workingdir/user emulation.  From the comments:
      
        Docker's semantics for ENTRYPOINT/CMD vary depending on if those
        values are specified as arrays of string, or simple as single strings
        (which must be interpreted by /bin/sh -c).
      
        Handling all the quoting possibilities in the shell is a major pain.
        So, this script handles the basic stuff (in particular, sourcing env
        vars, because we want the shell to interpret them!) -- then execs our
        perl companion script (run.pl) to deal with the entrypoint/command
        files that libvnode_docker::emulabizeImage and
        libvnode_docker::vnodeCreate populated.
      
        libvnode_docker creates these single-line files in /etc/emulab/docker
        as either string:hexstr(<entrypoint-or-cmd-string>), or
        array:hexstr(a[0]),hexstr(a[1])... .  This allows us to preserve the
        original type of the image's entrypoint/cmd as well as the runtime
        entrypoint/cmd, and to preserve the exact bytes for the eventual final
        call to exec.
      
        The static files builtin to an emulabized image are
        /etc/emulab/docker/{entrypoint.image,cmd.image}, and those created
        dynamically at runtime if user changes the entrypoint or cmd are
        bind-mounted to /etc/emulab/docker{entrypoint.runtime,cmd.runtime}.
      
        Given the presence (or absence!) of those files, this script
        implements the emulation, based upon the content in those files.
      a986a085
    • David Johnson's avatar
      993e9f8c
    • David Johnson's avatar
      e48155a7