Commit 098084e0 authored by Robert Ricci's avatar Robert Ricci

Bump all remaining lecture numbers up to make room for a new one

parent 6e766bf5
\documentclass{article}[12pt]
\input{../../texstuff/fonts.sty}
\input{../../texstuff/notepaper.sty}
\usepackage{outlines}
\title{CS6963 Lecture \#12}
\author{Robert Ricci}
\date{February 20, 2014}
\begin{document}
\maketitle
\begin{outline}
\1 Left over from lat time
\1 Overall goal of experiment design:
\2 Learn as much as possible from as few experiments as possible
\1 Some terminology
\2 Response variable: outcome
\3 \textit{Why call it a variable?}
\3 \textit{Examples of non-performance variables?}
\2 Factors: things you change
\3 \textit{Why call them predictor variables?}
\2 Primary / secondary factors
\3 \textit{How to decide which ones to use?}
\2 Replication: How many reactions
\1 Important properties of experiment design
\2 Every experiment should get you closer to answering one of the questions
\2 You should be able to explain all behavior in the results---if not, you
may need more experiments
\2 Control all variables you can
\2 Measure the variables you can't control
\1 Interacting factors
\2 Understand which of your factors interact, and which are independent
\2 Saves you a lot of time not running experiments that don't reveal more
information
\2 May take a few experiments to determine
\2 A good (negative) example the FV paper
\2 If you know for sure they are independent, make sure to say so in the
paper
\1 Common mistakes
\2 Ignoring variation in experimental error
\2 Not controlling params
\2 One factor at a time experiments
\2 Not isolating effects
\2 Too many experiments
\3 Break into several sub-evals to answer questions, evaluate particular
pieces of the SUT
\1 Discussing design of lab 1
\2 Our goal: Which variant of TCP should I run on my webserver?
\2 Congestion control: cubic vs. newreno
\2 SACK or no SACK? (orthogonal to CC algo)
\2 ``Doesn't make a difference'' is an okay answer, but have to prove it.
\1 Questions to answer
\2 Which provides my clients the best experience?
\3 Low latency to load page
\3 High throughput
\2 Which allows me to server more users?
\3 Resources on server
\3 Fairness between clients
\1 Metrics
\2 Time to download one page
\2 Error rate
\2 Throughput
\2 Jain's fairness index
\2 Resource usage on server (eg. CPU)
\1 Parameters and factors
\2 TCP parameters
\2 Number of simultaneous clients
\2 File size distribution
\2 Which webserver?
\2 Client load generator?
\2 Packet loss
\2 Client RTT
\2 Client bandwidth
\1 Tools to use
\2 Webserver: which one (apache 2.2?)
\2 \texttt{tc}
\2 Client workload generator (\texttt{httperf?})
\1 Experiments to to run
\2 Determine whether SACK and CC are interacting factors
\2 Max out number of clients
\1 How to present results
\1 For next time
\2 Finish paper analysis before class
\end{outline}
\end{document}
\documentclass{article}[12pt] \documentclass{article}[12pt]
\usepackage[no-math]{fontspec} \input{../../texstuff/fonts.sty}
\usepackage{sectsty} \input{../../texstuff/notepaper.sty}
\usepackage[margin=1.25in]{geometry}
\usepackage{outlines}
\usepackage{pdfpages}
\setmainfont[Numbers=OldStyle,Ligatures=TeX]{Equity Text A} \usepackage{outlines}
\setmonofont{Inconsolata}
\newfontfamily\titlefont[Numbers=OldStyle,Ligatures=TeX]{Equity Caps A}
\allsectionsfont{\titlefont}
\title{CS6963 Lecture \#13} \title{CS6963 Lecture \#12}
\author{Robert Ricci} \author{Robert Ricci}
\date{February 27, 2014} \date{February 20, 2014}
\begin{document} \begin{document}
...@@ -21,62 +15,91 @@ ...@@ -21,62 +15,91 @@
\begin{outline} \begin{outline}
\1 What we decided last time: \1 Left over from lat time
\2 Remember, overall goal is to decide whether to use NewReno or cubic TCP on
our webserver, and whether or not to enable SACK \1 Overall goal of experiment design:
\2 SUT includes network and NIC, but not webserver or client \2 Learn as much as possible from as few experiments as possible
\2 Questions to answer:
\3 Which variant provides higher througput / goodput? \1 Some terminology
\3 Which variant gives lower delay? \2 Response variable: outcome
\3 Which provides better fairness between clients? \3 \textit{Why call it a variable?}
\3 How many TCP sessions / clients can we support? \3 \textit{Examples of non-performance variables?}
\2 Secondary things to look at: \2 Factors: things you change
\3 How many retransmissions are caused? \3 \textit{Why call them predictor variables?}
\3 What is the utilization on the server NIC? \2 Primary / secondary factors
\3 What is the interaction between the application and TCP? \3 \textit{How to decide which ones to use?}
\2 Replication: How many reactions
\1 Things to decide for today:
\2 What metrics will we use? \1 Important properties of experiment design
\2 What are the parameters? \2 Every experiment should get you closer to answering one of the questions
\2 Which ones will will vary as factors? \2 You should be able to explain all behavior in the results---if not, you
\3 How will we decide what values they should take on? may need more experiments
\2 What will we use as our workload generator? \2 Control all variables you can
\2 How will we collect measurements? \2 Measure the variables you can't control
\2 What will our major set of evaluations be?
\2 How will we present results? \1 Interacting factors
\2 Understand which of your factors interact, and which are independent
\0% \2 Saves you a lot of time not running experiments that don't reveal more
\begin{description} information
\2 May take a few experiments to determine
\item[Client workload generator: Naveen, Binh] \hfill \\ \2 A good (negative) example the FV paper
\2 If you know for sure they are independent, make sure to say so in the
A tool for actually making http requests; we should be reasonably confident that this tool does not itself cause some kind of bottleneck or timing artifacts, which would make it part of the system under test, which we decided we didn't want it to be. This tool is going to need to be capable of, or scriptable to, make requests to URLs according to some distribution and with some timing models (eg. how long do I wait between page loads, do I wait for the previous page load to load the next page, etc.) paper
\item[Server to respond to clients: Christopher, Hyunwook] \hfill \\ \1 Common mistakes
\2 Ignoring variation in experimental error
Like the client, we need to be confident that the server introduces minimal effects on the system so that it does not become part of the system under test. We need to be able to serve objects of varying size according to some distribution. \2 Not controlling params
\2 One factor at a time experiments
\item[Tools for analyzing client to server communication: Ren, Philip] \hfill \\ \2 Not isolating effects
\2 Too many experiments
We decided that we would capture the response variables (bandwidth, latency, etc.) at the layer of capturing packets on the wire. So, we will need to decide what tools to use to capture these packets, and we will need to be able to compute these higher-level metrics from the raw traces that we collect. \3 Break into several sub-evals to answer questions, evaluate particular
pieces of the SUT
\item[Data about distribution of web requests: Chaitu, Aisha] \hfill \\
\1 Discussing design of lab 1
We need to cause the client program to make requests according to some distribution of requests that is representative of that seen by real webservers - for example, what are the sizes of objects fetched, what is the time between objects being fetched, what is the ratio of data downloaded from the server vs. data uploaded to the server. As much as possible, we should use distributions gathered from real systems, so we should try to find studies, traces, datasets, etc. \2 Our goal: Which variant of TCP should I run on my webserver?
\2 Congestion control: cubic vs. newreno
\item[Data about distribution of client network performance: Junguk, Makito] \hfill \\ \2 SACK or no SACK? (orthogonal to CC algo)
\2 ``Doesn't make a difference'' is an okay answer, but have to prove it.
We need to model some network conditions between the clients and the server: what is their roundtrip latency to the server, what bandwidth do they have available, what packet loss rates do they see, etc. As with the distribution of requests, we should try to use distributions gathered from real networks, and should look for studies, etc. giving us distributions to use for these values
\1 Questions to answer
\end{description} \2 Which provides my clients the best experience?
\3 Low latency to load page
\3 High throughput
\2 Which allows me to server more users?
\3 Resources on server
\3 Fairness between clients
\1 Metrics
\2 Time to download one page
\2 Error rate
\2 Throughput
\2 Jain's fairness index
\2 Resource usage on server (eg. CPU)
\1 Parameters and factors
\2 TCP parameters
\2 Number of simultaneous clients
\2 File size distribution
\2 Which webserver?
\2 Client load generator?
\2 Packet loss
\2 Client RTT
\2 Client bandwidth
\1 Tools to use
\2 Webserver: which one (apache 2.2?)
\2 \texttt{tc}
\2 Client workload generator (\texttt{httperf?})
\1 Experiments to to run
\2 Determine whether SACK and CC are interacting factors
\2 Max out number of clients
\1 How to present results
\1 For next time \1 For next time
\2 Read Chapter 14, linear regression \2 Finish paper analysis before class
\end{outline} \end{outline}
\newpage
\includepdf[pages={1}]{board-drawing.pdf}
\end{document} \end{document}
...@@ -4,15 +4,16 @@ ...@@ -4,15 +4,16 @@
\usepackage{sectsty} \usepackage{sectsty}
\usepackage[margin=1.25in]{geometry} \usepackage[margin=1.25in]{geometry}
\usepackage{outlines} \usepackage{outlines}
\usepackage{pdfpages}
\setmainfont[Numbers=OldStyle,Ligatures=TeX]{Equity Text A} \setmainfont[Numbers=OldStyle,Ligatures=TeX]{Equity Text A}
\setmonofont{Inconsolata} \setmonofont{Inconsolata}
\newfontfamily\titlefont[Numbers=OldStyle,Ligatures=TeX]{Equity Caps A} \newfontfamily\titlefont[Numbers=OldStyle,Ligatures=TeX]{Equity Caps A}
\allsectionsfont{\titlefont} \allsectionsfont{\titlefont}
\title{CS6963 Lecture \#1} \title{CS6963 Lecture \#13}
\author{Robert Ricci} \author{Robert Ricci}
\date{March 4, 2014} \date{February 27, 2014}
\begin{document} \begin{document}
...@@ -20,101 +21,62 @@ ...@@ -20,101 +21,62 @@
\begin{outline} \begin{outline}
\1 Today: How well does your data fit a line? \1 What we decided last time:
\2 More complicated regressions exist, of course, but we'll stick with this one for now \2 Remember, overall goal is to decide whether to use NewReno or cubic TCP on
\2 Eyeballing is just not rigorous enough our webserver, and whether or not to enable SACK
\2 SUT includes network and NIC, but not webserver or client
\1 Basic model: $y_i = b_0 + b_1x_i + e_i$ \2 Questions to answer:
\2 $y_i$ is the prediction \3 Which variant provides higher througput / goodput?
\2 $b_0$ is the y-intercept \3 Which variant gives lower delay?
\2 $b_1$ is the slope \3 Which provides better fairness between clients?
\2 $x_i$ is the predictor \3 How many TCP sessions / clients can we support?
\2 $e_i$ is the error \2 Secondary things to look at:
\2 \textit{Which of these are random variables?} \3 How many retransmissions are caused?
\3 A: All but $x_i$ the $b$s are estimated from random variables, $e$ is difference between random variables \3 What is the utilization on the server NIC?
\3 So, we can compute statistics on them \3 What is the interaction between the application and TCP?
\1 Two criteria for getting $b$s \1 Things to decide for today:
\2 Zero total error \2 What metrics will we use?
\2 Minimize SSE (sum of squared errors) \2 What are the parameters?
\2 Example of why one is not enough: two points, infinite lines with zero total error \2 Which ones will will vary as factors?
\2 Squared errors always positive, so this criterion alone could overshoot \3 How will we decide what values they should take on?
or undershoot \2 What will we use as our workload generator?
\2 How will we collect measurements?
\1 Deriving $b_0$ is easy \2 What will our major set of evaluations be?
\2 Solve for $e_i$: $y_i - (b_0 + b_i x_i)$ \2 How will we present results?
\2 Take the mean over all $i$: $\overline{x} = \overline{y} - b_0 - b_1 \overline{x}$
\2 Set mean error to 0 to get $b_0 = \overline{y} - b_1 \overline{x}$ \0%
\2 Now we just need $b_1$ \begin{description}
\1 Deriving $b_1$ is harder \item[Client workload generator: Naveen, Binh] \hfill \\
\2 SSE = sum of errors squared over all $i$
\2 We want a minimum value for this A tool for actually making http requests; we should be reasonably confident that this tool does not itself cause some kind of bottleneck or timing artifacts, which would make it part of the system under test, which we decided we didn't want it to be. This tool is going to need to be capable of, or scriptable to, make requests to URLs according to some distribution and with some timing models (eg. how long do I wait between page loads, do I wait for the previous page load to load the next page, etc.)
\2 It's a function with one local maximum
\2 So we can differentiate and look for zero \item[Server to respond to clients: Christopher, Hyunwook] \hfill \\
\2 $s_y^2 - 2b_1s^2_{xy} + b_1^2s_x^2$, then take derivative
\2 $s_{xy}$ is correlation coefficient of $x$ and $y$ (see p. 181) Like the client, we need to be confident that the server introduces minimal effects on the system so that it does not become part of the system under test. We need to be able to serve objects of varying size according to some distribution.
\2 In the end, gives us $b_1 = \frac{s^2_{xy}}{s_x^2}$
\3 Correlation of $x$ and $y$ divided by variance of $x$ \item[Tools for analyzing client to server communication: Ren, Philip] \hfill \\
\3 $\frac{\sum{xy} - n \overline{x} \overline{y}}{\sum{x^2} - n(\overline{x})^2}$
We decided that we would capture the response variables (bandwidth, latency, etc.) at the layer of capturing packets on the wire. So, we will need to decide what tools to use to capture these packets, and we will need to be able to compute these higher-level metrics from the raw traces that we collect.
\1 SS*
\2 SSE = Sum of squared errors \item[Data about distribution of web requests: Chaitu, Aisha] \hfill \\
\2 SST = total sum of squares (TSS): difference from mean
\2 SS0 = square $\overline{y}$ $n$ times We need to cause the client program to make requests according to some distribution of requests that is representative of that seen by real webservers - for example, what are the sizes of objects fetched, what is the time between objects being fetched, what is the ratio of data downloaded from the server vs. data uploaded to the server. As much as possible, we should use distributions gathered from real systems, so we should try to find studies, traces, datasets, etc.
\2 SSY = square of all $y$, so SST = SSY - SS0
\2 SSR = Error explained by regression: SST - SSE \item[Data about distribution of client network performance: Junguk, Makito] \hfill \\
\1 Point of above: we can talk about two sources that explain variance: sum of We need to model some network conditions between the clients and the server: what is their roundtrip latency to the server, what bandwidth do they have available, what packet loss rates do they see, etc. As with the distribution of requests, we should try to use distributions gathered from real networks, and should look for studies, etc. giving us distributions to use for these values
squared difference from mean, and sum of errors
\2 $R^2 = \frac{SSR}{SST}$ \end{description}
\2 The ratio is the amount that was explained by the regression - close to 1 is good (1 is max possible)
\2 If the regression sucks, SSR will be close to 0
\1 Remember, our error terms and $b$s are random variables
\2 We can calculate stddev, etc. on them
\2 Variance is $s_e^2 = \frac{SSE}{n-2}$ - MSE, mean squared error
\2 Confidence intervals, too
\2 \textit{What do confidence intervals tell us in this case?}
\3 A: Our confidence in how close to the true slope our estimate is
\3 For example: How sure are we that two slopes are actually different
\2 \textit{When would we want to show that the confidence interval for $b_1$ includes zero?}
\1 Confidence intervals for predictions
\2 Confidence intervals tightest near middle of sample
\2 If we go far out, our confidence is low, which makes intuitive sense
\2 $s_e \big(\frac{1}{m} + \frac{1}{n} + \frac{(x_p - \overline{x}^2)}{\sum_{x^2} - n \overline{x}^2}\big)^\frac{1}{2}$
\2 $s_e$ is sttdev of error
\2 $m$ is how many predictions we are making
\2 $p$ is value at which we are predicting ($x$)
\2 $x_p - \overline{x}$ is capturing difference from center of sample
\2 \textit{Why is it smaller for more $m$}?
\3 Accounts for variance, assumption of normal distribution
\1 Residuals
\2 AKA error values
\2 We can expect several things from them if our assumptions about regressions are correct
\2 They will not show trends: \textit{why would this be a problem}
\3 Tells us that an assumption has been violated
\3 If not randomly distributed for different $x$, tells us there is a systematic error at high or low values - error and predictor not independent
\2 Q-Q plot of error distribution vs. normal ditribution
\2 Want the spread of stddev to be constant across range
\1 For next time \1 For next time
\2 Start filling out your section in cs6963-lab1 repo \2 Read Chapter 14, linear regression
\2 Be careful to only modify parts of the .tex file for your section
\3 Unless you want to suggest a broader change
\2 Fork it, give your parter access, send me a merge request before
the start of class Thursday
\2 Check in any notes you create, reference papers
\2 You are empowered to make decisions
\2 Goal is to describe in sufficient detail that people can start
implementing
\2 We will try to finish up our plan by deciding what experiments to run
and how to present results on Thursday
\2 Need next two paper volunteers, let's get them out before spring
break
\end{outline} \end{outline}
\newpage
\includepdf[pages={1}]{board-drawing.pdf}
\end{document} \end{document}
...@@ -10,9 +10,9 @@ ...@@ -10,9 +10,9 @@
\newfontfamily\titlefont[Numbers=OldStyle,Ligatures=TeX]{Equity Caps A} \newfontfamily\titlefont[Numbers=OldStyle,Ligatures=TeX]{Equity Caps A}
\allsectionsfont{\titlefont} \allsectionsfont{\titlefont}
\title{CS6963 Lecture \#15} \title{CS6963 Lecture \#1}
\author{Robert Ricci} \author{Robert Ricci}
\date{March 6, 2014} \date{March 4, 2014}
\begin{document} \begin{document}
...@@ -20,40 +20,100 @@ ...@@ -20,40 +20,100 @@
\begin{outline} \begin{outline}
\1 For today: finish planning the lab \1 Today: How well does your data fit a line?
\2 More complicated regressions exist, of course, but we'll stick with this one for now
\2 Eyeballing is just not rigorous enough
\1 Executive decisions I made (can discuss, though!) \1 Basic model: $y_i = b_0 + b_1x_i + e_i$
\2 Keep one distribution for client behavior \2 $y_i$ is the prediction
\2 One distance per experiment? \2 $b_0$ is the y-intercept
\2 Use Linux TC for traffic shaping \2 $b_1$ is the slope
\2 100 Mbit server NIC \2 $x_i$ is the predictor
\2 Draw the topology \2 $e_i$ is the error
\2 \textit{Which of these are random variables?}
\3 A: All but $x_i$ the $b$s are estimated from random variables, $e$ is difference between random variables
\3 So, we can compute statistics on them
\1 Major evaluations \1 Two criteria for getting $b$s
\2 cf. questions from Lecture 13 \2 Zero total error
\2 Calibrate how many runs to do \2 Minimize SSE (sum of squared errors)
\2 How to present data: tables, graphs, etc. \2 Example of why one is not enough: two points, infinite lines with zero total error
\2 Squared errors always positive, so this criterion alone could overshoot
or undershoot
\1 Interfaces between the pieces \1 Deriving $b_0$ is easy
\2 Get distributions of session sizes \2 Solve for $e_i$: $y_i - (b_0 + b_i x_i)$
\2 Set client conditions \2 Take the mean over all $i$: $\overline{x} = \overline{y} - b_0 - b_1 \overline{x}$
\2 Set mean error to 0 to get $b_0 = \overline{y} - b_1 \overline{x}$
\2 Now we just need $b_1$
\1 Other grungy stuff \1 Deriving $b_1$ is harder
\2 Time synchronization \2 SSE = sum of errors squared over all $i$
\2 Clients into traffic shaping pipes \2 We want a minimum value for this
\2 Calculating stats from packet streams \2 It's a function with one local maximum
\2 So we can differentiate and look for zero
\2 $s_y^2 - 2b_1s^2_{xy} + b_1^2s_x^2$, then take derivative
\2 $s_{xy}$ is correlation coefficient of $x$ and $y$ (see p. 181)
\2 In the end, gives us $b_1 = \frac{s^2_{xy}}{s_x^2}$
\3 Correlation of $x$ and $y$ divided by variance of $x$
\3 $\frac{\sum{xy} - n \overline{x} \overline{y}}{\sum{x^2} - n(\overline{x})^2}$
\1 SS*
\2 SSE = Sum of squared errors
\2 SST = total sum of squares (TSS): difference from mean
\2 SS0 = square $\overline{y}$ $n$ times
\2 SSY = square of all $y$, so SST = SSY - SS0
\2 SSR = Error explained by regression: SST - SSE
\1 Next step assignments \1 Point of above: we can talk about two sources that explain variance: sum of
\2 Continue to divide up by same areas? squared difference from mean, and sum of errors
\2 Estimate of the amount of work to do \2 $R^2 = \frac{SSR}{SST}$
\2 The ratio is the amount that was explained by the regression - close to 1 is good (1 is max possible)
\2 If the regression sucks, SSR will be close to 0
\1 Remember, our error terms and $b$s are random variables
\2 We can calculate stddev, etc. on them
\2 Variance is $s_e^2 = \frac{SSE}{n-2}$ - MSE, mean squared error
\2 Confidence intervals, too
\2 \textit{What do confidence intervals tell us in this case?}
\3 A: Our confidence in how close to the true slope our estimate is
\3 For example: How sure are we that two slopes are actually different
\2 \textit{When would we want to show that the confidence interval for $b_1$ includes zero?}
\1 Confidence intervals for predictions
\2 Confidence intervals tightest near middle of sample
\2 If we go far out, our confidence is low, which makes intuitive sense
\2 $s_e \big(\frac{1}{m} + \frac{1}{n} + \frac{(x_p - \overline{x}^2)}{\sum_{x^2} - n \overline{x}^2}\big)^\frac{1}{2}$
\2 $s_e$ is sttdev of error
\2 $m$ is how many predictions we are making
\2 $p$ is value at which we are predicting ($x$)
\2 $x_p - \overline{x}$ is capturing difference from center of sample
\2 \textit{Why is it smaller for more $m$}?
\3 Accounts for variance, assumption of normal distribution
\1 Residuals
\2 AKA error values
\2 We can expect several things from them if our assumptions about regressions are correct
\2 They will not show trends: \textit{why would this be a problem}
\3 Tells us that an assumption has been violated
\3 If not randomly distributed for different $x$, tells us there is a systematic error at high or low values - error and predictor not independent
\2 Q-Q plot of error distribution vs. normal ditribution
\2 Want the spread of stddev to be constant across range
\1 For next time \1 For next time
\2 I won't be here \2 Start filling out your section in cs6963-lab1 repo
\2 Guest lectures by Xing Lin and Weibin Sun \2 Be careful to only modify parts of the .tex file for your section
\2 Papers posted \3 Unless you want to suggest a broader change
\2 Form not required \2 Fork it, give your parter access, send me a merge request before
\2 Do think actively about questions as you read the papers the start of class Thursday
\2 You are encouraged to suggest ways to improve the evaluations \2 Check in any notes you create, reference papers
\2 You are empowered to make decisions
\2 Goal is to describe in sufficient detail that people can start
implementing
\2 We will try to finish up our plan by deciding what experiments to run
and how to present results on Thursday
\2 Need next two paper volunteers, let's get them out before spring
break
\end{outline} \end{outline}
......
...@@ -10,9 +10,9 @@ ...@@ -10,9 +10,9 @@
\newfontfamily\titlefont[Numbers=OldStyle,Ligatures=TeX]{Equity Caps A} \newfontfamily\titlefont[Numbers=OldStyle,Ligatures=TeX]{Equity Caps A}
\allsectionsfont{\titlefont} \allsectionsfont{\titlefont}
\title{CS6963 Lecture \#16} \title{CS6963 Lecture \#15}
\author{Robert Ricci} \author{Robert Ricci}
\date{March 25, 2014} \date{March 6, 2014}
\begin{document} \begin{document}
...@@ -20,121 +20,40 @@ ...@@ -20,121 +20,40 @@
\begin{outline} \begin{outline}
\1 From last time \1 For today: finish planning the lab
\2 Thanks to Junguk and Makito for the scripts!
\2 Quick status: client, server, network conditions, client request \1 Executive decisions I made (can discuss, though!)
sizes, analysis \2 Keep one distribution for client behavior