\2 If you know for sure they are independent, make sure to say so in the
A tool for actually making http requests; we should be reasonably confident that this tool does not itself cause some kind of bottleneck or timing artifacts, which would make it part of the system under test, which we decided we didn't want it to be. This tool is going to need to be capable of, or scriptable to, make requests to URLs according to some distribution and with some timing models (eg. how long do I wait between page loads, do I wait for the previous page load to load the next page, etc.)
paper
\item[Server to respond to clients: Christopher, Hyunwook]\hfill\\
\1 Common mistakes
\2 Ignoring variation in experimental error
Like the client, we need to be confident that the server introduces minimal effects on the system so that it does not become part of the system under test. We need to be able to serve objects of varying size according to some distribution.
\2 Not controlling params
\2 One factor at a time experiments
\item[Tools for analyzing client to server communication: Ren, Philip]\hfill\\
\2 Not isolating effects
\2 Too many experiments
We decided that we would capture the response variables (bandwidth, latency, etc.) at the layer of capturing packets on the wire. So, we will need to decide what tools to use to capture these packets, and we will need to be able to compute these higher-level metrics from the raw traces that we collect.
\3 Break into several sub-evals to answer questions, evaluate particular
pieces of the SUT
\item[Data about distribution of web requests: Chaitu, Aisha]\hfill\\
\1 Discussing design of lab 1
We need to cause the client program to make requests according to some distribution of requests that is representative of that seen by real webservers - for example, what are the sizes of objects fetched, what is the time between objects being fetched, what is the ratio of data downloaded from the server vs. data uploaded to the server. As much as possible, we should use distributions gathered from real systems, so we should try to find studies, traces, datasets, etc.
\2 Our goal: Which variant of TCP should I run on my webserver?
\2 Congestion control: cubic vs. newreno
\item[Data about distribution of client network performance: Junguk, Makito]\hfill\\
\2 SACK or no SACK? (orthogonal to CC algo)
\2 ``Doesn't make a difference'' is an okay answer, but have to prove it.
We need to model some network conditions between the clients and the server: what is their roundtrip latency to the server, what bandwidth do they have available, what packet loss rates do they see, etc. As with the distribution of requests, we should try to use distributions gathered from real networks, and should look for studies, etc. giving us distributions to use for these values
\1 Questions to answer
\end{description}
\2 Which provides my clients the best experience?
\3 Low latency to load page
\3 High throughput
\2 Which allows me to server more users?
\3 Resources on server
\3 Fairness between clients
\1 Metrics
\2 Time to download one page
\2 Error rate
\2 Throughput
\2 Jain's fairness index
\2 Resource usage on server (eg. CPU)
\1 Parameters and factors
\2 TCP parameters
\2 Number of simultaneous clients
\2 File size distribution
\2 Which webserver?
\2 Client load generator?
\2 Packet loss
\2 Client RTT
\2 Client bandwidth
\1 Tools to use
\2 Webserver: which one (apache 2.2?)
\2\texttt{tc}
\2 Client workload generator (\texttt{httperf?})
\1 Experiments to to run
\2 Determine whether SACK and CC are interacting factors
A tool for actually making http requests; we should be reasonably confident that this tool does not itself cause some kind of bottleneck or timing artifacts, which would make it part of the system under test, which we decided we didn't want it to be. This tool is going to need to be capable of, or scriptable to, make requests to URLs according to some distribution and with some timing models (eg. how long do I wait between page loads, do I wait for the previous page load to load the next page, etc.)
\2 It's a function with one local maximum
\2 So we can differentiate and look for zero
\item[Server to respond to clients: Christopher, Hyunwook]\hfill\\
\2$s_y^2-2b_1s^2_{xy}+ b_1^2s_x^2$, then take derivative
\2$s_{xy}$ is correlation coefficient of $x$ and $y$ (see p. 181)
Like the client, we need to be confident that the server introduces minimal effects on the system so that it does not become part of the system under test. We need to be able to serve objects of varying size according to some distribution.
\2 In the end, gives us $b_1=\frac{s^2_{xy}}{s_x^2}$
\3 Correlation of $x$ and $y$ divided by variance of $x$
\item[Tools for analyzing client to server communication: Ren, Philip]\hfill\\
\3$\frac{\sum{xy}- n \overline{x}\overline{y}}{\sum{x^2}- n(\overline{x})^2}$
We decided that we would capture the response variables (bandwidth, latency, etc.) at the layer of capturing packets on the wire. So, we will need to decide what tools to use to capture these packets, and we will need to be able to compute these higher-level metrics from the raw traces that we collect.
\1 SS*
\2 SSE = Sum of squared errors
\item[Data about distribution of web requests: Chaitu, Aisha]\hfill\\
\2 SST = total sum of squares (TSS): difference from mean
\2 SS0 = square $\overline{y}$$n$ times
We need to cause the client program to make requests according to some distribution of requests that is representative of that seen by real webservers - for example, what are the sizes of objects fetched, what is the time between objects being fetched, what is the ratio of data downloaded from the server vs. data uploaded to the server. As much as possible, we should use distributions gathered from real systems, so we should try to find studies, traces, datasets, etc.
\2 SSY = square of all $y$, so SST = SSY - SS0
\2 SSR = Error explained by regression: SST - SSE
\item[Data about distribution of client network performance: Junguk, Makito]\hfill\\
\1 Point of above: we can talk about two sources that explain variance: sum of
We need to model some network conditions between the clients and the server: what is their roundtrip latency to the server, what bandwidth do they have available, what packet loss rates do they see, etc. As with the distribution of requests, we should try to use distributions gathered from real networks, and should look for studies, etc. giving us distributions to use for these values
squared difference from mean, and sum of errors
\2$R^2=\frac{SSR}{SST}$
\end{description}
\2 The ratio is the amount that was explained by the regression - close to 1 is good (1 is max possible)
\2 If the regression sucks, SSR will be close to 0
\1 Remember, our error terms and $b$s are random variables
\2 We can calculate stddev, etc. on them
\2 Variance is $s_e^2=\frac{SSE}{n-2}$ - MSE, mean squared error
\2 Confidence intervals, too
\2\textit{What do confidence intervals tell us in this case?}
\3 A: Our confidence in how close to the true slope our estimate is
\3 For example: How sure are we that two slopes are actually different
\2\textit{When would we want to show that the confidence interval for $b_1$ includes zero?}
\1 Confidence intervals for predictions
\2 Confidence intervals tightest near middle of sample
\2 If we go far out, our confidence is low, which makes intuitive sense
\2$s_e \big(\frac{1}{m}+\frac{1}{n}+\frac{(x_p -\overline{x}^2)}{\sum_{x^2}- n \overline{x}^2}\big)^\frac{1}{2}$
\2$s_e$ is sttdev of error
\2$m$ is how many predictions we are making
\2$p$ is value at which we are predicting ($x$)
\2$x_p -\overline{x}$ is capturing difference from center of sample
\2\textit{Why is it smaller for more $m$}?
\3 Accounts for variance, assumption of normal distribution
\1 Residuals
\2 AKA error values
\2 We can expect several things from them if our assumptions about regressions are correct
\2 They will not show trends: \textit{why would this be a problem}
\3 Tells us that an assumption has been violated
\3 If not randomly distributed for different $x$, tells us there is a systematic error at high or low values - error and predictor not independent
\2 Q-Q plot of error distribution vs. normal ditribution
\2 Want the spread of stddev to be constant across range
\1 For next time
\1 For next time
\2 Start filling out your section in cs6963-lab1 repo
\2 Read Chapter 14, linear regression
\2 Be careful to only modify parts of the .tex file for your section
\3 Unless you want to suggest a broader change
\2 Fork it, give your parter access, send me a merge request before
the start of class Thursday
\2 Check in any notes you create, reference papers
\2 You are empowered to make decisions
\2 Goal is to describe in sufficient detail that people can start
implementing
\2 We will try to finish up our plan by deciding what experiments to run
and how to present results on Thursday
\2 Need next two paper volunteers, let's get them out before spring