All new accounts created on Gitlab now require administrator approval. If you invite any collaborators, please let Flux staff know so they can approve the accounts.

assign_internals.txt 15.3 KB
Newer Older
Leigh B. Stoller's avatar
Leigh B. Stoller committed
1 2 3 4 5 6
# Copyright (c) 2000-2002 University of Utah and the Flux Group.
# All rights reserved.

7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50

This document attempts to describe some of the internals of assign.
Much of the simulated annealing code of assign is not documented here
as it was not written by this author.  What is covered is a high level
overview and details of the scoring code.


assign attempts to solve a graph to graph mapping problem using
simulated annealing.  The 'physical' graph consists of nodes (pnodes),
links (plinks), and switches.  The virtual graph consists of nodes
(vnodes) and links (vlinks).  As assign runs it matches vnodes to
pnodes and vlinks to plinks.  An incremental scoring system keeps
track of the running score, both as a quality (reflected as a low
score) and a number of violations.  A correct solution is a solution
with 0 violations.  A good solution is a correct solution with a low
score.  Assign uses simulated annealing to attempt to minimize both
violations and score.

Simulated Annealing

Simulate annealing works by randomly shuffling the state of a problem
around.  In general changes that result in a increase in score are
ignored while those resulting in a decrease are accepted.  The
problem with this approach is that it is easy to fall into a local
minima.  To avoid this in simulated annealing there is a random chance
of accepting a change with an increase in score.  This chance starts
rather high and as the algorithm proceeds steadily decreases.  The
likelihood of accepting a "bad" change is reflected in the
temperature.  Once the temperature has reach a threshold the algorithm
exits and prints out the best solution found.  A good visual metaphor
is to view the solution space as a hilly landscape with the altitude
reflecting the score.  The current solution is a boulder.  During
simulated annealing the landscape is shaken causing the boulder to
bounce about.  As time goes on the landscape shakes less and less.

Changes and Scoring

The simulated annealing loop interfaces with the scoring system and
the data via two routines, add_node and remove_node.  Both routines
update the data structures appropriately and modify the score and
Christopher Alfeld's avatar
Christopher Alfeld committed
violations count.  add_node takes a vnode and a pnode to map it to
52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69
and returns whether that match is even remotely valid or not.
remove_node just takes a vnode.  The initial state of the mapping is
assumed to be the null mapping.  I.e., no vnodes have been mapped and
thus no physical resources used.  This starts with a violation count
equal to the number of vnodes as each unmapped vnode causes one

The Type System

Each vnode has a type and a valid solution must match each vnode to a
pnode that can support that type.  A vnode has a single type but a
pnode may be able to support several types and even have multiple
vnodes in a single vnode.  For example, a standard testbed PC with
four interfaces can either be a test node or up to two delay nodes.  A
pnode that has not been matched to a vnode is called virgin and can
take on any of its types.  Once it has been mapped to a vnode it
becomes typed and is locked into a certain type (that of the vnode).
Christopher Alfeld's avatar
Christopher Alfeld committed
When add_node is called the first thing checked is whether the pnode
71 72
has a suitable type for the vnode.  If it is virgin then all its types
are checked, otherwise only its current type is checked.  If the types
Christopher Alfeld's avatar
Christopher Alfeld committed
can not be matched then add_node returns an invalid condition,
74 75 76 77 78 79 80 81 82 83 84 85 86 87
otherwise it continues on.

Each type of a pnode also has a limit.  This is the number of vnodes
of that type that can fit in the pnode.  A vnode can not be placed in a pnode unless there is room left.

Under no conditions is a vnode mapped to a switch.


Each vlink has the amount of bandwidth is uses.  Each plink has its
bandwidth capacity and two arbitrary strings which are associated with
each end.  These strings are not used for any scoring purposes and are
just passed through, they are their for the use of the user.  A
virtual to physical link mapping comes in one of four types.  DIRECT
89 90 91 92 93 94 95 96 97
links represent physical wires connecting one node directly to
another.  INTRASWITCH links are links that go from a node to a switch
and then from that switch to the destination node.  INTERSWITCH links
go from a node to a switch, then through one or more interconnects to
another switch, and finally to the destination node.  DIRECT mappings
are always one to one, i.e. there is only one plink corresponding to
the vlink.  In INTRASWITCH maps there are two plinks for the vlink;
one to the switch, and one from the switch.  In INTERSWITCH maps there
are at least three (to the switch, at least one interconnect, and from
98 99 100 101
the final switch to the destination) and can be any number.  The final
type is a TRIVIAL link.  This type occurs when trying to assign a
vlink between vnodes that are in the same pnode.  TRIVIAL links impose
zero cost and have no data associated with them.
102 103 104 105 106 107 108 109

For interswitch plinks (switch to switch links) any number of vlinks
can use the plink.  For all other plinks only a single vlink is
allowed to use the plink without violations.  Every vlink beyond the
first results in a violation.  Violations also occur for every vlink
that causes the used bandwidth of a plink to exceed the allowed

110 111 112
A vlink can be marked emulated.  An emulated vlink can share a plink
with other emulated vlinks.

113 114 115 116 117 118 119 120 121
The switches and their interconnects are collectively called the
switch fabric.  The shortest path between any two switches is
precomputed.  This means that every interswitch link between the same
two switches will use the same path.  This also means that any
switches connected by multiple links will only use a single link.  For
bandwidth calculating purposes multiple links between switches should
be collated into a single link.

122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189

Assign defines an equivalence relation of physical nodes and uses this
relation to partition the physical space into equivalence classes.
Simulated annealing is then done on the classes rather than the
individual nodes.  In addition a cost is imposed on each pclass used,
causing assign to prefer solutions with similar nodes.

Two pnodes are equivalent if they have identical type information,
identical feature information, and a bijection of links exists which
preserves destination and bandwidth.  I.e. every link on one node can
be matched to a link on the other node such that they have the same
destination and bandwidth.

The equivalence classes are automatically generated during the
initialization step and stored in a dictionary of arrays, with keys as
types.  During annealing a starting pclass is randomly chosen and then
the pclasses are rotated through until a valid class is found.  As the
list of pclasses is chosen by type assign will never look at pclasses
that can not support the type of the node.

A complication arises in pnodes that can contain multiple vnodes.  To
handle this each pclass has a list of all nodes of a given type.  Say
a node of type T is requested from pclass P.  Assign looks at the list
of T nodes in P and chooses the first one.  It then removes that node
from all other type lists in P.  In addition, if the node is now full
it is removed from the T list in P, otherwise it is left at the front.
When a node is unassigned, if it is now empty it is added to the end
of all type lists in its class.  Otherwise, if it still has a vnode
and hence a type, it is only added to the beginning of the list of its
current type.  By this method assign tends to fill nodes completely
before moving on to a new node.


VClasses allow the user to specify a class of virtual nodes and then
have the vnodes be of that class.  Each vclass has a weight and a list
of allowable types.  Assign attempts to map nodes such that all nodes
of the same vclass have the same type.

To do this, assign keeps track of the dominant type of a vclass.  This
is the type that the majority of members are.  Every time a member of
a vclass is assigned or unassigned the dominant type and score from
that vclass are recalculated.  The score of a vclass (not the member
nodes) is 0 if all members are of the dominant type, and the weight
otherwise.  A weight of 1 or more also generates a vclass violation.

Due to the nature of assign when a member of a vclass is to be mapped
its type must be chosen before it is fitted.  It would be nice if
assign computed all possible pclasses that could satisfy the member,
based on its possible types, and then loop through that list.  This is
future work.  For the moment, when assign sees a member of a vclass it
chooses a type for it.  A little over half the time it will choose the
dominant class, otherwise it chooses randomly.

Fixed Nodes

Fixed nodes are a mechanism by which a user can manually specify part
of the mapping themselves.  Fixed nodes show up in assign as a special
case.  If a node is fixed it is marked as such and mapped to the
correct pnode during initialization.  From then on it can never be

190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214

Assign makes heavy use of LEDA.  Before getting deeply into the code
it would be a good idea to become familiar with LEDA's parameterized
GRAPHs and other basic data structures.

Overview of the Code

CC Files: - Contains the simulated annealing loop and all
initialization code, including main(). - Contains init_score, add_node, remove_node and all
associated helper routines. - Contains parse_top which parses and sets up the virtual
topology. - Contains parse_ptop which parses and sets up the
physical topology.

215 216 217 218 219 - Contains the equivalence relation, the code to generate
the pclasses, and the code to maintain the pclasses. - Contains the code to setup vclasses and calculate score
deltas when nodes are mapped and unmapped..
220 221 222 223 224 225 226 227 228 229 230 231 232 233

Header Files:

common.h - Contains all the parameters including all the score

score.h - Defines the datastructure and prototypes for the scoring

physical.h - Defines the physical elements: pnodes, plinks, and switch
fabric structures.

virtual.h - Defines the virtual elements: vnodes and vlinks.

234 235 236 237
pclass.h - Defines the pclass class.

vclass.h - Defines the vclass class.

238 239 240 241 242 243 244 245 246 247 248 249 250 251

Important Globals

switch_pred_array - The predecessor array to represent the shortest
path between any two switches.

PG - The physical topology.

G - The virtual topology.

bestscore, absbest - The bestscore this iteration and the absolute
best score.

absnodes - Contains the absolute best solution.
253 254 255

unassigned_nodes - Priority queue of nodes to be assigned.

256 257 258 259 260
type_table - Indexed by type, contains a list of pclasses that can
satisfy that type.

fixed_nodes - Indexed by vname, contains the pname the node is fixed
261 262 263 264 265 266 267 268


During the initialization process assign reads in the top and ptop
files, initializes the random number generator, and calculates the
shortest paths between any two switches.  

This next step deserves some comments.  parse_ptop will fill out both
270 271 272 273 274 275 276 277 278 279 280 281 282
the PG, the completely physical topology, and SG, the switch fabric.
The SG graph contains slinks and snodes which exists in one-to-one
correspondence to the switches and interswitch links in PG.  Each
slink and snode has a field, mate, which contains the corresponding
edge or node in PG.  Assign determines edge costs as 10000 minus the
bandwidth of the interswitch link and then uses the DIJKSTRA_T
algorithm from LEDA to compute the shortest paths between all
switches.  These paths are stored in switch_distances.  For a given
switch sw, switch_distances[sw] is a node_array (indexed by node) of
the predecessor node.  This is used by the scoring code to calculate
interswitch paths.  If SCORE_DEBUG is defined the both edge costs and
predecessors are output.

Finally assign computes all physical equivalence classes.
284 285 286 287 288 289 290 291

The Main Loop

A strange sequence of calls that exist for historical reasons
ultimately end up with assign() which is the main simulated annealing
loop.  This particularly code has had at least three and possibly as
many as five different authors and none of them quite understand it.
292 293 294 295 296 297
In essence it randomly chooses an unassigned vnode and then a pclass
that can satisfy the vnode and calls add_node to attempt to map
them.  If add_node succeeds it then compares scores and decides
whether to accept the mapping.  If no vnodes are unassigned then it
randomly chooses an assign vnode, removes it via remove_node and then
randomly maps it to a pnode via add_node.
298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407

The Scoring Code


This initializes the scoring datastructures.  First it sets score and
violations to 0 and blanks the vinfo structure (vinfo stores counts of
each type of violations).  It then looks at all vnodes and add the
cost of an unassigned node to the score and adds a violations.  It
then goes through all vnodes, vlinks, plinks, and pnodes and sets
their data members appropriately for being unused and unassigned.


If pnode is virgin
	If matching type exists then set type of pnode to T
   Else fail
Else if types match and and pnode is not full then we're ok
Else fail

# At this point we know that we can fit the vnode into the pnode with
# all types ok.  We now go onto links.

Foreach vlink in vnode
	If destination is mapped
		# We assign links when both ends are mapped.  Otherwise we leave
		# them unassigned.
		If direct_link exists then map vlink to single direct plink
		Else if vnode and destination are on same switch then
			Find links from vnode to switch to destination and map to them.
			If interswitch path exists
				Map vlink to interswitch path
				# It's possible to score this as a violation instead of
				# a failure however it was found to do better when we failed.

# We now have all links set up we just need to do the last few things:
# increment the load, features and desires, count a pnode used if this
# was virgin, count a switch as used if this is the pnode on the
# switch, and reduce the penalty and score for a newly assigned vnode.
Finish setting up pnode


# This is almost the exist reverse of add_node
Let pnode be what vnode is mapped to.

Foreach vlink in vnode
	Unassociated all plinks
	Remove score for link type and plink costs.
	Remove any violations for bandwidth.

Remove any violations associated with no connections on this node.
Reduce the load and free up pnode is load is now 0.
Free the switch if this was the last node on the switch.
Penalize for an unassigned vnode.
Undo all features and desires.


This routine looks for a direct link from A to B.  It does this in a
pretty basic fashion.  It looks at all edges from A and looks for the
edge going to B with the least users and less bandwidth used.


Very similar to direct link except is going from A to the switch A is
connected to rather than from A to B.


Extremely basic.  Just calls find_link_to_switch on both src and dst.


Another very basic routine.  Just reads off the from the switch_preds
datastructure the shortest path between the src switch and the dst


For non interswitch links this penalizes users beyond the first.  

For all links this keeps track of bandwidth and penalizes going over
bandwidth allowance.


Exact opposite of score_link.