SUBGRADIENT OPTIMIZATION PRIOR TO COLUMN GENERATION — A MEANS FOR PREDICTING OPTIMAL COLUMNS

: We propose a two-phase combination of two optimization tech-niques, Lagrangian relaxation and column generation, with the aim of overcom-ing their respective drawbacks. In a prediction phase, subgradient optimization is used and the Lagrangian relaxed solutions found are used to initialize a master problem. In a solution phase, column generation is performed. We provide a validation of this two-phase method through an asymptotic result for the prediction phase and give guidelines for its truncated usage. The two-phase method is assessed on a multicommodity network ﬂow problem, for which it performs signiﬁcantly better than a pure column generation method. We conclude that the subgradient optimization prediction phase can accelerate a column generation method considerably.


Introduction
A well-known way to approach a structured integer linear optimization problem is to exploit the structure in a price-directive decomposition scheme. Often a Lagrangian dual scheme is used, and we can distinguish between search methods and cutting plane methods for solving the dual problem. A popular search method for finding an approximate dual solution is subgradient optimization. Cutting plane methods for the dual problem are also well established, and they are dually equivalent to Dantzig-Wolfe decomposition and column generation applied to a convexified version of the integer problem. Both subgradient optimization and Dantzig-Wolfe decomposition produce lower bounds to the optimal value of the convexified problem (assuming minimization).
An advantage of subgradient optimization is that the move to the next dual iterate is inexpensive, once the Lagrangian relaxed problem, or subproblem, has been solved. A disadvantage is the lack of a good termination criterion. In practice the method is often terminated after a predetermined number of iterations. Another disadvantage of subgradient optimization is that feasible solutions are usually not easily obtained, neither to the integer problem nor to its convexified version, and therefore upper bounds to their optimal values are not easily available. It is however quite common that a Lagrangian heuristic [e.g., 15] is used to manipulate Lagrangian subproblem solutions into feasible solutions to the integer problem and upper bounds to its optimal value.
Advantages of Dantzig-Wolfe decomposition is that feasible solutions to the convexified problem are obtained and that it has finite convergence. The former fact enables early termination, when the gap between the upper and lower bounds to the optimal value of the convexified problem is small enough. A drawback is that each iteration is quite expensive, since a new dual iterate is computed through the reoptimization of a linear program. Another drawback is the inherent instability [e.g., 10,Ch. 15], in the sense that the values of the dual variables may change significantly between iterations, which may lead to a poor convergence behaviour.
To prevent this phenomenon, a stabilization mechanism can be introduced. One such mechanism can be viewed as the boxstep trust-region method [20] applied in the dual space. This stabilization has been used with good results by e.g. Larsson et al. [16] and Westerlund et al. [27]. A similar stabilization technique is proposed by du Merle et al. [7]. It is successfully used in three types of applications.
Another possibility to improve the behaviour of a column generation scheme is to heuristically generate high-quality starting columns [e.g., 19, Sect. 4.1.1]. We propose a specific way of generating high-quality starting columns.
We design a two-phase method that benefits from the advantages of both subgradient optimization and column generation. A prediction phase uses subgradient optimization, during which the Lagrangian subproblem solutions are stored. At termination, a lower bound for the convexified problem and a number of subproblem solutions are at hand. In a following solution phase, these subproblem solutions are used to initialize the Dantzig-Wolfe master problem, whereafter column generation is performed as usual.
The prediction phase can be seen as a heuristic for setting up a high-quality initial master problem, so that fewer columns need to be generated in the solution phase. Alternatively, the solution phase can be thought of as to verify the outcome of the prediction phase. If the latter works perfectly, only one master problem needs to be solved for detecting optimality. We prove that with an appropriate choice of steplengths in the subgradient optimization in the prediction phase, optimal columns in the master problem are found asymptotically.
Some procedures similar to our two-phase method can be found in the literature. In [28], subgradient optimization is performed in a first phase, and in each iteration dual cuts are stored and a Dantzig-Wolfe master problem is solved. The master objective value is used as an upper bound in the Polyak steplength formula (which uses the relaxation parameter value one). If the same cut has been generated too many times, the method switches to a second phase, in which the Dantzig-Wolfe method is used. An advantage of this two-phase method is that it is free from parameters. A clear disadvantage is that a master problem is solved in each subgradient iteration, which can be computationally burdensome. The only information used from the master problem of the first phase is its objective value; in particular, the dual solution is not used.
The idea above is further developed by Vera et al. [26], who report on different criteria to decide when to switch from the first to the second phase. Further, the effect of using a bundle method instead of a linear programming approach in the second phase is investigated. A two-phase combination of subgradient optimization with a bundle method is studied also in [13]. Barahona and Jensen [5] apply Dantzig-Wolfe decomposition to a plant location problem, and every few iterations of this method they run a fixed number of subgradient optimization iterations, starting from the dual solution to the master problem. The master problem receives all subproblem solutions found in these iterations, and the final dual values are used in the Dantzig-Wolfe subproblem. Substantial reductions in computing times are reported, compared to the standard Dantzig-Wolfe method. This line of research is continued in [11], where it is also discussed how to use subgradient optimization for finding approximate dual solutions to Dantzig-Wolfe master problems.
The contributions of our work are as follows. First, an asymptotic result for the prediction phase is shown. This result provides a theoretical basis for a two-phase method. Second, guidelines for practical usage are discussed. Third, computational experiments with multicommodity network flow problems show that the two-phase method can perform favourable compared to the standard Dantzig-Wolfe method.
The remainder of the paper is organized as follows. First, we give notations and outline some known results. In Section 3 the asymptotic result is given together with a summary of the two-phase method and a discussion of truncated usage. The application to multicommodity network flows is described in Section 4. Further, Section 5 presents results of some illustrative experiments. In Section 6, numerical results are given. Conclusions and directions for further research are pointed out in Section 7.

Preliminaries
The two-phase method is applicable to linear optimization problems and to convexified integer linear optimization problems. In order to make the presentation as general as possible, we consider the latter, with the problem stated as x ∈ X ⊂ R n , where b ∈ R m and X is a finite set, say X = {x 1 , . . . , x N }, with N typically being a huge number. The problem is assumed to be easier to solve if constraint (1) is relaxed. Let P be the convexified problem obtained if X is replaced by its convex hull. (If we were considering a linear optimization problem, then X would be assumed to be a polytope, with {x 1 , . . . , x N } being its extreme points.) The Dantzig-Wolfe master problem of P is given by Let u and w be dual variables associated with constraints (2) and (3), respectively. The linear programming dual problem is given by and it is equivalent to the Lagrangian dual problem Let U * be its set of optimal solutions. Let v(·) be the optimal value for a problem. Then, since problem P is convex, v(IP ) ≥ v(P ) = v(M P ) = v(DM P ) = v(LD) holds. A subgradient optimization method for problem LD starts from an arbitrary feasible dual point, u 0 , and generates a sequence of dual iterates according to where the maximum is taken component-wise, and t s and γ s are the steplength and the subgradient, respectively, used in iteration s. Here, γ s = b − Ax(u s ) ∈ ∂h(u s ), with x(u s ) being the Lagrangian subproblem solution in iteration s. The following convergence result is shown in [14]; see also [6,Ch. 3,Th. 4.5].
The divergent series requirement (6) is not as restrictive as it might seem, since any steplength selection rule can easily be modified to fulfil (6) by replacing tentative steplengths with their projections onto the intervals [α/(s + 1), β/(s+1)], where β > α > 0. Because the projected steplengths are then trapped between two series that fulfil (6), this also holds for the projected steplengths.
Below we give a method for solving problem P by computing a weighted average of the subproblem solutions encountered when applying subgradient optimization to the Lagrangian dual problem. The method is outlined in [24, p. 117]. Larsson and Liu [14] proved that it asymptotically finds optimal primal solutions if the steplengths fulfil the condition (6). Both references consider application of Lagrangian relaxation to a linear optimization problem. Letx where The expression (7) defines a convex combination of subproblem solutions. This combination can alternatively be expressed as where with Results similar to the proposition below have been shown by Larsson and Liu [14] and Larsson et al. [17], for linear and convex optimization, respectively. Proposition 2. Let the steplengths in the dual subgradient procedure (5) applied to LD be chosen so that condition (6) is fulfilled. Let the sequences {x(l)} and {λ(l)} be defined by formulas (7) and (9), respectively. Then each of the sequences {x(l)} and {λ(l)} has at least one accumulation point, and any accumulation points of these sequences are optimal in P and M P , respectively.
Since the subgradient optimization scheme is memoryless and any iterate can be considered to be the initial one, the result of this proposition holds also when the generation of the sequences {x(l)} and {λ(l)} is delayed for an arbitrary number of iterations, with the formulas (7) and (9) modified accordingly.
A clear drawback of the method implied by the proposition is that when the subgradient optimization scheme is terminated finitely, the averaged primal solution is typically only near-feasible (but also near-optimal, disregarding its infeasibility), which is due to the inherent dual character of the scheme. The practical use of such a solution is thus not immediate. For some applications, small primal infeasibilities may not matter. In others, tailored heuristics can be used to modify the near-feasible solution into a feasible and near-optimal solution; see e.g. [14] on this topic. Some further related works are [23,17,4,22,1,9].
Below, a slightly different strategy is employed. Subgradient optimization is performed and Lagrangian subproblem solutions are collected, but instead of assigning predetermined weights to these, as above, a restricted Dantzig-Wolfe master problem is constructed and solved in order to compute optimal weights. (Feasibility of this master problem can be assured by a proper use of starting columns.) If needed, column generation is then performed until overall optimality (or sufficient near-optimality) is reached.

The Two-Phase Method
We here develop the use of subgradient optimization to predict optimal basic columns in a Dantzig-Wolfe master problem. First, an asymptotic result is given. Then the method is outlined. Finally we discuss practical issues for truncated usage.
Let s be a nonnegative integer, define the set and consider the restricted master problem Theorem 3. Let the steplengths in the dual subgradient procedure (5) applied to LD be chosen so that condition (6) is fulfilled, and let the sequence {λ(l)} be defined by formula (9). Let u * be the limit point for the sequence {u s }, according to Proposition 1, and let w * solve DM P for u = u * . Define the optimal reduced costsc s = (c T − u * T A)x(u s ) − w * . Then v(M P (s)) = v(P ) holds for any nonnegative integer s. Further,c s = 0 holds for every s that is sufficiently large.
Proof. According to Proposition 2, the sequence {λ(l)} has an accumulation point, sayλ, which is optimal in M P . Ifλ j > 0, then x(u s ) = x j holds for at least one s ≥ s, which in turn implies that the problem M P (s) contains the variable λ j . Then the solution λ j =λ j , j ∈ J s , is feasible in M P (s), and it follows that From the equivalence between DM P and LD follows that Since the Lagrangian dual objective function h is polyhedral, x(u s ) solves the Lagrangian subproblem for both u = u * and u = u s when s is large enough. Hence, both γ s ∈ ∂h(u s ) and γ s ∈ ∂h(u * ) hold, so that which together with (10) yields that which proves the second conclusion.
According to the theorem, all subproblem solutions needed to construct a master problem that solves problem P will eventually be found by the subgradient optimization, no matter when the construction of the master problem begins (that is, subproblem solutions from early iterations are never indispensable). Further, in case of dual non-degeneracy, late iterations will only provide subproblem solutions that give optimal basic columns in the master problem. (In case of dual degeneracy, non-basic columns with zero reduced costs can also be obtained.) These results justify the use of subgradient optimization as a means for predicting optimal basic columns, before applying column generation.
Proposition 2 provides solution procedures that asymptotically solve problem P or M P . Finitely generated solutions are however typically infeasible in the respective problems. In contrast, a finitely generated problem M P (s) can be used for finding feasible and also near-optimal solutions to problem M P (and P ). The important difference between the averaged solutionx(l) and the solution to P obtained from a finitely generated problem M P (s), is, clearly, that the former is defined by the expression (8) with the convexity weights given a priori by formula (9), while in the latter case the values of the convexity weights are optimized with respect to the objective in P and the constraint (1).
Our two-phase method works as follows. In the prediction phase of the method,s subgradient optimization iterations are performed, with steplengths computed according to a rule such that (6) holds. Subproblem solutions are collected from iteration s ≤s. On termination of the prediction phase, a lower bound v(LD) to v(LD) and a collection of subproblem solutions have been obtained. First in the solution phase, an initial master problem is constructed, using the subproblem solutions already collected. (Other starting columns may of course also be included.) Thereafter, column generation in a Dantzig-Wolfe fashion is performed, until the relative gap between upper and lower bounds on the optimal value does not exceed ε ≥ 0. [The lower bound is initialized to v(LD).] We last discuss practical considerations of a truncated use of the solution procedure suggested by Theorem 3. Ideally, one would like to find all optimal basic columns in M P , which corresponds to finding all constraints (4) that are fulfilled with equality at an optimal solution to DM P , that is, the segments of the Lagrangian dual function that are active at an optimal solution. (We ignore the issue of degeneracy.) In order to find these constraints (or most of them) in a finite number of iterations, the subgradient steplengths need to ensure that the dual iterates reach near-optimality, that is, come close to u * (as defined in Proposition 1), while they also oscillate around u * . Below we discuss the consequences of failing to meet either of these two requirements. In this respect there are two pitfalls: the steplengths can be too small or too large.
The two pitfalls are illustrated on the fictitious Lagrangian dual problem in  Figure 1: A one-dimensional example of DM P with six constraints too rapidly. Then the dual iterates may never reach the vicinity of u * or they may approach this point without oscillating around it, and therefore all constraints (4) needed to solve problem DM P may not be found. If for example u 0 = 0 and the steplengths are too small, then maybe only the segments c 1 and c 2 have been visited at termination (in which case M P (s) becomes infeasible). Suppose instead that the steplengths decrease too slowly. Then the dual iterates may never come close to u * and the constraints needed to solve DM P may therefore never be found. If again u 0 = 0, it can in this case in the example happen that only the segments c 1 and c 6 have been visited at termination. In either of these cases the quality of the problem M P (s) may be poor. In order to make an accurate truncated prediction of optimal basic columns, the subgradient optimization steplengths must thus be chosen so that the iterates reach the vicinity of u * , while they also need to decrease slowly enough to cause the iterates to oscillate around u * . The first requirement coincides with the usual one in subgradient optimization, when the goal is to compute a strong lower bound, while the second one is a supplement.

The Multicommodity Network Flow Problem
The two-phase is expected to be advantageous in applications where the master problem is computationally burdensome, since the prediction phase should reduce the number of master problems to be solved. If the subproblems are inexpensive, it also favours the two-phase method, since high-quality columns can be found if more subgradient iterations can be allowed. With subproblems that are instead costly, it is likely that too much computing time will be used in the prediction phase and the two-phase method is not promising.
We perform experiments with a linear optimzation problem with the former characteristics, namely a linear minimum-cost multicommodity network flow (MCNF) problem where each commodity is given by an origin-destination (OD) pair. The problem can be described by an arc-node formulation or by a path formulation.
Consider a directed graph G = (N, A), with N and A being the sets of nodes and directed arcs, respectively. The sets of arcs emanating and terminating at node i are denoted by A + i and A − i , respectively. Let K be the set of commodities. Let o(k) and d(k) be the origin and the destination, respectively, for commodity k, and let d k be the demand. Further, c ka denotes the unit cost of sending flow of commodity k on arc a and u a denotes a total flow capacity on the arc. With variable x ka denoting the flow of commodity k on arc a, the arc-node formulation is as follows. min k∈K a∈A The number of flow variables is clearly |A| · |K|, and it typically grows fast with the size of the network. If the flow cost does not vary by commodity, the number of variables can be reduced considerably by aggregation of commodities with respect to their source (or, alternatively, destination) nodes. Such aggregation may however impair the performance of some methods, and in particular Dantzig-Wolfe decomposition [12].
In the path formulation of MCNF, we let P k be the set of elementary paths from o(k) to d(k), and let the parameter δ kpa = 1 if path p ∈ P k uses arc a and δ kpa = 0 otherwise. With c kp being the cost of path p, that is, c kp = a∈A δ kpa c ka , and the variable h kp being the flow on path p ∈ P k , the problem can be stated as below.
[MCNF-P] min The two formulations are tightly related through Dantzig-Wolfe decomposition. When applying this method to the arc-node formulation, the subproblem separates over commodities. For commodity k, an optimal subproblem solution is an extreme point of the set defined by the flow conservation and nonnegativity constraints for the commodity, and it describes a flow of strength d k along a path in P k that is shortest with respect to the subproblem cost function. This generation of columns to the Dantzig-Wolfe master problem is equivalent to a direct application of column generation to the formulation MCNF-P. In fact, the master problem is obtained from MCNF-P by introducing new variables that equal h kr /d k . The equivalence between MCNF-P and a Dantzig-Wolfe reformulation of the arc-node formulation of MCNF was observed in [25].
It was early recognized [e.g., 2] that price-directive decomposition is a promising approach for designing efficient solution methods for the MCNF problem. Straightforward Dantzig-Wolfe decomposition, that is, a cutting plane approach to the Lagrangian dual problem, is however not competitive, and several researchers have proposed other strategies for solving the Lagrangian dual problem. A bundle method is presented in [8], and in [18] it is proposed to use an augmented Lagrangian function to improve the rate of convergence in the dual space. In [3], encouraging results of applying a proximal analytic center cutting plane method are reported.
We have used two sets of test problems, which are among those used in [18] and [3]. For both sets, the unit flow cost varies by arc but not by commodity. The first set comprises planar networks, with nodes randomly generated in the plane and arc costs that are the Euclidean distances. Arcs are created in an ascending order of Euclidean distance, and so that the network becomes planar. The origins and destinations of commodities are randomly chosen pairs of nodes. The arc capacities and commodity demands are uniformly distributed. The networks in the second problem set have a grid topology. For an n × n grid, the network has n 2 nodes and 4n(n − 1) directed arcs. Capacities and commodities are generated as for the planar networks, but the arc costs have a uniform distribution. Table 1 gives properties of the problem instances. The first two columns give the name of the instance and the numbers of nodes, arcs and OD-pairs. The optimal value is z * . We also compute the optimal value that is obtained if all arc capacity constraints are relaxed, denotedz. The ratio τ = (z * − z)/z * measures how restrictive the arc capacities are. Further, let ρ be the portion of the capacity constraints that have non-zero dual variables when the final master problem has been solved. This value is another measure of how restrictive the arc capacities are. Finally, ς is the number of paths used in the final master problem solution divided by the number of OD-pairs, that is, the average number of paths used per OD-pair in an optimal solution. We have also made experiments on a set of modified instances, obtained from the instances planar500 and grid11. The modified instances are constructed by scaling the demand in each OD-pair by a given factor. For example, the instance planar500 2.5 is obtained from the instance planar500 by scaling all demands by the factor 2.5. With larger demands, the flow will be more congested, more arc capacities will be saturated and the optimal objective value will increase, as can be seen in Table 2.

An Illustration of Truncated Performance
Below we report on experiments that illustrate truncated usage of subgradient optimization as a means for predicting optimal columns, and especially how the quality of the columns found depends on the choice of parameter values. For this purpose, the instance grid8 is used. All input and output parameters are given in Table 3. (Here, D-W stands for Dantzig-Wolfe. Some of the parameters are not used until later.) The subgradient steplength in iteration s is t s = a/s, where a is a positive parameter. For the experiments reported in this section, a = 1.0 is used. In a first experiment, the quality of the prediction is measured in terms of the number of columns that are necessary to generate in the solution phase in order to reach optimality. Figure 2 shows how the quality of the column prediction depends of s and σ. Each curve in the figure corresponds to a given value of σ. Two patterns can be seen. First, for a given value of s, a more accurate prediction is observed for higher values of σ. Second, for a given value of σ it is typically better to do many subgradient iterations before beginning to collect columns.
In another experiment, the value of σ was fixed to 20 and four runs were made with s = 1, 100, 200, and 400, respectively. The objective value of the first master problem was compared with the optimal value, and a relative gap was computed. The resulting relative gaps were 0.25%, 0.03%, 0.007%, and 0.002%, respectively. In a similar experiment, s was fixed to 150 and five runs were made with σ = 5, 10, 15, 20, and 25, which gave relative gaps of 0.14%, 0.04%, 0.02%, 0.01%, and 0.007%, respectively.
We conclude that the quality of the predicted columns improves with increasing values of s and σ, as expected. There is of course a trade-off between the quality of the prediction and its computational cost. With a large value of s, many subgradient iterations are performed, which can become costly. Also, with a large value of σ, the first master problem will contain more columns, which can make the solution phase more costly. ς Average number of paths used per OD-pair.

Numerical Experiments
We here compare the performance of the two-phase method to that of the standard Dantzig-Wolfe method on the MCNF. The aim is only to show that the prediction phase can significantly accelerate the column generation, and not to compare with other solution approaches for this application. After some initial experiments, the parameter valuess = 200 and s = 190 were fixed, giving σ = 10. The value of the steplength parameter a was calibrated for different instaces. For all grid instances, a = 0.01 is used. For all planar instances, except for planar500, a = 10.0 is used. For planar500 and for the modified planar500 instances, a = 1.0 and a = 5.0 are used, respectively.
The results are presented in Tables 4 to 7. The contents of the columns in the tables are given in Table 3. In Tables 4 and 5, numerical results with solution accuracy ε = 10 −6 are presented. In Tables 6 and 7, the solution accuracy is ε = 10 −3 .
From the CPU time ratios in Table 4 we conclude that the two-phase method performs well compared to the standard Dantzig-Wolfe method, especially for the larger instances. Further, it is the solution phase of the two-phase method that is most computationally costly, again especially for the larger instances.
To study the effect of the tightness of the arc capacities, we used the set of modified instances. As can be seen in Tables 2 and 5, for higher demands  0  50  100  150  200  250  300  350  400  450  in the OD-pairs, the value of ρ increases, the number of paths needed in each OD-pair increases, the advantage of using the two-phase method becomes more and more evident, and the fraction of the CPU time spent in the second phase increases. (The last effect is due to that the CPU time in the first phase is almost unchanged while the CPU time in the second phase grows.) Table 6 shows results for the solution accuracy ε = 10 −3 . When comparing with Table 4, some differences can be seen. Considering the values of cg for the planar instances, somewhat smaller numbers are obtained in Table 6. For the grid instances, however, the values of cg are significantly smaller. For the instances grid11, grid12 and grid13, all columns that are needed for solving the problems to an accuracy of ε = 10 −3 are found in the prediction phase.
When comparing the values of cg in Tables 5 and 7, it can be seen that there are no large differences for the modified planar500 instances. For the grid11 1.25 and grid11 1.5 instances, however, the prediction phase was successful in finding the columns needed to solve the problem for the solution accuracy ε = 10 −3 .

Conclusions and Further Research
We have presented a two-phase method for optimization problems that benefit from price-directive decomposition. The method comprises a prediction phase, using subgradient optimization, and a solution phase, using column generation in a Dantzig-Wolfe environment. The combination is founded on an asymptotic result for the prediction phase. A truncated usage of the prediction phase can yield an initial Dantzig-Wolfe master problem of high quality, resulting in a shorter computing time for the overall method. In order to benefit from the use of the two-phase method, the subproblems must be computationally inexpensive compared to the master problems, since a relatively large number of subgradient optimization iterations needs to be performed. The two-phase method is applied to a multicommodity network flow problem, which has the characteristics that should be advantageous for the method, and our numerical experiments show that it indeed performs favourable compared to standard Dantzig-Wolfe decomposition, in terms of computing times. Of particular interest is that the reduction in computing time tends to increase with the size of the instance and the tightness of the arc capacities, that is, for instances that are more computationally demanding.
A drawback of the two-phase method is its several control parameters. Suitable values for most of these can, however, be chosen according to simple rules. Our experience is that only a single parameter needs more careful calibration, namely the one that controls the steplengths in the subgradient optimization scheme. A topic for further research would be to try to incorporate into the framework of our two-phase method a subgradient algorithm without any control parameter that needs to be calibrated.  Table 7: Results for modified instances, with accuracy ε = 10 −3