GAN experiment (Computer Vision)
A Generative Adversarial Network (GAN) is a Deep Learning model that has two actors: the Discriminator and the Generator. The goal in training a GAN is to have the generator create realistic looking fake data samples from noise, while the Discriminator learns to distinguish between the true data and the fake data that came from the generator.
Generative Adversarial Networks (GANs) and their variants, are known for their notorious training instability. Since their introduction in 2014 by Ian Goodfellow, countless variations in architectures and novel loss functions have been proposed. A small handful are pictured below:
At Neurales, we are applying neuroevolution to GAN training. This is where we initialize a population of N chromosomes (GAN architectures), and evolve them over M generations. After the evolution finishes, we are left with GANs that have “good” performance, quantified by a fitness function.
The idea of neuroevolution isn’t new, but we are applying the idea in an interesting way.
First, we use an “elitist” selection approach, rather than a crossover-mutation approach. Every generation, a fraction of the best models is kept, while the others are discarded and new random architectures are initialized. This process repeats every generation.
As seen from the Adversarial Swans post (also on our website), the elitist selection has the advantage of leaving the chromosomes unchanged from generation to generation guarantees the maximum fitness stays constant or improves (the catch being the max fitness may stay constant for many generations with no clear indicator in general of eventual improvement). With our GAN models this is mostly, but not entirely true. Since we retrain every model (including the most fit ones), the maximum fitness may go down from generation to generation under different “initial conditions”. This is due to learning rates (among other hyperparameters) for the generator and discriminator network are chosen at random, and GAN training is incredibly sensitive to learning rate choices.
Naturally, there are countless considerations which will not be investigated on our first “run” of the evolutionary algorithms. We could consider learning rate schedulers which decay the learning rate after different epoch (and by a different factor). We could consider using different loss functions with and without regularization. We could consider varying the optimizer chosen from chromosome to chromosome.
CIFAR10, CIFAR100: Images resized to 64 x 64
ImageNet: Images all resized to 200 x 200
Optimizer: SGD with learning rate 10-5<η<10-4 (for both the Discriminator and Generator Network) and momentum 0.05<α<0.9.
Batch size: 64
Epochs: 10 for CIFAR, 50 for ImageNet
Chromosomes: 10 for CIFAR, 30 for ImageNet
Generations: 20 for CIFAR, 200 for ImageNet
Taking a 1 x N noisy vector and feeding it into the generator is common, but in this experiment, we try to feed the generator a noisy 3-Tensor (channels x width x height), where we use between 2 and 5 channels, with width and height between 25% and 75% of the image sizes (so 16-48 height/width for CIFAR, and 50-150 height/width for ImageNet). A two-channel noisy image which could be fed into the network is pictured below:
Before discussing the layers and other hyperparameters in the GAN itself, it’s important to discuss the fitness function chosen for the GANs. The fitness function favors models with fewer number of parameters as well as those having lower loss on the last epoch of training.
In this experiment, we set λ=34 and γ=14
Here, the P value is defined as:
P=5*106iGi+Di – (CIFAR100)
P=5*107iGi+Di – (ImageNet)
With iGi+Di being the sum of the number of Generator and Discriminator weights respectively.
As is common with genetic algorithms, this is fitness function is heuristic. The first term describes how much we want to prioritize a low loss, while the second term determines how much we want to prioritize a model with fewer parameters. ImageNet is a much more “difficult” dataset. As such it is probably unrealistic to expect the model to do well with anything fewer than 50M parameters, whereas for CIFAR100, 5M is a “reasonable” number of parameters. Lastly, we want to normalize the fitness between 0 and 1, and that is achieved by ensuring λ+γ=1.
Now, we turn our attention back to the choices of layers for the GAN.
Layer types: What types of layers do we want to use for the Generator? What about the Discriminator? Should both networks have the same number and type of layers? In this experiment, we settled on a “simple” case. We use Conv-BN-PReLU blocks followed by one FC layer for the Discriminator and ConvTranspose-BN-PReLU blocks in the Generator. A quick visual reminder is pictured below (note Deconvolution is synonymous with Transposed Convolution).
Each network is built from stacking multiple blocks, so the number of layers are equal to 3 x number of blocks.
Channels: between 5 and 300
Kernel size: Between 3 and 5 (stride of 1, padding of 0)
Batch Norm (BN) layer: momentum between 0.1 and 0.9
FC layer: Hidden layer size between 30 and 1200, followed by a Sigmoid activation layer.
PReLU: we use a parameter a for each channel, initialized to 0.25.
The reasoning for these layer choices is that they may be “good enough”. Giving a large number of convolutional channels helps the network to have “enough” learnable parameters. The Batch Norm layer (with learnable parameters also) is there to help reduce overfitting by having potentially too many channels (not likely for ImageNet, but possible for CIFAR). Using vanilla BN, where there are no learnable parameters, is known to be problematic for GANs. Various theories exist as to why, but having the BN layer include learnable parameters might be prove useful. The algorithm for BN is pictured below for reference. Note the last step includes the learnable parameters γ,β.
Finally, in the Discriminator, we one additional FC layer before the Sigmoid activation layer so that the network has an additional opportunity to learn features from the convolutional layers.
Using PReLU instead of ReLU gives the ability to adjust the importance of each channel in each layer, whereas standard ReLU is not as flexible, and so each channel’s activation relies on the weights alone. A quick visualization of how different parameters are used per channel is pictured below:
This may become a problem in deeper networks, as the gradients decay, and so the PReLU in the earlier layers can adjust the “strength” of the activations. We will see how effective this is.
In future runs, we will give the ability for both networks to randomly choose both the type of layer and the hyperparameters in each layer. We also want to give the chromosomes the flexibility to choose their optimizer, as well as a choice in loss function. We aim, by our third set of runs, have at least the following:
- Convolutional Layers
- Max & Average Pooling Layers (for the Discriminator)
- Skip/residual connections
- Batch Norm & Instance Norm Layers
- FC layers at any layer in the network
- Activation function choice
- Loss functions
- BCE loss
- WGAN loss
This will also require us to more carefully define a fitness function, which will help us to better quantify what we mean by “realistic” images from “good” models. We plan on potentially incorporating the FID score and others into future fitness functions.
In an ideal world, we would only *need* one run. If we consider our search space to be all combination of layers, activations, optimizers, then we essentially will cover *all* possible architecture combinations a human could design. Having such a large search space however would require an incredibly long training time (months for ImageNet). As such, with each run, we will do our best to intelligently refine our search space so that we initialize a population that is more likely to be fit. This does inject some degree of human bias, but it helps make the search space for the genetic algorithms manageable.
With each run, we will publish some metrics that we have collected during the run, and with time, we are confident that we will be able to find the right combination of loss functions, architecture layers, and learning rates that can consistently give strong performing results.
Sign up at www.neurales.ai
Follow us at