-The first step was to analyze the data. I tried looking at decision tree classifiers for the data but the trees being too complex I could not draw any conclusions.
-I normalized the data as I found it gave better results. The normalization was zero mean and unit variance.
-Having found the variance for all features, I found that for many features the value of variance was very low. From this I deduced that these features, by not varying much were not contributing much to the prediction.
-After experimenting with a couple of values, I found that thresholing value of 0.1 gave the best results.
-The second insight was varying the number of nodes in the hidden layer (I've opted for a simple architecture with only one hidden layer). I checked the results with a couple of values and found that the number of units in hidden layer as int(1.65 * input_layer_units) gave good results.
-Thirdly, the gradient descent which I've applied to find the weight matrices was taking too much time with exact line search. So what I exprimented with a couple of values for the descent rate and took 5 best values. After backpropagation, the descent is calculated 5 times and the value which yields the minimum cost is used for the next iteration.