We’ll be using the numpy module to convert data to numpy arrays, which is what Scikit-learn wants. We will talk more on preprocessing and cross_validation when we get to them in the code, but preprocessing is the module used to do some cleaning/scaling of data prior to machine learning, and cross_ alidation is used in the testing stages. Finally, we’re also importing the LinearRegression algorithm as well as svm from Scikit-learn, which we’ll be using as our machine learning algorithms to demonstrate results.
At this point, we’ve got data that we think is useful. How does the actual machine learning thing work? With supervised learning, you have features and labels. The features are the descriptive attributes, and the label is what you’re attempting to predict or forecast. Another common example with regression might be to try to predict the dollar value of an insurance policy premium for someone. The company may collect your age, past driving infractions, public criminal record, and your credit score for example. The company will use past customers, taking this data, and feeding in the amount of the “ideal premium” that they think should have been given to that customer, or they will use the one they actually used if they thought it was a profitable amount.
Thus, for training the machine learning classifier, the features are customer attributes, the label is the premium associated with those attributes.