This blog covers multiclass classification on a customer complaints dataset.
After preprocessing the data we will build multiple models with different estimators and different hyperparemeters to find the best performing model.
As an example dataset, we'll import Consumer_Complaints.csv.
['I have outdated information on my credit report that I have previously disputed that has yet to be removed this information is more then seven years old and does not meet credit reporting requirements',
'An account on my credit report has a mistaken date. I mailed in a debt validation letter to allow XXXX to correct the information. I received a letter in the mail, stating that Experian received my correspondence and found it to be " suspicious \'\' and that " I did n\'t write it \'\'. Experian \'s letter is worded to imply that I am incapable of writing my own letter. I was deeply offended by this implication. \nI called Experian to figure out why my letter was so suspicious. I spoke to a representative who was incredibly unhelpful, She did not effectively answer any questions I asked of her, and she kept ignoring what I was saying regarding the offensive letter and my dispute process. I feel the representative did what she wanted to do, and I am not satisfied. It is STILL not clear to me why I received this letter. I typed this letter, I signed this letter, and I paid to mail this letter, yet Experian willfully disregarded my lawful request. \nI am disgusted with this entire situation, and I would like for my dispute to be handled appropriately, and I would like for an Experian representative to contact me and give me a real explanation for this letter.' ]
[ 'outdated information credit report previously disputed yet removed information seven years old meet credit reporting requirements',
'account credit report mistaken date mailed debt validation letter allow correct information received letter mail stating eperian received correspondence found suspicious nt write eperian letter worded imply incapable writing letter deeply offended implication called eperian figure letter suspicious spoke representative incredibly unhelpful effectively answer questions asked kept ignoring saying regarding offensive letter dispute process feel representative wanted satisfied still clear received letter typed letter signed letter paid mail letter yet eperian willfully disregarded lawful request disgusted entire situation would like dispute handled appropriately would like eperian representative contact give real eplanation letter',
'company refuses provide verification validation debt per right fdcpa believe debt mine',
'complaint regards square two financial refer cfpb case number regarding cach l l c square two financial utilized entire social security number include date birth pfd document listed complaint initial complaint cach l l c square two financial breach following identity theft assumption deterrence act privacy act social security privacy actwhich carries maimum fine calendar cap year breach title solution cach llc handle correction square two financial two square financial submitted subscriber name form listed cfpb case # rendered liable matter addition account number associated universal data form could use account number instead ssn dob also includes removal form cfpb case # listed pdf document attached case number square two financial contacted email regards matter addition information sale distribution via fa fascanned copied stored retrieval system recorded transmitted digitally electronically without epressed written consent information protected copyright publishing laws information protected freedom speech include uniform commercial codes rights reserved world wide',
'started refinance home mortgage process cash option necessary documents submitted initial review got good faith estimate loan amount closing cost based estimate deposit made towards appraisal appraisal came lesser amount agreed reduce loan amount etent however got revised estimate less additional closing cost towards points etc got numerous revised estimates different loan amounts closing cost took months reach definite closing document hence want get back deposit'
The classifiers and learning algorithms can not directly process the text documents in their original form, as most of them expect numerical feature vectors with a fixed size rather than the raw text documents with variable length. Therefore, during the preprocessing step, the texts are converted to a more manageable representation.
One common approach for extracting features from text is to use the bag of words model: a model where for each document, a complaint narrative in our case, the presence (and often the frequency) of words is taken into consideration, but the order in which they occur is ignored.
Specifically, for each term in our dataset, we will calculate a measure called Term Frequency, Inverse Document Frequency, abbreviated to tf-idf. We will use sklearn.feature_extraction.text.TfidfVectorizer to calculate a tf-idf vector for each of consumer complaint narratives
First we create an evaluation function to output all the needs metrics
Now we make predictions using the test data to see how the model performs
Let's create a Classification report
- Accuracy is a good measure to start with if all classes are balanced (e.g. same amount of samples which are labelled with 0 or 1).
- Precision and recall become more important when classes are imbalanced. If false positive predictions are worse than false negatives, aim for higher precision. If false negative predictions are worse than false positives, aim for higher recall.
- F1-score is a combination of precision and recall.
And finally a Confusion Matrix
Display the unique products and there category ids
The Jupyter Notebook can be found here, GitHub