Basic FAQ Chatbot using Machine Learning

4 min readNov 19, 2021

Today we’re gonna create a basic chatbot using machine learning that can answer user queries based on the FAQ data of any company/organization.

Note: This tutorial is intended for beginners in Machine Learning/NLP.

Dataset:

For this tutorial, we are going to use HDFC Bank FAQ Dataset containing user queries and their corresponding answers.

Link to the dataset

Data Cleaning and Preprocessing

Below you can see how the dataset looks like:

For this tutorial, we only need to focus on ‘question’ and ‘answer’ columns. ‘found_duplicate’ column is not required, so we will drop that column.

2. Now we will clean this data by using regular expressions to remove any irrelevant characters (if any), tokenization, and finally lemmatization of characters.

During Lemmatization, the endings of the words are removed to return the base word, which is also known as Lemma. For example, Running becomes Run, Playing becomes Play.

3. Since in our dataset almost all answers are unique except a few, so we have to augment our dataset to generate questions similar to those present in dataset in order to make our model more generalized for real life queries. For augmentation, we are going to use a super cool open source library called nlpaug.

Data Augmentation helps us to increase the size of the dataset and introduce variability in the dataset, without actually collecting new data.

For example, “I have no time” after augmentation becomes “I do not have time”.

4. Next step is to transform questions into vectorized form since questions are also in string format. We are going to use TF-IDF vectorizer here.

5. Since the ‘answer’ column is a string type column, ideally we should encode it using an encoding algorithm. We are going to use label encoder here.

Modelling

Now our dataset is cleaned and preprocessed, we are ready to train our machine learning model on this data.

There are various machine learning algorithms like Logistic Regression, Multinomial Naive Bayes, Linear SVM/SGD Classifier etc.

For this tutorial, we are going to use SGD Classifier due to its high speed training time.

Loss function used here is ‘modified_huber’ instead of default ‘hinge_loss’ because we also want prediction probabilities which is not possible using ‘hinge_loss’.

Evaluation

Now, we have trained our model, it’s time to evaluate our model. Since it’s a multi-class classification model, we have to calculate prediction values for test data and evaluate them against the test dataset.

Accuracy, Precision, Recall, and ROC-AUC Score looks good enough for testing model on real life queries. We can improve this model using hyper-parameter tuning to get even more accuracy.

Testing on Example queries

Let’s test our model on a sample query:

For each question, first we have to clean it then transform it into vectorized form using TF-IDF vectorizer. Also, we will check the probability score if it is less than 0.1 then we should mention that model is not sure about the prediction so that end-user can know that model is not confident enough for that particular prediction.

Congrats, we have successfully created a chatbot for bank queries that can be deployed and integrated on any website.

Link to Github Code

Clap and share the article if you like this tutorial. Feedbacks/Suggestions are most welcome.