Creating a Searchable Database with Text Extracted from Scanned Pdfs or Images Pdf Text OCR - Searchable Pdf text in a Database

In this short tutorial I show how to extract text from images and scanned pdfs and store the results in a database to make the document searchable.

Pdf documents and images with text are difficult to work with. Most business people manually read through multiple pages to retrieve the information they are looking for. We want to use a python program that will take a pdf, whether scanned or not as well as any image that contains text and extract the text by page and index each page in a dataframe which can be stored in any database of your choice and be made available for users to write nlp search or mine the text on the table.

Read More

Named Entity Recognition With Stanford NLP NER Package: Automated Information Extraction from Text - Natural Language Processing

In this short post, I will show how to get the entities in an article or any text documents using Natural language processing technique. We will use the powerful NER Package by Stanford NLP in this tutorial.

An entity could be a person, organization, location, date/time, money, percentage, the list goes on. This is useful if you quickly need to gather the specific salient information about a very long document, example who contacted who at what time and at what place; and which organization do they work for or are they discussing? Was money involved in the dealings and how much?

Read More

Classification of Customer Complaints using Tensorflow, Transfer Learning: Text Classification with Word Embeddings

In this post, I show how to classify consumer complaints text into these categories: Debt collection, Consumer Loan, Mortgage, Credit card, Credit reporting, Student loan, Bank account or service, Payday loan, Money transfers, Other financial service, Prepaid card.

This kind of model will be very useful for a customer service department that wants to classify the complaints they receive from their customers. The classification of the issues they have received into buckets will help the department to provide customized solutions to the customers in each group.

Read More

Downloading Kaggle Datasets into Google Colab: Easy Access to Kaggle Datasets in Colab

In this tutorial, I show how to download kaggle datasets into google colab. Kaggle has been and remains the de factor platform to try your hands on data science projects. The platform has huge rich free datasets for machine learning projects.

Another product from google, the company behind kaggle is colab, a platform suitable for training machine learning models and deep neural network free of charge without any installation requirement. One key thing that makes colab a game changer, especially for people who do not own GPU laptop is that users have the option to train their models with free GPU. Colab does not have the trove of datasets kaggle host on its platform therefore, it will be nice if you could access the datasets on kaggle from colab. There is in fact a kaggle API which we can use in colab but setting it up to work is not so easy. I would want to show how to use the API in a few simple steps.

Read More

Exploring the Causes of Death of Soccer Players SPARQL and Python Tutorial

As we watch soccer players exhibit their skills on the pitch at the world cup stage, we would think these players are healthy in all sense, given the amount of work they put in before and during each game. Health experts advise, exercising is paramount for avoiding many diseases. In that context, I wondered what is the impact of exercise on sports men and women, and in particular, soccer players.

These questions cannot be answered for players currently playing in the game because I do not have access to their medical records so I decided to use a public data on Wikipedia which show the causes of death of soccer players on earth.

Read More