Natural Language Processing using OpenAI API#

Introduction#

Large Language Models (LLMs) are great at understanding and interpreting natural language text. These can be leveraged to perform text extraction with very high accuracy. In this tutorial, we will use the OpenAI API to extract location information from news articles and display the results on a map. The tutorial alos shows how we can develp prompts suitable for data processing pipelines that can return structured data from the models. The notebook is based on the excellent course ChatGPT Prompt Engineering for Developers by Andrew Ng.

Overview of the Task#

We will take 3 news articles about human-elephant conflict in India, extract the information about the incident from these using a LLM and geocode the results to create a map.

Input Data:

  • article1.txt, article2.txt, article3.txt: Sample news articles

Output Layers:

  • An interactive map of locations and data extracted from the articles.

Setup and Data Download#

%%capture
if 'google.colab' in str(get_ipython()):
  !pip install openai mapclassify
from folium import Figure
from geopy.extra.rate_limiter import RateLimiter
from geopy.geocoders import GoogleV3
import folium
import geopandas as gpd
import json
import openai
import os
import pandas as pd
import textwrap

Add your OpenAI API Key below. You need to sign-up and obtain a key. This requires setting up a billing account. If you want to experiement, you can use the free environment provided by the ChatGPT Prompt Engineering for Developers course.

Add your Google Maps API Key below. This requires signing-up using Google Cloud Console and setting up a billing account. Once done, make sure to enable Geocoding API and get a key.

openai.api_key  = ''
google_maps_api_key = ''

Initialize the model.

client = openai.OpenAI(api_key=openai.api_key)

def get_completion(prompt, model='gpt-3.5-turbo'):
    messages = [{'role': 'user', 'content': prompt}]
    response = client.chat.completions.create(
        model=model,
        messages=messages,
        temperature=0,  # This is the degree of randomness of the model's output
    )
    return response.choices[0].message.content

Load Data#

data_folder = 'data'
output_folder = 'output'

if not os.path.exists(data_folder):
    os.mkdir(data_folder)
if not os.path.exists(output_folder):
    os.mkdir(output_folder)
def download(url):
    filename = os.path.join(data_folder, os.path.basename(url))
    if not os.path.exists(filename):
        from urllib.request import urlretrieve
        local, _ = urlretrieve(url, filename)
        print('Downloaded ' + local)

data_url = 'https://github.com/spatialthoughts/geopython-tutorials/releases/download/data/'

articles = ['article1.txt', 'article2.txt', 'article3.txt']

for article in articles:
  download(data_url + article)
Downloaded data/article1.txt
Downloaded data/article2.txt
Downloaded data/article3.txt

Get AI Predictions#

Read the data.

articles_texts = []

for article in articles:
  path = os.path.join(data_folder, article)
  f = open(path, 'r')
  articles_texts.append(f.read())

Display the first article.

wrapped_article = textwrap.fill(articles_texts[0], width=80)
print(wrapped_article)
Title: 2 Persons Trampled To Death By Elephants In 2 Days In Odisha’s Dhenkanal
Description: Dhenkanal: Human casualty due to elephant attack continued in
Odisha’s Dhenkanal district as a man was trampled to death by a herd on
Saturday. According to sources, the incident tool place when the victim, Khirod
Samal of Neulapoi village under Sadangi forest range, had gone to collect cashew
nuts from a nearby orchard in the morning. He came face to face with 3 elephants
who had separated from a herd and were creating a rampage in the area.  Though
Khirod tried to escape from the place, the elephants caught hold of him and
trampled him to death. It took place hardly 100 metre from the panchayat office
in the area.  On being informed, forester Madhusita Pati from Joronda went to
the spot along with a team of Forest officials. She sent the body for post-
mortem and advised the villagers not to venture into the forest till the Forest
officials send the elephants back.  In a similar incident on Friday, one person
was killed in elephant attack in the district. The deceased was identified as
Lakshmidhar Sahu of Bali Kiari village under Angat Jarda Panchayat in Hindol
forest range. He was attacked by the elephant in the morning when he had gone to
the village pond.

We design a prompt to extract specific information from the news article in JSON format.

results = []

for article_text in articles_texts:
  prompt = f"""
    Identify the following items from the news article
    - Location of the incident
    - Number of people injured
    - Number of people killed
    - Short summary

    The news article is delimited with triple backticks.
    Format your response as a JSON object with 'location', 'num_killed' and \
    'summary' as the keys.
    If the information isn't present, use 'unknown' as the value.
    Make your response as short as possible.

    News article: '''{article_text}'''
  """
  response = get_completion(prompt)
  results.append(json.loads(response))

We can turn the list of JSON responses to a Pandas DataFrame.

df = pd.DataFrame.from_dict(results)
df
location num_killed summary
0 Dhenkanal, Odisha 2 2 persons trampled to death by elephants in 2 ...
1 Jharkhand's Latehar district 3 Three members of a family, including a 3-year-...
2 Perumugai in the T.N. Palayam block 1 Wild elephant Karuppan trampled a 48-year-old ...

Geocode Locations#

We were able to extract the descriptive location name from the article. Now we can use a geocoding service to map the location to coordinates.

locator = GoogleV3(api_key=google_maps_api_key)
geocode_fn = RateLimiter(locator.geocode, min_delay_seconds=2)

df['geocoded'] = df['location'].apply(geocode_fn)
df['geocoded']
geocoded
0 (Dhenkanal, Odisha, India, (20.6504753, 85.598...
1 (Latehar, Jharkhand, India, (23.7555791, 84.35...
2 (Perumugai, Tamil Nadu 632009, India, (12.9376...

We extract the latitude and longitude from the geocoded response.

df['point'] = df['geocoded'].apply(lambda loc: tuple(loc.point) if loc else None)
df[['latitude', 'longitude', 'altitude']] = pd.DataFrame(df['point'].tolist(), index=df.index)
df_output = df[['location', 'num_killed', 'summary', 'latitude', 'longitude']].copy()
df_output
location num_killed summary latitude longitude
0 Dhenkanal, Odisha 2 2 persons trampled to death by elephants in 2 ... 20.650475 85.598122
1 Jharkhand's Latehar district 3 Three members of a family, including a 3-year-... 23.755579 84.354205
2 Perumugai in the T.N. Palayam block 1 Wild elephant Karuppan trampled a 48-year-old ... 12.937608 79.185825

Turn the Pandas Dataframe to a GeoPandas GeoDataFrame so we can display the results on a map.

geometry = gpd.points_from_xy(df_output.longitude, df_output.latitude)
gdf = gpd.GeoDataFrame(df_output, crs='EPSG:4326', geometry=geometry)
bounds = gdf.total_bounds

fig = Figure(width=700, height=400)

m = folium.Map()
m.fit_bounds([[bounds[1],bounds[0]], [bounds[3],bounds[2]]])

gdf.explore(
    m=m,
    tooltip=['location', 'num_killed'],
    popup=['location', 'num_killed'],
    marker_kwds=dict(radius=5))

fig.add_child(m)

If you want to give feedback or share your experience with this tutorial, please comment below. (requires GitHub account)