You will carry out this project in teams of two. You can do the project with your same teammate as for Project 1 and you are also welcome to switch teammates if you wish. In this case, please be considerate and notify your Project 1 teammate immediately. If you want to form a new team but can't find a teammate, please follow these steps:
You do not need to notify us of your team composition. Instead, you and your teammate will indicate your team composition when you submit your project on Gradescope (click on "Add Group Member" after one of you has submitted your project). You will upload your final electronic submission on Gradescope exactly once per team, rather than once per student.
This project is about information extraction on the web, or the task of extracting "structured" information that is embedded in natural language text on the web. As we discussed in class, information extraction has many applications and, notably, is becoming increasingly important for web search.
In this project, you will implement a version of the Iterative Set Expansion (ISE) algorithm that we described in class: for a target information extraction task, an "extraction confidence threshold," a "seed query" for the task, and a desired number of tuples k, you will follow ISE, starting with the seed query (which should correspond to a plausible tuple for the relation to extract), to return k tuples extracted for the specified relation from web pages with at least the given extraction confidence, and following the procedure that we outline below.
The objective of this project is to provide you with hands-on experience on how to (i) retrieve and parse webpages; (ii) prepare and annotate text on the webpages for subsequent analysis; and (iii) extract structured information from the webpages. You will exercise both a "traditional" information extraction approach (using SpanBERT) that involves multiple steps of data annotation, as well as an approach that reflects the ongoing paradigm shift from multi-step data pipelines with specialized models for extraction tasks to strong "few-shot" learners (using Google's Gemini API). You will implement both approaches, and then you can select one for a specific run in the command line for your project.
You will develop and run your project on the Google Cloud infrastructure, using your LionMail account as you did for Project 1.
IMPORTANT NOTE: To avoid memory-related problems, and to adapt to the requirements of the libraries that you will use in this project, please create a brand-new VM for Project 2 (i.e., don't use the same VM as for Project 1). For this, follow all steps in our earlier instructions but with the following two changes:
You should install Python 3.10 (or newer) and create a Python virtual environment to develop and test your code, as follows:
sudo apt update
sudo apt install python3
sudo apt install python3-venv
sudo apt install -y python3-pip
dbproj
:
python3 -m venv dbproj
/home
directory, move the virtual environment to a
different location, as follows:
sudo mv ~/dbproj /opt/
ln -s /opt/dbproj ~/dbproj
source dbproj/bin/activate
source
/home/<your_uni>/dbproj/bin/activate
.)
apt
or apt-get
below, you may get an error that says
"ModuleNotFoundError: No module named 'apt_pkg'." In this case,
then please perform the following steps:
cd /usr/lib/python3/dist-packages
sudo ln -s apt_pkg.cpython-36m-x86_64-linux-gnu.so
apt_pkg.so
cd ~
sudo pip3 install --upgrade google-api-python-client
Your program will rely on:
dbproj
virtual environment, run: pip3 install --upgrade
google-api-python-client
pip3 install beautifulsoup4
sudo apt-get update
pip3 install -U pip setuptools wheel
pip3 install -U spacy
python3 -m spacy download en_core_web_lg
per:schools_attended
)
per:employee_of
)
per:cities_of_residence
)
org:top_members/employees
)
git clone https://github.com/Shreyas200188/SpanBERT
cd SpanBERT
pip3 install -r requirements.txt
bash download_finetuned.sh
pip install -q -U google-generativeai
Overall, your program should receive as input:
-spanbert
)
or Google Gemini (-gemini
) for the
extraction process
-gemini
Then, your program should perform the following steps:
-spanbert
is specified, use the sentences and
named entity pairs as input to SpanBERT to
predict the corresponding relations, and extract all instances of
the relation specified by input
parameter r. Otherwise, if -gemini
is specified, use the Google Gemini API for
relation extraction. See below for details on how to perform this
step.
-spanbert
is specified, identify the
tuples that have an associated extraction confidence of at
least t and add them to
set X. Otherwise, if -gemini
is specified, identify all the tuples that have been
extracted and add them to set X (we do not
receive extraction confidence values from the Google
Gemini API, so feel free to hard-code in a value of
1.0 for the confidence value for all Gemini-extracted
tuples).
-spanbert
is
specified) and remove from X the duplicate
copies. (You do not need to remove approximate duplicates, for
simplicity.)
-spanbert
is specified, your output should
have the tuples sorted in decreasing order by extraction
confidence, together with the extraction confidence of each
tuple. If -gemini
is specified, your output can have
the tuples in any order (if you have more
than k tuples, then you can return an arbitrary
subset of k tuples). (Alternatively, you can
return all of the tuples in X, not just the
top-k such tuples; this is what the reference
implementation does.)
-spanbert
is specified, y has an extraction confidence
that is highest among the tuples in X that have
not yet been used for querying. (You can break ties
arbitrarily.) Create a query q from
tuple y by just concatenating the attribute
values together, and go to Step 2. If no such y
tuple exists, then stop. (ISE has "stalled" before
retrieving k high-confidence tuples.)
Steps 3.d and 3.e above require that you use
the spaCy library to annotate the plain text from
each webpage and extract tuples for the target
relation r using (1) the
pre-trained SpanBERT classifier,
when -spanbert
is specified; or (2) the Google
Gemini API, when -gemini
is specified.
Relation extraction is a complex task that traditionally operates over text that has been annotated with appropriate tools. In particular, the spaCy library that you will use in this project provides a variety of text pre-processing tools (e.g., sentence splitting, tokenization, named entity recognition).
For your project, you should use spaCy for splitting the text to sentences and for named entity recognition for each of the sentences. You can find instructions on how to apply spaCy for this task here and in our example script (see below).
If -spanbert
is specified, after having identified
named entities for a sentence you should use the
pre-trained SpanBERT classifier for relation
extraction.
SpanBERT is a BERT-based relation classifier that
considers as input (1) a sentence; (2) a subject entity from the
sentence; and (3) an object entity from the
sentence. SpanBERT then returns the predicted
relation and the respective confidence value. You can find
instructions on how to apply SpanBERT for this
task here and in our example script (see below).
We have put together two minimal Python scripts, namely, spacy_help_functions.py and example_relations.py, that perform the full relation extraction pipeline, to illustrate how the spaCy library is integrated with SpanBERT. To run these scripts, you need to place them under the same directory as the spanbert.py file (provided here).
As an example, consider the following sentence and a "conceptual walk-through" of the various steps of the full information extraction process (note that this is not how the output of our reference implementation is formatted):
spaCy extracted entities: [('Bill Gates', 'PERSON'), ('Microsoft', 'ORGANIZATION'), ('February 2014', 'DATE'), ('Satya Nadella', 'PERSON')]
Candidate entity pairs:
1. Subject: ('Bill Gates', 'PERSON') Object: ('Microsoft', 'ORGANIZATION')
2. Subject: ('Microsoft', 'ORGANIZATION') Object: ('Bill Gates', 'PERSON')
3. Subject: ('Bill Gates', 'PERSON') Object: ('Satya Nadella', 'PERSON')
4. Subject: ('Satya Nadella', 'PERSON') Object: ('Bill Gates', 'PERSON')
5. Subject: ('Microsoft', 'ORGANIZATION') Object: ('Satya Nadella', 'PERSON')
6. Subject: ('Satya Nadella', 'PERSON') Object: ('Microsoft', 'ORGANIZATION')
SpanBERT extracted relations:
1. Subject: Bill Gates Object: Microsoft Relation: per:employee_of Confidence: 1.00
2. Subject: Microsoft Object: Bill Gates Relation: org:top_members/employees Confidence: 0.99
3. Subject: Bill Gates Object: Satya Nadella Relation: no_relation Confidence: 1.00
4. Subject: Satya Nadella Object: Bill Gates Relation: no_relation Confidence: 0.52
5. Subject: Microsoft Object: Satya Nadella Relation: no_relation Confidence: 0.99
6. Subject: Satya Nadella Object: Microsoft Relation: per:employee_of Confidence: 0.98
Note that in the above example, SpanBERT runs 6
times for the same sentence, each time with a different entity
pair. SpanBERT extracts relations for 3 entity
pairs and predicts the no_relation
type for the rest of
the pairs (i.e., no relations were extracted). Each relation type
predicted by SpanBERT is listed together with the
associated extraction confidence score.
Unfortunately, the SpanBERT classifier is computationally expensive, so for efficiency you need to minimize its use. Specifically, you should not run SpanBERT over entity pairs that do not contain named entities of the right type for the relation of interest r. The required named entity types for each relation type are as follows:
As an example, consider extraction for
the Work_For relation (internal
name: per:employee_of
). You should only keep entity
pairs where the subject entity type is PERSON and the object entity
type is ORGANIZATION. By applying this constraint in the example
sentence above ("Bill Gates stepped down ... appointed CEO Satya
Nadella.") SpanBERT would run only for the first
entity pair ('Bill Gates', 'Microsoft') and the sixth entity pair
('Satya Nadella', 'Microsoft'). Note that the subject and object
entities might appear in either order in a sentence and this is
fine.
So to annotate the text, you should implement two steps. First, you should use spaCy to identify the sentences in the webpage text together with the named entities, if any, that appear in each sentence. Then, you should construct entity pairs and run the expensive SpanBERT model, separately only over each entity pair that contains both required named entities for the relation of interest, as specified above. IMPORTANT: You must not run SpanBERT for any entity pairs that are missing one or two entities of the type required by the relation. If a sentence is missing one or two entities of the type required by the relation, you should skip it and move to the next sentence.
While running the second step over a sentence, SpanBERT looks for some predefined set of relations in a sentence. We are interested in just the four relations mentioned above. (If you are curious about the other relations available, please check the complete list as well as an article with a detailed description.)
If -gemini
is specified, you will use
Google's Gemini API with the 2.0 Flash model for
relation extraction, rather than SpanBERT. Google
Gemini and its peer large language models, or LLMs, have laid a
major marker in generative modeling of text. LLMs have achieved
state-of-the-art performance over many text-centric tasks such as
machine translation and question answering, and have impressive
performance for relation extraction as well.
Google's Gemini API can extract relations directly
from sentences, as a response to a textual "prompt" provided by you,
as we discussed in class. The prompt can also specify the format
that you would like the output in, so that you can easily integrate
this output into the rest of the data pipeline in this
project. Google
Gemini's web interface is helpful to experiment with designing a
good prompt to extract relations from text.
IMPORTANT: You need to be logged into your
personal Gmail/Google account for this web
interface to work; it will not work if you are logged into your
Columbia account, unfortunately.
There are infinitely many
prompts that will work well, there is not one single magic prompt
you have to discover. We have put together a minimal Python script,
namely,
gemini_helper_6111.py, to
illustrate how to invoke the API.
For SpanBERT above, we followed a 2-step process of first tagging sentences and then classifying each candidate relation with SpanBERT, for each relation type. In contrast, Google Gemini can extract relations directly over the plain text, with no named-entity tagging.
However, because extraction is still computationally expensive (and in other settings can cost actual money!) using Google Gemini's API, you should still follow the multistep process described above for SpanBERT so that you only feed Gemini sentences that have the proper entities for the relation of interest. Specifically, you should first tag the sentences using spaCy to identify the sentences that contain the named entities pairs of the right type for the relation of interest r, and then feed a plain-text version of these sentences --without any tags-- to Google Gemini's API.
IMPORTANT NOTE 1: You must not submit to Google's Gemini API any sentences that are missing one or two entities of the type required by the relation. If a sentence is missing one or two entities of the type required by the relation, you should skip it and move to the next sentence, for computational efficiency and, importantly, to avoid monetary charges.
IMPORTANT NOTE 2: To avoid overloading Google
Gemini, you can assume that whenever we
specify -gemini
, the value of the number of
tuples k that we request will never exceed a
modest number such as 10.
IMPORTANT NOTE 3: You may occasionally encounter
an Internal Server Error exception
(google.api_core.exceptions.InternalServerError: 500
)
when making a call to the Google Gemini API. This is
a transient error signifying that Google servers are currently at
capacity. If you frequently receive this error, try waiting a bit
for Google's servers to clear up. On a related note, the API might
return an error 429 Resource has been exhausted
, meaning
that you are sending requests too fast. To avoid this error, you might
want to include a line
time.sleep(5)
in your code before each call to the Google
Gemini API.
Your Project 2 submission will consist of the following three components, which you should submit on Gradescope by Monday March 31 at 5 p.m. ET:
-spanbert 2 0.7 "bill gates microsoft" 10
(i.e.,
using SpanBERT for r=2,
t=0.7, q=[bill gates microsoft], and k=10). The format of your
transcript should closely follow the format of the reference
implementation, and should print the same information (i.e.,
number of characters, sentences, relations, etc.) as the
corresponding session of the reference implementation.-gemini 2 0.0 "bill gates microsoft" 10
(i.e., using
Google's Gemini API for r=2, t=0, which is ignored
for -gemini
, q=[bill gates microsoft], and k=10). The
format of your transcript should closely follow the format of the
reference implementation, and should print the same information
(i.e., number of characters, sentences, relations, etc.) as the
corresponding session of the reference implementation.To submit your project, please follow these steps:
-spanbert
.-gemini
.We created a reference implementation for this project. The
reference implementation is called as follows:
python3 project2.py [-spanbert|-gemini] <google api key>
<google engine id> <google gemini api key>
<r> <t> <q> <k>
where:
Unfortunately, the SpanBERT classifier requires substantial amounts of memory to run. Therefore, a VM that could support many concurrent runs of the reference implementation, to accommodate the number of students in the class, would exceed our available Google Cloud budget. So rather than giving you direct access to the reference implementation, we provide the transcripts of a variety of runs of the reference implementation. We will keep adding and updating these transcripts periodically, so you have reasonably up-to-date runs available.
Please adhere to the format of the reference implementation for your submission and your transcript file. Also, you can use the transcripts of the reference implementation to give you an idea of how good your overall system should be. Ideally, the number of querying iterations that your system takes to extract the number of tuples requested should be at least as low as that of our reference implementation.
NOTE/HINT: The "prompt"
for the -gemini
option of the reference implementation uses
the following examples of relations:
'["Jeff Bezos", "Schools_Attended", "Princeton University"]'
Specifically, the prompt lists the one example above for the
relation requested by input parameter r, together
with an example sentence including the relation. (Think of
it as a "one-shot in-context learning" prompt, as we discussed in
class.) You are welcome to use or not use these examples as part of
your prompt; overall, the design of the text prompt is part of what
you need to complete for the project. Finally, please recall that
we saw an example in class of a possible prompt for this general
task, which you are welcome to adapt.
'["Alec Radford", "Work_For", "OpenAI"]'
'["Mariah Carey", "Live_In", "New York City"]'
'["Nvidia", "Top_Member_Employees", "Jensen Huang"]'
A part of your grade will be based on the correctness of your overall system. Another part of your grade will be based on the number of iterations that your system takes to extract the number of tuples requested: ideally, the number of querying iterations that your system takes to extract the number of tuples requested should be at least as low as that of our reference implementation. We will not grade you on the run-time efficiency of each individual iteration, as long as you correctly implement the two annotator "steps" described above; in particular, note that you must not run the second (expensive) step for all sentences, but rather you should restrict that second step to only those sentences that satisfy the criteria described above. We will also grade your submission based on the quality of your code, the quality of the README file, and the quality of your transcript.