Skip to content
This repository was archived by the owner on Sep 3, 2024. It is now read-only.

Commit 4b5a802

Browse files
committed
Push final code.
1 parent 5d952c1 commit 4b5a802

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

43 files changed

+141434
-2
lines changed

README.md

100644100755
+103-2
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,103 @@
1-
# S4_semantic_shift
2-
code for reproducing results in AAAI 2021 paper
1+
# Fake it Till You Make it: Self-Supervised Semantic Shifts for Monolingual Word Embedding Tasks
2+
3+
This code repository contains the code for the experiments seen in the paper `Fake it Till You Make it: Self-Supervised Semantic Shifts for Monolingual Word Embedding Tasks` (2020).
4+
5+
## Requirements
6+
7+
This repository contains mainly Python3 routines and dependencies listed in `requirements.txt`. To install the dependencies using pip/venv, run:
8+
9+
```
10+
pip3 install -r requirements.txt
11+
```
12+
13+
## Setup
14+
15+
After installing the requirements, run `setup.sh` to configure the environment and download the pre-trained word embeddings.
16+
```
17+
sh setup.sh
18+
```
19+
This will create the folders to store the results, and will download pre-trained vectors. The size of the download is approximately 500MB.
20+
21+
Alternatively, the pre-trained embeddings can be downloaded [here](https://zenodo.org/record/3890109/files/wordvectors.zip?download=1).
22+
23+
24+
## Results
25+
26+
27+
### British English :gb: vs. American English :us:
28+
29+
Results for the classification task on detecting semantic shift between British English and American English.
30+
31+
** Requires the pre-trained word embeddings from BNC and COCA **
32+
33+
To reproduce these results, run:
34+
35+
```
36+
chmod +x ukus_experiment.sh
37+
./ukus_experiment.sh
38+
```
39+
40+
By default, results are saved to `results/ukus/cls_results.txt`.
41+
42+
|Method|Alignment|Accuracy|Precision|Recall|F1|
43+
|------|---------|--------|---------|------|--|
44+
|COS|global|0.35|0.71|0.19|0.3|
45+
|S4-D|global|0.45 +- 0.02|0.45 +- 0.02|0.45 +- 0.02|0.45 +- 0.03|
46+
|Noisy-Pairs|-|0.29|1.0|0.03|0.06|
47+
48+
49+
50+
### SemEval-2020 Task on Unsupervised Lexical Semantic Change Detection
51+
52+
Results for the binary classification task on semantic shift for multiple languages (SemEval2020 Task 1): English, German, Latin, and Swedish.
53+
54+
** Requires the pre-trained embeddings from SemEval **
55+
56+
To reproduce these results:
57+
58+
```
59+
chmod +x semeval_experiment.sh
60+
./semeval_experiment.sh
61+
```
62+
63+
By default, results are saved to `results/semeval/cls_results.txt`.
64+
65+
|Method|Language|Mean acc.|Max acc.|
66+
|------|--------|---------|--------|
67+
| s4|english|0.62|0.7|
68+
| noise-aware|english|0.61|0.65|
69+
| top-10|english|0.59|0.68|
70+
| bot-10|english|0.58|0.68|
71+
| global|english|0.61|0.68|
72+
| top-5|english|0.59|0.65|
73+
| bot-5|english|0.57|0.68|
74+
75+
76+
77+
### ArXiv Semantic Shift Discovery
78+
79+
Word discovery experiment on the arXiv data set for subjects Artificial Intelligence (cs.AI) and Classical Physics (physics.class-ph). This table shows the list of top semantically shifted words uniquely discovered by Global, Noise-Aware and S4-A alignments, respectively. As well as the most shifted words commonly discovered by all three methods.
80+
81+
** Requires the pre-trained embeddings from arXiv **
82+
83+
To reproduce these results:
84+
85+
```
86+
chmod +x arxiv_experiment.sh
87+
./arxiv_experiment.sh
88+
```
89+
90+
The table of results is saved in `results/arxiv/table.txt`, the ranking correlation plot is saved in `results/arxiv/arxiv_ranking.pdf`.
91+
92+
|Global|Noise-Aware|S4-A|Common| |
93+
|------|-----------|----|------|-|
94+
|agent||components|concepts|nodes|
95+
|approximation||element|density|phys|
96+
|boundary||mass|deterministic|polynomial|
97+
|conceptual||order|die|probability|
98+
|knowledge||solution|edge|respect|
99+
|plane||space|equations|rev|
100+
|reference||state|fields|rough|
101+
|rules||term|internal|rule|
102+
|system||time|light|tensor|
103+
|systems||vector|los|variables|

WordVectors.py

+203
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,203 @@
1+
import numpy as np
2+
from collections import OrderedDict
3+
from sklearn import preprocessing
4+
5+
6+
# This file contains the WordVectors class used to load and handle word embeddings
7+
def intersection(*args):
8+
"""
9+
This function returns the intersection between WordVectors objects
10+
I.e.: all words that occur in both objects simultaneously as well as their
11+
respective word vectors
12+
Returns: list(WordVectors) objects with intersecting words
13+
"""
14+
if len(args) < 2:
15+
print("! Error: intersection requires at least 2 WordVector objects")
16+
return None
17+
# Get intersecting words
18+
# WARNING: using set intersection will affect the order of words
19+
# in the original word vectors, to keep results consistent
20+
# it is better to iterate over the list of words
21+
# the resulting order will follow the first WordVectors's order
22+
# Get intersecting words
23+
common_words = set.intersection(*[set(wv.words) for wv in args])
24+
# Get intersecting words following the order of first WordVector
25+
words = [w for w in args[0].words if w in common_words]
26+
27+
# Retrieve vectors from a and b for intersecting words
28+
wv_out = list() # list of output WordVectors
29+
for wv in args:
30+
wv_out.append(WordVectors(words=words, vectors=[wv[w]for w in words]))
31+
32+
return wv_out
33+
34+
35+
def union(*args, f="average"):
36+
"""
37+
Performs union of two or more word vectors, returning a new WordVectors
38+
containing union of words and combination of vectors according to given
39+
function.
40+
Arguments:
41+
*args - list of WordVectors objects
42+
f - (str) function to use when combining word vectors (default to average)
43+
Returns:
44+
wv - WordVectors as the union the input args
45+
"""
46+
47+
if f == 'average':
48+
f = lambda x: sum(x)/len(x)
49+
50+
union_words = set.union(*[set(wv.words) for wv in args])
51+
52+
words = list(union_words)
53+
vectors = np.zeros((len(words), args[0].dimension), dtype=float)
54+
for i, w in enumerate(words):
55+
# Get list of existing vectors for w
56+
vecs = np.array([wv[w] for wv in args if w in wv])
57+
vectors[i] = f(vecs) # Combine vectors
58+
59+
wv_out = WordVectors(words=words, vectors=vectors)
60+
61+
return wv_out
62+
63+
64+
# Implements a WordVector class that performs mapping of word tokens to vectors
65+
# Stores words as
66+
class WordVectors:
67+
"""
68+
WordVectors class containing methods for handling the mapping of words
69+
to vectors.
70+
Attributes
71+
- word_id -- OrderedDict mapping word to id in list of vectors
72+
- words -- list of words mapping id (index) to word string
73+
- vectors -- n x dim matrix of word vectors, follows id order
74+
- counts -- not used at the moment, designed to store word count
75+
- dimension -- dimension of wordvectors
76+
- zipped -- a zipped list of (word, vec) used to construct the object
77+
- min_freq -- filter out words whose frequency is less than min_freq
78+
"""
79+
def __init__(self, words=None, vectors=None, counts=None, zipped=None,
80+
input_file=None, centered=True, normalized=False,
81+
min_freq=0, word_frequency=None):
82+
83+
if words is not None and vectors is not None:
84+
self.word_id = OrderedDict()
85+
self.words = list()
86+
for i, w in enumerate(words):
87+
self.word_id[w] = i
88+
self.words = list(words)
89+
self.vectors = np.array(vectors)
90+
self.counts = counts
91+
self.dimension = len(vectors[0])
92+
elif zipped:
93+
pass
94+
elif input_file:
95+
self.dimension = 0
96+
self.word_id = dict()
97+
self.words = list()
98+
self.counts = dict()
99+
self.vectors = None
100+
self.read_file(input_file)
101+
102+
if centered:
103+
self.center()
104+
if normalized:
105+
self.normalize()
106+
107+
if word_frequency:
108+
self.filter_frequency(min_freq, word_frequency)
109+
110+
def center(self):
111+
self.vectors = self.vectors - self.vectors.mean(axis=0, keepdims=True)
112+
113+
def normalize(self):
114+
self.vectors = preprocessing.normalize(self.vectors, norm="l2")
115+
116+
def get_words(self):
117+
return self.word_id.keys()
118+
119+
# Returns a numpy (m, dim) array for a given list of words
120+
# I.e.: select vectors whose word are in argument words
121+
def get_vectors_from_words(self, words):
122+
vectors = np.zeros((len(words), self.dimension))
123+
for i, w in enumerate(words):
124+
vectors[i] = self[w]
125+
return vectors
126+
127+
# Return (word,vec) for given word
128+
# In future versions may only return self.vectors
129+
def loc(self, word, return_word=False):
130+
if return_word:
131+
return word, self.vectors[self.word_id[word]]
132+
else:
133+
return self.vectors[self.word_id[word]]
134+
135+
def get_count(self, word):
136+
return self.freq[self.word_id[word]]
137+
138+
# Get word, vector pair from id
139+
def iloc(self, id_query, return_word=False):
140+
if return_word:
141+
return self.words[id_query], self.vectors[id_query]
142+
else:
143+
return self.vectors[id_query]
144+
145+
# Overload [], given word w returns its vector
146+
def __getitem__(self, key):
147+
if isinstance(key, int) or isinstance(key, np.int64):
148+
return self.iloc(key)
149+
elif isinstance(key, slice): # slice
150+
return ([w for w in self.words[key.start: key.stop]],
151+
[v for v in self.vectors[key.start: key.stop]])
152+
return self.loc(key)
153+
154+
def __len__(self):
155+
return len(self.words)
156+
157+
def __contains__(self, word):
158+
return word in self.word_id
159+
160+
def filter_frequency(self, min_freq, word_frequency):
161+
print("Filtering %d" % min_freq)
162+
words_kept = list()
163+
vectors_kept = list()
164+
for word, vec in zip(self.words, self.vectors):
165+
if word in word_frequency and word_frequency[word] > min_freq:
166+
words_kept.append(word)
167+
vectors_kept.append(vec)
168+
169+
self.words = words_kept
170+
self.vectors = np.array(vectors_kept)
171+
self.word_id = OrderedDict()
172+
for i, w in enumerate(self.words):
173+
self.word_id[w] = i
174+
175+
print(" - Found %d words" % len(self.words))
176+
177+
# Read file in following format:
178+
# n_items dim
179+
def read_file(self, path):
180+
with open(path) as fin:
181+
n_words, dim = map(int, fin.readline().rstrip().split(" ", 1))
182+
self.dimension = dim
183+
# print("Reading WordVectors (%d,%d)" % (n_words, dim))
184+
185+
# Use this function to process line reading in map
186+
def process_line(s):
187+
s = s.rstrip().split(" ", 1)
188+
w = s[0]
189+
v = np.array(s[1].split(" "), dtype=float)
190+
return w, v
191+
192+
data = map(process_line, fin.readlines())
193+
self.words, self.vectors = zip(*data)
194+
self.words = list(self.words)
195+
self.word_id = {w: i for i, w in enumerate(self.words)}
196+
self.vectors = np.array(self.vectors, dtype=float)
197+
198+
def save_txt(self, path):
199+
with open(path, "w") as fout:
200+
fout.write("%d %d\n" % (len(self.word_id), self.dimension))
201+
for word, vec in zip(self.words, self.vectors):
202+
v_string = " ".join(map(str, vec))
203+
fout.write("%s %s\n" % (word, v_string))

alignment.py

+47
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,47 @@
1+
from scipy.linalg import orthogonal_procrustes
2+
import numpy as np
3+
from WordVectors import WordVectors
4+
5+
# Word alignment module
6+
def align(wv1, wv2, anchor_indices=None, anchor_words=None, anchor_top=None,
7+
anchor_bot=None, anchor_random=None,
8+
exclude={},
9+
method="procrustes"):
10+
"""
11+
Implement OP alignment for a given set of landmarks.
12+
If no landmark is given, performs global alignment.
13+
Arguments:
14+
wv1 - WordVectors object to align to wv2
15+
wv2 - Target WordVectors. Will align wv1 to it.
16+
anchor_indices - (optional) uses word indices as landmarks
17+
anchor_words - (optional) uses words as landmarks
18+
exclude - set of words to exclude from alignment
19+
method - Alignment objective. Currently only supports orthogonal procrustes.
20+
"""
21+
if anchor_top is not None:
22+
v1 = [wv1.vectors[i] for i in range(anchor_top) if wv1.words[i] not in exclude]
23+
v2 = [wv2.vectors[i] for i in range(anchor_top) if wv2.words[i] not in exclude]
24+
elif anchor_bot is not None:
25+
v1 = [wv1.vectors[-i] for i in range(anchor_bot) if wv1.words[i] not in exclude]
26+
v2 = [wv2.vectors[-i] for i in range(anchor_bot) if wv2.words[i] not in exclude]
27+
elif anchor_random is not None:
28+
anchors = np.random.choice(range(len(wv1.vectors)), anchor_random)
29+
v1 = [wv1.vectors[i] for i in anchors if wv1.words[i] not in exclude]
30+
v2 = [wv2.vectors[i] for i in anchors if wv2.words[i] not in exclude]
31+
elif anchor_indices is not None:
32+
v1 = [wv1.vectors[i] for i in indices if wv1.words[i] not in exclude]
33+
v2 = [wv2.vectors[i] for i in indices if wv2.words[i] not in exclude]
34+
elif anchor_words is not None:
35+
v1 = [wv1[w] for w in anchor_words if w not in exclude]
36+
v2 = [wv2[w] for w in anchor_words if w not in exclude]
37+
else: # just use all words
38+
v1 = [wv1[w] for w in wv1.words if w not in exclude]
39+
v2 = [wv2[w] for w in wv2.words if w not in exclude]
40+
v1 = np.array(v1)
41+
v2 = np.array(v2)
42+
if method=="procrustes": # align with OP
43+
Q, _ = orthogonal_procrustes(v1, v2)
44+
45+
wv1_ = WordVectors(words=wv1.words, vectors=np.dot(wv1.vectors, Q))
46+
47+
return wv1_, wv2, Q

0 commit comments

Comments
 (0)