初始化

激活log模块

1
2
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

零碎的知识

常用的API

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16

default_dict = defaultdict(int)

dictionary = corpora.Dictionary(texts)
dictionary.save('my_dict.dict') # 一种save的方式
dictionary.save_as_text('my_dict_text.dict')
dictionary.token2id
dictionary.doc2bow(new_doc.lower().split())

corpora.MmCorpus.serialize(path,corpus)
corpus = corpora.MmCorpus('/tmp/corpus.mm')

model.save(path)
model.load

pprint(texts)

Tutorial_1: Corpora and Vector Spaces


Corpora and Vector Spaces

构建词典和语料库

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
from collections import defaultdict
from pprint import pprint
from gensim import corpora

# 构建语料库,documents中的文本已经用空格分隔开
def contruct_corpus(documents):
# 加载停用词
stopwords_dict = load_stopwords_dict()
# 分词并去停用词
texts = [[word for word in doc.strip().split(' ') if word not in stopwords_dict] for doc in documents]
# frequency 是一个字典
frequency = defaultdict(int)
for text in texts:
for token in text:
frequency[token] += 1
# 过滤掉仅仅出现一次的词
texts = [[token for token in text if frequency[token] > 1] for text in texts]
dictionary = corpora.Dictionary(texts) # 构建词典
dictionary.save('.\\tmp\\my_dict.dict') # 一种save的方式
dictionary.save_as_text('.\\tmp\\my_dict_text.dict') # 以文本的形式存储词典
corpus = [dictionary.doc2bow(text) for text in texts] # 根据词典将doc转成bow格式
corpora.MmCorpus.serialize('.\\tmp\\corpus.mm', corpus) # # store to disk, for later use
# print(corpus)
return corpus

存储和加载corpus

存储语料库

1
2
3
4
5
6
7
8
9
from gensim import corpora

# 最常用的format
corpora.MmCorpus.serialize('/tmp/deerwester.mm', corpus) # store to disk, for later use

# 其他format:Joachim’s SVMlight format, Blei’s LDA-C format and GibbsLDA++ format
corpora.SvmLightCorpus.serialize('/tmp/corpus.svmlight', corpus)
corpora.BleiCorpus.serialize('/tmp/corpus.lda-c', corpus)
corpora.LowCorpus.serialize('/tmp/corpus.low', corpus)
MmCorpus结果生成两个文件,一个是`deerwester.mm`,一个是`deerwester.mm.index`

.mm文件存储结构

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
%%MatrixMarket matrix coordinate real general
37789 46872 1089557
1 1 2
1 2 2
1 3 2
1 4 2
1 5 2
1 6 1
1 7 2
1 8 2
1 9 2
1 10 2
1 11 2
1 12 1
1 13 2
1 14 1
2 3 1
2 15 1
2 16 1
2 17 1
2 18 1
2 19 1
2 20 1
2 21 1
2 22 1
2 23 1
2 24 1
2 25 1

其中第二行的第一个数字是文档的数量,第二个数字是整个字典的大小,第三个数字是所有文档的Token总和。

下面的数据,第一列代表文档编号,第二列数字是该文档中的词在字典中的编号,第三列数字是该文档中的此在该文档中出现的次数。 

加载语料库

1
2
3
4
5
6
7
8
# 返回一个迭代器,是一个流对象,所以不能打印
corpus = corpora.MmCorpus('/tmp/corpus.mm')
# one way of printing a corpus: load it entirely into memory
print(list(corpus)) # calling list() will convert any sequence to a plain Python list
# another way of doing it: print one document at a time, making use of the streaming interface
for doc in corpus:
print(doc)
# The second way is obviously more memory-friendly, but for testing and development purposes, nothing beats the simplicity of calling list(corpus)

数据流方式加载词典和语料库

1
2
3
4
5
6
7
8
from six import iteritems
# collect statistics about all tokens
dictionary = corpora.Dictionary(line.lower().split() for line in open('mycorpus.txt')) # 一行一行加载
# remove stop words and words that appear only once
stop_ids = [dictionary.token2id[stopword] for stopword in stoplist if stopword in dictionary.token2id] # 将停用词转成id
once_ids = [tokenid for tokenid, docfreq in iteritems(dictionary.dfs) if docfreq == 1] # 收集仅仅出现一次的词
dictionary.filter_tokens(stop_ids + once_ids) # remove stop words and words that appear only once
dictionary.compactify() # remove gaps in id sequence after words that were removed
好处:一行一行的加载数据,防止一次性加载太多数据,导致内存不够用,这种方式比加载完再处理更加高效,少遍历了一次

兼容Numpy 和 Scipy

1
2
3
4
5
6
7
8
9
10
11
12
# 兼容numpy
import gensim
import numpy as np
numpy_matrix = np.random.randint(10, size=[5,2]) # random matrix as an example
corpus = gensim.matutils.Dense2Corpus(numpy_matrix)
numpy_matrix = gensim.matutils.corpus2dense(corpus, num_terms=number_of_corpus_features)

# 兼容Scipy
import scipy.sparse
scipy_sparse_matrix = scipy.sparse.random(5,2) # random sparse matrix as example
corpus = gensim.matutils.Sparse2Corpus(scipy_sparse_matrix)
scipy_csc_matrix = gensim.matutils.corpus2csc(corpus)

Tutorial_2: Topics and Transformations


模型和模型的初始化

TF-IDF 的初始化

1
2
3
4
5
6
7
8
9
10
 # 导入corpus和dictionary
corpus_path = './tmp/corpus.mm'
corpus = load_corpus(corpus_path) # corpus时一个可以迭代的对象

# 转成带权重的BOW,corpus是 bag-of-words (integer values)
tfidf_model = models.TfidfModel(corpus, normalize=True) # new model
# 返回一个迭代器,结果在迭代中进行计算,体现了gensim的memory independence 特性
tfidf_bow = tfidf_model[corpus] # convert bow to tf-idf bow
for doc in tfidf_bow:
print(doc)

常见的 VSM models

1、TF-IDF
2、LSI(Latent Semantic Indexing)
3、RP(Random Projection)
4、LDA(Latent Dirichlet Process)
5、HDP(Hierarchical Dirichlet Process)


Tutorial_3: Similarity Queries


相似性查询

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
def LSI_similarity_demo(corpus,dictionary,docs):

# 256 为特征数
lsi = models.LsiModel(corpus, id2word=dictionary, num_topics=256)

index = similarities.MatrixSimilarity(lsi[corpus]) # transform corpus to LSI space and index it
index.save(path('./tmp/lsi_similarity.index'))

index = similarities.MatrixSimilarity.load(path('./tmp/lsi_similarity.index'))

doc = "华为 消费者 业务 确实 中兴 要强 好多"
vec_bow = dictionary.doc2bow(doc.lower().split())
vec_lsi = lsi[vec_bow] # convert the query to LSI space
# sims 的数据结构为tuple (文档索引, 相似度)

sims = index[vec_lsi] # perform a similarity query against the corpus
# 对sims按照 scores值进行排序,
# -号表示将所有score变成负值,[1]为tuple的第二个元素:相似度,item 指代的就是排序的对象
sims = sorted(enumerate(sims), key=lambda item: -item[1])
for i in range(10):
index = sims[i][0]
score = sims[i][1]
print('( ' + str(index) + ', ' + str(score) + docs[index] + ')')