The Comparison of Multiverse Words Space
Machine learning has been trying to not only learning about numerical data but its also trying to figure what is happening on string data for decades. Such as text labeling, spam detector, sentiment analysis, document classification, etc. This type of machine learning, which has data string as a part of the predictor, is not literally using “string” type as part of the calculation behind it. Computers would not understand how to add “apple” and “blender” into “juice”. Therefore, word embedding comes to play to converting data string into numerical data which mostly transform words or documents becomes vectors. Not only acting as a converter but word embedding as a whole is a dictionary of documents that contain a number of words, word relation, etc.
This word transformation has various formats such as Count Vector, TF-IDF Vector, Co-Occurrence Vector, Skip Gram, Common Bag Of Words, etc. Every format can create its own universe of words, so that word “apple” in one universe would be totally different on the other universe. Therefore the result of prediction could be distinct for a different format in the same machine learning. But, how much the differences? which one is more effective?
Space of Word
In mathematics, vector space is a collection of objects called vectors, which may be added together and multiplied (or scaled) by particular numbers, called scalars (Wikipedia). Thus, Word Space is like a room full of words that are represented as vectors which can be scaled and added each other.
The above picture is an example of transforming words into a 2-dimensional vector (which mostly describe as a point). In Word Space, the vector is a finite n-dimensional vector where n is a natural number more than zero. These n-dimensional vectors contain an n real number, therefore these vectors can be added or subtracted each other.
However, not every Word Space could create such an astonishing result like on the picture.
Work in a Multiverse of Word
As mentioned before, one word can be represented as a different vector based on the technique and the corpus of the word coming from. Corpus is a collection of texts such as a collection of articles, a collection of books or even a book itself is a corpus. So that, even if using the same technique, a word “leaf” in a corpus of biology books and “leaf” in a corpus of a one-year newspaper will have a different vector representation. In terms of machine learning, a corpus can come from anywhere as long as its general corpus and contain all the words on data training. But sometimes the corpus is all data text on data training.
After the corpus is ready, its time to use some methods for converting the word or documents on the corpus to vectors. Here is a quick explanation of some method to convert words or documents into vectors.
A. n-count vector (documents to vectors)
This method allows converting documents (imagine one long sentence) into an n-dimensional vector where the entity of the vector shows the stats of the influential word (count or percentage) on the corpus of the documents. Usually, the influential word is the highest number count of the word in the corpus.
For example, there is a corpus/corpora with 3 documents as follows,
corpora :
- I Love you and your cat and dog and all
- Love Love you and my dog now
- my cat loves me and my dog
These are the word count of the corpora,
Use the table above as reference to create an n-count vector for every document. The first thing to do is to define the most influential word, where, in this case, is the word with count number more than 1. Therefore, the documents will be transformed into a 6-count vector.
The infamous example of transformation is a 6-count vector with entities describing the count of the influential word on documents.
Another example is a 6-count vector which contains the percentage of the influential word on documents.
The last variation is the boolean 6-count vector where 1 means the documents contain the influential word, otherwise is 0.
B. n- TF IDF vector (documents to vectors)
The next method is TF-idF or Term Frequency Inverse Document Frequency. This embedding method is transforming every word based on the following formula,
The formula tries to evaluate how important a word is to a document in the corpora by interpreting it as weight. Therefore, the weight of a word will be different in every document (that explain why the weight has two indexes). Here the example of TF-idF vector of the corpora above with the most influential word,
C. n-Word2Vec (words to vectors)
The last method that will be discussed is Word2Vec method. Unlike the tfidf method, the idea of the method is to transform words into vectors which have contextual meaning. The contextual meaning brings distance of two synonym words or correlated words shorter and the antonym is the opposite. It makes the computer understand that the word ‘fruit’ is correlated with ‘apple’, ‘orange’, etc. However, contextual insight will be different for different corpus.
The process to produce this kind of transformation is quite different from the previous, because it includes a deep learning process.
In general, the practical way to do word2vec is by collecting the corpus and train the corpus on existed word2vec model like gensim for python or rword2vec for R. After the training, the model can transform a word into an n-dimensional vector.
Pre-trained Universe
Internet is a mysterious place, there are many different pre-trained models (which means the model that is already trained with specific corpus, usually big amount of corpus) that can be found and easily downloaded free. The model sometimes is provided with different vector length and usually, have size more than 1 gigabytes.
The Best Universe
In short, like the no free lunch theorem state, there is no single best model or technique to transform a word into a vector. It really depends on the cases, the corpus and importantly the objective of the usage.
For example, the general context model or pre-trained model can’t be used directly for a specific case like medical terms. There must be some words that have a different meaning in general and medical. Also, there must be words in medical that hasn’t been trained in the model.
Another example, the word2vec method can’t be used if there is a possibility of many new words in data that is not included in the corpus. On that case, the n-count vector is more reliable to transform the word.
Conclusion
The embedding process is one important process to do analysis or machine learning with string data. There are multiple techniques in the word embedding which creates some kind like word multiverse. Those multiverses have different characteristics and usage, some of it has contextual insight and some of it can’t be used on the specific use case. Therefore, the learning technique on the deeper context is necessary on how the embedding can be utilized in the analysis. It also the consequence of the no free lunch theorem we talk about previously.
Reference