Confused about code in Lec 13

Post Reply
leon1209
Posts: 6
Joined: Thu May 28, 2020 3:58 am

Confused about code in Lec 13

Post by leon1209 » Sun Jun 07, 2020 8:30 pm

Hi,

I have a question about the code in Lec 13 "Text Classification in Code". In your code, before putting X_train and Y_train into a Random Forest Classifier, X_train is a 2D numpy array, Y_train is actually the label, which is a bunch of texts. I am really confused about it. Why we put a 2D numpy array as X_train and a bunch text as Y_train and the classifier still works?

More in general, I am confused about the meaning of the whole code. I can read through every line of the code but still don't know what this code is trying to do... Since we already have pre-trained word vector, what exactly are we training for?

Please advise, thanks.

lazyprogrammer
Site Admin
Posts: 33
Joined: Sat Jul 28, 2018 3:46 am

Re: Confused about code in Lec 13

Post by lazyprogrammer » Sun Jun 07, 2020 10:01 pm

Ytrain is a 1-D array of labels.

Just like if you have a dog vs. cat classifier, then Ytrain would be ["cat", "dog", "dog", ...]

Scikit-learn allows you to use strings as labels (if you implement an ML classifier yourself such as we do in Supervised Machine Learning you can surmise that it wouldn't be that difficult).

I'd recommend that if you aren't familiar with scikit-learn, you should take the free prerequisites course first and get more familiar with the library.


> Since we already have pre-trained word vector, what exactly are we training for?

I think you are confusing the feature vectors with the classifier. We are training the classifier using "bag of words features".

Later in the course we discuss how to find word vectors (this is just the beginner's corner section).

I would recommend looking at the "Course Outline" to understand what each section talks about.

leon1209
Posts: 6
Joined: Thu May 28, 2020 3:58 am

Re: Confused about code in Lec 13

Post by leon1209 » Mon Jun 08, 2020 5:30 am

Hi,

Thanks for clarification. Later I revisited the codes and understood the general idea here. My understanding is that we use the pre-trained word vectors to classify a bunch of texts by first doing vector addition of the training texts and then minimizing the cost between the value got from vector addition and the label.

But now I have an another question that bothers me, which is:

I know the pre-trained vectors were trained by corpus (such as wikipedia) with deep learning techniques, which means that those pre-trained vectors are trained by texts that make common sense in a linguistic way. And by going through deep training, these vectors can represent the characteristics of a word (can show analogy and synonyms etc as you've shown in the lectures). However, for text classification in this specific codes, the texts we are trying to classify doesn't even make common sense. I am bothered by this because I thought we could only use the pre-trained vectors (trained from texts that make sense) to classify the texts which also should make sense in a linguistic way. But anyway, it can still classify the texts, why?

Meanwhile, since the pre-trained vectors can classify texts which don't make sense, I somehow feel that the characteristics property of the word vectors is not being used, because we are doing vector addition on random texts. Therefore, can we just assign a vector with random values to each word in the training texts, and throw it to the classifier and let classifier do the classification?

I'm sorry that my question is kinda long. But it bothers me when I implement the codes. Really appreciate if you could share some thoughts!

lazyprogrammer
Site Admin
Posts: 33
Joined: Sat Jul 28, 2018 3:46 am

Re: Confused about code in Lec 13

Post by lazyprogrammer » Mon Jun 08, 2020 7:03 am

I would recommend reading the documentation (link in the code) if you think the text doesn't make sense.

These data files are created from the news (Reuters) and is a classic data set in machine learning.

leon1209
Posts: 6
Joined: Thu May 28, 2020 3:58 am

Re: Confused about code in Lec 13

Post by leon1209 » Mon Jun 08, 2020 4:18 pm

Hi,

I've read the texts from the train datasets, it seems the texts don't make sense to me, I copy here a part of the texts in the training data.

"earn champion products ch approves stock split champion products inc said its board of directors approved a two for one stock split of its common shares for shareholders of record as of april the company also said its board voted to recommend to shareholders at the annual meeting april an increase in the authorized capital stock from five mln to mln shares reuter
acq computer terminal systems cpml completes sale computer terminal systems inc said it has completed the sale of shares of its common stock and warrants to acquire an additional one mln shares to sedio n v of lugano switzerland for dlrs the company said the warrants are exercisable for five years at a purchase price of dlrs per share computer terminal said sedio also has the right to buy additional shares and increase its total holdings up to pct of the computer terminal s outstanding common stock under certain circumstances involving change of control at the company the company said if the conditions occur the warrants would be exercisable at a price equal to pct of its common stock s market price at the time not to exceed dlrs per share computer terminal also said it sold the technolgy rights to its dot matrix impact technology including any future improvements to woodco inc of houston tex for dlrs but it said it would continue to be the exclusive worldwide licensee of the technology for woodco the company said the moves were part of its reorganization plan and would help pay current operation costs and ensure product delivery computer terminal makes computer generated labels forms tags and ticket printers and terminals reuter
earn cobanco inc cbco year net shr cts vs dlrs net vs assets mln vs mln deposits mln vs mln loans mln vs mln note th qtr not available year includes extraordinary gain from tax carry forward of dlrs or five cts per shr reuter
earn am international inc am nd qtr jan oper shr loss two cts vs profit seven cts oper shr profit vs profit revs mln vs mln avg shrs mln vs mln six mths oper shr profit nil vs profit cts oper net profit vs profit revs mln vs mln avg shrs mln vs mln note per shr calculated after payment of preferred dividends results exclude credits of or four cts and or nine cts for qtr and six mths vs or six cts and or cts for prior periods from operating loss carryforwards reuter
earn brown forman inc bfd th qtr net shr one dlr vs cts net mln vs mln revs mln vs mln nine mths shr dlrs vs dlrs net mln vs mln revs billion vs mln reuter"

It has plenty of words such as mln, cts,dlrs, etc.

Post Reply

Return to “Natural Language Processing with Deep Learning in Python”