Skip to content
Projects
Groups
Snippets
Help
Loading...
Help
Support
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in
Toggle navigation
W
word2vec
Project overview
Project overview
Details
Activity
Releases
Repository
Repository
Files
Commits
Branches
Tags
Contributors
Graph
Compare
Issues
35
Issues
35
List
Boards
Labels
Milestones
Merge Requests
0
Merge Requests
0
CI / CD
CI / CD
Pipelines
Jobs
Schedules
Analytics
Analytics
CI / CD
Repository
Value Stream
Wiki
Wiki
Members
Members
Collapse sidebar
Close sidebar
Activity
Graph
Create a new issue
Jobs
Commits
Issue Boards
Open sidebar
DESHPANDE SRIJAY PARAG
word2vec
Commits
891d84c6
Commit
891d84c6
authored
Sep 06, 2014
by
tmikolov
Browse files
Options
Browse Files
Download
Email Patches
Plain Diff
update to 0.1c version
parent
5815e5d0
Changes
8
Expand all
Hide whitespace changes
Inline
Side-by-side
Showing
8 changed files
with
115 additions
and
93 deletions
+115
-93
demo-analogy.sh
demo-analogy.sh
+4
-4
demo-classes.sh
demo-classes.sh
+1
-1
demo-phrase-accuracy.sh
demo-phrase-accuracy.sh
+9
-10
demo-phrases.sh
demo-phrases.sh
+9
-6
demo-word-accuracy.sh
demo-word-accuracy.sh
+1
-1
demo-word.sh
demo-word.sh
+2
-2
makefile
makefile
+2
-2
word2vec.c
word2vec.c
+87
-67
No files found.
demo-analogy.sh
View file @
891d84c6
...
...
@@ -3,9 +3,9 @@ if [ ! -e text8 ]; then
wget http://mattmahoney.net/dc/text8.zip
-O
text8.gz
gzip
-d
text8.gz
-f
fi
echo
---------------------------------------------------------------------------------------------------
--
echo
Note that
for
the word analogy to perform well, the model
s should be trained on much larger data sets
echo
---------------------------------------------------------------------------------------------------
echo
Note that
for
the word analogy to perform well, the model
should be trained on much larger data
set
echo
Example input: paris france berlin
echo
---------------------------------------------------------------------------------------------------
--
time
./word2vec
-train
text8
-output
vectors.bin
-cbow
0
-size
200
-window
5
-negative
0
-hs
1
-sample
1e-3
-threads
12
-binary
1
echo
---------------------------------------------------------------------------------------------------
time
./word2vec
-train
text8
-output
vectors.bin
-cbow
1
-size
200
-window
8
-negative
25
-hs
0
-sample
1e-4
-threads
20
-binary
1
-iter
15
./word-analogy vectors.bin
demo-classes.sh
View file @
891d84c6
...
...
@@ -3,6 +3,6 @@ if [ ! -e text8 ]; then
wget http://mattmahoney.net/dc/text8.zip
-O
text8.gz
gzip
-d
text8.gz
-f
fi
time
./word2vec
-train
text8
-output
classes.txt
-cbow
0
-size
200
-window
5
-negative
0
-hs
1
-sample
1e-3
-threads
12
-classes
500
time
./word2vec
-train
text8
-output
classes.txt
-cbow
1
-size
200
-window
8
-negative
25
-hs
0
-sample
1e-4
-threads
20
-iter
15
-classes
500
sort
classes.txt
-k
2
-n
>
classes.sorted.txt
echo
The word classes were saved to file classes.sorted.txt
demo-phrase-accuracy.sh
View file @
891d84c6
make
if
[
!
-e
text8
]
;
then
wget http://
mattmahoney.net/dc/text8.zip
-O
text8
.gz
gzip
-d
text8
.gz
-f
if
[
!
-e
news.2012.en.shuffled
]
;
then
wget http://
www.statmt.org/wmt14/training-monolingual-news-crawl/news.2012.en.shuffled
.gz
gzip
-d
news.2012.en.shuffled
.gz
-f
fi
echo
----------------------------------------------------------------------------------------------------------------
echo
Note that the accuracy and coverage of the
test set
questions is going to be low with this small training corpus
echo
To achieve better accuracy, larger training
set
is needed
echo
----------------------------------------------------------------------------------------------------------------
time
./word2phrase
-train
text8
-output
text8-phrase
-threshold
500
-debug
2
-min-count
3
time
./word2vec
-train
text8-phrase
-output
vectors-phrase.bin
-cbow
0
-size
300
-window
10
-negative
0
-hs
1
-sample
1e-3
-threads
12
-binary
1
-min-count
3
./compute-accuracy vectors-phrase.bin <questions-phrases.txt
sed
-e
"s/’/'/g"
-e
"s/′/'/g"
-e
"s/''/ /g"
< news.2012.en.shuffled |
tr
-c
"A-Za-z'_
\n
"
" "
>
news.2012.en.shuffled-norm0
time
./word2phrase
-train
news.2012.en.shuffled-norm0
-output
news.2012.en.shuffled-norm0-phrase0
-threshold
200
-debug
2
time
./word2phrase
-train
news.2012.en.shuffled-norm0-phrase0
-output
news.2012.en.shuffled-norm0-phrase1
-threshold
100
-debug
2
tr
A-Z a-z < news.2012.en.shuffled-norm0-phrase1
>
news.2012.en.shuffled-norm1-phrase1
time
./word2vec
-train
news.2012.en.shuffled-norm1-phrase1
-output
vectors-phrase.bin
-cbow
1
-size
200
-window
10
-negative
25
-hs
0
-sample
1e-5
-threads
20
-binary
1
-iter
15
./compute-accuracy vectors-phrase.bin < questions-phrases.txt
demo-phrases.sh
View file @
891d84c6
make
if
[
!
-e
text8
]
;
then
wget http://
mattmahoney.net/dc/text8.zip
-O
text8
.gz
gzip
-d
text8
.gz
-f
if
[
!
-e
news.2012.en.shuffled
]
;
then
wget http://
www.statmt.org/wmt14/training-monolingual-news-crawl/news.2012.en.shuffled
.gz
gzip
-d
news.2012.en.shuffled
.gz
-f
fi
time
./word2phrase
-train
text8
-output
text8-phrase
-threshold
500
-debug
2
time
./word2vec
-train
text8-phrase
-output
vectors-phrase.bin
-cbow
0
-size
300
-window
10
-negative
0
-hs
1
-sample
1e-3
-threads
12
-binary
1
./distance vectors-phrase.bin
\ No newline at end of file
sed
-e
"s/’/'/g"
-e
"s/′/'/g"
-e
"s/''/ /g"
< news.2012.en.shuffled |
tr
-c
"A-Za-z'_
\n
"
" "
>
news.2012.en.shuffled-norm0
time
./word2phrase
-train
news.2012.en.shuffled-norm0
-output
news.2012.en.shuffled-norm0-phrase0
-threshold
200
-debug
2
time
./word2phrase
-train
news.2012.en.shuffled-norm0-phrase0
-output
news.2012.en.shuffled-norm0-phrase1
-threshold
100
-debug
2
tr
A-Z a-z < news.2012.en.shuffled-norm0-phrase1
>
news.2012.en.shuffled-norm1-phrase1
time
./word2vec
-train
news.2012.en.shuffled-norm1-phrase1
-output
vectors-phrase.bin
-cbow
1
-size
200
-window
10
-negative
25
-hs
0
-sample
1e-5
-threads
20
-binary
1
-iter
15
./distance vectors-phrase.bin
demo-word-accuracy.sh
View file @
891d84c6
...
...
@@ -3,6 +3,6 @@ if [ ! -e text8 ]; then
wget http://mattmahoney.net/dc/text8.zip
-O
text8.gz
gzip
-d
text8.gz
-f
fi
time
./word2vec
-train
text8
-output
vectors.bin
-cbow
0
-size
200
-window
5
-negative
0
-hs
1
-sample
1e-3
-threads
12
-binary
1
time
./word2vec
-train
text8
-output
vectors.bin
-cbow
1
-size
200
-window
8
-negative
25
-hs
0
-sample
1e-4
-threads
20
-binary
1
-iter
15
./compute-accuracy vectors.bin 30000 < questions-words.txt
# to compute accuracy with the full vocabulary, use: ./compute-accuracy vectors.bin < questions-words.txt
demo-word.sh
View file @
891d84c6
...
...
@@ -3,5 +3,5 @@ if [ ! -e text8 ]; then
wget http://mattmahoney.net/dc/text8.zip
-O
text8.gz
gzip
-d
text8.gz
-f
fi
time
./word2vec
-train
text8
-output
vectors.bin
-cbow
0
-size
200
-window
5
-negative
0
-hs
1
-sample
1e-3
-threads
12
-binary
1
./distance vectors.bin
\ No newline at end of file
time
./word2vec
-train
text8
-output
vectors.bin
-cbow
1
-size
200
-window
8
-negative
25
-hs
0
-sample
1e-4
-threads
20
-binary
1
-iter
15
./distance vectors.bin
makefile
View file @
891d84c6
CC
=
gcc
#
The -Ofast might not work with older versions of gcc; in that case, use -O2
CFLAGS
=
-lm
-pthread
-O
fast
-march
=
native
-Wall
-funroll-loops
-Wno-unused-result
#
Using -Ofast instead of -O3 might result in faster code, but is supported only by newer GCC versions
CFLAGS
=
-lm
-pthread
-O
3
-march
=
native
-Wall
-funroll-loops
-Wno-unused-result
all
:
word2vec word2phrase distance word-analogy compute-accuracy
...
...
word2vec.c
View file @
891d84c6
This diff is collapsed.
Click to expand it.
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment