How To Search CC-CEDICT

[lastupdated]
This article gives lots of technical information about how to search the CC-CEDICT Chinese-English dictionary. Here is a short article about you can Easily Search Simplified Chinese Words With CC-CEDICT Chinese-English Dictionary. And here is a short article about how to search Traditional Chinese words with the CC-CEDICT Chinese-English Dictionary. If you want to learn about the format of the CC-CEDICT dictionary, read this article.

Simple way to search CC-CEDICT

In general, in any of the formats for where you can download CC-CEDICT here, as a text file, most programs support searching using CTRL+f in text editors. If you hold down the CTRL key and then press f, usually a search bar will come up, and you can just search for what you want. For example, I could just type CTRL+f in a text editor or Microsoft Word or Firefox etc, and search for a word such as 中國, which means China, and I would eventually get what I want. If you prefer not to get too technical, then maybe this simple way is the best for you.

Advanced ways to search CC-CEDICT

The following explanation assumes you have downloaded the CC-CEDICT text file here. If you are OK with using technical commands, but want to do very, very efficient searches, then I think the advanced ways are the best for you. I believe that the key to searching this dictionary efficiently and quickly is to use a technology called “regular expressions”. This dictionary is not sorted in any easy way. And, since there are more than 170000 entries in the dictionary, it is impractical to look through all of them one-by-one because it would take too much time. Therefore, I believe that the magic of regular expressions will be very useful. Wikipedia defines regular expressions as “In theoretical computer science and formal language theory, a regular expression (abbreviated regex or regexp and sometimes called a rational expression is a sequence of characters that forms a search pattern, mainly for use in pattern matching with strings, or string matching, i.e. “find and replace”-like operations.” https://en.wikipedia.org/wiki/Regular_expression
I will now show you the quick and easy way to use certain commands to search the dictionary quickly and easily. You can copy and paste what I do, and edit them for yourself.

If you use Linux, the commands grep and sed will search this dictionary very well. In addition, if you use Mac OS, the following steps should be almost identical. Just make sure to open your terminal, and follow my steps. This tutorial will assume zero knowledge of sed and grep, and  you can just copy and paste what I write, without understanding the commands. This assumes that you are using any distribution of Linux or Mac OS, but  I also included links to software where you can do the same things in Windows if you prefer to use Windows. If you want to learn more about the commands grep and sed commands, I recommend that you Google them to learn more information.

For the commands below, make sure that you have your terminal open, and you have changed to the correct directory where the dictionary is located.

How to do advanced searches with grep on CC-CEDICT

In general, if you want to search for search a character that you know has no traditional character other than itself, you can use a grep command like the following:

grep -m 1 "^搜 " cc-cedict_2016_08_12.txt
The above command searches for the character 搜, and only matches the case where it is the first entry (meaning just searching the character). We need the space afterwards because there could be vocabulary starting with that character.

If you know that the character is a simplified version of another character, you must do a somewhat different search (if you want a unique search output)
If you wanted to search for the character 国, and tried a similar search as above using
grep -m 1 "^国" cc-cedict_2016_08_12.txt
it would output nothing because, the format of the dictionary always has the traditional character first.

(Note that in grep, \s means a space). You could also use a simple space with your space bar.
So, to search only for the character 国, you can do the following
grep -m 1 "^\s\s国" cc-cedict_2016_07_16.txt
or
grep -m 1 "^  国" cc-cedict_2016_07_16.txt

To search only for a specific word with at least 2 Traditional Characters, such as 中國, you can do
grep -m 1 "^中國" cc-cedict_2016_08_12.txt

To search only for a specific word with at least 2 Simplified Characters, such as 中国, you can do
grep -m 1 "^\s\s\s中国" cc-cedict_2016_08_12.txt
or
grep -m 1 "^   中国" cc-cedict_2016_08_12.txt
(note there are 3 spaces here. If the word had consisted of 3 characters, you would put 4 spaces (remember that \s is how you write space in grep). If the word had consisted of 5 characters, you would put 6 spaces and so on.)

If you are searching for a word using Traditional Characters, and you didn’t care where in the line it was, you could do something like the following
grep "中國" cc-cedict_2016_08_12.txt
The above command searches for the word 中國 anywhere in any line.

If you are searching for a word using Simplified Characters, and you didn’t care where in the line it was, you could do something like the following
grep "中国" cc-cedict_2016_08_12.txt
The above command searches for the word 中國 anywhere in any line.

You can find grep documentation for Mac OS here.

If you use Windows, you can download grep for Windows at the link below
Grep For Windows

How to do advanced searches with sed on CC-CEDICT

The following commands will give you exactly the same outputs as the above grep commands. These are just another way to do it.

sed -n '/^國/ {p;q}' cc-cedict_2016_08_12.txt
The above command searches for the character 搜, and only matches the case where it is the first entry in the line (meaning just searching the character).

If you know that the character is a simplified version of another character, you must do a somewhat different search (if you want a unique search output)
If you wanted to search for the character 国, the above search would output nothing because, the format of the dictionary always has the traditional character first.

(Note that in sed, \s means a space). You could also use a simple space with your space bar.
So, to search only for the character 国, you can do the following
sed -n '/^\s\s国/ {p;q}' cc-cedict_2016_08_12.txt
or
sed -n '/^  国/ {p;q}' cc-cedict_2016_08_12.txt

To search only for a specific word with at least 2 Traditional Characters, such as 中國, you can do
sed -n '/^中國 {p;q}' cc-cedict_2016_08_12.txt

To search only for a specific word with at least 2 Simplified Characters, such as 中国, you can do
sed -n '/^   中国/ {p;q}' cc-cedict_2016_08_12.txt
(note there are 3 spaces here. Remember that \s is how we write a space in sed. If the word had consisted of 3 characters, you would put 4 spaces. If the word had consisted of 5 characters, you would put 6 spaces and so on.)

If you are searching for a word using Traditional Characters, and you didn’t care where in the line it was, you could do something like the following
sed -n '/中國/ p' cc-cedict_2016_08_12.txt
The above command searches for the word 中國 anywhere in any line.

If you are searching for a word using Simplified Characters, and you didn’t care where in the line it was, you could do something like the following
sed -n '/中国/ p' cc-cedict_2016_08_12.txt
The above command searches for the word 中國 anywhere in any line.

If you use Windows, you can download sed for Windows at the link below
Sed for Windows

For Mac OS, the steps for searching should be almost identical to the steps I have described for Linux. You just open your terminal, and use the appropriate commands.
If you are unsure about how to use the Mac terminal, you can read about it here. Furthermore,  here is a link to documentation for sed in Mac OS.

Leave a comment

Your email address will not be published. Required fields are marked *