CS W1001 Introduction to Computer Science
Homework 3 (11 points)
Unix, Database Queries (and a little Java)
Due Date:
Tuesday, March 21.
For this assignment, hardcopies (only) of the first Part are required,
and both hardcopies and electronic submissions of the second and third
Parts are required. The course web page provides directions for
electronic submission. Hardcopies are due in class, as usual.
Reading:
"An Invitation to Computer Science," Section 5.4 and
page 233 and Sections 6.1, 6.2, 6.4 (but pages 268-284 of this section
are optional), and 6.5, and page 291 and Sections 7.1 and 7.2 (but not 7.2.2,
and don't worry about understanding the assembly language instructions
-- assembly language is just a slightly-friendlier version of (the
notoriously technical) machine language).
Read Sections 8.4.1 and 11.3 about the Structured Query Language (SQL)
and Database queries.
Also, read the first 2 pages of Chapter 8 about why there is not one single
high-level programming language that is used universally. By the way, although
the text uses C++ and we will instead use Java, the two are very
similar (looking) as far as the (limited) things we'll do with Java
this semester. Look through Chapter 7 and Section 8.2 if you are
interested in seeing information about other high level languages.
They are all very similar in that when you are programming, they each
need you to type in about the same level of detail as you would
specify for a "pseudo-code" algorithm.
While you are reading you can always get some extra perspective
by looking at where you are within the table of contents. The
table of contents breaks down the organization of the text,
which in turn follows the levels of abstraction depicted
on the front cover!
Part I: Computer Organization Theory Questions (3 points)
Show your work or explain your answer to each of the following:
- Chapter 5, exercises 4,5,7, and 11.
- (This isn't a computer organization question, but I'm sticking
it here anyway.) Describe two different algorithms to do
the query below in Part II, question #1. Hint: The first method
can be very similar to the algorithm in Figure 2.9 (page 46), especially
after modifying it according to exercise 13 (page 62). The second method
corresponds to the algorithm the computer is following
when you use a Unix pipeline, as in Part II below.
Part II: Database Queries with Unix (5 points)
The database you will use is in ~es66/W1001/films/. In that directory
you will find a database of films ("films.txt") and a "README" file
that lists the fields given for each film. Copy the database of films
(and the README) into a directory of your own:
cp ~es66/W1001/films/* .
If you want to see a more complete
on-line database of movies check out the "Internet Movie Database" at
www.imdb.com.
You will do queries on this database. We will describe the queries
abstractly using the Structured Query Language (SQL) as in the reading
above. SQL is often the way a person describes a query to a machine.
But, for this assignment, to actually make the queries happen
automatically with a computer, you will use the versatility of the
Unix Operating System. In particular, you will create a "Unix
pipeline", which starts with "cat films.txt" and then pipes that
to another Unix command. For example, with the Unix pipeline "cat
films.txt | grep 'Robert De Niro'" you will enact the following query:
SELECT *
FROM films.txt
WHERE actors INCLUDES "Robert De Niro"
This is because "grep" will output all lines that have the given
string of characters (characters means letters of the alphabet or
numerals, not parts in a movie!). In this case 'Robert De Niro' is
the character string -- I put this string in single quotes since it
includes spaces -- you have to do that also -- double quotes also
works. Now, if De Niro had directed a movie, it would also appear
since "grep" just looks anywhere on the whole line. But for the sake
of this assignment you can assume that no actor directs and no
director acts (which I think is true for our small database anyway).
However, we all know that De Niro's directorial debut, "A Bronx Tale",
was smashing!
Ok, for the next example let's do a fancier query:
SELECT title
FROM films.txt
WHERE rating = "R" AND grade = "3 stars"
The Unix pipeline for this is "cat films.txt | grep ' R ' | grep '3 stars' | cut -d':' -f1" . Note that the "AND" is done by piping the output
of one "grep" into the input of another "grep". Each "grep" filters
out the ones that match the pattern. Also note that ' R ' has spaces
on either side of it. This is because otherwise "grep" will
look for any occurrence of "R", even as part of a bigger word like "Ronin". The
last command, "cut", pulls the "title" field from each line, ignoring
the rest of the lines (try out the pipeline without the last command).
Make use of the guide of common Unix filters (hard copy handed out to
you in class) to do the following problems. Note that you will not
necessarily need to use everything on that guide. For each problem,
you must submit the Unix pipeline and the results of the query. You
can put this all neatly into a text file that you edit with Pico (or
Emacs) -- that would be a good way since you can then print out this
file for hardcopy submission, and electronically submit it. It is
pretty easy to cut and paste from your Unix window to another window
that has Pico running in it. Using Pico is a nice skill; you'll be
editing Java source code with it for the last two assignments of this
semester.
- Create a Unix pipeline to do the following query:
SELECT director
FROM films.txt
WHERE rating = "R" AND grade = "3 stars"
This one differs from the example above in that it asks for director, not title.
- Create a Unix pipeline to do the following query:
SELECT title, time
FROM films.txt
WHERE rating = "PG" AND grade = "2 stars"
Note: put spaces around "PG" in the grep line in order to avoid
also picking out PG-13 movies.
-
I want to select a movie that I can bring a child to see. How
can I create a Unix pipeline to do the following SQL query:
SELECT title
FROM films.txt
WHERE rating =\= " R " (i.e. the rating does not equal R)
This requires doing a "NOT". Check out the "-v" option of "grep" on
the attached page. Also, remember to put spaces around the "R" in
the grep line so it doesn't find occurrences of "R" that are
part of a bigger word. For this one, do not put the results
of the query in the file you will submit. Instead, create
a separate file with the results (put "> kidmovies" at the end
of the pipeline, but be careful -- it erases the file "kidmovies" or
whatever name you choose, if such a file already exists).
- Create a Unix pipeline to compute the number of movies that are
not rated. The final output must be a number, not a list of movies.
Hint: "wc". Another hint: since some movies are "not rated" and
others are "NOT RATED" or "Not Rated", use the -i option of grep to
have it ignore capitalization.
- Create a Unix pipeline to show the number of movies with
each grade (this is called a histogram). The grade
of each movie is the number of stars, from 1-4. "NA stars" means
"not available".
Hint: use "cut" to get
the grade, then "uniq" with the "-c" option. Woops, "uniq"
expects you to give it to "sort" first.
- Use Unix to do the following query:
SELECT *
FROM films.txt
WHERE director = "Mimi Leder" OR actors INCLUDES "Kate Capshaw"
This one has an "OR", which has to be handled differently than "AND". How
can you do this? Hint: you may need more than one pipeline to find
the comprehensive answer to this SQL query.
- Design one or more Unix pipelines to find out which rating (i.e.,
"R", "PG", etc.) get's the best and worst grades (i.e. "1 stars", "2.5
stars", etc.). Show the results and draw conclusions with a couple
English sentences. Perhaps "R" has better movies 'cause they can show
exciting stuff. But perhaps "R" has worse movies since the film-maker
can rely on superficial garbage to sell her/his movie. There is
more than one way to do this problem -- it is open-ended. But
you should find out how many movies of each rating get
high or low ratings, somehow.
- Think of a query that interests you. Show the SQL version of this
query, the Unix pipeline to make it happen, and the results of posing
the query to our database of films. Some queries may be difficult to
do with Unix pipelines - if you get stuck, change your query. If the
output of the query is long, put it in a different file and submit it
electronically only.
Part III: Your First Java Program (3 points)
For this first jaunt with Java, you will copy a Java source
code file we have already created for you. You will try
it out, and then make a minor modification, and then
try it again.
- Use mkdir to create a new subdirectory for this
part of this assignment.
- Copy the file ~es66/W1001/src/Inchtocm.java to your own directory.
(Ignore the other files in that src directory until otherwise instructed --
they are not really ready yet.)
- Use "cat" or "pico" to look at the file. This file
is Java source code.
- "javac Inchtocm.java" will compile this program.
- "ls" will show you there is a new file, Inchtocm.class.
This is the Java bytecode that can be executed by the Java
emulator.
- "java Inchtocm 3" will execute it with the emulator. Be
sure to include the "3". Now try it with a different number
other than 3. This number is the input to the program. If
you try it without a number, it will give you a nasty
error message.
- WARNING: Remember, "javac" is to compile, and "java" is to
execute -- a difference of one letter only! This is easy to confuse -
I do it all the time.
- Now, edit and compile Inchtocm.java (saving it in Cmtoinch.java)
so that it performs the opposite conversion. To get this to work, you
only need to change the formula. However, only changing
the formula means the "inches" variable will have the length
in centimeters, and the "cms" variable will have the length
in inches. Therefore, to do a complete job, you also
need to change each "cms" to "inches" and vice versa.
- Take notice: since you are now working with a file
of java source named "Cmtoinch.java", you must change
each occurrence of "Inchtocm" in the file to "Cmtoinch", or
it will not compile without an error. Be careful: capitalization
matters.
- Note that when you try it out you still have to put
a number on the command line to provide input. However, in
this case, the number is the length in centimeters, not
in inches.
- You must make sure your new version actually works.
It must compile with no error messages (if you make a mistake
when you edit it, even a minor mistake such as deleting
a semicolon, the compiler will give you an error message
when you try to compile), and it must work when you try it out.
This way you will know you did it correctly -- like the computer
checks your work for you.
- When you submit, only submit your new ".java" file (the Java
source code). DO NOT submit any ".class" files (Java bytecode) --
these take up much more space than the source code files, and your TA
can create the ".class" file herself if she needs to by using the Java
compiler. This guideline applies to all Java programming assignments
in the remaining homeworks this semester.
Thanks to Andrew Kosoresow and Michael Grossberg for help developing
part III above.
email: evs at cs dot columbia dot edu