As said before, when using htdig with the –t option, it produces two ASCII files,
db.docs and db.worddump. The db.docs file contains a series of lines, each of them
relating to an indexed document. The lines contain information such as the document
URL, title, modification time, size, meta description, excerpt, and many other useful
document information. The db.wordlist file has a line for each word found in the
indexed documents. Each line in this file contains a word, followed by the document
ID where it was found, its location in the document, its weight, number of
occurrences, etc.
The default configuration file for htdig is the file htdig.conf. That’s where all
configuration options for htdig (and the other tools, if they are used) are set. Since all
of its parameters will probably be left with their default values, this file’s content will
not be copied in this paper. At first, the security sites indexed by SisBrAV’s htdig will
be the ones listed in the items (5), (7), (8), (9) and (10) in the References section. The
number of sites can be (and will be) expanded to a much higher number, but initially
only these five sites were chosen. The way htdig indexes each site will be almost the
same: the only parameter that will differ from one site to another is the number of
hops from the starting URL. For example, if maxhops is set to 2, htdig will index the
starting URL, then it will follow all the links in that URL and index all the
documents, and finally it will also follow the links in these documents, indexing the
documents it finds, and then stop the digging process. Since each site has its way of
displaying their documents, the number of hops necessary to gather all relevant
vulnerability information will vary from site to site.
To solve this issue, a simple UNIX bash script will be used to read a file that
contains lines with an URL and a number (which defines the maximum hops from the
initial URL), separated by a TAB. The script will produce different htdig commands,
according to the number of maximum hops defined. The number of maximum hops
for each site is defined by the SisBrAV administrators, who inspect the sites and
check the number of levels the spider will have to crawl down in order to obtain the
maximum amount of relevant information about the vulnerabilities, and the minimum
unnecessary information.
Htdig generates several Berkeley DB type files. These files will then be analyzed
by the IPS Module, as explained in the next section.
by the IPS Module, as explained in the next section.
3.2 Interpreter, Parser and Sorter
The IPS Module will probably be written in Java. It will use an heuristics algorithm to
perform the content analysis of the data stored in the Berkeley DB files created by
htdig, in order to feed the Central Database with accurate vulnerability information.
The data is parsed and the vulnerabilities are grouped between previously determined,
hierarchically distributed classes.
At first, the IPS program will perform the sorting process. Initially, it analyses all
the entries in the database, to find ambiguous or duplicated information for a same
vulnerability. Then, it parses the content of the information, in order to group the
vulnerability entries in classes, according to its main aspects: remote/local, type,
low/medium/high importance score, etc. It also determines the systems/services in
which that vulnerability occurs. If there is more than one entry for the same
vulnerability, it correlates all the information found in the entries, to make sure the
attributes are set as precisely as possible. For example, if a given vulnerability is
70