Quelle hash.xml Sprache: XML

<Chapter Label="hash">
<Heading>Hashing techniques</Heading>

<Section Label="hashidea">
<Heading>The idea of hashing</Heading>

If one wants to store
a certain set of similar objects and wants to quickly access a given
one (or come back with the result that it is unknown), the first idea
would be to store them in a list, possibly sorted for faster access.
This however still would need <M>\log(n)</M> comparisons to find
a given element or to decide that it is not yet stored.


Therefore one uses a much bigger array and uses a function on the
space of possible objects with integer values to decide, where in the
array to store a certain object. If this so called hash function
distributes the actually stored objects well enough over the array,
the access time is constant in average. Of course, a hash function
will usually not be injective, so one needs a strategy what to do
in case of a so-called <Q>collision</Q>, that is, if more than one
object with the same hash value has to be stored. This package
provides two ways to deal with collisions, one is implemented in the
so called <Q>HashTabs</Q> and another in the <Q>TreeHashTabs</Q>. The
former simply uses other parts of the array to store the data involved
in the collisions and the latter uses an AVL tree (see Chapter <Ref
 Chap="avl"/>) to store all data objects with the same hash value.
Both are used basically in the same way but sometimes behave a bit
differently.


The basic functions to work with hash tables are <Ref
 Oper="HTCreate"/>, <Ref Oper="HTAdd"/>, <Ref Oper="HTValue"/>,
<Ref Oper="HTDelete"/> and <Ref Oper="HTUpdate"/>. They are described
in Section <Ref Sect="hashtables"/>.
</Section>

<Section Label="hashfunc">
<Heading>Hash functions</Heading>

In the <Package>orb</Package> package hash functions are chosen
automatically by giving a sample object together with the length
of the hash table. This is done with the following operation:

<ManSection>
<Oper Name="ChooseHashFunction" Arg="ob, len"/>
<Returns> a record </Returns>
<Description>
The first argument <A>ob</A> must be a sample object, that is, an object
like those we want to store in the hash table later on. The argument
<A>len</A> is an integer that gives the length of the hash table. Note
that this might be called later on automatically, when a hash table
is increased in size. The operation returns a record with two components.
The component <C>func</C> is a &GAP; function taking two arguments, see
below. The component <C>data</C> is some &GAP; object. Later on, the
hash function will be called with two arguments, the first is the
object for which it should call the hash value and the second argument
must be the data stored in the <C>data</C> component.

The hash function has to return values between <M>1</M> and the
hash length <A>len</A> inclusively.

This setup is chosen such that the hash functions can be global
objects that are not created during the execution of <Ref
Oper="ChooseHashFunction"/> but still can change their behaviour
depending on the data.
</Description>
</ManSection>

In the following we just document, for which types of objects there
are hash functions that can be found using <Ref
Oper="ChooseHashFunction"/>.

<ManSection>
<Meth Name="ChooseHashFunction" Arg="ob, len" Label="gf2vec"/>
<Returns> a record </Returns>
<Description>
This method is for compressed vectors over the field <C>GF(2)</C> of
two elements. Note that there is no hash function for non-compressed
vectors over <C>GF(2)</C> because those objects cannot efficiently
be recognised from their type.

Note that you can only use the resulting hash functions for vectors
of the same length.
</Description>
</ManSection>

<ManSection>
<Meth Name="ChooseHashFunction" Arg="ob, len" Label="8bitvec"/>
<Returns> a record </Returns>
<Description>
This method is for compressed vectors over a finite field with up
to <M>256</M> elements. Note that there is no hash function for
non-compressed such vectors because those objects cannot efficiently be
recognised from their type.

Note that you can only use the resulting hash functions for vectors
of the same length.
</Description>
</ManSection>

<ManSection>
<Meth Name="ChooseHashFunction" Arg="ob, len" Label="gf2mat"/>
<Returns> a record </Returns>
<Description>
This method is for compressed matrices over the field <C>GF(2)</C> of
two elements. Note that there is no hash function for non-compressed
matrices over <C>GF(2)</C> because those objects cannot efficiently
be recognised from their type.

Note that you can only use the resulting hash functions for matrices
of the same size.
</Description>
</ManSection>

<ManSection>
<Meth Name="ChooseHashFunction" Arg="ob, len" Label="8bitmat"/>
<Returns> a record </Returns>
<Description>
This method is for compressed matrices over a finite field with up
to <M>256</M> elements. Note that there is no hash function for
non-compressed such vectors because those objects cannot efficiently be
recognised from their type.

Note that you can only use the resulting hash functions for matrices
of the same size.
</Description>
</ManSection>

<ManSection>
<Meth Name="ChooseHashFunction" Arg="ob, len" Label="int"/>
<Returns> a record </Returns>
<Description>
This method is for integers.
</Description>
</ManSection>

<ManSection>
<Meth Name="ChooseHashFunction" Arg="ob, len" Label="perm"/>
<Returns> a record </Returns>
<Description>
This method is for permutations.
</Description>
</ManSection>

<ManSection>
<Meth Name="ChooseHashFunction" Arg="ob, len" Label="intlist"/>
<Returns> a record </Returns>
<Description>
This method is for lists of integers.
</Description>
</ManSection>

<ManSection>
<Meth Name="ChooseHashFunction" Arg="ob, len" Label="NBitsPcWord"/>
<Returns> a record </Returns>
<Description>
This method is for kernel Pc words.
</Description>
</ManSection>

<ManSection>
<Meth Name="ChooseHashFunction" Arg="ob, len" Label="IntLists"/>
<Returns> a record </Returns>
<Description>
This method is for lists of integers.
</Description>
</ManSection>

<ManSection>
<Meth Name="ChooseHashFunction" Arg="ob, len" Label="MatLists"/>
<Returns> a record </Returns>
<Description>
This method is for lists of matrices.
</Description>
</ManSection>

</Section>

<Section Label="hashtables">
 <Heading>Using hash tables</Heading>

<ManSection>
<Oper Name="HTCreate" Arg="sample [, opt]"/>
<Returns> a new hash table object </Returns>
<Description>
A new hash table for objects like <A>sample</A>
is created. The second argument <A>opt</A> is an optional options
record, which will supplied in most cases, if only to specify the
length and type of the hash table to be used.
The following components in this record can be bound:

<List>
 <C>treehashsize</C>
 <Item>
If this component is bound the type of the hash table is a
TreeHashTab.
The value must be a positive integer and will be the size of
the hash table. Note that for this type of hash table the keys to be
stored in the hash must be comparable using <M><</M>. A three-way
comparison function can be supplied using the component <C>cmpfunc</C>
(see below).
 </Item>
 <C>treehashtab</C>
 <Item>
If this component is bound the type of the hash table is a
TreeHashTab.
This option is superfluous if <C>treehashsize</C> is used.
 </Item>
 <C>forflatplainlists</C>
 <Item>
If this component is set to <K>true</K> then the user guarantees that
all the elements in the hash will be flat plain lists, that is, plain
lists with no subobjects. For example lists of immediate integers
will fulfill this requirement, but ranges don't.
In this case, a particularly good and
efficient hash function will automatically be taken and the
components <C>hashfunc</C>, <C>hfbig</C> and <C>hfdbig</C>
are ignored. Note that this cannot be automatically
detected because it depends not only on the sample point but also
potentially on all the other points to be stored in the hash table.
 </Item>
 <C>hf</C> and <C>hfd</C>
 <Item>
If these components are bound, they are used as the hash function. The
value of <C>hf</C> must be a function taking two arguments, the first
being the object for which the hash function shall be computed and the
second being the value of <C>hfd</C>. The returned value must be an
integer in the range from <M>1</M> to the length of the hash.
If either of these components is not bound, an automatic choice
for the hash function is done using <Ref Oper="ChooseHashFunction"/>
and the supplied sample object <A>sample</A>.

Note that if you specify these two components and are using a HashTab
table then this table cannot grow unless you also bind the
components <C>hfbig</C>, <C>hfdbig</C> and <C>cangrow</C>.
 </Item>
 <C>cmpfunc</C>
 <Item>
This component can be bound to a three-way comparison function taking
two arguments <A>a</A> and <A>b</A> (which will be keys for the
TreeHashTab) and returns <M>-1</M> if <M><A>a</A><<A>b</A></M>,
<M>0</M> if <M><A>a</A> = <A>b</A></M> and <M>1</M> if <M><A>a</A>
 > <A>b</A></M>. If this component is not bound the function
<Ref Func="AVLCmp"/> is taken, which simply calls the generic
operations <C><</C> and <C>=</C> to do the job.
 </Item>
 <C>hashlen</C>
 <Item>
If this component is bound the type of the hash table is a standard
HashTab table. That is, collisions are dealt with by storing
additional entries in other slots. This is the traditional way to
implement a hash table. Note that currently deleting entries in such a
hash table is not implemented, since it could only be done by leaving
a <Q>deleted</Q> mark which could pollute that hash table. Use
TreeHashTabs
instead if you need deletion. The value bound to
<C>hashlen</C> must be a positive integer and will be the initial
length of the hash table.

Note that it is a good idea to choose a prime number as
the hash length due to the algorithm for collision handling which
works particularly well in that case. The hash function is chosen
automatically.
 </Item>
 <C>hashtab</C>
 <Item>
If this component is bound the type of the hash table is a standard
HashTab table. This component is superfluous if <C>hashlen</C> is
bound.
 </Item>
 <C>eqf</C>
 <Item>
For HashTab tables the function taking two arguments bound to this
component is used to compare keys in the hash table. If this component
is not bound the usual <C>=</C> operation is taken.
</Item>
<C>hfbig</C> and <C>hfdbig</C> and <C>cangrow</C>
 <Item>
 If you have used the components <C>hf</C> and <C>hfd</C> then
 your hash table cannot automatically grow when it fills up.
 This is because the length of the table is built into the hash
 function. If you still want your hash table to be able to grow
 automatically, then bind a hash function returning arbitrary
 integers to <C>hfbig</C>, the corresponding data for the
 second argument to <C>hfdbig</C> and bind <C>cangrow</C> to
 <K>true</K>. Then the hash table will automatically grow and
 take this new hash function modulo the new length of the hash
 table as hash function.
 </Item>
</List>
</Description>
</ManSection>

<ManSection>
<Oper Name="HTAdd" Arg="ht, key, value"/>
<Returns> a hash value </Returns>
<Description>
Stores the object <A>key</A> into the hash table <A>ht</A> and stores
the value <A>val</A> together with <A>ob</A>. The result is <K>fail</K>
if an error occurred, which can be that an object equal to <A>key</A> is
already stored in the hash table or that the hash table is already full.
The latter can only happen, if the hash table is no TreeHashTab and
cannot grow automatically.

If no error occurs, the result is an integer indicating the place in the
hash table where the object is stored. Note that once the hash table
grows automatically this number is no longer the same!

If the value <A>val</A> is <K>true</K> for all objects in the hash,
no extra memory is used for the values. All other values are stored
in the hash. The value <K>fail</K> cannot be stored as it indicates
that the object is not found in the hash.

See Section <Ref Sect="hashdata"/> for details on the data structures and
especially about memory requirements.
</Description>
</ManSection>

<ManSection>
<Oper Name="HTValue" Arg="ht, key"/>
<Returns> <K>fail</K> or <K>true</K> or a value</Returns>
<Description>
Looks up the object <A>key</A> in the hash table <A>ht</A>. If the object
is not found, <K>fail</K> is returned. Otherwise, the value stored with
the object is returned. Note that if this value was <K>true</K>
no extra memory is used for this.
</Description>
</ManSection>

<ManSection>
<Oper Name="HTUpdate" Arg="ht, key, value"/>
<Returns> <K>fail</K> or <K>true</K> or a value</Returns>
<Description>
 The object <A>key</A> must already be stored in the hash table
 <A>ht</A>,
 otherwise this operation returns <K>fail</K>. The value stored
 with <A>key</A> in the hash is replaced by <A>value</A> and the
 previously stored value is returned.
</Description>
</ManSection>

<ManSection>
<Oper Name="HTDelete" Arg="ht, key"/>
<Returns> <K>fail</K> or <K>true</K> or a value</Returns>
<Description>
 The object <A>key</A> along with its stored value is removed from
 the hash table <A>ht</A>. Note that this currently only works for
 TreeHashTabs and not for HashTab tables. It is an error
 if <A>key</A> is not found in the hash table and <K>fail</K> is
 returned in this case.
</Description>
</ManSection>

<ManSection>
<Func Name="HTGrow" Arg="ht, ob"/>
<Returns> nothing </Returns>
<Description>
This is a more or less internal operation. It is called when the space
in a hash table becomes scarce. The first argument <A>ht</A> must be a
hash table object, the second a sample point.
The function increases the hash size
by a factor of 2. This makes it necessary to choose a new hash
function. Usually this is done with the usual
<C>ChooseHashFunction</C> method. However, one can bind the two
components <C>hfbig</C> and <C>hfdbig</C> in the options record
of <Ref Oper="HTCreate"/> to a function and a record
respectively and bind <C>cangrow</C> to <K>true</K>.
In that case, upon growing the hash, a new hash function
is created by taking the function <C>hfbig</C> together with
<C>hfdbig</C> as second data argument and reducing the resulting
integer modulo the hash length. In this way one can specify a hash
function suitable for all hash sizes by simply producing big enough
hash values.
</Description>
</ManSection>

</Section>

<Section Label="hashdata">
<Heading> The data structures for hash tables </Heading>

A legacy hash table object is just a record with the following components:
<List>
<C>els</C>
<Item> A &GAP; list storing the elements. Its length can be as long
as the component <C>len</C> indicates but will only grow as necessary
when elements are stored in the hash.
</Item>
<C>vals</C>
<Item> A &GAP; list storing the corresponding values. If a value is
<K>true</K> nothing is stored here to save memory.
</Item>
<C>len</C>
<Item> Length of the hash table.</Item>
<C>nr</C>
<Item> Number of elements stored in the hash table.</Item>
<C>hf</C>
<Item> The hash function (value of the <C>func</C> component in the
record returned by <Ref Oper="ChooseHashFunction"/>). </Item>
<C>hfd</C>
<Item> The data for the second argument of the hash function (value of
the <C>data</C> component in the record returned by
<Ref Oper="ChooseHashFunction"/>). </Item>
<C>eqf</C>
<Item> A comparison function taking two arguments and returning
<K>true</K> for equality or <K>false</K> otherwise. </Item>
<C>collisions</C>
<Item> Number of collisions (see below). </Item>
<C>accesses</C>
<Item> Number of lookup or store accesses to the hash. </Item>
<C>cangrow</C>
<Item> A boolean value indicating whether the hash can grow automatically
or not.</Item>
<C>ishash</C>
<Item> Is <K>true</K> to indicate that this is a hash table record.</Item>
<C>hfbig</C> and <C>hfdbig</C>
<Item> Used for hash tables which need to be able to grow but where
 the user supplied the hash function. See Section <Ref
 Oper="HTCreate"/> for more details.</Item>
</List>

A new style HashTab objects are component objects with the same
components except that there is no component <C>ishash</C> since
these objects are recognised by their type.


A TreeHashTab is very similar. It is a positional object with
basically the same components, except that <C>eqf</C> is replaced
by the three-way comparison function <C>cmpfunc</C>. Since
TreeHashTabs do not grow, the components <C>hfbig</C>, <C>hfdbig</C>
and <C>cangrow</C> are never bound. Each slot in the <C>els</C>
component is either unbound (empty), or bound to the only key stored
in the hash which has this hash value or, if there is more than one
key for that hash value, the slot is bound to an AVL tree containing
all such keys (and values).

<Subsection>
<Heading> Memory requirements </Heading>

Due to the data structure defined above the hash table
will need one machine word (<M>4</M> bytes on 32bit machines and
<M>8</M> bytes on 64bit machines) per possible entry in the hash if all values
corresponding to objects in the hash are <K>true</K> and two machine
words otherwise. This means that the memory requirement for the hash
itself is proportional to the hash table length and not to the number
of objects actually stored in the hash!

In addition one of course needs the memory to store
the objects themselves.

For TreeHashTabs there are additional memory requirements. As soon as
there are more than one key hashing to the same value, the memory for
an AVL tree object is needed in addition. An AVL tree objects needs
about 10 machine words for the tree object and then another 4 machine
words for each entry stored in the tree. Note that for many collisions
this can be significantly more than for HashTab tables. However, the
advantage of TreeHashTabs is that even for a bad hash function the
performance is never worse than <M>log(n)</M> for each operation where
<M>n</M> is the number of keys in the hash with the same hash value.
</Subsection>

<Subsection>
<Heading> Handling of collisions </Heading>

This section is only relevant for HashTab objects.

If two or more objects have the same hash value, the following is done:
If the hash value is coprime to the hash length, the hash value is
taken as <Q>the
increment</Q>, otherwise <M>1</M> is taken. The code to find the proper
place for an object just repeatedly adds the increment to the current
position modulo the hash length. Due to the choice of the increment this
will eventually try all
places in the hash table. Every such increment step is counted as
a collision in the <C>collisions</C> component in the hash table.
This algorithm explains why it is sensible to
choose a prime number as the length of a hash table.
</Subsection>

<Subsection>
<Heading> Efficiency </Heading>

Hashing is efficient as long as there are not too many collisions.
It is not a problem if the number of collisions (counted in the
<C>collisions</C> component) is smaller than the number of accesses
(counted in the <C>accesses</C> component).


A high number of collisions can be caused by a bad hash function,
because the hash table is too small (do not fill a hash table to
more than about 80&percent;), or because the objects to store
are just not well enough distributed. Hash tables will grow automatically
if too many collisions are detected or if they are filled to
80&percent;.

The advantage of TreeHashTabs is that even for a bad hash function the
performance is never worse than <M>log(n)</M> for each operation where
<M>n</M> is the number of keys in the hash with the same hash value.
However, they need a bit more memory.
</Subsection>

</Section>



</Chapter>

quality94%

¤ Dauer der Verarbeitung: 0.29 Sekunden (vorverarbeitet) ¤

Wurzel

Suchen

Beweissystem der NASA

Beweissystem Isabelle

NIST Cobol Testsuite

Cephes Mathematical Library

Wiener Entwicklungsmethode

Haftungshinweis

Die Informationen auf dieser Webseite wurden nach bestem Wissen sorgfältig zusammengestellt. Es wird jedoch weder Vollständigkeit, noch Richtigkeit, noch Qualität der bereit gestellten Informationen zugesichert.

Bemerkung:

Die farbliche Syntaxdarstellung ist noch experimentell.