public class InternalBloomFilter extends Object
The code in class is adapted from org.apache.hadoop.util.bloom.BloomFilter in Apache Hadoop.
The serialization and deserialization are completely the same as and compatible with Hadoop's
org.apache.hadoop.util.bloom.BloomFilter, so that this class correctly reads bloom
filters serialized by older Hudi versions using Hadoop's BloomFilter.
Hudi serializes bloom filter(s) and write them to Parquet file footers and metadata table's bloom filter partition containing bloom filters for all data files. We want to maintain the serde of a bloom filter and thus the code in Hudi repo to avoid breaking changes in storage format and bytes.
The Bloom filter is a data structure that was introduced in 1970 and that has been adopted by the networking research community in the past decade thanks to the bandwidth efficiencies that it offers for the transmission of set membership information between networked hosts. A sender encodes the information into a bit vector, the Bloom filter, that is more compact than a conventional representation. Computation and space costs for construction are linear in the number of elements. The receiver uses the filter to test whether various elements are members of the set. Though the filter will occasionally return a false positive, it will never return a false negative. When creating the filter, the sender can choose its desired point in a trade-off between the false positive rate and the size.
Originally created by European Commission One-Lab Project 034819.
| Modifier and Type | Field and Description |
|---|---|
protected HashFunction |
hash
The hash function used to map a key to several positions in the vector.
|
protected int |
hashType
Type of hashing function to use.
|
protected int |
nbHash
The number of hash function to consider.
|
protected int |
vectorSize
The vector size of this filter.
|
| Constructor and Description |
|---|
InternalBloomFilter()
Default constructor - use with readFields
|
InternalBloomFilter(int vectorSize,
int nbHash,
int hashType)
Constructor
|
| Modifier and Type | Method and Description |
|---|---|
void |
add(Collection<Key> keys)
Adds a collection of keys to this filter.
|
void |
add(Key key)
Adds a key to this filter.
|
void |
add(Key[] keys)
Adds an array of keys to this filter.
|
void |
add(List<Key> keys)
Adds a list of keys to this filter.
|
void |
and(org.apache.hudi.common.bloom.InternalFilter filter)
Performs a logical AND between this filter and a specified filter.
|
int |
getVectorSize() |
boolean |
membershipTest(Key key)
Determines whether a specified key belongs to this filter.
|
void |
not()
Performs a logical NOT on this filter.
|
void |
or(org.apache.hudi.common.bloom.InternalFilter filter)
Performs a logical OR between this filter and a specified filter.
|
void |
readFields(DataInput in)
Deserialize the fields of this object from
in. |
String |
toString() |
void |
write(DataOutput out)
Serialize the fields of this object to
out. |
void |
xor(org.apache.hudi.common.bloom.InternalFilter filter)
Performs a logical XOR between this filter and a specified filter.
|
protected int vectorSize
protected HashFunction hash
protected int nbHash
protected int hashType
public InternalBloomFilter()
public InternalBloomFilter(int vectorSize,
int nbHash,
int hashType)
vectorSize - The vector size of this filter.nbHash - The number of hash function to consider.hashType - type of the hashing function (see
Hash).public void add(Key key)
key - The key to add.public void and(org.apache.hudi.common.bloom.InternalFilter filter)
Invariant: The result is assigned to this filter.
filter - The filter to AND with.public boolean membershipTest(Key key)
key - The key to test.public void not()
The result is assigned to this filter.
public void or(org.apache.hudi.common.bloom.InternalFilter filter)
Invariant: The result is assigned to this filter.
filter - The filter to OR with.public void xor(org.apache.hudi.common.bloom.InternalFilter filter)
Invariant: The result is assigned to this filter.
filter - The filter to XOR with.public int getVectorSize()
public void write(DataOutput out) throws IOException
out.out - DataOuput to serialize this object into.IOExceptionpublic void readFields(DataInput in) throws IOException
in.
For efficiency, implementations should attempt to re-use storage in the existing object where possible.
in - DataInput to deserialize this object from.IOExceptionpublic void add(List<Key> keys)
keys - The list of keys.public void add(Collection<Key> keys)
keys - The collection of keys.public void add(Key[] keys)
keys - The array of keys.Copyright © 2024 The Apache Software Foundation. All rights reserved.