Package org.corpus_tools.graphannis
Class CorpusStorageManager
- java.lang.Object
-
- org.corpus_tools.graphannis.CorpusStorageManager
-
- All Implemented Interfaces:
AutoCloseable
public class CorpusStorageManager extends Object implements AutoCloseable
An API for managing corpora stored in a common location on the file system.- Author:
- Thomas Krause <krauseto@hu-berlin.de>
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static classCorpusStorageManager.CountResultContains the extended results of the count query.static classCorpusStorageManager.ExportFormatAn enum of all supported exort formats of graphANNIS.static classCorpusStorageManager.ImportFormatAn enum of all supported input formats of graphANNIS.static classCorpusStorageManager.QueryLanguageAn enum over all supported query languages of graphANNIS.static classCorpusStorageManager.ResultOrderDefines the order of results of a "find" query
-
Constructor Summary
Constructors Constructor Description CorpusStorageManager(String dbDir)Create a new instance with a an automatic determined size of the internal corpus cache.CorpusStorageManager(String dbDir, String logfile, LogLevel level, boolean useParallel)Create a new instance with a an automatic determined size of the internal corpus cache.CorpusStorageManager(String dbDir, String logfile, LogLevel level, boolean useParallel, long maxCacheSize)Create a new instance with a maximum size for the internal corpus cache.
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description voidapplyUpdate(String corpusName, GraphUpdate update)Apply a sequence of updates to this graph for a corpus.voidclose()GraphcorpusGraph(String corpusName)Return the copy of the graph of the corpus structure given by its name.GraphcorpusGraphForQuery(String corpusName, String query, CorpusStorageManager.QueryLanguage queryLanguage)Return the copy of the graph of the corpus structure which includes all nodes matched by the given query.longcount(Iterable<String> corpusNames, String query, CorpusStorageManager.QueryLanguage queryLanguage)Count the number of results for a query.CorpusStorageManager.CountResultcountExtra(Iterable<String> corpusNames, String query, CorpusStorageManager.QueryLanguage queryLanguage)Count the number of results for a query and return both the total number of matches and also the number of documents in the result set.booleandeleteCorpus(String corpusName)Delete a corpus from this corpus storage.voidexportToFileSystem(String[] corpora, String path, CorpusStorageManager.ExportFormat format)Export a corpus to an external location on the file system using the given formatString[]find(Iterable<String> corpusNames, String query, CorpusStorageManager.QueryLanguage queryLanguage, long offset, Optional<Long> limit)Find all results for a `query` and return the match ID for each result in default order.String[]find(Iterable<String> corpusNames, String query, CorpusStorageManager.QueryLanguage queryLanguage, long offset, Optional<Long> limit, CorpusStorageManager.ResultOrder order)Find all results for a `query` and return the match ID for each result.List<FrequencyTableEntry<String>>frequency(Iterable<String> corpusNames, String query, CorpusStorageManager.QueryLanguage queryLanguage, String frequencyQueryDefinition)Execute a frequency query.List<Component>getAllComponentsByType(String corpusName, ComponentType componentType)Returns a list of all components of a corpus given by its name and a given component typeList<NodeDesc>getNodeDescriptions(String query, CorpusStorageManager.QueryLanguage queryLanguage)voidimportFromFileSystem(String path, CorpusStorageManager.ImportFormat format, String corpusName, boolean diskBased)Import a corpus from an external location on the file system into this corpus storage.voidimportFromFileSystem(String path, CorpusStorageManager.ImportFormat format, String corpusName, boolean diskBased, boolean overwriteExisting)Import a corpus from an external location on the file system into this corpus storage.String[]list()List all available corpora in the corpus storage.List<Annotation>listEdgeAnnotations(String corpusName, ComponentType componentType, String componentName, String componentLayer, boolean listValues, boolean onlyMostFrequentValues)Returns a list of all edge annotations of a corpus given by its name and and given component.List<Annotation>listNodeAnnotations(String corpusName, boolean listValues, boolean onlyMostFrequentValues)Returns a list of all node annotations of a corpus given its name.GraphsubcorpusGraph(String corpusName, List<String> documentIDs)Return the copy of a subgraph which includes all nodes that belong to any of the given list of sub-corpus/document identifiers.Graphsubgraph(String corpusName, List<String> nodeIDs, long ctxLeft, long ctxRight, Optional<String> segmentation)Return the copy of a subgraph which includes the given list of node annotation identifiers, the nodes that cover the same token as the given nodes and all nodes that cover the token which are part of the defined context.GraphsubGraphForQuery(String corpusName, String query, CorpusStorageManager.QueryLanguage queryLanguage)Return the copy of a subgraph which includes all nodes matched by the given query.voidunloadCorpus(String corpusName)Unloads a corpus from the cache.booleanvalidateQuery(Iterable<String> corpusNames, String query, CorpusStorageManager.QueryLanguage queryLanguage)Parses a query and checks if it is valid.
-
-
-
Constructor Detail
-
CorpusStorageManager
public CorpusStorageManager(String dbDir) throws GraphANNISException
Create a new instance with a an automatic determined size of the internal corpus cache. This constructor version does not use parallel query execution and an automatic strategy for its internal corpus cache.- Parameters:
dbDir- The path on the filesystem where the corpus storage content is located. Must be an existing directory.- Throws:
GraphANNISException
-
CorpusStorageManager
public CorpusStorageManager(String dbDir, String logfile, LogLevel level, boolean useParallel) throws GraphANNISException
Create a new instance with a an automatic determined size of the internal corpus cache.- Parameters:
dbDir- The path on the filesystem where the corpus storage content is located. Must be an existing directory.logfile- Path to where a logfile should be writtenlevel- Log level for the logfileuseParallel- If "true" parallel joins are used by the system, using all available cores.- Throws:
GraphANNISException
-
CorpusStorageManager
public CorpusStorageManager(String dbDir, String logfile, LogLevel level, boolean useParallel, long maxCacheSize) throws GraphANNISException
Create a new instance with a maximum size for the internal corpus cache.- Parameters:
dbDir- The path on the filesystem where the corpus storage content is located. Must be an existing directory.logfile- Path to where a logfile should be writtenlevel- Log level for the logfileuseParallel- If "true" parallel joins are used by the system, using all available cores.maxCacheSize- Fixed maximum size of the cache in megabytes.- Throws:
GraphANNISException
-
-
Method Detail
-
list
public String[] list() throws GraphANNISException
List all available corpora in the corpus storage.- Returns:
- A list of corpus names.
- Throws:
GraphANNISException
-
listNodeAnnotations
public List<Annotation> listNodeAnnotations(String corpusName, boolean listValues, boolean onlyMostFrequentValues) throws GraphANNISException
Returns a list of all node annotations of a corpus given its name.- Parameters:
corpusName- The name of the corpuslistValues- If true include the possible values in the result.onlyMostFrequentValues- If both this argument and "listValues" are true, only return the most frequent value for each annotation name.- Returns:
- list of annotations
- Throws:
GraphANNISException
-
listEdgeAnnotations
public List<Annotation> listEdgeAnnotations(String corpusName, ComponentType componentType, String componentName, String componentLayer, boolean listValues, boolean onlyMostFrequentValues) throws GraphANNISException
Returns a list of all edge annotations of a corpus given by its name and and given component.- Parameters:
corpusName- The name of the corpuscomponentType- Type of the component.componentName- Name of the component.componentLayer- A layer name which allows to group different components into the same layer. Can be empty.listValues- If true include the possible values in the result.onlyMostFrequentValues- If both this argument and "listValues" are true, only return the most frequent value for each annotation name.- Returns:
- list of annotations
- Throws:
GraphANNISException
-
getAllComponentsByType
public List<Component> getAllComponentsByType(String corpusName, ComponentType componentType) throws GraphANNISException
Returns a list of all components of a corpus given by its name and a given component type- Parameters:
corpusName- The name of the corpuscomponentType- Type of the component to be returned- Returns:
- A list of all components of this type.
- Throws:
GraphANNISException
-
validateQuery
public boolean validateQuery(Iterable<String> corpusNames, String query, CorpusStorageManager.QueryLanguage queryLanguage) throws GraphANNISException
Parses a query and checks if it is valid.- Parameters:
corpusNames- The name of the corpora the query would be executed on (needed to catch certain corpus-specific semantic errors).query- The query as string.queryLanguage- The query language of the query (e.g. AQL).- Returns:
- True if this a valid query, false otherwise.
- Throws:
GraphANNISException
-
getNodeDescriptions
public List<NodeDesc> getNodeDescriptions(String query, CorpusStorageManager.QueryLanguage queryLanguage) throws GraphANNISException
- Throws:
GraphANNISException
-
count
public long count(Iterable<String> corpusNames, String query, CorpusStorageManager.QueryLanguage queryLanguage) throws GraphANNISException
Count the number of results for a query.- Parameters:
corpusNames- The name of the corpora to execute the query on.query- The query as string.queryLanguage- The query language of the query (e.g. AQL).- Returns:
- Returns the count as number.
- Throws:
GraphANNISException
-
countExtra
public CorpusStorageManager.CountResult countExtra(Iterable<String> corpusNames, String query, CorpusStorageManager.QueryLanguage queryLanguage) throws GraphANNISException
Count the number of results for a query and return both the total number of matches and also the number of documents in the result set.- Parameters:
corpusNames- The name of the corpora to execute the query on.query- The query as string.queryLanguage- The query language of the query (e.g. AQL).- Returns:
- An object containing both the match and document counts
- Throws:
GraphANNISException
-
find
public String[] find(Iterable<String> corpusNames, String query, CorpusStorageManager.QueryLanguage queryLanguage, long offset, Optional<Long> limit) throws GraphANNISException
Find all results for a `query` and return the match ID for each result in default order. The query is paginated and an offset and limit can be specified.- Parameters:
corpusNames- The name of the corpora to execute the query on.query- The query as string.queryLanguage- The query language of the query (e.g. AQL).offset- Skip the n first results, where n is the offset.limit- Return at most n matches, where n is the limit.- Returns:
- An array of node identifiers
- Throws:
GraphANNISException
-
find
public String[] find(Iterable<String> corpusNames, String query, CorpusStorageManager.QueryLanguage queryLanguage, long offset, Optional<Long> limit, CorpusStorageManager.ResultOrder order) throws GraphANNISException
Find all results for a `query` and return the match ID for each result. The query is paginated and an offset and limit can be specified.- Parameters:
corpusNames- The name of the corpora to execute the query on.query- The query as string.queryLanguage- The query language of the query (e.g. AQL).offset- Skip the `n` first results, where `n` is the offset.limit- Return at most `n` matches, where `n` is the limit.order- Specify the order of the matches.- Returns:
- An array of node identifiers
- Throws:
GraphANNISException
-
subgraph
public Graph subgraph(String corpusName, List<String> nodeIDs, long ctxLeft, long ctxRight, Optional<String> segmentation) throws GraphANNISException
Return the copy of a subgraph which includes the given list of node annotation identifiers, the nodes that cover the same token as the given nodes and all nodes that cover the token which are part of the defined context.- Parameters:
corpusName- The name of the corpus for which the subgraph should be generated from.nodeIDs- A set of node annotation identifiers describing the subgraph.ctxLeft- Left context in token distance to be included in the subgraph.ctxRight- Right context in token distance to be included in the subgraph.segmentation- The name of the segmentation which should be used to as base for the context. UseOptional.empty()to define the context in the default token layer.- Returns:
- The subgraph.
- Throws:
GraphANNISException
-
subcorpusGraph
public Graph subcorpusGraph(String corpusName, List<String> documentIDs) throws GraphANNISException
Return the copy of a subgraph which includes all nodes that belong to any of the given list of sub-corpus/document identifiers.- Parameters:
corpusName- The name of the corpus for which the subgraph should be generated from.documentIDs- A set of sub-corpus/document identifiers describing the subgraph.- Returns:
- The subgraph.
- Throws:
GraphANNISException
-
corpusGraph
public Graph corpusGraph(String corpusName) throws GraphANNISException
Return the copy of the graph of the corpus structure given by its name.- Parameters:
corpusName- The name of the corpus.- Returns:
- The corpus graph
- Throws:
GraphANNISException
-
corpusGraphForQuery
public Graph corpusGraphForQuery(String corpusName, String query, CorpusStorageManager.QueryLanguage queryLanguage) throws GraphANNISException
Return the copy of the graph of the corpus structure which includes all nodes matched by the given query.- Parameters:
corpusName- The name of the corpus.query- The query as string.queryLanguage- The query language of the query (e.g. AQL).- Returns:
- The corpus graph
- Throws:
GraphANNISException
-
subGraphForQuery
public Graph subGraphForQuery(String corpusName, String query, CorpusStorageManager.QueryLanguage queryLanguage) throws GraphANNISException
Return the copy of a subgraph which includes all nodes matched by the given query.- Parameters:
corpusName- The name of the corpus.query- The query as string.queryLanguage- The query language of the query (e.g. AQL).- Returns:
- The subgraph
- Throws:
GraphANNISException
-
frequency
public List<FrequencyTableEntry<String>> frequency(Iterable<String> corpusNames, String query, CorpusStorageManager.QueryLanguage queryLanguage, String frequencyQueryDefinition) throws GraphANNISException
Execute a frequency query.- Parameters:
corpusNames- The name of the corpora to execute the query on.query- The query as string.queryLanguage- The query language of the query (e.g. AQL).frequencyQueryDefinition- A comma seperated list of single frequency definition items as string. Each frequency definition must consist of two parts: the name of referenced node and the (possible qualified) annotation name or "tok" separated by ":". E.g. a frequency definition like1:tok,3:pos,4:tiger::poswould extract the token value for the nodes #1, the pos annotation for node #3 and the pos annotation in the tiger namespace for node #4.- Returns:
- A list of frequency table entries.
- Throws:
GraphANNISException
-
importFromFileSystem
public void importFromFileSystem(String path, CorpusStorageManager.ImportFormat format, String corpusName, boolean diskBased) throws GraphANNISException
Import a corpus from an external location on the file system into this corpus storage. This will not overwrite an existing corpus.- Parameters:
path- The location on the file system where the corpus data is located.format- The format in which this corpus data is stored.corpusName- If not "null", override the name of the new corpus for file formats that already provide a corpus name.diskBased- If true, certain elements like the node annotation storage will be be disk-based instead of using in-memory representations.- Throws:
GraphANNISException
-
importFromFileSystem
public void importFromFileSystem(String path, CorpusStorageManager.ImportFormat format, String corpusName, boolean diskBased, boolean overwriteExisting) throws GraphANNISException
Import a corpus from an external location on the file system into this corpus storage.- Parameters:
path- The location on the file system where the corpus data is located.format- The format in which this corpus data is stored.corpusName- If not "null", override the name of the new corpus for file formats that already provide a corpus name.diskBased- If true, certain elements like the node annotation storage will be be disk-based instead of using in-memory representations.overwriteExisting- If true, overwrite a possible existing corpus.- Throws:
GraphANNISException
-
exportToFileSystem
public void exportToFileSystem(String[] corpora, String path, CorpusStorageManager.ExportFormat format) throws GraphANNISException
Export a corpus to an external location on the file system using the given format. * @param corpora The corpora to include in the exported file(s).- Parameters:
path- The location on the file system where the corpus data should be written to.format- The format in which this corpus data will be stored.- Throws:
GraphANNISException
-
deleteCorpus
public boolean deleteCorpus(String corpusName) throws GraphANNISException
Delete a corpus from this corpus storage.- Parameters:
corpusName- The name of the corpus to delete.- Returns:
- "true" if the corpus was successfully deleted and "false" if no such corpus existed.
- Throws:
GraphANNISException
-
unloadCorpus
public void unloadCorpus(String corpusName) throws GraphANNISException
Unloads a corpus from the cache.- Parameters:
corpusName- The name of the corpus to unload.- Throws:
GraphANNISException
-
applyUpdate
public void applyUpdate(String corpusName, GraphUpdate update) throws GraphANNISException
Apply a sequence of updates to this graph for a corpus. It is ensured that the update process is atomic and that the changes are persisted to disk if the result no exception was thrown.- Parameters:
corpusName- The name of the corpus to apply the updates onupdate- The sequence of updates.- Throws:
GraphANNISException
-
close
public void close() throws Exception- Specified by:
closein interfaceAutoCloseable- Throws:
Exception
-
-