Class CorpusStorageManager

  • All Implemented Interfaces:
    AutoCloseable

    public class CorpusStorageManager
    extends Object
    implements AutoCloseable
    An API for managing corpora stored in a common location on the file system.
    Author:
    Thomas Krause <krauseto@hu-berlin.de>
    • Constructor Detail

      • CorpusStorageManager

        public CorpusStorageManager​(String dbDir)
                             throws GraphANNISException
        Create a new instance with a an automatic determined size of the internal corpus cache. This constructor version does not use parallel query execution and an automatic strategy for its internal corpus cache.
        Parameters:
        dbDir - The path on the filesystem where the corpus storage content is located. Must be an existing directory.
        Throws:
        GraphANNISException
      • CorpusStorageManager

        public CorpusStorageManager​(String dbDir,
                                    String logfile,
                                    LogLevel level,
                                    boolean useParallel)
                             throws GraphANNISException
        Create a new instance with a an automatic determined size of the internal corpus cache.
        Parameters:
        dbDir - The path on the filesystem where the corpus storage content is located. Must be an existing directory.
        logfile - Path to where a logfile should be written
        level - Log level for the logfile
        useParallel - If "true" parallel joins are used by the system, using all available cores.
        Throws:
        GraphANNISException
      • CorpusStorageManager

        public CorpusStorageManager​(String dbDir,
                                    String logfile,
                                    LogLevel level,
                                    boolean useParallel,
                                    long maxCacheSize)
                             throws GraphANNISException
        Create a new instance with a maximum size for the internal corpus cache.
        Parameters:
        dbDir - The path on the filesystem where the corpus storage content is located. Must be an existing directory.
        logfile - Path to where a logfile should be written
        level - Log level for the logfile
        useParallel - If "true" parallel joins are used by the system, using all available cores.
        maxCacheSize - Fixed maximum size of the cache in megabytes.
        Throws:
        GraphANNISException
    • Method Detail

      • listNodeAnnotations

        public List<Annotation> listNodeAnnotations​(String corpusName,
                                                    boolean listValues,
                                                    boolean onlyMostFrequentValues)
                                             throws GraphANNISException
        Returns a list of all node annotations of a corpus given its name.
        Parameters:
        corpusName - The name of the corpus
        listValues - If true include the possible values in the result.
        onlyMostFrequentValues - If both this argument and "listValues" are true, only return the most frequent value for each annotation name.
        Returns:
        list of annotations
        Throws:
        GraphANNISException
      • listEdgeAnnotations

        public List<Annotation> listEdgeAnnotations​(String corpusName,
                                                    ComponentType componentType,
                                                    String componentName,
                                                    String componentLayer,
                                                    boolean listValues,
                                                    boolean onlyMostFrequentValues)
                                             throws GraphANNISException
        Returns a list of all edge annotations of a corpus given by its name and and given component.
        Parameters:
        corpusName - The name of the corpus
        componentType - Type of the component.
        componentName - Name of the component.
        componentLayer - A layer name which allows to group different components into the same layer. Can be empty.
        listValues - If true include the possible values in the result.
        onlyMostFrequentValues - If both this argument and "listValues" are true, only return the most frequent value for each annotation name.
        Returns:
        list of annotations
        Throws:
        GraphANNISException
      • getAllComponentsByType

        public List<Component> getAllComponentsByType​(String corpusName,
                                                      ComponentType componentType)
                                               throws GraphANNISException
        Returns a list of all components of a corpus given by its name and a given component type
        Parameters:
        corpusName - The name of the corpus
        componentType - Type of the component to be returned
        Returns:
        A list of all components of this type.
        Throws:
        GraphANNISException
      • validateQuery

        public boolean validateQuery​(Iterable<String> corpusNames,
                                     String query,
                                     CorpusStorageManager.QueryLanguage queryLanguage)
                              throws GraphANNISException
        Parses a query and checks if it is valid.
        Parameters:
        corpusNames - The name of the corpora the query would be executed on (needed to catch certain corpus-specific semantic errors).
        query - The query as string.
        queryLanguage - The query language of the query (e.g. AQL).
        Returns:
        True if this a valid query, false otherwise.
        Throws:
        GraphANNISException
      • find

        public String[] find​(Iterable<String> corpusNames,
                             String query,
                             CorpusStorageManager.QueryLanguage queryLanguage,
                             long offset,
                             Optional<Long> limit)
                      throws GraphANNISException
        Find all results for a `query` and return the match ID for each result in default order. The query is paginated and an offset and limit can be specified.
        Parameters:
        corpusNames - The name of the corpora to execute the query on.
        query - The query as string.
        queryLanguage - The query language of the query (e.g. AQL).
        offset - Skip the n first results, where n is the offset.
        limit - Return at most n matches, where n is the limit.
        Returns:
        An array of node identifiers
        Throws:
        GraphANNISException
      • find

        public String[] find​(Iterable<String> corpusNames,
                             String query,
                             CorpusStorageManager.QueryLanguage queryLanguage,
                             long offset,
                             Optional<Long> limit,
                             CorpusStorageManager.ResultOrder order)
                      throws GraphANNISException
        Find all results for a `query` and return the match ID for each result. The query is paginated and an offset and limit can be specified.
        Parameters:
        corpusNames - The name of the corpora to execute the query on.
        query - The query as string.
        queryLanguage - The query language of the query (e.g. AQL).
        offset - Skip the `n` first results, where `n` is the offset.
        limit - Return at most `n` matches, where `n` is the limit.
        order - Specify the order of the matches.
        Returns:
        An array of node identifiers
        Throws:
        GraphANNISException
      • subgraph

        public Graph subgraph​(String corpusName,
                              List<String> nodeIDs,
                              long ctxLeft,
                              long ctxRight,
                              Optional<String> segmentation)
                       throws GraphANNISException
        Return the copy of a subgraph which includes the given list of node annotation identifiers, the nodes that cover the same token as the given nodes and all nodes that cover the token which are part of the defined context.
        Parameters:
        corpusName - The name of the corpus for which the subgraph should be generated from.
        nodeIDs - A set of node annotation identifiers describing the subgraph.
        ctxLeft - Left context in token distance to be included in the subgraph.
        ctxRight - Right context in token distance to be included in the subgraph.
        segmentation - The name of the segmentation which should be used to as base for the context. Use Optional.empty() to define the context in the default token layer.
        Returns:
        The subgraph.
        Throws:
        GraphANNISException
      • subcorpusGraph

        public Graph subcorpusGraph​(String corpusName,
                                    List<String> documentIDs)
                             throws GraphANNISException
        Return the copy of a subgraph which includes all nodes that belong to any of the given list of sub-corpus/document identifiers.
        Parameters:
        corpusName - The name of the corpus for which the subgraph should be generated from.
        documentIDs - A set of sub-corpus/document identifiers describing the subgraph.
        Returns:
        The subgraph.
        Throws:
        GraphANNISException
      • corpusGraph

        public Graph corpusGraph​(String corpusName)
                          throws GraphANNISException
        Return the copy of the graph of the corpus structure given by its name.
        Parameters:
        corpusName - The name of the corpus.
        Returns:
        The corpus graph
        Throws:
        GraphANNISException
      • corpusGraphForQuery

        public Graph corpusGraphForQuery​(String corpusName,
                                         String query,
                                         CorpusStorageManager.QueryLanguage queryLanguage)
                                  throws GraphANNISException
        Return the copy of the graph of the corpus structure which includes all nodes matched by the given query.
        Parameters:
        corpusName - The name of the corpus.
        query - The query as string.
        queryLanguage - The query language of the query (e.g. AQL).
        Returns:
        The corpus graph
        Throws:
        GraphANNISException
      • frequency

        public List<FrequencyTableEntry<String>> frequency​(Iterable<String> corpusNames,
                                                           String query,
                                                           CorpusStorageManager.QueryLanguage queryLanguage,
                                                           String frequencyQueryDefinition)
                                                    throws GraphANNISException
        Execute a frequency query.
        Parameters:
        corpusNames - The name of the corpora to execute the query on.
        query - The query as string.
        queryLanguage - The query language of the query (e.g. AQL).
        frequencyQueryDefinition - A comma seperated list of single frequency definition items as string. Each frequency definition must consist of two parts: the name of referenced node and the (possible qualified) annotation name or "tok" separated by ":". E.g. a frequency definition like
                                         1:tok,3:pos,4:tiger::pos
                
        would extract the token value for the nodes #1, the pos annotation for node #3 and the pos annotation in the tiger namespace for node #4.
        Returns:
        A list of frequency table entries.
        Throws:
        GraphANNISException
      • importFromFileSystem

        public void importFromFileSystem​(String path,
                                         CorpusStorageManager.ImportFormat format,
                                         String corpusName,
                                         boolean diskBased)
                                  throws GraphANNISException
        Import a corpus from an external location on the file system into this corpus storage. This will not overwrite an existing corpus.
        Parameters:
        path - The location on the file system where the corpus data is located.
        format - The format in which this corpus data is stored.
        corpusName - If not "null", override the name of the new corpus for file formats that already provide a corpus name.
        diskBased - If true, certain elements like the node annotation storage will be be disk-based instead of using in-memory representations.
        Throws:
        GraphANNISException
      • importFromFileSystem

        public void importFromFileSystem​(String path,
                                         CorpusStorageManager.ImportFormat format,
                                         String corpusName,
                                         boolean diskBased,
                                         boolean overwriteExisting)
                                  throws GraphANNISException
        Import a corpus from an external location on the file system into this corpus storage.
        Parameters:
        path - The location on the file system where the corpus data is located.
        format - The format in which this corpus data is stored.
        corpusName - If not "null", override the name of the new corpus for file formats that already provide a corpus name.
        diskBased - If true, certain elements like the node annotation storage will be be disk-based instead of using in-memory representations.
        overwriteExisting - If true, overwrite a possible existing corpus.
        Throws:
        GraphANNISException
      • exportToFileSystem

        public void exportToFileSystem​(String[] corpora,
                                       String path,
                                       CorpusStorageManager.ExportFormat format)
                                throws GraphANNISException
        Export a corpus to an external location on the file system using the given format. * @param corpora The corpora to include in the exported file(s).
        Parameters:
        path - The location on the file system where the corpus data should be written to.
        format - The format in which this corpus data will be stored.
        Throws:
        GraphANNISException
      • deleteCorpus

        public boolean deleteCorpus​(String corpusName)
                             throws GraphANNISException
        Delete a corpus from this corpus storage.
        Parameters:
        corpusName - The name of the corpus to delete.
        Returns:
        "true" if the corpus was successfully deleted and "false" if no such corpus existed.
        Throws:
        GraphANNISException
      • applyUpdate

        public void applyUpdate​(String corpusName,
                                GraphUpdate update)
                         throws GraphANNISException
        Apply a sequence of updates to this graph for a corpus. It is ensured that the update process is atomic and that the changes are persisted to disk if the result no exception was thrown.
        Parameters:
        corpusName - The name of the corpus to apply the updates on
        update - The sequence of updates.
        Throws:
        GraphANNISException