Interface FileBasedStatisticsReportableInputFormat


  • @PublicEvolving
    public interface FileBasedStatisticsReportableInputFormat
    Extension of input format which is able to report estimated statistics for file based connector.

    This interface is used by file-based connectors which should also implement SupportsStatisticReport. Since file have different formats, and each format has a different way of storing and obtaining statistics information. For example: for Parquet and Orc, they both store the metadata information in the file footer, which including row count, max/min, null count, etc. While, for csv, there is no other metadata information excluding file size, one approach to estimate row count is: the entire file size divided by the average length of the sampled rows.

    Note: This method is called at plan optimization phase, the implementation of this interface should be as light as possible, but more complete information.

    • Method Detail

      • reportStatistics

        TableStats reportStatistics​(List<org.apache.flink.core.fs.Path> files,
                                    DataType producedDataType)
        Returns the estimated statistics of this input format.
        Parameters:
        files - The files to be estimated.
        producedDataType - the final output type of the format.