This is an old revision of the document!

Find duplicate files in SFTP/FTP server

The following example uses WinSCP .NET assembly from a PowerShell script. If you have another preferred language, you can easily translate it.

You can use the script to efficiently find duplicate files on a remote SFTP/FTP server. The script first iterates remote directory tree and looks for files with the same size. When it finds any, it by default downloads the files and compares them locally.

You can install this script as a WinSCP extension by using this page URL in the Add Extension command. If you known that the server supports a protocol extension for calculating checksums, you can improve the extension efficiency by configuring it to ask the server for the checksum, sparing the file download.

Advertisement

To run the script manually use:

powershell.exe -File C:\path\FindDuplicates.ps1 -sessionUrl "sftp://username:password@example.com/" -remotePath "/path" -remoteChecksumAlg sha-1
# @name         Find &Duplicates...
# @command      powershell.exe -ExecutionPolicy Bypass -File "%EXTENSION_PATH%" ^
#                   -sessionUrl "!S" -remotePath "!/" -pause ^
#                   -remoteChecksumAlg "%RemoteChecksumAlg%" -sessionLogPath "%SessionLogPath%"
# @description  Searches for duplicate files on the server, starting from the current directory
# @flag         RemoteFiles
# @version      7
# @homepage     https://winscp.net/eng/docs/library_example_find_duplicate_files
# @require      WinSCP 5.12
# @option       RemoteChecksumAlg -config -run combobox "&Checksum:" "local" ^
#                   "local=Local sha-1" "sha1=Remote sha-1" "sha256=Remote sha-256" ^
#                   "md5=Remote md5" 
# @option       SessionLogPath -config sessionlogfile
# @optionspage  https://winscp.net/eng/docs/library_example_find_duplicate_files#options
 
param (
    # Use Generate Session URL function to obtain a value for -sessionUrl parameter.
    $sessionUrl = "sftp://user:mypassword;fingerprint=ssh-rsa-xx-xx-xx@example.com/",
    [Parameter(Mandatory = $True)]
    $remotePath,
    $remoteChecksumAlg = $Null,
    $sessionLogPath = $Null,
    [Switch]
    $pause
)
 
function FileChecksum ($remotePath)
{
    if (!($checksums.ContainsKey($remotePath)))
    {
        if (!$remoteChecksumAlg -or ($remoteChecksumAlg -eq "local"))
        {
            Write-Host "Downloading file $remotePath..."
            # Download file
            $localPath = [System.IO.Path]::GetTempFileName()
            $transferResult = $session.GetFiles($remotePath, $localPath)
 
            if ($transferResult.IsSuccess)
            {
                $stream = [System.IO.File]::OpenRead($localPath)
                $checksum = [System.BitConverter]::ToString($sha1.ComputeHash($stream))
                $stream.Dispose()
                
                Write-Host "Downloaded file $remotePath checksum is $checksum"
 
                Remove-Item $localPath
            }
            else
            {
                Write-Host "Error downloading file ${remotePath}: $($transferResult.Failures[0])"
                $checksum = $False
            }
        }
        else
        {
            Write-Host "Request checksum for file $remotePath..."
            $buf = $session.CalculateFileChecksum($remoteChecksumAlg, $remotePath)
            $checksum = [System.BitConverter]::ToString($buf)
            Write-Host "File $remotePath checksum is $checksum"
        }
 
        $checksums[$remotePath] = $checksum
    }
 
    return $checksums[$remotePath]
}
 
try
{
    # Load WinSCP .NET assembly
    $assemblyPath = if ($env:WINSCP_PATH) { $env:WINSCP_PATH } else { $PSScriptRoot }
    Add-Type -Path (Join-Path $assemblyPath "WinSCPnet.dll")
 
    # Setup session options from URL
    $sessionOptions = New-Object WinSCP.SessionOptions
    $sessionOptions.ParseUrl($sessionUrl)
 
    $session = New-Object WinSCP.Session
    
    try
    {
        $session.SessionLogPath = $sessionLogPath
 
        Write-Host "Connecting..."
        $session.Open($sessionOptions)
 
        # Handle errors when enumerating the files
        $session.add_Failed( { 
            Write-Host "Error: $($_.Error.Message)"
        } )
 
        $sizes = @{}
        $checksums = @{}
        $duplicates = @{}
        
        $sha1 = [System.Security.Cryptography.SHA1]::Create()
 
        $files =
            $session.EnumerateRemoteFiles(
                $remotePath, "*", [WinSCP.EnumerationOptions]::AllDirectories)
 
        foreach ($fileInfo in $files)
        {
            Write-Host "Found file $($fileInfo.FullName) with size $($fileInfo.Length)"
 
            if ($sizes.ContainsKey($fileInfo.Length))
            {
                $checksum = FileChecksum($fileInfo.FullName)
 
                foreach ($otherFilePath in $sizes[$fileInfo.Length])
                {
                    $otherChecksum = FileChecksum($otherFilePath)
 
                    if ($checksum -eq $otherChecksum)
                    {
                        Write-Host (
                            "Checksums of files $($fileInfo.FullName) and " +
                            "$otherFilePath are identical")
                        $duplicates[$fileInfo.FullName] = $otherFilePath
                    }
                }
            }
            else
            {
                $sizes[$fileInfo.Length] = @()
            }
 
            $sizes[$fileInfo.Length] += $fileInfo.FullName
        }
    }
    finally
    {
        # Disconnect, clean up
        $session.Dispose()
    }
 
    # Print results
    Write-Host
 
    if ($duplicates.Count -gt 0)
    {
        Write-Host "Duplicates found:"
 
        foreach ($path1 in $duplicates.Keys)
        {
            Write-Host "$path1 <=> $($duplicates[$path1])"
        }
    }
    else
    {
        Write-Host "No duplicates found."
    }
 
    $result = 0
}
catch
{
    Write-Host "Error: $($_.Exception.Message)"
    $result = 1
}
 
# Pause if -pause switch was used
if ($pause)
{
    Write-Host "Press any key to exit..."
    [System.Console]::ReadKey() | Out-Null
}
 
exit $result

Advertisement

Options

The Checksum selection allows you to choose, what checksum algorithm to use and if the checksum is to be calculated locally or remotely. Select the Local sha-1 to calculate SHA-1 checksum locally. This is an universal option that will work with any server, but WinSCP will need to download all candidate files locally. If you know that the server supports a protocol extension for calculating checksums, you can improve the extension efficiency by selecting a remote calculation. The list contains some common algorithms that some servers support. However you can type in name of any other algorithm supported by the server.

In the Session log file, you can specify a path to a session log file. The option is available on the Preferences dialog only.

In the Keyboard shortcut, you can specify a keyboard shortcut for the extension. The option is available on the Preferences dialog only.

Last modified: by martin