Necesito dividir una columna de marco de datos en varias columnas para asegurarme de que solo hay dos valores dentro de cada celda. El marco de datos actual se ve así:

          Name     |  Number |  Code |
         ..............................
         Tom      | 78797071|       0
         Nick     |         | 89797071
         Juli     |         | 57797074
         June     | 39797571|       0
         Junw     |         | 23000000|

Si el código contiene un número de 8 dígitos, divida cada número de dos dígitos en cada columna y si 00 viene en cualquiera de los DIV debe marcarse como 'incompleto'

El nuevo marco de datos debería verse así:

     Name     |  Number |  Code |  DIV|DIV2|DIV3|DIV4|Incomplete  |
     ........................................................................
     Tom      | 78797071|       0 | 0 |   0|  0 |   0 |incomplete |
     Nick     |         | 89797071| 89| 79 | 70 | 71  |complete   |
     Juli     |         | 57797074| 57| 79 | 70 | 74  |complete   |
     June     | 39797571|       0 |  0|   0|  0 |   0 |complete   |
     Junw     |         | 23000000| 23|  00| 00 | 00  |incomplete |
1
PURU 5 oct. 2019 a las 08:40

3 respuestas

La mejor respuesta

Puede usar str.findall("..") para dividir los valores, luego join la lista en el df original. Use apply para obtener el estado completo / incompleto.

import pandas as pd

df = pd.DataFrame({"Name":["Tom","Nick","Juli","June","Junw"],
                   "Number":[78797071, 0, 0, 39797571, 0],
                   "Code":[0, 89797071, 57797074, 0, 23000000]})

df = df.join(pd.DataFrame(df["Code"].astype(str).str.findall("..").values.tolist()).add_prefix('DIV')).fillna("00")
df["Incomplete"] = df.iloc[:,3:7].apply(lambda row: "incomplete" if row.str.contains('00').any() else "complete", axis=1)

print (df)

#
   Name    Number      Code DIV0 DIV1 DIV2 DIV3  Incomplete
0   Tom  78797071         0   00   00   00   00  incomplete
1  Nick         0  89797071   89   79   70   71    complete
2  Juli         0  57797074   57   79   70   74    complete
3  June  39797571         0   00   00   00   00  incomplete
4  Junw         0  23000000   23   00   00   00  incomplete
1
Henry Yik 5 oct. 2019 a las 07:00

Puedes hacerlo usando las funciones de cadena zfill y findall como a continuación


df.Code = df.Code.astype(np.str)

## zfill will pad string with 0 to make its lenght 8, findall will find each pair of digit
## explode will split list into rows (explode works with pandas 0.25 and above)
## reshape to make it 4 columns
arr = df.Code.str.zfill(8).str.findall(r"(\d\d)").explode().values.reshape(-1, 4)

## create new dataframe from arr with given column names
df2 = pd.DataFrame(arr, columns=[f"Div{i+1}" for i in range(arr.shape[1])])

## set "Incomplete" colum to incomplete if any column of row contains "00"
df2["Incomplete"] = np.where(np.any(arr == "00", axis=1), "incomplete", "complete")

pd.concat([df,df2], axis=1)


Resultado

        Name    Number  Code    Div1    Div2    Div3    Div4    Incomplete
0   Tom 78797071    0   00  00  00  00  incomplete
1   Nick        89797071    89  79  70  71  complete
2   Juli        57797074    57  79  70  74  complete
3   June    39797571    0   00  00  00  00  incomplete
4   Junw        23000000    23  00  00  00  incomplete
1
Dev Khadka 5 oct. 2019 a las 07:13

Prueba esta solución rápida.

import pandas as pd
import re

#data-preprocessing
data = {'Name': ['Tom','Nick','Juli','June','Junw'],'Code': ['0', '89797071', '57797074', '0', '23000000']}

#I omitted Number key in data

df = pd.DataFrame(data)

print(df)

#find patterns

pattern = r'(\d{2})(\d{2})(\d{2})(\d{2})'
zero_pattern = r'0{1,}'

split_data = []

for _ in df['Code'].items():

  to_find = _[1]

  splitted = re.findall(pattern, to_find)
  if splitted:
    temp = list(splitted[0])
    if '00' in temp:
      temp.append('incomplete')
    else:
      temp.append('complete')
    split_data.append(temp)

  zeromatch = re.match(zero_pattern, to_find)
  if zeromatch:
    split_data.append(['0','0','0','0','incomplete'])

#make right dataframe

col_name = ['DIV1','DIV2','DIV3','DIV4','Incomplete']

df2 = pd.DataFrame(split_data, columns=col_name)  

df[col_name]= df2

print(df)

Salida

   Name      Code
0   Tom         0
1  Nick  89797071
2  Juli  57797074
3  June         0
4  Junw  23000000
   Name      Code DIV1 DIV2 DIV3 DIV4  Incomplete
0   Tom         0    0    0    0    0  incomplete
1  Nick  89797071   89   79   70   71    complete
2  Juli  57797074   57   79   70   74    complete
3  June         0    0    0    0    0  incomplete
4  Junw  23000000   23   00   00   00  incomplete
1
QuantStats 5 oct. 2019 a las 07:01
58245672